Adopting Datalad for collaboration

Datalad is a powerful tool for the versioning and sharing of raw and processed data as well as for the tracking of data provenance (i.e. the recording on how data was processed). This page was created with the intention to share with the user how we adopted datalad to manage and process datasets with Connectome Mapper 3 in our lab, following the YODA principles to our best.

You may ask “What are the YODA principles?”. They are basic principles behind creating, sharing, and publishing reproducible, understandable, and open data analysis projects with DataLad.

For more details and tutorials on Datalad and YODA, please check the recent Datalad Handbook and the YODA principles.

Happy Collaborative and Reproducible Connectome Mapping!

Prerequisites

  • Python3 must be installed with Datalad and all dependencies. You can use the conda environment py39cmp-gui for instance. See Installation of py39cmp-gui for more installation details.

  • A recent version of git-annex and liblzma (included in py39cmp-gui for Ubuntu/Debian).

  • Docker must be installed on systems running Connectome Mapper 3. See Prerequisites of Connectome Mapper 3 for more installation instructions.

Copy BIDS dataset to server

Copy the raw BIDS dataset using rsync:

rsync -P -avz -e 'ssh' \
--exclude 'derivatives' \
--exclude 'code' \
--exclude '.datalad' \
--exclude '.git' \
--exclude '.gitattributes' \
/path/to/ds-example/* \
<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example

where:

  • -P is used to show progress during transfer

  • -v increases verbosity

  • -e specifies the remote shell to use (ssh)

  • -a indicates archive mode

  • -z enables file data compression during the transfer

  • --exclude DIR_NAME exclude the specified DIR_NAME from the copy

Remote datalad dataset creation on Server

Connect to Server

To connect with SSH:

ssh <SERVER_USERNAME>@<SERVER_IP_ADDRESS>

Creation of Datalad dataset

Go to the source dataset directory:

cd /archive/data/ds-example

Initialize the Datalad dataset:

datalad create -f -c text2git -D "Original example dataset on lab server" -d .

where:

  • -f forces to create the datalad dataset if not empty

  • -c text2git configures Datalad to use git to manage text file

  • -D gives a brief description of the dataset

  • -d specify the location where the Datalad dataset is created

Track all files contained in the dataset with Datalad:

datalad save -m "Source (Origin) BIDS dataset" --version-tag origin

where:

  • -m MESSAGE is the description of the state or the changes made to the dataset

  • --version-tag tags the state of the Dataset

Report on the state of dataset content:

datalad status -r
git log

Processing using the Connectome Mapper BIDS App on Alice’s workstation

Processed dataset creation

Initialize a datalad dataset with the YODA procedure:

datalad create -c text2git -c yoda \
-D "Processed example dataset by Alice with CMP3" \
/home/alice/data/ds-example-processed

This will create a datalad dataset with:

  • a code directory in your dataset

  • three files for human consumption (README.md, CHANGELOG.md)

  • everything in the code/ directory configured to be tracked by Git, not git-annex

  • README.md and CHANGELOG.md configured in the root of the dataset to be tracked by Git

  • Text files configured to be tracked by Git

Go to the created dataset directory:

cd /home/alice/data/ds-example-processed

Create the derivatives output directory:

mkdir derivatives

Raw BIDS dataset installation

Install the remove datalad dataset ds-example in /home/alice/data/ds-example-processed/input/:

datalad install -d . -s ssh://<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example \
/home/alice/data/ds-example-processed/input/

where:

  • -s SOURCE specifies the URL or local path of the installation source

Get T1w and Diffusion images to be processed

For reproducibility, create and write datalad get commands to get_required_files_for_analysis.sh:

echo "datalad get input/sub-*/ses-*/anat/sub-*_T1w.nii.gz" > code/get_required_files_for_analysis.sh
echo "datalad get input/sub-*/ses-*/dwi/sub-*_dwi.nii.gz" >> code/get_required_files_for_analysis.sh
echo "datalad get input/sub-*/ses-*/dwi/sub-*_dwi.bvec" >> code/get_required_files_for_analysis.sh
echo "datalad get input/sub-*/ses-*/dwi/sub-*_dwi.bval" >> code/get_required_files_for_analysis.sh

Save the script to the dataset’s history:

datalad save -m "Add script to get the files required for analysis by Alice"

Execute the script:

sh code/get_required_files_for_analysis.sh

Run Connectome Mapper with Datalad

Run Connectome Mapper on all subjects:

datalad containers-run --container-name connectomemapper-bidsapp-<VERSION_TAG> \
--input code/ref_anatomical_config.json \
--input code/ref_diffusion_config.json \
--output derivatives \
/bids_dir /output_dir participant \
--anat_pipeline_config '/bids_dir/{inputs[0]}' \
--dwi_pipeline_config '/bids_dir/{inputs[1]}'

Note

datalad containers-run will take of replacing the {inputs[i]} by the value specified by the i --input flag (Indexing start at 0).

Save the state:

datalad save -m "Alice's test dataset on local \
workstation processed by connectomemapper-bidsapp:<VERSION_TAG>, {Date/Time}" \
--version-tag processed-<date>-<time>

Report on the state of dataset content:

datalad status -r
git log

Configure a datalad dataset target on the Server

Create a remote dataset repository and configures it as a dataset sibling to be used as a publication target:

datalad create-sibling --name remote -d . \
<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example-processed

See the documentation of datalad create-sibling command for more details.

Update the remote datalad dataset

Push the datalad dataset with data derivatives to the server:

datalad push -d . --to remote

Note

--to remote specifies the remote dataset sibling i.e. ssh://<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example-processed previously configured.

Uninstall all files accessible from the remote

With DataLad we don’t have to keep those inputs around so you can safely uninstall them without losing the ability to reproduce an analysis:

datalad uninstall input/sub-*/*

Local collaboration with Bob for Electrical Source Imaging

Processed dataset installation on Bob’s workstation

Install the processed datalad dataset ds-example-processed in /home/bob/data/ds-example-processed:

datalad install -s ssh://<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example-processed  \
/home/bob/data/ds-example-processed

Go to datalad dataset clone directory:

cd /home/bob/data/ds-example-processed

Get connectome mapper output files (Brain Segmentation and Multi-scale Parcellation) used by Bob in his analysis

For reproducibility, write datalad get commands to get_required_files_for_analysis_by_bob.sh:

echo "datalad get derivatives/cmp/sub-*/ses-*/anat/sub-*_mask.nii.gz" \
> code/get_required_files_for_analysis_by_bob.sh
echo "datalad get derivatives/cmp/sub-*/ses-*/anat/sub-*_class-*_dseg.nii.gz" \
>> code/get_required_files_for_analysis_by_bob.sh
echo "datalad get derivatives/cmp/sub-*/ses-*/anat/sub-*_scale*_atlas.nii.gz" \
>> code/get_required_files_for_analysis_by_bob.sh

Save the script to the dataset’s history:

datalad save -m "Add script to get the files required for analysis by Bob"

Execute the script:

sh code/get_required_files_for_analysis_by_bob.sh

Update derivatives

Update derivatives with data produced by Cartool:

cd /home/bob/data/ds-example
mkdir derivatives/cartool
cp [...]

Save the state:

datalad save -m "Bob's test dataset on local \
workstation processed by cartool:<CARTOOL_VERSION>, {Date/Time}" \
--version-tag processed-<date>-<time>

Report on the state of dataset content:

datalad status -r
git log

Update the remote datalad dataset

Update the remote datalad dataset with data derivatives:

datalad push -d . --to origin

Note

--to origin specifies the origin dataset sibling i.e. ssh://<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example-processed from which it was cloned.

Uninstall all files accessible from the remote

Again, with DataLad we don’t have to keep those inputs around so you can safely uninstall them without losing the ability to reproduce an analysis:

datalad uninstall derivatives/cmp-*/*
datalad uninstall derivatives/freesurfer-*/*
datalad uninstall derivatives/nipype-*/*

Authors

Sebastien Tourbier

Version

Revision: 2.1 (Last modification: 2022 Feb 09)