Adopting Datalad for collaboration
Datalad is a powerful tool for the versioning and sharing of raw and processed data as well as for the tracking of data provenance (i.e. the recording on how data was processed). This page was created with the intention to share with the user how we adopted datalad to manage and process datasets with Connectome Mapper 3 in our lab, following the YODA principles to our best.
You may ask “What are the YODA principles?”. They are basic principles behind creating, sharing, and publishing reproducible, understandable, and open data analysis projects with DataLad.
For more details and tutorials on Datalad and YODA, please check the recent Datalad Handbook and the YODA principles.
Happy Collaborative and Reproducible Connectome Mapping!
Prerequisites
Python3 must be installed with Datalad and all dependencies. You can use the conda environment
py39cmp-gui
for instance. See Installation of py39cmp-gui for more installation details.A recent version of
git-annex
andliblzma
(included inpy39cmp-gui
for Ubuntu/Debian).Docker must be installed on systems running Connectome Mapper 3. See Prerequisites of Connectome Mapper 3 for more installation instructions.
Copy BIDS dataset to server
Copy the raw BIDS dataset using rsync
:
rsync -P -avz -e 'ssh' \
--exclude 'derivatives' \
--exclude 'code' \
--exclude '.datalad' \
--exclude '.git' \
--exclude '.gitattributes' \
/path/to/ds-example/* \
<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example
where:
-P
is used to show progress during transfer
-v
increases verbosity
-e
specifies the remote shell to use (ssh)
-a
indicates archive mode
-z
enables file data compression during the transfer
--exclude DIR_NAME
exclude the specifiedDIR_NAME
from the copy
Remote datalad dataset creation on Server
Connect to Server
To connect with SSH:
ssh <SERVER_USERNAME>@<SERVER_IP_ADDRESS>
Creation of Datalad dataset
Go to the source dataset directory:
cd /archive/data/ds-example
Initialize the Datalad dataset:
datalad create -f -c text2git -D "Original example dataset on lab server" -d .
where:
-f
forces to create the datalad dataset if not empty
-c text2git
configures Datalad to use git to manage text file
-D
gives a brief description of the dataset
-d
specify the location where the Datalad dataset is created
Track all files contained in the dataset with Datalad:
datalad save -m "Source (Origin) BIDS dataset" --version-tag origin
where:
-m MESSAGE
is the description of the state or the changes made to the dataset
--version-tag
tags the state of the Dataset
Report on the state of dataset content:
datalad status -r
git log
Processing using the Connectome Mapper BIDS App on Alice’s workstation
Processed dataset creation
Initialize a datalad dataset with the YODA procedure:
datalad create -c text2git -c yoda \
-D "Processed example dataset by Alice with CMP3" \
/home/alice/data/ds-example-processed
This will create a datalad dataset with:
a code directory in your dataset
three files for human consumption (
README.md
,CHANGELOG.md
)everything in the
code/
directory configured to be tracked by Git, not git-annex
README.md
andCHANGELOG.md
configured in the root of the dataset to be tracked by GitText files configured to be tracked by Git
Go to the created dataset directory:
cd /home/alice/data/ds-example-processed
Create the derivatives
output directory:
mkdir derivatives
Raw BIDS dataset installation
Install the remove datalad dataset ds-example
in /home/alice/data/ds-example-processed/input/
:
datalad install -d . -s ssh://<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example \
/home/alice/data/ds-example-processed/input/
where:
-s SOURCE
specifies the URL or local path of the installation source
Get T1w and Diffusion images to be processed
For reproducibility, create and write datalad get commands to get_required_files_for_analysis.sh
:
echo "datalad get input/sub-*/ses-*/anat/sub-*_T1w.nii.gz" > code/get_required_files_for_analysis.sh
echo "datalad get input/sub-*/ses-*/dwi/sub-*_dwi.nii.gz" >> code/get_required_files_for_analysis.sh
echo "datalad get input/sub-*/ses-*/dwi/sub-*_dwi.bvec" >> code/get_required_files_for_analysis.sh
echo "datalad get input/sub-*/ses-*/dwi/sub-*_dwi.bval" >> code/get_required_files_for_analysis.sh
Save the script to the dataset’s history:
datalad save -m "Add script to get the files required for analysis by Alice"
Execute the script:
sh code/get_required_files_for_analysis.sh
Link the container image with the dataset
Add Connectome Mapper’s container image to the datalad dataset:
datalad containers-add connectomemapper-bidsapp-<VERSION_TAG> \
--url dhub://sebastientourbier/connectomemapper-bidsapp:<VERSION_TAG> \
-d . \
--call-fmt \
"docker run --rm -t \
-v "$(pwd)/input":/bids_dir \
-v "$(pwd)/code":/bids_dir/code \
-v "$(pwd)"/derivatives:/output_dir \
-u "$(id -u)":"$(id -g)" \
sebastientourbier/connectomemapper-bidsapp:<VERSION_TAG> {cmd}"
Note
--call-fmt
specifies a custom docker run command. The current directory
is assumed to be the BIDS root directory and retrieve with "$(pwd)"/input
and the
output directory is inside the derivatives/
folder.
Important
The name of the container-name registered to Datalad cannot have dot
as character so that a <VERSION_TAG>
of v3.X.Y
would need to be rewritten as v3-X-Y
Copy existing reference pipeline configuration files to code
folder:
cp /path/to/existing/ref_anatomical_config.json \
code/ref_anatomical_config.json
cp /path/to/existing/ref_diffusion_config.json \
code/ref_diffusion_config.json
Copy FreeSurfer license file to code
folder:
cp /path/to/freesurfer/license.txt \
code/license.txt
Save the state of the dataset prior to analysis:
datalad save -m "Alice's test dataset on local \
workstation ready for analysis with connectomemapper-bidsapp:<VERSION_TAG>" \
--version-tag ready4analysis-<date>-<time>
Run Connectome Mapper with Datalad
Run Connectome Mapper on all subjects:
datalad containers-run --container-name connectomemapper-bidsapp-<VERSION_TAG> \
--input code/ref_anatomical_config.json \
--input code/ref_diffusion_config.json \
--output derivatives \
/bids_dir /output_dir participant \
--anat_pipeline_config '/bids_dir/{inputs[0]}' \
--dwi_pipeline_config '/bids_dir/{inputs[1]}'
Note
datalad containers-run
will take of replacing the {inputs[i]}
by the value
specified by the i --input
flag (Indexing start at 0).
Save the state:
datalad save -m "Alice's test dataset on local \
workstation processed by connectomemapper-bidsapp:<VERSION_TAG>, {Date/Time}" \
--version-tag processed-<date>-<time>
Report on the state of dataset content:
datalad status -r
git log
Configure a datalad dataset target on the Server
Create a remote dataset repository and configures it as a dataset sibling to be used as a publication target:
datalad create-sibling --name remote -d . \
<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example-processed
See the documentation of datalad create-sibling command for more details.
Update the remote datalad dataset
Push the datalad dataset with data derivatives to the server:
datalad push -d . --to remote
Note
--to remote
specifies the remote
dataset sibling i.e.
ssh://<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example-processed
previously configured.
Uninstall all files accessible from the remote
With DataLad we don’t have to keep those inputs around so you can safely uninstall them without losing the ability to reproduce an analysis:
datalad uninstall input/sub-*/*
Local collaboration with Bob for Electrical Source Imaging
Processed dataset installation on Bob’s workstation
Install the processed datalad dataset ds-example-processed
in /home/bob/data/ds-example-processed
:
datalad install -s ssh://<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example-processed \
/home/bob/data/ds-example-processed
Go to datalad dataset clone directory:
cd /home/bob/data/ds-example-processed
Get connectome mapper output files (Brain Segmentation and Multi-scale Parcellation) used by Bob in his analysis
For reproducibility, write datalad get commands to get_required_files_for_analysis_by_bob.sh
:
echo "datalad get derivatives/cmp/sub-*/ses-*/anat/sub-*_mask.nii.gz" \
> code/get_required_files_for_analysis_by_bob.sh
echo "datalad get derivatives/cmp/sub-*/ses-*/anat/sub-*_class-*_dseg.nii.gz" \
>> code/get_required_files_for_analysis_by_bob.sh
echo "datalad get derivatives/cmp/sub-*/ses-*/anat/sub-*_scale*_atlas.nii.gz" \
>> code/get_required_files_for_analysis_by_bob.sh
Save the script to the dataset’s history:
datalad save -m "Add script to get the files required for analysis by Bob"
Execute the script:
sh code/get_required_files_for_analysis_by_bob.sh
Update derivatives
Update derivatives with data produced by Cartool:
cd /home/bob/data/ds-example
mkdir derivatives/cartool
cp [...]
Save the state:
datalad save -m "Bob's test dataset on local \
workstation processed by cartool:<CARTOOL_VERSION>, {Date/Time}" \
--version-tag processed-<date>-<time>
Report on the state of dataset content:
datalad status -r
git log
Update the remote datalad dataset
Update the remote datalad dataset with data derivatives:
datalad push -d . --to origin
Note
--to origin
specifies the origin
dataset sibling i.e.
ssh://<SERVER_USERNAME>@<SERVER_IP_ADDRESS>:/archive/data/ds-example-processed
from which it was cloned.
Uninstall all files accessible from the remote
Again, with DataLad we don’t have to keep those inputs around so you can safely uninstall them without losing the ability to reproduce an analysis:
datalad uninstall derivatives/cmp-*/*
datalad uninstall derivatives/freesurfer-*/*
datalad uninstall derivatives/nipype-*/*
- Authors
Sebastien Tourbier
- Version
Revision: 2.1 (Last modification: 2022 Feb 09)