Part 1-2 Visualize and create input data

Part 1. Visualize input data

Open the main.nf file in VS Code. This file is pre-filled with a workflow named get_data, which is responsible for fetching input files from a specified directory. This step serves as a generic data-loading process commonly used at the start of a pipeline.

A key concept here is the use of Channel, which enables efficient, asynchronous data flow. The fromFilePairs() method is particularly useful for handling paired-end sequencing data, but in this case, it helps group related files.

To run the Nextflow pipeline, use the following command:

Command
Expected output

nextflow run main.nf --input data -profile docker

[participants_files, /workspaces/nf-neuro-tutorial_test/data/participants.json, /workspaces/nf-neuro-tutorial_test/data/participants.tsv]

Part 2: Create input structure

1. Update data structure

Now, let’s modify the get_data and the main workflow to fetch the test data. Replace the existing main.nf file with the following.

#!/usr/bin/env nextflow

    workflow get_data {
        main:
            if ( !params.input ) {
                log.info "You must provide an input directory containing all images using:"
                log.info ""
                log.info "    --input=/path/to/[input]   Input directory containing your subjects"
                log.info "                        |"
                log.info "                        ├-- S1"
                log.info "                        |    └-- ses-01"
                log.info "                        |    |    ├-- anat"
                log.info "                        |    |    |   |--*t1.nii.gz"
                log.info "                        |    |    |--dwi"
                log.info "                        |    |    |   |--*dwi.nii.gz"
                log.info "                        |    |    |   ├-- *dwi.bval"
                log.info "                        |    |    |   └-- *dwi.bvec"
                log.info "                        |    └-- ses-02"
                log.info "                        └-- S2"
                log.info "                             └-- ses-01"
                log.info "                             |    ├-- anat"
                log.info "                             |    |   |--*t1.nii.gz"
                log.info "                             |    |--dwi"
                log.info "                             |    |   |--*dwi.nii.gz"
                log.info "                             |    |   ├-- *dwi.bval"
                log.info "                             |    |   └-- *dwi.bvec"
                log.info "                             └-- ses-02"
                log.info ""
                error "Please resubmit your command with the previous file structure."
            }

            input = file(params.input)
            // ** Loading all files. ** //
            dwi_channel = Channel.fromFilePairs("$input/**/dwi/*dwi.{nii.gz,bval,bvec}", size: 3, flat: true)

        emit:
            dwi = dwi_channel
    }

    workflow {
        // ** Now call your input workflow to fetch your files ** //
        data = get_data()
        data.dwi.view() // Contains your DWI data: [meta, dwi, bval, bvec]
    }

dwi_channel = Channel.fromFilePairs("$input/**/dwi/*dwi.{nii.gz,bval,bvec}", size: 3, flat: true)

File Matching Pattern: $input/**/dwi/*dwi.{nii.gz,bval,bvec}

This is the glob pattern used to match files. It searches for files in any subdirectory of $input inside dwi/, matching .nii.gz, .bval, or .bvec files.

Number of files: size: 3

This option specifies that each emitted item should contain 3 files (in this case, the dwi.nii.gz, .bval, and .bvec files).

Format: flat: true

Flattens the output to emit file groups as tuples rather than nested lists.

Now, you can run nextflow..

Command
Expected output

nextflow run main.nf --input data -profile docker

[sub-003_ses-01_dir-AP, /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.bval, /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.bvec, /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.nii.gz]

Each element in the output channel is a tuple containing:

A unique key identifier (subject/session)

The matching .bval file

The matching .bvec file

The matching .nii.gz file (DWI image)

And following this format :

[ subject_session_id,
  /path/to/subject/session/dwi/*dwi.bval,
  /path/to/subject/session/dwi/*dwi.bvec ,
  /path/to/subject/session/dwi/*dwi.nii.gz]

2. Set correctly the Subject and session ID

Now let’s modify the input structure to make the key identifier sub-003_ses-01_dir-AP become sub-003_ses-01. We still use the current structure, but with an additional item mapping using it. Check the Before and After sections below to see the needed modification.

Before
After

    dwi_channel = Channel.fromFilePairs("$input/**/dwi/*dwi.{nii.gz,bval,bvec}", size: 3, flat: true);

    dwi_channel = Channel.fromFilePairs("$input/**/dwi/*dwi.{nii.gz,bval,bvec}", size: 3, flat: true)
             { it.parent.parent.parent.name + "_" + it.parent.parent.name};

Now, you can run nextflow..

Command
Expected output

nextflow run main.nf --input data -profile docker

[sub-003_ses-01, /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.bval, /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.bvec, /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.nii.gz]

{ it.parent.parent.parent.name + "_" + it.parent.parent.name}

This is a closure that defines how to create the grouping key for the file pairs. It’s using the names of the parent directories to create a unique identifier, so you need to add as many “parent” as necessary to fit your data structure.

it corresponds to the file name
.parent corresponds to the parent folder.

In our exemple it is /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.bval

To get subjectID sub-003 from it:

sub-003_ses-01_dir-AP_dwi.bval .parent-> dwi (dwi is the parent folder of the file)

dwi .parent-> ses-01

ses-01 .parent-> sub-003

it.parent.parent.parent.name -> sub-003

To get ses-name ses-01 from it:

sub-003_ses-01_dir-AP_dwi.bval .parent-> dwi

dwi .parent-> ses-01

it.parent.parent.name -> ses-01

3. Organizing Data for Processing

By default, files are sorted alphabetically, so you need to reorder them to get a specific file order. To do this, you use the map function and change the main.nf as follows:

Before
After

    dwi_channel = Channel.fromFilePairs("$input/**/dwi/*dwi.{nii.gz,bval,bvec}", size: 3, flat: true)
        { it.parent.parent.parent.name + "_" + it.parent.parent.name}`;

    dwi_channel = Channel.fromFilePairs("$input/**/dwi/*dwi.{nii.gz,bval,bvec}", size: 3, flat: true)
        { it.parent.parent.parent.name + "_" + it.parent.parent.name}
        .map{ sid, bvals, bvecs, dwi -> [ [id: sid], dwi, bvals, bvecs ] } // Reordering the inputs.

Now, you can run nextflow..

Command
Expected output

nextflow run main.nf --input data -profile docker

[[id:sub-003_ses-01], /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.nii.gz, /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.bval, /workspaces/nf-neuro-tutorial_test/data/sub-003/ses-01/dwi/sub-003_ses-01_dir-AP_dwi.bvec]

Now, your input pipeline data is well-structured, facilitating seamless processing in subsequent pipeline stages. Each dataset includes a clearly labeled subject ID and session, along with all necessary files for DWI processing — such as the DWI file, b-values, and b-vectors.