Introduction

The objective of these analyses is to compare the tools currently available for analyzing extensive ITS data. For this purpose, we have analyzed simulated ITS1 and ITS2 data. FROGS (Escudié et al. 2018), USEARCH (Edgar, 2016), DADA2 (Callahan et al. 2016) and Qiime2 (Bolyen et al. 2019) have been used and compared using their respective guidelines.

Simulated data

We have produced paired-end ITS1 and ITS2 reads, R1 and R2, type Illumina Miseq, from Fungi, as diverse as possible. We extracted the reference sequences from the Unite v7.1 database (Abarenkov et al. 2010) centered on the eukaryotic nuclear ribosomal ITS region. We extracted the reference sequences from the Unite v7.1 database centered on the eukaryotic nuclear ribosomal ITS region. We picked 35, 115, or 515 species of Fungi. We have incorporated sequencing errors and chimeras to reflect the reality obtained after sequencing. Reads have been simulated by increasing the number of substitutions and indels linearly along the reads. We generated samples with two types of abundance distribution that follows either a Power law or an Uniform law. We have focused on including reads from ITS1 and ITS2 of long lengths, not overlaid by two paired 250-nucleotide reads.

Results

The results obtained for each tool (FROGS, DADA2, USEARCH and Qiime2) were compared to what was expected and different metrics were calculated:

At OTU/ASV/ZOTU level :
- Richness expressed as number of OTU/ASV/ZOTU
At the species level:
- Abundance divergence rate
- False negatives
- False positives
- True positives

Divergence evaluates a tool’s ability to recover the expected community composition at species levels. It is defined as affiliation divergence = 100 -∑ (min (expected, found)) and is equal to the canberra distance (expressed in percent) between the expected (i.e. the true taxonomic composition of the community) and observed (i.e. the one inferred by the OTU picking tool) community compositions, after aggregating OTUs at a given taxonomic level. The second metric, FN, is the number of False Negative species i.e. present in the original bacterial community but not discovered by the OTU picking method (FROGS, DADA2, USEARCH or Qiime2). These are the OTUs lost while picking OTUs. This second metric evaluates the sensitivity, i.e. the probability of detection of species present in dataset. The third one, FP, is the number of False Positive species i.e. discovered by the OTU picking method but not present in the original bacterial community. This metric evaluates a tool’s propensity to create spurious OTUs.

N.B. : FROGS and Qiime2 produce OTUs, DADA2 produces ASVs and USEARCH produces ZOTUs.

35 species

Amplicon length distribution

List of mergeable amplicons

List of non-mergeable amplicons

Expected vs. observed depth (count of sequences)

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

Expected vs. observed Richness (count of OTU, ASV, ZOTU)

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

Metrics calculated in relation to the expected

Affiliations and associated abundances are taken into account.

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

Taxonomies found vs. lost

Column order is based on the canberra distance.

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

Reconstruction of the reference.

35sp ITS1 Uniform

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
35	35	33	27	19	27	31

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
35	34	20	2	12	2	21

35sp ITS1 Powerlaw

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
35	35	33	27	21	27	31

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
35	34	20	2	12	2	21

35sp ITS2 Uniform

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
35	35	33	18	18	33	33

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
35	34	29	0	12	0	30

35sp ITS2 Powerlaw

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
35	35	32	19	19	33	33

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
35	34	29	0	12	0	30

115 species

Amplicon length distribution

List of mergeable amplicons

List of non-mergeable amplicons

Expected vs. observed depth (count of sequences)

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

Expected vs. observed Richness (count of OTU, ASV, ZOTU)

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

Metrics calculated in relation to the expected

Affiliations and associated abundances are taken into account.

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

Taxonomies found vs. lost

Column order is based on the canberra distance.

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

Reconstruction of the reference.

115sp ITS1 Uniform

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
115	115	102	67	55	75	103

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
115	114	89	0	40	0	93

115sp ITS1 Powerlaw

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
115	115	103	65	56	73	102

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
115	114	90	0	40	0	94

115sp ITS2 Uniform

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
115	115	95	49	47	104	104

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
115	113	91	1	39	1	100

115sp ITS2 Powerlaw

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
115	115	96	48	48	102	103

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
115	113	92	1	39	1	99

515 species

Amplicon length distribution

List of mergeable amplicons

List of non-mergeable amplicons

Expected vs. observed depth (count of sequences)

515sp ITS1 Uniform

515sp ITS1 Powerlaw

515sp ITS2 Uniform

515sp ITS2 Powerlaw

Expected vs. observed Richness (count of OTU, ASV, ZOTU)

515sp ITS1 Uniform

515sp ITS1 Powerlaw

515sp ITS2 Uniform

515sp ITS2 Powerlaw

Metrics calculated in relation to the expected

Affiliations and associated abundances are taken into account.

515sp ITS1 Uniform

515sp ITS1 Powerlaw

515sp ITS2 Uniform

515sp ITS2 Powerlaw

Taxonomies found vs. lost

Column order is based on the canberra distance.

515sp ITS1 Uniform

515sp ITS1 Powerlaw

515sp ITS2 Uniform

515sp ITS2 Powerlaw

Reconstruction of the reference .

515sp ITS1 Uniform

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
515	513	453	270	252	302	472

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
515	505	433	1	220	1	459

515sp ITS1 Powerlaw

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
515	506	448	262	244	285	465

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
515	498	429	1	220	1	455

515sp ITS2 Uniform

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
515	512	436	228	221	456	465

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
515	504	424	4	188	4	453

515sp ITS2 Powerlaw

Count of OTU/ASV/ZOTU detected
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
515	504	434	219	212	446	462

Count of OTU/ASV/ZOTU perfectly reconstructed
EXPECTED	FROGS	USEARCH	DADA2-se	DADA2-pe	QIIME-se	QIIME-pe
515	496	423	4	188	4	449

Command lines

Tools need to be installed and available in the PATH.

FROGS (v. 3.2.0)

#!/bin/bash

SAVE_DIR="results/FROGS"
INPUT_DIR="data_input"
REFERENCE="Unite_ref/Unite_7.1_clean_certae_refs.fasta"
LAW="Uniform Powerlaw"
SP="35sp 115sp 515sp"
ITS="ITS1 ITS2"
CPU=1

# Create archive for FROGS for each dataset
for path in $INPUT_DIR/*sp/jeu1/ITS*/*/ ; do cd  $path ; tar zcvf Archive.tar.gz *.fastq.gz;  cd - ;done

for sp in $SP
do
    for its in $ITS
    do
        for law in $LAW
        do
            # Create output directory for each dataset
            ARCHIVE_FILE="$INPUT_DIR/$sp/jeu1/$its/$law/Archive.tar.gz"
            mkdir -p "$SAVE_DIR/$sp/$its/$law/"
            cd "$SAVE_DIR/$sp/$its/$law/"
            mkdir -p LOG
            # Preprocess
            preprocess.py illumina --input-archive $ARCHIVE_FILE --keep-unmerged --merge-software pear --min-amplicon-size 20 --max-amplicon-size 490 --without-primers --R1-size 250 --R2-size 250 --nb-cpus $CPU --output-dereplicated preprocess.fasta --output-count preprocess.tsv --summary preprocess.html --log-file preprocess.log
            # Clustering
            clustering.py --input-fasta preprocess.fasta --input-count preprocess.tsv --distance 1 --nb-cpus $CPU --output-biom clustering.biom --output-fasta clustering.fasta --log-file clustering.log
            # Chimera removal
            remove_chimera.py --input-fasta clustering.fasta --input-biom clustering.biom --nb-cpus $CPU --out-abundance remove_chimera.biom --non-chimera remove_chimera.fasta --log-file remove_chimera.log --summary remove_chimera.html
            # OTU filters
            otu-filters.py --min-abundance 0.00005 --nb-cpus $CPU --input-fasta remove_chimera.fasta --input-biom remove_chimera.biom --output-biom filters.biom --output-fasta filters.fasta --log-file filters.log --summary filters.html
            # ITSx
            itsx.py --nb-cpus $CPU --region $its --check-its-only --input-fasta filters.fasta --input-biom filters.biom --out-fasta itsx.fasta --out-abundance itsx.biom --summary itsx.html --log-file itsx.log
            # Taxonomic affiliation
            affiliation_OTU.py --nb-cpus $CPU --reference $REFERENCE --input-biom itsx.biom --input-fasta itsx.fasta --output-biom frogs.biom --summary affiliation.html --log-file affiliation.log
            # Biom to TSV
            biom_to_tsv.py --input-biom frogs.biom --input-fasta itsx.fasta --output-tsv frogs.tsv --output-multi-affi multiaff.tsv --log-file biom_to_tsv.log
            # Affiliation filter: parameters is very dependent on your dataset and questions
            affiliation_filters.py --input-biom frogs.biom --input-fasta itsx.fasta --output-frogs_affi_filter.biom --output-fasta frogs_affi_filter.fasta --summary frogs_affi_filter.html --impacted frogs_affi_filter_impacted_OTU.tsv --impacted-multihit frogs_affi_filter_impacted_OTU_multihit.tsv --log-file affi_filter.log

            --delete / --mask 		# choose behavior 
            # add blast affiliation filter
            # --min-blast-identity 0.8 --min-blast-coverage 0.8 --ignore-blast-taxa "known_contaminant_terme or key_species"

        done
    done
done

DADA2-pe (v. 1.14.1)

dada2.sh

#!/bin/bash

SAVE_DIR="results/DADA2"
INPUT_DIR="data_input"
SCRIPT_DIR="scripts/DADA2"
REFERENCE="Unite_ref/Unite_7.1_clean_certae_refs_for_DADA2.fasta"
LAW="Uniform Powerlaw"
SP="35sp 115sp 515sp"
ITS="ITS1 ITS2"
CPU=1

for sp in $SP
do
    for its in $ITS
    do
        for law in $LAW
        do
            mkdir -p "$SAVE_DIR/$sp/$its/$law/"
            cd "$SAVE_DIR/$sp/$its/$law/"
            mkdir -p LOG
                  # launch DADA2
                  module load system/R-3.6.1 libraries/proj-4.9.3 libraries/gdal-2.3.0 && $SCRIPT_DIR/dada2.R --input $INPUT_DIR/$sp/jeu1/$its/$law/ --output $SAVE_DIR/$sp/$its/$law/ --reference $REFERENCE --cpus $CPU
            cd -
        done
    done
done

dada2.R

#!/usr/bin/env Rscript

library(argparse)
library(ShortRead)
library(Biostrings)
library(tidyverse)
library(phyloseq)
library(dada2)

## Get arguments from bash script
args <- commandArgs(TRUE)
parser <- ArgumentParser()

parser$add_argument("-i", "--input", type="character", default=NULL,
                    help="dataset file name", metavar="character")
parser$add_argument("-o", "--output", type="character", default="out.txt",
                    help="output file name", metavar="character")
parser$add_argument("--reference", type="character", default=NULL,
                    help="reference database", metavar="character")
parser$add_argument("--cpus", type="integer", default=1,
                    help="number of cpus", metavar="N")                        

args <- parser$parse_args()


input_dir <- args$input
output_dir <- args$output

## Input files
fnFs <- sort(list.files(input_dir, pattern = "_R1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(input_dir, pattern = "_R2.fastq.gz", full.names = TRUE))

# Remove sequences with N
path.N <- file.path(output_dir, "filtered")
filtFs <- file.path(output_dir, "filtered", basename(fnFs)) # Put N-filterd files in filtN/ subdirectory
filtRs <- file.path(output_dir, "filtered", basename(fnRs))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN = 0, maxEE = c(2, 2), 
                     truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = 16)

get.sample.name <- function(fname) paste(strsplit(basename(fname), "_R")[[1]][1],collapse="_")
sample.names <- unname(sapply(fnFs, get.sample.name))

# Dada
dadaFs <- dada(filtFs, selfConsist=TRUE, pool=TRUE, multithread = args$cpus)
dadaRs <- dada(filtRs, selfConsist=TRUE, pool=TRUE, multithread = args$cpus)

names(dadaFs) <- sample.names
names(dadaRs) <- sample.names

# Deal with pairs
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs)

seqtab <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, multithread=args$cpus)

taxa <- assignTaxonomy(seqtab.nochim, args$reference, multithread = args$cpus)
saveRDS(taxa,file.path(output_dir,"taxa.rds"))

taxa <- taxa[,c(1,2,3,4,5,6,7)]
colnames(taxa) <- c("Kingdom", "Phylum", "Class", "Order" ,  "Family" , "Genus" ,  "Species")

### Track reads
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
print(track)

asv_seqs <- colnames(seqtab.nochim)

biomtable <- t(seqtab.nochim)
b <- as_tibble(biomtable)
obs_sum <- rowSums(b)
b$observation_sum <- obs_sum
b$observation_name <- paste0("ASV", 1:nrow(b))
b$seed_sequence <- asv_seqs

taxonomy <- as_tibble(taxa) %>% unite("#taxonomy", 1:7 , sep=";", remove = FALSE)

b$"#taxonomy" <- taxonomy$"#taxonomy"

col_order <- c("#taxonomy", "seed_sequence",    "observation_name", "observation_sum",
 rownames(seqtab.nochim))
b2 <- b[, col_order]

write.table(b2, file.path(output_dir,"dada2.tsv"), sep="\t", quote=F, row.names=F)

tsv_to_biom <- "tsv_to_biom.py"

system2(tsv_to_biom, args = c("--input-tsv", file.path(output_dir,"dada2.tsv"),
                                     "--output-biom", file.path(output_dir,"dada2.biom"),
                                     "--output-fasta", file.path(output_dir,"dada2.fasta"),
                                     "--log-file", file.path(output_dir,"tsv_to_biom.log")
                                    )
        )

samples.out <- rownames(seqtab.nochim)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE) ,
               tax_table(taxa))
saveRDS(ps, file.path(output_dir,"dada2.rds"))

ids_study <- paste("Cluster", 1:ncol(seqtab.nochim), sep = "_")
asv_headers <- paste(">",ids_study, sep="")
asv_seqs <- colnames(seqtab.nochim)
asv_fasta <- c(rbind(asv_headers, asv_seqs))

taxanomy_fields <- apply(unname(taxa), 1, function(x) paste(x, collapse=";"))
seqtab.nochim_t <- as.data.frame(t(seqtab.nochim))
samples_ids <- colnames(seqtab.nochim_t)
seqtab.nochim_t$ids <- ids_study
seqtab.nochim_t$"#blast_taxonomy" <- taxanomy_fields
seqtab.nochim_t$"seed_sequence" <- asv_seqs
seqtab.nochim_t$"observation_name" <- ids_study
seqtab.nochim_t$observation_sum <- rowSums(t(seqtab.nochim))
seqtab.nochim_t <- seqtab.nochim_t[, c("#blast_taxonomy", "seed_sequence", "observation_name", "observation_sum", samples_ids)]
write.table(x = rbind(colnames(seqtab.nochim_t), seqtab.nochim_t),
            file = file.path(output_dir, "dada2.tsv"),
            quote = FALSE,
            sep = "\t",
            col.names = FALSE,
            row.names = FALSE)

system2(tsv_to_biom, args=c("--input-tsv", file.path(output_dir, "dada2.tsv"), "--output-biom", file.path(output_dir, "dada2.biom")))

DADA2-se (v. 1.14.1)

dada2_se.sh

#!/bin/bash

SAVE_DIR="results/DADA2_SE2"
INPUT_DIR="data_input"
SCRIPT_DIR="scripts/DADA2_SE"
REFERENCE="Unite_ref/Unite_7.1_clean_certae_refs_for_DADA2.fasta"
LAW="Uniform Powerlaw"
SP="35sp 115sp 515sp"
ITS="ITS1 ITS2"
CPU=1

for sp in $SP
do
    for its in $ITS
    do
        for law in $LAW
        do
            mkdir -p "$SAVE_DIR/$sp/$its/$law/"
            cd "$SAVE_DIR/$sp/$its/$law/"
            mkdir -p LOG
                  # launch DADA2_SE
                  module load system/R-3.6.1 libraries/proj-4.9.3 libraries/gdal-2.3.0 && $SCRIPT_DIR/dada2_se.R --input $INPUT_DIR/$sp/jeu1/$its/$law/ --output $SAVE_DIR/$sp/$its/$law/ --reference $REFERENCE --cpus $CPU
            cd -
        done
    done
done

dada2_se.R

#!/usr/bin/env Rscript

library(argparse)
library(ShortRead)
library(Biostrings)
library(tidyverse)
library(phyloseq)
library(dada2)

## Get arguments from bash script
args <- commandArgs(TRUE)
parser <- ArgumentParser()

parser$add_argument("-i", "--input", type="character", default=NULL,
                    help="dataset file name", metavar="character")
parser$add_argument("-o", "--output", type="character", default="out.txt",
                    help="output file name", metavar="character")
parser$add_argument("--reference", type="character", default=NULL,
                    help="reference database", metavar="character")
parser$add_argument("--cpus", type="integer", default=1,
                    help="number of cpus", metavar="N")                        

args <- parser$parse_args()


input_dir <- args$input
output_dir <- args$output

## Input files
fnFs <- sort(list.files(input_dir, pattern = "_R1.fastq.gz", full.names = TRUE))

# Remove sequences with N
path.N <- file.path(output_dir, "filtered")
filtFs <- file.path(output_dir, "filtered", basename(fnFs)) # Put N-filterd files in filtN/ subdirectory
filterAndTrim(fnFs, filtFs, maxN = 0, maxEE = 2, 
                     truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = 16)

get.sample.name <- function(fname) paste(strsplit(basename(fname), "_R")[[1]][1],collapse="_")
sample.names <- unname(sapply(fnFs, get.sample.name))

# Dada
dadaFs <- dada(filtFs, selfConsist=TRUE, pool=TRUE, multithread = args$cpus)

names(dadaFs) <- sample.names

seqtab <- makeSequenceTable(dadaFs)
seqtab.nochim <- removeBimeraDenovo(seqtab, multithread=args$cpus)

taxa <- assignTaxonomy(seqtab.nochim, args$reference, multithread = args$cpus)
saveRDS(taxa,file.path(output_dir,"taxa.rds"))

taxa <- taxa[,c(1,2,3,4,5,6,7)]
colnames(taxa) <- c("Kingdom", "Phylum", "Class", "Order" ,  "Family" , "Genus" ,  "Species")

asv_seqs <- colnames(seqtab.nochim)

biomtable <- t(seqtab.nochim)
b <- as_tibble(biomtable)
obs_sum <- rowSums(b)
b$observation_sum <- obs_sum
b$observation_name <- paste0("ASV", 1:nrow(b))
b$seed_sequence <- asv_seqs

taxonomy <- as_tibble(taxa) %>% unite("#taxonomy", 1:7 , sep=";", remove = FALSE)

b$"#taxonomy" <- taxonomy$"#taxonomy"

col_order <- c("#taxonomy", "seed_sequence",    "observation_name", "observation_sum",
 rownames(seqtab.nochim))
b2 <- b[, col_order]

write.table(b2, file.path(output_dir,"dada2.tsv"), sep="\t", quote=F, row.names=F)

tsv_to_biom <- "tsv_to_biom.py"

system2(tsv_to_biom, args = c("--input-tsv", file.path(output_dir,"dada2.tsv"),
                                     "--output-biom", file.path(output_dir,"dada2.biom"),
                                     "--output-fasta", file.path(output_dir,"dada2.fasta"),
                                     "--log-file", file.path(output_dir,"tsv_to_biom.log")
                                    )
        )

samples.out <- rownames(seqtab.nochim)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE) ,
               tax_table(taxa))
saveRDS(ps, file.path(output_dir,"dada2.rds"))

ids_study <- paste("Cluster", 1:ncol(seqtab.nochim), sep = "_")
asv_headers <- paste(">",ids_study, sep="")
asv_seqs <- colnames(seqtab.nochim)
asv_fasta <- c(rbind(asv_headers, asv_seqs))

taxanomy_fields <- apply(unname(taxa), 1, function(x) paste(x, collapse=";"))
seqtab.nochim_t <- as.data.frame(t(seqtab.nochim))
samples_ids <- colnames(seqtab.nochim_t)
seqtab.nochim_t$ids <- ids_study
seqtab.nochim_t$"#blast_taxonomy" <- taxanomy_fields
seqtab.nochim_t$"seed_sequence" <- asv_seqs
seqtab.nochim_t$"observation_name" <- ids_study
seqtab.nochim_t$observation_sum <- rowSums(t(seqtab.nochim))
seqtab.nochim_t <- seqtab.nochim_t[, c("#blast_taxonomy", "seed_sequence", "observation_name", "observation_sum", samples_ids)]
write.table(x = rbind(colnames(seqtab.nochim_t), seqtab.nochim_t),
            file = file.path(output_dir, "dada2.tsv"),
            quote = FALSE,
            sep = "\t",
            col.names = FALSE,
            row.names = FALSE)

system2(tsv_to_biom, args=c("--input-tsv", file.path(output_dir, "dada2.tsv"), "--output-biom", file.path(output_dir, "dada2.biom")))

USEARCH (usearch11.0.667)

makedb_usearch.sh

#!/bin/sh

REF=$1
OUT_DIR=$2

mkdir -p $OUT_DIR

REF_WOEXT=`basename ${REF%.*}`

cat $REF | awk '{if(substr($1,0,1)==">"){gsub(";",",",$2) ; print $1";tax="$2}else{print $0}}' | sed -r 's/,$/;/' > $OUT_DIR/${REF_WOEXT}_usearch.fasta
usearch -makeudb_usearch $OUT_DIR/${REF_WOEXT}_usearch.fasta -output $OUT_DIR/${REF_WOEXT}_usearch.udb

usearch.sh

#!/bin/sh

READS_DIR=$1
REF=$2
OUT_DIR=$3
CPU=$4

mkdir -p $OUT_DIR/usearch/log

###############################################################################################
# Merge all R1/R2 file pairs
# Add sample name to read labels (-relabel @ option)
# Pool samples together into raw.fq
# keep R1 unmerged reads

mkdir $OUT_DIR/data

for R1 in $READS_DIR/*_R1.fastq.gz
do
R2=`echo $R1 |sed 's/_R1.fastq.gz/_R2.fastq.gz/'`
sample=`basename $R1 |sed 's/_R1.fastq.gz//'`

gzip -dc $R1 > $OUT_DIR/data/`basename $R1`
gzip -dc $R2 > $OUT_DIR/data/`basename $R2`

R1=$OUT_DIR/data/`basename $R1`
R2=$OUT_DIR/data/`basename $R2`

usearch -fastq_mergepairs $R1 -reverse $R2  \
    -sample $sample -threads $CPU \
    -fastqout $OUT_DIR/usearch/${sample}_raw_paired.fq -fastqout_notmerged_fwd $OUT_DIR/usearch/${sample}_raw_unpaired_fw.fq 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/${sample}_fastq_mergepairs.log

awk -v S=$sample '{c++; if(c==1){print $0";sample="S";"}else{print $0; if(c==4){c=0}}}' $OUT_DIR/usearch/${sample}_raw_unpaired_fw.fq > $OUT_DIR/usearch/${sample}_raw_unpaired_fw_rename.fq

rm $OUT_DIR/usearch/${sample}_raw_unpaired_fw.fq

done

rm -r $OUT_DIR/data

# concat R1R2 merged and R1 unmerged reads
cat $OUT_DIR/usearch/*raw_paired.fq $OUT_DIR/usearch/*_raw_unpaired_fw_rename.fq > $OUT_DIR/usearch/amplicon.fq
rm $OUT_DIR/usearch/*raw_paired.fq $OUT_DIR/usearch/*raw_unpaired_fw_rename.fq

###############################################################################################
# Amplicon sequence quality filter
#       Discard reads with > E expected errors per base: https://www.drive5.com/usearch/manual/exp_errs.html
#       For example, with the fastq_maxee_rate option set to 0.01, then 
#       a read of length 100 will be discarded if the expected errors is >1, 
#       and a read of length 1,000 will be discarded if the expected errors is >10.

usearch -fastq_filter $OUT_DIR/usearch/amplicon.fq \
    -fastq_maxee 1.0 -relabel Filt -threads $CPU \
    -fastaout $OUT_DIR/usearch/amplicon_filtered.fa 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/fastq_filter.log

###############################################################################################
# Find unique read sequences, store abundances, and sort by abundance
usearch -fastx_uniques $OUT_DIR/usearch/amplicon_filtered.fa \
    -sizeout -relabel Uniq -threads $CPU \
    -fastaout $OUT_DIR/usearch/amplicon_uniques.fa 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/fastx_uniques.log

###############################################################################################
# Run UNOISE algorithm to get denoised sequences (ZOTUs)
# The -minsize option specifies the minimum abundance (size= annotation). Default is 8. 
# Input sequences with lower abundances are discarded.

usearch -unoise3 $OUT_DIR/usearch/amplicon_uniques.fa \
    -zotus $OUT_DIR/usearch.fa 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/unoise3.log

# make ZOTU table
usearch -otutab $OUT_DIR/usearch/amplicon.fq \
    -threads $CPU \
    -otus $OUT_DIR/usearch.fa -otutabout $OUT_DIR/usearch/zotus.txt 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/otutab.log


awk -F "\t" 'BEGIN{OFS="\t"}{if($1=="#OTU ID"){$1="observation_name\tobservation_sum"}else{s=0; for (i=2;i<= NF; i++){s+=$i} ; $1=$1"\t"s} print $0}' $OUT_DIR/usearch/zotus.txt > $OUT_DIR/usearch/zotus_reformat.txt
tsv_to_biom.py -t $OUT_DIR/usearch/zotus_reformat.txt -b $OUT_DIR/usearch/zotus.biom

###############################################################################################
# perform affiliation using Sintax
# If the -sintax_cutoff option is given then predictions are written a second time after applying the confidence 
# threshold, keeping only ranks with high enough confidence. On V4 reads, using a cutoff of 0.8 gives predictions 
# with similar accuracy to RDP at 80% bootstrap cutoff.

usearch -sintax $OUT_DIR/usearch.fa \
    -db $REF -strand both -threads $CPU \
    -tabbedout $OUT_DIR/usearch/zotus_tax.txt -sintax_cutoff 0.8 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/sintax.log

cat $OUT_DIR/usearch/zotus_tax.txt | awk '{c++; if(c==1){print "#observation_name\tboostrapped_taxonomy\tstrand\ttaxonomy"}; gsub(",",";",$0) ; print $0}' > $OUT_DIR/usearch/zotus_tax_reformat.txt
# ==> supprimer le cutoff de bootstrap ?

# create final biom
/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/qiime2-2019.4/bin/biom add-metadata -i $OUT_DIR/usearch/zotus.biom \
  --observation-metadata-fp  $OUT_DIR/usearch/zotus_tax_reformat.txt \
  -o $OUT_DIR/usearch.biom --output-as-json \
  --sc-separated boostrapped_taxonomy,taxonomy

run_usearch.sh

#!/bin/sh

CPU=1

DIR=/work/frogsfungi/FungiPubli/JeuTestDuMoment
DATA_DIR=$DIR/data_input
REF=$DIR/Unite_ref/Unite_7.1_clean_certae_refs.fasta

USEARCH_OUT_DIR=$DIR/results/USEARCH
mkdir -p $USEARCH_OUT_DIR/LOG

##############
# reference build
sh $DIR/scripts/USEARCH/makeudb_usearch.sh $REF $DIR/Unite_ref

DB_USEARCH=$DIR/Unite_ref/Unite_7.1_clean_certae_refs_usearch.udb

##############
# run usearch

for sp in 35sp 115sp 515sp
do
    for its in ITS1 ITS2
    do
        for law in Powerlaw Uniform
        do
            sh usearch.sh $DATA_DIR/$sp/jeu1/$its/$law/ $DB_USEARCH $USEARCH_OUT_DIR/$sp/$its/$law/ $CPU
        done
    done
done

QIIME-pe (qiime2-2019.4)

qiime2_train_classifer.sh

module load system/Miniconda3-4.4.10 module load bioinfo/qiime2-2019.4

#!/bin/sh

# sbatch -J Unite_qiime2_train_classifier --mem=20G -o %x.out -e %x.err train_classifier.sh

REF=$1

OUT_DIR=$2
mkdir -p $OUT_DIR

REF_WOEXT=`basename ${REF%.*}`

grep '>' $REF | awk '{sub(">","",$1); print $1"\t"$2}' > $OUT_DIR/$REF_WOEXT.tax

qiime tools import \
  --input-path $REF \
  --output-path $OUT_DIR/$REF_WOEXT.seq.qza \
  --type 'FeatureData[Sequence]'

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path $OUT_DIR/$REF_WOEXT.tax \
  --output-path $OUT_DIR/$REF_WOEXT.tax.qza
  
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads $OUT_DIR/$REF_WOEXT.seq.qza \
  --i-reference-taxonomy $OUT_DIR/$REF_WOEXT.tax.qza \
  --o-classifier $OUT_DIR/$REF_WOEXT.classifier.qza

run_qiime_pe.sh

#!/bin/sh

CPU=10
export PATH=/save/frogs/src/FROGS_dev/assessment/bin:/save/frogs/src/FROGS_dev/app:$PATH
export PYTHONPATH=/save/frogs/src/FROGS_dev/lib:$PYTHONPATH

DIR=/work/frogsfungi/FungiPubli/JeuTestDuMoment
DATA_DIR=$DIR/data_input
REF=$DIR/Unite_ref/Unite_7.1_clean_certae_refs.fasta

# QIIME2-PE
QIIME_OUT_DIR=$DIR/results/QIIME2
mkdir -p $QIIME_OUT_DIR/LOG
# note pour une analyse sur des données réelles
#  Classification of fungal ITS sequences
#  In our experience, fungal ITS classifiers trained on the UNITE reference database do NOT benefit from extracting/trimming reads to primer sites. 
#  We recommend training UNITE classifiers on the full reference sequences. 
#  Furthermore, we recommend the “developer” sequences (located within the QIIME-compatible release download) 
#  because the standard versions of the sequences have already been trimmed to the ITS region (excluding portions of flanking rRNA genes 
#  that may be present in amplicons generated with standard ITS primers).

##############
# reference build
sbatch -J Unite_7.1_qiime2_train_classifier --mem=20G -o $DIR/results/QIIME2/LOG/%x.out -e $DIR/results/QIIME2/LOG/%x.err --wrap="sh $DIR/scripts/QIIME2/qiime2_train_classifier.sh $REF $DIR/Unite_ref/"

DB_PREFIX=$DIR/Unite_ref/Unite_7.1_clean_certae_refs

##############
# run qiime2 open reference clustering

for sp in 35sp 115sp 515sp 37sp 117sp 517sp
do
    for its in ITS1 ITS2
    do
        for law in Powerlaw Uniform
        do
            # create manifest
            mkdir -p $QIIME_OUT_DIR/$sp/$its/$law/pe 
            echo -e "sample-id\tforward-absolute-filepath\treverse-absolute-filepath" > $QIIME_OUT_DIR/$sp/$its/$law/pe/manifest_paired.tsv
            for R1 in `ls $DATA_DIR/$sp/jeu1/$its/$law/*_R1.fastq.gz`
            do
            sample_name=`basename $R1 | sed 's/_R1.fastq.gz//'`
            R2=`echo $R1 | sed 's/_R1.fastq.gz/_R2.fastq.gz/'`
            echo -e $sample_name"\t"$R1"\t"$R2 >> $QIIME_OUT_DIR/$sp/$its/$law/pe/manifest_paired.tsv
            done
            # run qiime2-pe
            sbatch -J qiime2-pe_${sp}_${its}_${law} --cpus-per-task=$CPU -o $QIIME_OUT_DIR/LOG/%x.out -e $QIIME_OUT_DIR/LOG/%x.err --wrap="sh $DIR/scripts/QIIME2/qiime2-pe.sh $QIIME_OUT_DIR/$sp/$its/$law/pe/manifest_paired.tsv $DB_PREFIX"
        done
    done
done

qiime-pe.sh

#!/usr/bin/sh

# FROGS/app directory need to be in the PATH

module load system/Miniconda3-4.4.10
module load bioinfo/qiime2-2019.4

MANIFEST=$1


OUT_DIR=`dirname $MANIFEST`
WORK_DIR=$OUT_DIR/qiime_workdir
mkdir -p $WORK_DIR

DB_PREFIX=$2
DB_REF_SEQ=$DB_PREFIX.seq.qza
DB_REF_CLASSIFIER=$DB_PREFIX.classifier.qza

### import
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path $MANIFEST \
  --output-path $WORK_DIR/pe-demux.qza \
  --input-format PairedEndFastqManifestPhred33V2

qiime demux summarize \
  --i-data $WORK_DIR/pe-demux.qza \
  --o-visualization $WORK_DIR/pe-demux.qzv


# pas de trimming puis pas d'adaptateurs et très bonne qualité
  
########################################################################
### clustering pipeline : open reference clustering paired end

# The features produced by clustering methods are known as operational taxonomic units (OTUs), which is Esperanto for suboptimal, imprecise rubbish

# join pair

# paramètres par defaut:
#   --p-minlen 1
#   --p-maxns (optional)
#   --p-allowmergestagger False
#   --p-minovlen 10
#   --p-maxdiffs 10
#   --p-minmergelen (optional)
#   --p-maxmergelen (optional)
#   --p-maxee (optional)
#   --p-qmin 0
#   --p-qminout 0
#   --p-qmax 41
#   --p-qmaxout 41

qiime vsearch join-pairs \
  --i-demultiplexed-seqs $WORK_DIR/pe-demux.qza \
  --p-allowmergestagger --p-minovlen 10 --p-minmergelen 20 --p-maxmergelen 490 \
  --o-joined-sequences $WORK_DIR/pe-join.qza
  
qiime demux summarize \
  --i-data $WORK_DIR/pe-join.qza \
  --o-visualization $WORK_DIR/pe-join.qzv

# dereplication

qiime vsearch dereplicate-sequences \
  --i-sequences $WORK_DIR/pe-join.qza \
  --o-dereplicated-table $WORK_DIR/pe-join-derepTable.qza \
  --o-dereplicated-sequences $WORK_DIR/pe-join-derepSeqs.qza

qiime feature-table summarize \
  --i-table $WORK_DIR/pe-join-derepTable.qza \
  --o-visualization $WORK_DIR/pe-join-table-or-99.qzv # à renommer en pe-join-derepTable.qzv


# clustering

#   Given a feature table and the associated feature sequences, cluster the
#   features against a reference database based on user-specified percent
#   identity threshold of their sequences. Any sequences that don't match are
#   then clustered de novo. This is not a general-purpose clustering method,
#   but rather is intended to be used for clustering the results of quality-
#   filtering/dereplication methods, such as DADA2, or for re-clustering a
#   FeatureTable at a lower percent identity than it was originally clustered
#   at. When a group of features in the input table are clustered into a
#   single feature, the frequency of that single feature in a given sample is
#   the sum of the frequencies of the features that were clustered in that
#   sample. Feature identifiers will be inherited from the centroid feature of
#   each cluster. For features that match a reference sequence, the centroid
#   feature is that reference sequence, so its identifier will become the
#   feature identifier. The clustered_sequences result will contain feature
#   representative sequences that are derived from the sequences input for all
#   features in clustered_table. This will always be the most abundant
#   sequence in the cluster. The new_reference_sequences result will contain
#   the entire reference database, plus feature representative sequences for
#   any de novo features. This is intended to be used as a reference database
#   in subsequent iterations of cluster_features_open_reference, if
#   applicable. See the vsearch documentation for details on how sequence
#   clustering is performed.

# default param : --p-strand plus

qiime vsearch cluster-features-open-reference \
  --i-table $WORK_DIR/pe-join-derepTable.qza \
  --i-sequences $WORK_DIR/pe-join-derepSeqs.qza \
  --i-reference-sequences $DB_REF_SEQ \
  --p-perc-identity 0.99 \
  --o-clustered-table $WORK_DIR/pe-clusterTable-or-99.qza \
  --o-clustered-sequences $WORK_DIR/pe-clusterSeqs-or-99.qza \
  --o-new-reference-sequences $WORK_DIR/new-ref-seqs-or-99.qza

qiime feature-table summarize \
  --i-table $WORK_DIR/pe-clusterTable-or-99.qza \
  --o-visualization $WORK_DIR/pe-clusterTable-or-99.qzv

# suppression des chimères

#  Apply the vsearch uchime_denovo method to identify chimeric feature
#  sequences. The results of this method can be used to filter chimeric
#  features from the corresponding feature table. For additional details,
#  please refer to the vsearch documentation.

# paramètres par défaut (défaut de vsearch, mais manque abskew : min abundance ratio of parent vs chimera (2.0)):
#   --p-dn 1.4
#   --p-mindiffs 3
#   --p-mindiv 0.8
#   --p-minh 0.28
#   --p-xn 8.0

qiime vsearch uchime-denovo \
  --i-table $WORK_DIR/pe-clusterTable-or-99.qza \
  --i-sequences $WORK_DIR/pe-clusterSeqs-or-99.qza \
  --output-dir $WORK_DIR/pe-uchime-dn-out

qiime feature-table filter-features \
  --i-table $WORK_DIR/pe-clusterTable-or-99.qza \
  --m-metadata-file $WORK_DIR/pe-uchime-dn-out/nonchimeras.qza \
  --o-filtered-table $WORK_DIR/pe-nonchimeric-wo-borderline-table.qza

qiime feature-table summarize \
  --i-table $WORK_DIR/pe-nonchimeric-wo-borderline-table.qza \
  --o-visualization $WORK_DIR/pe-nonchimeric-wo-borderline-table.qzv

qiime feature-table filter-seqs \
  --i-data $WORK_DIR/pe-clusterSeqs-or-99.qza \
  --m-metadata-file $WORK_DIR/pe-uchime-dn-out/nonchimeras.qza \
  --o-filtered-data $WORK_DIR/pe-nonchimeric-wo-borderline-seqs.qza

# filter single singleton

#  q2-vsearch implements three different OTU clustering strategies: de novo, closed reference, and open reference. 
#  All should be preceded by basic quality-score-based filtering and followed by chimera filtering and aggressive 
#  OTU filtering (the treacherous trio, a.k.a. the Bokulich method). ==> 0.00005% 
#  ==> problème impossible de filtrer sur une fréquence de façon automatique
#  ==> Filtre sur les single singleton

qiime feature-table filter-features \
  --i-table $WORK_DIR/pe-nonchimeric-wo-borderline-table.qza \
  --p-min-frequency 2 \
  --o-filtered-table $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qza

qiime feature-table summarize \
  --i-table $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qza \
  --o-visualization $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qzv

qiime feature-table filter-seqs \
  --i-data $WORK_DIR/pe-nonchimeric-wo-borderline-seqs.qza \
  --i-table $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qza \
  --o-filtered-data $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-seqs.qza

# affiliation

qiime feature-classifier classify-sklearn \
  --i-classifier $DB_REF_CLASSIFIER \
  --i-reads $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-seqs.qza \
  --o-classification $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-tax.qza 

qiime metadata tabulate \
  --m-input-file $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-tax.qza \
  --o-visualization $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-tax.qzv

#~ # export biom
qiime tools export \
  --input-path $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qza \
  --output-path $WORK_DIR/export

qiime tools export \
  --input-path $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-tax.qza \
  --output-path $WORK_DIR/export

# create final biom
sed -i 's/Feature/#Feature/' $WORK_DIR/export/taxonomy.tsv
biom add-metadata -i $WORK_DIR/export/feature-table.biom \
  --observation-metadata-fp  $WORK_DIR/export/taxonomy.tsv \
  -o $OUT_DIR/pe-qiime.biom \
  --output-as-json

# export fasta
qiime tools export \
  --input-path $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-seqs.qza  \
  --output-path $WORK_DIR/export

mv $WORK_DIR/export/dna-sequences.fasta $OUT_DIR/pe-qiime.fasta

# create abundance table in TSV format
biom_to_tsv.py -b $OUT_DIR/pe-qiime.biom -f $OUT_DIR/pe-qiime.fasta -t $OUT_DIR/pe-qiime.tsv

QIIME-se (qiime2-2019.4)

qiime2_train_classifer.sh

#!/bin/sh

module load system/Miniconda3-4.4.10
module load bioinfo/qiime2-2019.4

REF=$1
OUT_DIR=$2
mkdir -p $OUT_DIR

REF_WOEXT=`basename ${REF%.*}`

grep '>' $REF | awk '{sub(">","",$1); print $1"\t"$2}' > $OUT_DIR/$REF_WOEXT.tax

qiime tools import \
  --input-path $REF \
  --output-path $OUT_DIR/$REF_WOEXT.seq.qza \
  --type 'FeatureData[Sequence]'

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path $OUT_DIR/$REF_WOEXT.tax \
  --output-path $OUT_DIR/$REF_WOEXT.tax.qza
  
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads $OUT_DIR/$REF_WOEXT.seq.qza \
  --i-reference-taxonomy $OUT_DIR/$REF_WOEXT.tax.qza \
  --o-classifier $OUT_DIR/$REF_WOEXT.classifier.qza

run_qiime-se.sh

#!/bin/sh

CPU=1
DIR="."
DATA_DIR=$DIR/data_input
REF=$DIR/Unite_ref/Unite_7.1_clean_certae_refs.fasta

# QIIME2-SE
QIIME_OUT_DIR=$DIR/results/QIIME2
mkdir -p $QIIME_OUT_DIR/LOG
#  Classification of fungal ITS sequences
#  In our experience, fungal ITS classifiers trained on the UNITE reference database do NOT benefit from extracting/trimming reads to primer sites. 
#  We recommend training UNITE classifiers on the full reference sequences. 
#  Furthermore, we recommend the “developer” sequences (located within the QIIME-compatible release download) 
#  because the standard versions of the sequences have already been trimmed to the ITS region (excluding portions of flanking rRNA genes 
#  that may be present in amplicons generated with standard ITS primers).

##############
# reference build
sh $DIR/scripts/QIIME2/qiime2_train_classifier.sh $REF $DIR/Unite_ref/

DB_PREFIX=$DIR/Unite_ref/Unite_7.1_clean_certae_refs

##############
# run qiime2 open reference clustering

for sp in 35sp 115sp 515sp 37sp 117sp 517sp
do
    for its in ITS1 ITS2
    do
        for law in Powerlaw Uniform
        do
            # manifest creation
            mkdir -p $QIIME_OUT_DIR/$sp/$its/$law/pe
            echo -e "sample-id\tabsolute-filepath" > $QIIME_OUT_DIR/$sp/$its/$law/se/manifest_singleFwd.tsv
            for R1 in `ls $DATA_DIR/$sp/jeu1/$its/$law/*_R1.fastq.gz`
            do
            sample_name=`basename $R1 | sed 's/_R1.fastq.gz//'`
            echo -e $sample_name"\t"$R1 >> $QIIME_OUT_DIR/$sp/$its/$law/se/manifest_singleFwd.tsv
            done
            # lancement qiime2-se
            sh $DIR/scripts/QIIME2/qiime2-se.sh $QIIME_OUT_DIR/$sp/$its/$law/se/manifest_singleFwd.tsv $DB_PREFIX $CPU
        done
    done
done

qiime2-se.sh

#!/usr/bin/sh

MANIFEST=$1
OUT_DIR=`dirname $MANIFEST`
WORK_DIR=$OUT_DIR/qiime_workdir
mkdir -p $WORK_DIR

DB_PREFIX=$2
DB_REF_SEQ=$DB_PREFIX.seq.qza
DB_REF_CLASSIFIER=$DB_PREFIX.classifier.qza

CPU=$3

### import
qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path $MANIFEST \
  --output-path $WORK_DIR/se-demux.qza \
  --input-format SingleEndFastqManifestPhred33V2

qiime demux summarize \
  --i-data $WORK_DIR/se-demux.qza \
  --o-visualization $WORK_DIR/se-demux.qzv

########################################################################
### clustering pipeline : open reference clustering single end

# The features produced by clustering methods are known as operational taxonomic units (OTUs), which is Esperanto for suboptimal, imprecise rubbish
  
qiime vsearch dereplicate-sequences \
  --i-sequences $WORK_DIR/se-demux.qza \
  --o-dereplicated-table $WORK_DIR/se-derepTable.qza \
  --o-dereplicated-sequences $WORK_DIR/se-derepSeqs.qza

qiime feature-table summarize \
  --i-table $WORK_DIR/se-derepTable.qza \
  --o-visualization $WORK_DIR/se-derepTable.qzv
  
# clustering

#   Given a feature table and the associated feature sequences, cluster the
#   features against a reference database based on user-specified percent
#   identity threshold of their sequences. Any sequences that don't match are
#   then clustered de novo. This is not a general-purpose clustering method,
#   but rather is intended to be used for clustering the results of quality-
#   filtering/dereplication methods, such as DADA2, or for re-clustering a
#   FeatureTable at a lower percent identity than it was originally clustered
#   at. When a group of features in the input table are clustered into a
#   single feature, the frequency of that single feature in a given sample is
#   the sum of the frequencies of the features that were clustered in that
#   sample. Feature identifiers will be inherited from the centroid feature of
#   each cluster. For features that match a reference sequence, the centroid
#   feature is that reference sequence, so its identifier will become the
#   feature identifier. The clustered_sequences result will contain feature
#   representative sequences that are derived from the sequences input for all
#   features in clustered_table. This will always be the most abundant
#   sequence in the cluster. The new_reference_sequences result will contain
#   the entire reference database, plus feature representative sequences for
#   any de novo features. This is intended to be used as a reference database
#   in subsequent iterations of cluster_features_open_reference, if
#   applicable. See the vsearch documentation for details on how sequence
#   clustering is performed.

# default param : --p-strand plus

qiime vsearch cluster-features-open-reference \
  --i-table $WORK_DIR/se-derepTable.qza \
  --i-sequences $WORK_DIR/se-derepSeqs.qza \
  --i-reference-sequences $DB_REF_SEQ \
  --p-perc-identity 0.99 --p-threads $CPU \
  --o-clustered-table $WORK_DIR/se-clusterTable-or-99.qza \
  --o-clustered-sequences $WORK_DIR/se-clusterSeqs-or-99.qza \
  --o-new-reference-sequences $WORK_DIR/new-ref-seqs-or-99.qza

qiime feature-table summarize \
  --i-table $WORK_DIR/se-clusterTable-or-99.qza \
  --o-visualization $WORK_DIR/se-clusterTable-or-99.qzv

# remove chimera

#  Apply the vsearch uchime_denovo method to identify chimeric feature
#  sequences. The results of this method can be used to filter chimeric
#  features from the corresponding feature table. For additional details,
#  please refer to the vsearch documentation.

# default parameters (défaut de vsearch, mais manque abskew : min abundance ratio of parent vs chimera (2.0)):
#   --p-dn 1.4
#   --p-mindiffs 3
#   --p-mindiv 0.8
#   --p-minh 0.28
#   --p-xn 8.0

qiime vsearch uchime-denovo \
  --i-table $WORK_DIR/se-clusterTable-or-99.qza \
  --i-sequences $WORK_DIR/se-clusterSeqs-or-99.qza \
  --output-dir $WORK_DIR/se-uchime-dn-out

qiime feature-table filter-features \
  --i-table $WORK_DIR/se-clusterTable-or-99.qza \
  --m-metadata-file $WORK_DIR/se-uchime-dn-out/nonchimeras.qza \
  --o-filtered-table $WORK_DIR/se-nonchimeric-wo-borderline-table.qza

qiime feature-table summarize \
  --i-table $WORK_DIR/se-nonchimeric-wo-borderline-table.qza \
  --o-visualization $WORK_DIR/se-nonchimeric-wo-borderline-table.qzv

qiime feature-table filter-seqs \
  --i-data $WORK_DIR/se-clusterSeqs-or-99.qza \
  --m-metadata-file $WORK_DIR/se-uchime-dn-out/nonchimeras.qza \
  --o-filtered-data $WORK_DIR/se-nonchimeric-wo-borderline-seqs.qza

# filter single singleton

#  q2-vsearch implements three different OTU clustering strategies: de novo, closed reference, and open reference. 
#  All should be preceded by basic quality-score-based filtering and followed by chimera filtering and aggressive 
#  OTU filtering (the treacherous trio, a.k.a. the Bokulich method). ==> 0.00005% 
#  ==> problème impossible de filtrer sur une fréquence de façon automatique
#  ==> Filtre sur les single singleton

qiime feature-table filter-features \
  --i-table $WORK_DIR/se-nonchimeric-wo-borderline-table.qza \
  --p-min-frequency 2 \
  --o-filtered-table $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qza

qiime feature-table summarize \
  --i-table $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qza \
  --o-visualization $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qzv

qiime feature-table filter-seqs \
  --i-data $WORK_DIR/se-nonchimeric-wo-borderline-seqs.qza \
  --i-table $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qza \
  --o-filtered-data $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-seqs.qza

# affiliation

qiime feature-classifier classify-sklearn \
  --i-classifier $DB_REF_CLASSIFIER \
  --i-reads $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-seqs.qza \
  --o-classification $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-tax.qza 

qiime metadata tabulate \
  --m-input-file $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-tax.qza \
  --o-visualization $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-tax.qzv

# export biom
qiime tools export \
  --input-path $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qza \
  --output-path $WORK_DIR/export

qiime tools export \
  --input-path $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-tax.qza \
  --output-path $WORK_DIR/export

# create final biom
sed -i 's/Feature/#Feature/' $WORK_DIR/export/taxonomy.tsv
biom add-metadata -i $WORK_DIR/export/feature-table.biom \
  --observation-metadata-fp  $WORK_DIR/export/taxonomy.tsv \
  -o $OUT_DIR/se-qiime.biom \
  --output-as-json

# export fasta
qiime tools export \
  --input-path $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-seqs.qza  \
  --output-path $WORK_DIR/export

mv $WORK_DIR/export/dna-sequences.fasta $OUT_DIR/se-qiime.fasta

# create abundance table in TSV format
biom_to_tsv.py -b $OUT_DIR/se-qiime.biom -f $OUT_DIR/se-qiime.fasta -t $OUT_DIR/se-qiime.tsv

References

Escudié, Frédéric, Lucas Auer, Maria Bernard, Mahendra Mariadassou, Laurent Cauquil, Katia Vidal, Sarah Maman, Guillermina Hernandez-Raquet, Sylvie Combes, and Géraldine Pascal. 2018. “FROGS: Find, Rapidly, Otus with Galaxy Solution.” Bioinformatics 34 (8): 1287–94. https://doi.org/10.1093/bioinformatics/btx791
Abarenkov, Kessy, R Henrik Nilsson, Karl-Henrik Larsson, Ian J Alexander, Ursula Eberhardt, Susanne Erland, Klaus Høiland, et al. 2010. “The Unite Database for Molecular Identification of Fungi–Recent Updates and Future Perspectives.” The New Phytologist 186 (2): 281–85. https://doi.org/10.1111/j.1469-8137.2009.03160.x
Bolyen, Evan, Jai Ram Rideout, Matthew R Dillon, Nicholas A Bokulich, Christian C Abnet, Gabriel A Al-Ghalith, Harriet Alexander, et al. 2019. “Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using Qiime 2.” Nature Biotechnology 37 (8): 852–57. https://doi.org/10.1038/s41587-019-0209-9
Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7): 581–83. https://dx.doi.org/10.1038%2Fnmeth.3869
R.C. Edgar. 2016. “UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing”. BioRxiv. https://doi.org/10.1101/081257

FROGS ITS Benchmarking

Introduction

Simulated data

Results

35 species

Amplicon length distribution

List of mergeable amplicons

List of non-mergeable amplicons

Expected vs. observed depth (count of sequences)

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

Expected vs. observed Richness (count of OTU, ASV, ZOTU)

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

Metrics calculated in relation to the expected

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

Taxonomies found vs. lost

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

Reconstruction of the reference.

35sp ITS1 Uniform

35sp ITS1 Powerlaw

35sp ITS2 Uniform

35sp ITS2 Powerlaw

115 species

Amplicon length distribution

List of mergeable amplicons

List of non-mergeable amplicons

Expected vs. observed depth (count of sequences)

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

Expected vs. observed Richness (count of OTU, ASV, ZOTU)

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

Metrics calculated in relation to the expected

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

Taxonomies found vs. lost

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

Reconstruction of the reference.

115sp ITS1 Uniform

115sp ITS1 Powerlaw

115sp ITS2 Uniform

115sp ITS2 Powerlaw

515 species

Amplicon length distribution

List of mergeable amplicons

List of non-mergeable amplicons

Expected vs. observed depth (count of sequences)

515sp ITS1 Uniform

515sp ITS1 Powerlaw

515sp ITS2 Uniform

515sp ITS2 Powerlaw

Expected vs. observed Richness (count of OTU, ASV, ZOTU)

515sp ITS1 Uniform

515sp ITS1 Powerlaw

515sp ITS2 Uniform

515sp ITS2 Powerlaw

Metrics calculated in relation to the expected

515sp ITS1 Uniform

515sp ITS1 Powerlaw

515sp ITS2 Uniform