FROGS ITS Benchmarking
Introduction
The objective of these analyses is to compare the tools currently available for analyzing extensive ITS data. For this purpose, we have analyzed simulated ITS1 and ITS2 data. FROGS (Escudié et al. 2018), USEARCH (Edgar, 2016), DADA2 (Callahan et al. 2016) and Qiime2 (Bolyen et al. 2019) have been used and compared using their respective guidelines.
Simulated data
We have produced paired-end ITS1 and ITS2 reads, R1 and R2, type Illumina Miseq, from Fungi, as diverse as possible. We extracted the reference sequences from the Unite v7.1 database (Abarenkov et al. 2010) centered on the eukaryotic nuclear ribosomal ITS region. We extracted the reference sequences from the Unite v7.1 database centered on the eukaryotic nuclear ribosomal ITS region. We picked 35, 115, or 515 species of Fungi. We have incorporated sequencing errors and chimeras to reflect the reality obtained after sequencing. Reads have been simulated by increasing the number of substitutions and indels linearly along the reads. We generated samples with two types of abundance distribution that follows either a Power law or an Uniform law. We have focused on including reads from ITS1 and ITS2 of long lengths, not overlaid by two paired 250-nucleotide reads.
Results
The results obtained for each tool (FROGS, DADA2, USEARCH and Qiime2) were compared to what was expected and different metrics were calculated:
- At OTU/ASV/ZOTU level :
- Richness expressed as number of OTU/ASV/ZOTU
- At the species level:
- Abundance divergence rate
- False negatives
- False positives
- True positives
Divergence evaluates a tool’s ability to recover the expected community composition at species levels. It is defined as affiliation divergence = 100 -∑ (min (expected, found)) and is equal to the canberra distance (expressed in percent) between the expected (i.e. the true taxonomic composition of the community) and observed (i.e. the one inferred by the OTU picking tool) community compositions, after aggregating OTUs at a given taxonomic level. The second metric, FN, is the number of False Negative species i.e. present in the original bacterial community but not discovered by the OTU picking method (FROGS, DADA2, USEARCH or Qiime2). These are the OTUs lost while picking OTUs. This second metric evaluates the sensitivity, i.e. the probability of detection of species present in dataset. The third one, FP, is the number of False Positive species i.e. discovered by the OTU picking method but not present in the original bacterial community. This metric evaluates a tool’s propensity to create spurious OTUs.
N.B. : FROGS and Qiime2 produce OTUs, DADA2 produces ASVs and USEARCH produces ZOTUs.
35 species
Amplicon length distribution
List of mergeable amplicons
List of non-mergeable amplicons
Expected vs. observed depth (count of sequences)
35sp ITS1 Uniform
35sp ITS1 Powerlaw
35sp ITS2 Uniform
35sp ITS2 Powerlaw
Expected vs. observed Richness (count of OTU, ASV, ZOTU)
35sp ITS1 Uniform
35sp ITS1 Powerlaw
35sp ITS2 Uniform
35sp ITS2 Powerlaw
Metrics calculated in relation to the expected
Affiliations and associated abundances are taken into account.
35sp ITS1 Uniform
35sp ITS1 Powerlaw
35sp ITS2 Uniform
35sp ITS2 Powerlaw
Taxonomies found vs. lost
Column order is based on the canberra distance.
35sp ITS1 Uniform
35sp ITS1 Powerlaw
35sp ITS2 Uniform
35sp ITS2 Powerlaw
Reconstruction of the reference.
35sp ITS1 Uniform
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
35 | 35 | 33 | 27 | 19 | 27 | 31 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
35 | 34 | 20 | 2 | 12 | 2 | 21 |
35sp ITS1 Powerlaw
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
35 | 35 | 33 | 27 | 21 | 27 | 31 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
35 | 34 | 20 | 2 | 12 | 2 | 21 |
35sp ITS2 Uniform
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
35 | 35 | 33 | 18 | 18 | 33 | 33 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
35 | 34 | 29 | 0 | 12 | 0 | 30 |
35sp ITS2 Powerlaw
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
35 | 35 | 32 | 19 | 19 | 33 | 33 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
35 | 34 | 29 | 0 | 12 | 0 | 30 |
115 species
Amplicon length distribution
List of mergeable amplicons
List of non-mergeable amplicons
Expected vs. observed depth (count of sequences)
115sp ITS1 Uniform
115sp ITS1 Powerlaw
115sp ITS2 Uniform
115sp ITS2 Powerlaw
Expected vs. observed Richness (count of OTU, ASV, ZOTU)
115sp ITS1 Uniform
115sp ITS1 Powerlaw
115sp ITS2 Uniform
115sp ITS2 Powerlaw
Metrics calculated in relation to the expected
Affiliations and associated abundances are taken into account.
115sp ITS1 Uniform
115sp ITS1 Powerlaw
115sp ITS2 Uniform
115sp ITS2 Powerlaw
Taxonomies found vs. lost
Column order is based on the canberra distance.
115sp ITS1 Uniform
115sp ITS1 Powerlaw
115sp ITS2 Uniform
115sp ITS2 Powerlaw
Reconstruction of the reference.
115sp ITS1 Uniform
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
115 | 115 | 102 | 67 | 55 | 75 | 103 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
115 | 114 | 89 | 0 | 40 | 0 | 93 |
115sp ITS1 Powerlaw
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
115 | 115 | 103 | 65 | 56 | 73 | 102 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
115 | 114 | 90 | 0 | 40 | 0 | 94 |
115sp ITS2 Uniform
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
115 | 115 | 95 | 49 | 47 | 104 | 104 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
115 | 113 | 91 | 1 | 39 | 1 | 100 |
115sp ITS2 Powerlaw
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
115 | 115 | 96 | 48 | 48 | 102 | 103 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
115 | 113 | 92 | 1 | 39 | 1 | 99 |
515 species
Amplicon length distribution
List of mergeable amplicons
List of non-mergeable amplicons
Expected vs. observed depth (count of sequences)
515sp ITS1 Uniform
515sp ITS1 Powerlaw
515sp ITS2 Uniform
515sp ITS2 Powerlaw
Expected vs. observed Richness (count of OTU, ASV, ZOTU)
515sp ITS1 Uniform
515sp ITS1 Powerlaw
515sp ITS2 Uniform
515sp ITS2 Powerlaw
Metrics calculated in relation to the expected
Affiliations and associated abundances are taken into account.
515sp ITS1 Uniform
515sp ITS1 Powerlaw
515sp ITS2 Uniform
515sp ITS2 Powerlaw
Taxonomies found vs. lost
Column order is based on the canberra distance.
515sp ITS1 Uniform
515sp ITS1 Powerlaw
515sp ITS2 Uniform
515sp ITS2 Powerlaw
Reconstruction of the reference .
515sp ITS1 Uniform
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
515 | 513 | 453 | 270 | 252 | 302 | 472 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
515 | 505 | 433 | 1 | 220 | 1 | 459 |
515sp ITS1 Powerlaw
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
515 | 506 | 448 | 262 | 244 | 285 | 465 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
515 | 498 | 429 | 1 | 220 | 1 | 455 |
515sp ITS2 Uniform
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
515 | 512 | 436 | 228 | 221 | 456 | 465 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
515 | 504 | 424 | 4 | 188 | 4 | 453 |
515sp ITS2 Powerlaw
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
515 | 504 | 434 | 219 | 212 | 446 | 462 |
EXPECTED | FROGS | USEARCH | DADA2-se | DADA2-pe | QIIME-se | QIIME-pe |
---|---|---|---|---|---|---|
515 | 496 | 423 | 4 | 188 | 4 | 449 |
Command lines
Tools need to be installed and available in the PATH.
FROGS (v. 3.2.0)
#!/bin/bash
SAVE_DIR="results/FROGS"
INPUT_DIR="data_input"
REFERENCE="Unite_ref/Unite_7.1_clean_certae_refs.fasta"
LAW="Uniform Powerlaw"
SP="35sp 115sp 515sp"
ITS="ITS1 ITS2"
CPU=1
# Create archive for FROGS for each dataset
for path in $INPUT_DIR/*sp/jeu1/ITS*/*/ ; do cd $path ; tar zcvf Archive.tar.gz *.fastq.gz; cd - ;done
for sp in $SP
do
for its in $ITS
do
for law in $LAW
do
# Create output directory for each dataset
ARCHIVE_FILE="$INPUT_DIR/$sp/jeu1/$its/$law/Archive.tar.gz"
mkdir -p "$SAVE_DIR/$sp/$its/$law/"
cd "$SAVE_DIR/$sp/$its/$law/"
mkdir -p LOG
# Preprocess
preprocess.py illumina --input-archive $ARCHIVE_FILE --keep-unmerged --merge-software pear --min-amplicon-size 20 --max-amplicon-size 490 --without-primers --R1-size 250 --R2-size 250 --nb-cpus $CPU --output-dereplicated preprocess.fasta --output-count preprocess.tsv --summary preprocess.html --log-file preprocess.log
# Clustering
clustering.py --input-fasta preprocess.fasta --input-count preprocess.tsv --distance 1 --nb-cpus $CPU --output-biom clustering.biom --output-fasta clustering.fasta --log-file clustering.log
# Chimera removal
remove_chimera.py --input-fasta clustering.fasta --input-biom clustering.biom --nb-cpus $CPU --out-abundance remove_chimera.biom --non-chimera remove_chimera.fasta --log-file remove_chimera.log --summary remove_chimera.html
# OTU filters
otu-filters.py --min-abundance 0.00005 --nb-cpus $CPU --input-fasta remove_chimera.fasta --input-biom remove_chimera.biom --output-biom filters.biom --output-fasta filters.fasta --log-file filters.log --summary filters.html
# ITSx
itsx.py --nb-cpus $CPU --region $its --check-its-only --input-fasta filters.fasta --input-biom filters.biom --out-fasta itsx.fasta --out-abundance itsx.biom --summary itsx.html --log-file itsx.log
# Taxonomic affiliation
affiliation_OTU.py --nb-cpus $CPU --reference $REFERENCE --input-biom itsx.biom --input-fasta itsx.fasta --output-biom frogs.biom --summary affiliation.html --log-file affiliation.log
# Biom to TSV
biom_to_tsv.py --input-biom frogs.biom --input-fasta itsx.fasta --output-tsv frogs.tsv --output-multi-affi multiaff.tsv --log-file biom_to_tsv.log
# Affiliation filter: parameters is very dependent on your dataset and questions
affiliation_filters.py --input-biom frogs.biom --input-fasta itsx.fasta --output-frogs_affi_filter.biom --output-fasta frogs_affi_filter.fasta --summary frogs_affi_filter.html --impacted frogs_affi_filter_impacted_OTU.tsv --impacted-multihit frogs_affi_filter_impacted_OTU_multihit.tsv --log-file affi_filter.log
--delete / --mask # choose behavior
# add blast affiliation filter
# --min-blast-identity 0.8 --min-blast-coverage 0.8 --ignore-blast-taxa "known_contaminant_terme or key_species"
done
done
done
DADA2-pe (v. 1.14.1)
dada2.sh
#!/bin/bash
SAVE_DIR="results/DADA2"
INPUT_DIR="data_input"
SCRIPT_DIR="scripts/DADA2"
REFERENCE="Unite_ref/Unite_7.1_clean_certae_refs_for_DADA2.fasta"
LAW="Uniform Powerlaw"
SP="35sp 115sp 515sp"
ITS="ITS1 ITS2"
CPU=1
for sp in $SP
do
for its in $ITS
do
for law in $LAW
do
mkdir -p "$SAVE_DIR/$sp/$its/$law/"
cd "$SAVE_DIR/$sp/$its/$law/"
mkdir -p LOG
# launch DADA2
module load system/R-3.6.1 libraries/proj-4.9.3 libraries/gdal-2.3.0 && $SCRIPT_DIR/dada2.R --input $INPUT_DIR/$sp/jeu1/$its/$law/ --output $SAVE_DIR/$sp/$its/$law/ --reference $REFERENCE --cpus $CPU
cd -
done
done
done
dada2.R
#!/usr/bin/env Rscript
library(argparse)
library(ShortRead)
library(Biostrings)
library(tidyverse)
library(phyloseq)
library(dada2)
## Get arguments from bash script
args <- commandArgs(TRUE)
parser <- ArgumentParser()
parser$add_argument("-i", "--input", type="character", default=NULL,
help="dataset file name", metavar="character")
parser$add_argument("-o", "--output", type="character", default="out.txt",
help="output file name", metavar="character")
parser$add_argument("--reference", type="character", default=NULL,
help="reference database", metavar="character")
parser$add_argument("--cpus", type="integer", default=1,
help="number of cpus", metavar="N")
args <- parser$parse_args()
input_dir <- args$input
output_dir <- args$output
## Input files
fnFs <- sort(list.files(input_dir, pattern = "_R1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(input_dir, pattern = "_R2.fastq.gz", full.names = TRUE))
# Remove sequences with N
path.N <- file.path(output_dir, "filtered")
filtFs <- file.path(output_dir, "filtered", basename(fnFs)) # Put N-filterd files in filtN/ subdirectory
filtRs <- file.path(output_dir, "filtered", basename(fnRs))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN = 0, maxEE = c(2, 2),
truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = 16)
get.sample.name <- function(fname) paste(strsplit(basename(fname), "_R")[[1]][1],collapse="_")
sample.names <- unname(sapply(fnFs, get.sample.name))
# Dada
dadaFs <- dada(filtFs, selfConsist=TRUE, pool=TRUE, multithread = args$cpus)
dadaRs <- dada(filtRs, selfConsist=TRUE, pool=TRUE, multithread = args$cpus)
names(dadaFs) <- sample.names
names(dadaRs) <- sample.names
# Deal with pairs
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs)
seqtab <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, multithread=args$cpus)
taxa <- assignTaxonomy(seqtab.nochim, args$reference, multithread = args$cpus)
saveRDS(taxa,file.path(output_dir,"taxa.rds"))
taxa <- taxa[,c(1,2,3,4,5,6,7)]
colnames(taxa) <- c("Kingdom", "Phylum", "Class", "Order" , "Family" , "Genus" , "Species")
### Track reads
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
print(track)
asv_seqs <- colnames(seqtab.nochim)
biomtable <- t(seqtab.nochim)
b <- as_tibble(biomtable)
obs_sum <- rowSums(b)
b$observation_sum <- obs_sum
b$observation_name <- paste0("ASV", 1:nrow(b))
b$seed_sequence <- asv_seqs
taxonomy <- as_tibble(taxa) %>% unite("#taxonomy", 1:7 , sep=";", remove = FALSE)
b$"#taxonomy" <- taxonomy$"#taxonomy"
col_order <- c("#taxonomy", "seed_sequence", "observation_name", "observation_sum",
rownames(seqtab.nochim))
b2 <- b[, col_order]
write.table(b2, file.path(output_dir,"dada2.tsv"), sep="\t", quote=F, row.names=F)
tsv_to_biom <- "tsv_to_biom.py"
system2(tsv_to_biom, args = c("--input-tsv", file.path(output_dir,"dada2.tsv"),
"--output-biom", file.path(output_dir,"dada2.biom"),
"--output-fasta", file.path(output_dir,"dada2.fasta"),
"--log-file", file.path(output_dir,"tsv_to_biom.log")
)
)
samples.out <- rownames(seqtab.nochim)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE) ,
tax_table(taxa))
saveRDS(ps, file.path(output_dir,"dada2.rds"))
ids_study <- paste("Cluster", 1:ncol(seqtab.nochim), sep = "_")
asv_headers <- paste(">",ids_study, sep="")
asv_seqs <- colnames(seqtab.nochim)
asv_fasta <- c(rbind(asv_headers, asv_seqs))
taxanomy_fields <- apply(unname(taxa), 1, function(x) paste(x, collapse=";"))
seqtab.nochim_t <- as.data.frame(t(seqtab.nochim))
samples_ids <- colnames(seqtab.nochim_t)
seqtab.nochim_t$ids <- ids_study
seqtab.nochim_t$"#blast_taxonomy" <- taxanomy_fields
seqtab.nochim_t$"seed_sequence" <- asv_seqs
seqtab.nochim_t$"observation_name" <- ids_study
seqtab.nochim_t$observation_sum <- rowSums(t(seqtab.nochim))
seqtab.nochim_t <- seqtab.nochim_t[, c("#blast_taxonomy", "seed_sequence", "observation_name", "observation_sum", samples_ids)]
write.table(x = rbind(colnames(seqtab.nochim_t), seqtab.nochim_t),
file = file.path(output_dir, "dada2.tsv"),
quote = FALSE,
sep = "\t",
col.names = FALSE,
row.names = FALSE)
system2(tsv_to_biom, args=c("--input-tsv", file.path(output_dir, "dada2.tsv"), "--output-biom", file.path(output_dir, "dada2.biom")))
DADA2-se (v. 1.14.1)
dada2_se.sh
#!/bin/bash
SAVE_DIR="results/DADA2_SE2"
INPUT_DIR="data_input"
SCRIPT_DIR="scripts/DADA2_SE"
REFERENCE="Unite_ref/Unite_7.1_clean_certae_refs_for_DADA2.fasta"
LAW="Uniform Powerlaw"
SP="35sp 115sp 515sp"
ITS="ITS1 ITS2"
CPU=1
for sp in $SP
do
for its in $ITS
do
for law in $LAW
do
mkdir -p "$SAVE_DIR/$sp/$its/$law/"
cd "$SAVE_DIR/$sp/$its/$law/"
mkdir -p LOG
# launch DADA2_SE
module load system/R-3.6.1 libraries/proj-4.9.3 libraries/gdal-2.3.0 && $SCRIPT_DIR/dada2_se.R --input $INPUT_DIR/$sp/jeu1/$its/$law/ --output $SAVE_DIR/$sp/$its/$law/ --reference $REFERENCE --cpus $CPU
cd -
done
done
done
dada2_se.R
#!/usr/bin/env Rscript
library(argparse)
library(ShortRead)
library(Biostrings)
library(tidyverse)
library(phyloseq)
library(dada2)
## Get arguments from bash script
args <- commandArgs(TRUE)
parser <- ArgumentParser()
parser$add_argument("-i", "--input", type="character", default=NULL,
help="dataset file name", metavar="character")
parser$add_argument("-o", "--output", type="character", default="out.txt",
help="output file name", metavar="character")
parser$add_argument("--reference", type="character", default=NULL,
help="reference database", metavar="character")
parser$add_argument("--cpus", type="integer", default=1,
help="number of cpus", metavar="N")
args <- parser$parse_args()
input_dir <- args$input
output_dir <- args$output
## Input files
fnFs <- sort(list.files(input_dir, pattern = "_R1.fastq.gz", full.names = TRUE))
# Remove sequences with N
path.N <- file.path(output_dir, "filtered")
filtFs <- file.path(output_dir, "filtered", basename(fnFs)) # Put N-filterd files in filtN/ subdirectory
filterAndTrim(fnFs, filtFs, maxN = 0, maxEE = 2,
truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = 16)
get.sample.name <- function(fname) paste(strsplit(basename(fname), "_R")[[1]][1],collapse="_")
sample.names <- unname(sapply(fnFs, get.sample.name))
# Dada
dadaFs <- dada(filtFs, selfConsist=TRUE, pool=TRUE, multithread = args$cpus)
names(dadaFs) <- sample.names
seqtab <- makeSequenceTable(dadaFs)
seqtab.nochim <- removeBimeraDenovo(seqtab, multithread=args$cpus)
taxa <- assignTaxonomy(seqtab.nochim, args$reference, multithread = args$cpus)
saveRDS(taxa,file.path(output_dir,"taxa.rds"))
taxa <- taxa[,c(1,2,3,4,5,6,7)]
colnames(taxa) <- c("Kingdom", "Phylum", "Class", "Order" , "Family" , "Genus" , "Species")
asv_seqs <- colnames(seqtab.nochim)
biomtable <- t(seqtab.nochim)
b <- as_tibble(biomtable)
obs_sum <- rowSums(b)
b$observation_sum <- obs_sum
b$observation_name <- paste0("ASV", 1:nrow(b))
b$seed_sequence <- asv_seqs
taxonomy <- as_tibble(taxa) %>% unite("#taxonomy", 1:7 , sep=";", remove = FALSE)
b$"#taxonomy" <- taxonomy$"#taxonomy"
col_order <- c("#taxonomy", "seed_sequence", "observation_name", "observation_sum",
rownames(seqtab.nochim))
b2 <- b[, col_order]
write.table(b2, file.path(output_dir,"dada2.tsv"), sep="\t", quote=F, row.names=F)
tsv_to_biom <- "tsv_to_biom.py"
system2(tsv_to_biom, args = c("--input-tsv", file.path(output_dir,"dada2.tsv"),
"--output-biom", file.path(output_dir,"dada2.biom"),
"--output-fasta", file.path(output_dir,"dada2.fasta"),
"--log-file", file.path(output_dir,"tsv_to_biom.log")
)
)
samples.out <- rownames(seqtab.nochim)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE) ,
tax_table(taxa))
saveRDS(ps, file.path(output_dir,"dada2.rds"))
ids_study <- paste("Cluster", 1:ncol(seqtab.nochim), sep = "_")
asv_headers <- paste(">",ids_study, sep="")
asv_seqs <- colnames(seqtab.nochim)
asv_fasta <- c(rbind(asv_headers, asv_seqs))
taxanomy_fields <- apply(unname(taxa), 1, function(x) paste(x, collapse=";"))
seqtab.nochim_t <- as.data.frame(t(seqtab.nochim))
samples_ids <- colnames(seqtab.nochim_t)
seqtab.nochim_t$ids <- ids_study
seqtab.nochim_t$"#blast_taxonomy" <- taxanomy_fields
seqtab.nochim_t$"seed_sequence" <- asv_seqs
seqtab.nochim_t$"observation_name" <- ids_study
seqtab.nochim_t$observation_sum <- rowSums(t(seqtab.nochim))
seqtab.nochim_t <- seqtab.nochim_t[, c("#blast_taxonomy", "seed_sequence", "observation_name", "observation_sum", samples_ids)]
write.table(x = rbind(colnames(seqtab.nochim_t), seqtab.nochim_t),
file = file.path(output_dir, "dada2.tsv"),
quote = FALSE,
sep = "\t",
col.names = FALSE,
row.names = FALSE)
system2(tsv_to_biom, args=c("--input-tsv", file.path(output_dir, "dada2.tsv"), "--output-biom", file.path(output_dir, "dada2.biom")))
USEARCH (usearch11.0.667)
makedb_usearch.sh
#!/bin/sh
REF=$1
OUT_DIR=$2
mkdir -p $OUT_DIR
REF_WOEXT=`basename ${REF%.*}`
cat $REF | awk '{if(substr($1,0,1)==">"){gsub(";",",",$2) ; print $1";tax="$2}else{print $0}}' | sed -r 's/,$/;/' > $OUT_DIR/${REF_WOEXT}_usearch.fasta
usearch -makeudb_usearch $OUT_DIR/${REF_WOEXT}_usearch.fasta -output $OUT_DIR/${REF_WOEXT}_usearch.udb
usearch.sh
#!/bin/sh
READS_DIR=$1
REF=$2
OUT_DIR=$3
CPU=$4
mkdir -p $OUT_DIR/usearch/log
###############################################################################################
# Merge all R1/R2 file pairs
# Add sample name to read labels (-relabel @ option)
# Pool samples together into raw.fq
# keep R1 unmerged reads
mkdir $OUT_DIR/data
for R1 in $READS_DIR/*_R1.fastq.gz
do
R2=`echo $R1 |sed 's/_R1.fastq.gz/_R2.fastq.gz/'`
sample=`basename $R1 |sed 's/_R1.fastq.gz//'`
gzip -dc $R1 > $OUT_DIR/data/`basename $R1`
gzip -dc $R2 > $OUT_DIR/data/`basename $R2`
R1=$OUT_DIR/data/`basename $R1`
R2=$OUT_DIR/data/`basename $R2`
usearch -fastq_mergepairs $R1 -reverse $R2 \
-sample $sample -threads $CPU \
-fastqout $OUT_DIR/usearch/${sample}_raw_paired.fq -fastqout_notmerged_fwd $OUT_DIR/usearch/${sample}_raw_unpaired_fw.fq 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/${sample}_fastq_mergepairs.log
awk -v S=$sample '{c++; if(c==1){print $0";sample="S";"}else{print $0; if(c==4){c=0}}}' $OUT_DIR/usearch/${sample}_raw_unpaired_fw.fq > $OUT_DIR/usearch/${sample}_raw_unpaired_fw_rename.fq
rm $OUT_DIR/usearch/${sample}_raw_unpaired_fw.fq
done
rm -r $OUT_DIR/data
# concat R1R2 merged and R1 unmerged reads
cat $OUT_DIR/usearch/*raw_paired.fq $OUT_DIR/usearch/*_raw_unpaired_fw_rename.fq > $OUT_DIR/usearch/amplicon.fq
rm $OUT_DIR/usearch/*raw_paired.fq $OUT_DIR/usearch/*raw_unpaired_fw_rename.fq
###############################################################################################
# Amplicon sequence quality filter
# Discard reads with > E expected errors per base: https://www.drive5.com/usearch/manual/exp_errs.html
# For example, with the fastq_maxee_rate option set to 0.01, then
# a read of length 100 will be discarded if the expected errors is >1,
# and a read of length 1,000 will be discarded if the expected errors is >10.
usearch -fastq_filter $OUT_DIR/usearch/amplicon.fq \
-fastq_maxee 1.0 -relabel Filt -threads $CPU \
-fastaout $OUT_DIR/usearch/amplicon_filtered.fa 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/fastq_filter.log
###############################################################################################
# Find unique read sequences, store abundances, and sort by abundance
usearch -fastx_uniques $OUT_DIR/usearch/amplicon_filtered.fa \
-sizeout -relabel Uniq -threads $CPU \
-fastaout $OUT_DIR/usearch/amplicon_uniques.fa 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/fastx_uniques.log
###############################################################################################
# Run UNOISE algorithm to get denoised sequences (ZOTUs)
# The -minsize option specifies the minimum abundance (size= annotation). Default is 8.
# Input sequences with lower abundances are discarded.
usearch -unoise3 $OUT_DIR/usearch/amplicon_uniques.fa \
-zotus $OUT_DIR/usearch.fa 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/unoise3.log
# make ZOTU table
usearch -otutab $OUT_DIR/usearch/amplicon.fq \
-threads $CPU \
-otus $OUT_DIR/usearch.fa -otutabout $OUT_DIR/usearch/zotus.txt 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/otutab.log
awk -F "\t" 'BEGIN{OFS="\t"}{if($1=="#OTU ID"){$1="observation_name\tobservation_sum"}else{s=0; for (i=2;i<= NF; i++){s+=$i} ; $1=$1"\t"s} print $0}' $OUT_DIR/usearch/zotus.txt > $OUT_DIR/usearch/zotus_reformat.txt
tsv_to_biom.py -t $OUT_DIR/usearch/zotus_reformat.txt -b $OUT_DIR/usearch/zotus.biom
###############################################################################################
# perform affiliation using Sintax
# If the -sintax_cutoff option is given then predictions are written a second time after applying the confidence
# threshold, keeping only ranks with high enough confidence. On V4 reads, using a cutoff of 0.8 gives predictions
# with similar accuracy to RDP at 80% bootstrap cutoff.
usearch -sintax $OUT_DIR/usearch.fa \
-db $REF -strand both -threads $CPU \
-tabbedout $OUT_DIR/usearch/zotus_tax.txt -sintax_cutoff 0.8 2>&1 | sed -e 's/\r/\n/g' > $OUT_DIR/usearch/log/sintax.log
cat $OUT_DIR/usearch/zotus_tax.txt | awk '{c++; if(c==1){print "#observation_name\tboostrapped_taxonomy\tstrand\ttaxonomy"}; gsub(",",";",$0) ; print $0}' > $OUT_DIR/usearch/zotus_tax_reformat.txt
# ==> supprimer le cutoff de bootstrap ?
# create final biom
/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/qiime2-2019.4/bin/biom add-metadata -i $OUT_DIR/usearch/zotus.biom \
--observation-metadata-fp $OUT_DIR/usearch/zotus_tax_reformat.txt \
-o $OUT_DIR/usearch.biom --output-as-json \
--sc-separated boostrapped_taxonomy,taxonomy
run_usearch.sh
#!/bin/sh
CPU=1
DIR=/work/frogsfungi/FungiPubli/JeuTestDuMoment
DATA_DIR=$DIR/data_input
REF=$DIR/Unite_ref/Unite_7.1_clean_certae_refs.fasta
USEARCH_OUT_DIR=$DIR/results/USEARCH
mkdir -p $USEARCH_OUT_DIR/LOG
##############
# reference build
sh $DIR/scripts/USEARCH/makeudb_usearch.sh $REF $DIR/Unite_ref
DB_USEARCH=$DIR/Unite_ref/Unite_7.1_clean_certae_refs_usearch.udb
##############
# run usearch
for sp in 35sp 115sp 515sp
do
for its in ITS1 ITS2
do
for law in Powerlaw Uniform
do
sh usearch.sh $DATA_DIR/$sp/jeu1/$its/$law/ $DB_USEARCH $USEARCH_OUT_DIR/$sp/$its/$law/ $CPU
done
done
done
QIIME-pe (qiime2-2019.4)
qiime2_train_classifer.sh
module load system/Miniconda3-4.4.10 module load bioinfo/qiime2-2019.4
#!/bin/sh
# sbatch -J Unite_qiime2_train_classifier --mem=20G -o %x.out -e %x.err train_classifier.sh
REF=$1
OUT_DIR=$2
mkdir -p $OUT_DIR
REF_WOEXT=`basename ${REF%.*}`
grep '>' $REF | awk '{sub(">","",$1); print $1"\t"$2}' > $OUT_DIR/$REF_WOEXT.tax
qiime tools import \
--input-path $REF \
--output-path $OUT_DIR/$REF_WOEXT.seq.qza \
--type 'FeatureData[Sequence]'
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path $OUT_DIR/$REF_WOEXT.tax \
--output-path $OUT_DIR/$REF_WOEXT.tax.qza
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads $OUT_DIR/$REF_WOEXT.seq.qza \
--i-reference-taxonomy $OUT_DIR/$REF_WOEXT.tax.qza \
--o-classifier $OUT_DIR/$REF_WOEXT.classifier.qza
run_qiime_pe.sh
#!/bin/sh
CPU=10
export PATH=/save/frogs/src/FROGS_dev/assessment/bin:/save/frogs/src/FROGS_dev/app:$PATH
export PYTHONPATH=/save/frogs/src/FROGS_dev/lib:$PYTHONPATH
DIR=/work/frogsfungi/FungiPubli/JeuTestDuMoment
DATA_DIR=$DIR/data_input
REF=$DIR/Unite_ref/Unite_7.1_clean_certae_refs.fasta
# QIIME2-PE
QIIME_OUT_DIR=$DIR/results/QIIME2
mkdir -p $QIIME_OUT_DIR/LOG
# note pour une analyse sur des données réelles
# Classification of fungal ITS sequences
# In our experience, fungal ITS classifiers trained on the UNITE reference database do NOT benefit from extracting/trimming reads to primer sites.
# We recommend training UNITE classifiers on the full reference sequences.
# Furthermore, we recommend the “developer” sequences (located within the QIIME-compatible release download)
# because the standard versions of the sequences have already been trimmed to the ITS region (excluding portions of flanking rRNA genes
# that may be present in amplicons generated with standard ITS primers).
##############
# reference build
sbatch -J Unite_7.1_qiime2_train_classifier --mem=20G -o $DIR/results/QIIME2/LOG/%x.out -e $DIR/results/QIIME2/LOG/%x.err --wrap="sh $DIR/scripts/QIIME2/qiime2_train_classifier.sh $REF $DIR/Unite_ref/"
DB_PREFIX=$DIR/Unite_ref/Unite_7.1_clean_certae_refs
##############
# run qiime2 open reference clustering
for sp in 35sp 115sp 515sp 37sp 117sp 517sp
do
for its in ITS1 ITS2
do
for law in Powerlaw Uniform
do
# create manifest
mkdir -p $QIIME_OUT_DIR/$sp/$its/$law/pe
echo -e "sample-id\tforward-absolute-filepath\treverse-absolute-filepath" > $QIIME_OUT_DIR/$sp/$its/$law/pe/manifest_paired.tsv
for R1 in `ls $DATA_DIR/$sp/jeu1/$its/$law/*_R1.fastq.gz`
do
sample_name=`basename $R1 | sed 's/_R1.fastq.gz//'`
R2=`echo $R1 | sed 's/_R1.fastq.gz/_R2.fastq.gz/'`
echo -e $sample_name"\t"$R1"\t"$R2 >> $QIIME_OUT_DIR/$sp/$its/$law/pe/manifest_paired.tsv
done
# run qiime2-pe
sbatch -J qiime2-pe_${sp}_${its}_${law} --cpus-per-task=$CPU -o $QIIME_OUT_DIR/LOG/%x.out -e $QIIME_OUT_DIR/LOG/%x.err --wrap="sh $DIR/scripts/QIIME2/qiime2-pe.sh $QIIME_OUT_DIR/$sp/$its/$law/pe/manifest_paired.tsv $DB_PREFIX"
done
done
done
qiime-pe.sh
#!/usr/bin/sh
# FROGS/app directory need to be in the PATH
module load system/Miniconda3-4.4.10
module load bioinfo/qiime2-2019.4
MANIFEST=$1
OUT_DIR=`dirname $MANIFEST`
WORK_DIR=$OUT_DIR/qiime_workdir
mkdir -p $WORK_DIR
DB_PREFIX=$2
DB_REF_SEQ=$DB_PREFIX.seq.qza
DB_REF_CLASSIFIER=$DB_PREFIX.classifier.qza
### import
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path $MANIFEST \
--output-path $WORK_DIR/pe-demux.qza \
--input-format PairedEndFastqManifestPhred33V2
qiime demux summarize \
--i-data $WORK_DIR/pe-demux.qza \
--o-visualization $WORK_DIR/pe-demux.qzv
# pas de trimming puis pas d'adaptateurs et très bonne qualité
########################################################################
### clustering pipeline : open reference clustering paired end
# The features produced by clustering methods are known as operational taxonomic units (OTUs), which is Esperanto for suboptimal, imprecise rubbish
# join pair
# paramètres par defaut:
# --p-minlen 1
# --p-maxns (optional)
# --p-allowmergestagger False
# --p-minovlen 10
# --p-maxdiffs 10
# --p-minmergelen (optional)
# --p-maxmergelen (optional)
# --p-maxee (optional)
# --p-qmin 0
# --p-qminout 0
# --p-qmax 41
# --p-qmaxout 41
qiime vsearch join-pairs \
--i-demultiplexed-seqs $WORK_DIR/pe-demux.qza \
--p-allowmergestagger --p-minovlen 10 --p-minmergelen 20 --p-maxmergelen 490 \
--o-joined-sequences $WORK_DIR/pe-join.qza
qiime demux summarize \
--i-data $WORK_DIR/pe-join.qza \
--o-visualization $WORK_DIR/pe-join.qzv
# dereplication
qiime vsearch dereplicate-sequences \
--i-sequences $WORK_DIR/pe-join.qza \
--o-dereplicated-table $WORK_DIR/pe-join-derepTable.qza \
--o-dereplicated-sequences $WORK_DIR/pe-join-derepSeqs.qza
qiime feature-table summarize \
--i-table $WORK_DIR/pe-join-derepTable.qza \
--o-visualization $WORK_DIR/pe-join-table-or-99.qzv # à renommer en pe-join-derepTable.qzv
# clustering
# Given a feature table and the associated feature sequences, cluster the
# features against a reference database based on user-specified percent
# identity threshold of their sequences. Any sequences that don't match are
# then clustered de novo. This is not a general-purpose clustering method,
# but rather is intended to be used for clustering the results of quality-
# filtering/dereplication methods, such as DADA2, or for re-clustering a
# FeatureTable at a lower percent identity than it was originally clustered
# at. When a group of features in the input table are clustered into a
# single feature, the frequency of that single feature in a given sample is
# the sum of the frequencies of the features that were clustered in that
# sample. Feature identifiers will be inherited from the centroid feature of
# each cluster. For features that match a reference sequence, the centroid
# feature is that reference sequence, so its identifier will become the
# feature identifier. The clustered_sequences result will contain feature
# representative sequences that are derived from the sequences input for all
# features in clustered_table. This will always be the most abundant
# sequence in the cluster. The new_reference_sequences result will contain
# the entire reference database, plus feature representative sequences for
# any de novo features. This is intended to be used as a reference database
# in subsequent iterations of cluster_features_open_reference, if
# applicable. See the vsearch documentation for details on how sequence
# clustering is performed.
# default param : --p-strand plus
qiime vsearch cluster-features-open-reference \
--i-table $WORK_DIR/pe-join-derepTable.qza \
--i-sequences $WORK_DIR/pe-join-derepSeqs.qza \
--i-reference-sequences $DB_REF_SEQ \
--p-perc-identity 0.99 \
--o-clustered-table $WORK_DIR/pe-clusterTable-or-99.qza \
--o-clustered-sequences $WORK_DIR/pe-clusterSeqs-or-99.qza \
--o-new-reference-sequences $WORK_DIR/new-ref-seqs-or-99.qza
qiime feature-table summarize \
--i-table $WORK_DIR/pe-clusterTable-or-99.qza \
--o-visualization $WORK_DIR/pe-clusterTable-or-99.qzv
# suppression des chimères
# Apply the vsearch uchime_denovo method to identify chimeric feature
# sequences. The results of this method can be used to filter chimeric
# features from the corresponding feature table. For additional details,
# please refer to the vsearch documentation.
# paramètres par défaut (défaut de vsearch, mais manque abskew : min abundance ratio of parent vs chimera (2.0)):
# --p-dn 1.4
# --p-mindiffs 3
# --p-mindiv 0.8
# --p-minh 0.28
# --p-xn 8.0
qiime vsearch uchime-denovo \
--i-table $WORK_DIR/pe-clusterTable-or-99.qza \
--i-sequences $WORK_DIR/pe-clusterSeqs-or-99.qza \
--output-dir $WORK_DIR/pe-uchime-dn-out
qiime feature-table filter-features \
--i-table $WORK_DIR/pe-clusterTable-or-99.qza \
--m-metadata-file $WORK_DIR/pe-uchime-dn-out/nonchimeras.qza \
--o-filtered-table $WORK_DIR/pe-nonchimeric-wo-borderline-table.qza
qiime feature-table summarize \
--i-table $WORK_DIR/pe-nonchimeric-wo-borderline-table.qza \
--o-visualization $WORK_DIR/pe-nonchimeric-wo-borderline-table.qzv
qiime feature-table filter-seqs \
--i-data $WORK_DIR/pe-clusterSeqs-or-99.qza \
--m-metadata-file $WORK_DIR/pe-uchime-dn-out/nonchimeras.qza \
--o-filtered-data $WORK_DIR/pe-nonchimeric-wo-borderline-seqs.qza
# filter single singleton
# q2-vsearch implements three different OTU clustering strategies: de novo, closed reference, and open reference.
# All should be preceded by basic quality-score-based filtering and followed by chimera filtering and aggressive
# OTU filtering (the treacherous trio, a.k.a. the Bokulich method). ==> 0.00005%
# ==> problème impossible de filtrer sur une fréquence de façon automatique
# ==> Filtre sur les single singleton
qiime feature-table filter-features \
--i-table $WORK_DIR/pe-nonchimeric-wo-borderline-table.qza \
--p-min-frequency 2 \
--o-filtered-table $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qza
qiime feature-table summarize \
--i-table $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qza \
--o-visualization $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qzv
qiime feature-table filter-seqs \
--i-data $WORK_DIR/pe-nonchimeric-wo-borderline-seqs.qza \
--i-table $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qza \
--o-filtered-data $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-seqs.qza
# affiliation
qiime feature-classifier classify-sklearn \
--i-classifier $DB_REF_CLASSIFIER \
--i-reads $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-seqs.qza \
--o-classification $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-tax.qza
qiime metadata tabulate \
--m-input-file $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-tax.qza \
--o-visualization $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-tax.qzv
#~ # export biom
qiime tools export \
--input-path $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-table.qza \
--output-path $WORK_DIR/export
qiime tools export \
--input-path $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-tax.qza \
--output-path $WORK_DIR/export
# create final biom
sed -i 's/Feature/#Feature/' $WORK_DIR/export/taxonomy.tsv
biom add-metadata -i $WORK_DIR/export/feature-table.biom \
--observation-metadata-fp $WORK_DIR/export/taxonomy.tsv \
-o $OUT_DIR/pe-qiime.biom \
--output-as-json
# export fasta
qiime tools export \
--input-path $WORK_DIR/pe-nonchimeric-wo-borderline-noSingle-seqs.qza \
--output-path $WORK_DIR/export
mv $WORK_DIR/export/dna-sequences.fasta $OUT_DIR/pe-qiime.fasta
# create abundance table in TSV format
biom_to_tsv.py -b $OUT_DIR/pe-qiime.biom -f $OUT_DIR/pe-qiime.fasta -t $OUT_DIR/pe-qiime.tsv
QIIME-se (qiime2-2019.4)
qiime2_train_classifer.sh
#!/bin/sh
module load system/Miniconda3-4.4.10
module load bioinfo/qiime2-2019.4
REF=$1
OUT_DIR=$2
mkdir -p $OUT_DIR
REF_WOEXT=`basename ${REF%.*}`
grep '>' $REF | awk '{sub(">","",$1); print $1"\t"$2}' > $OUT_DIR/$REF_WOEXT.tax
qiime tools import \
--input-path $REF \
--output-path $OUT_DIR/$REF_WOEXT.seq.qza \
--type 'FeatureData[Sequence]'
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path $OUT_DIR/$REF_WOEXT.tax \
--output-path $OUT_DIR/$REF_WOEXT.tax.qza
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads $OUT_DIR/$REF_WOEXT.seq.qza \
--i-reference-taxonomy $OUT_DIR/$REF_WOEXT.tax.qza \
--o-classifier $OUT_DIR/$REF_WOEXT.classifier.qza
run_qiime-se.sh
#!/bin/sh
CPU=1
DIR="."
DATA_DIR=$DIR/data_input
REF=$DIR/Unite_ref/Unite_7.1_clean_certae_refs.fasta
# QIIME2-SE
QIIME_OUT_DIR=$DIR/results/QIIME2
mkdir -p $QIIME_OUT_DIR/LOG
# Classification of fungal ITS sequences
# In our experience, fungal ITS classifiers trained on the UNITE reference database do NOT benefit from extracting/trimming reads to primer sites.
# We recommend training UNITE classifiers on the full reference sequences.
# Furthermore, we recommend the “developer” sequences (located within the QIIME-compatible release download)
# because the standard versions of the sequences have already been trimmed to the ITS region (excluding portions of flanking rRNA genes
# that may be present in amplicons generated with standard ITS primers).
##############
# reference build
sh $DIR/scripts/QIIME2/qiime2_train_classifier.sh $REF $DIR/Unite_ref/
DB_PREFIX=$DIR/Unite_ref/Unite_7.1_clean_certae_refs
##############
# run qiime2 open reference clustering
for sp in 35sp 115sp 515sp 37sp 117sp 517sp
do
for its in ITS1 ITS2
do
for law in Powerlaw Uniform
do
# manifest creation
mkdir -p $QIIME_OUT_DIR/$sp/$its/$law/pe
echo -e "sample-id\tabsolute-filepath" > $QIIME_OUT_DIR/$sp/$its/$law/se/manifest_singleFwd.tsv
for R1 in `ls $DATA_DIR/$sp/jeu1/$its/$law/*_R1.fastq.gz`
do
sample_name=`basename $R1 | sed 's/_R1.fastq.gz//'`
echo -e $sample_name"\t"$R1 >> $QIIME_OUT_DIR/$sp/$its/$law/se/manifest_singleFwd.tsv
done
# lancement qiime2-se
sh $DIR/scripts/QIIME2/qiime2-se.sh $QIIME_OUT_DIR/$sp/$its/$law/se/manifest_singleFwd.tsv $DB_PREFIX $CPU
done
done
done
qiime2-se.sh
#!/usr/bin/sh
MANIFEST=$1
OUT_DIR=`dirname $MANIFEST`
WORK_DIR=$OUT_DIR/qiime_workdir
mkdir -p $WORK_DIR
DB_PREFIX=$2
DB_REF_SEQ=$DB_PREFIX.seq.qza
DB_REF_CLASSIFIER=$DB_PREFIX.classifier.qza
CPU=$3
### import
qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path $MANIFEST \
--output-path $WORK_DIR/se-demux.qza \
--input-format SingleEndFastqManifestPhred33V2
qiime demux summarize \
--i-data $WORK_DIR/se-demux.qza \
--o-visualization $WORK_DIR/se-demux.qzv
########################################################################
### clustering pipeline : open reference clustering single end
# The features produced by clustering methods are known as operational taxonomic units (OTUs), which is Esperanto for suboptimal, imprecise rubbish
qiime vsearch dereplicate-sequences \
--i-sequences $WORK_DIR/se-demux.qza \
--o-dereplicated-table $WORK_DIR/se-derepTable.qza \
--o-dereplicated-sequences $WORK_DIR/se-derepSeqs.qza
qiime feature-table summarize \
--i-table $WORK_DIR/se-derepTable.qza \
--o-visualization $WORK_DIR/se-derepTable.qzv
# clustering
# Given a feature table and the associated feature sequences, cluster the
# features against a reference database based on user-specified percent
# identity threshold of their sequences. Any sequences that don't match are
# then clustered de novo. This is not a general-purpose clustering method,
# but rather is intended to be used for clustering the results of quality-
# filtering/dereplication methods, such as DADA2, or for re-clustering a
# FeatureTable at a lower percent identity than it was originally clustered
# at. When a group of features in the input table are clustered into a
# single feature, the frequency of that single feature in a given sample is
# the sum of the frequencies of the features that were clustered in that
# sample. Feature identifiers will be inherited from the centroid feature of
# each cluster. For features that match a reference sequence, the centroid
# feature is that reference sequence, so its identifier will become the
# feature identifier. The clustered_sequences result will contain feature
# representative sequences that are derived from the sequences input for all
# features in clustered_table. This will always be the most abundant
# sequence in the cluster. The new_reference_sequences result will contain
# the entire reference database, plus feature representative sequences for
# any de novo features. This is intended to be used as a reference database
# in subsequent iterations of cluster_features_open_reference, if
# applicable. See the vsearch documentation for details on how sequence
# clustering is performed.
# default param : --p-strand plus
qiime vsearch cluster-features-open-reference \
--i-table $WORK_DIR/se-derepTable.qza \
--i-sequences $WORK_DIR/se-derepSeqs.qza \
--i-reference-sequences $DB_REF_SEQ \
--p-perc-identity 0.99 --p-threads $CPU \
--o-clustered-table $WORK_DIR/se-clusterTable-or-99.qza \
--o-clustered-sequences $WORK_DIR/se-clusterSeqs-or-99.qza \
--o-new-reference-sequences $WORK_DIR/new-ref-seqs-or-99.qza
qiime feature-table summarize \
--i-table $WORK_DIR/se-clusterTable-or-99.qza \
--o-visualization $WORK_DIR/se-clusterTable-or-99.qzv
# remove chimera
# Apply the vsearch uchime_denovo method to identify chimeric feature
# sequences. The results of this method can be used to filter chimeric
# features from the corresponding feature table. For additional details,
# please refer to the vsearch documentation.
# default parameters (défaut de vsearch, mais manque abskew : min abundance ratio of parent vs chimera (2.0)):
# --p-dn 1.4
# --p-mindiffs 3
# --p-mindiv 0.8
# --p-minh 0.28
# --p-xn 8.0
qiime vsearch uchime-denovo \
--i-table $WORK_DIR/se-clusterTable-or-99.qza \
--i-sequences $WORK_DIR/se-clusterSeqs-or-99.qza \
--output-dir $WORK_DIR/se-uchime-dn-out
qiime feature-table filter-features \
--i-table $WORK_DIR/se-clusterTable-or-99.qza \
--m-metadata-file $WORK_DIR/se-uchime-dn-out/nonchimeras.qza \
--o-filtered-table $WORK_DIR/se-nonchimeric-wo-borderline-table.qza
qiime feature-table summarize \
--i-table $WORK_DIR/se-nonchimeric-wo-borderline-table.qza \
--o-visualization $WORK_DIR/se-nonchimeric-wo-borderline-table.qzv
qiime feature-table filter-seqs \
--i-data $WORK_DIR/se-clusterSeqs-or-99.qza \
--m-metadata-file $WORK_DIR/se-uchime-dn-out/nonchimeras.qza \
--o-filtered-data $WORK_DIR/se-nonchimeric-wo-borderline-seqs.qza
# filter single singleton
# q2-vsearch implements three different OTU clustering strategies: de novo, closed reference, and open reference.
# All should be preceded by basic quality-score-based filtering and followed by chimera filtering and aggressive
# OTU filtering (the treacherous trio, a.k.a. the Bokulich method). ==> 0.00005%
# ==> problème impossible de filtrer sur une fréquence de façon automatique
# ==> Filtre sur les single singleton
qiime feature-table filter-features \
--i-table $WORK_DIR/se-nonchimeric-wo-borderline-table.qza \
--p-min-frequency 2 \
--o-filtered-table $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qza
qiime feature-table summarize \
--i-table $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qza \
--o-visualization $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qzv
qiime feature-table filter-seqs \
--i-data $WORK_DIR/se-nonchimeric-wo-borderline-seqs.qza \
--i-table $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qza \
--o-filtered-data $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-seqs.qza
# affiliation
qiime feature-classifier classify-sklearn \
--i-classifier $DB_REF_CLASSIFIER \
--i-reads $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-seqs.qza \
--o-classification $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-tax.qza
qiime metadata tabulate \
--m-input-file $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-tax.qza \
--o-visualization $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-tax.qzv
# export biom
qiime tools export \
--input-path $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-table.qza \
--output-path $WORK_DIR/export
qiime tools export \
--input-path $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-tax.qza \
--output-path $WORK_DIR/export
# create final biom
sed -i 's/Feature/#Feature/' $WORK_DIR/export/taxonomy.tsv
biom add-metadata -i $WORK_DIR/export/feature-table.biom \
--observation-metadata-fp $WORK_DIR/export/taxonomy.tsv \
-o $OUT_DIR/se-qiime.biom \
--output-as-json
# export fasta
qiime tools export \
--input-path $WORK_DIR/se-nonchimeric-wo-borderline-noSingle-seqs.qza \
--output-path $WORK_DIR/export
mv $WORK_DIR/export/dna-sequences.fasta $OUT_DIR/se-qiime.fasta
# create abundance table in TSV format
biom_to_tsv.py -b $OUT_DIR/se-qiime.biom -f $OUT_DIR/se-qiime.fasta -t $OUT_DIR/se-qiime.tsv
References
- Escudié, Frédéric, Lucas Auer, Maria Bernard, Mahendra Mariadassou, Laurent Cauquil, Katia Vidal, Sarah Maman, Guillermina Hernandez-Raquet, Sylvie Combes, and Géraldine Pascal. 2018. “FROGS: Find, Rapidly, Otus with Galaxy Solution.” Bioinformatics 34 (8): 1287–94. https://doi.org/10.1093/bioinformatics/btx791
- Abarenkov, Kessy, R Henrik Nilsson, Karl-Henrik Larsson, Ian J Alexander, Ursula Eberhardt, Susanne Erland, Klaus Høiland, et al. 2010. “The Unite Database for Molecular Identification of Fungi–Recent Updates and Future Perspectives.” The New Phytologist 186 (2): 281–85. https://doi.org/10.1111/j.1469-8137.2009.03160.x
- Bolyen, Evan, Jai Ram Rideout, Matthew R Dillon, Nicholas A Bokulich, Christian C Abnet, Gabriel A Al-Ghalith, Harriet Alexander, et al. 2019. “Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using Qiime 2.” Nature Biotechnology 37 (8): 852–57. https://doi.org/10.1038/s41587-019-0209-9
- Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7): 581–83. https://dx.doi.org/10.1038%2Fnmeth.3869
- R.C. Edgar. 2016. “UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing”. BioRxiv. https://doi.org/10.1101/081257