1 Introduction

This is an R markdown document intended to compare the performances of FROGS, MOTHUR, UPARSE and QIIME in terms of accuracy on real microbial communities. Throughout, UPARSE/mothur/QIIME (SOP) refers to the affiliation strategy suggested in SOP of the respective pipelines. SOP corresponds to Standard Operating Procedures (program guidelines).

1.1 Metrics used for comparison

The results of FROGS, MOTHUR and UPARSE, QIIME are compared using only one metric:

Divergence: Bray-Curtis distance (expressed in percent) between the true taxonomic composition of the community and the one inferred by the otu-picking tool. The divergence is measured at all taxonomic levels from Phylum to either Genus (utax) or Species (Silva).

1.2 Experimental design for Real communities

The experimental design used for the real community was sligthly different than one for simulated communities:

nb_OTU amplicon abundance_law count reference
20sp V3V4 uniform 4 BEI:HM-278D - Kozich et al. 2013
20sp V3V4 staggered 4 BEI:HM-278D - Kozich et al. 2013
20sp V4 uniform 1 BEI:HM-278D - Kozich et al. 2013
20sp V4V5 uniform 1 BEI:HM-278D - Kozich et al. 2013
4sp V3V4 uneven 10 Personal mock - SRA: Project ID SRP113288
67sp V4 even 9 Mock 6, 7, 8 - Caporaso et al. 2011

Three samples corresponding to communities of size 20 with abundance distribution uniform were used to compare the amplicon V3V4, V4 and V4V5 (1 per amplicon). 8 samples corresponding to communities of size 20 and amplicon V3V4 were used to compared abundance distribution uniform and staggered (4 per distribution), 10 samples (community size 4, amplicon V3V4 and distribution uniform) were used to compare the accuracy of the different otu picking methods and finally, the last real dataset consists of three experiments where a synthetic community (mock 6, 7 and 8) made of 67 species was sequenced three times each. Mock 6, 7 and 8 are biological replicates of the same community and the three samples for each mock are technical replicates. Since, we only have three replicates for mock 6, 7 and 8, we can use non-parametric tests and restrict ourselves to paired t-tests.

2 Material and Methods of Statistical analysis

For divergence rate metric, we performed two-sided paired test, either parametric (paired t-test) or non-parametric (signed rank test, also known as paired mann-whitney test) to assess the difference in accuracy between FROGS and each of the competitors.

The tests were peformed at the theoretical community levels (dataset) using biological replicates (set_number) as replicates. We chose to compare the methods at this level because it the finest one for which we have replication. Pooling different theoretical communities and/or abundance distributions to compare the method at higher levels (e.g community size \(\times\) amplicon) will blur the signal as a method may be outclass the others for uniform abundances but perform worse on different abundance disrtibutions.

For each theoretical community, we declared FROGS better (resp. worse) than its competitor when the test was significant at the 0.05 level and FROGS had a lower (resp. higher) metric than its competitor. When the test was not significant, the methods were declared tied. Finally, we aggregated the results to count for each condition (community size \(\times\) abundance distribution \(\times\) amplicon) the number of theoretical communities favoring one or none of the methods.

Due to the different design used for real communities, we are going to perform focused comparisons of the samples.

3 Real Communities: BEI datasets

3.1 Amplicon Effect on BEI datasets

We study the amplicon effect on the uniform community with 20 species as it is the only one for which different amplicons were used. We represent the trends observed at different taxonomic levels. All methods seem to give comparable results but there are no replicates per amplicon so we cannot assess the significance of the observed differences.

Note that all methods have high divergences compared to simulated datasets. This may reflect experimental limitations (sequencing and amplification bias, copy number variations, etc) rather than intrinsic complexity of the real community and/or differences between the methods.

3.2 Abundance Distribution Effect on BEI datasets

We study the abundance distribution effect on the community with 20 species sequenced with the V3V4 amplicon region. FROGS is better than MOTHUR (SOP), UPARSE (SOP) and QIIME (SOP) on the staggered community and worse on the uniform one.

Although the base divergence is not very satisfying on either distribution.

As we have 4 replicates for this community, we can compare all methods using a paired t-test (there are not enough replicates for the non parametric signed rank test to reach significance). The statistical analysis confirm that FROGS outperforms MOTHUR (SOP), UPARSE (SOP) and QIIME (SOP) on communities with staggered abundances but does worse on communities with uniform abundances.

3.3 Classification of OTUs

In addition to divergence comparisons, we blasted our reconstructed OTUs against the corresponding regions of the 16S rRNA sequences of the 20 species used in the mock. The OTUs were classified into 3 classes according to their blast results:

  • true otu if the cluster seed had 100% of identity and 100% of coverage with one of the 20 species 16S rRNA
  • accepted otu if the seed had > 97% of identity over > 95% of coverage with one of the 20 species 16S rRNA
  • spurious otu if the seed had < 97% of identity with one of the 20 species 16S rRNA

For each class, we counted both (i) the number of reconstructed OTUs in that class (middle panel: OTU) and (ii) their aggregate contribution to the community (top panel: Abundancy).

We also blasted the 20 16S rRNA sequences that were supposed to be retrieved against the recovered OTU to find how many were really missed, as opposed to reconstructed as a slightly erroneous OTU. Again, the retrieved sequences were classified in 3 classes according to their blast results.

  • retrieved as true (true) if the sequence had 100% of identity against one of the cluster seed
  • retrieved as accepted (accepted) if the sequence had between 97% and 100% of sequence identity
  • not retrieved (not) otherwise

The number of species in each category are shown in the bottom panel (Retrieved).

3.3.1 Amplicon Effect

As previously shown, the bulk of reads fall in true or accepted OTUs (see below figure). UPARSE produces the less OTUs but also misses more OTUs than competing methods. Both QIIME and MOTHUR produce enormous amounts of spurious OTUs. All methods fail to recover true OTUs on the V4V5 regions.

A focus on FROGS and UPARSE results reveals the same trend at a lower scale (see below). However, we note that UPARSE is the method that most closely approximates the expected results for these V4V5 sequences.But, UPARSE, unlike FROGS, do not find true OTUs for this region. Frogs finds 4 true OTUs on 20 expected.

3.3.2 Distribution Effect

As previously shown, the bulk of reads fall in true or accepted OTUs (see below figure) . UPARSE produces the less OTUs but also misses more OTU than competing methods, especially with a staggered abundance. Both QIIME and MOTHUR produce enormous amounts of spurious OTUs.

A focus on FROGS and UPARSE (see below) reveals the same trend at a different scale. Both FROGS and UPARSE sort most of the reads into true OTUs. FROGS always finds back a few more OTUs than UPARSE, FROGS misses a few less and FROGS produces less spurious otus. The most important difference is in terms of accepted OTUs where FROGS systematically produces more accepted OTUs than UPARSE.

Note that, the very criteria used to classify OTU (sequence identity and coverage) are likely to favor UPARSE over FROGS in this setting. Indeed, any of the 20 species recovered as a true OTU will not produce accepted OTU as those OTU would be >97% identical to the true sequence and clustered with it in the first place. It’s both an advantage, when the sequences are well separated using the 97% sequence identity threshold, and a limitation, if two populations with less than 3% divergence are present simultaneously. They cannot be recovered by UPARSE but can by FROGS, thanks to its reliance on SWARM.

4 Comparison on a Toy Community

Finally, we compare the 3 methods on a toy community with 4 species and uneven abundances. We have 10 replicates for this community, enough to use either the signed rank test and the paired t-test. Once again FROGS base divergence level is not fantastic

but in line with divergences obtained by competitors

The statistical analyses confirms the graphical diagnostic of the boxplot: FROGS is generally better than MOTHUR (SOP) and QIIME (SOP) and tied with UPARSE (SOP), except at the Genus level where it outperforms all three. It also performs worse than UPARSE (SOP) at the Phylum level (for the paired t-test)

4.1 Paired t-test on divergence on Toy community

4.2 Signed rank test on divergence on Toy community

5 Real Communities: Caporaso Datasets

We compare the methods on Caporaso datasets: it consists of three experiments where a real community (mock 6,7 and 8) made of 67 species was sequenced three times each. Mock 6, 7 and 8 are biological replicates of the same community and the three samples for each mock are technical replicates. Since, we only have three replicates for each mock, we can use non-parametric tests and restrict ourselves to paired t-tests.

5.1 Visualisation of Divergences

All methods achieve comparable (and not so good) divergences at high taxonomic ranks but FROGS and UPARSE are better at the Genus level.

FROGS exhibits sligth excess divergence over UPARSE and is comparable to MOTHUR and QIIME at high taxonomic ranks and is strictly better than these latter at low taxonomic level.

The basic level of all methods is quite high (problem of working with real data). The difference between methods is played out in % whereas the basic divergences are measured in tens of %.

FROGS does better than MOTHUR and QIIME for fine descriptions but less well than UPARSE.

5.2 Results of paired t-test on Divergence

The high divergence observed is probably due to the fact that the dataset consists of very short reads, non mergeable, on the V4 region. Thus, only R1 reads could be processed by the 4 pipelines. These very short sequences do not allow the formation of well-defined clusters and thus generate a large deviation of observed results from the expected results. These mock are probably not well suited to the expected use of the 4 pipelines and therefore the results are unsatisfactory.