1 Introduction

This is an R markdown document intended to compare the performances of FROGS, MOTHUR, UPARSE and QIIME in terms of accuracy on simulated microbial communities. We consider two variants of MOTHUR, UPARSE and QIIME called respectively SOP and MA. SOP correspond to Standard Operating Procedures (program guidelines) whereas MA stands for Multi-Affiliation and correspond to a multiple affiliation strategy used to propagate uncertainty when multiple equally good affiliations are found for a given OTU. This is the default strategy used in FROGS.

Throughout, QIIME (MA) refers to QIIME with Multi-Affiliation strategy, QIIME (SOP) to the affiliation strategy suggested in QIIME SOP and QIIME to both variants: QIIME (MA) and QIIME (SOP).

1.1 Metrics used for comparison

The results of FROGS, MOTHUR and UPARSE, QIIME are compared using three different metrics:

  1. Divergence: Bray-Curtis distance (expressed in percent) between the true taxonomic composition of the community and the one inferred by the otu-picking tool. The divergence is measured at all taxonomic levels from Phylum to either Genus (utax) or Species (Silva).
  2. FN: Number of false negative taxa (i.e. present in the original bacterial community but not discovered by the otu picking method);
  3. FP: Number of false positive taxa (i.e. discovered by the otu picking method but not present in the original bacterial community)

1.2 Experimental design of simulated bacterial communities

The experimental design for the simulated communities uses a full-factorial design.

The simulated communities were built according to the following design:

  1. Databank: Biobank from which taxa were drawn to construct theoretical communities, either Silva (silva) or Utax (utax).
  2. Number of OTUs: 20, 100, 200, 500 and 1000;
  3. Abundance distribution: abundances of OTUs were either uniform (uniform) or sampled from a power distribution (power_law);
  4. Dataset: Theoretical communities. Each dataset (5 for each combinaison of abundance distribution and number of OTUs) correspond to a unique ideal bacterial community specified by its own taxa set and corresponding vector of relative abundances.
  5. Set Number: Biological replicates (10 for each dataset), i.e. communities created by sampling organisms with replacement in the theoretical communities.
  6. Amplicon: variable region of the 16S rRNA used to produce the ampicon sequences, either the V3-V4 (V3V4) variable region or the V4-V4 (V4) variable region

This resulted in a total of 2 databanks \(\times\) 5 community sizes \(\times\) 2 abundance distribution \(\times\) 10 theoretical communities \(\times\) 10 replicates for each theoretical community \(\times\) 2 amplicons \(=\) 2000 samples (1000 per databank) (100 000 sequences per sample).

2 Material and Methods of Statistical analysis

For each of the three metrics (divergence, FN and FP) we performed two-sided paired test, either parametric (paired t-test) or non-parametric (signed rank test, also known as paired mann-whitney test) to assess the difference in accuracy between FROGS and each of the competitors.

The tests were peformed at the theoretical community levels (dataset) using biological replicates (set_number) as replicates. We chose to compare the methods at this level because it the finest one for which we have replication. Pooling different theoretical communities and/or abundance distributions to compare the method at higher levels (e.g community size \(\times\) amplicon) will blur the signal as a method may be outclass the others for uniform abundances but perform worse on different abundance disrtibutions.

For each theoretical community, we declared FROGS better (resp. worse) than its competitor when the test was significant at the 0.05 level and FROGS had a lower (resp. higher) metric than its competitor. When the test was not significant, the methods were declared tied. Finally, we aggregated the results to count for each condition (community size \(\times\) abundance distribution \(\times\) amplicon) the number of theoretical communities favoring one or none of the methods.

3 SDFU

Simulated data from UTAX databank.

3.1 Vizualisation

3.1.1 Divergence

The comparisons of divergence at the sample level in the scatterplots shows that on average, FROGS has comparable but better performances than MOTHUR, UPARSE and QIIME (SOP): most samples end up in the upper left corner (corresponding to the region “divergence FROGS < divergence competitor”) but no too far away from the first diagonal (grey line). It also has comparable performances to QIIME (MA) but in certain conditions, samples are not mostly contained in the upper left half of the graph, meaning that QIIME (MA) outperforms FROGS for those parameter values.

A

A more traditional representation using boxplot of the excess divergence of FROGS, with samples from all theoretical communities pooled together, confirms the results: FROGS has similar (compared to UPARSE) or lower (compared to MOTHUR) divergence for the vast majority of samples. Note that the y-range was reduced from \([-85, 31]\) to \([-15, 3]\) in order to exclude outliers (4% of outliers with excess divergence < -15 and 0.02% with excess divergence > 3) and zoom in on the boxplots. As expected, all methods perform quite similarly up to the order level and the main differences appear at the Family and Genus levels, where MOTHUR and QIIME (SOP) produces much larger divergences than competing methods. The only configuration where FROGS is consistently outperformed is complex communities (number of species > 200) with uniform abundances and sequenced on the V4 region. In that configuration, FROGS is outperformed by QIIME (MA).

B

Finally a focus on the accuracy of FROGS alone shows that divergence levels vary between 0 and 10% and as expected, is higher for fine classification (Genus) than for coarse ones (Phylum). Unsurprisingly, the V3V4 amplicon gives less distorded view of communities than the V4. Overall, FROGS recover community compositions very well expect at the genus level for complex communities (size > 200), with uniform abundances and sequenced using the V4 region.

C

3.1.2 False Positive and False Negative OTUs

We repeat the graphical exploration of the resutls with False Positive and False Negative OTUs. A first representation shows that use of the V3V4 amplicon leads to more false postive and less false negative than the V4. The graphics also highlight the gigantic number of false positive inferred by mothur (up to 20 times more than the real community size).

A focus on FROGS and UPARSE leads to similar patterns: FROGS always produces less false negatives than UPARSE but produces a bit more false positive under power law abundance distribution and a bit less under uniform abundance distribution.

The lower number of false positive under power law abundances could be due to the abundance based filters used in UPARSE.

3.2 Statistical Analysis

3.2.1 Divergence

We present the results of the paired tests, either parametric (t-test, top) or non-parametric (signed rank test, bottom). Both tests show that FROGS perform as well or better as UPARSE and MOTHUR in most conditions. The only condition in which FROGS does worse than UPARSE is small community size (20). It also does better than QIIME (SOP) in all settings, and than QIIME (MA) in most settings with the notable exception of large communities (size > 200) with uniform abundance studied using the V4 region.

The real strength of FROGS lies in its ability ot give a more accurate view of large communities (size > 200) at fine scales (Species or Genus level).

3.2.1.1 Paired t-test

3.2.1.2 Signed rank test

3.2.2 Conditions where QIIME (MA) outperforms FROGS

A focus on the performance of QIIME (MA) reveals that QIIME (MA) indeed performs better than FROGS with uniform distribution when using the V4 amplicon region. However and although significant, the differences are small in that case: less than 2 percentage points in all cases and most marked at the Genus level where the divergences of both FROGS and QIIME (MA) are quite high.

3.2.3 False Positive and False Negative OTUs

The same paired test as in the previous section reveal that FROGS strictly outperforms MOTHUR in terms of both FP and FN taxas. It also produces less FN than UPARSE. Additionnally, it produces less FP than UPARSE for uniform distributions and more power law ones. Overall, FROGS produces less FP and less FN than either of UPARSE and MOTHUR for high community sizes (>200 for uniform distributions, >1000 for power law distributions). It also produces less FP and FN than QIIME under power law distribution and less FP but more FN than QIIME under uniform distribution.

3.2.3.1 Paired t-test

3.2.3.2 Signed rank test

4 SDFS

Simulated data from Silva databank.

4.1 Vizualisation

4.1.1 Divergence

The comparisons of divergence at the sample level in the scatterplots shows that on average, FROGS has comparable but better performances than MOTHUR (MA), UPARSE (MA) and QIIME (MA): most samples end up in the upper left corner (corresponding to the region “divergence FROGS < divergence competitor”) but no too far away from the first diagonal (grey line).

A

A more traditional representation using boxplot of the excess divergence of FROGS, with samples from all theoretical communities pooled together, confirms the results: FROGS has similar (compared to UPARSE (MA) and QIIME (MA)) or lower (compared to MOTHUR (MA)) divergence for the vast majority of samples. Note that the y-range was reduced from \([-51, 41]\) to \([-15, 3]\) in order to exclude from outliers (1% of communities with low FROGS but very high MOTHUR (MA) divergence or high FROGS but low QIIME (MA) divergences) and zoom in on the boxplots. The only configuration where FROGS is consistently outperformed is complex communities (number of species > 200) with uniform abundances and sequenced on the V4 region. In that configuration, FROGS is outperformed by QIIME (MA).

B

Finally a focus on the accuracy of FROGS alone shows that divergence levels vary mostly between 0 and 15% and as expected, are higher for finer classifications (Species) than for coarse ones (Phylum). Unsurprisingly, the V3V4 amplicon gives less distorded view of communities than the V4.

C

4.1.2 False Positive and False Negative OTUs

We repeat the graphical exploration of the resutls with False Positive and False Negative OTUs. A first representation shows that use of the V3V4 amplicon leads to more false postive and less false negative than the V4. The graphics also highlight the gigantic number of false positives inferred by MOTHUR and QIIME (up to 20 times more than the real community size).

A focus on FROGS and UPARSE (MA) leads to similar patterns: FROGS always produces less false negatives than UPARSE (MA) but produces a bit more false positive under power law abundance distribution and a bit less under uniform abundance distribution.

The lower number of false positive in under power law abundances could be due to the abundance based filters used in UPARSE (MA).

4.2 Statistical Analysis

4.2.1 Divergence

We present the results of the paired tests, either parametric (t-test, top) or non-parametric (signed rank test, bottom). Both tests show that FROGS perform as well or better as UPARSE (MA) and MOTHUR (MA) in most conditions. The only condition in which FROGS does worse than UPARSE (MA) is small community size (20). It also does better than QIIME (MA) in most settings, with the exception of large communities (size > 200) with uniform abundance studied using the V4 region.

The real strength of FROGS lies in its ability ot give a more accurate view of large communities (size > 200) at fine scales (Genus level).

4.2.1.1 Paired t-test

4.2.1.2 Signed rank test

4.2.2 False Positive and False Negative OTUs

The same paired test as in the previous section reveal that FROGS strictly outperforms MOTHUR (MA) in terms of both FP and FN taxas. It also produces less FN than UPARSE (MA) and less FP than QIIME (MA). Additionnally, it produces less FP than UPARSE (MA) for uniform distributions and more for power law ones.

Overall, FROGS produces less FP and less FN than either of UPARSE (MA) and MOTHUR (MA) for high community sizes (>200 for uniform distributions, >1000 for power law distributions) and less FP than QIIME (MA) at all sizes.

4.2.2.1 Paired t-test

4.2.2.2 Signed rank test