Article

 

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition Public Deposited

Downloadable Content

Download PDF
https://ir.library.oregonstate.edu/concern/articles/ff3657098

An open source, platform-independent implementation of the method in the Julia programming language is freely available at  https://github.com/dkoslicki/ARK. A Matlab implementation is available at  http://www.ee.kth.se/ctsoftware

Supporting information available online at:  http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0140644#sec021

To the best of our knowledge, one or more authors of this paper were federal employees when contributing to this work. This is the publisher’s final pdf. The article was published by the Public Library of Science and is in the public domain. The published article can be found at:  http://www.plosone.org/.

Descriptions

Attribute NameValues
Creator
Abstract
  • Motivation: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results: There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.
Resource Type
DOI
Date Available
Date Issued
Citation
  • Koslicki, D., Chatterjee, S., Shahrivar, D., Walker, A. W., Francis, S. C., Fraser, L. J., ... & Corander, J. (2015). ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PLoS ONE, 10(10), e0140644. doi:10.1371/journal.pone.0140644
Journal Title
Journal Volume
  • 10
Journal Issue/Number
  • 10
Academic Affiliation
Rights Statement
Funding Statement (additional comments about funding)
  • This work was supported by the Swedish Research Council Linnaeus Centre ACCESS (S.C.), ERC grant 239784 (J.C.), the Academy of Finland Center of Excellence COIN (J.C.), the Academy of Finland (M.V.), the Scottish Government's Rural and Environment Science and Analytical Services Division (RESAS) (A.W.W), and the UK MRC/DFID grant G1002369 (S.C.F). L.J.F. received funding in the form of salary from Illumina Cambridge Ltd.
Publisher
Peer Reviewed
Language
Replaces
Additional Information
  • description.provenance : Submitted by Patricia Black (patricia.black@oregonstate.edu) on 2015-11-19T17:18:07Z No. of bitstreams: 2 license_rdf: 1089 bytes, checksum: 0a703d871bf062c5fdc7850b1496693b (MD5) KoslickiDavidMathARKAggregationReads.pdf: 1646190 bytes, checksum: dcf4511c1db00902056886c12620eb07 (MD5)
  • description.provenance : Approved for entry into archive by Patricia Black(patricia.black@oregonstate.edu) on 2015-11-19T17:18:21Z (GMT) No. of bitstreams: 2 license_rdf: 1089 bytes, checksum: 0a703d871bf062c5fdc7850b1496693b (MD5) KoslickiDavidMathARKAggregationReads.pdf: 1646190 bytes, checksum: dcf4511c1db00902056886c12620eb07 (MD5)
  • description.provenance : Made available in DSpace on 2015-11-19T17:18:21Z (GMT). No. of bitstreams: 2 license_rdf: 1089 bytes, checksum: 0a703d871bf062c5fdc7850b1496693b (MD5) KoslickiDavidMathARKAggregationReads.pdf: 1646190 bytes, checksum: dcf4511c1db00902056886c12620eb07 (MD5) Previous issue date: 2015-10-23

Relationships

Parents:

This work has no parents.

Items