US 12,221,657 B2
Method and system for improving amplicon sequencing based taxonomic resolution of microbial communities
Sharmila Shekhar Mande, Pune (IN); Anirban Dutta, Pune (IN); Nishal Kumar Pinna, Pune (IN); and Mohammed Monzoorul Haque, Pune (IN)
Assigned to Tata Consultancy Services Limited, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Aug. 9, 2019, as Appl. No. 16/537,133.
Claims priority of application No. 201821030219 (IN), filed on Aug. 10, 2018.
Prior Publication US 2020/0115766 A1, Apr. 16, 2020
Int. Cl. C12Q 1/6888 (2018.01); C12N 15/10 (2006.01); C12Q 1/6869 (2018.01); G16B 10/00 (2019.01); G16B 30/00 (2019.01); G16B 40/10 (2019.01)
CPC C12Q 1/6888 (2013.01) [C12N 15/1003 (2013.01); C12Q 1/6869 (2013.01); G16B 10/00 (2019.02); G16B 30/00 (2019.02); G16B 40/10 (2019.02)] 4 Claims
 
1. A method for improving accuracy of taxonomic profiling of a microbial community based on amplicon sequencing, the method comprising:
collecting a biological sample from an environment;
obtaining a first subsample and a second subsample from the biological sample;
extracting microbial DNA from the first subsample and the second subsample;
sequencing, the extracted microbial DNA from the first subsample using a sequencer to get first DNA sequence data, wherein the first DNA sequence data comprises of a plurality of pairs of sequence fragments, wherein each pair of the plurality of pairs of sequence fragments is generated through paired-end sequencing of a first amplicon that comprises a first combination of informative regions within the first amplicon, wherein the first combination of informative regions comprises informative regions arranged contiguously or non-contiguously in a phylogenetic marker gene targeted in the first amplicon sequencing, the sequencing of the first combination of informative regions arranging contiguously comprising the steps of:
designing primers including a forward primer and a reverse primer against a stretch of the extracted microbial DNA such that the informative regions reside within the stretch and the primers target two contiguous informative regions, wherein the two contiguous informative regions are two adjacent informative regions;
generating paired-end reads including a forward read and a reverse read by performing the paired-end sequencing of the first amplicon, wherein the paired-end sequencing is a 250 bpx2 paired-end sequencing where the stretch of the extracted microbial DNA is sequenced from both ends; and
merging the forward read and the reverse read into a single sequence forming a merged read based on an overlap between the forward read and the reverse read constituting a pair, wherein the overlap is found between the forward and the reverse read on sequencing the two adjacent informative regions; and
wherein the informative regions contain phylogenetically relevant information;
sequencing, the extracted DNA from the second subsample using the sequencer to get second DNA sequence data, wherein the second DNA sequence data comprises of a plurality of pairs of sequence fragments, wherein each pair of the plurality of pairs of sequence fragments is generated through paired-end sequencing of a second amplicon that comprises a second combination of informative regions within the second amplicon, wherein the second combination of informative regions comprises informative regions arranged non-contiguously in the phylogenetic marker gene targeted in the second amplicon sequencing, the sequencing of the second combination of informative regions arranging non-contiguously comprising the steps of:
designing primers including aforward primer and areverse primer against a stretch of the extracted microbial DNA such that the informative regions reside within the stretch and the primers target two non-contiguous informative regions, wherein the two non-contiguous informative regions are two distantly separated informative regions;
generating paired-end reads including a forward read and a reverse read by performing the paired-end sequencing of the second amplicon, wherein the paired-end sequencing is a 250 bpx2 paired-end sequencing where the stretch of the extracted microbial DNA is sequenced from both ends;
concatenating the forward read and the reverse read into a single sequence forming a concatenated read using a string of multiple ambiguous nucleotide characters when the forward read and the reverse read do not overlap, wherein the overlap is not found between the forward read and the reverse read on sequencing the two separated informative regions;
wherein utility of targeting pairs of non-contiguously placed informative regions improves taxonomic classification accuracy,
wherein the second combination of informative regions are different from the first combination of informative regions and one of the informative regions in the first combination of informative regions and the second combination of informative regions is shared by the first combination of informative regions and the second combination of informative regions, and
wherein the first and second amplicon sequencing experiments target the phylogenetic marker gene;
generating, via one or more hardware processors, a first microbial taxonomic abundance profile of the first sequenced subsample by performing a taxonomic classification of phylogenetically relevant information corresponding to the first combination of informative regions, wherein the first combination of informative regions are submitted as query sequences for performing the taxonomic classification, and wherein the first microbial taxonomic abundance profile comprises abundance values corresponding to one or more pair of sequence fragments comprising the first combination of informative regions classified into a plurality of taxonomic groups;
generating, via the one or more hardware processors, a second microbial taxonomic abundance profile of the second sequenced subsample by performing the taxonomic classification of phylogenetically relevant information corresponding to the second combination of informative regions, wherein the second combination of informative regions are submitted as query sequences for performing the taxonomic classification, and wherein the second microbial taxonomic abundance profile comprises abundance values corresponding to one or more pair of sequence fragments comprising the second combination of informative regions classified into the plurality of taxonomic groups;
pre-computing, via the one or more hardware processors, taxonomic classification accuracies for various possible combinations of informative regions for microbes belonging to the plurality of taxonomic groups, wherein the pre-computing is based on marker gene sequences of known taxonomic origin present in existing sequence databases, to generate a computation table; and
combining, via the one or more hardware processors, the first microbial taxonomic abundance profile and the second microbial taxonomic abundance profile of the first and the second sequenced subsample based on the computation table to generate a combined microbial taxonomic abundance profile, wherein combining the first microbial taxonomic abundance profile and the second microbial taxonomic abundance profile utilizes a combinatorial strategy and the combined microbial taxonomic abundance profile has a refined abundance value for each taxonomic group and has improved taxonomic classification accuracy as compared to the first microbial taxonomic abundance profile and the second microbial taxonomic abundance profile obtained individually for the first and the second subsample, targeting the first combination of informative regions and the second combination of informative regions in the phylogenetic marker gene, wherein the combinatorial strategy comprises:
obtaining the abundance values of a particular taxonomic group ‘i’ (Tix and Tiy) corresponding to the first and second sequenced subsamples, generated by performing the taxonomic classification utilizing the first combination of informative regions and the second combination of informative regions;
providing pre-computed relative accuracies Wix and Wiy in taxonomic classification for the particular taxonomic group ‘i’ using the first combination of informative regions ‘x’ and the second combination of informative regions ‘y’; and
calculating the refined abundance value (Tixy) for the particular taxonomic group ‘i ’ using the following formula:

OG Complex Work Unit Math
and calculating the refined abundance value for all the taxonomic groups to obtain a more accurate microbial taxonomic abundance profile as compared to the first microbial taxonomic abundance profile and the second microbial taxonomic abundance profile obtained individually for the first and the second subsample.