CPC C12Q 1/6881 (2013.01) [C07K 14/7051 (2013.01); C07K 16/2809 (2013.01); G16B 30/00 (2019.02); G16B 40/30 (2019.02); G16B 45/00 (2019.02); C07K 2317/565 (2013.01); C12Q 1/686 (2013.01); C12Q 1/6806 (2013.01); C12Q 1/6883 (2013.01); C12Q 2600/156 (2013.01); C12Q 2600/158 (2013.01); C12Q 2600/16 (2013.01); G16B 30/10 (2019.02)] | 14 Claims |
1. A method of screening a plurality of clonotypes, the method comprising:
at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors:
obtaining, in electronic form, a clonotype dataset formed using a first plurality of sequence reads of nucleic acids in a first aliquot of nucleic acids pooled from a plurality of 100 or more cells from a biological sample of a single subject comprising B cells or T cells, wherein
each respective sequence read in the first plurality of sequence reads includes a corresponding barcode, from a plurality of barcodes, that indicates which cell in the plurality of cells originated the nucleic acid represented by the respective sequence read,
the clonotype dataset comprises, for each respective cell in the plurality of cells, a corresponding contig entry in a plurality of contig entries, wherein the plurality of contig entries represents the plurality of clonotypes, and wherein the plurality of clonotypes comprises 25 clonotypes,
each respective clonotype in the plurality of clonotypes corresponds to one or more contig entries in the clonotype dataset, and
for each respective cell in the plurality of cells, a corresponding contig entry in the plurality of contig entries:
(i) corresponds to a T-cell receptor or B-cell receptor from the respective cell in the plurality of cells and (ii) comprises an indication of chain type for the corresponding T-cell receptor or B-cell receptor, a corresponding contig sequence, and a corresponding barcode, from the plurality of barcodes, that identifies the respective cell for the respective contig entry,
wherein the corresponding contig sequence determined by a subset of sequence reads in the first plurality of sequence reads (i) having the respective barcode for the respective cell and (ii) encoding all or a portion of the corresponding T-cell receptor chain or B-cell receptor chain from the respective cell;
obtaining, in electronic form, a discrete attribute value dataset formed using a second plurality of sequence reads of nucleic acids in a second aliquot of nucleic acids pooled from the plurality of cells, wherein each respective sequence read in the second plurality of sequence reads includes a corresponding barcode, from the plurality of barcodes, that indicates which cell in the plurality of cells originated the nucleic acid represented by the respective sequence read, and the discrete attribute value data set comprises:
for each respective cell in the plurality of cells,
for each respective gene in a plurality of genes, a corresponding discrete attribute value for a count of a number of mRNA mapping to the respective gene in the respective cell determined from a number of sequence reads in the second plurality of sequence reads that map to the respective gene and have the barcode for the respective cell;
clustering the plurality of cells into a plurality of clusters by (i) computing a plurality of distances using each discrete attribute value of each cell in the plurality of cells for each unique pair of cells in the plurality of cells and (ii) evaluating the plurality of distances with a criterion function, wherein
the plurality of distances includes a separate distance for each pair of cells in the plurality of cells,
each respective distance in the plurality of distances represents a different pair of cells in the plurality of cells and quantifies a distance between a respective first vector formed by the discrete attribute values for a respective first cell in the different pair of cells and a respective second vector formed by the discrete attribute values for a respective second cell in the different pair of cells, and
each respective cluster in the plurality of clusters represents a corresponding subset of cells of the plurality of cells that are clustered together based on evaluation of distances in the plurality of distances representing different pairs of cells within the corresponding subset of cells with the criterion function;
matching, for each respective cell in a first cluster in the plurality of clusters, the corresponding barcode of the respective cell with a corresponding barcode for a respective contig entry in the plurality of contig entries, thereby obtaining a corresponding contig entry for the respective cell; and
providing, for each respective clonotype in the plurality of clonotypes represented in the first cluster, a number of cells in the plurality of cells in the clonotype dataset, that represent the respective clonotype that are in the first cluster, based on the corresponding contig entry of each respective cell in the first cluster.
|