CPC C12N 15/113 (2013.01) [C12N 9/22 (2013.01); C12N 15/102 (2013.01); C12N 15/111 (2013.01); C12N 15/63 (2013.01); C12N 15/74 (2013.01); C12Q 1/6876 (2013.01); G16B 25/00 (2019.02); G16B 30/00 (2019.02); G16B 30/10 (2019.02); G16B 30/20 (2019.02); C12N 2310/20 (2017.05); C12Q 2600/156 (2013.01)] | 27 Claims |
1. A method for identifying and generating novel nucleic acid modifying effectors, comprising:
a computer-implemented method comprising:
(a) identifying putative nucleic acid modifying loci from a set of nucleic acid sequences, the set of nucleic acid sequences obtained from a genomic or metagenomic database and the nucleic acid sequences are long enough to encode a protein with a defined size limit greater than 700 amino acids, wherein the putative nucleic acid modifying loci: are a defined distance between 1 and 25kb of a CRISPR array, the CRISPR array being identified using a repeat or pattern finding analysis of the set of nucleic acid sequences, and
comprise only one sequence encoding a protein with a defined size limit greater than 700 amino acids;
(b) identifying candidate effector proteins that are the protein with greater than 700 amino acids in (a) and homologous proteins thereof;
(c) grouping the candidate effector proteins into subsets based on homology;
(d) selecting the subsets that have at least 10 candidate effector proteins and more than 50% of the candidate effector proteins within 10 kb of the CRISPR array,
(e) identifying a candidate set of novel nucleic acid modifying effectors proteins by selecting from one or more of the subsets selected in (d) based on one or more of the following:
subsets comprising loci of coding sequences for putative candidate effector proteins with no more than 90% homology matches to known protein domains relative to loci in other subsets,
subsets whose loci have same orientations as putative adjacent accessory proteins relative to effector proteins in other subsets,
subsets comprising candidate effector proteins with lower existing nucleic acid modifying classifications relative to other subsets,
subsets comprising loci with a lower proximity to known nucleic acid modifying loci relative to other subsets, and
total number of candidate effector proteins in each subset; and
generating nucleic acid molecules encoding one or more of the nucleic acid modifying proteins in the candidate set identified by the computer-implemented method; and
expressing the one or more nucleic acid modifying proteins from the generated nucleic acid molecules and performing one or more biochemical y assays that validate a level of nucleic acid modifying function of the one or more novel nucleic acid modifying proteins.
|