Genomic phylostratigraphy

From Wikipedia, the free encyclopedia

Genomic phylostratigraphy is a novel genetic statistical method developed in order to date the origin of specific genes by looking at its homologs across species. It was first developed by Ruđer Bošković Institute in Zagreb, Croatia.[1] The system links genes to their founder gene, allowing us to then determine their age. This could help us better understand many evolutionary processes such as patterns of gene birth throughout evolution, or the relationship between the age of a transcriptome throughout embryonic development. Bioinformatic tools like GenEra have been developed to calculate relative gene ages based on genomic phylostratigraphy.[2]

Method[edit]

An example of a phylogenetic tree with its different phylostrata. Considering the large grey bars as the phylogeny of the taxa and the thin coloured lines as the various gene lineages within them, we can deduce the presence of two founder genes 1 and 2 present in their respective phylostrata 1 and 2. The phylostrata will then be usually given the name of the smallest determinable clade including all taxa present.[1]

This technique relies on the assumption that the diversity of the genome is not only due to gene duplications but also to continuous frequent de novo gene births. These genes (called "founder genes") would form from non-genic DNA sequences, as well as from changes in reading frame (or other ways of arising from within existing genes), or even from very rapid evolution of the protein that would modify the sequence beyond recognition.[3] These new genes would at first have high evolutionary rates that would then slow down with time, allowing us to recognise their lineage in their descendants.[1] The founder genes can then be put in a specific phylostratum. The phylostratum is represented as the clade that includes all the genes that derive from the same founder gene, signifying that this gene was formed in the common ancestor of this clade (e.g. Arthropoda, Mammalia, Metazoa, etc.). Positioning these founder genes and their descendants on different phylostrata can allow us to age them. This can then be used to analyse the origin of certain functions of proteins and developmental processes on a macroevolutionary scale, by observing connections between certain genes as well.

The original method for genomic phylostratigraphy involves the use of a BLAST sequence similarity search with a 10−3 E-value cut off. The genes deemed similar enough in sequence are gathered and the clade englobing all the taxa represented by those genes is determined. This clade then becomes the phylostratum of these genes. Modern implementations replace BLAST for DIAMOND since it is orders of magnitude faster [4] and have refined this process to account for sequence contamination, horizontal gene transfer and event for homology detection failure.[2] By determining the common ancestor of this clade, we can hence give an age to the founder gene and all its descendants. Applying the process on a genome-wide scale can then allow us to detect patterns of founder genes births and infer the role of certain genes involved in clade-specific developmental processes and physiological pathways, and the origin of those traits. The developers of the method gave in the original paper[1] an example how to exploit this system in practice using Drosophila. They gathered 13,000 genes for which they determined the founder genes, regrouping them in their respective phylostrata. They also segregated the families of genes depending on whether they were mainly expressed in either of the three germ layers (endoderm, mesoderm, ectoderm). By studying the frequencies of expression of genes in those different phylostrata, they were able to hypothetically pinpoint the possible original formation of those germ layers to specific periods and ancestral organisms in evolutionary history.

Other studies have found that bursts of gene founder events are linked to important evolutionary innovations such as the emergence of bilateral symmetry in animals, the emergence of multicellularity in streptophyte algae, the colonization of land by plants or the emergence of flowering plants.[2]

Since its invention, genomic phylostratigraphy has been regularly used by this research team[5] as well as others,[6] notably in an attempt to determine the origin of cancer genes, seemingly showing a strong link between a peak in the formation of cancer genes and the transition to multicellular organisms, a connection which had been previously hypothesised and is hence further supported by phylostratigraphy. As its use has grown, the method has been assessed and enhanced on multiple occasions, and programs that run it automatically and more efficiently have been developed.[2][3]

One of the most prominent uses of genomic phylostratigraphy has been in inferring the correlation between phylogeny and developmental processes (often called the phylogeny-ontogeny correlation). Using genomic phylostratigraphy, to this day scientists have found a significant phylogeny-ontogeny correlation in animals,[7] plants,[8] fungi,[9][10] and even bacterial biofilms.[11]

Criticism[edit]

Albeit it being now used frequently by the scientific community, genomic phylostratigraphy has also received some criticism for being too inaccurate for its measurements to be trustworthy. First of all, according to some authors precision lacks in the assumptions.[12][13] It is erroneous to assume for example that all species beyond the organism of focus share the same protein evolutionary rate, which isn't true as it varies depending on cell cycle speeds, leading to problems in setting the limits of BLAST error to englobe all proteins originated from the same founder gene. Another point is that the BLAST search assumes that protein evolutionary rates are constant at all its sites, which is also false. Lastly, it could be said that the model does not account correctly for gene duplications, as well as gene losses: the changes in evolutionary rates caused by gene duplications due to new functional changes would increase BLAST error rates, and gene loss in taxa distant to the one studied could lead to great underestimations in the calculated gene age and phylostratum of founder genes compared to their true values. However, rather than demanding to simply abandon the method, critics have been trying to work at refining it from its original state, by introducing other potential mathematical formulas or sequence searching tools,[14] although the Ruđer Bošković Institute has replied to such criticism claiming their original approach was valid and did not need to be extensively revised.[15] This debate is also included as part of the wider discussion on the importance of de novo gene births in creating genetic diversity, in which genomic phylostratigraphy supports that they do hold a strong effect. Gene founder events have also been proposed to not only represent de novo gene birth events but also significant changes in the protein sequence that change them beyond recognition with other related genes and thus acquire novel biological activities.[16] Nonetheless, genomic phylostratigraphy assumes that the detection of such "founder events" entails evolutionary novelty. However, recent studies suggest that the detection of genes that are restricted to certain taxonomic groups (called taxonomically restricted genes) can be explained by homology detection failure, that is the inability of bioinformatic tools (BLAST or any other tool to detect sequence homology) to trace back homology when protein sequences are small and evolve at a fast rate, leading to gene untraceability.[17] Thus, much of the proposed "founder events" that were originally detected by genomic phylostratigraphy could be reinterpreted as a methodological artifact due to the technological limitations of detecting more distantly related homologs. Luckily, homology detection failure can be explicitly tested.[17] The recently developed tool GenEra is able to disentangle homology detection failure from potential gene founder events by explicitly testing for gene untraceability.[2]

References[edit]

  1. ^ a b c d Domazet-Loso T, Brajković J, Tautz D (November 2007). "A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages". Trends in Genetics. 23 (11): 533–539. doi:10.1016/j.tig.2007.08.014. PMID 18029048.
  2. ^ a b c d e Barrera-Redondo, Josué; Lotharukpong, Jaruwatana Sodai; Drost, Hajk-Georg; Coelho, Susana M. (24 March 2023). "Uncovering gene-family founder events during major evolutionary transitions in animals, plants and fungi using GenEra". Genome Biology. 24 (1): 1552–1554. doi:10.1186/s13059-023-02895-z. PMC 10037820. PMID 10037820.
  3. ^ a b Arendsee Z, Li J, Singh U, Seetharam A, Dorman K, Wurtele ES (October 2019). "phylostratr: a framework for phylostratigraphy". Bioinformatics. 35 (19): 3617–3627. bioRxiv 10.1101/360164. doi:10.1093/bioinformatics/btz171. PMID 30873536.
  4. ^ Buchfink, Benjamin; Reuter, Klaus; Drost, Hajk-Georg (April 2021). "Sensitive protein alignments at tree-of-life scale using DIAMOND". Nature Methods. 18 (4): 366–368. doi:10.1038/s41592-021-01101-x. PMC 8026399. PMID 33828273.
  5. ^ Domazet-Loso T, Tautz D (May 2010). "Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa". BMC Biology. 8 (1): 66. doi:10.1186/1741-7007-8-66. PMC 2880965. PMID 20492640.
  6. ^ Zhang L, Tan Y, Fan S, Zhang X, Zhang Z (February 2019). "Phylostratigraphic analysis of gene co-expression network reveals the evolution of functional modules for ovarian cancer". Scientific Reports. 9 (1): 2623. Bibcode:2019NatSR...9.2623Z. doi:10.1038/s41598-019-40023-9. PMC 6384884. PMID 30796309.
  7. ^ Domazet-Lošo T, Tautz D (December 2010). "A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns". Nature. 468 (7325): 815–818. Bibcode:2010Natur.468..815D. doi:10.1038/nature09632. PMID 21150997. S2CID 1417664.
  8. ^ Quint M, Drost HG, Gabel A, Ullrich KK, Bönn M, Grosse I (October 2012). "A transcriptomic hourglass in plant embryogenesis". Nature. 490 (7418): 98–101. Bibcode:2012Natur.490...98Q. doi:10.1038/nature11394. PMID 22951968. S2CID 4404460.
  9. ^ Cheng X, Hui JH, Lee YY, Wan Law PT, Kwan HS (June 2015). "A "developmental hourglass" in fungi". Molecular Biology and Evolution. 32 (6): 1556–1566. doi:10.1093/molbev/msv047. PMID 25725429.
  10. ^ Xie Y, Kwan HS, Chan PL, Wu W, Chiou J, Chang J (2022-07-16). "The Phylotranscriptomic Hourglass Pattern in Fungi: An Updated Model". bioRxiv: 2022.07.14.500038. doi:10.1101/2022.07.14.500038. S2CID 250646039.
  11. ^ Futo M, Opašić L, Koska S, Čorak N, Široki T, Ravikumar V, et al. (January 2021). "Embryo-Like Features in Developing Bacillus subtilis Biofilms". Molecular Biology and Evolution. 38 (1): 31–47. doi:10.1093/molbev/msaa217. PMC 7783165. PMID 32871001.
  12. ^ Moyers BA, Zhang J (January 2015). "Phylostratigraphic bias creates spurious patterns of genome evolution". Molecular Biology and Evolution. 32 (1): 258–267. doi:10.1093/molbev/msu286. PMC 4271527. PMID 25312911.
  13. ^ Casola C (November 2018). "From De Novo to "De Nono": The Majority of Novel Protein-Coding Genes Identified with Phylostratigraphy Are Old Genes or Recent Duplicates". Genome Biology and Evolution. 10 (11): 2906–2918. doi:10.1093/gbe/evy231. PMC 6239577. PMID 30346517.
  14. ^ Moyers BA, Zhang J (August 2018). Martin B (ed.). "Toward Reducing Phylostratigraphic Errors and Biases". Genome Biology and Evolution. 10 (8): 2037–2048. doi:10.1093/gbe/evy161. PMC 6105108. PMID 30060201.
  15. ^ Domazet-Lošo T, Carvunis AR, Albà MM, Šestak MS, Bakaric R, Neme R, Tautz D (April 2017). "No Evidence for Phylostratigraphic Bias Impacting Inferences on Patterns of Gene Emergence and Evolution". Molecular Biology and Evolution. 34 (4): 843–856. doi:10.1093/molbev/msw284. PMC 5400388. PMID 28087778.
  16. ^ Tautz, Diethard; Domazet-Lošo, Tomislav (October 2011). "The evolutionary origin of orphan genes". Nature Reviews Genetics. 12 (10): 692–702. doi:10.1038/nrg3053.
  17. ^ a b Weisman, Caroline M.; Murray, Andrew W.; Eddy, Sean R. (2 November 2020). "Many, but not all, lineage-specific genes can be explained by homology detection failure". PLOS Biology. 18 (11): e3000862. doi:10.1371/journal.pbio.3000862. PMC 7660931. PMID 33137085.