Posts by James McInerney
Dissecting Phylogenetic Support: Unified Decay Indices, AU Tests, and Branch-Site Specific Visualizations. www.biorxiv.org/content/10.64898/2025.12...
What's even wilder is that models like scGPT and Geneformer are now doing this at the single-cell level — learning gene regulatory patterns from raw expression data with zero biological annotation. They're rediscovering cell types and disease states purely from the structure of the data.
A transformer model trained on raw DNA, with no biological instruction whatsoever, learned to distinguish between genetic codes. Nobody taught it what a codon is. That is just “wow”.
We have a new Masters programme in Liverpool. Please pass on to anybody you think might be interested: news.liverpool.ac.uk/2026/02/13/n...
This looks really great. I’ve been following this for a while.
On this last point: the implication is (and I wrote this in the paper) that evolution is not optimising biochemical (or other) activity as a primary focus, but the context-dependent function. And this is liable to change often.
those who felt word meanings came from definitions, and those who thought word meaning emerged from context. It just struck a chord & I started from there. I think gene "function" emerges from context (function ≠ biochemical activity, btw) and for the next ten years, I plan on focussing on context.
The entire thing came from watching a youtube video interview with Geoffrey Hinton (en.wikipedia.org/wiki/Geoffre...), who was talking about how the field of language modelling was in two camps 2/3
I'm thankful to MBE for publishing this paper. I wan't sure anybody would. The handling editor (I now know was Jeff Townsend) was great, as were the 2 reviewers. It is not an outcome/product/discovery, it is my way of thinking about HGT & pangenomes 1/3 academic.oup.com/mbe/article/...
That looks really cool, Stu. Will be reading it properly later today.
True story.
@jomcinerney.bsky.social proposes that genomes do not encode fixed functions but rather “probability distributions” over functional and phenotypic outcomes, and introduces “genomic perplexity” as a measure of gene-context incompatibility.
🔗 doi.org/10.1093/molbev/msag041
#evobio #molbio
@guigau.bsky.social Saw PanGBank earlier. Very nice 👍
This looks absolutely great. For those of us interested in pangenomes, I am sure this will be a super place to get data and the interface is very clean (plotly). Congrats to the authors (I don't know if they are on bsky): pangbank.genoscope.cns.fr
But it is an eight year old version of endnote and word is a similar vintage. I think me and them are fused together at this stage. Late-stage calcification 😜
Thanks for the tip. I have done what many people do - bought the book and only part-read it. However, I do need to thinbk more deeply about correcting for phylogeny, so this is a great tip. Many thanks.👍
It’s the power ballads that were the hardest part. Nobody needs to listen to van Halen at 2pm on a Tuesday.
I'm in a coffee shop wrestling with Endnote, while 1980s power ballads blast out from a giant speaker. The sun is shining for the first time this year, but I got paper submission deadlines. I hope nobody sees me cry.
Thanks to The Leverhulme Trust (RF-2023-408) for supporting this work, and to the reviewers and associate editor whose feedback greatly improved the manuscript.
Paper: doi.org/10.1093/molbev/msag041
This connects to real tools. Transformer-based genome models (DNABERT, Evo) can calculate perplexity directly. AlphaFold confidence scores estimate structural perplexity. Flux Balance Analysis handles metabolic perplexity. The framework is testable now.
The practical shift: instead of asking "what does this gene do?" we should ask "what can this gene become?" Synthetic biology, antimicrobial development, and evolutionary prediction all become questions of context engineering rather than gene optimisation alone.
It also explains open vs closed pangenomes. Open pangenomes (like E. coli) arise when large population sizes can detect small fitness advantages and high environmental variability creates many contexts where accessory genes pay off - despite integration costs.
The framework predicts pangenome structure. Core genes = low perplexity across contexts. Rare accessory genes = high perplexity generally, but strong benefits in specific contexts. The U-shaped frequency distribution falls out naturally.
This explains why HGT has a fitness cost — not because transferred genes are broken, but because they arrive into a genome optimised for different statistical patterns. Over time, codon adaptation, regulatory rewiring, and compensatory mutations reduce perplexity. The gene becomes "expected."
Perplexity operates across multiple dimensions: codon usage, protein structure, regulatory compatibility, metabolic integration, protein-protein interactions, chromosomal organisation, and gene co-occurrence patterns. Each contributes to the fitness cost of genomic novelty.
This leads to the concept of "genomic perplexity" — borrowed from information theory. Perplexity measures how "surprised" a model is by a sequence. A horizontally transferred gene landing in a new genome is a high-perplexity token — statistically unexpected in that context.
I propose that evolution shapes genomes not to encode fixed functions, but to optimise probability distributions of functional outcomes across the contexts organisms actually encounter. Selection acts on these distributions, not on singular gene activities.
Modern language models (transformers) succeed because they learn probability distributions over outcomes given context. They use "attention"-each word's contribution depends on other words in the sequence. Epistasis is the biological equivalent. A gene's effect depends on what else is in the genome.