We are excited to share GPN-Star, a cost-effective, biologically grounded genomic language modeling framework that achieves state-of-the-art performance across a wide range of variant effect prediction tasks relevant to human genetics.
www.biorxiv.org/content/10.1...
(1/n)
Posts by Bin Shao
Big congrats, Yunha!
Interesting work on plasmid engineering.
All NIH study sections canceled indefinitely. This will halt science and devastate research budgets in universities.
This gives me such hope for biodiversity conservation, mammals and future mammalogists! Go young people!! 🧪
A new paper from Lingchong You's group develops a cool amplification circuit that expands the dynamic range of plasmid transfer #ChemBio #synbio #microsky
www.nature.com/articles/s41...
Recruiting PhD students: our research covers language model + genomics + systems biology: scholar.google.com/citations?us...
1. Four-year PhD program in Beijing
2. Master's degree required
3. Start date: Sep 2025
Please DM if you are interested.
After 24 years of work, I’m thrilled to announce the TYMEFLIES dataset, which comprises metagenomes from Lake Mendota (Madison, WI), collected roughly every 10 days (471 samples) for 20 years! @quendi.bsky.social @robinrohwer.bsky.social
rdcu.be/d5put
A thread…
We deeply appreciate the experimental studies that have made this work possible! Please check our github for more details: github.com/lingxusb/TXp...
We hope this work will be a useful tool. Feedback is welcome! Please feel free to try our Colab notebook to predict transcriptomes at (almost) zero cost! It takes about 20 minutes for a genome with 4k genes: colab.research.google.com/drive/1Kd-QI...
TXpredict captures variations in gene expression both across different protein functional groups and within the same functional group.
We further used TXpredict to predict the expression of 3.1M genes across a collection of 900 microbial genomes. Small clusters of ribosomal genes located at the periphery of the tSNE plot of all genes and showed high predicted expressions.
Our model leverages information learned from ESM2 model and basic protein statistics to predict genome-wide gene expression. It achieves an average Spearman correlation of 0.53 in predicting gene expression for bacterial genomes that are not in the training dataset:
Is it possible to get the transcriptome of any sequenced microbe without doing the experiments? Happy to introduce TXpredict, a transcriptome prediction tool that generalizes to novel microbial genomes: www.biorxiv.org/content/10.1...
Predicting microbial transcriptome using genome sequence www.biorxiv.org/content/10.1101/2024.12....
9/n We envision EcoVAE will advance biodiversity investigations, especially in under-sampled regions and ultimately support global biodiversity monitoring efforts🙏
💻Codes are publicly available: github.com/lingxusb/Eco...
8/n 🧩 EcoVAE can also interpolate missing occurrences. For example: In North America, EcoVAE predictions for Sassafras largely overlapped with iNaturalist records. In South Asia, EcoVAE highlighted a wider distribution of Desmodium, consistent with field surveys.
7/n 🌍Where is biodiversity under-sampled? We found that regions with high prediction error overlap with known "darkspots" of biodiversity collection. For example, the highest prediction errors for plants were observed in South Asia, Southeast Asia, the Middle East, and Central Africa.
6/n 🦋EcoVAE isn’t limited to plants. The model generalizes well to other taxa, including butterflies and mammals, showcasing its versatility across ecosystems.
5/n🖥️Remarkably, EcoVAE can predict species distributions even with sparse inputs. With just 20% of input data, it achieved an AUROC of 0.78, effectively identifying the locations of missing genera.
4/n🌍 We withheld data from three independent regions to test its generalization. The model reconstructed species distributions effectively—even for withheld test regions—and predicted the location of missing records at genus and species levels.
3/n 🚀We leverage a VAE structure that enables fast and scalable modeling of species distribution patterns. In training, we masked 50% of species records and tasked the model to reconstruct full species distribution, mimicking real-world biodiversity sampling
2/n 🌿Biodiversity is under immense pressure. Predicting global species distributions at scale is critical, but traditional species distribution models struggle with massive datasets and interspecies interactions (e.g., >33M records and >127K species of plants)
🌏What happens when generative AI meets ecology? How can we use AI to advance biodiversity exploration and monitoring?
Excited to introduce EcoVAE, a generative approach trained on over 100 million high-quality vouchered records to model global biodiversity
www.biorxiv.org/content/10.1...
1/n🧵
Preprint alert! A thread is coming soon.
book cover and first page of the preface
The third edition of my textbook, Nonlinear Dynamics and Chaos, was published today. You can preview the first 68 pages on Google Books, or take a look at the preface below to see what's new. The main new thing is a chapter on the Kuramoto model! Hope you enjoy it.
Two BioML starter packs now:
Pack 1: go.bsky.app/2VWBcCd
Pack 2: go.bsky.app/Bw84Hmc
DM if you want to be included (or nominate people who should be!)