Thanks to my amazing collaborators: @samsja19.bsky.social , Johannes Hagemann, @shangshang-wang.bsky.social , Jason Wiemels, Jeff Kaufman, and @willieneis.bsky.social
Special shout out to the Nucleic Acid Observatory for the sequencing data, and @PrimeIntellect for compute support.
Posts by Ollie Liu
We’re sharing METAGENE-1’s:
📄Paper: metagene.ai/metagene-1-p...
🌐Website: metagene.ai
🤗Model weights: huggingface.co/metagene-ai
🧵7/
🛡Tailored for detection, not design. We scoped METAGENE-1 to minimize risks while maximizing potential for public health and biosurveillance. Responsible open-sourcing matters. With open weights, we aim to drive progress in interpretability and safe genomics research.
🧵6/
📈METAGENE-1 achieves state-of-the-art results in:
- Pathogen detection
- Genomic embedding benchmarks
- Generalization to multi-species tasks
It already shows promise in public health and biosurveillance, and we are collaborating with experts to unlock its full impact.
🧵5/
The METAGENE-1 model is 7B parameter Llama-style transformer 🦙, pretrained and optimized for anomaly detection, embedding, and multi-species genomics. Fully compatible with 🤗Hugging Face (huggingface.co/metagene-ai) – ready to use like any of your favorite LLMs!
🧵4/
📊The data behind METAGENE-1:
- Brand-new dataset collected with experts from Southern California & Missouri
- 1.5 trillion base pairs from diverse wastewater samples
- Short reads (100–300 BPs), deep sequencing at scale
- Byte-Pair Encoding customized for genomic sequences
🧵3/
Why is METAGENE-1 special? 🤔We trained it on wastewater metagenomics, capturing the human-adjacent microbiome across the US for the past 12 months. This unlocks powerful capabilities for early pathogen detection and microbial ecosystems understanding. 🌱🦠
🌐Website: metagene.ai
🧵2/
Introducing METAGENE-1🧬, an open-source 7B-parameter metagenomics foundation model pretrained on 1.5 trillion base pairs. Built for pandemic monitoring, pathogen detection, and biosurveillance, with SOTA results across many genomics tasks.
🧵1/
Landed at Vancouver to attend #NeurIPS :-) Excited to chat about multimodal models, AI4Science, decision making, and more!
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!
👋 nlp@usc student. thanks!
tfw you realize that this isn't an alt twitter for academic posting but an alt insta for cute doggos.
this is doodle, our border collie pup that often used as adversarial attacks for image classification models (they classify him as corgi :-)
yes please if there's still space left :-P
our border collie pup doodle absolutely wants nothing from that plate of banana :-P