Advertisement · 728 × 90

Posts by April Wei

Thanks 😇

1 week ago 0 0 0 0

The principles for using variant sharing/inheritance hierarchy to compression the data would not change. How well it works compare to human is an empirical question.

1 week ago 1 0 2 0

Thanks. Will take a look!
In humans, downstream computation became a hurdle when the number of genomes exceeds tens of thousands. But GRG can still help when the number of genomes is much smaller.

1 week ago 1 0 2 0

We would be curious to see its performance if you can point us to a sufficiently large public dataset.

1 week ago 0 0 1 0

Taken together, GRG v2 and grapp demonstrate that moving from tabular to GRG-based representations can deliver substantial gains in speed, memory, and cost, while leveraging the rich Python scientific computing ecosystem.

Huge thanks to the whole team (Drew DeHaas, Chris Adonizio, Ziqing Pan) 💙

1 week ago 3 1 0 0
Post image

We introduce grapp, a collection of GRG-based command-line tools that resembles PLINK2: variant and sample filtering, GWAS with covariates, PCA, and data export as native graph operations. Routine analyses can now be done easily and orders-of-magnitude faster with grapp, with minimal upfront cost.

1 week ago 7 4 1 0
Post image

This scalability also enables a leave-one-chromosome-out approach (LOCO) to GWAS covariate construction that avoids LD artifacts (later PCs capture local LD) without requiring LD pruning. Once computation is no longer the limit, methods can be chosen on statistical grounds rather than feasibility.

1 week ago 7 1 1 0
Post image

Using these operators, scipy-based PCA can be implemented in four lines of Python. PCA on 89M variants in 2–4 hours, 51–492× faster than existing methods.

1 week ago 4 1 1 0
Advertisement
Post image

We also provide linear operators compatible with SciPy’s sparse linear algebra interface, enabling extremely efficient implicit multiplication against the standardized genotype matrix, the linkage disequilibrium (LD) matrix, and the genetic relatedness matrix–none of which are ever materialized.

1 week ago 4 1 1 0
Post image

GRG is now the smallest practical phased genotype format. Applied to the UK Biobank WGS dataset (490,541 individuals; 706,556,181 variants), GRG v2 produces files 25× smaller than .vcf.gz (122GB vs. 3TB) and more than 8× smaller than PLINK2’s PGEN, at a total construction cost of less than 90 GBP.

1 week ago 3 1 1 0
Post image

Here, we introduce a new construction algorithm that reduces construction time by 10–20×, halves the disk and RAM footprint, and improves load time by more than 20× relative to v1. GRG construction is now so fast that building a GRG directly from .vcf.gz can be faster than .vcf.gz to PGEN (PLINK2).

1 week ago 8 1 1 0
Preview
Fast phenotype simulation for genotype representation graphs AbstractMotivation. The Genotype Representation Graph (GRG) is a graph representation of whole genome polymorphisms, designed to encode the variant hard-ca

Since then, we have been working towards removing the barriers to broader adoption of GRG by both method developers and empirical researchers. We started with phenotype simulations academic.oup.com/bioinformati... and showed GRG enables orders of magnitude faster simulation than ARGs.

1 week ago 4 0 1 0
Preview
Enabling efficient analysis of biobank-scale data with genotype representation graphs - Nature Computational Science The genotype representation graph (GRG) is a compact data structure that encodes 200,000 human genomes in just 5–26 gigabytes per chromosome. Computation on GRG via graph traversal greatly accelerates...

The GRG is an ARG-motivated representation that compactly and losslessly encodes the genotypes. It is a file format and a computational data structure. ~2y ago www.nature.com/articles/s43..., we introduced GRG, its relation to ARG, a construction algorithm, GWAS, and its scalability promise.

1 week ago 5 1 1 0

Very proud to share our new work on General, orders-of-magnitude faster whole-genome analysis with genotype representation graphs (GRG). We topped ourselves in this one 🚀 and made GRG a practical foundation for biobank-scale population and statistical genetics. www.biorxiv.org/content/10.6...

1 week ago 42 18 2 3
Post image

📊 Explore the latest from Bioinformatics Advances: "Fast phenotype simulation for genotype representation graphs" 

Read the full paper here: https://doi.org/10.1093/bioadv/vbag040

Authors include: @aprilwei.bsky.social

2 months ago 5 1 1 0
Post image Post image Post image

Method works from simple and complex scenarios in jointly estimating epoch time, population size, migration rate (symmetric or asymmetric), growth rate, and admixture proportion. Software integrated with msprime, demes, tsinfer/tsdate, relate, and singer. github.com/aprilweilab/...

6 months ago 4 4 0 0
Preview
Inference of complex demographic history using composite likelihood based on whole-genome genealogies Accurate parametric inference on complex demographic models is a continuing challenge in population genetics. Ancestral recombination graphs (ARGs) provide richer information than simple population ge...

Excited to preprint our latest work (w/ Drew DeHaas, Zhibai Jia, Leo Speidel) on using ARGs for demographic inference. w/ applications using data from 1000 Genomes Project. www.biorxiv.org/content/10.1...

6 months ago 45 23 1 0
Advertisement
Preview
Fast Phenotype Simulation for Genotype Representation Graphs Motivation The Genotype Representation Graph (GRG) [[DeHaas et al., 2025][1]] is a graph representation of whole genome polymorphisms, designed to encode the variant hard-call information in phased wh...

Very proud of this manuscript with two talented undergraduate students, Aditya Syam and Chris Adonizio. We are continuing to push towards more scalable statistical genetics with Genotype Representation Graphs, and this is the start. www.biorxiv.org/content/10.1...

7 months ago 11 6 0 0
Preview
IGD: A simple, efficient genotype data format Motivation While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limit...

Our work (by Drew DeHaas) on an extremely simple yet efficient binary genotype format - designed to facilitate scalable bioinformatics tool development. www.biorxiv.org/content/10.1...

1 year ago 10 10 0 0
Preview
Biologically inspired graphs to explore massive genetic datasets - Nature Computational Science A recent study proposes a data structure that addresses crucial challenges related to storage and computation of large genome databases.

📢In a recent News & Views, @ryanlayer.bsky.social discusses a data structure introduced by @aprilwei.bsky.social and colleagues for reducing storage and computational costs for phased whole-genome polymorphisms. www.nature.com/articles/s43...

🔓https://rdcu.be/d8ay3

1 year ago 6 2 0 1
Preview
Enabling efficient analysis of biobank-scale data with genotype representation graphs Nature Computational Science - The genotype representation graph (GRG) is a compact data structure that encodes 200,000 human genomes in just 5–26 gigabytes per chromosome. Computation...

Link to pdf. www.nature.com/articles/s43...

1 year ago 2 2 0 0
Wei Lab Web site created using create-react-app

My lab (aprilweilab.github.io) continues to develop GRG and ARG related methods & more. We are looking for a postdoc to join us.

1 year ago 4 5 1 0
Post image

Our work w/ two co-first authors Drew DeHaas and Ziqing Pan is now published. GRG allows large amounts of WGS polymorphism data to be analyzed in RAM via graph traversal & algebra operations & has some intrinsic connection w/ popgen data generating process & is different from ARG

1 year ago 21 11 1 0

Thanks, Alison. (that's me logging in 6mo later😂

1 year ago 1 0 0 0
Preview
Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data bioRxiv - the preprint server for biology, operated by Cold Spring Harbor Laboratory, a research and educational institution

We introduced an ARG-inspired data structure, Genotype Representation Graph (GRG), to enable lossless data compression and efficient computation through graph traversal. Developed a fast inference method. Cost ~80 GBP to convert 350TB VCF (200,000 UKBiobank WGS) into 160 GB GRG.
t.co/0badfCYz47

1 year ago 4 0 1 0
Advertisement