I've recently made some changes to the gallery of all my data visualisations 📊
Most examples now have links to the underlying source code! If you see something you like, you can see how it was made 🤩
#DataViz #RStats
Posts by Cristian Pattaro
Our review "Integrating genetic data with biological insight: A practical guide to cis-Mendelian randomization" is now published at @ajhgnews.bsky.social - led by @vkarhune.bsky.social and Benji Woolf with critical insight from Dipender Gill and Pallav Bhatnagar. Thread follows:
Cis-MR studies are not intrinsically superior to genome-wide MR studies, and algorithmically-performed cis-MR analyses will rarely be optimal. But when performed with care, cis-MR is a powerful tool to inform about putative causal effects.
Caveat: it remains clear that, a hypothesis generating exploration cannot estimate a sample size in the traditional way, as there is no hypothesis to base the estimation on (it is exploratory, in fact; and we would be talking about hypothesis-testing, not prediction).
11/11
Lesson taken: when working in applied contexts, statisticians should advocate for robustness and lead choice of sound methods, even in apparently blurry data situations, where #AI tools are not going to make any magic.
10/n
! the good news 😀
However, methods and [importantly] software do exist to enable estimation of the minimum required sample size to enable reliable estimates and prevent overfitting 8/n
...and spreads over all #AI methods
pubmed.ncbi.nlm.nih.gov/40461350/
7/n
The issue of insufficient sample size is common to other fields www.nature.com/articles/s41... 6/n
A recent systematic review www.sciencedirect.com/science/arti... @jclinepi.bsky.social highlights that most cancer research studies don't check if N is large enough to
1-guarantee reliable prediction
2-prevent overfitting
when research is conducted w insufficient N👉reproducibility remains a mirage🏝️
Among non-statisticians it is also commonly believed that #ML outperforms traditional prediction models particularly when N is small (if anything, the opposite is true). 4/n
However, many advocate widespread use of #ML tools as magic solutions to deal with small datasets with large number of predictors. This is especially common among non-statisticians and grant applications are normally flooded by cloudy methods promising exaggerated results. 3/n
Most agree that for too long #statistics forgot the data in favor of the models. Time and tools have come that can put data at the centre (big data, complex data, any data) 2/n
projecteuclid.org/journals/sta...
On machine learning (#ML) & sample size
Inspired by recent posts, we ran an interesting discussion club within our group of Biostats&Epi
Bottom line: talking about sample size estimation in #ML is taboo in many fields. It shouldn't be & there're many reasons for it
#stats #biostats
1/n
🚨 New preprint:
www.biorxiv.org/content/10.6...
We studied the dynamics of maternal gene expression over the course of healthy pregnancy based on weekly samples 👇
For our April journal club, @ozvanbocher.bsky.social will present on "Making the most of whole-genome sequencing data for rare variant association tests." This is another talk you won't want to miss!
📅 Friday, April 10
⏰ 8 am (PDT), 11 am (EDT), 5 pm (CEST)
🔗 iges.memberclicks.net/assets/IGES_...
#CKD : from being the elephant in the room to being recognized as health priority
academic.oup.com/ndt/article/...
I deeply believe causal thinking is core to good DS regardless if you do analytics, ML, etc
A new-to-me resource is the excellent In the Interim podcast on clinical trial design
Comes out weekly and takes the sting out of my Monday commute
Check it out! www.berryconsultants.com/resources/po...
Some slides from a recent talk on missing heritability.
www.dropbox.com/scl/fi/kvogj...
Inclusion bias in #GWAS of #EHR traits
"By weighting the sample using inverse probability weights derived from probabilities of enrollment, we replicate 54% more known GWAS variants" 😱
#statgen
www.cell.com/ajhg/fulltex...
#statistics is based on the data and on the models that might have generated those data. For so long, many have ignored the former, going on with the latter only.
Re-upping Leo Breiman's 2001 powerful piece on the two cultures
projecteuclid.org/journals/sta...
So, kids are seen as obstacles to career. More for women. But men are not safe either. And if it is so in academia, is it worse for other jobs?
Much to fix in our systems.
Venice’s hidden islands are being resurrected – digitally.
ERC grantee Ludovica Galeazzo at University of Padua uses 3D scans and underwater robots to map 500 years of lost history, showing us a way to preserve heritage worldwide.
🔗 t.co/u6K5cQalCO
#FrontierResearch
Delighted to see our new paper published @nature communications.
Largest gwas of kidney function in Africa.
Lower frequency of APOL1 high risk variants in continental African populations
www.nature.com/articles/s41...
KidneyGenAfrica rocks
👏 @sfatumo.bsky.social & team!
#ckd #africa #gwas
www.linkedin.com/posts/segun-...
I’m honored and excited to join the Board of Directors! IGES is home to such a vibrant and welcoming scientific community, I look forward to helping it continue to thrive! 💟🧬
Join us for IGES 2026 in beautiful Estérel, QC 🍁 Abstract submission is open until May 30
www.geneticepi.org/2026-annual-...
This is an amazing repository of datasets that are helpful to self educate on key #stats principles
G-EE
A very much needed (and brilliant) reflection for Mendelian randomization
“SVM, NN and RF may need over 10 times as many events per variable to achieve a stable AUC and a small optimism than classical modelling techniques such as LR. This implies that such modern techniques should only be used in medical prediction problems if very large data sets are available.“
#stats