Excited to be speaking at Good Tech Summit in DC April 7 www.goodtechtogether.org/summit
We’ll share a program focused on K-12 education and talk about investing in the foundations of AI: data, models, and benchmarks. We'll explore how these shape AI development in a field. Join us!
Posts by Peter Bull
🎉 Excited to launch this challenge! 🎉 Over a year of data collection, curation, and annotation that we undertook to produce a first-of-its-kind dataset.
Help us build speech models that understand 2-5 year olds. $120k in prizes and huge impact!
kidsasr.drivendata.org
Great set of events for #SeattleAIWeek this week! Definitely join some if you are in town and let me know if you want to catch up luma.com/Seattle-AI-W...
🚀 New release: cloudpathlib v0.23.0
🥧 Now with Python 3.14 (π) support!
📁 New copy & move methods mean you can reduce usage of shutil 🎉
Check out the full release and docs here:
👉 cloudpathlib.drivendata.org/stable/
Super interesting work on new proposed columnar data file format called F3 with embedded wasm binary to decode the data 🤯 (which obviates the need for 3rd party library support). Favorable comparisons on compression, throughput and random reads to existing formats.
db.cs.cmu.edu/papers/2025/...
Very cool to see Wikimedia embracing LLM tools and launching a hybrid similarity search API and open source embeddings for Wikipedia! Also supports Q&A style queries.
www.wikidata.org/wiki/Wikidat...
Interesting to see empirical research coming out for LLMs as education aids. In this study, active use of LLMs helped CS students debug compiler errors. Removing LLM access demonstrated no lasting learning benefit from having had access to it...
learninganalytics.upenn.edu/ryanbaker/IC...
AI for Conservation Boston Meetup Join us for an iNat bioblitz!!! September 27th from 9am-12pm Meet at umass Boston Quad at 9 Register here https://www.eventbrite.com/e/umass-boston-bioblitz-tickets-1626791971579?aff=oddtdtcreator
Are you interested in #AIforConservation #AIforBiodiversity #AIforWildlife or #AIforNature?? Are you located in the Boston Area?
If so, come join us!! The AI for Conservation Slack community is doing our first local-area Boston meetup, partnering with iNaturalist and TEDx Boston!
We just shipped two major features for cloudpathlib ✨📦 ✨ ! First, http support—treat an URL like any other path (open, read_text, join). Second, compatibility with open and os Python built-ins for seamless transition of legacy code and third-party library support.
cloudpathlib.drivendata.org
Great opportunity to work on AI in conservation and biodiversity with Roland Kays! In-person in NC, check it out now since it is only open for a week:
www.governmentjobs.com/careers/%7B0...
Exemplary FAQ for "Your Brain on ChatGPT: Accumulation of Cognitive Debt" www.brainonllm.com/faq
I'd love to see more authors who are explicit about what NOT to claim based on a study, including wording for lay audiences that is not appropriate.
Thought I would spot check a application someone was posting about 100% vibecoding. Can you spot the issue?
Kudos to the LLM, this is verbatim from the fastapi docs. Sometimes verbatim from the docs is not what you want for your application though....
Interesting announcement on a product from Astral! Similar model to one of the core @anacondainc.bsky.social lines of business.
Enthusiastic to build on this generation of earth observation foundation embeddings like DeepMind's AlphaEarth (and more)! We already see some promising crop type (cereals vs. orchards) results and are exploring other use cases in climate resilience. deepmind.google/discover/blo...
Very cool to see that marimo supports our cloudpathlib library for their file browser UI! Browse your S3, GCS, Azure buckets from your notebooks! docs.marimo.io/api/inputs/f...
✨ 📦 ✨ Just released new Cookiecutter Data Science version with support for pixi and poetry as environment managers! Some of our top requested features ever. Upgrade and check it out now.
cookiecutter-data-science.drivendata.org
Now getting organic inbound for www.zambacloud.com, our wildlife imagery processing platform, from ChatGPT! 😲
Just in case you thought speech-to-text worked for children, the third column is what Whisper does. Somehow in the third example it accesses my inner monologue... I guess that's why we're excited about our upcoming challenge! kidsasr.drivendata.org
How are people managing code review for their AI coding agents? I do a first glance and it is obviously bad (e.g., didn't refactor repeated code), and now I've got half a dozen AI diffs for things that aren't good enough cluttering up my todo list with things to respond to....
New research based on the CANDOR corpus shows that people enjoy conversations where they alternate longer turns better than short turns or one person dominating. Cool!
arxiv.org/html/2506.20...
The best shortcut to how many experienced software engineers feel about AI is listening to the Primeagen's takes. Balanced perspectives on what's actually new, determinism, security, system complexity, what's promising, and what's not www.youtube.com/watch?v=vDWa...
"Damn ChatGPT" your new summer jam about using ChatGPT as a therapist open.spotify.com/track/4umq06... (edited)
Great article on the challenges of only surfacing the right info to LLMs and editing down what is not needed. If you've used a coding copilot or agent, you've seen this first hand many times. Output iterations are often polluted with code that came before.
www.dbreunig.com/2025/06/22/h...
BioCLIP2 looks like a stellar improvement! I'm excited to think about integrating into Zamba to for open-ended classification tasks run at scale on camera trap imagery. Definitely the potential to dramatically improve CT image utility. imageomics.github.io/bioclip-2/
We've built so many low-fidelity prototypes in our HCD work. IMO vibecoding changes the feel of those prototypes, but doesn't change the process. Ask any designer—they'll tell you high-fidelity first iterations are often more distracting to clients than helpful.
www.semafor.com/article/06/0...
Check out this LLM circuit trace LLM for the text: '"The statement 'this statement is false' is." It goes through a logical contradictions node, but still outputs either "true" or "false" with the highest probabilities... www.anthropic.com/research/ope...
A new preprint shows anonymization techniques for voices make transcription accuracy substantially worse for children versus adults. This is going to be a big challenge as we work on ASR for educational settings where we emphatically need both privacy and accuracy. arxiv.org/pdf/2506.00100
😍 Incredible data storytelling about the power of conversation and human connection. Worth a read for good vibes! Based on the CANDOR corpus that we worked on. pudding.cool/2025/06/hell...
The gap between LLM prototype and production strikes again... in the worst possible place. www.propublica.org/article/trum...