Notes from my #NICAR26 session on browser automation and scraping with Playwright!
jsoma.github.io/workshop-bro...
Posts by j soma
Notes and code from my #NICAR26 talk on my FAVORITE pdf processing library (very not biased), Natural PDF
jsoma.github.io/natural-pdf-...
Notes and code from my #NICAR26 talk: Analyzing Images and Video with AI!
jsoma.github.io/workshop-ai-...
Slides and notes for #NICAR26 No-Code AI pipelines with n8n!
github.com/jsoma/worksh...
no-code pipelines and AI agents
analyzing images and video with AI
PDF processing with Natural PDF
Browser automation with Playwright
Here we go #NICAR26, a zillion and one sessions on the docket!
Thurs: No-code AI pipelines with n8n
Fri: Analyzing images/video with AI + Wrangling PDFs with Natural PDF
Sat: Browser automation with Playwright + Ethical AI in Investigations
Sun: Build your own AI Benchmark
With the awful WaPo layoffs and the state of journalism more broadly, if it's useful for any writers and reporters considering going indy, @molly.wiki, @xoxogossipgita.bsky.social of Aftermath, @jasonkoebler.bsky.social of 404 Media, @edzitron.com and me will do a little workshop next week.
Amid the hundreds of colleagues we’ve lost today, I wanted to highlight the BRILLIANT data/graphics folks who any newsroom should be fighting to hire right now—threading here:
Whatever you think of the Washington Post at this moment, here's a chance to support the dedicated, hard-working journalists who were just laid off. If you have the means, your donation is most welcome. If you don't, a kind thought and maybe spreading the word to others is support enough 💙
Find out more about the AI newsroom workflow course at its awful sales-y site, and feel free to shoot me any questions you might have!
littlecolumns.com/courses/ai-n...
The course itself is six weeks long, and while it does cost money (which is crazy strange for me!), there are steep geographic pricing discounts and coupon codes for close readers of the course site.
It's maybe like 35% a tech course, and a lot of the theory is stuff that seems simple once you've heard it: see what goes wrong, fix it, track it. That's it!
Yes, we'll learn automation tools like n8n/ActivePieces and eval suites like Opik/Arize Phoenix, buuut they're just one part
This course is going to solve every step of those crises. How do you...
- set up an AI pipeline?
- measure if it's working?
- iterate and improve it?
- make sure you're solving a reader/reporter problem instead of just playing tech games?
It isn't magic! It's easy!!!!
I'm running a six-week course in November on building and evaluating AI newsroom workflows!
It's targeted at people who don't know where to start, or who build little prototypes and end up stumped about making them production-ready.
littlecolumns.com/courses/ai-n...
a three-column table with the middle column highlighted
three columns being restructured into a vertical flow
tables being selected irrespective of their columns
the eventual pandas df
Natural PDF v0.1.13 out – a handful of useful changes but my favorite is🗼page restructuring support!
Grab sections and "flow" them together vertically or horizontally, making multi-column extraction infinitely easier than 24 hours ago.
Details at jsoma.github.io/natural-pdf/...
it looks like someone has been going very hard on scans
ONE MORE DAY OF ACCEPTING BAD PDF SUBMISSIONS
you could have won EVERY CATEGORY
Woke up to ton of new non-English BAD PDF CONTEST submissions: 💥 Serbian! Romanian! Chinese! 💥
Mostly not scans, though, so I predict they'll easy-peasy to extract the info from. I want to have to train a custom OCR model!!! Someone submit a big scanned non-English PDF!!!
i know you all are hiding worse scans from me
screenshot of a spreadsheet with very tiny text
i love this giant-pdf-with-tiny-text submission, we need a smallest font size category
I am running a contest. It is about bad pdfs.
It can make you independently wealthy (for immeasurably small measures of independent wealth)
badpdfs.com
Live colab demo/walkthrough here: colab.research.google.com/github/jsoma...
a screenshot of natural PDF documentation
New release of 📝 Natural PDF 📝
A million and one table extraction/document layout/Q&A/quality of life improvements for all your PDF-processing needs
jsoma.github.io/natural-pdf/
the law clinic repping this student, CLEAR, is based out of CUNY.....once again the public city university absolutely flounces the ivy league when it comes to having a backbone and standing on actual principles
Thank you – if only we could get a fix for the bug that prevents it from working 100%!