language ai in space sciences hackathon day 1: given access to real euclid telescope data
day 2: we made the galaxies flirt
👉 hf.co/spaces/astrohayley/gHarmony
Posts by Mike Smith
On Apple M3, a Linux KDE plasma desktop under Fedora Asahi Remix is now WORKING! Super excited to share this update and happy to answer any questions! Co-credits to noopwafel and Shiz. :)
New AstroPT models are out 🔭🎉 This time trained with an improved DESI galaxy image dataset. Link here: huggingface.co/Smith42/astr...
Check out these new scaling curves!
We are still seeing improvement at 800M parameters where before we stalled at 100M. Maybe high quality data is all you need 🤔
Anyways, here's the paper - it's one of the first big uses of foundation models in astronomy that I'm aware of, and it seems to have worked really well! #extragalactic #astrocode 🧪
Ooh we scalin'
This side-by-side plot compares two visualizations of the same dataset’s embeddings: on the left, the original embeddings; on the right, latent representations generated via a transformation method (here labeled vec2vec). Left: Original Embeddings • Two clearly separated clusters of red and green points. • The clusters represent two distinct groups (e.g., classes, domains, or modalities). • Gray lines show strong alignment or correspondences between red and green points—suggesting some shared structure or matched pairs. • However, the clusters are far apart, meaning the original embedding space encodes strong domain-specific separation (e.g., red and green are treated as different). Right: Latent Representations (vec2vec) • The same points are now more uniformly mixed in latent space. • The tight clustering by color is gone; red and green points are distributed throughout. • This suggests the vec2vec method has projected both groups into a shared latent space, removing domain bias and aligning semantically similar items regardless of origin. • It’s indicative of embedding alignment, domain adaptation, or representation unification, where cross-domain items are mapped closer together based on semantic similarity. Implication: vec2vec successfully transforms the original domain-specific embeddings into a common space where structural similarity dominates over origin (color), enabling better transfer, comparison, or fusion between domains.
Strong Platonic Representation Hypothesis
All embedding models, given large enough scale, can be translated between them without paired data
Security implication: Embeddings aren’t encryption, they’re basically plain text
arxiv.org/abs/2505.12540
If you add "also Cthulhu-y" to the prompt, the results are pretty great.
A great dataset to round out UTBD's 2^2 week 😎
📢 New dataset out!
We introduce HypoGen💥, a dataset of ~5.5K structured problem–hypothesis pairs (Bit–Flip–Spark + Chain‑of‑Reasoning) to advance LLM-driven scientific ideation💡.
Fine‑tuned LLaMA 3.1 8B & R1‑distilled models show significant gains. Humans are still the best🥇.
Was great fun cooking this up with Sharaf and team! Check out all the code at github.com/UniverseTBD/... and paper at arxiv.org/abs/2504.08583
🎉 HAPPY BIRTHDAY, UniverseTBD! 🚀
As we turn 2, we’re going 2^2.
Launching a new project per day for the next four days.
We hope that you all enjoy these works as much as we have enjoyed working on them. Stay tuned for the big reveals!
tariffs getting so bad you can't even import numpy 🥲
me: i didn't know you were cool like that
val loss:
me: go left! ←←
my computer: best i can do is ^[[D
Going to be a great talk 😎
arxiv.org/abs/2501.12499 super cool paper! Extracting useful information from astro time series via RNNs
With r1 and o1, Yann Lecun's cake is now baked and ready
The final frontier for AI will be anything that can't be captured via a quantitative benchmark
broke: reading 500 page AI safety papers
woke: learning AI alignment best practices from "wallace & gromit: vengeance most fowl"
first time spotting AI art in the wild
just look at that floating ship!
the nvidia digits case is so british-housing-core
This makes so much sense – more cooperation when more willing to wait for rewards ~= less risk averse.
www.sciencedirect.com/science/arti...
An M dwarf star with the text: "I am not a toy. I am not a Christmas present. I am a 10 trillion year commitment"
Think twice before gifting someone an M dwarf this holiday season
I move to refer to `'Gold OA' as 'Pay to Publish'
o3 got me thinking about the future of selling our labour as code... how many more iterations until that's transformed? 😅
new shoggoth just dropped 😤😤
lets goo hatfield, great doc about my hometown: www.youtube.com/watch?v=IQYt...
POV: me getting on social media