Mike Smith (@mjjsmith.com) Bsky

g-Harmony - a Hugging Face Space by astrohayley Which galaxy is right for you?

language ai in space sciences hackathon day 1: given access to real euclid telescope data

day 2: we made the galaxies flirt

👉 hf.co/spaces/astrohayley/gHarmony

1 month ago 1 0 0 0

On Apple M3, a Linux KDE plasma desktop under Fedora Asahi Remix is now WORKING! Super excited to share this update and happy to answer any questions! Co-credits to noopwafel and Shiz. :)

2 months ago 412 69 15 11

New AstroPT models are out 🔭🎉 This time trained with an improved DESI galaxy image dataset. Link here: huggingface.co/Smith42/astr...

Check out these new scaling curves!

We are still seeing improvement at 800M parameters where before we stalled at 100M. Maybe high quality data is all you need 🤔

10 months ago 4 1 1 0

Euclid Quick Data Release (Q1) Exploring galaxy properties with a multi-modal foundation model Modern astronomical surveys, such as the Euclid mission, produce high-dimensional, multi-modal data sets that include imaging and spectroscopic information for millions of galaxies. These data serve a...

Anyways, here's the paper - it's one of the first big uses of foundation models in astronomy that I'm aware of, and it seems to have worked really well! #extragalactic #astrocode 🧪

1 year ago 14 3 0 0

Ooh we scalin'

10 months ago 0 0 0 0

This side-by-side plot compares two visualizations of the same dataset’s embeddings: on the left, the original embeddings; on the right, latent representations generated via a transformation method (here labeled vec2vec). Left: Original Embeddings • Two clearly separated clusters of red and green points. • The clusters represent two distinct groups (e.g., classes, domains, or modalities). • Gray lines show strong alignment or correspondences between red and green points—suggesting some shared structure or matched pairs. • However, the clusters are far apart, meaning the original embedding space encodes strong domain-specific separation (e.g., red and green are treated as different). Right: Latent Representations (vec2vec) • The same points are now more uniformly mixed in latent space. • The tight clustering by color is gone; red and green points are distributed throughout. • This suggests the vec2vec method has projected both groups into a shared latent space, removing domain bias and aligning semantically similar items regardless of origin. • It’s indicative of embedding alignment, domain adaptation, or representation unification, where cross-domain items are mapped closer together based on semantic similarity. Implication: vec2vec successfully transforms the original domain-specific embeddings into a common space where structural similarity dominates over origin (color), enabling better transfer, comparison, or fusion between domains.

Strong Platonic Representation Hypothesis

All embedding models, given large enough scale, can be translated between them without paired data

Security implication: Embeddings aren’t encryption, they’re basically plain text

arxiv.org/abs/2505.12540

10 months ago 49 8 6 4

11 months ago 1 0 0 0

If you add "also Cthulhu-y" to the prompt, the results are pretty great.

11 months ago 65 6 3 0

A great dataset to round out UTBD's 2^2 week 😎

1 year ago 0 0 0 0

📢 New dataset out!

We introduce HypoGen💥, a dataset of ~5.5K structured problem–hypothesis pairs (Bit–Flip–Spark + Chain‑of‑Reasoning) to advance LLM-driven scientific ideation💡.

Fine‑tuned LLaMA 3.1 8B & R1‑distilled models show significant gains. Humans are still the best🥇.

1 year ago 5 4 1 1

Was great fun cooking this up with Sharaf and team! Check out all the code at github.com/UniverseTBD/... and paper at arxiv.org/abs/2504.08583

1 year ago 2 2 0 0

🎉 HAPPY BIRTHDAY, UniverseTBD! 🚀
As we turn 2, we’re going 2^2.
Launching a new project per day for the next four days.
We hope that you all enjoy these works as much as we have enjoyed working on them. Stay tuned for the big reveals!

1 year ago 3 1 1 0

tariffs getting so bad you can't even import numpy 🥲

1 year ago 2 0 0 0