Advertisement · 728 × 90

Posts by Leland McInnes

Plot of XKCD color names clustered by color name embedding vectors. Embeds each color name with fastText (300-d English word vectors, mean-pooled across words for multi-word names), then projects to 2D with UMAP (cosine distance).

Plot of XKCD color names clustered by color name embedding vectors. Embeds each color name with fastText (300-d English word vectors, mean-pooled across words for multi-word names), then projects to 2D with UMAP (cosine distance).

Plot of XKCD colors in 2D clustered by literal color value. Converts each hex code to normalized sRGB and runs UMAP (euclidean distance) directly on the 3D color space. Similar colors end up near each other spatially, and each point is colored with its actual hex value.

Plot of XKCD colors in 2D clustered by literal color value. Converts each hex code to normalized sRGB and runs UMAP (euclidean distance) directly on the 3D color space. Similar colors end up near each other spatially, and each point is colored with its actual hex value.

Map versus territory of XKCD colors.
github.com/venkatasg/xkcd…

1 week ago 4 1 1 0
EVōC: Embedding Vector Oriented Clustering — EVoC 0.1.0 documentation

EVoC is available now on github and PyPI.
Documentation: evoc.readthedocs.io
Github: github.com/TutteInstitu...
PyPI: pip install evoc

2 weeks ago 2 0 0 0

EVoC uses the Sklearn API, so you can drop it into existing pipelines. It requires less parameter tuning than most other similar clustering algorithms.

2 weeks ago 0 0 1 0
Line splot showing scaling of different algorithms with dataset size. EVoC scales equivalently to MinibatchKMeans and much better than other algorithms.

Line splot showing scaling of different algorithms with dataset size. EVoC scales equivalently to MinibatchKMeans and much better than other algorithms.

Do you use KMeans for your embedding clustering for the sake of speed? Good news, EVoC performance scales very well, and EVoC can be competitive with MiniBatchKMeans, while producing better clusters for high dimensional embedding vectors.

2 weeks ago 0 0 1 0

Do you use UMAP + HDBSCAN for your embedding vector clustering pipeline? Good news, EVoC works similarly (but better! and much much faster!), and is developed by the original authors of the umap-learn and hdbscan python packages.

2 weeks ago 0 0 1 0
Video

EVoC is a library designed specifically for fast clustering of high dimensional embedding vectors. It can produce high quality clusters extremely efficiently, and requires little to no hyperparameter tuning.
Better clustering than UMAP + HDBSCAN; faster clustering than KMeans.

2 weeks ago 13 3 2 1
Multi-Resolution Clustering with PLSCAN — fast_hdbscan 0.1 documentation

PLSCAN docs: fast-hdbscan.readthedocs.io/en/latest/pl...
metrics and constraints: fast-hdbscan.readthedocs.io/en/latest/me...

PLSCAN paper: arxiv.org/abs/2512.16558

1 month ago 2 0 0 0

There's a new release of fast-hdbscan out now on PyPI that adds some major new features.

- A PLSCAN implementation based on work by Jelmer Bot
- Support for diverse metrics via PyNNDescent
- A powerful "cannot-link" constraint system for semi-supervised clustering by Richard Hakim

1 month ago 7 3 1 0
Preview
Nemotron V3 Atlas - a Hugging Face Space by davanstrien This app lets you upload your vector embeddings (e.g., CSV or JSON files) and instantly creates an interactive 2‑D/3‑D plot where similar items cluster together. You can explore the layout, hover o...

huggingface.co/spaces/davan...

1 month ago 10 1 0 0
Advertisement
Interactive Embedding Atlas visualization of 250,000 training examples from NVIDIA's Nemotron post-training v3 collection, colored by category. Distinct clusters are visible for Math (blue, 56k), Code (orange, 45k), Agentic (green, 41k), Instruction Following (red, 25k), Finance (purple, 18k), Multilingual (brown, 18k), Science (pink, 18k), Safety (grey, 16k), and Identity (yellow, 8k). A selected point shows an example from the Agentic function-calling dataset: "I need to know the weight of the creature that can evolve into a Fire Dragon."

Interactive Embedding Atlas visualization of 250,000 training examples from NVIDIA's Nemotron post-training v3 collection, colored by category. Distinct clusters are visible for Math (blue, 56k), Code (orange, 45k), Agentic (green, 41k), Instruction Following (red, 25k), Finance (purple, 18k), Multilingual (brown, 18k), Science (pink, 18k), Safety (grey, 16k), and Identity (yellow, 8k). A selected point shows an example from the Agentic function-calling dataset: "I need to know the weight of the creature that can evolve into a Fire Dragon."

One of the nicest things about Nvidia model releases is that they ship the training data.

What does it look like? I sampled 250k examples from 24 datasets in the Nemotron post-training v3 collection and built an interactive Embedding Atlas to explore it.

1 month ago 37 7 3 1
Conference banner that states the text:

Language AI in Space Sciences Workshop (March 9th - 12th, 2026)

imposed over what looks like a JWST image of a galaxy cluster, and the contours that specify where some astronomy literature categories live in a UMAP-ed embedding space.

Conference banner that states the text: Language AI in Space Sciences Workshop (March 9th - 12th, 2026) imposed over what looks like a JWST image of a galaxy cluster, and the contours that specify where some astronomy literature categories live in a UMAP-ed embedding space.

This week I'll be delivering a talk and leading a tutorial for the Language AI in Space Sciences Workshop in Baltimore, hosted at STScI.

So excited to see folks from all disciplines join the conversation -- including astronomers, engineers, computer scientists, librarians, and linguists.

1 month ago 8 1 0 0
Preview
GitHub - TutteInstitute/toponymy Contribute to TutteInstitute/toponymy development by creating an account on GitHub.

I might suggest Toponymy (github.com/TutteInstitu...) and DataMapPlot (github.com/TutteInstitu...) as a way to clean up the labeling and display a little.

1 month ago 1 0 0 0
Post image

Han Jiao (ex jina Ai ceo) built mlx apple silicon clustering libs for umap, tsne etc github.com/hanxiao/mlx-...

1 month ago 10 3 0 1
Video

playing around with umap-js today.

1 month ago 3 1 0 0
Preview
State of Neuroscience 2025: Trends & Breakthroughs | The Transmitter A comprehensive look at major trends shaping the neuroscience landscape in 2025

Excited to share a new #interactive #dataviz, putting 50 years of #neuroscience research on the map:

The State of Neuroscience 2025
stateofneuroscience.thetransmitter.org

#StateOfNeuro

5 months ago 50 15 4 2

Any sufficiently large k-nn is indistinguishable from magic
🧙‍♂️

1 month ago 22 4 0 0
Advertisement
screenshot of map.sky.boo with a circle around a dot way off the main clustered area, a modal at the bottom right shows it as "Justin Kyle" daddysgaygarden13.bsky.social

screenshot of map.sky.boo with a circle around a dot way off the main clustered area, a modal at the bottom right shows it as "Justin Kyle" daddysgaygarden13.bsky.social

become unclusterable

2 months ago 76 4 2 1

Some very nice ideas in there. Thanks for sharing!

2 months ago 4 0 1 0
Preview
Bluesky Map Interactive map of 3.4 million Bluesky users, visualised by their follower pattern.

I made a map of 3.4 million Bluesky users - see if you can find yourself!

bluesky-map.theo.io

I've seen some similar projects, but IMO this seems to better capture some of the fine-grained detail

2 months ago 7239 2166 660 4554
A) Dendrogram of the development dataset showing the clustering structure and optimal cut points, and spectrograms of representative calls extracted from cluster 0 and cluster 1. Within the main
clusters, we observed further branching; B) UMAP projection divided into 𝐾 = 2 clusters using HAC.

A) Dendrogram of the development dataset showing the clustering structure and optimal cut points, and spectrograms of representative calls extracted from cluster 0 and cluster 1. Within the main clusters, we observed further branching; B) UMAP projection divided into 𝐾 = 2 clusters using HAC.

Our new pre-print shows how unsupervised clustering methods can identify biologically meaningful differences in early vocal production, with no human feedback. @antorrisi.bsky.social
has led this interdisciplinary collaboration based on computational methods + #chicks 🐣 arxiv.org/abs/2601.12203

2 months ago 20 8 1 0
Post image Post image

here's a fun side project i've been working on: i compiled a joint text<>audio embedding model to a fast coreml pipeline, and built a very fast (~400ms for 50k samples, can scale to millions) UMAP dimensionality reduction GPU impl in mlx. using it to browse music libraries and do sample sim search

2 months ago 63 4 3 0

Xiaobin Li, Run Zhang: Understanding and Improving UMAP with Geometric and Topological Priors: The JORC-UMAP Algorithm https://arxiv.org/abs/2601.16552 https://arxiv.org/pdf/2601.16552 https://arxiv.org/html/2601.16552

2 months ago 1 1 0 0
Post image

I miss the days where you'd see blogposts with clever analyses on datasets, maths and data science tricks.

That's why, as an experiment, we're starting a new moderated subreddit. People can share/promote their notebooks and you can use RSS to subscribe.

Please join and share!

2 months ago 10 2 0 0
3 by 3 grid of networks color-mapped with "plasma" and with edge-bundling. Lung looking structure.

3 by 3 grid of networks color-mapped with "plasma" and with edge-bundling. Lung looking structure.

UMAP connectivity plots of 3,627 chess openings from the @lichess.org datasets (huggingface.co/datasets/Lic...)

3 months ago 6 1 1 0

If I had to guess a direction that could be taken that would ameliorate this, it would be "small models". It feels like it should be possible to have small enough models to run locally that are either "capable enough", or specialize in a domain. In that case training is the only compute bottleneck.

3 months ago 3 0 0 0
Advertisement

I think it's important to note though that in spite of those incentives, the direction of the last two years has been more fungibility, *not* lock-in. And open source is the wrong fight here: when lock-in comes it will look more like the lock-in that Amazon or Uber have than Microsoft Office…

3 months ago 8 1 2 1
Preview
Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction Fuzzy simplicial sets have become an object of interest in dimensionality reduction and manifold learning, most prominently through their role in UMAP. However, their definition through tools from alg...

New preprint! Have you ever wondered, what are these fuzzy simplicial sets, the theoretical framework behind e.g. UMAP? Here we show that you may simply see them as marginal distributions over simplicial sets. This provides a generative model for UMAP. (1/2)

arxiv.org/abs/2512.03899

4 months ago 14 7 1 0
Video

Space DJ turns genre embeddings into a playable galaxy—pilot a ship, the music follows. 🚀

Key stats
768→128 PCA compression; 3D UMAP projection; three.js rendering; autopilot drift; high‑dim neighbors surfacing hidden similarities.

5 months ago 29 8 3 2

Since 2018 if I recall correctly.

5 months ago 1 0 1 0