Michael Günther (@michael-g-u) Bsky

NeurIPS 2025: model merging, task vectors, and code embeddings - Elasticsearch Labs Explore our NeurIPS 2025 highlights on model merging, task vectors, and VLM dynamics, plus our DL4C workshop presentation on Jina code embeddings.

I really enjoyed NeurIPS 2025! We put together a short article including summaries of several papers that caught my interest - especially some that are relevant to the future of model training for neural search / information retrieval:

www.elastic.co/search-labs/...

3 months ago 0 0 0 0

What We Learned at SIGIR 2025 Sharing what we saw and learned at SIGIR 2025, feat. CLIP-AdaM, RE-AdaptIR and evaluations for LLM-based retrieval systems.

I went together with @bowang0911.bsky.social to SIGIR this year, we wrote with @scottmartens.bsky.social a blog post with our highlights and summaries of AI and neural search papers that we found interesting at the conference
jina.ai/news/what-we...

8 months ago 2 0 0 0

How Image Resolution Impacts Visual Document Retrieval Image resolution is crucial for embedding visually rich documents. Too small and models miss key details; too large and they can't connect the parts.

Image resolution matters for embeddings - especially for visual document retrieval. jina-embeddings-v4 supports inputs up to 16+ MP (the default is much lower). We wrote a blog post about how resolution affects performance across benchmarks
jina.ai/news/how-ima...

8 months ago 1 0 0 0

JinaVDR: New Visual Document Retrieval Benchmark with 95 Tasks in 20 Languages JinaVDR is a new benchmark spanning 95 tasks across 20 languages for visual document retrieval, soon on MTEB.

We created a new benchmark for visual document retrieval with diverse visually rich documents (more than linear paginated PDFs) and more query types than just questions
👨‍💻 github.com/jina-ai/jina...
📑 jina.ai/news/jinavdr...

8 months ago 0 0 0 0

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compre...

Our paper "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models" has been accepted at the Robust IR Workshop @ SIGIR 2025! 🌠

📅 I'll present it on July 17th

📝 Pre-print: arxiv.org/abs/2409.04701
🔗 Workshop: sigir-2025-workshop-on-robust-ir.github.io

9 months ago 2 0 0 0

‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance!

Details in 🧵

9 months ago 16 4 1 0

Quantization-Aware Training of jina-embeddings-v4 Quantization gives smaller embeddings. We show you fine-tuned quantization gives you even lossless embeddings.

We are currently working on quantization-aware training to speed up retrieval. My colleagues Andrej, @scottmartens.bsky.social, and @bowang0911.bsky.social have published a blog post about first results - more is on the way!
jina.ai/news/quantiz...

9 months ago 2 1 0 0

jinaai/jina-embeddings-v4 · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

We released a new model: jina-embeddings-v4
- multilingual text-to-text and text-to-image search w/o modality gap
- also visual docs (e.g. pdfs, maps) - trained on a wider scope than DSE, ColPali, etc.
+ MRL, late interaction, etc.
🤗 huggingface.co/jinaai/jina-...
📄 arxiv.org/abs/2506.18902

9 months ago 1 2 0 0

Interesting blog post by my colleague @scottmartens.bsky.social on influences of text size on embedding similarity, e.g., longer queries produce higher scores, thus comparing score between two docs and the same query works but scores for different queries are not comparable
jina.ai/news/on-the-...

11 months ago 1 1 0 0

jina-reranker-m0: Multilingual Multimodal Document Reranker Introducing jina-reranker-m0, our new multilingual multimodal reranker for retrieving visual documents, with SOTA performance on multilingual long documents and code searching tasks.

New Multi-Modal Reranking Model (e.g. for text-to-image retrieval): jina.ai/news/jina-re...

Supports Multiple Languages and Dynamic Resolution (up to 4K)

🤗 huggingface.co/jinaai/jina-...

1 year ago 1 0 0 0

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning · Zoom · Luma Join us for an insightful discussion of the groundbreaking Search-R1 framework, presented by Bowen Jin, a fourth-year Ph.D. student in Computer Science at the…

This Thursday Bowen Jin will present his research on how to train LLMs (R1) to generate search queries to use search engines more effectively using reinforcement learning in our paper talks event series.

Online Event: lu.ma/j8g0wnit
Paper: arxiv.org/abs/2503.09516

1 year ago 1 0 0 0

Embedding models become "blind" beyond 4K tokens in context length. Building on the NoLIMA paper, our experiments show that for needle-in-a-haystack tasks, performance of embedding models drops to near-random chance with long contexts—even with exact keyword matches 🤔 🧵

1 year ago 5 3 1 0

Query Expansion with LLMs: Searching Better by Saying More Search has changed a lot since embedding models were introduced. Is there still a role for lexical techniques like query expansion in AI? We think so.

I applied LLMs for query expansion and we wrote this article:
It sees to work out-of-the-box and generally boost the performance of embedding models. However, it requires more latency. Would be interesting to see more about this.
📃: jina.ai/news/query-e...
🛠️: github.com/jina-ai/llm-...

1 year ago 7 3 0 0

When it rains, it pours.

Baichuan releases Baichuan-Omni-1.5

Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs.

Both model ( huggingface.co/baichuan-inc... ) and base ( huggingface.co/baichuan-inc... ).

1 year ago 26 4 2 0

Some examples for instructions:
- How to translate named entities and technical terms (e.g., "Big Data," "Embeddings")
- Specifying date formats (MM/DD/YY, DD/MM/YY, YYYY-MM-DD)
- Define tone of a text (e.g., formal vs informal)
Nevertheless LLM's latency might be much higher.

1 year ago 0 0 0 0

GitHub - guenthermi/translation-align: LLM-based translation and translation comparison LLM-based translation and translation comparison. Contribute to guenthermi/translation-align development by creating an account on GitHub.

It seems like LLM APIs are cheaper and more versatile for translation than translation APIs like Google Translate, as they allow customized instructions. I created a small tool to experiment with LLM translation and translation comparison.
github.com/guenthermi/t...

1 year ago 0 0 1 0

What Should We Learn From ModernBERT? Bigger training data, efficient parameter sizing, and a deep-but-thin architecture, ModernBERT sets a direction for future BERT-like models.

An interesting blog post from my colleague that compares ModernBERT to RoBERTa and the backbone of our Jina-Embeddings-V3 model Jina-XLM-RoBERTa in context of embedding training
jina.ai/news/what-sh...

1 year ago 0 0 0 0

ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON ReaderLM-v2 is a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional accuracy.

Many search and data analysis use cases require extracting information from the HTML code of websites. To make this easier, the new ReaderLM-v2 model can effectively convert HTML to Markdown and extract information to JSON in a given JSON schema.
jina.ai/news/readerl...

1 year ago 5 2 0 0

47th EUROPEAN CONFERENCE ON INFORMATION RETRIEVAL – 47th EUROPEAN CONFERENCE ON INFORMATION RETRIEVAL

Our submission to ECIR 2025 on jina-embeddings-v3 has been accepted! 🎉
At the ECIR Industry Day my colleague @str-saba.bsky.social presents how we train the latest version of our text embedding model.
More details on ECIR: ecir2025.eu
More details about the model: arxiv.org/abs/2409.10173

1 year ago 4 2 0 0

I’m releasing a series of experiment to enhance Retrieval augmented generation using attention scores. colab.research.google.com/drive/1HEUqy... Basic idea is to leverage the internal reading process, as the model goes back and forth to the sources to find information and potential quotes.

1 year ago 51 14 2 3

Scaling Test-Time Compute For Embedding Models Better results scale with compute—more on learning, more on search. A good pretrained model takes you far, but test-time compute takes you further. It's time to recognize this paradigm of test-time co...

Interesting article by Han Xiao about how to better utilize embedding models for classification tasks if the embedding model doesn't know much about your classes. It proposes an interesting method to decompose the classification into multiple simpler classification problems
jina.ai/news/scaling...

1 year ago 0 0 0 0

Whether to use late chunking also depends on the chunk size, for smaller chunks late chunking is generally more useful than for large chunk sizes.

1 year ago 0 0 0 0

Chunking improves the performance for fact retrieval task but can actually harm the performance for other retrieval tasks. Late chunking is useful for coherent datasets and often a good compromise to help embeddings to retain context information but also to focus on details:

1 year ago 1 0 1 0

First, more input helps, but not for all retrieval tasks equally:

1 year ago 0 0 1 0

https://jina.ai/news/still-need-chunking-when-long-context-models-can-do-it-all/

One year ago, we released the first OS embedding model for 8192 tokens. Many suspected it to be not useful and chunking to be better than a single vector. I run many experiments to explain, when to use what and we summarized the findings in this article
t.co/BLC3WTU3LP

1 year ago 7 2 2 1

Small yet mighty! 💫

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🤠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base huggingface.co/collections/...

1 year ago 158 27 11 4

Jina AI Join the conversation

follow @jina-ai.bsky.social official account and our team here:

go.bsky.app/99FgER

1 year ago 3 3 1 0

Jina CLIP v2: Multilingual Multimodal Embeddings for Text and Images Jina-CLIP v2, a 0.9B multimodal embedding model with multilingual support of 89 languages, high image resolution at 512x512, and Matryoshka representations.

Jina-CLIP-v2: a 0.9B multilingual multimodal embedding model that supports 89 languages, 512x512 image resolution, 8192 token-length, and Matryoshka representations down to 64-dim for both images and text. jina.ai/news/jina-cl... With of course strong performance on retrieval & classification tasks.

1 year ago 1 4 0 0

Posts by Michael Günther