I really enjoyed NeurIPS 2025! We put together a short article including summaries of several papers that caught my interest - especially some that are relevant to the future of model training for neural search / information retrieval:
www.elastic.co/search-labs/...
Posts by Michael Günther
I went together with @bowang0911.bsky.social to SIGIR this year, we wrote with @scottmartens.bsky.social a blog post with our highlights and summaries of AI and neural search papers that we found interesting at the conference
jina.ai/news/what-we...
Image resolution matters for embeddings - especially for visual document retrieval. jina-embeddings-v4 supports inputs up to 16+ MP (the default is much lower). We wrote a blog post about how resolution affects performance across benchmarks
jina.ai/news/how-ima...
We created a new benchmark for visual document retrieval with diverse visually rich documents (more than linear paginated PDFs) and more query types than just questions
👨💻 github.com/jina-ai/jina...
📑 jina.ai/news/jinavdr...
Our paper "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models" has been accepted at the Robust IR Workshop @ SIGIR 2025! 🌠
📅 I'll present it on July 17th
📝 Pre-print: arxiv.org/abs/2409.04701
🔗 Workshop: sigir-2025-workshop-on-robust-ir.github.io
‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance!
Details in 🧵
We are currently working on quantization-aware training to speed up retrieval. My colleagues Andrej, @scottmartens.bsky.social, and @bowang0911.bsky.social have published a blog post about first results - more is on the way!
jina.ai/news/quantiz...
We released a new model: jina-embeddings-v4
- multilingual text-to-text and text-to-image search w/o modality gap
- also visual docs (e.g. pdfs, maps) - trained on a wider scope than DSE, ColPali, etc.
+ MRL, late interaction, etc.
🤗 huggingface.co/jinaai/jina-...
📄 arxiv.org/abs/2506.18902
Interesting blog post by my colleague @scottmartens.bsky.social on influences of text size on embedding similarity, e.g., longer queries produce higher scores, thus comparing score between two docs and the same query works but scores for different queries are not comparable
jina.ai/news/on-the-...
New Multi-Modal Reranking Model (e.g. for text-to-image retrieval): jina.ai/news/jina-re...
Supports Multiple Languages and Dynamic Resolution (up to 4K)
🤗 huggingface.co/jinaai/jina-...
This Thursday Bowen Jin will present his research on how to train LLMs (R1) to generate search queries to use search engines more effectively using reinforcement learning in our paper talks event series.
Online Event: lu.ma/j8g0wnit
Paper: arxiv.org/abs/2503.09516
Embedding models become "blind" beyond 4K tokens in context length. Building on the NoLIMA paper, our experiments show that for needle-in-a-haystack tasks, performance of embedding models drops to near-random chance with long contexts—even with exact keyword matches 🤔 🧵
I applied LLMs for query expansion and we wrote this article:
It sees to work out-of-the-box and generally boost the performance of embedding models. However, it requires more latency. Would be interesting to see more about this.
📃: jina.ai/news/query-e...
🛠️: github.com/jina-ai/llm-...
When it rains, it pours.
Baichuan releases Baichuan-Omni-1.5
Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs.
Both model ( huggingface.co/baichuan-inc... ) and base ( huggingface.co/baichuan-inc... ).
Some examples for instructions:
- How to translate named entities and technical terms (e.g., "Big Data," "Embeddings")
- Specifying date formats (MM/DD/YY, DD/MM/YY, YYYY-MM-DD)
- Define tone of a text (e.g., formal vs informal)
Nevertheless LLM's latency might be much higher.
It seems like LLM APIs are cheaper and more versatile for translation than translation APIs like Google Translate, as they allow customized instructions. I created a small tool to experiment with LLM translation and translation comparison.
github.com/guenthermi/t...
An interesting blog post from my colleague that compares ModernBERT to RoBERTa and the backbone of our Jina-Embeddings-V3 model Jina-XLM-RoBERTa in context of embedding training
jina.ai/news/what-sh...
Many search and data analysis use cases require extracting information from the HTML code of websites. To make this easier, the new ReaderLM-v2 model can effectively convert HTML to Markdown and extract information to JSON in a given JSON schema.
jina.ai/news/readerl...
Our submission to ECIR 2025 on jina-embeddings-v3 has been accepted! 🎉
At the ECIR Industry Day my colleague @str-saba.bsky.social presents how we train the latest version of our text embedding model.
More details on ECIR: ecir2025.eu
More details about the model: arxiv.org/abs/2409.10173
I’m releasing a series of experiment to enhance Retrieval augmented generation using attention scores. colab.research.google.com/drive/1HEUqy... Basic idea is to leverage the internal reading process, as the model goes back and forth to the sources to find information and potential quotes.
Interesting article by Han Xiao about how to better utilize embedding models for classification tasks if the embedding model doesn't know much about your classes. It proposes an interesting method to decompose the classification into multiple simpler classification problems
jina.ai/news/scaling...
Whether to use late chunking also depends on the chunk size, for smaller chunks late chunking is generally more useful than for large chunk sizes.
Chunking improves the performance for fact retrieval task but can actually harm the performance for other retrieval tasks. Late chunking is useful for coherent datasets and often a good compromise to help embeddings to retain context information but also to focus on details:
First, more input helps, but not for all retrieval tasks equally:
One year ago, we released the first OS embedding model for 8192 tokens. Many suspected it to be not useful and chunking to be better than a single vector. I run many experiments to explain, when to use what and we summarized the findings in this article
t.co/BLC3WTU3LP
Small yet mighty! 💫
We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🤠
We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base huggingface.co/collections/...
Jina-CLIP-v2: a 0.9B multilingual multimodal embedding model that supports 89 languages, 512x512 image resolution, 8192 token-length, and Matryoshka representations down to 64-dim for both images and text. jina.ai/news/jina-cl... With of course strong performance on retrieval & classification tasks.