๐ Huge thanks to all collaborators @yfyuan01.bsky.social, @wenyan62.bsky.social, @aliannejadi.bsky.social, @danielhers.bsky.social, Anders Sรธgaard, Ivan Vuliฤ, Wenxuan Zhang, Paul Liang, Yang Deng, @serge.belongie.com
Posts by Jiaang Li
๐ Paper: arxiv.org/abs/2505.14462
๐ Project: jiaangli.github.io/ravenea/
๐ป Code: github.com/yfyuan01/RAV...
๐ค Data: huggingface.co/datasets/jaa...
๐ Excited to share our work "๐ฅ๐๐ฉ๐๐ก๐๐: ๐ ๐๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐ ๐๐น๐๐ถ๐บ๐ผ๐ฑ๐ฎ๐น ๐ฅ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น-๐๐๐ด๐บ๐ฒ๐ป๐๐ฒ๐ฑ ๐ฉ๐ถ๐๐๐ฎ๐น ๐๐๐น๐๐๐ฟ๐ฒ ๐จ๐ป๐ฑ๐ฒ๐ฟ๐๐๐ฎ๐ป๐ฑ๐ถ๐ป๐ด", accepted at #ICLR2026!
๐ง๐ท I'll be attending ICLR in person โ would love to connect and chat there! ๐ค
๐๏ธ Sat, Apr 25, 2026, 10:30 AM โ 1:00 PM GMT-03
๐ Pavilion 4 P4-# 3618
Feeling overwhelmed by all the recent developments in video understanding? What used to require dozens of modular computational workflows involving SLAM, feature tracking, optical flow, camera calibration, multiview geometric constraints, and resnet backbones is now...
(1/3)
Feel free to reach out and chat with Xinyi on July 18th in Vancouver at the #ICML
Would you present your next NeurIPS paper in Europe instead of traveling to San Diego (US) if this was an option? Sรธren Hauberg (DTU) and I would love to hear the answer through this poll: (1/6)
Check out our new preprint ๐๐๐ง๐ฌ๐จ๐ซ๐๐๐๐.
We use a robust decomposition of the gradient tensors into low-rank + sparse parts to reduce optimizer memory for Neural Operators by up to ๐๐%, while matching the performance of Adam, even on turbulent NavierโStokes (Re 10e5).
PhD student, Jiaang Li and his collaborators, with insights into cultural understanding of vision-language models ๐
Paper title "Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory"
I am excited to announce our latest work ๐ "Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory". We review recent works on culture in VLMs and argue for deeper grounding in cultural theory to enable more inclusive evaluations.
Paper ๐: arxiv.org/pdf/2505.22793
Great collaboration with @yfyuan01.bsky.social @wenyan62.bsky.social @aliannejadi.bsky.social @danielhers.bsky.social , Anders Sรธgaard, Ivan Vuliฤ, Wenxuan Zhang, Paul Liang, Yang Deng, @serge.belongie.com
๐More here:
Project Page: jiaangli.github.io/RAVENEA/
Code: github.com/yfyuan01/RAV...
Dataset: huggingface.co/datasets/jaa...
๐Our experiments demonstrate that even lightweight VLMs, when augmented with culturally relevant retrievals, outperform their non-augmented counterparts and even surpass the next larger model tier, achieving at least a 3.2% improvement in cVQA and 6.2% in cIC.
๐ Culture-Aware Contrastive Learning
We propose Culture-aware Contrastive (CAC) Learning, a supervised learning framework compatible with both CLIP and SigLIP architectures. Fine-tuning with CAC can help models better capture culturally significant content.
๐ Dataset Construction
RAVENEA integrates 1,800+ images, 2,000+ culture-related questions, 500+ human captions, and 10,000+ human-ranked Wikipedia documents to support two key tasks:
๐ฏCulture-focused Visual Question Answering (cVQA)
๐Culture-informed Image Captioning (cIC)
๐New Preprint๐
Can Multimodal Retrieval Enhance Cultural Awareness in Vision-Language Models?
Excited to introduce RAVENEA, a new benchmark aimed at evaluating cultural understanding in VLMs through RAG.
arxiv.org/abs/2505.14462
More details:๐
Super cool! Incidentally, in our previous project, we also found that linear alignment between embedding spaces from two modalities is viable โ and the alignment improves as LLMs scale.
bsky.app/profile/jiaa...
I wonโt be attending #ICLR in person this year๐ข. But feel free to check our paper โRevisiting the Othello World Model Hypothesisโ with Anders Sรธgaard, accepted at ICLR world models workshop!
Paper link arxiv.org/abs/2503.04421
Thrilled to announce "Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation" is accepted as a Spotlight (5%) at #ICLR2025!
Our model MM-FSS leverages 3D, 2D, & text modalities for robust few-shot 3D segmentationโall without extra labeling cost. ๐คฉ
arxiv.org/pdf/2410.22489
More details๐
Forget just thinking in words.
๐Our New Preprint:
๐ New Era of Multimodal Reasoning๐จ
๐ Imagine While Reasoning in Space with MVoT
Multimodal Visualization-of-Thought (MVoT) revolutionizes reasoning by generating visual "thoughts" that transform how AI thinks, reasons, and explains itself.
FGVC12 Workshop is coming to #CVPR 2025 in Nashville!
Are you working on fine-grained visual problems?
This year we have two peer-reviewed paper tracks:
i) 8-page CVPR Workshop proceedings
ii) 4-page non-archival extended abstracts
CALL FOR PAPERS: sites.google.com/view/fgvc12/...
Hereโs a short film produced by the Danish Royal Academy of Sciences, showcasing the WineSensed ๐ท project of รรณranna Bender et al. thoranna.github.io/learning_to_...
From San Diego to New York to Copenhagen, wishing you Happy Holidays!๐
With @neuripsconf.bsky.social right around the corner, weโre excited to be presenting our work soon! Hereโs an overview
(1/5)
๐โโ๏ธ
Great collaboration with @constanzafierro.bsky.social , @YovaKem_v2, and Anders Sรธgaard!
๐จโ๐ป github.com/jiaangli/VLCA
๐ direct.mit.edu/tacl/article...
๐Take away:
1. Representation spaces of LMs and VMs grow more partially similar with model size.
2. Lower frequency, polysemy, dispersion can be easier to align.
3. Shared concepts between LMs and VMs might extend beyond nouns.
๐งต(7/8)
#NLP #NLProc
๐ฑWe then discuss the implications of our finding:
- the LM understanding debate
- the study of emergent properties
- philosophy
๐งต(6/8)
๐We also measure the generalization of the mapping to other POS, and explore the impact of different size of the training data. ๐To investigate the effects of incorporating text signals during vision pretraining, we compare pure vision models against selected CLIP vision encoders.
๐งต(5/8)