💬 We thank Prof. Kementchedjhieva for the insightful talk and the discussion with UKP members on multimodal modeling and the future of vision-language systems.
#UKPLab #MultimodalAI #VisionLanguageModels #NLP #GuestTalk #NLProc #MBZUAI @tuda.bsky.social @cs-tudarmstadt.bsky.social
LEGS embeds language into 3D Gaussian splats, training 3.5x faster than LERF while improving pose fidelity in large-scale indoor scenes. #visionlanguagemodels
LEGS matches LERF in object recall while cutting training time to 12 minutes, showing how bundle adjustment boosts 3D Gaussian splat quality. #visionlanguagemodels
A real-time robotic system that builds 3D indoor maps and localizes objects using open-vocabulary language queries and Gaussian Splatting. #visionlanguagemodels
A new system merges 3D Gaussian Splatting and language models to enable real-time semantic mapping and object localization for robots. #visionlanguagemodels
A new system fuses language models with 3D Gaussian Splatting to help robots build real-time, semantic maps 3.5x faster than existing methods. #visionlanguagemodels
How a new vision-language AI uses multi-stage reasoning to identify schools, parks, and hospitals—going beyond pixels to understand cities. #visionlanguagemodels
Reviews state-of-the-art MLLMs. Highlights the challenge of expanding current models beyond the simple one-to-one image text relationship. #visionlanguagemodels
I will be at EMNLP next week presenting this work on November the 7th! Reach out to me for any questions :))
Work done with my advisor, Mirella Lapata!
Preprint: arxiv.org/pdf/2505.14627
#EMNLP2025 #multimodallearning #scalableoversight #visionlanguagemodels #nlproc
PerSense-D is a new benchmark dataset for personalized dense image segmentation, advancing AI accuracy in crowded visual environments. #visionlanguagemodels
Adaptive prompts, density maps, and VLMs are used in PerSense's training-free one-shot segmentation framework for dense picture interpretation. #visionlanguagemodels
PerSense is a model-aware, training-free system for one-shot tailored instance division in dense images based on density and vision-language cues. #visionlanguagemodels
Reason‑RFT improves visual reasoning in vision‑language models
Reason-RFT improves visual reasoning in vision-language models, according to the announcement. Read more: getnews.me/reason-rft-improves-visu... #reasonrft #visionlanguagemodels #visualreasoning
MetaSpatial improves 3D spatial reasoning in vision-language models
MetaSpatial says it improves 3D spatial reasoning in vision-language models. Read more: getnews.me/metaspatial-improves-3d-... #metaspatial #3dspatial #visionlanguagemodels
Vision-Language Models Struggle with Compositional Counting
VLMCountBench shows vision‑language models count objects when only one shape type (triangles, circles or squares) appears, but accuracy drops on scenes with multiple shapes. Read more: getnews.me/vision-language-models-s... #visionlanguagemodels #counting
Vision‑Language Models Linked to Action Expert for Robot Planning
A new framework pairs vision‑language models with an action expert that refines sparse 3‑D waypoints into collision‑free motion plans, trained on synthetic and real point‑cloud data. getnews.me/vision-language-models-l... #visionlanguagemodels #robotics
DepthLM Achieves Accurate Metric Depth with Vision‑Language Models
DepthLM equips vision-language models with metric depth prediction, matching the accuracy of dedicated depth estimators, per the paper submitted on 1 Oct 2025. Read more: getnews.me/depthlm-achieves-accurat... #depthlm #visionlanguagemodels
CoFFT Boosts Vision Language Models with Iterative Focused Reasoning
CoFFT, a training-free technique, lifts Vision Language Model accuracy by 3.1%–5.8% and debuted on 1 Oct 2025, iteratively sharpening visual focus during inference. getnews.me/cofft-boosts-vision-lang... #visionlanguagemodels #cofft
Vision-Language Models Restore Spatial Awareness with New Diagnostic Tools
Three new diagnostics—PSI, CMB, RoPE probe—show VLMs favor visual tokens; reducing visual token norms raised PSI and improved spatial reasoning. Read more: getnews.me/vision-language-models-r... #visionlanguagemodels #spatialreasoning
Explanation-Driven Counterfactual Testing Boosts Faithfulness of Vision-Language Model Explanations
EDCT audits VLM explanation faithfulness on 120 OK‑VQA examples, showing many explanations are plausible but not causally linked to answers. Read more: getnews.me/explanation-driven-count... #edct #visionlanguagemodels
Capability-Attributed Data Curation Improves Vision-Language Models
CADC reduces required training data to about 5% of the original set while still outperforming full-data models on multimodal benchmarks, the authors report. Read more: getnews.me/capability-attributed-da... #visionlanguagemodels #capabilitycuration
This paper summarizes a comprehensive framework for typographic attacks, proving their effectiveness and transferability against Vision-LLMs like LLaVA #visionlanguagemodels
This article presents an empirical study on the effectiveness and transferability of typographic attacks against major Vision-LLMs using AD-specific datasets. #visionlanguagemodels
This article explores the physical realization of typographic attacks, categorizing their deployment into background and foreground elements #visionlanguagemodels
GSM8K-V Shows Vision Language Models Lag on Visual Math Problems
GSM8K‑V adds visual format to 1,319 grade‑school math problems. Gemini‑2.5‑Pro scores 95.22% on text but only 46.93% on the visual version, showing a gap for VLMs. getnews.me/gsm8k-v-shows-vision-lan... #gsm8kv #visionlanguagemodels
This article proposes a linguistic augmentation scheme for typographic attacks using explicit instructional directives. #visionlanguagemodels
This article details the multi-step typographic attack pipeline, including Attack Auto-Generation and Attack Augmentation. #visionlanguagemodels
Disentangling Text for Better Language‑Based Object Detection
TaSe splits queries into object, attribute and relation parts, then hierarchically recombines them, delivering a significant 24% boost on the OmniLabel benchmark. Read more: getnews.me/disentangling-text-for-b... #visionlanguagemodels #objectdetection
This article analyzes the critical safety trade-off of integrating Vision-LLMs into autonomous driving (AD) systems. #visionlanguagemodels
Vision-Language Models' Ability to Name Colors Evaluated
A study tested five vision‑language models on 957 color samples and found high accuracy for prototypical colors but lower performance on non‑prototypical shades across nine languages. getnews.me/vision-language-models-a... #visionlanguagemodels #colornaming