Our paper VisOnlyQA has been accepted to
@colmweb.org #COLM2025! See you in Montrealπ
We find that even recent Vision Language Models struggle with simple questions about geometric properties in images, such as "What is the degree of angle AOD?"π§
arxiv.org/abs/2412.00947
bsky.app/profile/ryok...
Posts by Ryo Kamoi
Excited to share that Communications of the ACM featured an article that includes an interview with me about LLM self-correction! I mainly discuss self-correction before o1, but I believe it still offers some takeaways.
cacm.acm.org/news/self-co...
arxiv.org/abs/2406.01297
VLMEvalKit now supports our VisOnlyQA dataset π₯π₯π₯
github.com/open-compass...
VisOnlyQA reveals that even recent LVLMs like GPT-4o and Gemini 1.5 Pro stumble on simple visual perception questions, e.g., "What is the degree of angle AOD?"π§
arxiv.org/abs/2412.00947
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang
Paper: arxiv.org/abs/2412.00947
Data: huggingface.co/collections/...
Code: github.com/psunlpgroup/...
Interestingly, our experiments suggest that stronger language models improve visual perception of LVLMs, even when using the same visual encoders (ViT).
We conclude that we need to improve both the training data and model architecture of LVLMs for better visual perception. [4/n]
We hypothesize that the weak visual perception is due to the lack of training data. To verify this, we make training data for VisOnlyQA, but we observe that the performance after fine-tuning depends on tasks and models, suggesting that training data is not the only problem. [3/n]
VisOnlyQA includes questions about geometric and numerical information on scientific figures.
Recent benchmarks for LVLMs often involve reasoning or knowledge, putting less focus on visual perception. In contrast, VisOnlyQA is designed to evaluate visual perception directly [2/n]
π’ New preprint! Do LVLMs have strong visual perception capabilities? Not quite yet...
We introduce VisOnlyQA, a new dataset for evaluating the visual perception of LVLMs, but existing LVLMs perform poorly on our dataset. [1/n]
arxiv.org/abs/2412.00947
github.com/psunlpgroup/...
Iβm on the academic job market this year! Iβm completing my @uwcse.bsky.social @uwnlp.bsky.social Ph.D. (2025), focusing on overcoming LLM limitations like hallucinations, by building new LMs.
My Ph.D. work focuses on Retrieval-Augmented LMs to create more reliable AI systems π§΅
This reading list is based on our survey paper. Don't forget to check it out as well π
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs (TACL 2024)
arxiv.org/abs/2406.01297
Curious about LLM self-correction? Check out our reading list!
π github.com/ryokamoi/llm...
We feature papers & blogs in
* Key self-correction papers
* Negative results in self-correction
* Projects inspired by OpenAI o1
A starter pack for the NLP and Computational Linguistics researchers at UT Austin!
go.bsky.app/75g9JLT
We at UT Linguistics are hiring for π₯ 2 faculty positions in Computational Linguistics! Assistant or Associate professors, deadline Dec 1.
UT has a super vibrant comp ling & #nlp community!!
Apply here π apply.interfolio.com/158280
Hello Bluesky. It was great to talk with so many people at
#EMNLP2024!
The paper we presented, a survey paper on self-correction of LLMs, is now on MIT Press!
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs (TACL 2024)
direct.mit.edu/tacl/article...