back to list
Project: Do Large Language Models Really “See”? Evaluating Visual Understanding vs. Textual Memorization in Visualization Annotation
Description
Problem Description
Large Language Models (LLMs) are increasingly used to generate natural‑language annotations for data visualizations such as line charts, scatterplots, and bar plots. These annotations can describe trends, highlight anomalies, or summarize relationships in the data. However, it remains unclear whether LLMs produce these annotations by genuinely interpreting the visual content or by exploiting memorized textual patterns, dataset artifacts, or statistical priors unrelated to the actual visualization.
This project investigates the extent to which LLM‑generated annotations reflect true visual understanding. Specifically, it examines whether LLMs rely on the rendered visual content or instead infer annotations from textual metadata, axis labels, or common dataset structures. Understanding this distinction is crucial for evaluating the reliability of LLM‑assisted visualization tools and for designing systems that support trustworthy data analysis.
Goals
The main goals of this project are:
- Assess the degree of visual grounding in LLM‑generated annotations for common visualization types (line charts, bar charts, scatterplots).
- Design controlled experiments that separate visual information from textual cues to test whether LLMs rely on the image itself or on memorized patterns.
- Develop evaluation metrics to quantify visual understanding vs. textual memorization.
- Provide design recommendations for visualization systems that incorporate LLMs responsibly.
Approach
The project will proceed in several stages:
- Dataset Construction
- Collect or generate a set of visualizations (line, bar, scatter) with controlled variations.
- Create paired versions of each visualization:
- Full version: includes axes, labels, legends, and titles.
- Minimal version: removes all textual elements.
- Perturbed version: includes misleading or randomized labels.
- Synthetic version: uses procedurally generated data unseen in common datasets.
- Annotation Generation
- Use one or more LLMs to generate annotations for each visualization variant.
- Prompt models under different conditions (image‑only, text‑only, image+text).
- Evaluation Framework
- Compare annotations across visualization variants to detect reliance on:
- Visual patterns (e.g., slope, clustering, outliers)
- Textual cues (e.g., axis labels, titles)
- Dataset priors or memorized examples
- Develop metrics such as:
- Annotation consistency across variants
- Visual‑textual divergence
- Error typology (e.g., hallucinated trends, label‑driven misinterpretations)
- Analysis & Interpretation
- Identify conditions under which LLMs demonstrate genuine visual reasoning.
- Detect failure modes where models rely on memorization or textual artifacts.
- Compare performance across visualization types and prompt strategies.
- Outcome & Recommendations
- Summarize findings on the reliability of LLM‑based visualization annotation.
- Provide guidelines for designing visualization tools that integrate LLMs safely and effectively.
References
- Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland. Association for Computational Linguistics.
- Rahmanzadehgervi, Pooyan & Bolton, Logan & Taesiri, Mohammad Reza & Nguyen, Anh. (2024). Vision language models are blind. 10.48550/arXiv.2407.06581.
Details
- Supervisor
-
Fernando Paulovich
- Interested?
-
Get in contact