MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

About MTabVQA

Vision-Language Models (VLMs) struggle with reasoning over multiple visually presented tables, a common real-world scenario. Existing benchmarks don't adequately assess this multi-tabular visual reasoning. We introduce MTabVQA, a benchmark with 3,745 complex question-answer pairs requirin multi-hop reasoning across several table images. Our evaluations reveal significant VLM limitations. We also release MTabVQA-Instruct, a large-scale instruction-tuning dataset.

Our main contributions are:

We introduce MTabVQA-Eval, a novel benchmark designed to evaluate multi-hop reasoning over multiple tables presented as images, addressing a key gap in existing table QA benchmarks.
We provide extensive benchmark results for SOTA open-source and proprietary VLMs on MTabVQA, revealing significant challenges posed by this task.
We release MTabVQA-Instruct, a large-scale instruction-tuning dataset.
We introduce TableVision, a VLM fine-tuned on MTabVQA-Instruct, which shows significant improvements on visual multi-tabular reasoning.

News

Aug 2025:
🎉 Our paper is accepted at EMNLP 2025 Findings!
June 2025:
🚀 MTabVQA is officially released!

📝 The paper is available on arXiv.

Citation

@misc{singh2025mtabvqaevaluatingmultitabularreasoning,
      title={MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space}, 
      author={Anshul Singh and Chris Biemann and Jan Strich},
      year={2025},
      eprint={2506.11684},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.11684}, 
}

Authors

¹ Department of Information Technology, Panjab University
² Language Technology Group, Universität Hamburg

Leaderboard: Performance on MTabVQA-Eval

Model	MTabVQA-Spider		MTabVQA-Query		MTabVQA-ATIS		MTabVQA-MiMo		Overall
Model	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
Open-Source VLMs (Zero-Shot)
Gemma-3-12B-IT	15.6	48.0	10.3	38.1	11.6	35.1	9.3	18.6	11.8	40.1
Qwen2.5-VL-7B	8.0	39.8	7.8	33.9	6.3	32.6	9.3	22.2	7.8	35.1
InternVL3-8B-Instruct	6.1	32.4	5.2	24.8	3.6	20.3	7.0	19.1	5.4	26.6
Phi-3.5-Vision-Instruct	2.9	26.1	2.4	22.0	1.8	15.0	0.8	3.2	2.5	22.3
LLaVA-One-Vision-Qwen2-7B	2.2	20.0	2.3	15.7	0.0	9.2	0.7	5.5	2.1	18.4
Proprietary VLMs (Zero-Shot)
GPT-4.1	49.0	74.3	34.2	58.5	6.3	39.9	20.2	39.6	37.0	61.7
Gemini-2.0-Flash	42.9	68.5	31.4	57.3	22.3	36.0	24.0	42.3	34.1	59.3
Fine-tuned Model (Ours)
TableVision (Ours)	32.4	64.3	49.8	72.6	33.0	45.9	20.2	36.2	43.4	68.2

MTabVQA-Eval Benchmark

Traditional benchmarks for table understanding and QA often focus on single-table scenarios or non-visual data. MTabVQA addresses the challenge of robust interpretation and reasoning over multi-tabular data presented as images, common in web pages, PDFs, and digital documents. MTabVQA-Eval comprises 3,745 complex question-answer pairs requiring multi-hop reasoning across two to five table images.The benchmark is designed to evaluate how well models can:

Understand diverse visual table layouts presented as images.
Parse and correlate information across multiple, physically separate tables.
Execute multi-hop reasoning grounded in visual data.

**Figure 1:** MTabVQA Construction Framework Overview.

Table images are generated with significant visual diversity (10 distinct styling themes) to mimic real-world appearances, challenging models on robust OCR and layout understanding. Questions in MTabVQA cover distinct reasoning categories like aggregation, comparison, fact-checking, and ranking.

MTabVQA Question Categories — **Figure 2:** Distribution of Verified Question Categories in the MTabVQA dataset.

Analysis

Post-Training Strategies (SFT, CoT, GRPO)

To explore methods for enhancing VLM performance, we investigated several post-training techniques using a subset of MTabVQA-Instruct (2,395 QA pairs from the Spider data source) with the Qwen2.5-VL-3B model. We compared Supervised Fine-Tuning (SFT), Chain-of-Thought (CoT) prompting, and Group Relative Policy Optimization (GRPO). SFT yielded substantial performance gains over both CoT and GRPO, boosting EM to 28.0% and F1 to 55.9% on the corresponding MTabVQA-Eval split. This demonstrates the strong effectiveness of targeted instruction tuning for this complex multi-hop reasoning task. While GRPO showed improvement, its gains did not surpass SFT with LoRA, possibly due to the challenge of defining a more sophisticated reward function for visual multi-tabular reasoning.

Performance of Post-Training Strategies — **Figure 3:** Performance comparison of post-training strategies.

Impact of Post-Training Data Scale and Source

We further analyzed how VLM performance is affected by the scale and source of data used for instruction fine-tuning. Using Qwen2.5-VL-7B as the base VLM, we fine-tuned it on several MTabVQA-Instruct subsets derived from different original data sources (Spider, MultiTabQA, MiMo+ATIS, and the full MTabVQA-Instruct set). Generally, more fine-tuning data led to better EM and F1 scores. The model trained on the full MTabVQA-Instruct dataset (15,853 diverse examples) achieved the highest overall F1 score (68.2%). However, the source of the data was critically important. For instance, a model trained only on the large MultiTabQA subset performed surprisingly poorly overall, suggesting that data characteristics and alignment with the benchmark are crucial. This highlights that while scaling instruction data is advantageous, the relevance and diversity of this data with respect to the target tasks are paramount for achieving optimal performance and generalization.

Impact of Data Scale and Source — **Table 2:** Impact of fine-tuning data source and size on overall performance.

Acknowledgement

This website is based on the layout from the T²-RAGBench and Pangea project pages.