Course Project (CSE 576 – NLP) · Jan 2025 - May 2025

Optimizing Query Execution in Table QA

A multimodal Table Question Answering pipeline that decomposes questions into modality-specific sub-questions, filters irrelevant images using CLIP, and generates structured tables with captions to achieve 73.94% accuracy on the MultiModalQA dataset—outperforming all existing baselines.

Technologies

PythonGemini 3 Flash Pro APICLIPSQLMultiModalQA DatasetZero-Shot PromptingLLM PipelinesImage Captioning

Team

Neil Mahajan · Gokul Ramasamy · Mann Vora · Yi Xiao

Overview

Optimizing Query Execution in Table QA is a group project for CSE 576 (Natural Language Processing) at Arizona State University. The project addresses the challenge of answering complex questions involving semi-structured tables with textual and visual information.

The pipeline decomposes queries into modality-specific sub-questions (ImageQ, TableQ, TextQ), retrieves relevant images using CLIP embeddings, prunes the table to remove irrelevant rows, replaces images with captions, and generates SQL queries via the Gemini 3 Flash Pro API in a zero-shot setting—without any fine-tuning or retraining.

Evaluated on the MultiModalQA (MMQA) dataset, the approach achieves 73.94% accuracy, outperforming Solar (65.4%), MuRAG (51.4%), the MMQA baseline (46.5%), and MoqaGPT (37.8%). The results demonstrate that careful input curation and modular prompting are more effective than increasing model size for improving QA performance.

Features

Pipeline Stages

Preprocessing: Identifies column datatypes, image columns, and extracts contextual text
Question Decomposition: LLM decomposes questions into ImageQ, TableQ, and TextQ sub-questions
Relevant Image Identification: CLIP joint embedding space filters top-k relevant images
Caption Generation & Table Pruning: Images replaced with captions conditioned on ImageQ; irrelevant rows removed
SQL Query & Answer Extraction: Structured table passed to LLM for SQL generation and final answer

Results & Key Insights

73.94% accuracy on MMQA dev set—over 8% higher than the best baseline (Solar)
Zero-shot evaluation with no fine-tuning or large-scale retraining required
Converting images to language captions significantly improves LLM reasoning
Early-stage pruning of irrelevant content reduces noise and improves prediction quality
Gemini chat history preserves context across multi-step reasoning

Screenshots

Proposed Architecture

Question Decomposition Tree Diagram

Back to All Projects