Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%.
The proposed tasks require models to integrate information from both an image I and a text passage T to perform reasoning, ensuring that neither modality alone is sufficient for correct inference. The tasks explicitly emphasize multimodal reasoning, where the fusion of visual and textual context is essential for deriving accurate and consistent conclusions.
Task-I: Truth Evaluation (True/False/Unknown) Question. Given an image I, a text passage T , and an argument A, the model must determine the truth value of the argument based on the combined information from I and T . Specifically, the model outputs the truth value Truth(A) ∈ {True, False, Unknown} and generates a sequence of reasoning steps R = {R1, R2, . . . , Rn}, where each Ri represents an individual step that contributes to the final decision. Formally, the input is a triplet (I, T, A), and the output consists of Truth(A) and R.
Task-II: Multiple Choice Question. Given an image I, a text passage T , and candidate arguments {A1, A2, A3, A4}, the model must select the argument that best matches the image and text, denoted as BestArgument(I, T ) ∈ {A1, A2, A3, A4}. Additionally, the model must provide detailed reason- ing steps R = {R1, R2, . . . , Rn}, where each Ri details a step in the reasoning process. Formally, the input is a triplet (I, T, {A1, A2, A3, A4}), and the output consists of BestArgument(I, T ) and R.
We collect images from various sources such as COCO, Flickr30k, nocaps, MIMIC, RVL_CDIP, ScienceQA, and manually collected Traffic Reports. Visual details for each image are extracted using GPT-4o, ensuring diverse and fine-grained descriptions. We carefully select non-trivial logical inference rules, such as Modus Ponens and Hypothetical Syllogism, drawn from propositional logic (PL), first-order logic (FOL), and non-monotonic logic (NM). These rules then form meaningful but abstract reasoning chains through logical combinations. The abstract chains are grounded in real-world contexts by leveraging extracted visual features and relevant retrieved text from sources like healthcare, traffic reports, and Wikipedia. Questions and answers are then generated based on these instantiated reasoning chains, using rule-based substitution.
To ensure the quality and relevance of the dataset, both automatic and manual quality control procedures are employed. Automatic checks include assessing lexical similarity and commonsense plausibility, while human annotators verify the accuracy of visual details and the real-world relevance of the generated context. Instances that fail these checks are filtered out, ensuring a high-quality, logically sound, and contextually relevant dataset.
Figure: Data construction pipeline and quality control overview (placeholder).
@inproceedings{
author={Jundong Xu and Hao Fei and Yuhui Zhang and Liangming Pan and Qijun Huang and
Qian Liu and Preslav Nakov and Min-Yen Kan and William Yang Wang and
Mong-Li Lee and Wynne Hsu},
title={Multimodal Symbolic Logical Reasoning},
booktitle={Proceedings of the Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://nips.cc/virtual/2025/poster/115490}
}