VizArQA : A Foundation for Visual Question Answering inArchitectural Simulation
Research Paper | AI + Design Research | B2B/B2C
Product
Research Paper
Role
Product Builder, AI Research
Time Duration
10 months
Description
Vision-language models (VLMs) offer potential for automated interpretation support, yet their reliability on architectural simulation outputs remains unexplored, particularly across different computational scales. This study introduces VizArQA, a benchmark for evaluating VLM performance on architectural environmental simulation interpretation. We evaluate six models representing both open-source and closed-source approaches across varying computational scales: GPT-5, GPT-5-Mini, and GPT-5-Nano, alongside Qwen3-235B, Qwen3-8B, and Qwen3-2B. Results show performance variations across model sizes and architectures, with larger models achieving substantially higher accuracy. These findings establish baseline performance metrics for VLM-assisted environmental analysis and highlight the potential of AI-assisted workflows, potentially enabling a path toward integrating VLMs into design processes to enhance efficiency, insight generation, and accessibility for non-experts.
Read Here
Problem Statement
Environmental simulations are critical in early-stage design, offering insights into factors like solar radiation, airflow, and climate responsiveness. However, interpreting these simulation outputs is time-consuming and requires expert reasoning, which can slow down design workflows.
Problem Statement

Generated. by Gemini
Excerpts

Our evaluation showed performance variations across VLMs in environmental simulation interpretation tasks. Overall accuracy ranged from 57.02% to 84.30%, establishing clear performance hierarchies between model architectures and computational scales as shown in Figure 7. The largest model, Qwen3-235B, achieved the highest accuracy at 84.30%, followed closely by both GPT-5 and Qwen3-8B at 80.99%. Mid-range models GPT-5-Mini (74.38%) and Qwen3-2B (61.98%) demonstrated moderate performance, while GPT-5-Nano, achieved 57.02% accuracy. Performance patterns revealed distinct advantages for larger model architectures in complex visual reasoning tasks. Open source models (Qwen3 series) demonstrated competitive performance with closed-source alternatives (GPT-5 series), with Qwen3-235B surpassing all GPT-5 variants. This finding suggests that architectural simulation interpretation capabilities scale effectively with model size across both proprietary and open-source frameworks, potentially democratizing access to AI-assisted environmental analysis workflows.





