VizArQA : A Foundation for Visual Question Answering inArchitectural Simulation

Research Paper | AI + Design Research | B2B/B2C

Product Strategy

Product Design

Product Management

Circular Design

User Research

Usability testing

Rapid Prototyping

AI - Native

B2B2C Recommendation Algorithms

Product Strategy

Product Design

Product Management

User Research

Rapid Prototyping

AI - Native

Systems Thinking

B2C Experience Design

Architecture & Construction

Product

Research Paper

Role

Product Builder, AI Research

Time Duration

10 months

Description

Vision-language models (VLMs) offer potential for automated interpretation support, yet their reliability on architectural simulation outputs remains unexplored, particularly across different computational scales. This study introduces VizArQA, a benchmark for evaluating VLM performance on architectural environmental simulation interpretation. We evaluate six models representing both open-source and closed-source approaches across varying computational scales: GPT-5, GPT-5-Mini, and GPT-5-Nano, alongside Qwen3-235B, Qwen3-8B, and Qwen3-2B. Results show performance variations across model sizes and architectures, with larger models achieving substantially higher accuracy. These findings establish baseline performance metrics for VLM-assisted environmental analysis and highlight the potential of AI-assisted workflows, potentially enabling a path toward integrating VLMs into design processes to enhance efficiency, insight generation, and accessibility for non-experts.

Read Here

Problem Statement

Environmental simulations are critical in early-stage design, offering insights into factors like solar radiation, airflow, and climate responsiveness. However, interpreting these simulation outputs is time-consuming and requires expert reasoning, which can slow down design workflows.

Problem Statement

Generated. by Gemini

Excerpts

Our evaluation showed performance variations across VLMs in environmental simulation interpretation tasks. Overall accuracy ranged from 57.02% to 84.30%, establishing clear performance hierarchies between model architectures and computational scales as shown in Figure 7. The largest model, Qwen3-235B, achieved the highest accuracy at 84.30%, followed closely by both GPT-5 and Qwen3-8B at 80.99%. Mid-range models GPT-5-Mini (74.38%) and Qwen3-2B (61.98%) demonstrated moderate performance, while GPT-5-Nano, achieved 57.02% accuracy. Performance patterns revealed distinct advantages for larger model architectures in complex visual reasoning tasks. Open source models (Qwen3 series) demonstrated competitive performance with closed-source alternatives (GPT-5 series), with Qwen3-235B surpassing all GPT-5 variants. This finding suggests that architectural simulation interpretation capabilities scale effectively with model size across both proprietary and open-source frameworks, potentially democratizing access to AI-assisted environmental analysis workflows.

Insight

Based on the results of the study, we understand that current VLM reasoning models are not robust enough to conduct design reasoning in a truly accurate way.

Insight

//COMING SOON//

Copyright 2026, Manasi Dushyant Mehta
All works displayed are protected by copyright law
Copyright 2026, Manasi Dushyant Mehta
All works displayed are protected by copyright law

Create a free website with Framer, the website builder loved by startups, designers and agencies.