StoryGold Logo

STORYGOLD

How good is GPT-5 at 3D?

August 22, 2025 Jennifer Jang

We conducted a quick and dirty evaluation of GPT-5, OpenAI's latest frontier model, and its capabilities on parametric CAD code generation. While GPT-5 represents significant progress toward more general AI systems, our experiments reveal limitations in its ability to perform spatial reasoning tasks, similar to its predecessor, GPT-4. We tried some simple techniques like retrieval-augmented generation (RAG) on the CadQuery API documentation and few-shot learning.

Experimental Design

We evaluated multiple configurations of GPT models on our parametric CAD generation benchmark:

  • GPT-5 (baseline): Zero-shot generation
  • GPT-5 + Few-shot: Prompted with 3-4 high-quality example image-code pairs
  • GPT-5 + RAG: Augmented with retrieval from CadQuery API documentation
  • GPT-5 + RAG + Few-shot: Combined approach using both techniques
  • GPT-4 (baseline): Previous generation model for comparison
  • BELLA v1.0 (13B and 34B): Our fine-tuned models for reference

An example of a few-shot prompt:

An example of a few-shot image-code pair passed in as a prompt

We measured two key metrics:

  • Code Validity Rate: The percentage of generated Python CadQuery code samples that compile and produce a valid solid without errors
  • Intersection-over-Union (IoU): Geometric accuracy of successfully compiled outputs, measured against ground truth models (scale from 0 to 1.0, where 1.0 represents perfect geometric agreement)

Results

Model Validity % IoU Score
BELLA v1.0 34B 99% 0.75
BELLA v1.0 13B 98% 0.73
GPT-5 + RAG + Few-shot 98% 0.49
GPT-5 + RAG 92% 0.48
GPT-5 + Few-shot 90% 0.47
GPT-5 89% 0.43
GPT-4 93% 0.43

The combination of RAG and few-shot learning dramatically improved GPT-5's code validity rate from 89% to 98%. Basically, knowing about the CadQuery API and a few examples of how to use it produced an 82% reduction in syntax and API errors (from 11% failures to 2% failures). In some ways, this is not surprising, as GPT-5 is exceptionally good at coding tasks, as shown by the myriad of codegen tools that are widely adopted.

Despite the substantial improvement in code validity, geometric accuracy remained largely unchanged. GPT-5's IoU score improved only marginally from 0.43 (baseline) to 0.49 (RAG + few-shot), whereas BELLA v1.0 (fine-tuned using a CadQuery dataset) sits at 0.75 IoU.

What does this mean? Well, generating syntactically correct code is fundamentally different from generating three-dimensional, geometrically accurate code! RAG and few-shot learning can teach a model correct API usage patterns, but they cannot teach deep spatial awareness!

Analysis

Most likely, trying to teach a model something that deviates far away from its pre-training data -- like spatial reasoning -- requires expensive fine-tuning or even specialized architecture trained from scratch. It's also possible that LLaVA's vision-language architecture may be better suited for geometric tasks than GPT's text-centric architecture, even before fine-tuning anything.

While GPT-5 represents impressive progress in general AI capabilities, our experiments demonstrate that frontier-scale models augmented with RAG and few-shot learning still cannot match specialized, fine-tuned models on 3D spatial reasoning tasks.