Senior Design Project 2026

BrainBench

Evaluating Local LLMs on Mathematical Reasoning

3 Models • 3 Datasets • 2,544 Questions • Standardized & Reproducible

Models Tested
0
Local LLMs via Ollama
Questions Per Model
0
Across 3 math datasets
Total Responses
0
All verified automatically
Best Overall
Qwen3:4b
66.3% overall accuracy

Accuracy by Dataset

Percentage of correctly answered questions per model per dataset

Model Capabilities

Multi-dimensional comparison

Model Overview

Gemma3:4b

Google • 4B params

0%
Prob & Stats85.9%
Calculus I69.2%
Grade 822.0%
Avg: 7.3s/q Total: 310 min

Phi3:3.8b

Microsoft • 3.8B params

0%
Prob & Stats38.2%
Calculus I42.8%
Grade 812.3%
Avg: 5.4s/q Total: 230 min

Qwen3:4b

Alibaba • 4B params

0%
Prob & Stats74.2%
Calculus I84.4%
Grade 828.7%
Avg: 25.8s/q Total: 1,094 min

Overall Accuracy Ranking

Percentage of all 2,544 questions answered correctly

Average Response Time

Seconds per question (lower is better)

Key Takeaways

Qwen3:4b leads overall with 66.3% accuracy and dominates Calculus I (84.4%), but is 4–5x slower than competitors at ~26 seconds per question.

Gemma3:4b excels at probability with 85.9% accuracy on Prob & Stats—the highest single-dataset score—and scores perfectly on limit evaluation problems.

Phi3:3.8b is the fastest at 5.4s per question but has the lowest accuracy across all categories. Speed alone doesn't compensate for reduced correctness.

All models struggle with Grade 8 Math (12–29% accuracy), especially repeating decimal-to-fraction conversions and function problems.