Senior Design Project 2026

BrainBench

Evaluating Local LLMs on Mathematical Reasoning

3 Models • 3 Datasets • 2,544 Questions • Standardized & Reproducible

Models Tested

Local LLMs via Ollama

Questions Per Model

Across 3 math datasets

Total Responses

All verified automatically

Best Overall

Qwen3:4b

66.3% overall accuracy

Accuracy by Dataset

Percentage of correctly answered questions per model per dataset

Model Capabilities

Multi-dimensional comparison

Model Overview

Gemma3:4b

Google • 4B params

Prob & Stats85.9%

Calculus I69.2%

Grade 822.0%

Avg: 7.3s/q Total: 310 min

Phi3:3.8b

Microsoft • 3.8B params

Prob & Stats38.2%

Calculus I42.8%

Grade 812.3%

Avg: 5.4s/q Total: 230 min

Qwen3:4b

Alibaba • 4B params

Prob & Stats74.2%

Calculus I84.4%

Grade 828.7%

Avg: 25.8s/q Total: 1,094 min

Overall Accuracy Ranking

Percentage of all 2,544 questions answered correctly

Average Response Time

Seconds per question (lower is better)

Key Takeaways

Qwen3:4b leads overall with 66.3% accuracy and dominates Calculus I (84.4%), but is 4–5x slower than competitors at ~26 seconds per question.

Gemma3:4b excels at probability with 85.9% accuracy on Prob & Stats—the highest single-dataset score—and scores perfectly on limit evaluation problems.

Phi3:3.8b is the fastest at 5.4s per question but has the lowest accuracy across all categories. Speed alone doesn't compensate for reduced correctness.

All models struggle with Grade 8 Math (12–29% accuracy), especially repeating decimal-to-fraction conversions and function problems.