BrainBench
Evaluating Local LLMs on Mathematical Reasoning
3 Models • 3 Datasets • 2,544 Questions • Standardized & Reproducible
Accuracy by Dataset
Percentage of correctly answered questions per model per dataset
Model Capabilities
Multi-dimensional comparison
Model Overview
Gemma3:4b
Google • 4B params
Phi3:3.8b
Microsoft • 3.8B params
Qwen3:4b
Alibaba • 4B params
Overall Accuracy Ranking
Percentage of all 2,544 questions answered correctly
Average Response Time
Seconds per question (lower is better)
Key Takeaways
Qwen3:4b leads overall with 66.3% accuracy and dominates Calculus I (84.4%), but is 4–5x slower than competitors at ~26 seconds per question.
Gemma3:4b excels at probability with 85.9% accuracy on Prob & Stats—the highest single-dataset score—and scores perfectly on limit evaluation problems.
Phi3:3.8b is the fastest at 5.4s per question but has the lowest accuracy across all categories. Speed alone doesn't compensate for reduced correctness.
All models struggle with Grade 8 Math (12–29% accuracy), especially repeating decimal-to-fraction conversions and function problems.