Advertisement
AI model rankings based on Japanese benchmarks including JGLUE and Global-MMLU-Lite
| # | Model | Tier | Global-MMLU | JGLUE | JA-Alpaca | Overall ↓ |
|---|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro Google | S | 94 | 92 | 91 | 92.3 |
| 2 | Claude Opus 4 Anthropic | S | 93 | 91 | 90 | 91.3 |
| 3 | Claude Sonnet 4 Anthropic | S | 93 | 90 | 89 | 90.7 |
| 4 | GPT-4o OpenAI | A | 90 | 89 | 88 | 89.0 |
| 5 | GPT-4.1 OpenAI | A | 91 | 88 | 87 | 88.7 |
| 6 | Gemini 2.0 Flash Google | A | 88 | 86 | 85 | 86.3 |
| 7 | DeepSeek V3 DeepSeek | B | 85 | 84 | 83 | 84.0 |
| 8 | Qwen 2.5 72B Qwen | B | 86 | 85 | 80 | 83.7 |
| 9 | GPT-4o mini OpenAI | B | 82 | 80 | 79 | 80.3 |
| 10 | Mistral Large Mistral | C | 82 | 76 | 73 | 77.0 |
| 11 | Claude Haiku 4 Anthropic | C | 78 | 77 | 75 | 76.7 |
| 12 | Llama 3.1 70B Meta | C | 80 | 75 | 74 | 76.3 |
Models scoring high on English benchmarks may perform significantly worse in Japanese. Dedicated benchmarks reveal true capability.
Evaluates understanding of culturally specific elements like honorifics, seasonal expressions, and business etiquette.
JGLUE evaluates on real NLP tasks (sentiment analysis, sentence pairs, QA). More practical than academic scores alone.
Providers are actively improving Japanese support. Google Gemini scores especially high, and Claude is rapidly improving.