Japanese Language Performance Ranking

AI model rankings based on Japanese benchmarks including JGLUE and Global-MMLU-Lite

#	Model	Tier	Global-MMLU	JGLUE	JA-Alpaca	Overall ↓
1	Gemini 2.5 Pro Google	S	94	92	91	92.3
2	Claude Opus 4 Anthropic	S	93	91	90	91.3
3	Claude Sonnet 4 Anthropic	S	93	90	89	90.7
4	GPT-4o OpenAI	A	90	89	88	89.0
5	GPT-4.1 OpenAI	A	91	88	87	88.7
6	Gemini 2.0 Flash Google	A	88	86	85	86.3
7	DeepSeek V3 DeepSeek	B	85	84	83	84.0
8	Qwen 2.5 72B Qwen	B	86	85	80	83.7
9	GPT-4o mini OpenAI	B	82	80	79	80.3
10	Mistral Large Mistral	C	82	76	73	77.0
11	Claude Haiku 4 Anthropic	C	78	77	75	76.7
12	Llama 3.1 70B Meta	C	80	75	74	76.3

Why Japanese-Specific Benchmarks Matter

Models scoring high on English benchmarks may perform significantly worse in Japanese. Dedicated benchmarks reveal true capability.

Evaluates understanding of culturally specific elements like honorifics, seasonal expressions, and business etiquette.

JGLUE evaluates on real NLP tasks (sentiment analysis, sentence pairs, QA). More practical than academic scores alone.

Providers are actively improving Japanese support. Google Gemini scores especially high, and Claude is rapidly improving.