Paper
•
2501.14249
•
Published
•
77
Note
currently the hardest
Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models
Paper
•
2206.04615
•
Published
•
5
Note
*BB* => BBH => BBEH
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Paper
•
2210.09261
•
Published
•
1
Note
BB => *BBH* => BBEH
Paper
•
2502.19187
•
Published
•
10
Note
BB => BBH => **BBEH**
Measuring Massive Multitask Language Understanding
Paper
•
2009.03300
•
Published
•
3
Note
OG MMLU !
MMLU-Pro: A More Robust and Challenging Multi-Task Language
Understanding Benchmark
Paper
•
2406.01574
•
Published
•
51
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
•
2311.12022
•
Published
•
33
Instruction-Following Evaluation for Large Language Models
Paper
•
2311.07911
•
Published
•
22
Note
IFEval
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper
•
2310.06770
•
Published
•
9
Note
Coding Benchmark
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
•
2412.15204
•
Published
•
37
Note
Best for long context (as of July 2025)
long context: at least 8K