LLM Benchmarks - a veryhungryhippo Collection

veryhungryhippo 's Collections

Diffusion Inpainting

Diffusion Models Fundamental Papers (Read First)

LLM Benchmarks

updated Jul 17, 2025

Humanity's Last Exam

Paper • 2501.14249 • Published Jan 24, 2025 • 77

Note currently the hardest
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Paper • 2206.04615 • Published Jun 9, 2022 • 5

Note *BB* => BBH => BBEH
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Paper • 2210.09261 • Published Oct 17, 2022 • 1

Note BB => *BBH* => BBEH
BIG-Bench Extra Hard

Paper • 2502.19187 • Published Feb 26, 2025 • 10

Note BB => BBH => **BBEH**
Measuring Massive Multitask Language Understanding

Paper • 2009.03300 • Published Sep 7, 2020 • 3

Note OG MMLU !
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Paper • 2406.01574 • Published Jun 3, 2024 • 51
GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Paper • 2311.12022 • Published Nov 20, 2023 • 33
Instruction-Following Evaluation for Large Language Models

Paper • 2311.07911 • Published Nov 14, 2023 • 22

Note IFEval
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Paper • 2310.06770 • Published Oct 10, 2023 • 9

Note Coding Benchmark
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Paper • 2412.15204 • Published Dec 19, 2024 • 37

Note Best for long context (as of July 2025) long context: at least 8K