Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps Paper • 2510.13430 • Published Oct 15, 2025 • 1
3LM: Bridging Arabic, STEM, and Code through Benchmarking Paper • 2507.15850 • Published Jul 21, 2025 • 6
NeurIPS 2025 E2LM Competition : Early Training Evaluation of Language Models Paper • 2506.07731 • Published Jun 9, 2025 • 2
Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation Paper • 2604.03395 • Published 5 days ago • 2
Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation Paper • 2604.03395 • Published 5 days ago • 2