Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark Paper • 2510.02356 • Published Sep 27, 2025 • 11
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis Paper • 2503.23145 • Published Mar 29, 2025 • 35