Differences in the results of the reproduction test on lm-evaluation-harness

#8
by ThreeGold116 - opened

Ouro-1.4B R4 evaluation results between reproduction and paper

  1. reproduction evaluation setting follow the paper

  2. Result Comparsions

    Benchmark Paper Result Reproduction Result
    mmlu 67.35 66.74
    bbh 71.02 60.77
    gsm8k 78.92 60.80
ByteDance org

Sorry for the delay. I recently finished my internship at ByteDance, so I lost control of the repository for a period of time. Regarding the results, the MMLU scores are actually quite consistent (67.35 vs 66.74), likely because the paper reports log-prob results while we used a standard 5-shot setting in lm-eval. For the discrepancies in generate-until tasks like BBH and GSM8K, we used vLLM as the backend to speed up the evaluation since standard generation is quite slow. I suspect the performance drop is due to vLLM-specific behaviors rather than the model itself, so I wanted to ask if the original paper measurements were done without vLLM.

Sign up or log in to comment