LastingBench / README.md
kixx's picture
Update README.md
007c5e6 verified
---
title: "LastingBench: Defend Benchmarks Against Knowledge Leakage"
tags:
- paper
- benchmark
license: cc-by-4.0
---
# 📄 Paper
<iframe
src="https://huggingface.co/kixx/LastingBench/resolve/main/paper.pdf#toolbar=0"
width="100%"
height="900"
style="border:none;">
</iframe>
<!-- 兼容备用: -->
<p><a href="https://huggingface.co/kixx/LastingBench/resolve/main/paper.pdf">📥 Download the PDF</a></p>
# LastingBench: Defend Benchmarks Against Knowledge Leakage.
Welcome to the repository for the research paper: "LastingBench: Defend Benchmarks Against Knowledge Leakage." This project addresses the growing concern about large language models (LLMs) "cheating" on standard Question Answering (QA) benchmarks by memorizing task-specific data, which undermines the validity of benchmark evaluations as they no longer reflect genuine model capabilities but instead the effects of data leakage.
## Project Overview
![Overview](./assets/overview.png)
LastingBench introduces a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. The project aims to:
- **Detect knowledge leakage** through context and question perturbation techniques
- **Rewrite leaked content** to counterfactual alternatives that disrupt memorization while preserving the benchmark's original evaluative intent
- **Evaluate model responses** to contextual evidence and reasoning patterns
- **Provide practical solutions** to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs
## Installation
1. Clone the repository:
```bash
git clone https://github.com/Seriousss/lastingbench
```
2. Create and activate conda environment:
```bash
conda create -n lastingbench python=3.12
conda activate lastingbench
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Set up environment variables:
```bash
export OPENAI_BASE_URL="your-api-base-url"
export OPENAI_API_KEY="your-api-key"
export CUDA_VISIBLE_DEVICES="0,1,2,3" # Adjust based on your GPU setup
```
## Usage
LastingBench provides three main functionalities: **Detection**, **Rewrite**, and **Training Comparision**.
### 🔍 Detection
Detect knowledge leakage through various perturbation techniques.
#### 1. Context Leakage Detection
Evaluate models using exact-match scoring on benchmark datasets:
```bash
# Using vLLM for most models
python -m detect.contextleakage --hf_model "Qwen/Qwen2.5-7B-Instruct" \
--dataset_subset "hotpotqa" --cuda_devices "0,1"
# Using Transformers for Qwen3 models
python -m detect.contextleakage --hf_model "Qwen/Qwen3-8B" \
--is_qwen3 --max_new_tokens 30
python -m detect.contextleakage_api --model "deepseek-r1" --dataset_subset "hotpotqa"
```
#### 2. Question Perturbation Detection
Rephrase questions to opposite meanings and test model consistency:
```bash
# Using OpenAI API
python -m detect.question_rephrase_answer_api \
--model_name "gpt-4o" --dataset_subset "2wikimqa" \
--rephrase_type "opposite" --sample_count 100
# Using local vLLM models
python -m detect.question_rephrase_answer_vllm \
--model_name "Qwen/Qwen2.5-7B-Instruct" --dataset_subset "hotpotqa" --rephrase_type "similar"
# Using Qwen3 with Transformers
python -m detect.question_rephrase_answer_qwen3 \
--model_name "Qwen/Qwen3-8B" --dataset_subset "2wikimqa"
```
### ✏️ Rewrite
Generate counterfactual answers and rewrite leaked evidence to create robust benchmarks.
`
#### 1. Evidence Finding and Counterfactual Rewriting Pipeline
Run the complete finding and rewriting pipeline:
```bash
# Specify custom output file and dataset
python main_gpu.py --output custom_output.jsonl \
--dataset_subset "hotpotqa" --start_idx 0 --max_samples 100
```
Convert and merge JSONL files with question-answer mappings:
```bash
# Merge single mapping file with original dataset
python utils/convert.py original.jsonl revised.jsonl custom_output.jsonl
```
The original and revised dataset can be found under the **data** folder.
#### 2. Random Answer Rewriting
Create random alternatives to disrupt memorization:
```bash
# Specify custom output file and dataset
python random_alternative_answer.py --output random_hotpot.jsonl \
--dataset_subset "hotpotqa" --start_idx 0 --max_samples 50
```
### 🚀Dataset evaluations on model inference and training
#### 1. Model Inference Evaluation
Comprehensive evaluation on original and revised benchmarks:
```bash
# Transformers-based evaluation
python -m eval.evaluation -i data/hotpotqa.jsonl -model "Qwen/Qwen3-8B" -k 40 -t 0.5
# API-based evaluation
python -m eval.eval_with_api.py --input data/hotpotqa_antifact.jsonl \
--model "deepseek-r1" --max_tokens 30 --temperature 0.5
```
#### 2. Model training Evaluation
Compare training dynamics between original and rewritten datasets:
The training loss data can be found under **training_result**.
To repoduce the picture in our paper:
```bash
python utils/draw.py training_result/training_loss_qwen38.csv training_result/training_loss_antifact_qwen38.csv \
--title "Original vs Rewritten Training Loss"
```
### 📊 Utility Functions
Additional tools for analysis and metrics:
- **Metrics Calculation**: F1 scores, EM scores, and custom evaluation metrics
- **Document Retrieval**: BM25-based retrieval for evidence analysis
All scripts support various parameters for customization. Use `--help` with any script to see available options.