LastingBench / README.md

Update README.md

007c5e6 verified 7 months ago

5.49 kB

	---
	title: "LastingBench: Defend Benchmarks Against Knowledge Leakage"
	tags:
	- paper
	- benchmark
	license: cc-by-4.0
	---

	# 📄 Paper

	<iframe
	src="https://huggingface.co/kixx/LastingBench/resolve/main/paper.pdf#toolbar=0"
	width="100%"
	height="900"
	style="border:none;">
	</iframe>

	<!-- 兼容备用： -->
	<p><a href="https://huggingface.co/kixx/LastingBench/resolve/main/paper.pdf">📥 Download the PDF</a></p>


	# LastingBench: Defend Benchmarks Against Knowledge Leakage.

	Welcome to the repository for the research paper: "LastingBench: Defend Benchmarks Against Knowledge Leakage." This project addresses the growing concern about large language models (LLMs) "cheating" on standard Question Answering (QA) benchmarks by memorizing task-specific data, which undermines the validity of benchmark evaluations as they no longer reflect genuine model capabilities but instead the effects of data leakage.

	## Project Overview

	![Overview](./assets/overview.png)

	LastingBench introduces a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. The project aims to:
	- Detect knowledge leakage through context and question perturbation techniques
	- Rewrite leaked content to counterfactual alternatives that disrupt memorization while preserving the benchmark's original evaluative intent
	- Evaluate model responses to contextual evidence and reasoning patterns
	- Provide practical solutions to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs


	## Installation

	1. Clone the repository:
	```bash
	git clone https://github.com/Seriousss/lastingbench
	```

	2. Create and activate conda environment:
	```bash
	conda create -n lastingbench python=3.12
	conda activate lastingbench
	```

	3. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	4. Set up environment variables:
	```bash
	export OPENAI_BASE_URL="your-api-base-url"
	export OPENAI_API_KEY="your-api-key"
	export CUDA_VISIBLE_DEVICES="0,1,2,3" # Adjust based on your GPU setup
	```

	## Usage

	LastingBench provides three main functionalities: Detection, Rewrite, and Training Comparision.

	### 🔍 Detection

	Detect knowledge leakage through various perturbation techniques.

	#### 1. Context Leakage Detection
	Evaluate models using exact-match scoring on benchmark datasets:
	```bash
	# Using vLLM for most models
	python -m detect.contextleakage --hf_model "Qwen/Qwen2.5-7B-Instruct" \
	--dataset_subset "hotpotqa" --cuda_devices "0,1"

	# Using Transformers for Qwen3 models
	python -m detect.contextleakage --hf_model "Qwen/Qwen3-8B" \
	--is_qwen3 --max_new_tokens 30

	python -m detect.contextleakage_api --model "deepseek-r1" --dataset_subset "hotpotqa"
	```


	#### 2. Question Perturbation Detection
	Rephrase questions to opposite meanings and test model consistency:
	```bash
	# Using OpenAI API
	python -m detect.question_rephrase_answer_api \
	--model_name "gpt-4o" --dataset_subset "2wikimqa" \
	--rephrase_type "opposite" --sample_count 100

	# Using local vLLM models
	python -m detect.question_rephrase_answer_vllm \
	--model_name "Qwen/Qwen2.5-7B-Instruct" --dataset_subset "hotpotqa" --rephrase_type "similar"

	# Using Qwen3 with Transformers
	python -m detect.question_rephrase_answer_qwen3 \
	--model_name "Qwen/Qwen3-8B" --dataset_subset "2wikimqa"
	```


	### ✏️ Rewrite

	Generate counterfactual answers and rewrite leaked evidence to create robust benchmarks.
	`

	#### 1. Evidence Finding and Counterfactual Rewriting Pipeline
	Run the complete finding and rewriting pipeline:
	```bash

	# Specify custom output file and dataset
	python main_gpu.py --output custom_output.jsonl \
	--dataset_subset "hotpotqa" --start_idx 0 --max_samples 100

	```

	Convert and merge JSONL files with question-answer mappings:
	```bash
	# Merge single mapping file with original dataset
	python utils/convert.py original.jsonl revised.jsonl custom_output.jsonl

	```
	The original and revised dataset can be found under the data folder.

	#### 2. Random Answer Rewriting
	Create random alternatives to disrupt memorization:
	```bash
	# Specify custom output file and dataset
	python random_alternative_answer.py --output random_hotpot.jsonl \
	--dataset_subset "hotpotqa" --start_idx 0 --max_samples 50

	```


	### 🚀Dataset evaluations on model inference and training


	#### 1. Model Inference Evaluation
	Comprehensive evaluation on original and revised benchmarks:
	```bash
	# Transformers-based evaluation
	python -m eval.evaluation -i data/hotpotqa.jsonl -model "Qwen/Qwen3-8B" -k 40 -t 0.5

	# API-based evaluation
	python -m eval.eval_with_api.py --input data/hotpotqa_antifact.jsonl \
	--model "deepseek-r1" --max_tokens 30 --temperature 0.5
	```

	#### 2. Model training Evaluation
	Compare training dynamics between original and rewritten datasets:

	The training loss data can be found under training_result.

	To repoduce the picture in our paper:
	```bash
	python utils/draw.py training_result/training_loss_qwen38.csv training_result/training_loss_antifact_qwen38.csv \
	--title "Original vs Rewritten Training Loss"
	```



	### 📊 Utility Functions

	Additional tools for analysis and metrics:

	- Metrics Calculation: F1 scores, EM scores, and custom evaluation metrics
	- Document Retrieval: BM25-based retrieval for evidence analysis

	All scripts support various parameters for customization. Use `--help` with any script to see available options.