first-DPO-without-remove-approach-v2
This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via the Unsloth library.
This repository contains the full-merged 16-bit weights. No adapter loading is required.
Training Configuration
- Base model: Qwen/Qwen3-4B-Instruct-2507
- Method: DPO (Direct Preference Optimization)
- Epochs: 2
- Learning rate: 5e-07
- Beta: 0.1
- Max sequence length: 1024
- LoRA Config: r=4, alpha=16, dropout=0 (merged into base)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "yokoe/first-DPO-without-remove-approach-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
Sources & License
- Training Data: [u-10bei/dpo-dataset-qwen-cot]
- License: MIT License.
- Downloads last month
- 19
Model tree for yokoe/first-DPO-without-remove-approach-v2
Base model
Qwen/Qwen3-4B-Instruct-2507