first-DPO-without-remove-approach-v2

This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via the Unsloth library.

This repository contains the full-merged 16-bit weights. No adapter loading is required.

Training Configuration

Base model: Qwen/Qwen3-4B-Instruct-2507
Method: DPO (Direct Preference Optimization)
Epochs: 2
Learning rate: 5e-07
Beta: 0.1
Max sequence length: 1024
LoRA Config: r=4, alpha=16, dropout=0 (merged into base)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "yokoe/first-DPO-without-remove-approach-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

Sources & License

Training Data: [u-10bei/dpo-dataset-qwen-cot]
License: MIT License.

Downloads last month: 19

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for yokoe/first-DPO-without-remove-approach-v2

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1382)

this model

yokoe
/

first-DPO-without-remove-approach-v2

first-DPO-without-remove-approach-v2

Training Configuration

Usage

Sources & License

Model tree for yokoe/first-DPO-without-remove-approach-v2

Dataset used to train yokoe/first-DPO-without-remove-approach-v2