YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

IkhouDict-xs

Model Description

IkhouDict-xs is a small bilingual dictionary model fine-tuned from Qwen/Qwen3-0.6B for single-line gloss generation. Given a word or short phrase in context, the model returns 1 to 4 short translations or synonyms in a target language. The rubric enforces a single line, no quotes or labels, no trailing punctuation, and optional French grammatical hints when the target language is French.

Intended Use

This model is intended for lexicography support, language learning tools, and high-level draft glossing. It is not a substitute for professional translation or domain-specific terminology work. Outputs should be reviewed by humans in high-stakes settings.

Training Data

Training data are produced by the data generation pipeline in training/ in this repository. The pipeline creates synthetic dictionary examples from web corpora, then filters and formats them for supervised fine-tuning (SFT).

Pipeline summary:

  1. Extract sentences from multilingual web corpora (FineWeb-2 by default; optional FineWeb for English-only supplementation).
  2. Select a target word or phrase from each sentence (single token or short phrase up to 5 tokens; phrase_ratio controls the mix).
  3. Sample target languages, including cross-lingual targets. The default config uses 10 languages (deu, eng, spa, fra, ita, jpn, kor, por, rus, cmn) and generates multiple target languages per example.
  4. A teacher LLM (OpenAI-compatible endpoint) generates a short gloss under a strict rubric. Definitions are cleaned and validated.
  5. Examples below a quality threshold are dropped, then remaining examples are de-duplicated by (source_lang, target_lang, selection, context).
  6. Each example is written to SFT JSONL format with a system prompt, a user prompt, and a <final>...</final> assistant answer.

The run used for this model produced:

  • Train: 1,521,749 examples
  • Eval: 15,366 examples
  • Test: 15,011 examples

Splits are deterministic by grouping on provenance metadata to reduce leakage (see sft/src/ikhou_sft/split.py).

Training Procedure

Fine-tuning was performed with the ikhou_sft pipeline in this repository:

  • Base model: Qwen/Qwen3-0.6B
  • Full fine-tuning (no LoRA)
  • Supervised fine-tuning using the chat template
  • Max sequence length: 512
  • Optimizer: Muon
  • 1 epoch with gradient accumulation

See sft/src/ikhou_sft/train.py for implementation details.

How To Use

The model expects a system prompt and a user prompt that mirror the data generation pipeline.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ikhou/dict-xs"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

system_prompt = (
    "You are a bilingual dictionary assistant.\n\n"
    "Your job: given a word/phrase in context, output a SHORT dictionary-style gloss line.\n\n"
    "Hard rules:\n"
    "- Output EXACTLY ONE LINE and nothing else.\n"
    "- No quotes, no bullets, no labels (no \"Definition:\", \"Meaning:\", etc).\n"
    "- Do NOT repeat the original word/phrase in the output.\n"
    "- Keep it short (ideally <= 120 characters).\n\n"
    "Gloss rules:\n"
    "- Output 1-4 translations/synonyms in the definition language, separated by \", \".\n"
    "- Each gloss should be short (1-3 words). Prefer common, user-friendly glosses.\n"
    "- Do NOT write full sentences. No trailing period.\n\n"
    "French grammar hints (only if confident):\n"
    "IMPORTANT: The French-only formatting hints below apply ONLY when the definition language is French (fr/fra).\n"
    "If the definition language is NOT French, do NOT use nm./nf./adj./adv., do NOT add French tense notes, and do NOT add (pp).\n"
    "- Noun: prefix with \"nm.\" (masc) or \"nf.\" (fem), then a space, then glosses.\n"
    "  Example: nm. face\n"
    "- Adjective: prefix with \"adj.\", then a space, then glosses.\n"
    "  Example: adj. fragile, delicate\n"
    "- Adverb: prefix with \"adv.\", then a space, then glosses.\n"
    "  Example: adv. extremely, exceedingly\n"
    "- Conjugated verb form: glosses, then add \"(tense, subject)\" in French.\n"
    "  Example: came back, used to come back (imparfait, il)\n"
    "- Past participle: glosses, then add \"(pp)\".\n"
    "  Example: watched over, supervised (pp)\n"
)

user_prompt = (
    'Expression: "online"\n'
    "Context: He paid for the course online and started immediately.\n"
    "Source language: eng (English)\n"
    "Definition language: spa (Spanish)\n\n"
    "Return the single-line gloss now."
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=64,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Limitations and Risks

  • Outputs can be inaccurate, overly general, or inconsistent with the rubric.
  • The model inherits biases from source corpora and the teacher model.
  • Rare languages or specialized terminology may be poorly handled.

Acknowledgements

Base model: Qwen/Qwen3-0.6B.

Downloads last month
15
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ikhou/dict-xs

Finetunes
1 model