IkhouDict-xs
Model Description
IkhouDict-xs is a small bilingual dictionary model fine-tuned from Qwen/Qwen3-0.6B for single-line gloss generation. Given a word or short phrase in context, the model returns 1 to 4 short translations or synonyms in a target language. The rubric enforces a single line, no quotes or labels, no trailing punctuation, and optional French grammatical hints when the target language is French.
Intended Use
This model is intended for lexicography support, language learning tools, and high-level draft glossing. It is not a substitute for professional translation or domain-specific terminology work. Outputs should be reviewed by humans in high-stakes settings.
Training Data
Training data are produced by the data generation pipeline in training/ in
this repository. The pipeline creates synthetic dictionary examples from web
corpora, then filters and formats them for supervised fine-tuning (SFT).
Pipeline summary:
- Extract sentences from multilingual web corpora (FineWeb-2 by default; optional FineWeb for English-only supplementation).
- Select a target word or phrase from each sentence (single token or short
phrase up to 5 tokens;
phrase_ratiocontrols the mix). - Sample target languages, including cross-lingual targets. The default config
uses 10 languages (
deu,eng,spa,fra,ita,jpn,kor,por,rus,cmn) and generates multiple target languages per example. - A teacher LLM (OpenAI-compatible endpoint) generates a short gloss under a strict rubric. Definitions are cleaned and validated.
- Examples below a quality threshold are dropped, then remaining examples are de-duplicated by (source_lang, target_lang, selection, context).
- Each example is written to SFT JSONL format with a system prompt, a user
prompt, and a
<final>...</final>assistant answer.
The run used for this model produced:
- Train: 1,521,749 examples
- Eval: 15,366 examples
- Test: 15,011 examples
Splits are deterministic by grouping on provenance metadata to reduce leakage
(see sft/src/ikhou_sft/split.py).
Training Procedure
Fine-tuning was performed with the ikhou_sft pipeline in this repository:
- Base model: Qwen/Qwen3-0.6B
- Full fine-tuning (no LoRA)
- Supervised fine-tuning using the chat template
- Max sequence length: 512
- Optimizer: Muon
- 1 epoch with gradient accumulation
See sft/src/ikhou_sft/train.py for implementation details.
How To Use
The model expects a system prompt and a user prompt that mirror the data generation pipeline.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ikhou/dict-xs"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
system_prompt = (
"You are a bilingual dictionary assistant.\n\n"
"Your job: given a word/phrase in context, output a SHORT dictionary-style gloss line.\n\n"
"Hard rules:\n"
"- Output EXACTLY ONE LINE and nothing else.\n"
"- No quotes, no bullets, no labels (no \"Definition:\", \"Meaning:\", etc).\n"
"- Do NOT repeat the original word/phrase in the output.\n"
"- Keep it short (ideally <= 120 characters).\n\n"
"Gloss rules:\n"
"- Output 1-4 translations/synonyms in the definition language, separated by \", \".\n"
"- Each gloss should be short (1-3 words). Prefer common, user-friendly glosses.\n"
"- Do NOT write full sentences. No trailing period.\n\n"
"French grammar hints (only if confident):\n"
"IMPORTANT: The French-only formatting hints below apply ONLY when the definition language is French (fr/fra).\n"
"If the definition language is NOT French, do NOT use nm./nf./adj./adv., do NOT add French tense notes, and do NOT add (pp).\n"
"- Noun: prefix with \"nm.\" (masc) or \"nf.\" (fem), then a space, then glosses.\n"
" Example: nm. face\n"
"- Adjective: prefix with \"adj.\", then a space, then glosses.\n"
" Example: adj. fragile, delicate\n"
"- Adverb: prefix with \"adv.\", then a space, then glosses.\n"
" Example: adv. extremely, exceedingly\n"
"- Conjugated verb form: glosses, then add \"(tense, subject)\" in French.\n"
" Example: came back, used to come back (imparfait, il)\n"
"- Past participle: glosses, then add \"(pp)\".\n"
" Example: watched over, supervised (pp)\n"
)
user_prompt = (
'Expression: "online"\n'
"Context: He paid for the course online and started immediately.\n"
"Source language: eng (English)\n"
"Definition language: spa (Spanish)\n\n"
"Return the single-line gloss now."
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
)
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=64,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
Limitations and Risks
- Outputs can be inaccurate, overly general, or inconsistent with the rubric.
- The model inherits biases from source corpora and the teacher model.
- Rare languages or specialized terminology may be poorly handled.
Acknowledgements
Base model: Qwen/Qwen3-0.6B.
- Downloads last month
- 15