Marking Code Without Breaking It: Code Watermarking for Detecting LLM-Generated Code
Paper • 2502.18851 • Published
Qwen3-1.8B fine-tuned for generative product recommendation via hierarchical semantic identifiers. The model generates 4-level Semantic IDs (<|sid_start|><|A#|><|B#|><|C#|><|D#|><|sid_end|>) given product descriptions, purchase histories, or co-purchase contexts.
This is the smaller model in a controlled comparison experiment (1.8B vs 8B) conducted under identical training conditions.
Hierarchical SID prediction accuracy (A-level match, greedy decoding):
| Task | Accuracy |
|---|---|
| Text → SID | 59.9% |
| Sequential recommendation | 7.0% |
| Co-purchase prediction | 5.5% |
Evaluation: 3,000 samples per task, 11 task types.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("kalistratov/qwen3-1.8b-semantic-ids")
tokenizer = AutoTokenizer.from_pretrained("kalistratov/qwen3-1.8b-semantic-ids")
Master's thesis, Moscow Institute of Physics and Technology (MIPT), 2026.