Considering the support of QAT, Unsloth Dynamic 2.0 Quants (UD) + TQ2_0 and TQ1_0 quantization

#1732

by wahidmounir - opened 9 days ago

9 days ago

Recently I've stumbled upon the QAT, Unsloth Dynamic 2.0 Quants (UD) + TQ2_0 and TQ1_0 quantizations of several models
and I've been testing them a lot
TQ1_0 and TQ2_0 are good for large models but give bad result for small models and math tasks but they are worth it for other tasks
for smaller models QAT and UD2.0 give better results than other quantisations

The request is it will very nice if mrdermacher team include these quantizations in their quantized models

RichardErkhov

9 days ago

(didnt check what are those) I mean if they are included in default llama cpp, I dont see why not
@nicoboss

wahidmounir

2 days ago

Llama.cpp provides experimental support for the

TQ1_0 and TQ2_0 quantization formats, which are ternary-based methods primarily optimized for models like BitNet b1.58 and TriLM.

Status Report: TQ1_0 and TQ2_0 Support

Definition: These are ternary quantization formats. TQ1_0 uses approximately 1.69 bits per weight, while TQ2_0 uses 2.06 bits. Unlike standard bit-based quants, they represent weights using "trits" (three possible values).
Current Hardware Support:
- CPU: Fully supported via llama.cpp's conversion and inference engine.
- GPU: Metal (Apple Silicon) and CUDA support is still in development or partial. Users have reported significant performance slowdowns (up to 30x slower than K-quants) after dequantization on certain GPUs.
Model Compatibility: These quants are intended only for models specifically trained for ternary weights (e.g., BitNet). Applying them to standard models like Qwen or Llama results in "garbage" or messy text output.
Unsloth Integration: Unsloth uses these formats for extremely small versions of massive models, such as their 1.66-bit DeepSeek-R1 GGUF, which fits within 162GB of RAM. While these are marketed as "Unsloth Dynamic Quants," they rely on the underlying llama.cpp TQ implementation for execution.

Critical GitHub Links

Primary Implementation (llama.cpp):
The core support for TQ1_0 and TQ2_0 was added to the main repository under the ggml backend.
- Main llama.cpp Repository
- Conversion Tool (convert_hf_to_gguf.py) — Use this to generate TQ quants from compatible HF models.
Unsloth Support & Issues:
- Unsloth Main Repository
- Issue #15193: TQ1_0 Execution Failures — Tracks current bugs where TQ1_0 models fail to run or produce errors.
- Issue #748: llama-quantize path errors — Common troubleshooting for users trying to export GGUFs via Unsloth.
Alternative Implementation (ik_llama.cpp):
For users seeking more optimized or experimental low-bit quants beyond the main branch.
- ik_llama.cpp Fork — Contains specific SOTA optimizations for extreme quantization that often precede mainline merges.

Unsloth UD (Unified Dynamic) quantization is a proprietary method that optimizes model quality by assigning different bit-precisions to individual layers based on their sensitivity. Unlike static quantization, it uses a large calibration dataset (up to 1.5M tokens) to protect critical reasoning and attention layers while aggressively compressing less important ones.

Llama.cpp Support Status

Llama.cpp natively supports UD-quantized models because they utilize existing GGML quant types (like Q4_K, Q6_K, and TQ1_0). The "Dynamic" aspect is handled during the export process within Unsloth, not by a new runtime operator in llama.cpp.

Runtime Support: UD quants work "out of the box" in llama.cpp versions released after early 2025.
Hardware Efficiency: Since UD quants mix different bitrates, memory usage is reduced while performance remains consistent with standard GGUF inference.
Smallest Quantities: The UD-TQ1_0 (1.8-bit) variant is the smallest, designed to fit models like DeepSeek-R1 (671B) into limited hardware (e.g., 24GB VRAM with offloading).

Critical GitHub & Resource Links

Primary Implementation (Unsloth):
The UD logic resides in Unsloth's export kernels rather than the llama.cpp core.
- Unsloth Main Repository — Contains the model.save_pretrained_gguf() logic for dynamic exports.
- Unsloth Export Kernels — Code for managing weight scales and quantization states.
Official Documentation:
- Unsloth Dynamic 2.0 Guide — Official explanation of the Dynamic 2.0 methodology.
- Aider Polyglot Benchmark — Ablation studies showing accuracy recovery using UD quants.
Integration Repositories:
- DeepSeek-R1 Dynamic GGUF Discussion — Technical breakdown of how layers are assigned specific quants.
- Ollama UD-Q6_K_XL Support — Example of UD quants deployed in Ollama (via llama.cpp backend).

RichardErkhov

2 days ago

answer from nico: T-quants are only useful for very specific models so just add them in addition to our usual quants using when queuing a model that would benefit from them or explicitly requested by someone when requesting a specific model. This is exactly how we currently add MXFP4 quants to all GPT OSS based models we queue (if we don't forget).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment