--- license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model: - google/gemma-3-1b-it tags: - text-generation-inference --- # gemma-3-1b-it-FlashHead-W4A16 ![My model banner](assets/FlashHead.png) **Optimized version of gemma-3-1b-it using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.** Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging: - FlashHead - Quantization (W4A16) - Custom vLLM generation via `embedl-models` FlashHead matches the gemma-3-1b-it baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency. --- ## Model Details | **Field** | **Value** | |------------|------------| | **Base Model** | gemma-3-1b-it | | **Input / Output** | Text → Text | | **Release Date** | 2025-12-08 | | **Version** | 1.0 | | **Optimizations** | FlashHead LM Head, Quantization (W4A16) | | **Developers** | Embedl | | **Licenses** | Upstream: Gemma Terms of Use.
Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* | | **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs | --- ## Optimizations - **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput. - **Quantization (W4A16)** - large reduction in memory footprint and latency. - **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package. --- ## Performance ### Token Generation Speed (RTX 3500 Ada, batch size = 1) | **Precision** | **Tokens/sec** | **Speedup vs BF16** | |----------------|----------------|----------------------| | BF16 baseline | 148 | 1.0× | | **FlashHead (Embedl)** | **178** | **1.20×** | | W4A16 baseline | 243 | 1.64x× | | **FlashHead W4A16 (Embedl)** | **336** | **2.27×** | FlashHead improves end-to-end speed by **1.38×** over state-of-the-art, while maintaining full accuracy parity. **Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs. --- ## Accuracy (Parity with Baseline) | **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** | |-------------|---------------|--------------|-------------|----------------|--------------| | **Baseline** | 0.15 | 0.55 | 0.38 | 0.31 | 0.42 | | **FlashHead** | 0.15 | 0.49 | 0.38 | 0.31 | 0.39 | FlashHead closely matches baseline accuracy. --- ## Installation ```bash pip install embedl-models ``` The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime. --- ## Usage Examples **Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`). ### vLLM Inference ```python from vllm import SamplingParams from embedl.models.vllm import LLM model_id = "embedl/gemma-3-1b-it-FlashHead-W4A16" if __name__ == "__main__": sampling = SamplingParams(max_tokens=128, temperature=0.0) llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072) prompt = "Write a haiku about coffee." output = llm.generate([prompt], sampling) print(output[0].outputs[0].text) ``` --- ### Interactive REPL Example The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled. It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context. ```python import asyncio from embedl.models.vllm.demo import run_repl model_id = "embedl/gemma-3-1b-it-FlashHead-W4A16" if __name__ == "__main__": asyncio.run( run_repl( model=model_id, max_model_len=131072 ) ) ``` --- --- ## ⚠️ Important Warning: Hugging Face Transformers Support > **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.** > Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**. > > For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference. > > Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**. --- ## Limitations - Limited to **vLLM 0.10.2** (pinned dependency) - **Batch size = 1** (real-time generation) - Currently optimized for **NVIDIA RTX GPUs** --- ## Roadmap Planned improvements: - Advanced mixed precision quantization - Huggingface transformers generation - vLLM CLI benchmarking for detailed latency evaluation - `lm-eval-harness` integration for detailed accuracy evaluation - Upstream support in **Transformers** and **vLLM** - Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc. - Broader model coverage (larger models, VLMs, VLAs) --- ## License - **Upstream:** Gemma Terms of Use. - **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)* --- ## Contact **Enterprise & Commercial Inquiries** [sales@embedl.com](mailto:sales@embedl.com) **Technical Issues & Early Access** [https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models) **More Information & Model Releases** [https://embedl.com](https://embedl.com) --- ### Partner & Developer Opportunities If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for: - Embedl SDK - AI optimization tools & profiling - Embedl HUB - benchmarking platform - Engineering support for on-prem/edge deployments - Migration guidance (Llama / Qwen / Gemma) - Early access & partner co-marketing opportunities Contact: [sales@embedl.com](mailto:sales@embedl.com)