--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen3-4B tags: - text2sql - sql - nlp - distillation - qwen3 datasets: - distil-labs/text2sql-synthetic language: - en pipeline_tag: text-generation --- # Distil-Qwen3-4B-Text2SQL A fine-tuned Qwen3-4B model for converting natural language questions into SQL queries. Trained using knowledge distillation from DeepSeek-V3, this 4B parameter model matches teacher-level accuracy while being small enough to run locally. ## Results | Metric | DeepSeek-V3 (Teacher) | Qwen3-4B (Base) | **This Model** | |--------|:---------------------:|:---------------:|:--------------:| | LLM-as-a-Judge | 80% | 62% | **80%** | | Exact Match | 48% | 16% | **60%** | | ROUGE | 87.6% | 84.2% | **89.5%** | | METEOR | 85.1% | 87.3% | 86.1% | The fine-tuned model **matches the 685B parameter teacher** on LLM-as-a-Judge accuracy and **exceeds it** on exact match and ROUGE scores. ## Quick Start ### Using Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("distil-labs/distil-qwen3-4b-text2sql") tokenizer = AutoTokenizer.from_pretrained("distil-labs/distil-qwen3-4b-text2sql") schema = """CREATE TABLE employees ( id INTEGER PRIMARY KEY, name TEXT NOT NULL, department TEXT, salary INTEGER );""" question = "How many employees earn more than 50000?" messages = [ { "role": "system", "content": """You are a problem solving model working on task_description XML block: You are given a database schema and a natural language question. Generate the SQL query that answers the question. Input: - Schema: One or two table definitions in SQL DDL format - Question: Natural language question about the data Output: - A single SQL query that answers the question - No explanations, comments, or additional text Rules: - Use only tables and columns from the provided schema - Use uppercase SQL keywords (SELECT, FROM, WHERE, etc.) - Use SQLite-compatible syntax You will be given a single task in the question XML block Solve only the task in question block. Generate only the answer, do not generate anything else""" }, { "role": "user", "content": f"""Now for the real task, solve the task in question block. Generate only the solution, do not generate anything else Schema: {schema} Question: {question}""" } ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=256, temperature=0) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Using Ollama (GGUF version) For local inference, use the quantized GGUF versions: - [distil-qwen3-4b-text2sql-gguf](https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql-gguf) - Full precision GGUF - [distil-qwen3-4b-text2sql-gguf-4bit](https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql-gguf-4bit) - 4-bit quantized (~2.5GB) ```bash # Download and create Ollama model ollama create distil-qwen3-4b-text2sql -f Modelfile # Run inference ollama run distil-qwen3-4b-text2sql ``` ## Model Details | Property | Value | |----------|-------| | Base Model | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | | Parameters | 4 billion | | Architecture | Qwen3ForCausalLM | | Context Length | 262,144 tokens | | Precision | bfloat16 | | Training Data | ~10,000 synthetic examples | | Teacher Model | DeepSeek-V3 | ## Training This model was trained using the [Distil Labs](https://distillabs.ai) platform: 1. **Seed Data**: 50 hand-validated Text2SQL examples covering various SQL complexities 2. **Synthetic Generation**: Expanded to ~10,000 examples using DeepSeek-V3 3. **Fine-tuning**: 4 epochs on the synthetic dataset 4. **Evaluation**: LLM-as-a-Judge with semantic equivalence checking ### Training Hyperparameters - Epochs: 4 - Learning Rate: 5e-5 (cosine schedule) - Batch Size: 1 (with gradient accumulation) - Total Steps: ~40,000 ## Task Format ### Input Format ``` Schema: CREATE TABLE table_name ( column_name DATA_TYPE [CONSTRAINTS], ... ); Question: Natural language question about the data ``` ### Output Format A single SQL query with: - Uppercase SQL keywords (SELECT, FROM, WHERE, etc.) - SQLite-compatible syntax - No explanations or additional text ### Supported SQL Features - **Simple**: SELECT, WHERE, COUNT, SUM, AVG, MAX, MIN - **Medium**: JOIN, GROUP BY, HAVING, ORDER BY, LIMIT - **Complex**: Subqueries, multiple JOINs, UNION ## Use Cases - Natural language interfaces to databases - SQL query assistance and autocompletion - Database chatbots and conversational BI - Educational tools for learning SQL ## Limitations - Optimized for SQLite syntax - Best with 1-2 table schemas - May struggle with highly complex nested subqueries - Trained on English questions only ## License This model is released under the Apache 2.0 license. ## Links - [Distil Labs Website](https://distillabs.ai) - [GitHub](https://github.com/distil-labs) - [Hugging Face](https://huggingface.co/distil-labs) ## Citation ```bibtex @misc{distil-qwen3-4b-text2sql, author = {Distil Labs}, title = {Distil-Qwen3-4B-Text2SQL: A Fine-tuned Model for Natural Language to SQL}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql} } ```