distill-pipeline — modular synthetic data engine (thinking + instruct)
distill-pipeline is a modular Node.js synthetic data pipeline that reads JSONL inputs, runs generation + verification + reward, and writes JSONL outputs. It supports both:
- “Thinking” generators that produce visible reasoning, and
- “Instruct” generators that produce direct answers,
with separate caches and outputs so you can compare styles without mixing artefacts.
Rather than owning retrieval, distill-pipeline is designed as the middle layer in a stack: you feed it JSONL chunks or questions (for example, from distill-rag or your own tooling), and it orchestrates the LLM stages to produce clean, reusable synthetic data.
What it does
JSONL-first pipeline
- Reads JSONL chunks (default
data/rag_chunks.jsonl) or static question seeds (test_samples/seed_questions.jsonl). - Writes accepted samples as JSONL into
gold/*.jsonl.
- Reads JSONL chunks (default
Two pipeline modes
- Thinking mode: question → reasoning-style answer → verification → reward.
- Instruct mode: instruction → direct answer pairs, for fine-tuning assistants.
- Each mode has its own cache + output paths so you can run them independently.
Retrieval-agnostic, RAG-friendly
- Works with plain JSONL; any RAG stack or pre-processing step that can emit JSONL chunks can plug in.
- Optional “question-first” mode uses context chunks (or Elasticsearch) to generate questions from your corpus.
Stage-based and cache-heavy
- Questions, generations, verifications, and rewards are cached on disk (JSONL).
- You can change prompts or models and reuse existing work instead of re-running everything.
Local-first providers
- Built to run locally with Ollama as the default provider for all stages.
- Also supports OpenAI/HTTP-style providers, plus mock providers for tests/benchmarks.
Monitoring and benchmarks
- Live HUD (
scripts/live_bench.mjs) for real-time throughput/accept-rate monitoring. - Benchmark script (
scripts/bench_pipeline.mjs) to measure pipeline speed without burning GPU on real models.
- Live HUD (
Quickstart
git clone https://github.com/elspru/distill-pipeline && cd distill-pipeline
npm install
# Generate reasoning data to aid in complex problem-solving models
PIPELINE_SEED_MODE=question-first PIPELINE_RANDOM_WALK=1 npm run pipeline -- --limit 10 --verbose
# Create instruction data for fine-tuning helpful assistants
INSTRUCT_PIPELINE=1 INSTRUCT_GENERATOR_MODEL=phi-4-instruct npm run pipeline -- --out gold/instruct_gold.jsonl --verbose
For ongoing generation to build larger datasets:
scripts/run_thinking_continuous.sh
INSTRUCT_GENERATOR_MODEL=phi-4-instruct scripts/run_instruct_continuous.sh
Configuration (see .env.example)
# Retrieval
ES_NODE=http://localhost:9200
ES_INDEX=quo_distill_index
EMBED_URL=http://localhost:11434/api/embeddings
EMBED_MODEL=mxbai-embed-large
# Providers per stage
GENERATOR_PROVIDER=ollama
VERIFIER_PROVIDER=ollama
REWARD_PROVIDER=ollama
QUESTION_PROVIDER=ollama
# Models
GENERATOR_MODEL=qwen3-vl:8b-thinking
VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
QUESTION_MODEL=qwen2.5-7b-instruct
# Instruct-only generator
INSTRUCT_PIPELINE=0
INSTRUCT_GENERATOR_MODEL=phi-4-instruct
INSTRUCT_GENERATOR_PROVIDER=ollama
# Pipeline knobs
PIPELINE_SEED_MODE=question-first
PIPELINE_RANDOM_WALK=0 # set 1 for shuffled chunks
QUESTION_MAX_PER_CHUNK=5
# PIPELINE_CHUNK_LIMIT=10
# PIPELINE_CACHE_DIR=data/cache # override (e.g., data/cache_instruct)
Key scripts
npm run pipeline— main pipeline CLI (--limit,--out,--chunk-limit,--verbose).scripts/run_thinking_continuous.sh— loop thinking pipeline with random walk.scripts/run_instruct_continuous.sh— loop instruct pipeline (needsINSTRUCT_GENERATOR_MODEL).scripts/try_generator_prompt.sh— send generator prompt with cached chunk/question (--random,-rfor reasoning).scripts/cache_report.mjs— cache stats; setCACHE_REPORT_MODE=thinking|instruct|bothorPIPELINE_CACHE_DIR=....scripts/bench_pipeline.mjs— mock-provider throughput benchmark (question-first); use--limit,--chunk-limit,--random-walk,--cache-dir.scripts/live_bench.mjs— live HUD (readline) showing throughput/accept rate/status; defaults to mock providers;--realto use real providers.
Outputs
- Gold JSONL default:
gold/pipeline_gold.jsonl(instruct default:gold/pipeline_gold_instruct.jsonl). - Sample gold:
samples/pipeline_gold_sample.jsonl. - Cache defaults:
data/cache(thinking) anddata/cache_instruct(instruct); both gitignored.
Hugging Face / GitHub distribution
- License: Apache-2.0 (
LICENSE). - CI:
.github/workflows/ci.ymlrunsnpm teston push/PR. - Push to GitHub:
git remote add origin https://github.com/elspru/distill-pipeline git push origin main - Push to Hugging Face (user: htaf):
git lfs install git remote add hf https://huggingface.co/htaf/distill-pipeline git push origin main git push hf main - Publish code + prompts +
samples/pipeline_gold_sample.jsonl. Keep caches/gold outputs out (gitignored).
Project structure
prompts/ # stage prompts
src/ # pipeline, providers, stages
tests/ # Vitest
data/ # rag chunks (jsonl), cache (ignored)
gold/ # outputs (ignored)
scripts/ # tooling + runners
samples/pipeline_gold_sample.jsonl
Testing
npm test
License
Apache-2.0
Join the movement: Use distill-pipeline to create and share datasets that elevate AI for everyone.