distill-pipeline — modular synthetic data engine (thinking + instruct)

distill-pipeline is a modular Node.js synthetic data pipeline that reads JSONL inputs, runs generation + verification + reward, and writes JSONL outputs. It supports both:

“Thinking” generators that produce visible reasoning, and
“Instruct” generators that produce direct answers,

with separate caches and outputs so you can compare styles without mixing artefacts.

Rather than owning retrieval, distill-pipeline is designed as the middle layer in a stack: you feed it JSONL chunks or questions (for example, from distill-rag or your own tooling), and it orchestrates the LLM stages to produce clean, reusable synthetic data.

What it does

JSONL-first pipeline
- Reads JSONL chunks (default data/rag_chunks.jsonl) or static question seeds (test_samples/seed_questions.jsonl).
- Writes accepted samples as JSONL into gold/*.jsonl.
Two pipeline modes
- Thinking mode: question → reasoning-style answer → verification → reward.
- Instruct mode: instruction → direct answer pairs, for fine-tuning assistants.
- Each mode has its own cache + output paths so you can run them independently.
Retrieval-agnostic, RAG-friendly
- Works with plain JSONL; any RAG stack or pre-processing step that can emit JSONL chunks can plug in.
- Optional “question-first” mode uses context chunks (or Elasticsearch) to generate questions from your corpus.
Stage-based and cache-heavy
- Questions, generations, verifications, and rewards are cached on disk (JSONL).
- You can change prompts or models and reuse existing work instead of re-running everything.
Local-first providers
- Built to run locally with Ollama as the default provider for all stages.
- Also supports OpenAI/HTTP-style providers, plus mock providers for tests/benchmarks.
Monitoring and benchmarks
- Live HUD (scripts/live_bench.mjs) for real-time throughput/accept-rate monitoring.
- Benchmark script (scripts/bench_pipeline.mjs) to measure pipeline speed without burning GPU on real models.

Quickstart

git clone https://github.com/elspru/distill-pipeline && cd distill-pipeline
npm install

# Generate reasoning data to aid in complex problem-solving models
PIPELINE_SEED_MODE=question-first PIPELINE_RANDOM_WALK=1 npm run pipeline -- --limit 10 --verbose

# Create instruction data for fine-tuning helpful assistants
INSTRUCT_PIPELINE=1 INSTRUCT_GENERATOR_MODEL=phi-4-instruct npm run pipeline -- --out gold/instruct_gold.jsonl --verbose

For ongoing generation to build larger datasets:

scripts/run_thinking_continuous.sh
INSTRUCT_GENERATOR_MODEL=phi-4-instruct scripts/run_instruct_continuous.sh

Configuration (see `.env.example`)

# Retrieval
ES_NODE=http://localhost:9200
ES_INDEX=quo_distill_index
EMBED_URL=http://localhost:11434/api/embeddings
EMBED_MODEL=mxbai-embed-large
# Providers per stage
GENERATOR_PROVIDER=ollama
VERIFIER_PROVIDER=ollama
REWARD_PROVIDER=ollama
QUESTION_PROVIDER=ollama
# Models
GENERATOR_MODEL=qwen3-vl:8b-thinking
VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
QUESTION_MODEL=qwen2.5-7b-instruct
# Instruct-only generator
INSTRUCT_PIPELINE=0
INSTRUCT_GENERATOR_MODEL=phi-4-instruct
INSTRUCT_GENERATOR_PROVIDER=ollama
# Pipeline knobs
PIPELINE_SEED_MODE=question-first
PIPELINE_RANDOM_WALK=0 # set 1 for shuffled chunks
QUESTION_MAX_PER_CHUNK=5
# PIPELINE_CHUNK_LIMIT=10
# PIPELINE_CACHE_DIR=data/cache # override (e.g., data/cache_instruct)

Key scripts

npm run pipeline — main pipeline CLI (--limit, --out, --chunk-limit, --verbose).
scripts/run_thinking_continuous.sh — loop thinking pipeline with random walk.
scripts/run_instruct_continuous.sh — loop instruct pipeline (needs INSTRUCT_GENERATOR_MODEL).
scripts/try_generator_prompt.sh — send generator prompt with cached chunk/question (--random, -r for reasoning).
scripts/cache_report.mjs — cache stats; set CACHE_REPORT_MODE=thinking|instruct|both or PIPELINE_CACHE_DIR=....
scripts/bench_pipeline.mjs — mock-provider throughput benchmark (question-first); use --limit, --chunk-limit, --random-walk, --cache-dir.
scripts/live_bench.mjs — live HUD (readline) showing throughput/accept rate/status; defaults to mock providers; --real to use real providers.

Outputs

Gold JSONL default: gold/pipeline_gold.jsonl (instruct default: gold/pipeline_gold_instruct.jsonl).
Sample gold: samples/pipeline_gold_sample.jsonl.
Cache defaults: data/cache (thinking) and data/cache_instruct (instruct); both gitignored.

Hugging Face / GitHub distribution

License: Apache-2.0 (LICENSE).
CI: .github/workflows/ci.yml runs npm test on push/PR.

Push to GitHub:

git remote add origin https://github.com/elspru/distill-pipeline
git push origin main

Push to Hugging Face (user: htaf):

git lfs install
git remote add hf https://huggingface.co/htaf/distill-pipeline
git push origin main
git push hf main

Publish code + prompts + samples/pipeline_gold_sample.jsonl. Keep caches/gold outputs out (gitignored).

Project structure

prompts/ # stage prompts
src/ # pipeline, providers, stages
tests/ # Vitest
data/ # rag chunks (jsonl), cache (ignored)
gold/ # outputs (ignored)
scripts/ # tooling + runners
samples/pipeline_gold_sample.jsonl

Testing

npm test

License

Apache-2.0

Join the movement: Use distill-pipeline to create and share datasets that elevate AI for everyone.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support