distill-pipeline — modular synthetic data engine (thinking + instruct)

distill-pipeline is a modular Node.js synthetic data pipeline that reads JSONL inputs, runs generation + verification + reward, and writes JSONL outputs. It supports both:

  • “Thinking” generators that produce visible reasoning, and
  • “Instruct” generators that produce direct answers,

with separate caches and outputs so you can compare styles without mixing artefacts.

Rather than owning retrieval, distill-pipeline is designed as the middle layer in a stack: you feed it JSONL chunks or questions (for example, from distill-rag or your own tooling), and it orchestrates the LLM stages to produce clean, reusable synthetic data.


What it does

  • JSONL-first pipeline

    • Reads JSONL chunks (default data/rag_chunks.jsonl) or static question seeds (test_samples/seed_questions.jsonl).
    • Writes accepted samples as JSONL into gold/*.jsonl.
  • Two pipeline modes

    • Thinking mode: question → reasoning-style answer → verification → reward.
    • Instruct mode: instruction → direct answer pairs, for fine-tuning assistants.
    • Each mode has its own cache + output paths so you can run them independently.
  • Retrieval-agnostic, RAG-friendly

    • Works with plain JSONL; any RAG stack or pre-processing step that can emit JSONL chunks can plug in.
    • Optional “question-first” mode uses context chunks (or Elasticsearch) to generate questions from your corpus.
  • Stage-based and cache-heavy

    • Questions, generations, verifications, and rewards are cached on disk (JSONL).
    • You can change prompts or models and reuse existing work instead of re-running everything.
  • Local-first providers

    • Built to run locally with Ollama as the default provider for all stages.
    • Also supports OpenAI/HTTP-style providers, plus mock providers for tests/benchmarks.
  • Monitoring and benchmarks

    • Live HUD (scripts/live_bench.mjs) for real-time throughput/accept-rate monitoring.
    • Benchmark script (scripts/bench_pipeline.mjs) to measure pipeline speed without burning GPU on real models.

Quickstart

git clone https://github.com/elspru/distill-pipeline && cd distill-pipeline
npm install

# Generate reasoning data to aid in complex problem-solving models
PIPELINE_SEED_MODE=question-first PIPELINE_RANDOM_WALK=1 npm run pipeline -- --limit 10 --verbose

# Create instruction data for fine-tuning helpful assistants
INSTRUCT_PIPELINE=1 INSTRUCT_GENERATOR_MODEL=phi-4-instruct npm run pipeline -- --out gold/instruct_gold.jsonl --verbose

For ongoing generation to build larger datasets:

scripts/run_thinking_continuous.sh
INSTRUCT_GENERATOR_MODEL=phi-4-instruct scripts/run_instruct_continuous.sh

Configuration (see .env.example)

# Retrieval
ES_NODE=http://localhost:9200
ES_INDEX=quo_distill_index
EMBED_URL=http://localhost:11434/api/embeddings
EMBED_MODEL=mxbai-embed-large
# Providers per stage
GENERATOR_PROVIDER=ollama
VERIFIER_PROVIDER=ollama
REWARD_PROVIDER=ollama
QUESTION_PROVIDER=ollama
# Models
GENERATOR_MODEL=qwen3-vl:8b-thinking
VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
QUESTION_MODEL=qwen2.5-7b-instruct
# Instruct-only generator
INSTRUCT_PIPELINE=0
INSTRUCT_GENERATOR_MODEL=phi-4-instruct
INSTRUCT_GENERATOR_PROVIDER=ollama
# Pipeline knobs
PIPELINE_SEED_MODE=question-first
PIPELINE_RANDOM_WALK=0 # set 1 for shuffled chunks
QUESTION_MAX_PER_CHUNK=5
# PIPELINE_CHUNK_LIMIT=10
# PIPELINE_CACHE_DIR=data/cache # override (e.g., data/cache_instruct)

Key scripts

  • npm run pipeline — main pipeline CLI (--limit, --out, --chunk-limit, --verbose).
  • scripts/run_thinking_continuous.sh — loop thinking pipeline with random walk.
  • scripts/run_instruct_continuous.sh — loop instruct pipeline (needs INSTRUCT_GENERATOR_MODEL).
  • scripts/try_generator_prompt.sh — send generator prompt with cached chunk/question (--random, -r for reasoning).
  • scripts/cache_report.mjs — cache stats; set CACHE_REPORT_MODE=thinking|instruct|both or PIPELINE_CACHE_DIR=....
  • scripts/bench_pipeline.mjs — mock-provider throughput benchmark (question-first); use --limit, --chunk-limit, --random-walk, --cache-dir.
  • scripts/live_bench.mjs — live HUD (readline) showing throughput/accept rate/status; defaults to mock providers; --real to use real providers.

Outputs

  • Gold JSONL default: gold/pipeline_gold.jsonl (instruct default: gold/pipeline_gold_instruct.jsonl).
  • Sample gold: samples/pipeline_gold_sample.jsonl.
  • Cache defaults: data/cache (thinking) and data/cache_instruct (instruct); both gitignored.

Hugging Face / GitHub distribution

  • License: Apache-2.0 (LICENSE).
  • CI: .github/workflows/ci.yml runs npm test on push/PR.
  • Push to GitHub:
    git remote add origin https://github.com/elspru/distill-pipeline
    git push origin main
    
  • Push to Hugging Face (user: htaf):
    git lfs install
    git remote add hf https://huggingface.co/htaf/distill-pipeline
    git push origin main
    git push hf main
    
  • Publish code + prompts + samples/pipeline_gold_sample.jsonl. Keep caches/gold outputs out (gitignored).

Project structure

prompts/ # stage prompts
src/ # pipeline, providers, stages
tests/ # Vitest
data/ # rag chunks (jsonl), cache (ignored)
gold/ # outputs (ignored)
scripts/ # tooling + runners
samples/pipeline_gold_sample.jsonl

Testing

npm test

License

Apache-2.0

Join the movement: Use distill-pipeline to create and share datasets that elevate AI for everyone.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support