Skip to content

ruc-datalab/DataEvolver

Repository files navigation

DataEvolver: Automatic data preparation for LLMs via multi-level self-evolving pipelines


Python PyPI arXiv FastAPI React License


Paper Β· Demo Β· Quick Start Β· Usage Β· Results


DataEvolver overview

Turn noisy raw data + a few seed examples into training-ready, seed-aligned datasets.


Give us a ⭐ if DataEvolver helps your data-prep workflow.

DataEvolver is a seed-driven, multi-level self-evolving system for LLM training data preparation. Starting from raw data and only a handful of seed examples, it automatically understands target data characteristics, builds and repairs executable operator DAGs, trial-runs on samples, and iteratively refines the pipeline until outputs align with seeds β€” then runs full preparation.

DataEvolver supports:

  • 🌱 Seed-guided understanding β€” distill schema, format, style, and quality constraints from seed data (not just task descriptions)
  • πŸ”§ Operator-level self-evolving β€” orchestrate DAGs, detect structural gaps, and synthesize bridging / task-specific operators when needed
  • πŸ”„ Pipeline-level self-evolving β€” trial runs, Pilot LLM judging, experience reflow, and next-round understanding + orchestration updates
  • πŸ–₯ Three aligned interfaces β€” Web UI (evolution canvas), CLI, and HTTP API share the same workflow semantics
  • πŸ“¦ Fully open & deployable β€” git clone for full stack, or pip install dataevolver for CLI/API; all stage artifacts are inspectable

DataEvolver framework

πŸ–₯ Demo

Upload raw data and seed examples β€” DataEvolver automatically performs structured understanding, DAG orchestration, instantiation, trial evaluation, and experience reflow across rounds.

DataEvolver evolution canvas demo

β–Ά Watch full demo (2m50s) Β· Download HD .mov

Tip

Clone this repository and configure an OpenAI-compatible LLM API to run DataEvolver locally as your data-prep agent β€” no closed-source workflow engine required. The evolution canvas lets you inspect every stage artifact, DAG retry, and trial score.

✨ Highlights

  • Executable + seed-aligned β€” jointly optimizes can the pipeline run end-to-end? and does output match seed supervision style?
  • DAG-native orchestration β€” three-stage LLM planning with automatic structural check + LLM assessment after each orchestration
  • Evolvable operator registry β€” auto-evolve operators on demand; manually add custom operators via CLI/API and re-orchestrate
  • Observable workflow β€” per-step artifacts, orchestration retry tabs, token ledger, and round/iteration archives
  • Automation with control β€” advance-all for hands-off runs, or step-by-step understand / orchestrate / trial with rerun and --force
  • Cross-platform β€” Linux / macOS / Windows setup scripts; PyPI package for portable CLI workspaces

πŸš€ Quick Start

Full deployment guide: docs/INSTALL.md

Requirements

Component Version
Python 3.10+
Node.js 18+ LTS (Web UI only)
LLM API OpenAI-compatible endpoint + key

1. Install

From source (Web UI + CLI + API)

git clone https://github.com/ruc-datalab/DataEvolver.git
cd DataEvolver
python scripts/setup_env.py    # or: bash setup_env.sh / setup_env.ps1

From PyPI (CLI + API only)

pip install dataevolver
mkdir my_project && cd my_project
dataevolver init
dataevolver --help

2. Configure LLM

config/api_config.json   # provider, base URL, model
config/api_keys.json     # API key (gitignored β€” do not commit)

3. Start services

Service Command URL
Backend python scripts/dev.py backend http://127.0.0.1:8000
Frontend python scripts/dev.py frontend http://127.0.0.1:5173
API docs β€” http://127.0.0.1:8000/docs

4. Run your first pipeline

dataevolver session-start my_pipeline \
  --raw tmp/samples/finance_raw.jsonl \
  --seed tmp/samples/finance_seed.jsonl \
  --description tmp/samples/finance_description.txt

dataevolver workflow advance-all my_pipeline --max-steps 32
dataevolver state my_pipeline

Open http://127.0.0.1:5173 for the evolution canvas, or continue in the terminal with dataevolver advance my_pipeline.

Usage

DataEvolver exposes the same workflow through three interfaces.

Interface Best for Entry
Web UI Visual DAG, trial scores, evolution history http://127.0.0.1:5173
CLI Reproducible runs, scripting, CI dataevolver --help Β· dataevolver wf --help
HTTP API Integration & automation http://127.0.0.1:8000/docs

Web UI

  1. Create or select a pipeline session
  2. Upload raw data, seed data, and optional task description
  3. Advance step-by-step or run continuously
  4. Inspect DAG tabs, instantiation code, trial scores, and experience
  5. Trigger full run only after quality gates pass

CLI

dataevolver state my_pipeline
dataevolver advance my_pipeline
dataevolver workflow advance-all my_pipeline --max-steps 32
dataevolver lang en    # CLI output: zh / en
Stage Command
Understanding dataevolver understand my_pipeline
Orchestration dataevolver orchestrate my_pipeline
Operator evolution dataevolver evolve-operators my_pipeline
Instantiation dataevolver instantiate my_pipeline
Trial run dataevolver trial my_pipeline
Quality check dataevolver quality-check my_pipeline
Experience dataevolver experience my_pipeline
Full run dataevolver run my_pipeline
dataevolver rerun my_pipeline orchestration
dataevolver tokens my_pipeline
dataevolver op add my_op -p my_pipeline -d "Clean records" -c structure

HTTP API

Endpoint Purpose
POST /api/sessions/start Create session & register manifest
GET /api/workflow/{pipeline_id}/state Read workflow state
POST /api/workflow/{pipeline_id}/advance Advance one step
POST /api/pipeline/{pipeline_id}/run-full Full dataset execution
GET /api/operators/?pipeline_id= List merged operator pool
Operator pool, project layout, configuration & FAQ

Operator pool β€” add custom operators, then re-orchestrate:

dataevolver op list -p my_pipeline
dataevolver op add -p my_pipeline --from-file examples/operator_template.json
dataevolver workflow orchestrate my_pipeline

Project layout

DataEvolver/
β”œβ”€β”€ subsystems/     # understanding, orchestration, workflow, …
β”œβ”€β”€ web/            # FastAPI
β”œβ”€β”€ frontend/       # React evolution canvas
β”œβ”€β”€ cli/            # Typer CLI (dataevolver)
β”œβ”€β”€ config/         # api_config, api_keys, operator registry
└── data/           # runtime artifacts (gitignored)

Tips: use rerun / --force to regenerate stages Β· dataevolver tokens for LLM usage Β· artifacts under data/artifact_history/

FAQ

  • Instantiation finishes instantly? β€” artifacts may be reused (skipped); built-in operators use templates.
  • Experience finishes quickly? β€” rule-based aggregation for deterministic reflow, not LLM rewriting.
  • Multiple orchestration tabs? β€” each tab is a distinct attempt (failed validation β†’ repaired DAG).

πŸ“Š Results

Empirical evaluation from our paper β€” DataEvolver improves downstream training data quality and model performance across diverse task types.

~12%
avg relative gain vs.
weaker prep settings
7 benchmarks
instruction Β· MC-QA Β· math Β· SQL
~40%
lower amortized token cost
on average

Main experiment results

Downstream performance across 7 benchmarks from 4 task categories.

Comparison against baselines

vs. vanilla SFT on raw data and strong data-prep baselines β€” fewer, better-prepared samples can match larger weakly-prepared sets.

Ablation, case study & how it works

Ablation study

  • Without operator-level evolution β†’ less executable, less coherent pipelines
  • Without pipeline-level evolution β†’ less seed-aligned outputs

Case study

Workflow loop

understanding β†’ orchestration β†’ operator_evolution β†’ instantiation
             β†’ trial_run β†’ quality_check β†’ experience β†’ (refine or full run)
Layer What happens
Understanding Learn target profile from seeds + raw samples
Operator evolution Repair DAG; synthesize missing operators
Pipeline evolution Trial feedback β†’ experience β†’ next-round refinement

Compared to predefined recipes (stable but rigid) or one-shot pipeline synthesis (flexible but fragile), DataEvolver treats data preparation as an iterative, self-improving closed loop.

🀝 Community

Channel Link
Issues & features GitHub Issues
Questions GitHub Discussions

Contributions welcome β€” operators, dataset adapters, UI polish, docs, and benchmark scripts.

πŸ“– Citation

If you use DataEvolver in research, please cite our paper:

@misc{deng2026dataevolverautomaticdatapreparation,
      title={DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving}, 
      author={Chao Deng and Shaolei Zhang and Ju Fan and Xiaoyong Du},
      year={2026},
      eprint={2606.07001},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2606.07001}, 
}

πŸ“„ Paper: arXiv:2606.07001

Built for teams who want executable and seed-aligned data pipelines β€” not one-shot prompts.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors