Paper Β· Demo Β· Quick Start Β· Usage Β· Results
Turn noisy raw data + a few seed examples into training-ready, seed-aligned datasets.
Give us a β if DataEvolver helps your data-prep workflow.
DataEvolver is a seed-driven, multi-level self-evolving system for LLM training data preparation. Starting from raw data and only a handful of seed examples, it automatically understands target data characteristics, builds and repairs executable operator DAGs, trial-runs on samples, and iteratively refines the pipeline until outputs align with seeds β then runs full preparation.
DataEvolver supports:
- π± Seed-guided understanding β distill schema, format, style, and quality constraints from seed data (not just task descriptions)
- π§ Operator-level self-evolving β orchestrate DAGs, detect structural gaps, and synthesize bridging / task-specific operators when needed
- π Pipeline-level self-evolving β trial runs, Pilot LLM judging, experience reflow, and next-round understanding + orchestration updates
- π₯ Three aligned interfaces β Web UI (evolution canvas), CLI, and HTTP API share the same workflow semantics
- π¦ Fully open & deployable β git clone for full stack, or
pip install dataevolverfor CLI/API; all stage artifacts are inspectable
Upload raw data and seed examples β DataEvolver automatically performs structured understanding, DAG orchestration, instantiation, trial evaluation, and experience reflow across rounds.
βΆ Watch full demo (2m50s) Β· Download HD .mov
Tip
Clone this repository and configure an OpenAI-compatible LLM API to run DataEvolver locally as your data-prep agent β no closed-source workflow engine required. The evolution canvas lets you inspect every stage artifact, DAG retry, and trial score.
- Executable + seed-aligned β jointly optimizes can the pipeline run end-to-end? and does output match seed supervision style?
- DAG-native orchestration β three-stage LLM planning with automatic structural check + LLM assessment after each orchestration
- Evolvable operator registry β auto-evolve operators on demand; manually add custom operators via CLI/API and re-orchestrate
- Observable workflow β per-step artifacts, orchestration retry tabs, token ledger, and round/iteration archives
- Automation with control β
advance-allfor hands-off runs, or step-by-stepunderstand/orchestrate/trialwithrerunand--force - Cross-platform β Linux / macOS / Windows setup scripts; PyPI package for portable CLI workspaces
Full deployment guide: docs/INSTALL.md
| Component | Version |
|---|---|
| Python | 3.10+ |
| Node.js | 18+ LTS (Web UI only) |
| LLM API | OpenAI-compatible endpoint + key |
From source (Web UI + CLI + API)
git clone https://github.com/ruc-datalab/DataEvolver.git
cd DataEvolver
python scripts/setup_env.py # or: bash setup_env.sh / setup_env.ps1From PyPI (CLI + API only)
pip install dataevolver
mkdir my_project && cd my_project
dataevolver init
dataevolver --helpconfig/api_config.json # provider, base URL, model
config/api_keys.json # API key (gitignored β do not commit)
| Service | Command | URL |
|---|---|---|
| Backend | python scripts/dev.py backend |
http://127.0.0.1:8000 |
| Frontend | python scripts/dev.py frontend |
http://127.0.0.1:5173 |
| API docs | β | http://127.0.0.1:8000/docs |
dataevolver session-start my_pipeline \
--raw tmp/samples/finance_raw.jsonl \
--seed tmp/samples/finance_seed.jsonl \
--description tmp/samples/finance_description.txt
dataevolver workflow advance-all my_pipeline --max-steps 32
dataevolver state my_pipelineOpen http://127.0.0.1:5173 for the evolution canvas, or continue in the terminal with dataevolver advance my_pipeline.
DataEvolver exposes the same workflow through three interfaces.
| Interface | Best for | Entry |
|---|---|---|
| Web UI | Visual DAG, trial scores, evolution history | http://127.0.0.1:5173 |
| CLI | Reproducible runs, scripting, CI | dataevolver --help Β· dataevolver wf --help |
| HTTP API | Integration & automation | http://127.0.0.1:8000/docs |
- Create or select a pipeline session
- Upload raw data, seed data, and optional task description
- Advance step-by-step or run continuously
- Inspect DAG tabs, instantiation code, trial scores, and experience
- Trigger full run only after quality gates pass
dataevolver state my_pipeline
dataevolver advance my_pipeline
dataevolver workflow advance-all my_pipeline --max-steps 32
dataevolver lang en # CLI output: zh / en| Stage | Command |
|---|---|
| Understanding | dataevolver understand my_pipeline |
| Orchestration | dataevolver orchestrate my_pipeline |
| Operator evolution | dataevolver evolve-operators my_pipeline |
| Instantiation | dataevolver instantiate my_pipeline |
| Trial run | dataevolver trial my_pipeline |
| Quality check | dataevolver quality-check my_pipeline |
| Experience | dataevolver experience my_pipeline |
| Full run | dataevolver run my_pipeline |
dataevolver rerun my_pipeline orchestration
dataevolver tokens my_pipeline
dataevolver op add my_op -p my_pipeline -d "Clean records" -c structure| Endpoint | Purpose |
|---|---|
POST /api/sessions/start |
Create session & register manifest |
GET /api/workflow/{pipeline_id}/state |
Read workflow state |
POST /api/workflow/{pipeline_id}/advance |
Advance one step |
POST /api/pipeline/{pipeline_id}/run-full |
Full dataset execution |
GET /api/operators/?pipeline_id= |
List merged operator pool |
Operator pool, project layout, configuration & FAQ
Operator pool β add custom operators, then re-orchestrate:
dataevolver op list -p my_pipeline
dataevolver op add -p my_pipeline --from-file examples/operator_template.json
dataevolver workflow orchestrate my_pipelineProject layout
DataEvolver/
βββ subsystems/ # understanding, orchestration, workflow, β¦
βββ web/ # FastAPI
βββ frontend/ # React evolution canvas
βββ cli/ # Typer CLI (dataevolver)
βββ config/ # api_config, api_keys, operator registry
βββ data/ # runtime artifacts (gitignored)
Tips: use rerun / --force to regenerate stages Β· dataevolver tokens for LLM usage Β· artifacts under data/artifact_history/
FAQ
- Instantiation finishes instantly? β artifacts may be reused (
skipped); built-in operators use templates. - Experience finishes quickly? β rule-based aggregation for deterministic reflow, not LLM rewriting.
- Multiple orchestration tabs? β each tab is a distinct attempt (failed validation β repaired DAG).
Empirical evaluation from our paper β DataEvolver improves downstream training data quality and model performance across diverse task types.
| ~12% avg relative gain vs. weaker prep settings |
7 benchmarks instruction Β· MC-QA Β· math Β· SQL |
~40% lower amortized token cost on average |
Downstream performance across 7 benchmarks from 4 task categories.
vs. vanilla SFT on raw data and strong data-prep baselines β fewer, better-prepared samples can match larger weakly-prepared sets.
Ablation, case study & how it works
- Without operator-level evolution β less executable, less coherent pipelines
- Without pipeline-level evolution β less seed-aligned outputs
Workflow loop
understanding β orchestration β operator_evolution β instantiation
β trial_run β quality_check β experience β (refine or full run)
| Layer | What happens |
|---|---|
| Understanding | Learn target profile from seeds + raw samples |
| Operator evolution | Repair DAG; synthesize missing operators |
| Pipeline evolution | Trial feedback β experience β next-round refinement |
Compared to predefined recipes (stable but rigid) or one-shot pipeline synthesis (flexible but fragile), DataEvolver treats data preparation as an iterative, self-improving closed loop.
| Channel | Link |
|---|---|
| Issues & features | GitHub Issues |
| Questions | GitHub Discussions |
Contributions welcome β operators, dataset adapters, UI polish, docs, and benchmark scripts.
If you use DataEvolver in research, please cite our paper:
@misc{deng2026dataevolverautomaticdatapreparation,
title={DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving},
author={Chao Deng and Shaolei Zhang and Ju Fan and Xiaoyong Du},
year={2026},
eprint={2606.07001},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2606.07001},
}π Paper: arXiv:2606.07001
Built for teams who want executable and seed-aligned data pipelines β not one-shot prompts.





