This document provides a high-level introduction to Agent-Lightning, its architecture, and core components. It explains what Agent-Lightning is, how it works, and how its major subsystems interact to enable agent training and optimization.
For installation instructions, see Installation. For step-by-step tutorials, see Quick Start Tutorial. For detailed component documentation, see Components.
Agent-Lightning is a framework for training and optimizing AI agents with minimal code changes. It provides infrastructure for collecting execution traces from agents built with any framework (LangChain, AutoGen, OpenAI SDK, CrewAI, or plain Python), applying learning algorithms to improve them, and deploying the updated agents.
The framework is designed around three core principles:
Sources: README.md1-26 pyproject.toml1-6
Agent-Lightning follows a producer-consumer architecture where agents execute tasks (producing traces) and algorithms consume those traces to generate improvements. The LightningStore acts as the central coordination hub.
Key Components:
Trainer: Orchestrates training loops, manages resources, coordinates between algorithms and runnersLightningStore: Central data hub storing rollouts, traces, and resources (prompts/weights)LitAgentRunner: Executes agent tasks with tracing and lifecycle managementLitAgent: Base class for user-defined agents (framework-agnostic wrapper)Algorithm: Implements learning logic (reads traces, outputs updated resources)Tracer: Captures execution data from agent frameworksLLM Proxy: Routes LLM requests and injects trace contextSources: README.md63-73 agentlightning/__init__.py1-23 Diagram 1
The training pipeline implements a producer-consumer pattern where algorithms enqueue tasks with resource references, runners execute those tasks while capturing traces, and algorithms read the traces to compute updates.
Pipeline Stages:
task data and resources_id referencerollout_id and attempt_id tagsSources: Diagram 2 README.md65-67
The LightningStore provides four data management interfaces:
| Interface | Purpose | Key Methods |
|---|---|---|
| Rollout Management | Task queue with status tracking | enqueue_rollout(), dequeue_rollout(), update_attempt() |
| Resource Management | Versioned prompts and weights | add_resources(), get_resources_by_id(), list_resources() |
| Span Management | Trace storage and retrieval | add_otel_span(), query_spans(), get_many_span_sequence_ids() |
| Worker Coordination | Heartbeat and health monitoring | register_worker(), worker_heartbeat(), get_active_workers() |
The store supports two backend implementations:
InMemoryLightningStore: Single-process, fast, no external dependenciesMongoLightningStore: Distributed, persistent, scales horizontallyBoth expose identical Python APIs and RESTful HTTP endpoints at /v1/*.
Sources: README.md65-67 docs/reference/restful.md1-20
Trace Collection Flow:
sequence_id for deterministic orderingSources: tests/tracer/test_integration.py1-18 Diagram 4
Algorithms implement the learning logic. Agent-Lightning provides four built-in algorithms:
| Algorithm | Type | Description | Dependencies |
|---|---|---|---|
| VERL | Reinforcement Learning | PPO with vLLM and Ray | verl>=0.5.0, vllm>=0.8.4 |
| APO | Prompt Optimization | Automatic prompt refinement | poml |
| SFT | Supervised Fine-tuning | Standard supervised learning | transformers, torch |
| Baseline | Logging Only | No updates, used for debugging | (none) |
All algorithms inherit from a common interface and interact with the store via adapters that convert spans to algorithm-specific formats (e.g., TracerTraceToTriplet for trajectory data).
Sources: pyproject.toml30-48 README.md23 tests/tracer/test_integration.py67
Agent-Lightning supports three execution strategies:
Trainer.dev() or Trainer.fit() with InMemoryLightningStorehttp://localhost:4747/v1/*agl store CLI, then set AGL_CURRENT_ROLE environment variableTrainer.fit() with MongoLightningStore and VERL algorithmSources: Diagram 5 docs/tutorials/installation.md1-10
Agent-Lightning uses four primary data structures:
A rollout represents a single task execution request. It contains:
rollout_id: Unique identifier (UUID)task: Input data for the agent (any JSON-serializable object)resources_id: Reference to the resources (prompts/weights) to usestatus: State machine (pending → preparing → running → succeeded/failed)attempts: List of execution attempts (for retries)An attempt tracks one execution of a rollout:
attempt_id: Unique identifier (UUID)rollout_id: Parent rollout referenceworker_id: Which runner executed itstatus: State machine (running → succeeded/failed)started_at, finished_at: TimestampsA span captures one operation during execution (OpenTelemetry format):
span_id, trace_id: OpenTelemetry identifiersrollout_id, attempt_id: Agent-Lightning contextsequence_id: Monotonic ordering within attemptname: Operation name (e.g., "openai.chat.completion")attributes: Key-value metadata (prompt, response, model, etc.)parent_id: Hierarchy referenceResources are versioned artifacts used by agents:
resources_id: Unique identifier (UUID)resources: Dictionary of named artifacts (e.g., {"prompt": "...", "actor_weights": "path/to/weights"})created_at: Timestamp for versioningSources: README.md65-69
The primary entry point is the Trainer class:
The agl command-line tool provides server management:
The store exposes HTTP endpoints:
POST /v1/rollouts: Enqueue rolloutGET /v1/rollouts/dequeue: Dequeue rolloutPOST /v1/spans: Add spanGET /v1/spans: Query spansPOST /v1/resources: Add resourcesGET /v1/resources/{id}: Fetch resourcesPOST /v1/traces: OTLP trace ingestionSources: agentlightning/__init__.py5-23 pyproject.toml50-51 docs/reference/restful.md1-20
Agent-Lightning's minimal installation requires:
| Package | Purpose | Version |
|---|---|---|
agentops | Tracing framework integration | >=0.4.13 |
opentelemetry-api | Observability standard | >=1.35 |
litellm | LLM proxy and routing | >=1.74 |
fastapi | RESTful API server | (latest) |
pydantic | Data validation | >=2.11 |
Additional features require extra packages:
verl>=0.5.0, vllm>=0.8.4, torch>=2.8.0, flash-attn>=2.8.3pomlpymongoweave>=0.52.22langchain>=1.0.0, autogen-agentchat, openai-agents, crewai>=1.2.0Agent-Lightning maintains three dependency tracks to ensure compatibility:
torch==2.7.0, openai<2.0.0, vllm==0.9.2torch>=2.8.0, openai>=2.0.0, vllm>=0.10.2uv lock --upgradeSources: pyproject.toml6-51 pyproject.toml79-151 docs/tutorials/installation.md100-109
Current Version: 0.3.1
Major releases:
For detailed changelog, see docs/changelog.md
Sources: pyproject.toml3 agentlightning/__init__.py3 docs/changelog.md1-212
Refresh this wiki
This wiki was recently refreshed. Please wait 7 days to refresh again.