Overview

Relevant source files

This document provides a high-level introduction to Agent-Lightning, its architecture, and core components. It explains what Agent-Lightning is, how it works, and how its major subsystems interact to enable agent training and optimization.

For installation instructions, see Installation. For step-by-step tutorials, see Quick Start Tutorial. For detailed component documentation, see Components.

What is Agent-Lightning?

Agent-Lightning is a framework for training and optimizing AI agents with minimal code changes. It provides infrastructure for collecting execution traces from agents built with any framework (LangChain, AutoGen, OpenAI SDK, CrewAI, or plain Python), applying learning algorithms to improve them, and deploying the updated agents.

The framework is designed around three core principles:

Framework-agnostic: Works with existing agent code without requiring rewrites
Algorithm-flexible: Supports multiple learning approaches (PPO, APO, SFT, GRPO)
Scale-ready: Runs on a single machine or distributed across clusters

Sources: README.md1-26 pyproject.toml1-6

Architecture Overview

Agent-Lightning follows a producer-consumer architecture where agents execute tasks (producing traces) and algorithms consume those traces to generate improvements. The LightningStore acts as the central coordination hub.

High-Level Component Diagram

Key Components:

Trainer: Orchestrates training loops, manages resources, coordinates between algorithms and runners
LightningStore: Central data hub storing rollouts, traces, and resources (prompts/weights)
LitAgentRunner: Executes agent tasks with tracing and lifecycle management
LitAgent: Base class for user-defined agents (framework-agnostic wrapper)
Algorithm: Implements learning logic (reads traces, outputs updated resources)
Tracer: Captures execution data from agent frameworks
LLM Proxy: Routes LLM requests and injects trace context

Sources: README.md63-73 agentlightning/__init__.py1-23 Diagram 1

Training Pipeline

The training pipeline implements a producer-consumer pattern where algorithms enqueue tasks with resource references, runners execute those tasks while capturing traces, and algorithms read the traces to compute updates.

Training Loop Sequence

Pipeline Stages:

Enqueue: Algorithm posts rollout requests with task data and resources_id reference
Execute: Runner dequeues, fetches resources, runs agent with tracing enabled
Collect: Tracer writes spans to store with rollout_id and attempt_id tags
Learn: Algorithm queries spans, computes learning signals, updates resources
Deploy: Updated resources versioned in store, used by subsequent rollouts

Sources: Diagram 2 README.md65-67

Core Subsystems

Storage and Coordination

The LightningStore provides four data management interfaces:

Interface	Purpose	Key Methods
Rollout Management	Task queue with status tracking	`enqueue_rollout()`, `dequeue_rollout()`, `update_attempt()`
Resource Management	Versioned prompts and weights	`add_resources()`, `get_resources_by_id()`, `list_resources()`
Span Management	Trace storage and retrieval	`add_otel_span()`, `query_spans()`, `get_many_span_sequence_ids()`
Worker Coordination	Heartbeat and health monitoring	`register_worker()`, `worker_heartbeat()`, `get_active_workers()`

The store supports two backend implementations:

InMemoryLightningStore: Single-process, fast, no external dependencies
MongoLightningStore: Distributed, persistent, scales horizontally

Both expose identical Python APIs and RESTful HTTP endpoints at /v1/*.

Sources: README.md65-67 docs/reference/restful.md1-20

Trace Collection System

Trace Collection Flow:

Instrumentation: Framework-specific hooks capture events (LLM calls, tool executions)
Normalization: Tracers convert to OpenTelemetry span format
Processing: Dedicated thread enriches spans with metadata
Storage: Spans written to store with monotonic sequence_id for deterministic ordering

Sources: tests/tracer/test_integration.py1-18 Diagram 4

Algorithm Integration

Algorithms implement the learning logic. Agent-Lightning provides four built-in algorithms:

Algorithm	Type	Description	Dependencies
VERL	Reinforcement Learning	PPO with vLLM and Ray	`verl>=0.5.0`, `vllm>=0.8.4`
APO	Prompt Optimization	Automatic prompt refinement	`poml`
SFT	Supervised Fine-tuning	Standard supervised learning	`transformers`, `torch`
Baseline	Logging Only	No updates, used for debugging	(none)

All algorithms inherit from a common interface and interact with the store via adapters that convert spans to algorithm-specific formats (e.g., TracerTraceToTriplet for trajectory data).

Sources: pyproject.toml30-48 README.md23 tests/tracer/test_integration.py67

Execution Models

Agent-Lightning supports three execution strategies:

Single-Process Mode

Use case: Development, debugging, small experiments
Characteristics: Fast startup, easy debugging, no network overhead
Invocation: Trainer.dev() or Trainer.fit() with InMemoryLightningStore

Multi-Process Mode

Use case: Separate algorithm and execution, multi-machine rollouts
Characteristics: Processes communicate via RESTful API at http://localhost:4747/v1/*
Invocation: agl store CLI, then set AGL_CURRENT_ROLE environment variable

Distributed Mode

Use case: Production training at scale
Characteristics: MongoDB for persistence, Ray for distributed RL, vLLM for inference
Invocation: Trainer.fit() with MongoLightningStore and VERL algorithm

Sources: Diagram 5 docs/tutorials/installation.md1-10

Data Model

Agent-Lightning uses four primary data structures:

Rollout

A rollout represents a single task execution request. It contains:

rollout_id: Unique identifier (UUID)
task: Input data for the agent (any JSON-serializable object)
resources_id: Reference to the resources (prompts/weights) to use
status: State machine (pending → preparing → running → succeeded/failed)
attempts: List of execution attempts (for retries)

Attempt

An attempt tracks one execution of a rollout:

attempt_id: Unique identifier (UUID)
rollout_id: Parent rollout reference
worker_id: Which runner executed it
status: State machine (running → succeeded/failed)
started_at, finished_at: Timestamps

Span

A span captures one operation during execution (OpenTelemetry format):

span_id, trace_id: OpenTelemetry identifiers
rollout_id, attempt_id: Agent-Lightning context
sequence_id: Monotonic ordering within attempt
name: Operation name (e.g., "openai.chat.completion")
attributes: Key-value metadata (prompt, response, model, etc.)
parent_id: Hierarchy reference

NamedResources

Resources are versioned artifacts used by agents:

resources_id: Unique identifier (UUID)
resources: Dictionary of named artifacts (e.g., {"prompt": "...", "actor_weights": "path/to/weights"})
created_at: Timestamp for versioning

Sources: README.md65-69

Key APIs and Interfaces

Python API

The primary entry point is the Trainer class:

CLI API

The agl command-line tool provides server management:

RESTful API

The store exposes HTTP endpoints:

POST /v1/rollouts: Enqueue rollout
GET /v1/rollouts/dequeue: Dequeue rollout
POST /v1/spans: Add span
GET /v1/spans: Query spans
POST /v1/resources: Add resources
GET /v1/resources/{id}: Fetch resources
POST /v1/traces: OTLP trace ingestion

Sources: agentlightning/__init__.py5-23 pyproject.toml50-51 docs/reference/restful.md1-20

Dependencies and Ecosystem

Core Dependencies

Agent-Lightning's minimal installation requires:

Package	Purpose	Version
`agentops`	Tracing framework integration	`>=0.4.13`
`opentelemetry-api`	Observability standard	`>=1.35`
`litellm`	LLM proxy and routing	`>=1.74`
`fastapi`	RESTful API server	(latest)
`pydantic`	Data validation	`>=2.11`

Optional Dependencies

Additional features require extra packages:

VERL: verl>=0.5.0, vllm>=0.8.4, torch>=2.8.0, flash-attn>=2.8.3
APO: poml
MongoDB: pymongo
Weave: weave>=0.52.22
Agent Frameworks: langchain>=1.0.0, autogen-agentchat, openai-agents, crewai>=1.2.0

Dependency Tracks

Agent-Lightning maintains three dependency tracks to ensure compatibility:

legacy: torch==2.7.0, openai<2.0.0, vllm==0.9.2
stable: torch>=2.8.0, openai>=2.0.0, vllm>=0.10.2
latest: Rolling updates via uv lock --upgrade

Sources: pyproject.toml6-51 pyproject.toml79-151 docs/tutorials/installation.md100-109

Version and Release Information

Current Version: 0.3.1

Major releases:

v0.3.0 (12/24/2025): Tinker integration, MongoDB store, dashboard, multi-modality
v0.2.0 (10/22/2025): Lightning Store, emitter system, LLM proxy, execution strategies
v0.1.0 (08/04/2025): Initial release

For detailed changelog, see docs/changelog.md

Sources: pyproject.toml3 agentlightning/__init__.py3 docs/changelog.md1-212