Skip to content

Conversation

@totoluo
Copy link
Contributor

@totoluo totoluo commented Dec 7, 2025

  • Added example of Chartqa agent
  • Added position id conversion for mrope
  • Added image/path message format for OpenAI-compatible MLLMs API.
@totoluo totoluo marked this pull request as ready for review December 7, 2025 05:18
@totoluo
Copy link
Contributor Author

totoluo commented Dec 7, 2025

@microsoft-github-policy-service agree

Copy link
Contributor

@ultmaster ultmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution to this project! I have several minor comments, hoping to be fixed before we get this landed on main.

Generally I want this PR to be merged quickly and we can collaborate to add more tests, refine the documents later.

@@ -0,0 +1,229 @@
"""Utilities for multimodal model support in VERL training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these functions not importable directly from VERL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_rope_index and processor are imported directly from verl and the current compute_mrope_position_ids get_image_grid_thw are just wrappers that are compatible with agl AgentModeDaemon. In terms of get_image_grid_thw, verl process images in dataset getitem and does not return it explicitly. Seems agl does get_train_data_batch from rollout ids in AgentModeDaemon so we need to load image and get them from the original sample. We can override the currrent dataset __getitem__ alternatively but seems like a bigger change.
However, the description on this file is redundant and I'll refactor and make them look simpler, and move them to daemon.py because they are only used by the daemon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ultmaster, I have updated the branch based on all the comments, appreciate any reviews!


```bash
uv pip install datasets pillow pandas pyarrow nest_asyncio
uv pip install "langgraph<1.0" "langchain[openai]<1.0" "langchain-community"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have upgraded langchain to 1.0+ on main. Please upgrade your langchain accordingly.

@@ -0,0 +1,100 @@
# ChartQA Example

[![chartqa CI status](https://github.com/microsoft/agent-lightning/actions/workflows/examples-chartqa.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-chartqa.yml)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this until we really have a CI for this.

pyproject.toml Outdated
apo = [
"poml",
]
multimodal = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please upload new uv.lock

@@ -0,0 +1,6 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blackbox one-click scripts are bad and hard to debug. Remove this script and add clear instructions in README.

@totoluo totoluo changed the title Feature/multimodal-example Dec 8, 2025
@ultmaster
Copy link
Contributor

@totoluo The code looks generally good now. I think I'll test it myself before I get it merged.

Could you tell me the lowest resource requirements to run this example?

@totoluo
Copy link
Contributor Author

totoluo commented Dec 9, 2025

@totoluo The code looks generally good now. I think I'll test it myself before I get it merged.

Could you tell me the lowest resource requirements to run this example?

Hi @ultmaster, I used 2 GPU with 40G memory. Additionally, I used vllm==0.10.2 because I met the flash attention issue mentioned in pyproject.toml with vllm 0.11.0. This was the only glitch I hit when testing in local.

@ultmaster
Copy link
Contributor

@totoluo I'm testing with vllm 0.11.0 (I know 0.11.1+ won't work); but I encountered the following error when running the following command:

uv run --no-sync vllm serve Qwen/Qwen2-VL-2B-Instruct \
    --gpu-memory-utilization 0.6 \
    --max-model-len 4096 \
    --allowed-local-media-path /home/xxx/Projects/agl-second-bench/examples/chartqa/data \
    --enable-prefix-caching \
    --port 8088
(EngineCore_DP0 pid=257550) INFO 12-10 18:44:41 [default_loader.py:267] Loading weights took 0.45 seconds
(EngineCore_DP0 pid=257550) INFO 12-10 18:44:41 [gpu_model_runner.py:2653] Model loading took 4.1513 GiB and 2.411431 seconds
(EngineCore_DP0 pid=257550) INFO 12-10 18:44:41 [gpu_model_runner.py:3344] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return func(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return func(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     self.model_runner.profile_run()
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3361, in profile_run
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     self.model.get_multimodal_embeddings(
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 1458, in get_multimodal_embeddings
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     vision_embeddings = self._process_image_input(image_input)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 1385, in _process_image_input
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     image_embeds = self.visual(pixel_values,
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]                    ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 739, in forward
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     x = blk(
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]         ^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 489, in forward
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     x = x + self.attn(
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]             ^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 398, in forward
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     output = flash_attn_varlen_func(q,
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return FlashAttnVarlenFunc.apply(
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return super().apply(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 925, in forward
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 344, in backend_impl
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     result = self._backend_fns[device_type](*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return disable_fn(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 377, in wrapped_fn
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]   File "/home/xxx/Projects/agl-second-bench/.venv/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708]                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=257550) ERROR 12-10 18:44:42 [core.py:708] RuntimeError: cu_seqlens_q must be on CUDA

Have you tried vllm 0.11.0? Have you seen this before?

It seems to be related to this issue: vllm-project/vllm#27340

I don't want to downgrade to vllm 0.10.2, because previously it is not working well with verl 0.6.0.

@ultmaster
Copy link
Contributor

I also want you to know that I've done some minor changes to your code to fix some lint issues and prepare for the CI. But I didn't manage to run the example code yet.

@totoluo
Copy link
Contributor Author

totoluo commented Dec 10, 2025

cu_seqlens_q must be on CUDA

Hello @ultmaster, I also met this exact same issue. It's related to vllm-project/vllm#25926 and the fix has been merged in vllm-project/vllm#26219. However the commit has only been included in 0.11.1, 0.11.2 and 0.12.0 and these clashed with the requirement of group torch-stable. Seems the only compatible version is 0.10.2 or we need to use different vllm versions to run different examples in the project. Should I update the dependency group? What would you suggest we do here?

@ultmaster
Copy link
Contributor

Hi @totoluo Thanks for the update. I tested the code on one A100 and I can confirm that the code passes sanity check and runs successfully: https://github.com/microsoft/agent-lightning/actions/runs/20140706916/job/57808847154

I have one question though: why is all batch size, train batch size, logprob batch size, mini batch size all set to 1? Have you tried a more realistic setting? Is the training successful in terms of accuracy on ChartQA dataset?

@ultmaster
Copy link
Contributor

/ci

@github-actions
Copy link

github-actions bot commented Dec 12, 2025

🚀 CI Watcher for correlation id-3645291658-mj2k48tj triggered by comment 3645291658
🏃‍♀️ Tracking 2 workflow run(s):

✅ All runs completed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive multimodal support to AgentLightning by introducing a ChartQA example demonstrating visual reasoning with LangGraph. The implementation includes M-RoPE position embedding support for Qwen2-VL models and image handling utilities for OpenAI-compatible MLLM APIs.

Key changes:

  • New ChartQA example with multi-step reasoning workflow (observe → extract → calculate → check → refine)
  • Core framework enhancements for M-RoPE position IDs computation in VERL training pipeline
  • Image processing utilities supporting both base64 encoding (cloud APIs) and file:// URLs (local vLLM)

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
uv.lock Added image-related dependencies (av, qwen-vl-utils, datasets, pandas, pillow, pyarrow)
pyproject.toml Defined new "image" dependency group for multimodal support
examples/chartqa/train_chartqa_agent.py Training orchestration with VERL algorithm and configurable hyperparameters
examples/chartqa/prompts.py LangChain prompt templates for the multi-step reasoning workflow
examples/chartqa/prepare_data.py Dataset download and preprocessing script for HuggingFace ChartQA
examples/chartqa/multimodal_utils.py Image encoding utilities for base64 conversion and message formatting
examples/chartqa/env_var.py Environment variable configuration for ChartQA paths and API settings
examples/chartqa/debug_chartqa_agent.py Debugging utilities for testing with cloud APIs or local vLLM proxy
examples/chartqa/chartqa_agent.py Main agent implementation with LangGraph state machine and evaluation logic
examples/chartqa/README.md Comprehensive documentation with setup and usage instructions
agentlightning/verl/trainer.py Passes processor and image_base_dir to daemon for M-RoPE support
agentlightning/verl/daemon.py Implements M-RoPE position_ids computation and image grid handling for Qwen2-VL
.github/workflows/examples-rag.yml Fixed artifact naming consistency
.github/workflows/examples-chartqa.yml New CI workflow for ChartQA example testing and training validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "token-abc123")

OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4.1-mini")
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model name "gpt-4.1-mini" appears to be incorrect. OpenAI's current models are named "gpt-4o-mini", "gpt-4o", etc. The ".1" versioning pattern is not standard for OpenAI models. This should likely be "gpt-4o-mini" or another valid model identifier.

Suggested change
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4.1-mini")
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
Copilot uses AI. Check for mistakes.
Comment on lines +301 to +308
if "THE ANSWER IS CORRECT" in last_message.content: # type: ignore
if "THE ANSWER IS INCORRECT" in last_message.content: # type: ignore
correct_index = last_message.content.rfind("THE ANSWER IS CORRECT") # type: ignore
incorrect_index = last_message.content.rfind("THE ANSWER IS INCORRECT") # type: ignore
if correct_index > incorrect_index:
return END
else:
return END
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition logic appears incorrect. When both "CORRECT" and "INCORRECT" are present in the message, the code checks if "CORRECT" appears after "INCORRECT" to decide whether to end. However, this doesn't handle the edge case where neither substring appears, which would still fall through to checking turn limits. More critically, if only "INCORRECT" appears without "CORRECT", the code would incorrectly continue to check turn limits rather than explicitly routing to refinement. Consider restructuring to check for "INCORRECT" first, then "CORRECT", with explicit handling for each case.

Suggested change
if "THE ANSWER IS CORRECT" in last_message.content: # type: ignore
if "THE ANSWER IS INCORRECT" in last_message.content: # type: ignore
correct_index = last_message.content.rfind("THE ANSWER IS CORRECT") # type: ignore
incorrect_index = last_message.content.rfind("THE ANSWER IS INCORRECT") # type: ignore
if correct_index > incorrect_index:
return END
else:
return END
content = last_message.content # type: ignore
has_correct = "THE ANSWER IS CORRECT" in content
has_incorrect = "THE ANSWER IS INCORRECT" in content
if has_incorrect and has_correct:
# If both are present, decide based on which comes last
correct_index = content.rfind("THE ANSWER IS CORRECT")
incorrect_index = content.rfind("THE ANSWER IS INCORRECT")
if correct_index > incorrect_index:
return END
else:
return "refine_answer"
elif has_incorrect:
return "refine_answer"
elif has_correct:
return END
Copilot uses AI. Check for mistakes.
USE_LLM_PROXY=1 \
OPENAI_API_BASE=http://localhost:8088/v1 \
OPENAI_MODEL=Qwen/Qwen2-VL-2B-Instruct \
python chartqa_agent.py
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in line 87 says to run "python chartqa_agent.py" but the correct command based on the file structure should be "python debug_chartqa_agent.py". This inconsistency could confuse users following the instructions.

Suggested change
python chartqa_agent.py
python debug_chartqa_agent.py
Copilot uses AI. Check for mistakes.
Then run the training script with the external store address:

```bash
AGL_MANAGED_STORE=0 python train_chartqa_agent.py fast --external-store-address http://localhost:4747
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "fast" as a config option but the code only defines three configs: "debug", "qwen", and "ci". Using "fast" in the command line example will cause an error. This should be updated to use one of the actual available config options like "qwen" or "debug".

Suggested change
AGL_MANAGED_STORE=0 python train_chartqa_agent.py fast --external-store-address http://localhost:4747
AGL_MANAGED_STORE=0 python train_chartqa_agent.py qwen --external-store-address http://localhost:4747
Copilot uses AI. Check for mistakes.
Comment on lines +825 to +828
# Compute image_grid_thw for this triplet using image_urls from prompt
if self._use_mrope:
image_urls = trace.get("image_urls", [])
image_grid_thw_list.append(self._get_image_grid_thw(image_urls))
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When multiple triplets are processed for a single rollout, all of them share the same image_urls list from the first triplet. This could lead to incorrect image_grid_thw computations for subsequent triplets if they have different images. The code should iterate through triplets and compute image_grid_thw for each triplet individually, matching each to its corresponding input sequence.

Copilot uses AI. Check for mistakes.
Comment on lines +864 to +868
# For mrope (3D position_ids), use the first dimension (text position_ids) for eos calculation
if self._use_mrope:
# position_ids is (batch_size, 4, seq_length), use first dim for text positions
text_position_ids = position_ids[:, 0, :] # (batch_size, seq_length)
eos_mask_idx = torch.argmax(text_position_ids * attention_mask, dim=-1) # (bsz,)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name "text_position_ids" is misleading because it extracts the first dimension of position_ids, which for M-RoPE includes text positions but the comment suggests this is specifically for text. According to M-RoPE documentation, dimension 0 contains temporal/sequential positions that work for both text and vision tokens. Consider renaming to "sequential_position_ids" or "temporal_position_ids" to better reflect its actual purpose.

Suggested change
# For mrope (3D position_ids), use the first dimension (text position_ids) for eos calculation
if self._use_mrope:
# position_ids is (batch_size, 4, seq_length), use first dim for text positions
text_position_ids = position_ids[:, 0, :] # (batch_size, seq_length)
eos_mask_idx = torch.argmax(text_position_ids * attention_mask, dim=-1) # (bsz,)
# For mrope (3D position_ids), use the first dimension (sequential/temporal position_ids) for eos calculation
if self._use_mrope:
# position_ids is (batch_size, 4, seq_length), use first dim for sequential/temporal positions (per M-RoPE docs)
sequential_position_ids = position_ids[:, 0, :] # (batch_size, seq_length)
eos_mask_idx = torch.argmax(sequential_position_ids * attention_mask, dim=-1) # (bsz,)
Copilot uses AI. Check for mistakes.
Comment on lines +276 to +279
text_pos = torch.zeros((1, len(input_ids)), dtype=torch.long, device=input_ids.device)
text_pos[0, valid_mask] = torch.arange(valid_mask.sum().item(), device=input_ids.device)

return torch.cat([text_pos, vision_pos], dim=0)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The position_ids computation for M-RoPE creates a zero tensor for text_pos and then fills valid positions, but the logic appears to have an issue. The code creates text_pos with shape (1, len(input_ids)), but input_ids is already a 1D tensor, so len(input_ids) gives the sequence length. However, the indexing text_pos[0, valid_mask] assumes valid_mask has been properly broadcast or is 1D. Since valid_mask comes from attention_mask.bool() and attention_mask is likely 1D, this should work, but it would be clearer to explicitly show the dimensions being used or add shape assertions for debugging.

Suggested change
text_pos = torch.zeros((1, len(input_ids)), dtype=torch.long, device=input_ids.device)
text_pos[0, valid_mask] = torch.arange(valid_mask.sum().item(), device=input_ids.device)
return torch.cat([text_pos, vision_pos], dim=0)
# Assert input shapes for clarity and debugging
assert input_ids.dim() == 1, f"input_ids should be 1D, got shape {input_ids.shape}"
assert attention_mask.shape == input_ids.shape, f"attention_mask shape {attention_mask.shape} does not match input_ids shape {input_ids.shape}"
text_pos = torch.zeros(len(input_ids), dtype=torch.long, device=input_ids.device)
text_pos[valid_mask] = torch.arange(valid_mask.sum().item(), device=input_ids.device)
return torch.cat([text_pos.unsqueeze(0), vision_pos], dim=0)
Copilot uses AI. Check for mistakes.
Comment on lines +145 to +148
# Local vLLM supports file:// URLs
if not image_path.startswith("file://"):
image_path = f"file://{os.path.realpath(image_path)}"
image_url = image_path
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment references "file:// URLs" but the code in the else branch doesn't handle all cases correctly. If image_path is already an absolute path starting with "/" but not with "file://", the code will prepend "file://" correctly. However, if image_path is a relative path (which shouldn't happen based on the logic flow), it would create an invalid file:// URL. Consider adding validation to ensure the path is absolute before constructing the file:// URL, or document the assumption that image_path must be absolute when use_base64_images=False.

Suggested change
# Local vLLM supports file:// URLs
if not image_path.startswith("file://"):
image_path = f"file://{os.path.realpath(image_path)}"
image_url = image_path
# Local vLLM supports file:// URLs.
# Always use os.path.realpath to ensure the path is absolute.
abs_path = os.path.realpath(image_path)
image_url = f"file://{abs_path}"
Copilot uses AI. Check for mistakes.
gt_num = float(gt.replace(",", ""))
if abs(pred_num - gt_num) / max(abs(gt_num), 1e-9) < 0.02:
return 1.0
except (ValueError, AttributeError):
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except (ValueError, AttributeError):
except (ValueError, AttributeError):
# If conversion to float fails, fall back to substring/partial match below.
Copilot uses AI. Check for mistakes.
@ultmaster ultmaster changed the title New Example/multimodal Dec 12, 2025
@ultmaster
Copy link
Contributor

Let's leave the comments for now and resolve them in another PR.

@ultmaster ultmaster merged commit 60f9955 into microsoft:main Dec 12, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants