-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Multi-modal example: ChartQA #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-modal example: ChartQA #379
Conversation
totoluo
commented
Dec 7, 2025
- Added example of Chartqa agent
- Added position id conversion for mrope
- Added image/path message format for OpenAI-compatible MLLMs API.
|
@microsoft-github-policy-service agree |
ultmaster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution to this project! I have several minor comments, hoping to be fixed before we get this landed on main.
Generally I want this PR to be merged quickly and we can collaborate to add more tests, refine the documents later.
| @@ -0,0 +1,229 @@ | |||
| """Utilities for multimodal model support in VERL training. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these functions not importable directly from VERL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_rope_index and processor are imported directly from verl and the current compute_mrope_position_ids get_image_grid_thw are just wrappers that are compatible with agl AgentModeDaemon. In terms of get_image_grid_thw, verl process images in dataset getitem and does not return it explicitly. Seems agl does get_train_data_batch from rollout ids in AgentModeDaemon so we need to load image and get them from the original sample. We can override the currrent dataset __getitem__ alternatively but seems like a bigger change.
However, the description on this file is redundant and I'll refactor and make them look simpler, and move them to daemon.py because they are only used by the daemon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ultmaster, I have updated the branch based on all the comments, appreciate any reviews!
examples/chartqa/README.md
Outdated
|
|
||
| ```bash | ||
| uv pip install datasets pillow pandas pyarrow nest_asyncio | ||
| uv pip install "langgraph<1.0" "langchain[openai]<1.0" "langchain-community" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have upgraded langchain to 1.0+ on main. Please upgrade your langchain accordingly.
examples/chartqa/README.md
Outdated
| @@ -0,0 +1,100 @@ | |||
| # ChartQA Example | |||
|
|
|||
| [](https://github.com/microsoft/agent-lightning/actions/workflows/examples-chartqa.yml) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this until we really have a CI for this.
pyproject.toml
Outdated
| apo = [ | ||
| "poml", | ||
| ] | ||
| multimodal = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please upload new uv.lock
examples/chartqa/download_chartqa.sh
Outdated
| @@ -0,0 +1,6 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blackbox one-click scripts are bad and hard to debug. Remove this script and add clear instructions in README.
|
@totoluo The code looks generally good now. I think I'll test it myself before I get it merged. Could you tell me the lowest resource requirements to run this example? |
Hi @ultmaster, I used 2 GPU with 40G memory. Additionally, I used vllm==0.10.2 because I met the flash attention issue mentioned in pyproject.toml with vllm 0.11.0. This was the only glitch I hit when testing in local. |
|
@totoluo I'm testing with vllm 0.11.0 (I know 0.11.1+ won't work); but I encountered the following error when running the following command: uv run --no-sync vllm serve Qwen/Qwen2-VL-2B-Instruct \
--gpu-memory-utilization 0.6 \
--max-model-len 4096 \
--allowed-local-media-path /home/xxx/Projects/agl-second-bench/examples/chartqa/data \
--enable-prefix-caching \
--port 8088Have you tried vllm 0.11.0? Have you seen this before? It seems to be related to this issue: vllm-project/vllm#27340 I don't want to downgrade to vllm 0.10.2, because previously it is not working well with verl 0.6.0. |
|
I also want you to know that I've done some minor changes to your code to fix some lint issues and prepare for the CI. But I didn't manage to run the example code yet. |
Hello @ultmaster, I also met this exact same issue. It's related to vllm-project/vllm#25926 and the fix has been merged in vllm-project/vllm#26219. However the commit has only been included in |
|
Hi @totoluo Thanks for the update. I tested the code on one A100 and I can confirm that the code passes sanity check and runs successfully: https://github.com/microsoft/agent-lightning/actions/runs/20140706916/job/57808847154 I have one question though: why is all batch size, train batch size, logprob batch size, mini batch size all set to 1? Have you tried a more realistic setting? Is the training successful in terms of accuracy on ChartQA dataset? |
|
/ci |
|
🚀 CI Watcher for correlation id-3645291658-mj2k48tj triggered by comment 3645291658
✅ All runs completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive multimodal support to AgentLightning by introducing a ChartQA example demonstrating visual reasoning with LangGraph. The implementation includes M-RoPE position embedding support for Qwen2-VL models and image handling utilities for OpenAI-compatible MLLM APIs.
Key changes:
- New ChartQA example with multi-step reasoning workflow (observe → extract → calculate → check → refine)
- Core framework enhancements for M-RoPE position IDs computation in VERL training pipeline
- Image processing utilities supporting both base64 encoding (cloud APIs) and file:// URLs (local vLLM)
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Added image-related dependencies (av, qwen-vl-utils, datasets, pandas, pillow, pyarrow) |
| pyproject.toml | Defined new "image" dependency group for multimodal support |
| examples/chartqa/train_chartqa_agent.py | Training orchestration with VERL algorithm and configurable hyperparameters |
| examples/chartqa/prompts.py | LangChain prompt templates for the multi-step reasoning workflow |
| examples/chartqa/prepare_data.py | Dataset download and preprocessing script for HuggingFace ChartQA |
| examples/chartqa/multimodal_utils.py | Image encoding utilities for base64 conversion and message formatting |
| examples/chartqa/env_var.py | Environment variable configuration for ChartQA paths and API settings |
| examples/chartqa/debug_chartqa_agent.py | Debugging utilities for testing with cloud APIs or local vLLM proxy |
| examples/chartqa/chartqa_agent.py | Main agent implementation with LangGraph state machine and evaluation logic |
| examples/chartqa/README.md | Comprehensive documentation with setup and usage instructions |
| agentlightning/verl/trainer.py | Passes processor and image_base_dir to daemon for M-RoPE support |
| agentlightning/verl/daemon.py | Implements M-RoPE position_ids computation and image grid handling for Qwen2-VL |
| .github/workflows/examples-rag.yml | Fixed artifact naming consistency |
| .github/workflows/examples-chartqa.yml | New CI workflow for ChartQA example testing and training validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "token-abc123") | ||
|
|
||
| OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4.1-mini") |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model name "gpt-4.1-mini" appears to be incorrect. OpenAI's current models are named "gpt-4o-mini", "gpt-4o", etc. The ".1" versioning pattern is not standard for OpenAI models. This should likely be "gpt-4o-mini" or another valid model identifier.
| OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4.1-mini") | |
| OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini") |
| if "THE ANSWER IS CORRECT" in last_message.content: # type: ignore | ||
| if "THE ANSWER IS INCORRECT" in last_message.content: # type: ignore | ||
| correct_index = last_message.content.rfind("THE ANSWER IS CORRECT") # type: ignore | ||
| incorrect_index = last_message.content.rfind("THE ANSWER IS INCORRECT") # type: ignore | ||
| if correct_index > incorrect_index: | ||
| return END | ||
| else: | ||
| return END |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition logic appears incorrect. When both "CORRECT" and "INCORRECT" are present in the message, the code checks if "CORRECT" appears after "INCORRECT" to decide whether to end. However, this doesn't handle the edge case where neither substring appears, which would still fall through to checking turn limits. More critically, if only "INCORRECT" appears without "CORRECT", the code would incorrectly continue to check turn limits rather than explicitly routing to refinement. Consider restructuring to check for "INCORRECT" first, then "CORRECT", with explicit handling for each case.
| if "THE ANSWER IS CORRECT" in last_message.content: # type: ignore | |
| if "THE ANSWER IS INCORRECT" in last_message.content: # type: ignore | |
| correct_index = last_message.content.rfind("THE ANSWER IS CORRECT") # type: ignore | |
| incorrect_index = last_message.content.rfind("THE ANSWER IS INCORRECT") # type: ignore | |
| if correct_index > incorrect_index: | |
| return END | |
| else: | |
| return END | |
| content = last_message.content # type: ignore | |
| has_correct = "THE ANSWER IS CORRECT" in content | |
| has_incorrect = "THE ANSWER IS INCORRECT" in content | |
| if has_incorrect and has_correct: | |
| # If both are present, decide based on which comes last | |
| correct_index = content.rfind("THE ANSWER IS CORRECT") | |
| incorrect_index = content.rfind("THE ANSWER IS INCORRECT") | |
| if correct_index > incorrect_index: | |
| return END | |
| else: | |
| return "refine_answer" | |
| elif has_incorrect: | |
| return "refine_answer" | |
| elif has_correct: | |
| return END |
examples/chartqa/README.md
Outdated
| USE_LLM_PROXY=1 \ | ||
| OPENAI_API_BASE=http://localhost:8088/v1 \ | ||
| OPENAI_MODEL=Qwen/Qwen2-VL-2B-Instruct \ | ||
| python chartqa_agent.py |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment in line 87 says to run "python chartqa_agent.py" but the correct command based on the file structure should be "python debug_chartqa_agent.py". This inconsistency could confuse users following the instructions.
| python chartqa_agent.py | |
| python debug_chartqa_agent.py |
examples/chartqa/README.md
Outdated
| Then run the training script with the external store address: | ||
|
|
||
| ```bash | ||
| AGL_MANAGED_STORE=0 python train_chartqa_agent.py fast --external-store-address http://localhost:4747 |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "fast" as a config option but the code only defines three configs: "debug", "qwen", and "ci". Using "fast" in the command line example will cause an error. This should be updated to use one of the actual available config options like "qwen" or "debug".
| AGL_MANAGED_STORE=0 python train_chartqa_agent.py fast --external-store-address http://localhost:4747 | |
| AGL_MANAGED_STORE=0 python train_chartqa_agent.py qwen --external-store-address http://localhost:4747 |
| # Compute image_grid_thw for this triplet using image_urls from prompt | ||
| if self._use_mrope: | ||
| image_urls = trace.get("image_urls", []) | ||
| image_grid_thw_list.append(self._get_image_grid_thw(image_urls)) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When multiple triplets are processed for a single rollout, all of them share the same image_urls list from the first triplet. This could lead to incorrect image_grid_thw computations for subsequent triplets if they have different images. The code should iterate through triplets and compute image_grid_thw for each triplet individually, matching each to its corresponding input sequence.
| # For mrope (3D position_ids), use the first dimension (text position_ids) for eos calculation | ||
| if self._use_mrope: | ||
| # position_ids is (batch_size, 4, seq_length), use first dim for text positions | ||
| text_position_ids = position_ids[:, 0, :] # (batch_size, seq_length) | ||
| eos_mask_idx = torch.argmax(text_position_ids * attention_mask, dim=-1) # (bsz,) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name "text_position_ids" is misleading because it extracts the first dimension of position_ids, which for M-RoPE includes text positions but the comment suggests this is specifically for text. According to M-RoPE documentation, dimension 0 contains temporal/sequential positions that work for both text and vision tokens. Consider renaming to "sequential_position_ids" or "temporal_position_ids" to better reflect its actual purpose.
| # For mrope (3D position_ids), use the first dimension (text position_ids) for eos calculation | |
| if self._use_mrope: | |
| # position_ids is (batch_size, 4, seq_length), use first dim for text positions | |
| text_position_ids = position_ids[:, 0, :] # (batch_size, seq_length) | |
| eos_mask_idx = torch.argmax(text_position_ids * attention_mask, dim=-1) # (bsz,) | |
| # For mrope (3D position_ids), use the first dimension (sequential/temporal position_ids) for eos calculation | |
| if self._use_mrope: | |
| # position_ids is (batch_size, 4, seq_length), use first dim for sequential/temporal positions (per M-RoPE docs) | |
| sequential_position_ids = position_ids[:, 0, :] # (batch_size, seq_length) | |
| eos_mask_idx = torch.argmax(sequential_position_ids * attention_mask, dim=-1) # (bsz,) |
| text_pos = torch.zeros((1, len(input_ids)), dtype=torch.long, device=input_ids.device) | ||
| text_pos[0, valid_mask] = torch.arange(valid_mask.sum().item(), device=input_ids.device) | ||
|
|
||
| return torch.cat([text_pos, vision_pos], dim=0) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The position_ids computation for M-RoPE creates a zero tensor for text_pos and then fills valid positions, but the logic appears to have an issue. The code creates text_pos with shape (1, len(input_ids)), but input_ids is already a 1D tensor, so len(input_ids) gives the sequence length. However, the indexing text_pos[0, valid_mask] assumes valid_mask has been properly broadcast or is 1D. Since valid_mask comes from attention_mask.bool() and attention_mask is likely 1D, this should work, but it would be clearer to explicitly show the dimensions being used or add shape assertions for debugging.
| text_pos = torch.zeros((1, len(input_ids)), dtype=torch.long, device=input_ids.device) | |
| text_pos[0, valid_mask] = torch.arange(valid_mask.sum().item(), device=input_ids.device) | |
| return torch.cat([text_pos, vision_pos], dim=0) | |
| # Assert input shapes for clarity and debugging | |
| assert input_ids.dim() == 1, f"input_ids should be 1D, got shape {input_ids.shape}" | |
| assert attention_mask.shape == input_ids.shape, f"attention_mask shape {attention_mask.shape} does not match input_ids shape {input_ids.shape}" | |
| text_pos = torch.zeros(len(input_ids), dtype=torch.long, device=input_ids.device) | |
| text_pos[valid_mask] = torch.arange(valid_mask.sum().item(), device=input_ids.device) | |
| return torch.cat([text_pos.unsqueeze(0), vision_pos], dim=0) |
| # Local vLLM supports file:// URLs | ||
| if not image_path.startswith("file://"): | ||
| image_path = f"file://{os.path.realpath(image_path)}" | ||
| image_url = image_path |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment references "file:// URLs" but the code in the else branch doesn't handle all cases correctly. If image_path is already an absolute path starting with "/" but not with "file://", the code will prepend "file://" correctly. However, if image_path is a relative path (which shouldn't happen based on the logic flow), it would create an invalid file:// URL. Consider adding validation to ensure the path is absolute before constructing the file:// URL, or document the assumption that image_path must be absolute when use_base64_images=False.
| # Local vLLM supports file:// URLs | |
| if not image_path.startswith("file://"): | |
| image_path = f"file://{os.path.realpath(image_path)}" | |
| image_url = image_path | |
| # Local vLLM supports file:// URLs. | |
| # Always use os.path.realpath to ensure the path is absolute. | |
| abs_path = os.path.realpath(image_path) | |
| image_url = f"file://{abs_path}" |
| gt_num = float(gt.replace(",", "")) | ||
| if abs(pred_num - gt_num) / max(abs(gt_num), 1e-9) < 0.02: | ||
| return 1.0 | ||
| except (ValueError, AttributeError): |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except (ValueError, AttributeError): | |
| except (ValueError, AttributeError): | |
| # If conversion to float fails, fall back to substring/partial match below. |
|
Let's leave the comments for now and resolve them in another PR. |