feat(browser): implement experimental browser agent#19284
feat(browser): implement experimental browser agent#19284gsquared94 merged 17 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello @gsquared94, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new, experimental browser agent designed to automate web tasks within the Gemini CLI. It provides robust web interaction capabilities, allowing the system to navigate, fill forms, click elements, and extract information from web pages. The agent is built with a focus on isolation and configurability, ensuring it integrates safely and can be tailored to specific use cases, including visual analysis when a vision model is enabled. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is an impressive and well-architected addition of an experimental browser agent. The feature is self-contained, disabled by default, and includes comprehensive documentation and tests. The design choices, such as the isolated BrowserManager, dynamic tool wrapping, and detailed error handling with user-friendly remediation steps, are excellent. The use of prompt engineering techniques like augmenting tool descriptions and post-processing results to improve agent reliability is particularly noteworthy.
I have one high-severity suggestion regarding the performance of the type_text tool to align it with the performance claims in the pull request description.
272804f to
fdc40ba
Compare
Add extensible browser agent configuration using the agents.overrides pattern: - Extended AgentOverride interface with customConfig field for agent-specific settings - Added BrowserAgentCustomConfig type for browser-specific configuration - Added getAgentOverride() and getBrowserAgentConfig() methods to Config class - Settings configured via agents.overrides.browser_agent.customConfig - Updated settings schema with customConfig in AgentOverride definition This establishes the foundational pattern for configuring the browser agent through the standard agents.overrides infrastructure.
Implement the browser agent using the LocalAgentDefinition pattern: - BrowserAgentDefinition: Agent metadata and prompt configuration - BrowserAgentInvocation: Handles individual browser agent invocations - BrowserAgentFactory: Creates agent definitions with dynamic MCP tools - BrowserManager: Manages chrome-devtools-mcp connection lifecycle Uses getBrowserAgentConfig() to read settings from agents.overrides.browser_agent
Implement the visual agent using the LocalAgentDefinition pattern: - VisualAgentDefinition: Agent metadata for coordinate-based visual tasks - delegateToVisualAgent.ts: Tool for semantic agent to delegate visual tasks - Uses gemini-2.5-computer-use-preview-10-2025 model for Computer Use capability The visual agent handles tasks requiring visual identification or precise coordinate-based actions that cannot be done via the accessibility tree.
…ation Fix chrome-devtools-mcp CLI flags: - --existing (invalid) → --autoConnect for existing session mode - --profile-path (invalid) → --userDataDir for custom profile path - Default session mode changed from 'isolated' to 'persistent' Add 'persistent' session mode (new default) which uses a persistent Chrome profile at ~/.cache/chrome-devtools-mcp/chrome-profile. Add connection timeout and actionable error for 'existing' mode when Chrome remote debugging is not enabled.
- Add custom type_text tool that types a full string by internally calling press_key for each character, turning N model round-trips into 1. Dramatically speeds up text input in complex web apps. - Move tool-specific usage rules from system prompt to individual tool descriptions via augmentToolDescription() for better organization and token efficiency. - Add terminal failure handling instructions to system prompt (Chrome connection errors, browser crashes, repeated errors) with specific remediation steps. - Add complex web app guidance (spreadsheets, rich editors) to system prompt, recommending type_text + keyboard navigation. - Fix augmentToolDescription key ordering so more-specific keys (fill_form, click_at) match before shorter keys (fill, click). - Remove non-existent tool references (scroll, type_text as MCP tool) and add click_at hint for vision tool.
Vision (screenshot analysis + coordinate-based interactions) is now disabled by default. Set visualModel in browser_agent customConfig to enable it, e.g. visualModel: 'gemini-2.5-computer-use-preview-10-2025'.
…sion modes, and security.
…n errors - Add submitKey parameter to type_text tool for pressing Enter/Tab/etc after typing, eliminating a separate model round-trip per value entry - Update system prompt and tool hints to guide model toward type_text with submitKey instead of per-character press_key calls - Refactor connection error handling into createConnectionError() with session-mode-aware remediation messages for profile locks, timeouts, and generic failures - Update terminal failure prompts to pass through error remediation verbatim instead of hardcoding instructions - Add tests for profile-lock, timeout, and generic connection errors
The system prompt always included the VISUAL IDENTIFICATION section telling the model about analyze_screenshot, even when visualModel was not configured. This caused the model to attempt calling the tool despite it not being registered. - Convert BROWSER_SYSTEM_PROMPT to buildBrowserSystemPrompt(visionEnabled) - Pass vision state from factory to definition builder - Remove analyze_screenshot reference from click_at tool description - Add tests for conditional prompt inclusion/exclusion - Fix misleading test comment about tool count
…n mode, headless, profile path, and visual model settings.
Updated the browser_agent description from a primitive-focused listing (navigating, filling, clicking) to a goal-oriented description that emphasizes autonomy, multi-step reasoning, and dynamic feedback interpretation. This encourages the parent agent to delegate entire tasks in a single call rather than micromanaging individual browser actions.
352d98e to
c1560d9
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new experimental browser agent, a significant feature addition. The implementation is well-structured across multiple new files, including comprehensive documentation and tests. The architecture for managing the browser lifecycle and wrapping MCP tools is robust, with good error handling and attention to user experience.
I've identified a few high-severity issues related to the propagation of AbortSignal for cancellation, which should be addressed to ensure the agent can be reliably cancelled, preventing potential resource leaks or unintended continued execution. Additionally, there's a suggestion for simplifying redundant cancellation logic.
|
Great work implementing this experimental feature! The architecture and test coverage look solid. I just had one minor piece of feedback regarding the strict development rules: In // We accept any cast here because we are mocking the interface
mockMessageBus = { ... } as unknown as MessageBus;
// ...
// Use "any" to access protected method
// eslint-disable-next-line @typescript-eslint/no-explicit-any
const details = await (invocation as any).getConfirmationDetails( ... );Our strict development guidelines state to strictly forbid Otherwise, the PR looks great! |
- Remove redundant Promise.race in McpToolInvocation.execute (event listener leak) - Propagate AbortSignal to all press_key calls (submitKey + typeCharByChar) - Call this.close() on connectMcp failure (zombie process leak) - Set showInDialog: false for all browser settings - Remove debug log truncation in analyzeScreenshot - Fix misleading --experimental-vision error message - Replace any casts with typed TestableConfirmation interface in tests - Update license year to 2026 in all browser agent files - Merge duplicate imports in mcpToolWrapper - Add sync comment to BrowserAgentCustomConfig - Update subagents.md Chrome requirement wording - Regenerate settings docs
- Remove redundant Promise.race in McpToolInvocation.execute (event listener leak) - Propagate AbortSignal to all press_key calls (submitKey + typeCharByChar) - Call this.close() on connectMcp failure (zombie process leak) - Set showInDialog: false for all browser settings - Remove debug log truncation in analyzeScreenshot - Fix misleading --experimental-vision error message - Replace any casts with typed TestableConfirmation interface in tests - Update license year to 2026 in all browser agent files - Merge duplicate imports in mcpToolWrapper - Add sync comment to BrowserAgentCustomConfig - Update subagents.md Chrome requirement wording - Regenerate settings docs
593273f to
3eb3f05
Compare
- Remove redundant Promise.race in McpToolInvocation.execute (event listener leak) - Propagate AbortSignal to all press_key calls (submitKey + typeCharByChar) - Call this.close() on connectMcp failure (zombie process leak) - Set showInDialog: false for all browser settings - Remove debug log truncation in analyzeScreenshot - Fix misleading --experimental-vision error message - Replace any casts with typed TestableConfirmation interface in tests - Update license year to 2026 in all browser agent files - Merge duplicate imports in mcpToolWrapper - Add sync comment to BrowserAgentCustomConfig - Update subagents.md Chrome requirement wording - Regenerate settings docs
3eb3f05 to
89fd568
Compare
- Remove redundant Promise.race in McpToolInvocation.execute (event listener leak) - Propagate AbortSignal to all press_key calls (submitKey + typeCharByChar) - Call this.close() on connectMcp failure (zombie process leak) - Set showInDialog: false for all browser settings - Remove debug log truncation in analyzeScreenshot - Fix misleading --experimental-vision error message - Replace any casts with typed TestableConfirmation interface in tests - Update license year to 2026 in all browser agent files - Merge duplicate imports in mcpToolWrapper - Add sync comment to BrowserAgentCustomConfig - Update subagents.md Chrome requirement wording - Regenerate settings docs
89fd568 to
2af976e
Compare
- Remove redundant Promise.race in McpToolInvocation.execute (event listener leak) - Propagate AbortSignal to all press_key calls (submitKey + typeCharByChar) - Call this.close() on connectMcp failure (zombie process leak) - Set showInDialog: false for all browser settings - Remove debug log truncation in analyzeScreenshot - Fix misleading --experimental-vision error message - Replace any casts with typed TestableConfirmation interface in tests - Update license year to 2026 in all browser agent files - Merge duplicate imports in mcpToolWrapper - Add sync comment to BrowserAgentCustomConfig - Update subagents.md Chrome requirement wording - Regenerate settings docs
2af976e to
4ad9be5
Compare
Fixed this. It should no longer be corrupting the UI on alternate buffer mode. |
WIll revise the browser connection management along with the devtools mcp dependency management in an immediate followup PR. |
- Remove redundant Promise.race in McpToolInvocation.execute (event listener leak) - Propagate AbortSignal to all press_key calls (submitKey + typeCharByChar) - Call this.close() on connectMcp failure (zombie process leak) - Set showInDialog: false for all browser settings - Remove debug log truncation in analyzeScreenshot - Fix misleading --experimental-vision error message - Replace any casts with typed TestableConfirmation interface in tests - Update license year to 2026 in all browser agent files - Merge duplicate imports in mcpToolWrapper - Add sync comment to BrowserAgentCustomConfig - Update subagents.md Chrome requirement wording - Regenerate settings docs
Pipe stderr from npx chrome-devtools-mcp instead of inheriting it. The server's banner warnings were leaking into the terminal and corrupting the Ink-based UI in alternate buffer mode. Piped output is forwarded to debugLogger so it remains visible with --debug.
e84b0b4 to
7d32f2b
Compare
|
@jacob314 addressed all your comments. PTAL |
9e95b8b

Summary
Implements an experimental
browser_agentsubagent that enables semantic and visual web browsing via chrome-devtools-mcp. The agent uses the Accessibility Tree for reliable element identification and supports optional vision-based interactions via screenshots.🛡️ Safe & Localized Implementation
packages/core/src/agents/browser/(12 new files).AgentRegistryregistration (+8 lines),Configschema (+68 lines),subagent-tool-wrapper(+13 lines). No existing agents, tools, or utilities were modified.Architecture
Session Modes
Configured via
agents.overrides.browser_agent.customConfig.sessionMode:persistent(default)~/.gemini/cli-browser-profile/. Preserves logins/cookies across sessions.isolatedexisting--autoConnect. Requires remote debugging enabled atchrome://inspect/#remote-debugging.new_page/list_pagestimeouts.Key Features
take_snapshot→ uid-basedclick/fillfor reliable element interactions.visualModelconfig enablesanalyze_screenshotfor coordinate-based interactions viaclick_at. The system prompt conditionally includes visual instructions only when enabled.type_textcomposite tool: Usesdocument.execCommand('insertText')for fast bulk text entry with optionalsubmitKey(Enter, Tab, etc.).clearTimeout: Prevents timer leaks viafinallyblock.docs/core/subagents.mdanddocs/cli/settings.mdupdated with browser agent config reference.Test Coverage
53 unit tests across 6 test files:
browserManager.test.tsbrowserAgentFactory.test.tsmcpToolWrapper.test.tsanalyzeScreenshot.test.tsbrowserAgentInvocation.test.tsmcpToolWrapperConfirmation.test.tsHow to Validate
1. Enable the agent
Add to
.gemini/settings.json:{ "agents": { "overrides": { "browser_agent": { "enabled": true } } } }2. Example prompts to try (interactive mode)
Basic navigation & extraction:
Multi-step autonomous task:
Authenticated workflow (use
existingsession mode, with signed in account):3. What to verify
new_page,click,fillshould show "Action Required" before executing (unless YOLO mode is active)🌐💭 Navigating to...) as the agent worksbrowser_agentcall, not call it repeatedly for each stepThe Browser Agent status card may briefly appear duplicated — a frozen card above and a live card below. This seems like an Ink rendering artifact that occurs when
<Static>history items (like hint messages or update notifications) are added while a tool card is in the pending area.Pre-Merge Checklist
docs/core/subagents.md,docs/cli/settings.md)Fixes #15956