Parser: Introduce two-pass algorithm for matching HTML Elements #795

marcoroth · 2025-11-07T02:50:53Z

This pull request introduces a two-pass algorithm for HTML tag matching to better handle ERB control flow boundaries and provide more accurate and actionable error messages.

Previously, HTML tags were matched during parsing (single-pass), which caused issues when tags appeared across ERB control flow boundaries. This would incorrectly report errors or miss mismatched tags because the parser couldn't understand that the <% if %> creates a scope boundary.

<div>
  <% if valid? %>
    <h1>Title
  <% end %>
  </h1>
</div>

Now with this pull request:

In the first pass during parsing, the parser collects HTMLOpenTagNode and HTMLCloseTagNode nodes separately without attempting to match them or perform any validation, so it doesn't built any HTMLElementNode anymore during this pass.

In the second pass after the ERB structure has been fully analyzed, the new herb_parser_match_html_tags_post_analyze() function matches tags while respecting ERB control flow scope boundaries.

This function recursively processes all ERB control structures including if/elsif/else, case/when, begin/rescue/ensure, and others, attaching any mismatch errors directly to the relevant stray open or close tag nodes.

Improved HTML Element Matching

Before

After

ERB Control Error Improvements

Before

After

This pull request introduces a two-pass algorithm for HTML tag matching to better handle ERB control flow boundaries. Previously, HTML tags were matched during parsing (single-pass), which caused issues when tags appeared across ERB control flow boundaries: ```erb <div> <% if true %> <h1>Title <% end %> </h1> </div> ``` This would incorrectly report errors or miss mismatched tags because the parser couldn't understand that the `<% if %>` creates a scope boundary. --- In the first pass during parsing, the parser collects `HTMLOpenTagNode` and `HTMLCloseTagNode` nodes separately without attempting to match them or perform any validation. In the second pass after the ERB structure has been fully analyzed, the new `herb_parser_match_html_tags_post_analyze()` function matches tags while respecting ERB control flow scope boundaries. This function recursively processes all ERB control structures including `if`/`elsif`/`else`, `case`/`when`, `begin`/`rescue`/`ensure`, and others, attaching any mismatch errors directly to the relevant stray open or close tag nodes.

This pull request fixes a critical performance regression in the two-pass tag matching algorithm introduced in #795. Documents that previously parsed in ~75ms were taking ~11 seconds (and in some cases didn't even complete in a reasonable timeframe, see #828). The `match_tags_visitor` function was causing double recursion by manually processing array fields through calls to `match_tags_in_node_array()`, but then returning `true` which instructed `herb_visit_node` to also automatically traverse those same children a second time. This created an exponential explosion where each level of nesting squared the number of visits. Resolves #828

github-actions bot added linter formatter parser typescript c c-extension engine rust labels Nov 7, 2025

marcoroth force-pushed the parse-two-pass-matching branch 2 times, most recently from e108605 to 74467cc Compare November 7, 2025 03:11

github-actions bot added node rbs labels Nov 7, 2025

marcoroth force-pushed the parse-two-pass-matching branch from 74467cc to 59c08c3 Compare November 7, 2025 03:30

marcoroth merged commit de456b8 into main Nov 7, 2025
24 checks passed

marcoroth deleted the parse-two-pass-matching branch November 7, 2025 04:06

marcoroth mentioned this pull request Nov 13, 2025

Parser error in v0.8.0 when using if inside else of a case block #860

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parser: Introduce two-pass algorithm for matching HTML Elements #795

Parser: Introduce two-pass algorithm for matching HTML Elements #795

Uh oh!

marcoroth commented Nov 7, 2025 •

edited

Loading

Uh oh!

Labels

2 participants

Uh oh!

Parser: Introduce two-pass algorithm for matching HTML Elements #795

Parser: Introduce two-pass algorithm for matching HTML Elements #795

Uh oh!

Conversation

marcoroth commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improved HTML Element Matching

ERB Control Error Improvements

Uh oh!

Labels

2 participants

marcoroth commented Nov 7, 2025 •

edited

Loading