Skip to content

Conversation

@marcoroth
Copy link
Owner

@marcoroth marcoroth commented Nov 7, 2025

This pull request introduces a two-pass algorithm for HTML tag matching to better handle ERB control flow boundaries and provide more accurate and actionable error messages.

Previously, HTML tags were matched during parsing (single-pass), which caused issues when tags appeared across ERB control flow boundaries. This would incorrectly report errors or miss mismatched tags because the parser couldn't understand that the <% if %> creates a scope boundary.

<div>
  <% if valid? %>
    <h1>Title
  <% end %>
  </h1>
</div>

Now with this pull request:

In the first pass during parsing, the parser collects HTMLOpenTagNode and HTMLCloseTagNode nodes separately without attempting to match them or perform any validation, so it doesn't built any HTMLElementNode anymore during this pass.

In the second pass after the ERB structure has been fully analyzed, the new herb_parser_match_html_tags_post_analyze() function matches tags while respecting ERB control flow scope boundaries.

This function recursively processes all ERB control structures including if/elsif/else, case/when, begin/rescue/ensure, and others, attaching any mismatch errors directly to the relevant stray open or close tag nodes.


Improved HTML Element Matching

Before
CleanShot 2025-11-07 at 04 05 31@2x

After

CleanShot 2025-11-07 at 04 08 04@2x

ERB Control Error Improvements

Before

CleanShot 2025-11-07 at 04 11 56@2x

After

CleanShot 2025-11-07 at 04 12 16@2x
This pull request introduces a two-pass algorithm for HTML tag matching to better handle ERB control flow boundaries.

Previously, HTML tags were matched during parsing (single-pass), which caused issues when tags appeared across ERB control flow boundaries:

```erb
<div>
  <% if true %>
    <h1>Title
  <% end %>
  </h1>
</div>
```

This would incorrectly report errors or miss mismatched tags because the parser couldn't understand that the `<% if %>` creates a scope boundary.

---

In the first pass during parsing, the parser collects `HTMLOpenTagNode` and `HTMLCloseTagNode` nodes separately without attempting to match them or perform any validation.

In the second pass after the ERB structure has been fully analyzed, the new `herb_parser_match_html_tags_post_analyze()` function matches tags while respecting ERB control flow scope boundaries.

This function recursively processes all ERB control structures including `if`/`elsif`/`else`, `case`/`when`, `begin`/`rescue`/`ensure`, and others, attaching any mismatch errors directly to the relevant stray open or close tag
nodes.
@marcoroth marcoroth force-pushed the parse-two-pass-matching branch from 74467cc to 59c08c3 Compare November 7, 2025 03:30
@marcoroth marcoroth merged commit de456b8 into main Nov 7, 2025
24 checks passed
@marcoroth marcoroth deleted the parse-two-pass-matching branch November 7, 2025 04:06
marcoroth added a commit that referenced this pull request Nov 11, 2025
This pull request fixes a critical performance regression in the
two-pass tag matching algorithm introduced in #795.

Documents that previously parsed in ~75ms were taking ~11 seconds (and
in some cases didn't even complete in a reasonable timeframe, see #828).

The `match_tags_visitor` function was causing double recursion by
manually processing array fields through calls to
`match_tags_in_node_array()`, but then returning `true` which instructed
`herb_visit_node` to also automatically traverse those same children a
second time.

This created an exponential explosion where each level of nesting
squared the number of visits.

Resolves #828
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment