feat: add byte tracking by OGKevin · Pull Request #745 · servo/html5ever

OGKevin · 2026-06-16T07:41:07Z

Byte Tracking

This PR introduced the ability for TreeSink implementations to know byte positions of Tokens. This is done by making BufferQueue track byte consumption and expose methods for Tokenizer to get this info, which than passes it forwards to TreeSink.

BufferQueue

BufferQueue gets a bytes_consumed field that tracks bytes manipulated via:

pop_except_from
advance_bytes_consumed
retreat_bytes_consumed
eat
next

The "consuming" methods (pop, eat, next) automatically advance the counter. While push_back and push_front don't move the counter. This is due to some code paths pushing front without actually consuming bytes, so this would corrupt the counter. Hence it's the callers responsibility to retreat.

Tokenizer

The main use case of this PR is to know where a token began in bytes, not necessarily the raw byte position of a character (what BufferQueue::bytes_consumed tracks.

Since Tokenizer is aware of "Tokens" e.g. Tag, Comment and Character tokens, it is in a good position to dictate what the correct byte is that needs to be reported to TreeSink.

For Tag, Comment and Doctype, the byte offset is that start of the token. So for tag <some-tag></some-tag>, its offset would be 0. And for tag p in <some-tag></some-tag> it would be 10.

For character tokens, it wold be that start of the first character.
e.g. abc the text abcwould have offset3`.

This is done by keeping track of when we saw the last < when processing a token. This is stored as token_start_byte. As we continue to process, we always write to last_token_end_byte at the end of each token processing so it can be used when we get CharacterTokens. In the above example, it would contain the byte of a. As when we get the character token abc we would be at byte 6 but we want to communicate that the character token started at 3.

TokenSink

Token sink get's a new

    #[cfg(feature = "source-positions")]
    fn set_current_byte(&self, _byte_offset: u64) {}

Method instead of updating

    /// Process a token.
    fn process_token(&self, token: Token, line_number: u64) -> TokenSinkResult<Self::Handle>;

And breaking the signature.

TreeSink

For folks who implement their own Dom struct, TreeSink is able to also pass the byte offset forward.

An example of this can be seen in source-positions-integration.rs

    impl TreeSink for ByteCapturingDOM {

Conclusion

BufferQueue holds raw position, Tokenizer converts raw position to an offset, and this gets passed down to TokenSink and TreeSink.

_{This was manually written ⌨️. Mainly to proof that human 🧠 and hands were used .}

Supporting changes

Cargo

I had to make some small changes to the cargo.yaml files so that cargo would be happy to compile this project as a submodule: https://github.com/OGKevin/cadmus/blob/65cdaaabb5ab795c668b32a4206f8a49fc9e2661/Cargo.toml#L26-L28

`xhtml-self-closing` feature

This feature is needed to properly process self closing tags such as <title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.

xhtml-self-closing-integration.rs has test case that showcases this. Well, not the "bug" per se, just how this feature fixes the issue.

Ref: baskerville/plato#426 (comment)

🤖 ~~was used to generate some of the rustdocs and tests.~~ Was also used to educate me on the project and brainstorm implementation path.

Closes: #734

Ref:

simonwuelker · 2026-06-16T21:00:33Z

🤖 was used to generate some of the rustdocs and tests. Was also used to educate me on the project and brainstorm implementation path.

Note that servo does not accept AI-generated contributions: https://book.servo.org/contributing/getting-started.html#ai-contributions.

Admittedly we should do a better job of communicating that this applies to all repositories under the servo org, not just servo/servo.

OGKevin · 2026-06-17T05:02:32Z

Ahh, that might make things interesting now. I wasn't aware, mybad.

What now? As I see 2 solutions:

Remove the tests and rustdocs, as thats the easiest way to remove all 🤖 touched code, but would also significantly lower the quality of the PR.
Re-write them, but I'm now biased to what I've seen, reviewed and agreed with, so my manual rewrite might not be significantly different. Unless you have some feedback on it that makes it look diff. However, how would I then proof that was scaffolded by 🤖 is now 100% manually typed?

The tests turned out to be useful as they caught some gaps in the implementation, but since they've "served their purpose" I'm fine with removing them. As In this specific situation, I've similar tests in Cadmus that covers similar cases.

Do note that tests are the biggest diff:

I'm not challenging the usage of 🤖, just not sure how to move forward. Unfortunately, my daily life has come to 🤖 code reviewer. So I take the small manual implementations where I can, was just lazy when it came to tests and rustdocs 🙉 😭.

simonwuelker · 2026-06-17T07:48:26Z

I'd be fine with you rewriting the tests and rustdoc.

but I'm now biased to what I've seen, reviewed and agreed with, so my manual rewrite might not be significantly different.

That's okay.

However, how would I then proof that was scaffolded by 🤖 is now 100% manually typed?

You don't need to prove that. We rely on the honesty of contributors when rejecting AI generated patches (unless the patch is obviously slop). Thanks for being upfront about it.

This commit introduced a new feature flag that enables the ability to track byte offsets. When the source-positions feature is enabled, the tokenizer will track the number of UTF-8 bytes consumed from the input so far. This is done by giving BufferQueue a `bytes_consumed` field, which is incremented every time a character is consumed. The changes in cargo.toml were needed to make this project load as a git submodule in the Cadmus project. The xhtml-self-closing feature was needed due to EPUBs using XHTML-compatible self-closing on RCDATA/RAWTEXT elements. Change-Id: 566446e2bca101b7fefdca639c1b4d26 Change-Id-Short: uttvvtlxonpy

Change-Id: 1cd53437f8710e57e9a9bb17f38bc2fc Change-Id-Short: ynmuwvwskrsy

OGKevin · 2026-06-20T15:58:24Z

I think this is ready for review 🤞🏾. I believe I've touched up on all the parts that were fully generated.

So far, I haven't encountered any bugs 🙏🏾, but would like to keep this project as a submodule, so I always have the ability to make changes if needed, since its now an integral part of Cadmus. That being said:

The only thing I'm unsure about are the changes in cargo.toml. Not sure if there is a way to add this project as a submodule. I've read https://doc.rust-lang.org/cargo/reference/overriding-dependencies.html 2 times now and the only solution that seems to work is by updating cargo.toml in this project.

I was hoping maybe this could work:

[patch.crates-io]
html5ever  = { path = "thirdparty/html5ever", package = "html5ever"}

But it doesn't.

error: failed to load source for dependency `html5ever`

Caused by:
  unable to update /home/kevin/code/github.com/ogkevin/cadmus/thirdparty/html5ever

Caused by:
  found a virtual manifest at `/home/kevin/code/github.com/ogkevin/cadmus/thirdparty/html5ever/Cargo.toml` instead of a package manifest

thirdparty/html5ever/
├── AUTHORS
├── Cargo.lock
├── Cargo.toml
├── COPYRIGHT
├── html5ever
├── LICENSE-APACHE
├── LICENSE-MIT
├── markup5ever
├── rcdom
├── README.md
├── RELEASING.MD
├── rustfmt.toml
├── target
├── tendril
├── web_atoms
└── xml5ever

8 directories, 9 files

Not sure if there is something I'm missing tho 🙈.

simonwuelker

These are a couple things I found while skimming through the code. The byte-tracking looks pretty invasive, which concerns me. Ideally we would keep the number of #[cfg(feature = "source-positions")] low, much lower than it currently is. I don't think any changes to the internals of BufferQueue should be necessary.

I had to make some small changes to the cargo.yaml files so that cargo would be happy to compile this project as a submodule: https://github.com/OGKevin/cadmus/blob/65cdaaabb5ab795c668b32a4206f8a49fc9e2661/Cargo.toml#L26-L28

That is not how we intend for html5ever to be consumed. Please don't change our workspace configuration.

I was hoping maybe this could work:

try html5ever = { path = "thirdparty/html5ever/html5ever" }, that works for me when developing servo.

xhtml-self-closing feature
This feature is needed to properly process self closing tags such as <title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.
xhtml-self-closing-integration.rs has test case that showcases this. Well, not the "bug" per se, just how this feature fixes the issue.

That is completely unrelated to this PR. Please open a different one for this bug.

simonwuelker · 2026-06-30T13:47:05Z

+    ///
+    /// The default implementation is a no-op.
+    #[cfg(feature = "source-positions")]
+    fn set_current_byte(&self, _byte_offset: u64) {}


I would strongly prefer passing the byte position as an offset to process_token instead of adding a new method that will be called right before.

You can add something like a struct SourcePosition which would contain the line number and (optionally) the byte offset.

simonwuelker · 2026-06-30T13:50:16Z

+                    if c == '<' {
+                        self.token_start_byte
+                            .set(pos.saturating_sub(c.len_utf8() as u64));
+                    }


This doesn't work - not every < indicates the start of a token (for example in the PLAINTEXT state ¹ it is simply emitted as any other character).

Footnotes

https://html.spec.whatwg.org/#plaintext-state ↩

For this specific implementation, this is not a problem. As the value is only read after it has been determined that we're processing a non character token:

Token::TagToken(_) | Token::CommentToken(_) | Token::DoctypeToken(_) => { self.token_start_byte.get() },

This is not under test however, I'll add a test case later something like:

fn check_byte_offsets_text_content_with_token_start() { let entries = tokenize_bytes("hello<world");

Hopefully it proofs me right 😆.

simonwuelker · 2026-06-30T13:50:34Z

+                        #[cfg(feature = "source-positions")]
+                        if let Some(ref r) = result {
+                            let n = match r {
+                                SetResult::NotFromSet(ref t) => t.len() as u64,


no one-letter variable names please

simonwuelker · 2026-06-30T13:52:18Z

+#[cfg(all(test, feature = "source-positions"))]
+mod test_source_positions {


integration tests should live in html5ever/tests,

This is not integration testing actually, its similar to the main mod test similar to

// LinesMatch implements the TokenSink trait. It is used for testing to see // if current_line is being updated when process_token is called. The lines // vector is a collection of the line numbers that each token is on. struct LinesMatch { tokens: RefCell<Vec<Token>>, current_str: RefCell<StrTendril>, lines: RefCell<Vec<(Token, u64)>>, }

but I wanted to separate it. Would it be okay to do:

mod test { // existing #[cfg(feature = "source-positions")] mod test_source_positions { // tests } }

?

simonwuelker · 2026-06-30T13:53:23Z

+version = "0.39.0"
+license = "MIT OR Apache-2.0"
+authors = [ "The html5ever Project Developers" ]
+repository = "https://github.com/servo/html5ever"
+edition = "2021"
+rust-version = "1.71.0"


You'll need to bump the version if we publish this, but don't modify the other fields please.

simonwuelker · 2026-06-30T13:55:58Z

+    /// Total number of UTF-8 bytes consumed from this queue so far.
+    ///
+    /// Used by the tokenizer to surface byte-accurate source offsets via
+    /// [`TokenSink::set_current_byte`] and [`TreeSink::set_current_byte`].
+    #[cfg(feature = "source-positions")]
+    bytes_consumed: Cell<u64>,


I don't understand why we need to count the consumed bytes here. A BufferQueue just stores input bytes - it is not tied to the tokenizer in any way. Any book-keeping of the consumed input should happen inside the tokenizer.

My thinking is that since the buffer queue is in charge of the raw input, it should be the source of truth to what has been consumed or not?

In theory, something could happen in buffer queue that the tokenizer is not aware of and corrupt the counter 🤔

simonwuelker · 2026-06-30T13:56:42Z

+    /// Total number of UTF-8 bytes consumed from this queue so far.
+    ///
+    /// Used by the tokenizer to surface byte-accurate source offsets via
+    /// [`TokenSink::set_current_byte`] and [`TreeSink::set_current_byte`].
+    #[cfg(feature = "source-positions")]
+    bytes_consumed: Cell<u64>,


The number of bytes should be a usize. That avoids some casting back and forth in the rest of the code.

nicoburns · 2026-06-30T22:35:59Z

This feature is needed to properly process self closing tags such as <title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.

This is what the xml5ever crate is for. It will parse XHTML documents correctly.

OGKevin · 2026-07-01T04:12:43Z

try html5ever = { path = "thirdparty/html5ever/html5ever" }, that works for me when developing servo.

This is what I have! And without the changes, I get the same

failed to load source for dependency `html5ever'

error 😔. I'll spend some more time on it, worst case I'll just keep the workspace config changes in the fork on a separate commit 👍🏾 Since

That is not how we intend for html5ever to be consumed.

is valid 🤝

This is what the xml5ever crate is for. It will parse XHTML documents correctly.

Ahh I missed this, if thats the case, I should be using xml5ever instead 🤝. I was under the impression it was just for xml 😭. Nevertheless, I will still attempt at getting this PR merged for html5ever. Afterwards, I'll replicate the changes to xml5ever.

In this case, the xhtml-self-closing can be removed entirely. I'm not certain if it should still be classified as a "bug" @simonwuelker? As this is a clear case of user error(xml5ever for xhtml instead of html5ever) + invalid spec (xhtml vs html) 🤔.

simonwuelker · 2026-07-01T09:10:01Z

This is what I have!

Note the extra /htmlever in the path compared to what you posted (:

In this case, the xhtml-self-closing can be removed entirely. I'm not certain if it should still be classified as a "bug" @simonwuelker? As this is a clear case of user error(xml5ever for xhtml instead of html5ever) + invalid spec (xhtml vs html) 🤔.

If xml5ever has the correct behaviour then this is not a bug, no.

OGKevin · 2026-07-01T09:35:35Z

Note the extra /htmlever in the path compared to what you posted (:

🙈

I had to make some small changes to the cargo.yaml files so that cargo would be happy to compile this project as a submodule: OGKevin/cadmus@65cdaaa/Cargo.toml#L26-L28

That link, and what has been comitted is:

[patch.crates-io]
html5ever  = { path = "thirdparty/html5ever/html5ever" }
markup5ever = { path = "thirdparty/html5ever/markup5ever" }

Which is the same?

I was hoping maybe this could work:

this is just another variation that completely doesn't work.
And the above "thirdparty/html5ever/html5ever also doesn't work without the changes in cargo.toml in html5ever 😔.
As it doesn't like that cargo.toml in html5ever is defering fields to workspace:

$ cargo check
error: failed to load source for dependency `html5ever`

Caused by:
  unable to update /Users/x/code/github.com/ogkevin/cadmus/thirdparty/html5ever/html5ever

Caused by:
  failed to parse manifest at `/Users/x/code/github.com/ogkevin/cadmus/thirdparty/html5ever/html5ever/Cargo.toml`

Caused by:
  error inheriting `edition` from workspace root manifest's `workspace.package.edition`

Caused by:
  `workspace.package.edition` was not defined

Despite it being defined:

[workspace.package]
version = "0.39.0"
license = "MIT OR Apache-2.0"
authors = [ "The html5ever Project Developers" ]
repository = "https://github.com/servo/html5ever"
edition = "2021"
rust-version = "1.71.0"

I'll spend some time on it again later today, and if not I'll just keep it separate 👍🏾

nicoburns · 2026-07-01T11:52:36Z

If you can, try using path = '...' in the regular dependency section rather than using a patch.

github-actions Bot added the V-non-breaking A non-breaking change label Jun 16, 2026

OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from a332771 to 0d70705 Compare June 16, 2026 13:01

github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 16, 2026

OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 0d70705 to cdd361a Compare June 18, 2026 13:44

OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch 3 times, most recently from 1b0e47d to 16e417a Compare June 18, 2026 13:49

github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 18, 2026

OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 16e417a to 74a33c7 Compare June 18, 2026 14:00

github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 18, 2026

OGKevin mentioned this pull request Jun 18, 2026

feat(HTML): replace manual XmlParser with html5ever OGKevin/cadmus#599

Draft

OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 74a33c7 to b51978b Compare June 19, 2026 20:27

github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 19, 2026

chore: some brooming

56499d6

Change-Id: 1cd53437f8710e57e9a9bb17f38bc2fc Change-Id-Short: ynmuwvwskrsy

OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from b51978b to 56499d6 Compare June 19, 2026 20:47

github-actions Bot removed the V-non-breaking A non-breaking change label Jun 19, 2026

github-actions Bot added the V-non-breaking A non-breaking change label Jun 19, 2026

OGKevin marked this pull request as ready for review June 20, 2026 15:58

simonwuelker requested changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add byte tracking#745

feat: add byte tracking#745
OGKevin wants to merge 2 commits into
servo:mainfrom
OGKevin:ogkevin/push/uttvvtlxonpy

OGKevin commented Jun 16, 2026 •

edited

Loading

simonwuelker commented Jun 16, 2026

OGKevin commented Jun 17, 2026 •

edited

Loading

simonwuelker commented Jun 17, 2026 •

edited

Loading

OGKevin commented Jun 20, 2026

simonwuelker left a comment •

edited

Loading

simonwuelker Jun 30, 2026

simonwuelker Jun 30, 2026

OGKevin Jul 1, 2026 •

edited

Loading

simonwuelker Jun 30, 2026

simonwuelker Jun 30, 2026

OGKevin Jul 1, 2026 •

edited

Loading

simonwuelker Jun 30, 2026

simonwuelker Jun 30, 2026

OGKevin Jul 1, 2026

simonwuelker Jun 30, 2026

nicoburns commented Jun 30, 2026

OGKevin commented Jul 1, 2026 •

edited

Loading

simonwuelker commented Jul 1, 2026

OGKevin commented Jul 1, 2026 •

edited

Loading

nicoburns commented Jul 1, 2026

Labels

3 participants

		#[cfg(all(test, feature = "source-positions"))]
		mod test_source_positions {

Uh oh!

Conversation

OGKevin commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Byte Tracking

BufferQueue

Tokenizer

TokenSink

TreeSink

Conclusion

Supporting changes

Cargo

xhtml-self-closing feature

simonwuelker commented Jun 16, 2026

OGKevin commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

simonwuelker commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OGKevin commented Jun 20, 2026

simonwuelker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

OGKevin Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OGKevin Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicoburns commented Jun 30, 2026

OGKevin commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

simonwuelker commented Jul 1, 2026

OGKevin commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nicoburns commented Jul 1, 2026

Labels

3 participants

OGKevin commented Jun 16, 2026 •

edited

Loading

`xhtml-self-closing` feature

OGKevin commented Jun 17, 2026 •

edited

Loading

simonwuelker commented Jun 17, 2026 •

edited

Loading

simonwuelker left a comment •

edited

Loading

OGKevin Jul 1, 2026 •

edited

Loading

OGKevin Jul 1, 2026 •

edited

Loading

OGKevin commented Jul 1, 2026 •

edited

Loading

OGKevin commented Jul 1, 2026 •

edited

Loading