Skip to content

feat: add byte tracking#745

Open
OGKevin wants to merge 2 commits into
servo:mainfrom
OGKevin:ogkevin/push/uttvvtlxonpy
Open

feat: add byte tracking#745
OGKevin wants to merge 2 commits into
servo:mainfrom
OGKevin:ogkevin/push/uttvvtlxonpy

Conversation

@OGKevin

@OGKevin OGKevin commented Jun 16, 2026

Copy link
Copy Markdown

Byte Tracking

This PR introduced the ability for TreeSink implementations to know byte positions of Tokens. This is done by making BufferQueue track byte consumption and expose methods for Tokenizer to get this info, which than passes it forwards to TreeSink.

BufferQueue

BufferQueue gets a bytes_consumed field that tracks bytes manipulated via:

  • pop_except_from
  • advance_bytes_consumed
  • retreat_bytes_consumed
  • eat
  • next

The "consuming" methods (pop, eat, next) automatically advance the counter. While push_back and push_front don't move the counter. This is due to some code paths pushing front without actually consuming bytes, so this would corrupt the counter. Hence it's the callers responsibility to retreat.

Tokenizer

The main use case of this PR is to know where a token began in bytes, not necessarily the raw byte position of a character (what BufferQueue::bytes_consumed tracks.

Since Tokenizer is aware of "Tokens" e.g. Tag, Comment and Character tokens, it is in a good position to dictate what the correct byte is that needs to be reported to TreeSink.

For Tag, Comment and Doctype, the byte offset is that start of the token. So for tag <some-tag></some-tag>, its offset would be 0. And for tag p in <some-tag><p></p></some-tag> it would be 10.

For character tokens, it wold be that start of the first character.
e.g. <p>abc</p> the text abcwould have offset3`.

This is done by keeping track of when we saw the last < when processing a token. This is stored as token_start_byte. As we continue to process, we always write to last_token_end_byte at the end of each token processing so it can be used when we get CharacterTokens. In the above example, it would contain the byte of a. As when we get the character token abc we would be at byte 6 but we want to communicate that the character token started at 3.

TokenSink

Token sink get's a new

    #[cfg(feature = "source-positions")]
    fn set_current_byte(&self, _byte_offset: u64) {}

Method instead of updating

    /// Process a token.
    fn process_token(&self, token: Token, line_number: u64) -> TokenSinkResult<Self::Handle>;

And breaking the signature.

TreeSink

For folks who implement their own Dom struct, TreeSink is able to also pass the byte offset forward.

An example of this can be seen in source-positions-integration.rs

    impl TreeSink for ByteCapturingDOM {

Conclusion

BufferQueue holds raw position, Tokenizer converts raw position to an offset, and this gets passed down to TokenSink and TreeSink.

This was manually written ⌨️. Mainly to proof that human 🧠 and hands were used .


Supporting changes

Cargo

I had to make some small changes to the cargo.yaml files so that cargo would be happy to compile this project as a submodule: https://github.com/OGKevin/cadmus/blob/65cdaaabb5ab795c668b32a4206f8a49fc9e2661/Cargo.toml#L26-L28

xhtml-self-closing feature

This feature is needed to properly process self closing tags such as <title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.

xhtml-self-closing-integration.rs has test case that showcases this. Well, not the "bug" per se, just how this feature fixes the issue.

Ref: baskerville/plato#426 (comment)


🤖 was used to generate some of the rustdocs and tests. Was also used to educate me on the project and brainstorm implementation path.


Closes: #734

Ref:

@github-actions github-actions Bot added the V-non-breaking A non-breaking change label Jun 16, 2026
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from a332771 to 0d70705 Compare June 16, 2026 13:01
@github-actions github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 16, 2026
@simonwuelker

Copy link
Copy Markdown
Member

🤖 was used to generate some of the rustdocs and tests. Was also used to educate me on the project and brainstorm implementation path.

Note that servo does not accept AI-generated contributions: https://book.servo.org/contributing/getting-started.html#ai-contributions.

Admittedly we should do a better job of communicating that this applies to all repositories under the servo org, not just servo/servo.

@OGKevin

OGKevin commented Jun 17, 2026

Copy link
Copy Markdown
Author

Ahh, that might make things interesting now. I wasn't aware, mybad.

What now? As I see 2 solutions:

  1. Remove the tests and rustdocs, as thats the easiest way to remove all 🤖 touched code, but would also significantly lower the quality of the PR.
  2. Re-write them, but I'm now biased to what I've seen, reviewed and agreed with, so my manual rewrite might not be significantly different. Unless you have some feedback on it that makes it look diff. However, how would I then proof that was scaffolded by 🤖 is now 100% manually typed?

The tests turned out to be useful as they caught some gaps in the implementation, but since they've "served their purpose" I'm fine with removing them. As In this specific situation, I've similar tests in Cadmus that covers similar cases.

Do note that tests are the biggest diff:
image


I'm not challenging the usage of 🤖, just not sure how to move forward. Unfortunately, my daily life has come to 🤖 code reviewer. So I take the small manual implementations where I can, was just lazy when it came to tests and rustdocs 🙉 😭.

@simonwuelker

simonwuelker commented Jun 17, 2026

Copy link
Copy Markdown
Member

I'd be fine with you rewriting the tests and rustdoc.

but I'm now biased to what I've seen, reviewed and agreed with, so my manual rewrite might not be significantly different.

That's okay.

However, how would I then proof that was scaffolded by 🤖 is now 100% manually typed?

You don't need to prove that. We rely on the honesty of contributors when rejecting AI generated patches (unless the patch is obviously slop). Thanks for being upfront about it.

@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 0d70705 to cdd361a Compare June 18, 2026 13:44
This commit introduced a new feature flag that enables the ability to
track byte offsets.

When the source-positions feature is enabled, the tokenizer will track
the number of UTF-8 bytes consumed from the input so far. This is done
by giving BufferQueue a `bytes_consumed` field, which is incremented
every time a character is consumed.

The changes in cargo.toml were needed to make this project load as a git
submodule in the Cadmus project.

The xhtml-self-closing feature was needed due to EPUBs using
XHTML-compatible self-closing on RCDATA/RAWTEXT elements.

Change-Id: 566446e2bca101b7fefdca639c1b4d26
Change-Id-Short: uttvvtlxonpy
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch 3 times, most recently from 1b0e47d to 16e417a Compare June 18, 2026 13:49
@github-actions github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 18, 2026
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 16e417a to 74a33c7 Compare June 18, 2026 14:00
@github-actions github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 18, 2026
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 74a33c7 to b51978b Compare June 19, 2026 20:27
@github-actions github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 19, 2026
Change-Id: 1cd53437f8710e57e9a9bb17f38bc2fc
Change-Id-Short: ynmuwvwskrsy
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from b51978b to 56499d6 Compare June 19, 2026 20:47
@github-actions github-actions Bot removed the V-non-breaking A non-breaking change label Jun 19, 2026
@github-actions github-actions Bot added the V-non-breaking A non-breaking change label Jun 19, 2026
@OGKevin

OGKevin commented Jun 20, 2026

Copy link
Copy Markdown
Author

I think this is ready for review 🤞🏾. I believe I've touched up on all the parts that were fully generated.

So far, I haven't encountered any bugs 🙏🏾, but would like to keep this project as a submodule, so I always have the ability to make changes if needed, since its now an integral part of Cadmus. That being said:

The only thing I'm unsure about are the changes in cargo.toml. Not sure if there is a way to add this project as a submodule. I've read https://doc.rust-lang.org/cargo/reference/overriding-dependencies.html 2 times now and the only solution that seems to work is by updating cargo.toml in this project.

I was hoping maybe this could work:

[patch.crates-io]
html5ever  = { path = "thirdparty/html5ever", package = "html5ever"}

But it doesn't.

error: failed to load source for dependency `html5ever`

Caused by:
  unable to update /home/kevin/code/github.com/ogkevin/cadmus/thirdparty/html5ever

Caused by:
  found a virtual manifest at `/home/kevin/code/github.com/ogkevin/cadmus/thirdparty/html5ever/Cargo.toml` instead of a package manifest
thirdparty/html5ever/
├── AUTHORS
├── Cargo.lock
├── Cargo.toml
├── COPYRIGHT
├── html5ever
├── LICENSE-APACHE
├── LICENSE-MIT
├── markup5ever
├── rcdom
├── README.md
├── RELEASING.MD
├── rustfmt.toml
├── target
├── tendril
├── web_atoms
└── xml5ever

8 directories, 9 files

Not sure if there is something I'm missing tho 🙈.

@OGKevin OGKevin marked this pull request as ready for review June 20, 2026 15:58

@simonwuelker simonwuelker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a couple things I found while skimming through the code. The byte-tracking looks pretty invasive, which concerns me. Ideally we would keep the number of #[cfg(feature = "source-positions")] low, much lower than it currently is. I don't think any changes to the internals of BufferQueue should be necessary.

I had to make some small changes to the cargo.yaml files so that cargo would be happy to compile this project as a submodule: https://github.com/OGKevin/cadmus/blob/65cdaaabb5ab795c668b32a4206f8a49fc9e2661/Cargo.toml#L26-L28

That is not how we intend for html5ever to be consumed. Please don't change our workspace configuration.

I was hoping maybe this could work:

try html5ever = { path = "thirdparty/html5ever/html5ever" }, that works for me when developing servo.

xhtml-self-closing feature
This feature is needed to properly process self closing tags such as <title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.
xhtml-self-closing-integration.rs has test case that showcases this. Well, not the "bug" per se, just how this feature fixes the issue.

That is completely unrelated to this PR. Please open a different one for this bug.

///
/// The default implementation is a no-op.
#[cfg(feature = "source-positions")]
fn set_current_byte(&self, _byte_offset: u64) {}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly prefer passing the byte position as an offset to process_token instead of adding a new method that will be called right before.

You can add something like a struct SourcePosition which would contain the line number and (optionally) the byte offset.

Comment on lines +693 to +696
if c == '<' {
self.token_start_byte
.set(pos.saturating_sub(c.len_utf8() as u64));
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work - not every < indicates the start of a token (for example in the PLAINTEXT state 1 it is simply emitted as any other character).

Footnotes

  1. https://html.spec.whatwg.org/#plaintext-state

@OGKevin OGKevin Jul 1, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific implementation, this is not a problem. As the value is only read after it has been determined that we're processing a non character token:

Token::TagToken(_) | Token::CommentToken(_) | Token::DoctypeToken(_) => {
                    self.token_start_byte.get()
                },

This is not under test however, I'll add a test case later something like:

fn check_byte_offsets_text_content_with_token_start() {
        let entries = tokenize_bytes("<p>hello<world</p>");

Hopefully it proofs me right 😆.

#[cfg(feature = "source-positions")]
if let Some(ref r) = result {
let n = match r {
SetResult::NotFromSet(ref t) => t.len() as u64,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no one-letter variable names please

Comment on lines +2473 to +2474
#[cfg(all(test, feature = "source-positions"))]
mod test_source_positions {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integration tests should live in html5ever/tests,

@OGKevin OGKevin Jul 1, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not integration testing actually, its similar to the main mod test similar to

// LinesMatch implements the TokenSink trait. It is used for testing to see
    // if current_line is being updated when process_token is called. The lines
    // vector is a collection of the line numbers that each token is on.
    struct LinesMatch {
        tokens: RefCell<Vec<Token>>,
        current_str: RefCell<StrTendril>,
        lines: RefCell<Vec<(Token, u64)>>,
    }

but I wanted to separate it. Would it be okay to do:

mod test {
   // existing

   #[cfg(feature = "source-positions")]
   mod test_source_positions {
    // tests
   }
}

?

Comment thread html5ever/Cargo.toml
Comment on lines +8 to +13
version = "0.39.0"
license = "MIT OR Apache-2.0"
authors = [ "The html5ever Project Developers" ]
repository = "https://github.com/servo/html5ever"
edition = "2021"
rust-version = "1.71.0"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to bump the version if we publish this, but don't modify the other fields please.

Comment on lines +56 to +61
/// Total number of UTF-8 bytes consumed from this queue so far.
///
/// Used by the tokenizer to surface byte-accurate source offsets via
/// [`TokenSink::set_current_byte`] and [`TreeSink::set_current_byte`].
#[cfg(feature = "source-positions")]
bytes_consumed: Cell<u64>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to count the consumed bytes here. A BufferQueue just stores input bytes - it is not tied to the tokenizer in any way. Any book-keeping of the consumed input should happen inside the tokenizer.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that since the buffer queue is in charge of the raw input, it should be the source of truth to what has been consumed or not?

In theory, something could happen in buffer queue that the tokenizer is not aware of and corrupt the counter 🤔

Comment on lines +56 to +61
/// Total number of UTF-8 bytes consumed from this queue so far.
///
/// Used by the tokenizer to surface byte-accurate source offsets via
/// [`TokenSink::set_current_byte`] and [`TreeSink::set_current_byte`].
#[cfg(feature = "source-positions")]
bytes_consumed: Cell<u64>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of bytes should be a usize. That avoids some casting back and forth in the rest of the code.

@nicoburns

Copy link
Copy Markdown
Contributor

This feature is needed to properly process self closing tags such as <title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.

This is what the xml5ever crate is for. It will parse XHTML documents correctly.

@OGKevin

OGKevin commented Jul 1, 2026

Copy link
Copy Markdown
Author

try html5ever = { path = "thirdparty/html5ever/html5ever" }, that works for me when developing servo.

This is what I have! And without the changes, I get the same

failed to load source for dependency `html5ever'

error 😔. I'll spend some more time on it, worst case I'll just keep the workspace config changes in the fork on a separate commit 👍🏾 Since

That is not how we intend for html5ever to be consumed.

is valid 🤝


This is what the xml5ever crate is for. It will parse XHTML documents correctly.

Ahh I missed this, if thats the case, I should be using xml5ever instead 🤝. I was under the impression it was just for xml 😭. Nevertheless, I will still attempt at getting this PR merged for html5ever. Afterwards, I'll replicate the changes to xml5ever.

In this case, the xhtml-self-closing can be removed entirely. I'm not certain if it should still be classified as a "bug" @simonwuelker? As this is a clear case of user error(xml5ever for xhtml instead of html5ever) + invalid spec (xhtml vs html) 🤔.

@simonwuelker

Copy link
Copy Markdown
Member

This is what I have!

Note the extra /htmlever in the path compared to what you posted (:

In this case, the xhtml-self-closing can be removed entirely. I'm not certain if it should still be classified as a "bug" @simonwuelker? As this is a clear case of user error(xml5ever for xhtml instead of html5ever) + invalid spec (xhtml vs html) 🤔.

If xml5ever has the correct behaviour then this is not a bug, no.

@OGKevin

OGKevin commented Jul 1, 2026

Copy link
Copy Markdown
Author

Note the extra /htmlever in the path compared to what you posted (:

🙈

I had to make some small changes to the cargo.yaml files so that cargo would be happy to compile this project as a submodule: OGKevin/cadmus@65cdaaa/Cargo.toml#L26-L28

That link, and what has been comitted is:

[patch.crates-io]
html5ever  = { path = "thirdparty/html5ever/html5ever" }
markup5ever = { path = "thirdparty/html5ever/markup5ever" }

Which is the same?

I was hoping maybe this could work:

this is just another variation that completely doesn't work.
And the above "thirdparty/html5ever/html5ever also doesn't work without the changes in cargo.toml in html5ever 😔.
As it doesn't like that cargo.toml in html5ever is defering fields to workspace:

$ cargo check
error: failed to load source for dependency `html5ever`

Caused by:
  unable to update /Users/x/code/github.com/ogkevin/cadmus/thirdparty/html5ever/html5ever

Caused by:
  failed to parse manifest at `/Users/x/code/github.com/ogkevin/cadmus/thirdparty/html5ever/html5ever/Cargo.toml`

Caused by:
  error inheriting `edition` from workspace root manifest's `workspace.package.edition`

Caused by:
  `workspace.package.edition` was not defined

Despite it being defined:

[workspace.package]
version = "0.39.0"
license = "MIT OR Apache-2.0"
authors = [ "The html5ever Project Developers" ]
repository = "https://github.com/servo/html5ever"
edition = "2021"
rust-version = "1.71.0"

I'll spend some time on it again later today, and if not I'll just keep it separate 👍🏾

@nicoburns

Copy link
Copy Markdown
Contributor

If you can, try using path = '...' in the regular dependency section rather than using a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

V-non-breaking A non-breaking change

3 participants