feat: add byte tracking#745
Conversation
a332771 to
0d70705
Compare
Note that servo does not accept AI-generated contributions: https://book.servo.org/contributing/getting-started.html#ai-contributions. Admittedly we should do a better job of communicating that this applies to all repositories under the servo org, not just |
|
I'd be fine with you rewriting the tests and rustdoc.
That's okay.
You don't need to prove that. We rely on the honesty of contributors when rejecting AI generated patches (unless the patch is obviously slop). Thanks for being upfront about it. |
0d70705 to
cdd361a
Compare
This commit introduced a new feature flag that enables the ability to track byte offsets. When the source-positions feature is enabled, the tokenizer will track the number of UTF-8 bytes consumed from the input so far. This is done by giving BufferQueue a `bytes_consumed` field, which is incremented every time a character is consumed. The changes in cargo.toml were needed to make this project load as a git submodule in the Cadmus project. The xhtml-self-closing feature was needed due to EPUBs using XHTML-compatible self-closing on RCDATA/RAWTEXT elements. Change-Id: 566446e2bca101b7fefdca639c1b4d26 Change-Id-Short: uttvvtlxonpy
1b0e47d to
16e417a
Compare
16e417a to
74a33c7
Compare
74a33c7 to
b51978b
Compare
Change-Id: 1cd53437f8710e57e9a9bb17f38bc2fc Change-Id-Short: ynmuwvwskrsy
b51978b to
56499d6
Compare
|
I think this is ready for review 🤞🏾. I believe I've touched up on all the parts that were fully generated. So far, I haven't encountered any bugs 🙏🏾, but would like to keep this project as a submodule, so I always have the ability to make changes if needed, since its now an integral part of Cadmus. That being said: The only thing I'm unsure about are the changes in cargo.toml. Not sure if there is a way to add this project as a submodule. I've read https://doc.rust-lang.org/cargo/reference/overriding-dependencies.html 2 times now and the only solution that seems to work is by updating cargo.toml in this project. I was hoping maybe this could work: [patch.crates-io]
html5ever = { path = "thirdparty/html5ever", package = "html5ever"}But it doesn't. Not sure if there is something I'm missing tho 🙈. |
There was a problem hiding this comment.
These are a couple things I found while skimming through the code. The byte-tracking looks pretty invasive, which concerns me. Ideally we would keep the number of #[cfg(feature = "source-positions")] low, much lower than it currently is. I don't think any changes to the internals of BufferQueue should be necessary.
I had to make some small changes to the cargo.yaml files so that cargo would be happy to compile this project as a submodule: https://github.com/OGKevin/cadmus/blob/65cdaaabb5ab795c668b32a4206f8a49fc9e2661/Cargo.toml#L26-L28
That is not how we intend for html5ever to be consumed. Please don't change our workspace configuration.
I was hoping maybe this could work:
try html5ever = { path = "thirdparty/html5ever/html5ever" }, that works for me when developing servo.
xhtml-self-closing feature
This feature is needed to properly process self closing tags such as <title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.
xhtml-self-closing-integration.rs has test case that showcases this. Well, not the "bug" per se, just how this feature fixes the issue.
That is completely unrelated to this PR. Please open a different one for this bug.
| /// | ||
| /// The default implementation is a no-op. | ||
| #[cfg(feature = "source-positions")] | ||
| fn set_current_byte(&self, _byte_offset: u64) {} |
There was a problem hiding this comment.
I would strongly prefer passing the byte position as an offset to process_token instead of adding a new method that will be called right before.
You can add something like a struct SourcePosition which would contain the line number and (optionally) the byte offset.
| if c == '<' { | ||
| self.token_start_byte | ||
| .set(pos.saturating_sub(c.len_utf8() as u64)); | ||
| } |
There was a problem hiding this comment.
This doesn't work - not every < indicates the start of a token (for example in the PLAINTEXT state 1 it is simply emitted as any other character).
Footnotes
There was a problem hiding this comment.
For this specific implementation, this is not a problem. As the value is only read after it has been determined that we're processing a non character token:
Token::TagToken(_) | Token::CommentToken(_) | Token::DoctypeToken(_) => {
self.token_start_byte.get()
},This is not under test however, I'll add a test case later something like:
fn check_byte_offsets_text_content_with_token_start() {
let entries = tokenize_bytes("<p>hello<world</p>");Hopefully it proofs me right 😆.
| #[cfg(feature = "source-positions")] | ||
| if let Some(ref r) = result { | ||
| let n = match r { | ||
| SetResult::NotFromSet(ref t) => t.len() as u64, |
There was a problem hiding this comment.
no one-letter variable names please
| #[cfg(all(test, feature = "source-positions"))] | ||
| mod test_source_positions { |
There was a problem hiding this comment.
integration tests should live in html5ever/tests,
There was a problem hiding this comment.
This is not integration testing actually, its similar to the main mod test similar to
// LinesMatch implements the TokenSink trait. It is used for testing to see
// if current_line is being updated when process_token is called. The lines
// vector is a collection of the line numbers that each token is on.
struct LinesMatch {
tokens: RefCell<Vec<Token>>,
current_str: RefCell<StrTendril>,
lines: RefCell<Vec<(Token, u64)>>,
}but I wanted to separate it. Would it be okay to do:
mod test {
// existing
#[cfg(feature = "source-positions")]
mod test_source_positions {
// tests
}
}?
| version = "0.39.0" | ||
| license = "MIT OR Apache-2.0" | ||
| authors = [ "The html5ever Project Developers" ] | ||
| repository = "https://github.com/servo/html5ever" | ||
| edition = "2021" | ||
| rust-version = "1.71.0" |
There was a problem hiding this comment.
You'll need to bump the version if we publish this, but don't modify the other fields please.
| /// Total number of UTF-8 bytes consumed from this queue so far. | ||
| /// | ||
| /// Used by the tokenizer to surface byte-accurate source offsets via | ||
| /// [`TokenSink::set_current_byte`] and [`TreeSink::set_current_byte`]. | ||
| #[cfg(feature = "source-positions")] | ||
| bytes_consumed: Cell<u64>, |
There was a problem hiding this comment.
I don't understand why we need to count the consumed bytes here. A BufferQueue just stores input bytes - it is not tied to the tokenizer in any way. Any book-keeping of the consumed input should happen inside the tokenizer.
There was a problem hiding this comment.
My thinking is that since the buffer queue is in charge of the raw input, it should be the source of truth to what has been consumed or not?
In theory, something could happen in buffer queue that the tokenizer is not aware of and corrupt the counter 🤔
| /// Total number of UTF-8 bytes consumed from this queue so far. | ||
| /// | ||
| /// Used by the tokenizer to surface byte-accurate source offsets via | ||
| /// [`TokenSink::set_current_byte`] and [`TreeSink::set_current_byte`]. | ||
| #[cfg(feature = "source-positions")] | ||
| bytes_consumed: Cell<u64>, |
There was a problem hiding this comment.
The number of bytes should be a usize. That avoids some casting back and forth in the rest of the code.
This is what the |
This is what I have! And without the changes, I get the same
error 😔. I'll spend some more time on it, worst case I'll just keep the workspace config changes in the fork on a separate commit 👍🏾 Since
is valid 🤝
Ahh I missed this, if thats the case, I should be using xml5ever instead 🤝. I was under the impression it was just for In this case, the |
Note the extra
If xml5ever has the correct behaviour then this is not a bug, no. |
🙈
That link, and what has been comitted is: [patch.crates-io]
html5ever = { path = "thirdparty/html5ever/html5ever" }
markup5ever = { path = "thirdparty/html5ever/markup5ever" }Which is the same?
this is just another variation that completely doesn't work. $ cargo check
error: failed to load source for dependency `html5ever`
Caused by:
unable to update /Users/x/code/github.com/ogkevin/cadmus/thirdparty/html5ever/html5ever
Caused by:
failed to parse manifest at `/Users/x/code/github.com/ogkevin/cadmus/thirdparty/html5ever/html5ever/Cargo.toml`
Caused by:
error inheriting `edition` from workspace root manifest's `workspace.package.edition`
Caused by:
`workspace.package.edition` was not definedDespite it being defined: I'll spend some time on it again later today, and if not I'll just keep it separate 👍🏾 |
|
If you can, try using |

Byte Tracking
This PR introduced the ability for
TreeSinkimplementations to know byte positions of Tokens. This is done by makingBufferQueuetrack byte consumption and expose methods forTokenizerto get this info, which than passes it forwards toTreeSink.BufferQueue
BufferQueuegets abytes_consumedfield that tracks bytes manipulated via:pop_except_fromadvance_bytes_consumedretreat_bytes_consumedeatnextThe "consuming" methods (pop, eat, next) automatically advance the counter. While
push_backandpush_frontdon't move the counter. This is due to some code paths pushing front without actually consuming bytes, so this would corrupt the counter. Hence it's the callers responsibility to retreat.Tokenizer
The main use case of this PR is to know where a token began in bytes, not necessarily the raw byte position of a character (what
BufferQueue::bytes_consumedtracks.Since Tokenizer is aware of "Tokens" e.g. Tag, Comment and Character tokens, it is in a good position to dictate what the correct byte is that needs to be reported to
TreeSink.For Tag, Comment and Doctype, the byte offset is that start of the token. So for tag
<some-tag></some-tag>, its offset would be0. And for tagpin<some-tag><p></p></some-tag>it would be10.For character tokens, it wold be that start of the first character.
e.g.
<p>abc</p> the textabcwould have offset3`.This is done by keeping track of when we saw the last
<when processing a token. This is stored astoken_start_byte. As we continue to process, we always write tolast_token_end_byteat the end of each token processing so it can be used when we getCharacterTokens. In the above example, it would contain the byte ofa. As when we get the character tokenabcwe would be at byte6but we want to communicate that the character token started at3.TokenSink
Token sink get's a new
Method instead of updating
And breaking the signature.
TreeSink
For folks who implement their own Dom struct,
TreeSinkis able to also pass the byte offset forward.An example of this can be seen in
source-positions-integration.rsConclusion
BufferQueueholds raw position, Tokenizer converts raw position to an offset, and this gets passed down to TokenSink and TreeSink.This was manually written ⌨️. Mainly to proof that human 🧠 and hands were used .
Supporting changes
Cargo
I had to make some small changes to the
cargo.yamlfiles so that cargo would be happy to compile this project as a submodule: https://github.com/OGKevin/cadmus/blob/65cdaaabb5ab795c668b32a4206f8a49fc9e2661/Cargo.toml#L26-L28xhtml-self-closingfeatureThis feature is needed to properly process self closing tags such as
<title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.xhtml-self-closing-integration.rshas test case that showcases this. Well, not the "bug" per se, just how this feature fixes the issue.Ref: baskerville/plato#426 (comment)
🤖
was used to generate some of the rustdocs and tests.Was also used to educate me on the project and brainstorm implementation path.Closes: #734
Ref: