Skip to content

refactor: lexer #437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

refactor: lexer #437

wants to merge 8 commits into from

Conversation

psteinroe
Copy link
Collaborator

@psteinroe psteinroe commented Jul 1, 2025

  • adds a new tokenizer crate that turns a string into simple tokens
  • adds a new lexer + lexer_codegen that uses the tokeniser to lex into a new SyntaxKind enum

the new implementation is

  • much more performant (no extra string allocations, no call to C library)
  • works with broken strings
  • custom-made to our use-case (eg we need whitespace variants)

in a follow-up (or maybe here), we will be able to:

  • parse custom parameters that popular tools use
  • parse commands (lines starting with \)

we will still use libpg_query for parsing sql, but this gives us the flexibility to

  • pre-process to remove unsupported stuff
  • parse non-sql content (e.g. commands) via a simple custom parser

todos:

  • use new lexer in splitter (to detect double newlines, we either need to make all whitespaces tokens a length of one or add a explicit "double newline" token - I think I prefer the latter because its much simpler to use and more performant)
  • make sure we support all the different parameter formats popular tools use
  • tests
@psteinroe psteinroe changed the title refactor: parser Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant