Skip to content

fix(extract): guard against large SQL file stack overflow (#691)#698

Open
lg320531124 wants to merge 1 commit into
DeusData:mainfrom
lg320531124:fix/691-sql-large-file-guard
Open

fix(extract): guard against large SQL file stack overflow (#691)#698
lg320531124 wants to merge 1 commit into
DeusData:mainfrom
lg320531124:fix/691-sql-large-file-guard

Conversation

@lg320531124

Copy link
Copy Markdown

Summary

The tree-sitter SQL grammar (39 MB parser.c) uses deeply recursive non-terminals that overflow the C stack on files with many statements (e.g. schema dumps >10K lines, stored procedures). The stack overflow kills the thread before any timeout callback can fire, so a pre-parse guard is the only safe mitigation.

Changes

  • Add a line-count guard (CBM_SQL_MAX_LINES = 5000) in cbm_extract_file() that returns has_error for SQL files exceeding the threshold
  • The pipeline skips the file gracefully instead of crashing

Why line-count, not byte-size?

  • tree-sitter's recursive descent parser pushes a stack frame per statement
  • A 500KB SQL dump with 10K CREATE TABLE lines overflows the default 512KB macOS thread stack
  • A 2MB compressed binary blob with few newlines does NOT overflow
  • Line count is the correct proxy for SQL stack depth

Why 5000 lines?

  • Default macOS thread stack: 512 KB
  • tree-sitter SQL frame size: ~100 bytes per recursion
  • 5000 lines → ~500 KB stack usage (safe margin)
  • Captures virtually all crash-inducing workloads

Why in cbm_extract_file(), not the discover layer?

  • The discover layer only checks byte size and knows nothing about SQL
  • The guard must be where tree-sitter is invoked — that is where the crash happens
  • Keeps the change minimal and colocated with the dangerous call

Testing

  • All 5720 existing tests pass
  • Small SQL files (<5000 lines) parse normally
  • Large SQL files (6000 lines) are skipped gracefully with has_error=true

Fixes #691

Related: #668 (original report)

@lg320531124 lg320531124 requested a review from DeusData as a code owner June 29, 2026 10:27
@lg320531124 lg320531124 force-pushed the fix/691-sql-large-file-guard branch from 3875a72 to 0b83cfb Compare June 29, 2026 10:35
@DeusData

Copy link
Copy Markdown
Owner

Huge thanks for opening this PR and for the work you put into it.

The maintainer shop is currently full, so this may sit for a bit before it gets a proper review. We will come back to this as soon as possible with real feedback; I wanted to make sure it did not sit unacknowledged in the meantime.

@DeusData DeusData added bug Something isn't working stability/performance Server crashes, OOM, hangs, high CPU/memory priority/high Needs near-term maintainer attention; high-impact bug, regression, safety issue, or release blocker. labels Jun 29, 2026
…ter parser

The tree-sitter SQL grammar (39 MB parser.c) uses deeply recursive
non-terminals that overflow the C stack on files with many statements
(e.g. schema dumps >10K lines, stored procedures). The stack overflow
kills the thread before any timeout callback can fire, so we must
reject the file *before* calling ts_parser_parse_with_options().

Add a pre-parse line-count guard (CBM_SQL_MAX_LINES = 5000) in
cbm_extract_file() that returns has_error for SQL files exceeding the
threshold, allowing the indexing pipeline to skip gracefully rather
than crash.

Fixes DeusData#691

Signed-off-by: lg320531124 <lg320531124@users.noreply.github.com>
@lg320531124 lg320531124 force-pushed the fix/691-sql-large-file-guard branch from 0b83cfb to e8db9e5 Compare July 1, 2026 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working priority/high Needs near-term maintainer attention; high-impact bug, regression, safety issue, or release blocker. stability/performance Server crashes, OOM, hangs, high CPU/memory

2 participants