fix(foundation): open UTF-8 repo paths with _wfopen on Windows#729
Open
GrenAnHao wants to merge 1 commit into
Open
fix(foundation): open UTF-8 repo paths with _wfopen on Windows#729GrenAnHao wants to merge 1 commit into
GrenAnHao wants to merge 1 commit into
Conversation
On Windows, plain fopen() interprets path bytes in the active ANSI code page (e.g. GBK on zh-CN systems). Indexing a repo whose absolute path contains non-ASCII characters (CJK directory names) therefore fails to read source files during extraction: discovery still walks the tree via wide-char APIs, but read_file() returns NULL and the index ends up with File/Folder nodes only — no symbols, IMPORTS, or CALLS edges. Add cbm_fopen() in compat_fs: UTF-8 paths are widened and opened with _wfopen on Windows; POSIX keeps using fopen. Route project file reads in pass_parallel, pass_definitions, pass_semantic, pass_envscan, and language disambiguation through cbm_fopen. Verified on Windows (ACP 936): fopen(utf8 CJK path) fails, _wfopen succeeds; full index of a CJK-path Python repo yields 426 nodes / 1095 edges vs 61/59 before the fix. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: 岩工作室 <add336633@qq.com> Co-authored-by: Cursor <cursoragent@cursor.com>
0c0fe5f to
fa404b7
Compare
Owner
|
Huge thanks for opening this PR and for the work you put into it. The maintainer shop is currently full, so this may sit for a bit before it gets a proper review. We will come back to this as soon as possible with real feedback; I wanted to make sure it did not sit unacknowledged in the meantime. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cbm_fopen()incompat_fs: on Windows, UTF-8 paths are converted to UTF-16 and opened via_wfopen; on POSIX it delegates tofopen.cbm_fopenin the indexing hot path (pass_parallel,pass_definitions,pass_semantic,pass_envscan) and language disambiguation (language.c).Problem
On Windows with a non-UTF-8 ANSI code page (e.g. GBK / ACP 936),
fopen()interprets path bytes in the system code page. When the repo absolute path contains non-ASCII characters (e.g. a CJK directory name), UTF-8 path bytes are misinterpreted andfopenfails.Directory discovery already uses wide-char APIs (
cbm_opendir/FindFirstFileW), so the file tree is built. Extraction passes, however, usefopeninread_file()and silently skip every source file. The resulting index contains onlyFile/Folder/CONTAINS_*nodes — no symbols, noIMPORTS/CALLSedges, andfile_hashesstays empty.Verification
Reproduced and fixed on Windows 11 (ACP 936):
fopen)cbm_fopen)CONTAINS_*only)IMPORTS,CALLS,DEFINES, …)Minimal repro:
fopen(utf8_path_to_cjk_file)returns NULL;_wfopen(wide_path)opens and reads successfully.Built with official toolchain: MSYS2 CLANG64,
scripts/build.sh CC=clang CXX=clang++.Test plan
gcc -fsyntax-onlyon all changed files (MSYS2 clang 22.1.7)scripts/build.shscripts/test.sh(maintainer / Linux CI)Notes
This is a focused bug fix (7 files, ~40 lines). No API surface or pipeline algorithm changes.
Integrators calling the CLI on Windows with non-ASCII paths in JSON args may still need ASCII
\uXXXXescaping for argv (separate from this fix); MCP/stdio paths are unaffected.