Skip to content

fix(foundation): open UTF-8 repo paths with _wfopen on Windows#729

Open
GrenAnHao wants to merge 1 commit into
DeusData:mainfrom
GrenAnHao:fix/windows-utf8-fopen-cjk-paths
Open

fix(foundation): open UTF-8 repo paths with _wfopen on Windows#729
GrenAnHao wants to merge 1 commit into
DeusData:mainfrom
GrenAnHao:fix/windows-utf8-fopen-cjk-paths

Conversation

@GrenAnHao

Copy link
Copy Markdown

Summary

  • Add portable cbm_fopen() in compat_fs: on Windows, UTF-8 paths are converted to UTF-16 and opened via _wfopen; on POSIX it delegates to fopen.
  • Route project source file reads through cbm_fopen in the indexing hot path (pass_parallel, pass_definitions, pass_semantic, pass_envscan) and language disambiguation (language.c).

Problem

On Windows with a non-UTF-8 ANSI code page (e.g. GBK / ACP 936), fopen() interprets path bytes in the system code page. When the repo absolute path contains non-ASCII characters (e.g. a CJK directory name), UTF-8 path bytes are misinterpreted and fopen fails.

Directory discovery already uses wide-char APIs (cbm_opendir / FindFirstFileW), so the file tree is built. Extraction passes, however, use fopen in read_file() and silently skip every source file. The resulting index contains only File / Folder / CONTAINS_* nodes — no symbols, no IMPORTS / CALLS edges, and file_hashes stays empty.

Verification

Reproduced and fixed on Windows 11 (ACP 936):

Metric Before (fopen) After (cbm_fopen)
Nodes 61 (structure only) 426
Edges 59 (CONTAINS_* only) 1095 (IMPORTS, CALLS, DEFINES, …)
File→file IMPORTS edges 0 12+
File→file CALLS edges 0 103

Minimal repro: fopen(utf8_path_to_cjk_file) returns NULL; _wfopen(wide_path) opens and reads successfully.

Built with official toolchain: MSYS2 CLANG64, scripts/build.sh CC=clang CXX=clang++.

Test plan

  • Manual repro on Windows (ACP 936): CJK repo path indexes with symbols
  • gcc -fsyntax-only on all changed files (MSYS2 clang 22.1.7)
  • Production binary build via scripts/build.sh
  • CI scripts/test.sh (maintainer / Linux CI)

Notes

This is a focused bug fix (7 files, ~40 lines). No API surface or pipeline algorithm changes.

Integrators calling the CLI on Windows with non-ASCII paths in JSON args may still need ASCII \uXXXX escaping for argv (separate from this fix); MCP/stdio paths are unaffected.

@GrenAnHao GrenAnHao requested a review from DeusData as a code owner June 30, 2026 21:18
On Windows, plain fopen() interprets path bytes in the active ANSI code
page (e.g. GBK on zh-CN systems). Indexing a repo whose absolute path
contains non-ASCII characters (CJK directory names) therefore fails to
read source files during extraction: discovery still walks the tree via
wide-char APIs, but read_file() returns NULL and the index ends up with
File/Folder nodes only — no symbols, IMPORTS, or CALLS edges.

Add cbm_fopen() in compat_fs: UTF-8 paths are widened and opened with
_wfopen on Windows; POSIX keeps using fopen. Route project file reads in
pass_parallel, pass_definitions, pass_semantic, pass_envscan, and
language disambiguation through cbm_fopen.

Verified on Windows (ACP 936): fopen(utf8 CJK path) fails, _wfopen succeeds;
full index of a CJK-path Python repo yields 426 nodes / 1095 edges vs 61/59
before the fix.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: 岩工作室 <add336633@qq.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@GrenAnHao GrenAnHao force-pushed the fix/windows-utf8-fopen-cjk-paths branch from 0c0fe5f to fa404b7 Compare June 30, 2026 21:57
@DeusData DeusData added bug Something isn't working parsing/quality Graph extraction bugs, false positives, missing edges windows Windows-specific issues priority/high Needs near-term maintainer attention; high-impact bug, regression, safety issue, or release blocker. labels Jul 1, 2026
@DeusData

DeusData commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Huge thanks for opening this PR and for the work you put into it.

The maintainer shop is currently full, so this may sit for a bit before it gets a proper review. We will come back to this as soon as possible with real feedback; I wanted to make sure it did not sit unacknowledged in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working parsing/quality Graph extraction bugs, false positives, missing edges priority/high Needs near-term maintainer attention; high-impact bug, regression, safety issue, or release blocker. windows Windows-specific issues

2 participants