Skip to content

fix: detok review findings (null vocab guard + utf-8 tokens read)#303

Merged
LauraGPT merged 1 commit into
mainfrom
fix/detok-robustness
Jun 20, 2026
Merged

fix: detok review findings (null vocab guard + utf-8 tokens read)#303
LauraGPT merged 1 commit into
mainfrom
fix/detok-robustness

Conversation

@LauraGPT

Copy link
Copy Markdown
Member

Addresses the gemini-code-assist findings on the merged in-binary detok (#3004/#302).

  • Null guard on gguf_get_arr_str (funasr-sensevoice.cpp / funasr-paraformer.cpp): a malformed GGUF could return nullptr for a vocab element; assigning that to std::string is UB. Now coerces to empty string.
  • tokens.json read (export_paraformer_gguf.py): use with open(..., encoding="utf-8") — a context manager (no fd leak) and explicit UTF-8 so the non-ASCII tokens load correctly on Windows (default cp1252).

Not changed: adjacent VAD-segment texts are concatenated without a separator — that is intentional for Chinese, where the VAD splits fall mid-utterance and a space would fragment the sentence (CER is unaffected either way). --ids still gives raw ids if a caller wants per-segment structure.

Verified: rebuilt, both binaries still print the correct text (sample + full 002).

@LauraGPT LauraGPT merged commit 07a2110 into main Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant