fix: detok review findings (null vocab guard + utf-8 tokens read) by LauraGPT · Pull Request #303 · FunAudioLLM/SenseVoice

LauraGPT · 2026-06-20T08:49:22Z

Addresses the gemini-code-assist findings on the merged in-binary detok (#3004/#302).

Null guard on gguf_get_arr_str (funasr-sensevoice.cpp / funasr-paraformer.cpp): a malformed GGUF could return nullptr for a vocab element; assigning that to std::string is UB. Now coerces to empty string.
tokens.json read (export_paraformer_gguf.py): use with open(..., encoding="utf-8") — a context manager (no fd leak) and explicit UTF-8 so the non-ASCII tokens load correctly on Windows (default cp1252).

Not changed: adjacent VAD-segment texts are concatenated without a separator — that is intentional for Chinese, where the VAD splits fall mid-utterance and a space would fragment the sentence (CER is unaffected either way). --ids still gives raw ids if a caller wants per-segment structure.

Verified: rebuilt, both binaries still print the correct text (sample + full 002).

… review)

fix: guard against null gguf_get_arr_str when reading sv.vocab (detok…

22dc58c

… review)

LauraGPT merged commit 07a2110 into main Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: detok review findings (null vocab guard + utf-8 tokens read)#303

fix: detok review findings (null vocab guard + utf-8 tokens read)#303
LauraGPT merged 1 commit into
mainfrom
fix/detok-robustness

LauraGPT commented Jun 20, 2026

Labels

1 participant

Uh oh!

Conversation

LauraGPT commented Jun 20, 2026

Labels

1 participant