feat(fts): add UTF-8 support for tokenizer and token filters via utf8proc by egolearner · Pull Request #515 · alibaba/zvec

egolearner · 2026-06-23T03:43:11Z

Integrate utf8proc 2.11.3 and upgrade the FTS text analysis pipeline to support full Unicode:

StandardTokenizer:

Replace ASCII-only std::isalnum with utf8proc_category for Unicode-aware word boundary detection
Support letters and digits of any script (Latin, Cyrillic, Greek, etc.)
Emit CJK ideographs as individual single-character tokens, aligned with Elasticsearch standard tokenizer behavior

LowercaseTokenFilter:

Replace byte-level std::tolower with codepoint-aware utf8proc_tolower, supporting full Unicode lowercase conversion

AsciiFoldingTokenFilter (new):

Convert Unicode characters to ASCII equivalents using NFKD decomposition + STRIPMARK via utf8proc
Supplementary folding table for characters without decomposition mappings (ø→o, đ→d, ß→ss, Æ→AE, Þ→TH, etc.)
Registered as "ascii_folding" in TokenizerFactory

Tests: 51 new unit tests (11 lowercase + 26 ascii_folding + 14 standard)

…proc Integrate utf8proc 2.11.3 and upgrade the FTS text analysis pipeline to support full Unicode: StandardTokenizer: - Replace ASCII-only std::isalnum with utf8proc_category for Unicode-aware word boundary detection - Support letters and digits of any script (Latin, Cyrillic, Greek, etc.) - Emit CJK ideographs as individual single-character tokens, aligned with Elasticsearch standard tokenizer behavior LowercaseTokenFilter: - Replace byte-level std::tolower with codepoint-aware utf8proc_tolower, supporting full Unicode lowercase conversion AsciiFoldingTokenFilter (new): - Convert Unicode characters to ASCII equivalents using NFKD decomposition + STRIPMARK via utf8proc - Supplementary folding table for characters without decomposition mappings (ø→o, đ→d, ß→ss, Æ→AE, Þ→TH, etc.) - Registered as "ascii_folding" in TokenizerFactory Tests: 51 new unit tests (11 lowercase + 26 ascii_folding + 14 standard)

feihongxu0824 · 2026-07-01T02:44:49Z

+    case UTF8PROC_CATEGORY_LT:  // Letter, titlecase
+    case UTF8PROC_CATEGORY_LM:  // Letter, modifier
+    case UTF8PROC_CATEGORY_LO:  // Letter, other
+    case UTF8PROC_CATEGORY_MN:  // Mark, nonspacing


ES/Lucene standard tokenizer 基于 UAX #29，会区分 word boundary，而不是简单把 mark 当作可起始的 word char。这里目前允许 MN/MC/ME 单独起 token，可能会把 variation selector/standalone mark 索引出来；建议至少区分 token-start 和 token-continue。

另一个问题，目前zvec和es的的standard tokenizer有哪些diff呢？

已经避免 MN/MC/ME 单独起 token，后续PR再考虑以UAX #29的方式进一步对齐ES的standard tokenizer

当前 zvec 的 standard tokenizer 和 ES/Lucene standard tokenizer 主要差异：

整体实现模型不同
zvec 基于 Unicode General Category 做近似规则：Letter/Number 可起 token，Mark 只能延续 token，CJK Han 单字切分。
ES/Lucene 基于 UAX #29 word boundary 状态机，规则更细。

内部标点处理不同
zvec 会把 '、.、,、_ 等非字母数字字符当分隔符。
ES/Lucene 会按 UAX #29 保留部分词内标点，例如 dog's、3.14、1,000 这类通常不会被简单拆开。

emoji 行为不同
zvec 当前基本不索引 emoji，本次修复后 variation selector 也不会单独成 token。
Lucene StandardTokenizer 有 EMOJI token 类型，会识别部分 emoji 序列。

多脚本分类不同
zvec 只对 CJK Han ideograph 做单字 token；日文假名、韩文、泰文等主要按普通 Letter 连续聚合。
Lucene 有更细的 token 类型，如 IDEOGRAPHIC、HIRAGANA、KATAKANA、HANGUL、SOUTHEAST_ASIAN。

Mark/组合字符规则仍不是完整 UAX #29
本次修复后，MN/MC/ME 不再能单独起 token，只能延续已有 token，避免 standalone mark / variation selector 被索引。
但这只是修复评论指出的问题，不等价于完整 UAX #29 的 Extend/Format/ZWJ 处理。

max_token_length 只部分对齐
zvec 默认 255，超过会切分，和 ES 配置语义接近。
但 ES/Lucene 对配置范围和切分行为有更完整约束，zvec 目前只做了 > 0 判断。

OK，可以把 UAX #29 -> UAX #29，否则会显示为

egolearner requested review from chinaux and zhourrr as code owners June 23, 2026 03:43

github-actions Bot assigned egolearner Jun 23, 2026

egolearner force-pushed the feat/fts-utf8-lowercase-ascii-folding branch 2 times, most recently from 084b3fa to a0793cf Compare June 23, 2026 06:22

feihongxu0824 self-requested a review June 30, 2026 08:41

feihongxu0824 reviewed Jun 30, 2026

View reviewed changes

Comment thread src/db/index/column/fts_column/tokenizer/standard_tokenizer.h Outdated

Comment thread src/db/index/column/fts_column/tokenizer/standard_tokenizer.cc

Comment thread src/db/index/column/fts_column/tokenizer/ascii_folding_token_filter.cc Outdated

feihongxu0824 reviewed Jun 30, 2026

View reviewed changes

Comment thread src/db/index/column/fts_column/tokenizer/tokenizer_factory.cc

egolearner added 2 commits June 30, 2026 17:56

addressing comments

7126d78

egolearner force-pushed the feat/fts-utf8-lowercase-ascii-folding branch from a0793cf to 7126d78 Compare June 30, 2026 09:57

egolearner requested a review from Cuiyus as a code owner June 30, 2026 09:57

egolearner added 2 commits June 30, 2026 19:18

update python sdk doc

5fb9b61

test(fts): cover decomposed ascii folding behavior

a407630

feihongxu0824 reviewed Jul 1, 2026

View reviewed changes

fix: prevent marks from starting standard tokens

7189776

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fts): add UTF-8 support for tokenizer and token filters via utf8proc#515

feat(fts): add UTF-8 support for tokenizer and token filters via utf8proc#515
egolearner wants to merge 5 commits into
alibaba:mainfrom
egolearner:feat/fts-utf8-lowercase-ascii-folding

egolearner commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feihongxu0824 Jul 1, 2026 •

edited

Loading

egolearner Jul 1, 2026 •

edited

Loading

feihongxu0824 Jul 1, 2026

Labels

2 participants

Uh oh!

Conversation

egolearner commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feihongxu0824 Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

egolearner Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

feihongxu0824 Jul 1, 2026

Choose a reason for hiding this comment

Labels

2 participants

feihongxu0824 Jul 1, 2026 •

edited

Loading

egolearner Jul 1, 2026 •

edited

Loading