Skip to content

feat(fts): add UTF-8 support for tokenizer and token filters via utf8proc#515

Open
egolearner wants to merge 5 commits into
alibaba:mainfrom
egolearner:feat/fts-utf8-lowercase-ascii-folding
Open

feat(fts): add UTF-8 support for tokenizer and token filters via utf8proc#515
egolearner wants to merge 5 commits into
alibaba:mainfrom
egolearner:feat/fts-utf8-lowercase-ascii-folding

Conversation

@egolearner

Copy link
Copy Markdown
Collaborator

Integrate utf8proc 2.11.3 and upgrade the FTS text analysis pipeline to support full Unicode:

StandardTokenizer:

  • Replace ASCII-only std::isalnum with utf8proc_category for Unicode-aware word boundary detection
  • Support letters and digits of any script (Latin, Cyrillic, Greek, etc.)
  • Emit CJK ideographs as individual single-character tokens, aligned with Elasticsearch standard tokenizer behavior

LowercaseTokenFilter:

  • Replace byte-level std::tolower with codepoint-aware utf8proc_tolower, supporting full Unicode lowercase conversion

AsciiFoldingTokenFilter (new):

  • Convert Unicode characters to ASCII equivalents using NFKD decomposition + STRIPMARK via utf8proc
  • Supplementary folding table for characters without decomposition mappings (ø→o, đ→d, ß→ss, Æ→AE, Þ→TH, etc.)
  • Registered as "ascii_folding" in TokenizerFactory

Tests: 51 new unit tests (11 lowercase + 26 ascii_folding + 14 standard)

@egolearner egolearner force-pushed the feat/fts-utf8-lowercase-ascii-folding branch 2 times, most recently from 084b3fa to a0793cf Compare June 23, 2026 06:22
@feihongxu0824 feihongxu0824 self-requested a review June 30, 2026 08:41
Comment thread src/db/index/column/fts_column/tokenizer/standard_tokenizer.h Outdated
Comment thread src/db/index/column/fts_column/tokenizer/standard_tokenizer.cc
Comment thread src/db/index/column/fts_column/tokenizer/ascii_folding_token_filter.cc Outdated
Comment thread src/db/index/column/fts_column/tokenizer/tokenizer_factory.cc
…proc

Integrate utf8proc 2.11.3 and upgrade the FTS text analysis pipeline
to support full Unicode:

StandardTokenizer:
- Replace ASCII-only std::isalnum with utf8proc_category for
  Unicode-aware word boundary detection
- Support letters and digits of any script (Latin, Cyrillic, Greek, etc.)
- Emit CJK ideographs as individual single-character tokens,
  aligned with Elasticsearch standard tokenizer behavior

LowercaseTokenFilter:
- Replace byte-level std::tolower with codepoint-aware
  utf8proc_tolower, supporting full Unicode lowercase conversion

AsciiFoldingTokenFilter (new):
- Convert Unicode characters to ASCII equivalents using NFKD
  decomposition + STRIPMARK via utf8proc
- Supplementary folding table for characters without decomposition
  mappings (ø→o, đ→d, ß→ss, Æ→AE, Þ→TH, etc.)
- Registered as "ascii_folding" in TokenizerFactory

Tests: 51 new unit tests (11 lowercase + 26 ascii_folding + 14 standard)
@egolearner egolearner force-pushed the feat/fts-utf8-lowercase-ascii-folding branch from a0793cf to 7126d78 Compare June 30, 2026 09:57
@egolearner egolearner requested a review from Cuiyus as a code owner June 30, 2026 09:57
case UTF8PROC_CATEGORY_LT: // Letter, titlecase
case UTF8PROC_CATEGORY_LM: // Letter, modifier
case UTF8PROC_CATEGORY_LO: // Letter, other
case UTF8PROC_CATEGORY_MN: // Mark, nonspacing

@feihongxu0824 feihongxu0824 Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ES/Lucene standard tokenizer 基于 UAX #29,会区分 word boundary,而不是简单把 mark 当作可起始的 word char。这里目前允许 MN/MC/ME 单独起 token,可能会把 variation selector/standalone mark 索引出来;建议至少区分 token-start 和 token-continue。

另一个问题,目前zvec和es的的standard tokenizer有哪些diff呢?

@egolearner egolearner Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经避免 MN/MC/ME 单独起 token,后续PR再考虑以UAX #29的方式进一步对齐ES的standard tokenizer

当前 zvec 的 standard tokenizer 和 ES/Lucene standard tokenizer 主要差异:

  1. 整体实现模型不同
    zvec 基于 Unicode General Category 做近似规则:Letter/Number 可起 token,Mark 只能延续 token,CJK Han 单字切分。
    ES/Lucene 基于 UAX #29 word boundary 状态机,规则更细。

  2. 内部标点处理不同
    zvec 会把 '.,_ 等非字母数字字符当分隔符。
    ES/Lucene 会按 UAX #29 保留部分词内标点,例如 dog's3.141,000 这类通常不会被简单拆开。

  3. emoji 行为不同
    zvec 当前基本不索引 emoji,本次修复后 variation selector 也不会单独成 token。
    Lucene StandardTokenizerEMOJI token 类型,会识别部分 emoji 序列。

  4. 多脚本分类不同
    zvec 只对 CJK Han ideograph 做单字 token;日文假名、韩文、泰文等主要按普通 Letter 连续聚合。
    Lucene 有更细的 token 类型,如 IDEOGRAPHICHIRAGANAKATAKANAHANGULSOUTHEAST_ASIAN

  5. Mark/组合字符规则仍不是完整 UAX #29
    本次修复后,MN/MC/ME 不再能单独起 token,只能延续已有 token,避免 standalone mark / variation selector 被索引。
    但这只是修复评论指出的问题,不等价于完整 UAX #29Extend/Format/ZWJ 处理。

  6. max_token_length 只部分对齐
    zvec 默认 255,超过会切分,和 ES 配置语义接近。
    但 ES/Lucene 对配置范围和切分行为有更完整约束,zvec 目前只做了 > 0 判断。

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,可以把 UAX #29 -> UAX #29,否则会显示为
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants