Skip to content

[Proposal] Replacing linebreak with icu4x and a Possible Implementation #687

@Vizards

Description

@Vizards

Feature Request

Replace linebreak with icu4x to improve support for word-break: auto | unset.
I am also sharing a possible implementation that has worked in my experience.

Description

Background

With the continuous updates to the Unicode Line Breaking Algorithm (UAX#14), there are now significant differences between the Unicode 16.0.0 implementation and the version supported by linebreak, which is still based on Unicode 13. These discrepancies have become especially apparent with Emoji ZWJ sequences and CJK characters. I previously demonstrated these differences causing issues in #621 (comment).

UAX#14 receives updates almost every year, making it unrealistic to expect the community to maintain a JavaScript bindings library that consistently tracks the latest Unicode version. Libraries like linebreak—which are able to distinguish Mandatory Breaks and provide a user-friendly API—are already very rare. Unfortunately, the linebreak library appears to be struggling with Unicode 15.0 support (foliojs/linebreak#47), and Unicode 16.0 has introduced even more breaking changes to UAX#14. Clearly, it is an enormous task for maintainers to update thousands of lines of rules every year and ensure all tests pass.

Why icu4x?

icu4x is the official FFI developed by unicode-org, including WebAssembly bindings for JavaScript/TypeScript via icu - npm. Since icu4x’s LineSegmenter API is similar to linebreak's nextBreak iterator, it is possible to rewrite the existing splitByBreakOpportunities logic with only minimal changes.

By leveraging icu4x’s CodePointMapData8 and detailed LineBreak type enumeration, we can accurately differentiate between opportunity breaks and mandatory breaks.

As an officially maintained subproject of unicode-org, icu4x has a proven track record of timely updates, which gives good reason to believe it will continue to keep pace with UAX#14 changes. This would greatly reduce the Unicode line breaking maintenance burden for satori maintainers.

My Attempt

I have already replaced the splitByBreakOpportunities method in my local project with an implementation based on icu. Following icu4x’s recommended approach, I have also implemented mandatory break detection in TypeScript. All existing automated tests pass, and as verified in my Node.js server production environment, the solution resolves all issues related to Unicode 13 incompatibility.

Potential Risks

  1. The TypeScript type definitions currently provided by icu - npm are incorrect. However, this is an easy fix—I have opened an issue for it: Missing Type Exports in ffi/npm/package.json unicode-org/icu4x#6695.

  2. The WebAssembly bindings for icu4x are generated by Diplomat, which introduces a few challenges:

    1. In icu@2.0.4, the icu_capi.wasm file is about 12.3MB in size. If we decide to bundle it with satori, package size might become a concern.

    2. There are also some issues with how Diplomat handles WebAssembly initialization and environment detection. For example, to use icu in the playground, I needed to update the next.config.js as follows:

      /** @type {import('next').NextConfig} */
      const nextConfig = {
        webpack: (config, { isServer }) => {
          config.experiments = {
            ...config.experiments,
            /*
             * Diplomat needs top-level-await
             */
            topLevelAwait: true, 
          }
          if (!isServer) {
            config.resolve.alias = {
              ...config.resolve.alias,
              /*
               * Force Diplomat to use browser environment
               * ref: https://github.com/rust-diplomat/diplomat/blob/main/tool/templates/js/wasm.mjs#L35
               */
              'fs': false
            }
          }
          return config
        },
      }
      
      module.exports = nextConfig
  3. I have not yet attempted to integrate icu_capi.wasm into satori/wasm.

Additional Context

Thank you to all the maintainers for your hard work. If there is interest in this proposal or questions about the implementation, I’d be glad to open a Pull Request for discussion and validation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions