[Proposal] Replacing linebreak with icu4x and a Possible Implementation

Feature Request

Replace linebreak with icu4x to improve support for word-break: auto | unset.
I am also sharing a possible implementation that has worked in my experience.

Description

Background

With the continuous updates to the Unicode Line Breaking Algorithm (UAX#14), there are now significant differences between the Unicode 16.0.0 implementation and the version supported by linebreak, which is still based on Unicode 13. These discrepancies have become especially apparent with Emoji ZWJ sequences and CJK characters. I previously demonstrated these differences causing issues in #621 (comment).

UAX#14 receives updates almost every year, making it unrealistic to expect the community to maintain a JavaScript bindings library that consistently tracks the latest Unicode version. Libraries like linebreak—which are able to distinguish Mandatory Breaks and provide a user-friendly API—are already very rare. Unfortunately, the linebreak library appears to be struggling with Unicode 15.0 support (foliojs/linebreak#47), and Unicode 16.0 has introduced even more breaking changes to UAX#14. Clearly, it is an enormous task for maintainers to update thousands of lines of rules every year and ensure all tests pass.

Why icu4x?

icu4x is the official FFI developed by unicode-org, including WebAssembly bindings for JavaScript/TypeScript via icu - npm. Since icu4x’s LineSegmenter API is similar to linebreak's nextBreak iterator, it is possible to rewrite the existing splitByBreakOpportunities logic with only minimal changes.

By leveraging icu4x’s CodePointMapData8 and detailed LineBreak type enumeration, we can accurately differentiate between opportunity breaks and mandatory breaks.

As an officially maintained subproject of unicode-org, icu4x has a proven track record of timely updates, which gives good reason to believe it will continue to keep pace with UAX#14 changes. This would greatly reduce the Unicode line breaking maintenance burden for satori maintainers.

My Attempt

I have already replaced the splitByBreakOpportunities method in my local project with an implementation based on icu. Following icu4x’s recommended approach, I have also implemented mandatory break detection in TypeScript. All existing automated tests pass, and as verified in my Node.js server production environment, the solution resolves all issues related to Unicode 13 incompatibility.

Potential Risks

The TypeScript type definitions currently provided by icu - npm are incorrect. However, this is an easy fix—I have opened an issue for it: Missing Type Exports in ffi/npm/package.json unicode-org/icu4x#6695.

The WebAssembly bindings for icu4x are generated by Diplomat, which introduces a few challenges:

In icu@2.0.4, the icu_capi.wasm file is about 12.3MB in size. If we decide to bundle it with satori, package size might become a concern.

There are also some issues with how Diplomat handles WebAssembly initialization and environment detection. For example, to use icu in the playground, I needed to update the next.config.js as follows:

/** @type {import('next').NextConfig} */
const nextConfig = {
  webpack: (config, { isServer }) => {
    config.experiments = {
      ...config.experiments,
      /*
       * Diplomat needs top-level-await
       */
      topLevelAwait: true, 
    }
    if (!isServer) {
      config.resolve.alias = {
        ...config.resolve.alias,
        /*
         * Force Diplomat to use browser environment
         * ref: https://github.com/rust-diplomat/diplomat/blob/main/tool/templates/js/wasm.mjs#L35
         */
        'fs': false
      }
    }
    return config
  },
}

module.exports = nextConfig

I have not yet attempted to integrate icu_capi.wasm into satori/wasm.

Additional Context

Thank you to all the maintainers for your hard work. If there is interest in this proposal or questions about the implementation, I’d be glad to open a Pull Request for discussion and validation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Replacing `linebreak` with `icu4x` and a Possible Implementation #687

Feature Request

Description

Background

Why icu4x?

My Attempt

Potential Risks

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Replacing linebreak with icu4x and a Possible Implementation #687

Description

Feature Request

Description

Background

Why icu4x?

My Attempt

Potential Risks

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Proposal] Replacing `linebreak` with `icu4x` and a Possible Implementation #687