Skip to content

Preserve percent-encoded octets in hrefs to avoid UTF-8 replacement (EUC-JP / Shift-JIS) #2171

Description

@SnowMoonSS

When converting HTML links to Markdown, MarkItDown decodes URL path percent-encodings using UTF-8 and then re-encodes them. For URLs originally percent-encoded using EUC-JP (e.g. URLs like https://abc.com/hist/%a5%c8%a5%c3%a5%d7%a5%da%a1%bc%a5%b8), this UTF-8 decoding produces replacement characters (U+FFFD) which show up as %EF%BF%BD in the output. This corrupts the original percent-encoded URLs.

Steps to reproduce:

  1. Use MarkItDown to convert an HTML snippet containing a link with an EUC-JP percent-encoded path, for example:
<a href="https://abc.com/hist/%a5%c8%a5%c3%a5%d7%a5%da%a1%bc%a5%b8">example</a>
  1. Observe that the converted Markdown contains %EF%BF%BD sequences where the original %a5... bytes were, indicating U+FFFD replacement characters introduced during UTF-8 decoding.

Expected behavior:

  • MarkItDown should preserve existing percent-encoded octets in hrefs rather than attempting to decode them using UTF-8, or should otherwise normalize percent-encodings without introducing replacement characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions