When converting HTML links to Markdown, MarkItDown decodes URL path percent-encodings using UTF-8 and then re-encodes them. For URLs originally percent-encoded using EUC-JP (e.g. URLs like https://abc.com/hist/%a5%c8%a5%c3%a5%d7%a5%da%a1%bc%a5%b8), this UTF-8 decoding produces replacement characters (U+FFFD) which show up as %EF%BF%BD in the output. This corrupts the original percent-encoded URLs.
Steps to reproduce:
- Use MarkItDown to convert an HTML snippet containing a link with an EUC-JP percent-encoded path, for example:
<a href="https://abc.com/hist/%a5%c8%a5%c3%a5%d7%a5%da%a1%bc%a5%b8">example</a>
- Observe that the converted Markdown contains
%EF%BF%BD sequences where the original %a5... bytes were, indicating U+FFFD replacement characters introduced during UTF-8 decoding.
Expected behavior:
- MarkItDown should preserve existing percent-encoded octets in hrefs rather than attempting to decode them using UTF-8, or should otherwise normalize percent-encodings without introducing replacement characters.
When converting HTML links to Markdown, MarkItDown decodes URL path percent-encodings using UTF-8 and then re-encodes them. For URLs originally percent-encoded using EUC-JP (e.g. URLs like
https://abc.com/hist/%a5%c8%a5%c3%a5%d7%a5%da%a1%bc%a5%b8), this UTF-8 decoding produces replacement characters (U+FFFD) which show up as%EF%BF%BDin the output. This corrupts the original percent-encoded URLs.Steps to reproduce:
%EF%BF%BDsequences where the original%a5...bytes were, indicating U+FFFD replacement characters introduced during UTF-8 decoding.Expected behavior: