Skip to content

converting docx files to markdown with markitdown.exe does not escape special characters #2157

Description

@fengzhenqiong

markitdown version 0.1.6
OS: windows 11 x 64
using powershell terminal: markitdown file.docx -o file.md --keep-data-uris

If the line starts with hash mark ## it will keep it and recognized as section header.
If the line starts with pipeline character | it will be kept and recognized as table, even make it worse if it's already inside a table

only a few special characters listed here will be escaped, like asterisk, underscore.

also, when it converts the table in docx files, the markdown generated will get an empty header and put the real header as the first content row.

the converted result is something like:

| | | |
| --- | --- | --- |
| Header 1 | Header 2 | Header 3 |

but it should be like:

| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions