Skip to content

strip internal-only DOCX hyperlink anchors to avoid dead links#2131

Open
martian7777 wants to merge 1 commit into
microsoft:mainfrom
martian7777:DOCX-internal-hyperlinks
Open

strip internal-only DOCX hyperlink anchors to avoid dead links#2131
martian7777 wants to merge 1 commit into
microsoft:mainfrom
martian7777:DOCX-internal-hyperlinks

Conversation

@martian7777

Copy link
Copy Markdown

Problem

DOCX internal Table-of-Contents (TOC) and cross-reference hyperlinks (represented as <w:hyperlink w:anchor="..."> elements with no external relationship ID) were being translated into dead Markdown links (e.g., [Executive Summary](#_Toc12345)). Because these bookmark anchors do not resolve in the final Markdown document, they introduce dead link noise for reading or LLM consumption.

Solution

  • Modified the DOCX XML pre-processing step (pre_process.py) to search for <w:hyperlink> elements that contain a w:anchor attribute but lack an external relationship r:id attribute.
  • Unwrapped these internal-only hyperlink elements so that they render as plain text in the final Markdown output, keeping their text content and formatting intact without emitting the dead link wrapper.
  • Renamed the internal _pre_process_math helper function to _pre_process_xml to better represent its expanded pre-processing responsibilities.

Testing

  • Added test_docx_internal_hyperlinks in test_module_misc.py that verifies a DOCX file containing a w:anchor hyperlink converts to plain text rather than a Markdown link.
  • Verified that all other DOCX conversion tests pass successfully without regressions.
@martian7777

Copy link
Copy Markdown
Author

this solution was for #2125

@martian7777

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant