-
Notifications
You must be signed in to change notification settings - Fork 3.5k
feat: added support for parsing LaTeX (.tex) documents #2890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: added support for parsing LaTeX (.tex) documents #2890
Conversation
|
✅ DCO Check Passed Thanks @adityasasidhar, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
27f98dc to
170f769
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
acd16d6 to
4363deb
Compare
Updated: Improved Test CoverageThe initial submission passed all 25 required CI checks (lint, tests across Python 3.9-3.14, examples, docs, DCO) but missed the codecov threshold (57% vs 73.16% target). I've now added 33 additional unit tests (39 total) to achieve ~86% coverage, well above the target. New tests cover:
Also fixed a validation bug where Awaiting workflow approval to run CI. Thank you! |
PeterStaar-IBM
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just my first comments, I think we will need a few iterations.
Having that said, really impressive so far, good code quality and love the approach!
- Add text formatting options (bold, italic, underline) for LaTeX macros - Enhance image embedding with PIL and ImageRef.from_pil() - Refactor list processing to use GroupItem structure - Refactor bibliography to use GroupItem structure - Add nested list test coverage - All tests passing (39/39), all linters passing
|
@PeterStaar-IBM Thank you for the thorough review and kind words! All feedback has been addressed in commit Changes implemented:
Test coverage: 40/40 tests passing, all linters passing ✓ Ready for re-review! Let me know if you'd like any adjustments. PS. Will work on future enhancements through multiple iterations |
|
dang it i forgot to DCO sign off on the commit ( cries internally ) |
|
@PeterStaar-IBM Thank you for the thorough review and kind words! All feedback has been addressed in the commit.
Test coverage: 40/40 tests passing, all linters passing ✓ Ready for re-review! Let me know if you'd like any adjustments. |
|
@adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into |
|
Will do> @adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into |
@cau-git @PeterStaar-IBM Thank you for the feedback! I've significantly overhauled the implementation to allow for end-to-end ground-truth validation against real-world scientific papers. To ensure the backend is robust enough for general use, I added the full source code for 4 major arXiv papers (Attention Is All You Need, DeepSeek V3, Mistral 7B, and OTSL) to the test suite and regenerated the ground-truth data. Testing against these complex documents revealed several real-world parsing challenges which I have now resolved: Split-file Documents: Added recursive support for \input{} and \include{} to handle papers that are split across multiple There's still a lot of work to perfect it, I had some ideas reagrding restructuring tables and all using llm's, the user can probably pass some arguments regarding it to use ( just my thoughts ), as you see manual parsing through the latex code leaves us with a lot of gaps and using a small language model to probably assist with the use case might be too complex for now but yeah gave it a bit of thought.....I'll probably give some more attempts at making it perfect but yeah real world use cases ain't much ( cries internally wishing the world was perfect ) |
@vku-ibm Please have a look at this PR. Maybe you can run it over our internal latex collection and give some general feedback. Overall, we can not expect that all features of latex are supported by this PR (there are always additional feat/fix-PR's that can be done). However, we need to ensure a base level of robustness. |
|
committted the changes with proper dco sign off |
|
@adityasasidhar took some random arxiv publications for tests. These are my findings so far:
from pdf:
the above was done serializing to markdown using older methods
EDIT: extraction of figures is supported when modern format of latex is used, the code mentioned above handles graphics provided as pdfs. Old latex documents often use ps and eps formats. There are more findings but I need to clarify and cross check them first. |
Becomes this in docling document: and when serialized, these "breaks" consequently become paragraph separators. |
yes, these should become inline formula's |
|
@vku-ibm @PeterStaar-IBM thank you for the thorough testing! Working on fixes for: |
isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please |
My apologies for not being clear. Yes, we do want to preserve scientific notation and this backend does it. I was pointing out that docling doesn't do it by default with the pdf backend. So this point wasn't the "bug report" but an observation. |
@vku-ibm my bad if I came off accusing, almost done with the fixes, testing it a bit for edge cases...will push it once finished |
@adityasasidhar no worries, we (=docling team) dont take these things personal, just love your enthusiasm! Latex is a real pain sometimes, and we need to get some good rough coverage before we merge. The next iterations can then be improvements. |
|
@vku-ibm @PeterStaar-IBM Thank you for the detailed feedback and testing! I have pushed a new commit that addresses all the reported issues. Here is a summary of the fixes and verification: 🛠️ Fixes Implemented Fix: Adjusted the regex preprocessing to safely strip comments, including files starting with %. Fix: Modified Fix: Refactored Fix: Implemented Expanded Unit Tests: Added 3 new tests specifically for these edge cases. |
|
I've tried latest version and it looks great! Couple of the findings:
example from https://arxiv.org/abs/2501.00089 ,
same document as above.
becomes: The sample is https://arxiv.org/abs/astro-ph/0001001
becomes:
becomes: and later in the same document, another sample that breaks the same way:
The rest of the document didn't get parsed. |
|
Hey @vku-ibm, thanks a lot for the detailed feedback on the parsing issues. I realized that a large part of the problem comes from trying to patch things at the string or regex level after parsing, which is inherently fragile for LaTeX. I’m currently experimenting with a different approach that relies much more directly on the LaTeX AST and avoids text-level fixes as much as possible. The goal is to handle macros, environments, and math in a more structured and semantically correct way, instead of flattening early and compensating later. I’m also planning to build a proper dictionary of macros so that regex usage can be gradually phased out. Thanks again for your patience. I’ll likely be making some changes to the overall approach and will share updates once things stabilize. |
0e985c9 to
d6c4ee8
Compare
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
- Add text formatting options (bold, italic, underline) for LaTeX macros - Enhance image embedding with PIL and ImageRef.from_pil() - Refactor list processing to use GroupItem structure - Refactor bibliography to use GroupItem structure - Add nested list test coverage - All tests passing (39/39), all linters passing Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
…r@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135 Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
…r@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135 Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
- Add custom macro expansion for improved text quality
- Fix preamble filtering to remove metadata garbage
- Support recursive \input{} and \include{} file loading
- Organize test data into subdirectories for complex papers
- Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL)
- Pass all 41 unit tests and pre-commit checks
Addresses @cau-git feedback for ground-truth data.
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
- Fixed re.error: bad escape in macro expansion by using lambda in re.sub - Fixed sentences breaking at inline math ($) by preserving it within paragraphs - Improved figure environment with proper grouping and structured representation - Fixed crashes on documents starting with % comments - Added comprehensive unit tests and updated all ground truth data Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
This commit addresses several issues with LaTeX parsing:
- Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks.
- Fix extraction of structural macros (section, caption, etc.) vs text-only groups.
- Address PR feedback regarding inline math spacing and splitting.
- Regenerate ground truth files reflecting these improvements.
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
dda8d65 to
a604e80
Compare
|
@vku-ibm hey! I know it’s been a while since my last push (some might say I ran out of Claude credits… hehehe 😄), but I’m back with a major update. I’ve addressed the pending feedback and significantly hardened the LaTeX parsing logic. Below is a summary of what’s included in this merge. ✨ Summary of Changes🔧 Robust Macro Extraction
📝 Inline Text Preservation
🧱 Structural vs Inline Distinction
🔤 Special Character Support
📊 Table Parsing Stability
🧪 End-to-End Testing
can you please run it over some more latex files and update me about the edge cases and things i may have missed out, also i had an idea about handling malformed .tex file by converting them into pdf and running it over the standard docling pdf conversion methods...love to know what you think about it....maybe use a small language model for tweaks...but yeah even that has its own tradeoffs.... |
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Issue resolved by this Pull Request:
Resolves #2885
This PR introduces initial support for parsing LaTeX (.tex) documents as a structured input format, using
pylatexenc. This allows Docling to convert native LaTeX files directly to Markdown while preserving semantic structure.Key changes:
\section,\subsection,abstract) to Docling's model.$) and display ($$,\[) math, plus environments likeequation,align,gather, andmultline(including starred variants).tabularenvironments, including handling of escaped characters and\hlinecleanup.\cite) and references (\ref).docling convert file.tex.Checklist: