Skip to content

Conversation

@adityasasidhar
Copy link

@adityasasidhar adityasasidhar commented Jan 18, 2026

Issue resolved by this Pull Request:
Resolves #2885

This PR introduces initial support for parsing LaTeX (.tex) documents as a structured input format, using pylatexenc. This allows Docling to convert native LaTeX files directly to Markdown while preserving semantic structure.

Key changes:

  • New Backend: Added LatexDocumentBackend in docling/backend/latex_backend.py.
  • Structure Preservation: Correctly maps standard LaTeX hierarchy (\section, \subsection, abstract) to Docling's model.
  • Math Support: Handled inline ($) and display ($$, \[) math, plus environments like equation, align, gather, and multline (including starred variants).
  • Table Support: Implemented parsing for tabular environments, including handling of escaped characters and \hline cleanup.
  • Metadata: Improved extraction of citations (\cite) and references (\ref).
  • CLI Integration: Configured LatexFormatOption to enable docling convert file.tex.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.
@github-actions
Copy link
Contributor

github-actions bot commented Jan 18, 2026

DCO Check Passed

Thanks @adityasasidhar, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Jan 18, 2026

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Jan 18, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
@adityasasidhar adityasasidhar changed the title added a feature for latex Jan 18, 2026
@codecov
Copy link

codecov bot commented Jan 19, 2026

Codecov Report

❌ Patch coverage is 11.89320% with 363 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/latex_backend.py 9.77% 360 Missing ⚠️
docling/datamodel/document.py 0.00% 2 Missing ⚠️
docling/cli/main.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@adityasasidhar adityasasidhar force-pushed the feat/latex-support branch 2 times, most recently from acd16d6 to 4363deb Compare January 19, 2026 08:12
@adityasasidhar
Copy link
Author

adityasasidhar commented Jan 19, 2026

Codecov Report

Updated: Improved Test Coverage

The initial submission passed all 25 required CI checks (lint, tests across Python 3.9-3.14, examples, docs, DCO) but missed the codecov threshold (57% vs 73.16% target).

I've now added 33 additional unit tests (39 total) to achieve ~86% coverage, well above the target.

New tests cover:

  • ✅ All heading levels (\part, \chapter, \section, \subsection, \subsubsection)
  • ✅ List environments (itemize, enumerate, description)
  • ✅ Code blocks (verbatim, lstlisting)
  • ✅ Math environments (equation*, gather, multline, displaymath)
  • ✅ Macros: \footnote, \url, \caption, \label, \cite, \ref
  • \includegraphics and figure environments
  • ✅ Bibliography (thebibliography)
  • ✅ File path loading (in addition to BytesIO)
  • ✅ Edge cases: empty tables, starred environments, filecontents

Also fixed a validation bug where \part and \chapter returned level 0 (docling requires level ≥ 1).

Awaiting workflow approval to run CI. Thank you!

Copy link
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just my first comments, I think we will need a few iterations.

Having that said, really impressive so far, good code quality and love the approach!

adityasasidhar added a commit to adityasasidhar/docling that referenced this pull request Jan 19, 2026
- Add text formatting options (bold, italic, underline) for LaTeX macros
- Enhance image embedding with PIL and ImageRef.from_pil()
- Refactor list processing to use GroupItem structure
- Refactor bibliography to use GroupItem structure
- Add nested list test coverage
- All tests passing (39/39), all linters passing
@adityasasidhar
Copy link
Author

adityasasidhar commented Jan 19, 2026

@PeterStaar-IBM Thank you for the thorough review and kind words! All feedback has been addressed in commit f19f135:

Changes implemented:

  • ✅ Text formatting options (bold, italic, underline)
  • ✅ PIL-based image embedding with ImageRef.from_pil()
  • ✅ List grouping with GroupItem
  • ✅ Bibliography grouping with GroupItem
  • ✅ Nested list test added

Test coverage: 40/40 tests passing, all linters passing ✓

Ready for re-review! Let me know if you'd like any adjustments.

PS. Will work on future enhancements through multiple iterations

@adityasasidhar
Copy link
Author

dang it i forgot to DCO sign off on the commit ( cries internally )

@adityasasidhar
Copy link
Author

adityasasidhar commented Jan 19, 2026

@PeterStaar-IBM Thank you for the thorough review and kind words! All feedback has been addressed in the commit.
Changes implemented:

  • ✅ Text formatting options (bold, italic, underline)
  • ✅ PIL-based image embedding with ImageRef.from_pil()
  • ✅ List grouping with GroupItem
  • ✅ Bibliography grouping with GroupItem
  • ✅ Nested list test added

Test coverage: 40/40 tests passing, all linters passing ✓

Ready for re-review! Let me know if you'd like any adjustments.

@cau-git
Copy link
Member

cau-git commented Jan 19, 2026

@adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into tests/data and ensure that ground-truth representations are generated like with the other file formats, so new changes to the latex backend will show what improved or regressed.

@adityasasidhar
Copy link
Author

Will do> @adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into tests/data and ensure that ground-truth representations are generated like with the other file formats, so new changes to the latex backend will show what improved or regressed.

@adityasasidhar
Copy link
Author

adityasasidhar commented Jan 20, 2026

@adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into tests/data and ensure that ground-truth representations are generated like with the other file formats, so new changes to the latex backend will show what improved or regressed.

@cau-git @PeterStaar-IBM Thank you for the feedback!

I've significantly overhauled the implementation to allow for end-to-end ground-truth validation against real-world scientific papers.

To ensure the backend is robust enough for general use, I added the full source code for 4 major arXiv papers (Attention Is All You Need, DeepSeek V3, Mistral 7B, and OTSL) to the test suite and regenerated the ground-truth data.

Testing against these complex documents revealed several real-world parsing challenges which I have now resolved:

Split-file Documents: Added recursive support for \input{} and \include{} to handle papers that are split across multiple
.tex
files.
Custom Macros: Implemented expansion of user-defined macros (e.g., \newcommand) to ensure custom terms appear correctly in the output text.
Complex Preambles: Improved filtering to ensure only the document content is parsed, ignoring metadata and preamble definitions.
Asset Management: Restructured the test data to correctly resolve relative paths for images, bibliographies, and sub-files.

There's still a lot of work to perfect it, I had some ideas reagrding restructuring tables and all using llm's, the user can probably pass some arguments regarding it to use ( just my thoughts ), as you see manual parsing through the latex code leaves us with a lot of gaps and using a small language model to probably assist with the use case might be too complex for now but yeah gave it a bit of thought.....I'll probably give some more attempts at making it perfect but yeah real world use cases ain't much ( cries internally wishing the world was perfect )

@PeterStaar-IBM
Copy link
Member

@adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into tests/data and ensure that ground-truth representations are generated like with the other file formats, so new changes to the latex backend will show what improved or regressed.

@cau-git @PeterStaar-IBM Thank you for the feedback!

I've significantly overhauled the implementation to allow for end-to-end ground-truth validation against real-world scientific papers.

To ensure the backend is robust enough for general use, I added the full source code for 4 major arXiv papers (Attention Is All You Need, DeepSeek V3, Mistral 7B, and OTSL) to the test suite and regenerated the ground-truth data.

Testing against these complex documents revealed several real-world parsing challenges which I have now resolved:

Split-file Documents: Added recursive support for \input{} and \include{} to handle papers that are split across multiple .tex files. Custom Macros: Implemented expansion of user-defined macros (e.g., \newcommand) to ensure custom terms appear correctly in the output text. Complex Preambles: Improved filtering to ensure only the document content is parsed, ignoring metadata and preamble definitions. Asset Management: Restructured the test data to correctly resolve relative paths for images, bibliographies, and sub-files.

There's still a lot of work to perfect it, I had some ideas reagrding restructuring tables and all using llm's, the user can probably pass some arguments regarding it to use ( just my thoughts ), as you see manual parsing through the latex code leaves us with a lot of gaps and using a small language model to probably assist with the use case might be too complex for now but yeah gave it a bit of thought.....I'll probably give some more attempts at making it perfect but yeah real world use cases ain't much ( cries internally wishing the world was perfect )

@vku-ibm Please have a look at this PR. Maybe you can run it over our internal latex collection and give some general feedback.

Overall, we can not expect that all features of latex are supported by this PR (there are always additional feat/fix-PR's that can be done). However, we need to ensure a base level of robustness.

@adityasasidhar
Copy link
Author

committted the changes with proper dco sign off

@vku-ibm
Copy link
Member

vku-ibm commented Jan 23, 2026

@adityasasidhar took some random arxiv publications for tests. These are my findings so far:

  1. Looks like parser breaks when LaTex document starts with a comment line (starts with '%' symbol), example publications: https://arxiv.org/abs/astro-ph/0001001
    https://arxiv.org/abs/hep-lat/0001005
    https://arxiv.org/abs/nucl-th/0001045
    Sample of the error:
    in_doc = InputDocument(
             ^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/datamodel/document.py", line 163, in __init__
    self._init_doc(backend, path_or_stream)
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/datamodel/document.py", line 221, in _init_doc
    self._backend = backend(self, path_or_stream=path_or_stream)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/backend/latex_backend.py", line 57, in __init__
    self.latex_text = self._preprocess_latex(self.latex_text)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/backend/latex_backend.py", line 94, in _preprocess_latex
    text = re.sub(rf"\\{name}(?![a-zA-Z])", value, text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 185, in sub
    return _compile(pattern, flags).sub(repl, string, count)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 317, in _subx
    template = _compile_repl(template, pattern)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 308, in _compile_repl
    return _parser.parse_template(repl, pattern)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/_parser.py", line 1087, in parse_template
    raise s.error('bad escape %s' % this, len(this)) from None
re.error: bad escape \e at position 1
  1. Observations on conversion of this publication:
    LaTex backend preserved scientific notation while pdf backend lost it (default conversion settings).
    from latex

therefore implying a column density N $$_H > 10^{25}$$ cm $$^{-2}$$ . Only two of the 11 Compton thick sources have an excess in the 15-100 keV range, while for the remaining three the hard 15-100 keV X-ray emission is unconstrained. The shortage of objects with 10 $$^{24}$$ cm $$^{-2} <$$ N $$_H < 10^{25}$$ cm $$^{-2}$$ already pointed out in a sample of optically

from pdf:

therefore implying a column density N H > 10 25 cm -2 . Only two of the 11 Compton thick sources have an excess in the 15-100 keV range, while for the remaining three the hard 15-100 keV X-ray emission is unconstrained. The shortage of objects with 10 24 cm -2 < N H < 10 25 cm -2 already pointed out in a sample of optically

the above was done serializing to markdown using older methods document.export_to_markdown()

  1. It looks to me that images (figures) are currently not supported by the backend, but there is a code to handle some cases of the graphics content:
    https://github.com/adityasasidhar/docling/blob/feat/latex-support/docling/backend/latex_backend.py#L282
    The sample document I've checked, that uses "figures" for the images is this publication:
    https://arxiv.org/abs/cond-mat/0001307

EDIT: extraction of figures is supported when modern format of latex is used, the code mentioned above handles graphics provided as pdfs. Old latex documents often use ps and eps formats.

There are more findings but I need to clarify and cross check them first.

@vku-ibm
Copy link
Member

vku-ibm commented Jan 23, 2026

  1. The "$" symbol in the latex source is braking sentences into separate blocks. For example:
We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).
CNNs encode image data from a high-dimensional pixel space ($n \sim 10^5$ pixels) into a lower-dimensional latent space ($d \sim 10^3$ features). 

Becomes this in docling document:

{
      "self_ref": "#/texts/44",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "paragraph",
      "prov": [],
      "orig": "We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).\nCNNs encode image data from a high-dimensional pixel space (",
      "text": "We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).\nCNNs encode image data from a high-dimensional pixel space ("
    },
    {
      "self_ref": "#/texts/45",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "formula",
      "prov": [],
      "orig": "n \\sim 10^5",
      "text": "n \\sim 10^5"
    },
    {
      "self_ref": "#/texts/46",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "text",
      "prov": [],
      "orig": "pixels) into a lower-dimensional latent space (",
      "text": "pixels) into a lower-dimensional latent space ("
    },

and when serialized, these "breaks" consequently become paragraph separators.

@PeterStaar-IBM
Copy link
Member

  1. The "$" symbol in the latex source is braking sentences into separate blocks. For example:
We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).
CNNs encode image data from a high-dimensional pixel space ($n \sim 10^5$ pixels) into a lower-dimensional latent space ($d \sim 10^3$ features). 

Becomes this in docling document:

{
      "self_ref": "#/texts/44",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "paragraph",
      "prov": [],
      "orig": "We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).\nCNNs encode image data from a high-dimensional pixel space (",
      "text": "We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).\nCNNs encode image data from a high-dimensional pixel space ("
    },
    {
      "self_ref": "#/texts/45",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "formula",
      "prov": [],
      "orig": "n \\sim 10^5",
      "text": "n \\sim 10^5"
    },
    {
      "self_ref": "#/texts/46",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "text",
      "prov": [],
      "orig": "pixels) into a lower-dimensional latent space (",
      "text": "pixels) into a lower-dimensional latent space ("
    },

and when serialized, these "breaks" consequently become paragraph separators.

yes, these should become inline formula's

@adityasasidhar
Copy link
Author

@vku-ibm @PeterStaar-IBM thank you for the thorough testing! Working on fixes for:
✅ Comment line crash (regex escape handling)
✅ Inline math staying within paragraphs (not breaking text flow)
✅ .ps/.eps image format support

@adityasasidhar
Copy link
Author

@adityasasidhar took some random arxiv publications for tests. These are my findings so far:

1. Looks like parser breaks when LaTex document starts with a comment line (starts with '%' symbol), example publications: https://arxiv.org/abs/astro-ph/0001001
   https://arxiv.org/abs/hep-lat/0001005
   https://arxiv.org/abs/nucl-th/0001045
   Sample of the error:
    in_doc = InputDocument(
             ^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/datamodel/document.py", line 163, in __init__
    self._init_doc(backend, path_or_stream)
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/datamodel/document.py", line 221, in _init_doc
    self._backend = backend(self, path_or_stream=path_or_stream)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/backend/latex_backend.py", line 57, in __init__
    self.latex_text = self._preprocess_latex(self.latex_text)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/backend/latex_backend.py", line 94, in _preprocess_latex
    text = re.sub(rf"\\{name}(?![a-zA-Z])", value, text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 185, in sub
    return _compile(pattern, flags).sub(repl, string, count)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 317, in _subx
    template = _compile_repl(template, pattern)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 308, in _compile_repl
    return _parser.parse_template(repl, pattern)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/_parser.py", line 1087, in parse_template
    raise s.error('bad escape %s' % this, len(this)) from None
re.error: bad escape \e at position 1
2. Observations on conversion of [this](https://arxiv.org/pdf/astro-ph/0001375) publication:
   LaTex backend preserved scientific notation while pdf backend lost it (default conversion settings).
   from latex

therefore implying a column density N $$_H > 10^{25}$$ cm $$^{-2}$$ . Only two of the 11 Compton thick sources have an excess in the 15-100 keV range, while for the remaining three the hard 15-100 keV X-ray emission is unconstrained. The shortage of objects with 10 $$^{24}$$ cm $$^{-2} <$$ N $$_H < 10^{25}$$ cm $$^{-2}$$ already pointed out in a sample of optically

from pdf:

therefore implying a column density N H > 10 25 cm -2 . Only two of the 11 Compton thick sources have an excess in the 15-100 keV range, while for the remaining three the hard 15-100 keV X-ray emission is unconstrained. The shortage of objects with 10 24 cm -2 < N H < 10 25 cm -2 already pointed out in a sample of optically

the above was done serializing to markdown using older methods document.export_to_markdown()

3. It looks to me that images (figures) are currently not supported by the backend, but there is a code to handle some cases of the graphics content:
   https://github.com/adityasasidhar/docling/blob/feat/latex-support/docling/backend/latex_backend.py#L282
   The sample document I've checked, that uses "figures" for the images is this publication:
   https://arxiv.org/abs/cond-mat/0001307

EDIT: extraction of figures is supported when modern format of latex is used, the code mentioned above handles graphics provided as pdfs. Old latex documents often use ps and eps formats.

There are more findings but I need to clarify and cross check them first.

isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please

@vku-ibm
Copy link
Member

vku-ibm commented Jan 23, 2026

isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please

My apologies for not being clear. Yes, we do want to preserve scientific notation and this backend does it. I was pointing out that docling doesn't do it by default with the pdf backend. So this point wasn't the "bug report" but an observation.

@adityasasidhar
Copy link
Author

isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please

My apologies for not being clear. Yes, we do want to preserve scientific notation and this backend does it. I was pointing out that docling doesn't do it by default with the pdf backend. So this point wasn't the "bug report" but an observation.

@vku-ibm my bad if I came off accusing, almost done with the fixes, testing it a bit for edge cases...will push it once finished

@PeterStaar-IBM
Copy link
Member

isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please

My apologies for not being clear. Yes, we do want to preserve scientific notation and this backend does it. I was pointing out that docling doesn't do it by default with the pdf backend. So this point wasn't the "bug report" but an observation.

@vku-ibm my bad if I came off accusing, almost done with the fixes, testing it a bit for edge cases...will push it once finished

@adityasasidhar no worries, we (=docling team) dont take these things personal, just love your enthusiasm! Latex is a real pain sometimes, and we need to get some good rough coverage before we merge. The next iterations can then be improvements.

@adityasasidhar
Copy link
Author

@vku-ibm @PeterStaar-IBM Thank you for the detailed feedback and testing!

I have pushed a new commit that addresses all the reported issues. Here is a summary of the fixes and verification:

🛠️ Fixes Implemented
Crash on Leading Comments (Reported by @vku-ibm)

Fix: Adjusted the regex preprocessing to safely strip comments, including files starting with %.
Verification: Verified against astro-ph/0001001, hep-lat/0001005, and nucl-th/0001045.
Macro Expansion Regex Error (re.error: bad escape)

Fix: Modified
_preprocess_latex
to use a lambda function in re.sub(). This prevents backslashes in custom macro values (e.g., \emph{...}) from being interpreted as invalid escape sequences.
Verification: Added
test_latex_custom_macro_with_backslash
.
Inline Math Breaking Paragraphs (Reported by @PeterStaar-IBM)

Fix: Refactored
_process_nodes
to distinguish between inline math ($) and display math ($$, [). Inline math is now appended directly to the text buffer, preserving sentence flow and preventing unwanted paragraph breaks. Display math still creates structured FORMULA items.
Verification: Updated
test_latex_math_parsing
and manually verified output quality.
Figure Environment Support (Note by @vku-ibm)

Fix: Implemented
_process_figure
to create a GroupLabel.SECTION named "figure". This ensures images and their captions are semantically grouped together.
Verification: Added
test_latex_figure_with_caption
and regenerated ground truth.
✅ Verification & Testing
To ensure robustness, I have:

Expanded Unit Tests: Added 3 new tests specifically for these edge cases.
End-to-End Validation: Validated against the full source of 4 complex arXiv papers:
Attention Is All You Need (1706.03762)
DeepSeek V3 (2412.19437)
Mistral 7B (2305.03393)
OTSL (2310.06825)
Regenerated Ground Truth: Updated all expected outputs to reflect the improved structure.
All Tests Passing: Ran the full suite (pytest tests/test_backend_latex.py) with 44/44 tests passing.

@vku-ibm
Copy link
Member

vku-ibm commented Jan 26, 2026

I've tried latest version and it looks great!

Couple of the findings:

  1. Expressions like this, break lines:
formation and evolution with their \textit{detailed} appearances.

example from https://arxiv.org/abs/2501.00089 , main.tex line 33

  1. Another case of line breaking:
that have emission line signal-to-noise ratios greater than 3 for [\ion{N}{2}] 

same document as above.

  1. Found a case where content order is braking. This content after parsing, pushes power spectra before the the whole paragraph where it was present:
techniques applicable to emission line studies with the emphasis on those
that can provide information on the underlying {\em power spectra} of

becomes:

Madison, USA; email: lazarian@astro.wisc.edu

power spectra

Emission in spectral lines can provide unique information

The sample is https://arxiv.org/abs/astro-ph/0001001

  1. In the same document as point 3, the \& sequence breaks lines:
ISM phases (McKee \& Ostriker 1977)

becomes:

ISM phases (McKee

&

Ostriker 1977)
  1. Looks like a new case where equation is braking apart (same document)
where the kernel is
\be
\Xi({\bf k}, {\bf r})=\langle e^{i f 
k_z (u_z({\bf x})- u_z({\bf x \prime}))}
\rho ({\bf x}) \rho ({\bf x \prime}) \rangle~~~.
\label{kernelXi}
\ee

becomes:

where the kernel is

k

r

(, )=

x

x

i f
k_z (u_z()- u_z())

e^

x

()

x

()

.

and later in the same document, another sample that breaks the same way:

separate velocity and density in the following way
\be
\langle e^{i f \ldots} \rho ({\bf x}) \rho ({\bf x \prime}) \rangle =\langle e^{i f \ldots} \rangle \langle \rho ({\bf x}) \rho ({\bf x}+{\bf r}) \rangle~~~,
\label{sep} 
\ee
  1. Another issue with the same document (https://arxiv.org/abs/astro-ph/0001001 procl.tex), the last content that was parsed is on the line 479:
\left(1 + \left( {r_0 \over r} \right)^\gamma\right), ~~~~~ \gamma=n+3 > 0~~~.
\label{eq:xi}
\end{equation}

The rest of the document didn't get parsed.
These are the errors that parser showed in the console, the last is likely directly related to this issue:

astro-ph0001001/procl.tex
2026-01-26 09:57:58,078 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(13,33)
2026-01-26 09:57:58,078 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected closing environment: 'equation' @(14,17)
2026-01-26 09:57:58,079 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(15,33)
2026-01-26 09:57:58,079 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected closing environment: 'eqnarray' @(16,17)
2026-01-26 09:57:58,241 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing environment: 'document', was expecting 'equation' @(1429,0)
@adityasasidhar
Copy link
Author

Hey @vku-ibm, thanks a lot for the detailed feedback on the parsing issues. I realized that a large part of the problem comes from trying to patch things at the string or regex level after parsing, which is inherently fragile for LaTeX.

I’m currently experimenting with a different approach that relies much more directly on the LaTeX AST and avoids text-level fixes as much as possible. The goal is to handle macros, environments, and math in a more structured and semantically correct way, instead of flattening early and compensating later.

I’m also planning to build a proper dictionary of macros so that regex usage can be gradually phased out. Thanks again for your patience. I’ll likely be making some changes to the overall approach and will share updates once things stabilize.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
- Add text formatting options (bold, italic, underline) for LaTeX macros
- Enhance image embedding with PIL and ImageRef.from_pil()
- Refactor list processing to use GroupItem structure
- Refactor bibliography to use GroupItem structure
- Add nested list test coverage
- All tests passing (39/39), all linters passing

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
…r@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
…r@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
- Add custom macro expansion for improved text quality
- Fix preamble filtering to remove metadata garbage
- Support recursive \input{} and \include{} file loading
- Organize test data into subdirectories for complex papers
- Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL)
- Pass all 41 unit tests and pre-commit checks

Addresses @cau-git feedback for ground-truth data.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
- Fixed re.error: bad escape in macro expansion by using lambda in re.sub
- Fixed sentences breaking at inline math ($) by preserving it within paragraphs
- Improved figure environment with proper grouping and structured representation
- Fixed crashes on documents starting with % comments
- Added comprehensive unit tests and updated all ground truth data

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
This commit addresses several issues with LaTeX parsing:
- Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks.
- Fix extraction of structural macros (section, caption, etc.) vs text-only groups.
- Address PR feedback regarding inline math spacing and splitting.
- Regenerate ground truth files reflecting these improvements.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
@adityasasidhar
Copy link
Author

@vku-ibm hey!

I know it’s been a while since my last push (some might say I ran out of Claude credits… hehehe 😄), but I’m back with a major update.

I’ve addressed the pending feedback and significantly hardened the LaTeX parsing logic. Below is a summary of what’s included in this merge.


✨ Summary of Changes

🔧 Robust Macro Extraction

  • Fixed _extract_custom_macros to correctly identify macro names and definitions.
  • Handles \newcommand cases with optional arguments reliably.

📝 Inline Text Preservation

  • Refactored _process_nodes and LatexGroupNode handling.
  • Prevents unwanted line breaks for:
    • text-only groups
    • citations
    • inline math
    • unknown macros with arguments (for example breaking down into multiple lines \ion{N}{2})

🧱 Structural vs Inline Distinction

  • Introduced a dedicated structural_macros list.
  • Enables sensible decisions on when to:
    • flush the text buffer and start a new block, or
    • keep content inline

🔤 Special Character Support

  • Added comprehensive support for LaTeX accents and special symbols.
  • Examples include: \', \", \&, \%, \#
  • All now render as correct Unicode characters inline.

📊 Table Parsing Stability

  • Improved table cell extraction logic.
  • Correctly handles & column separators, whether parsed as macros or characters.
  • Ensures consistent row and cell structure across tables.

🧪 End-to-End Testing

  • Updated ground-truth generation and comparison logic.
  • All 48 tests pass with the improved rendering.

can you please run it over some more latex files and update me about the edge cases and things i may have missed out, also i had an idea about handling malformed .tex file by converting them into pdf and running it over the standard docling pdf conversion methods...love to know what you think about it....maybe use a small language model for tweaks...but yeah even that has its own tradeoffs....

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants