feat: added support for parsing LaTeX (.tex) documents #2890

adityasasidhar · 2026-01-18T16:40:56Z

Issue resolved by this Pull Request:
Resolves #2885

This PR introduces initial support for parsing LaTeX (.tex) documents as a structured input format, using pylatexenc. This allows Docling to convert native LaTeX files directly to Markdown while preserving semantic structure.

Key changes:

New Backend: Added LatexDocumentBackend in docling/backend/latex_backend.py.
Structure Preservation: Correctly maps standard LaTeX hierarchy (\section, \subsection, abstract) to Docling's model.
Math Support: Handled inline ($) and display ($$, \[) math, plus environments like equation, align, gather, and multline (including starred variants).
Table Support: Implemented parsing for tabular environments, including handling of escaped characters and \hline cleanup.
Metadata: Improved extraction of citations (\cite) and references (\ref).
CLI Integration: Configured LatexFormatOption to enable docling convert file.tex.

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

github-actions · 2026-01-18T16:41:04Z

✅ DCO Check Passed

Thanks @adityasasidhar, all your commits are properly signed off. 🎉

dosubot · 2026-01-18T16:41:09Z

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

mergify · 2026-01-18T16:41:30Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:$.+$)?(!)?:

codecov · 2026-01-19T07:13:58Z

Codecov Report

❌ Patch coverage is 11.89320% with 363 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/latex_backend.py	9.77%	360 Missing ⚠️
docling/datamodel/document.py	0.00%	2 Missing ⚠️
docling/cli/main.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

adityasasidhar · 2026-01-19T08:15:47Z

Codecov Report

Updated: Improved Test Coverage

The initial submission passed all 25 required CI checks (lint, tests across Python 3.9-3.14, examples, docs, DCO) but missed the codecov threshold (57% vs 73.16% target).

I've now added 33 additional unit tests (39 total) to achieve ~86% coverage, well above the target.

New tests cover:

✅ All heading levels (\part, \chapter, \section, \subsection, \subsubsection)
✅ List environments (itemize, enumerate, description)
✅ Code blocks (verbatim, lstlisting)
✅ Math environments (equation*, gather, multline, displaymath)
✅ Macros: \footnote, \url, \caption, \label, \cite, \ref
✅ \includegraphics and figure environments
✅ Bibliography (thebibliography)
✅ File path loading (in addition to BytesIO)
✅ Edge cases: empty tables, starred environments, filecontents

Also fixed a validation bug where \part and \chapter returned level 0 (docling requires level ≥ 1).

Awaiting workflow approval to run CI. Thank you!

PeterStaar-IBM

These are just my first comments, I think we will need a few iterations.

Having that said, really impressive so far, good code quality and love the approach!

docling/backend/latex_backend.py

tests/test_backend_latex.py

- Add text formatting options (bold, italic, underline) for LaTeX macros - Enhance image embedding with PIL and ImageRef.from_pil() - Refactor list processing to use GroupItem structure - Refactor bibliography to use GroupItem structure - Add nested list test coverage - All tests passing (39/39), all linters passing

adityasasidhar · 2026-01-19T13:29:38Z

@PeterStaar-IBM Thank you for the thorough review and kind words! All feedback has been addressed in commit f19f135:

Changes implemented:

✅ Text formatting options (bold, italic, underline)
✅ PIL-based image embedding with ImageRef.from_pil()
✅ List grouping with GroupItem
✅ Bibliography grouping with GroupItem
✅ Nested list test added

Test coverage: 40/40 tests passing, all linters passing ✓

Ready for re-review! Let me know if you'd like any adjustments.

PS. Will work on future enhancements through multiple iterations

adityasasidhar · 2026-01-19T16:18:44Z

dang it i forgot to DCO sign off on the commit ( cries internally )

adityasasidhar · 2026-01-19T16:29:10Z

@PeterStaar-IBM Thank you for the thorough review and kind words! All feedback has been addressed in the commit.
Changes implemented:

✅ Text formatting options (bold, italic, underline)
✅ PIL-based image embedding with ImageRef.from_pil()
✅ List grouping with GroupItem
✅ Bibliography grouping with GroupItem
✅ Nested list test added

Test coverage: 40/40 tests passing, all linters passing ✓

Ready for re-review! Let me know if you'd like any adjustments.

cau-git · 2026-01-19T19:06:59Z

@adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into tests/data and ensure that ground-truth representations are generated like with the other file formats, so new changes to the latex backend will show what improved or regressed.

adityasasidhar · 2026-01-20T02:59:46Z

Will do> @adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into tests/data and ensure that ground-truth representations are generated like with the other file formats, so new changes to the latex backend will show what improved or regressed.

adityasasidhar · 2026-01-20T16:46:03Z

@adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into tests/data and ensure that ground-truth representations are generated like with the other file formats, so new changes to the latex backend will show what improved or regressed.

@cau-git @PeterStaar-IBM Thank you for the feedback!

I've significantly overhauled the implementation to allow for end-to-end ground-truth validation against real-world scientific papers.

To ensure the backend is robust enough for general use, I added the full source code for 4 major arXiv papers (Attention Is All You Need, DeepSeek V3, Mistral 7B, and OTSL) to the test suite and regenerated the ground-truth data.

Testing against these complex documents revealed several real-world parsing challenges which I have now resolved:

Split-file Documents: Added recursive support for \input{} and \include{} to handle papers that are split across multiple
.tex
files.
Custom Macros: Implemented expansion of user-defined macros (e.g., \newcommand) to ensure custom terms appear correctly in the output text.
Complex Preambles: Improved filtering to ensure only the document content is parsed, ignoring metadata and preamble definitions.
Asset Management: Restructured the test data to correctly resolve relative paths for images, bibliographies, and sub-files.

There's still a lot of work to perfect it, I had some ideas reagrding restructuring tables and all using llm's, the user can probably pass some arguments regarding it to use ( just my thoughts ), as you see manual parsing through the latex code leaves us with a lot of gaps and using a small language model to probably assist with the use case might be too complex for now but yeah gave it a bit of thought.....I'll probably give some more attempts at making it perfect but yeah real world use cases ain't much ( cries internally wishing the world was perfect )

PeterStaar-IBM · 2026-01-21T04:49:04Z

@adityasasidhar Nice progress. Maybe this is a good point to add some sample latex file into tests/data and ensure that ground-truth representations are generated like with the other file formats, so new changes to the latex backend will show what improved or regressed.

@cau-git @PeterStaar-IBM Thank you for the feedback!

I've significantly overhauled the implementation to allow for end-to-end ground-truth validation against real-world scientific papers.

To ensure the backend is robust enough for general use, I added the full source code for 4 major arXiv papers (Attention Is All You Need, DeepSeek V3, Mistral 7B, and OTSL) to the test suite and regenerated the ground-truth data.

Testing against these complex documents revealed several real-world parsing challenges which I have now resolved:

Split-file Documents: Added recursive support for \input{} and \include{} to handle papers that are split across multiple .tex files. Custom Macros: Implemented expansion of user-defined macros (e.g., \newcommand) to ensure custom terms appear correctly in the output text. Complex Preambles: Improved filtering to ensure only the document content is parsed, ignoring metadata and preamble definitions. Asset Management: Restructured the test data to correctly resolve relative paths for images, bibliographies, and sub-files.

There's still a lot of work to perfect it, I had some ideas reagrding restructuring tables and all using llm's, the user can probably pass some arguments regarding it to use ( just my thoughts ), as you see manual parsing through the latex code leaves us with a lot of gaps and using a small language model to probably assist with the use case might be too complex for now but yeah gave it a bit of thought.....I'll probably give some more attempts at making it perfect but yeah real world use cases ain't much ( cries internally wishing the world was perfect )

@vku-ibm Please have a look at this PR. Maybe you can run it over our internal latex collection and give some general feedback.

Overall, we can not expect that all features of latex are supported by this PR (there are always additional feat/fix-PR's that can be done). However, we need to ensure a base level of robustness.

adityasasidhar · 2026-01-22T17:34:25Z

committted the changes with proper dco sign off

vku-ibm · 2026-01-23T09:06:41Z

@adityasasidhar took some random arxiv publications for tests. These are my findings so far:

Looks like parser breaks when LaTex document starts with a comment line (starts with '%' symbol), example publications: https://arxiv.org/abs/astro-ph/0001001
https://arxiv.org/abs/hep-lat/0001005
https://arxiv.org/abs/nucl-th/0001045
Sample of the error:

    in_doc = InputDocument(
             ^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/datamodel/document.py", line 163, in __init__
    self._init_doc(backend, path_or_stream)
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/datamodel/document.py", line 221, in _init_doc
    self._backend = backend(self, path_or_stream=path_or_stream)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/backend/latex_backend.py", line 57, in __init__
    self.latex_text = self._preprocess_latex(self.latex_text)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/backend/latex_backend.py", line 94, in _preprocess_latex
    text = re.sub(rf"\\{name}(?![a-zA-Z])", value, text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 185, in sub
    return _compile(pattern, flags).sub(repl, string, count)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 317, in _subx
    template = _compile_repl(template, pattern)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 308, in _compile_repl
    return _parser.parse_template(repl, pattern)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/_parser.py", line 1087, in parse_template
    raise s.error('bad escape %s' % this, len(this)) from None
re.error: bad escape \e at position 1

Observations on conversion of this publication:
LaTex backend preserved scientific notation while pdf backend lost it (default conversion settings).
from latex

therefore implying a column density N $$_H > 10^{25}$$ cm $$^{-2}$$ . Only two of the 11 Compton thick sources have an excess in the 15-100 keV range, while for the remaining three the hard 15-100 keV X-ray emission is unconstrained. The shortage of objects with 10 $$^{24}$$ cm $$^{-2} <$$ N $$_H < 10^{25}$$ cm $$^{-2}$$ already pointed out in a sample of optically

from pdf:

therefore implying a column density N H > 10 25 cm -2 . Only two of the 11 Compton thick sources have an excess in the 15-100 keV range, while for the remaining three the hard 15-100 keV X-ray emission is unconstrained. The shortage of objects with 10 24 cm -2 < N H < 10 25 cm -2 already pointed out in a sample of optically

the above was done serializing to markdown using older methods document.export_to_markdown()

It looks to me that images (figures) are currently not supported by the backend, but there is a code to handle some cases of the graphics content:
https://github.com/adityasasidhar/docling/blob/feat/latex-support/docling/backend/latex_backend.py#L282
The sample document I've checked, that uses "figures" for the images is this publication:
https://arxiv.org/abs/cond-mat/0001307

EDIT: extraction of figures is supported when modern format of latex is used, the code mentioned above handles graphics provided as pdfs. Old latex documents often use ps and eps formats.

There are more findings but I need to clarify and cross check them first.

vku-ibm · 2026-01-23T11:58:55Z

The "$" symbol in the latex source is braking sentences into separate blocks. For example:

We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).
CNNs encode image data from a high-dimensional pixel space ($n \sim 10^5$ pixels) into a lower-dimensional latent space ($d \sim 10^3$ features).

Becomes this in docling document:

{
      "self_ref": "#/texts/44",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "paragraph",
      "prov": [],
      "orig": "We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).\nCNNs encode image data from a high-dimensional pixel space (",
      "text": "We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).\nCNNs encode image data from a high-dimensional pixel space ("
    },
    {
      "self_ref": "#/texts/45",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "formula",
      "prov": [],
      "orig": "n \\sim 10^5",
      "text": "n \\sim 10^5"
    },
    {
      "self_ref": "#/texts/46",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "text",
      "prov": [],
      "orig": "pixels) into a lower-dimensional latent space (",
      "text": "pixels) into a lower-dimensional latent space ("
    },

and when serialized, these "breaks" consequently become paragraph separators.

PeterStaar-IBM · 2026-01-23T12:02:09Z

The "$" symbol in the latex source is braking sentences into separate blocks. For example:

We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).
CNNs encode image data from a high-dimensional pixel space ($n \sim 10^5$ pixels) into a lower-dimensional latent space ($d \sim 10^3$ features).

Becomes this in docling document:

{
      "self_ref": "#/texts/44",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "paragraph",
      "prov": [],
      "orig": "We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).\nCNNs encode image data from a high-dimensional pixel space (",
      "text": "We introduce a novel, interpretable CNN for images: a Sparse Feature Network (SFNet).\nCNNs encode image data from a high-dimensional pixel space ("
    },
    {
      "self_ref": "#/texts/45",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "formula",
      "prov": [],
      "orig": "n \\sim 10^5",
      "text": "n \\sim 10^5"
    },
    {
      "self_ref": "#/texts/46",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "content_layer": "body",
      "label": "text",
      "prov": [],
      "orig": "pixels) into a lower-dimensional latent space (",
      "text": "pixels) into a lower-dimensional latent space ("
    },

and when serialized, these "breaks" consequently become paragraph separators.

yes, these should become inline formula's

adityasasidhar · 2026-01-23T15:11:38Z

@vku-ibm @PeterStaar-IBM thank you for the thorough testing! Working on fixes for:
✅ Comment line crash (regex escape handling)
✅ Inline math staying within paragraphs (not breaking text flow)
✅ .ps/.eps image format support

adityasasidhar · 2026-01-23T15:14:06Z

@adityasasidhar took some random arxiv publications for tests. These are my findings so far:

1. Looks like parser breaks when LaTex document starts with a comment line (starts with '%' symbol), example publications: https://arxiv.org/abs/astro-ph/0001001
   https://arxiv.org/abs/hep-lat/0001005
   https://arxiv.org/abs/nucl-th/0001045
   Sample of the error:

    in_doc = InputDocument(
             ^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/datamodel/document.py", line 163, in __init__
    self._init_doc(backend, path_or_stream)
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/datamodel/document.py", line 221, in _init_doc
    self._backend = backend(self, path_or_stream=path_or_stream)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/backend/latex_backend.py", line 57, in __init__
    self.latex_text = self._preprocess_latex(self.latex_text)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vku/Documents/cloud/docling-forks/docling/docling/backend/latex_backend.py", line 94, in _preprocess_latex
    text = re.sub(rf"\\{name}(?![a-zA-Z])", value, text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 185, in sub
    return _compile(pattern, flags).sub(repl, string, count)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 317, in _subx
    template = _compile_repl(template, pattern)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 308, in _compile_repl
    return _parser.parse_template(repl, pattern)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/_parser.py", line 1087, in parse_template
    raise s.error('bad escape %s' % this, len(this)) from None
re.error: bad escape \e at position 1

2. Observations on conversion of [this](https://arxiv.org/pdf/astro-ph/0001375) publication:
   LaTex backend preserved scientific notation while pdf backend lost it (default conversion settings).
   from latex

therefore implying a column density N $$_H > 10^{25}$$ cm $$^{-2}$$ . Only two of the 11 Compton thick sources have an excess in the 15-100 keV range, while for the remaining three the hard 15-100 keV X-ray emission is unconstrained. The shortage of objects with 10 $$^{24}$$ cm $$^{-2} <$$ N $$_H < 10^{25}$$ cm $$^{-2}$$ already pointed out in a sample of optically

from pdf:

therefore implying a column density N H > 10 25 cm -2 . Only two of the 11 Compton thick sources have an excess in the 15-100 keV range, while for the remaining three the hard 15-100 keV X-ray emission is unconstrained. The shortage of objects with 10 24 cm -2 < N H < 10 25 cm -2 already pointed out in a sample of optically

the above was done serializing to markdown using older methods document.export_to_markdown()

3. It looks to me that images (figures) are currently not supported by the backend, but there is a code to handle some cases of the graphics content:
   https://github.com/adityasasidhar/docling/blob/feat/latex-support/docling/backend/latex_backend.py#L282
   The sample document I've checked, that uses "figures" for the images is this publication:
   https://arxiv.org/abs/cond-mat/0001307

EDIT: extraction of figures is supported when modern format of latex is used, the code mentioned above handles graphics provided as pdfs. Old latex documents often use ps and eps formats.

There are more findings but I need to clarify and cross check them first.

isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please

vku-ibm · 2026-01-23T17:30:05Z

isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please

My apologies for not being clear. Yes, we do want to preserve scientific notation and this backend does it. I was pointing out that docling doesn't do it by default with the pdf backend. So this point wasn't the "bug report" but an observation.

adityasasidhar · 2026-01-23T19:19:34Z

isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please

My apologies for not being clear. Yes, we do want to preserve scientific notation and this backend does it. I was pointing out that docling doesn't do it by default with the pdf backend. So this point wasn't the "bug report" but an observation.

@vku-ibm my bad if I came off accusing, almost done with the fixes, testing it a bit for edge cases...will push it once finished

PeterStaar-IBM · 2026-01-23T20:05:55Z

isn't it better to preserve the scientific notations as we are parsing through academic papers...most probably someone would want the extra bit of context if you are specfically gonna put .tex files..but all in all a project is supposed to be consistent...your call please

My apologies for not being clear. Yes, we do want to preserve scientific notation and this backend does it. I was pointing out that docling doesn't do it by default with the pdf backend. So this point wasn't the "bug report" but an observation.

@vku-ibm my bad if I came off accusing, almost done with the fixes, testing it a bit for edge cases...will push it once finished

@adityasasidhar no worries, we (=docling team) dont take these things personal, just love your enthusiasm! Latex is a real pain sometimes, and we need to get some good rough coverage before we merge. The next iterations can then be improvements.

adityasasidhar · 2026-01-24T11:58:03Z

@vku-ibm @PeterStaar-IBM Thank you for the detailed feedback and testing!

I have pushed a new commit that addresses all the reported issues. Here is a summary of the fixes and verification:

🛠️ Fixes Implemented
Crash on Leading Comments (Reported by @vku-ibm)

Fix: Adjusted the regex preprocessing to safely strip comments, including files starting with %.
Verification: Verified against astro-ph/0001001, hep-lat/0001005, and nucl-th/0001045.
Macro Expansion Regex Error (re.error: bad escape)

Fix: Modified
_preprocess_latex
to use a lambda function in re.sub(). This prevents backslashes in custom macro values (e.g., \emph{...}) from being interpreted as invalid escape sequences.
Verification: Added
test_latex_custom_macro_with_backslash
.
Inline Math Breaking Paragraphs (Reported by @PeterStaar-IBM)

Fix: Refactored
_process_nodes
to distinguish between inline math ($) and display math ($$, [). Inline math is now appended directly to the text buffer, preserving sentence flow and preventing unwanted paragraph breaks. Display math still creates structured FORMULA items.
Verification: Updated
test_latex_math_parsing
and manually verified output quality.
Figure Environment Support (Note by @vku-ibm)

Fix: Implemented
_process_figure
to create a GroupLabel.SECTION named "figure". This ensures images and their captions are semantically grouped together.
Verification: Added
test_latex_figure_with_caption
and regenerated ground truth.
✅ Verification & Testing
To ensure robustness, I have:

Expanded Unit Tests: Added 3 new tests specifically for these edge cases.
End-to-End Validation: Validated against the full source of 4 complex arXiv papers:
Attention Is All You Need (1706.03762)
DeepSeek V3 (2412.19437)
Mistral 7B (2305.03393)
OTSL (2310.06825)
Regenerated Ground Truth: Updated all expected outputs to reflect the improved structure.
All Tests Passing: Ran the full suite (pytest tests/test_backend_latex.py) with 44/44 tests passing.

vku-ibm · 2026-01-26T12:17:37Z

I've tried latest version and it looks great!

Couple of the findings:

Expressions like this, break lines:

formation and evolution with their \textit{detailed} appearances.

example from https://arxiv.org/abs/2501.00089 , main.tex line 33

Another case of line breaking:

that have emission line signal-to-noise ratios greater than 3 for [\ion{N}{2}]

same document as above.

Found a case where content order is braking. This content after parsing, pushes power spectra before the the whole paragraph where it was present:

techniques applicable to emission line studies with the emphasis on those
that can provide information on the underlying {\em power spectra} of

becomes:

Madison, USA; email: lazarian@astro.wisc.edu

power spectra

Emission in spectral lines can provide unique information

The sample is https://arxiv.org/abs/astro-ph/0001001

In the same document as point 3, the \& sequence breaks lines:

ISM phases (McKee \& Ostriker 1977)

becomes:

ISM phases (McKee

&

Ostriker 1977)

Looks like a new case where equation is braking apart (same document)

where the kernel is
\be
\Xi({\bf k}, {\bf r})=\langle e^{i f 
k_z (u_z({\bf x})- u_z({\bf x \prime}))}
\rho ({\bf x}) \rho ({\bf x \prime}) \rangle~~~.
\label{kernelXi}
\ee

becomes:

where the kernel is

k

r

(, )=

x

x

i f
k_z (u_z()- u_z())

e^

x

()

x

()

.

and later in the same document, another sample that breaks the same way:

separate velocity and density in the following way
\be
\langle e^{i f \ldots} \rho ({\bf x}) \rho ({\bf x \prime}) \rangle =\langle e^{i f \ldots} \rangle \langle \rho ({\bf x}) \rho ({\bf x}+{\bf r}) \rangle~~~,
\label{sep} 
\ee

Another issue with the same document (https://arxiv.org/abs/astro-ph/0001001 procl.tex), the last content that was parsed is on the line 479:

\left(1 + \left( {r_0 \over r} \right)^\gamma\right), ~~~~~ \gamma=n+3 > 0~~~.
\label{eq:xi}
\end{equation}

The rest of the document didn't get parsed.
These are the errors that parser showed in the console, the last is likely directly related to this issue:

astro-ph0001001/procl.tex
2026-01-26 09:57:58,078 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(13,33)
2026-01-26 09:57:58,078 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected closing environment: 'equation' @(14,17)
2026-01-26 09:57:58,079 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(15,33)
2026-01-26 09:57:58,079 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected closing environment: 'eqnarray' @(16,17)
2026-01-26 09:57:58,241 - INFO - Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing environment: 'document', was expecting 'equation' @(1429,0)

adityasasidhar · 2026-01-26T15:00:32Z

Hey @vku-ibm, thanks a lot for the detailed feedback on the parsing issues. I realized that a large part of the problem comes from trying to patch things at the string or regex level after parsing, which is inherently fragile for LaTeX.

I’m currently experimenting with a different approach that relies much more directly on the LaTeX AST and avoids text-level fixes as much as possible. The goal is to handle macros, environments, and math in a more structured and semantically correct way, instead of flattening early and compensating later.

I’m also planning to build a proper dictionary of macros so that regex usage can be gradually phased out. Thanks again for your patience. I’ll likely be making some changes to the overall approach and will share updates once things stabilize.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

- Add text formatting options (bold, italic, underline) for LaTeX macros - Enhance image embedding with PIL and ImageRef.from_pil() - Refactor list processing to use GroupItem structure - Refactor bibliography to use GroupItem structure - Add nested list test coverage - All tests passing (39/39), all linters passing Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

…r@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135 Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

@cau-git

- Add custom macro expansion for improved text quality - Fix preamble filtering to remove metadata garbage - Support recursive \input{} and \include{} file loading - Organize test data into subdirectories for complex papers - Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL) - Pass all 41 unit tests and pre-commit checks Addresses @cau-git feedback for ground-truth data. Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

- Fixed re.error: bad escape in macro expansion by using lambda in re.sub - Fixed sentences breaking at inline math ($) by preserving it within paragraphs - Improved figure environment with proper grouping and structured representation - Fixed crashes on documents starting with % comments - Added comprehensive unit tests and updated all ground truth data Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

This commit addresses several issues with LaTeX parsing: - Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks. - Fix extraction of structural macros (section, caption, etc.) vs text-only groups. - Address PR feedback regarding inline math spacing and splitting. - Regenerate ground truth files reflecting these improvements. Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

adityasasidhar · 2026-01-29T18:57:23Z

@vku-ibm hey!

I know it’s been a while since my last push (some might say I ran out of Claude credits… hehehe 😄), but I’m back with a major update.

I’ve addressed the pending feedback and significantly hardened the LaTeX parsing logic. Below is a summary of what’s included in this merge.

✨ Summary of Changes

🔧 Robust Macro Extraction

Fixed _extract_custom_macros to correctly identify macro names and definitions.
Handles \newcommand cases with optional arguments reliably.

📝 Inline Text Preservation

Refactored _process_nodes and LatexGroupNode handling.
Prevents unwanted line breaks for:
- text-only groups
- citations
- inline math
- unknown macros with arguments (for example breaking down into multiple lines \ion{N}{2})

🧱 Structural vs Inline Distinction

Introduced a dedicated structural_macros list.
Enables sensible decisions on when to:
- flush the text buffer and start a new block, or
- keep content inline

🔤 Special Character Support

Added comprehensive support for LaTeX accents and special symbols.
Examples include: \', \", \&, \%, \#
All now render as correct Unicode characters inline.

📊 Table Parsing Stability

Improved table cell extraction logic.
Correctly handles & column separators, whether parsed as macros or characters.
Ensures consistent row and cell structure across tables.

🧪 End-to-End Testing

Updated ground-truth generation and comparison logic.
All 48 tests pass with the improved rendering.

can you please run it over some more latex files and update me about the edge cases and things i may have missed out, also i had an idea about handling malformed .tex file by converting them into pdf and running it over the standard docling pdf conversion methods...love to know what you think about it....maybe use a small language model for tweaks...but yeah even that has its own tradeoffs....

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

adityasasidhar changed the title ~~added a feature for latex~~ Jan 18, 2026

adityasasidhar force-pushed the feat/latex-support branch from 27f98dc to 170f769 Compare January 18, 2026 16:43

adityasasidhar force-pushed the feat/latex-support branch 2 times, most recently from acd16d6 to 4363deb Compare January 19, 2026 08:12

PeterStaar-IBM requested review from PeterStaar-IBM, cau-git, ceberam and dolfim-ibm January 19, 2026 08:38

PeterStaar-IBM requested changes Jan 19, 2026

View reviewed changes

adityasasidhar requested a review from PeterStaar-IBM January 19, 2026 13:30

adityasasidhar force-pushed the feat/latex-support branch from 0e985c9 to d6c4ee8 Compare January 29, 2026 18:04

adityasasidhar added 14 commits January 30, 2026 00:14

feat: added support for parsing LaTeX (.tex) documents

d7a3056

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidha…

3b9e517

…r@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135 Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidha…

0bb6fc7

…r@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135 Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

fix: minor formatting in test file

bc2a78f

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

WIP: saving work for laptop migration

e612e1f

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

got rid of the line breaking issues, still some do exist

53b23fe

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

style: apply automatic formatting fixes

18f79e6

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

style: fix ruff linter and formatter errors

345f7aa

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

fix: typing issues identified by mypy

7991128

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

style: apply formatting fixes to tests

a604e80

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

adityasasidhar force-pushed the feat/latex-support branch from dda8d65 to a604e80 Compare January 29, 2026 18:44

fix: update groundtruth files for latex backend

10f73eb

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added support for parsing LaTeX (.tex) documents #2890

feat: added support for parsing LaTeX (.tex) documents #2890

adityasasidhar commented Jan 18, 2026 •

edited

Loading

github-actions bot commented Jan 18, 2026 •

edited

Loading

dosubot bot commented Jan 18, 2026

mergify bot commented Jan 18, 2026 •

edited

Loading

codecov bot commented Jan 19, 2026 •

edited

Loading

adityasasidhar commented Jan 19, 2026 •

edited

Loading

Codecov Report

PeterStaar-IBM left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adityasasidhar commented Jan 19, 2026 •

edited

Loading

adityasasidhar commented Jan 19, 2026

adityasasidhar commented Jan 19, 2026 •

edited

Loading

cau-git commented Jan 19, 2026

adityasasidhar commented Jan 20, 2026

adityasasidhar commented Jan 20, 2026 •

edited

Loading

PeterStaar-IBM commented Jan 21, 2026

adityasasidhar commented Jan 22, 2026

vku-ibm commented Jan 23, 2026 •

edited

Loading

vku-ibm commented Jan 23, 2026

PeterStaar-IBM commented Jan 23, 2026

adityasasidhar commented Jan 23, 2026

adityasasidhar commented Jan 23, 2026

vku-ibm commented Jan 23, 2026

adityasasidhar commented Jan 23, 2026

PeterStaar-IBM commented Jan 23, 2026

adityasasidhar commented Jan 24, 2026

vku-ibm commented Jan 26, 2026

adityasasidhar commented Jan 26, 2026

adityasasidhar commented Jan 29, 2026

Labels

4 participants

feat: added support for parsing LaTeX (.tex) documents #2890

Are you sure you want to change the base?

feat: added support for parsing LaTeX (.tex) documents #2890

Conversation

adityasasidhar commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dosubot bot commented Jan 18, 2026

mergify bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

codecov bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

adityasasidhar commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Updated: Improved Test Coverage

New tests cover:

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adityasasidhar commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

adityasasidhar commented Jan 19, 2026

adityasasidhar commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cau-git commented Jan 19, 2026

adityasasidhar commented Jan 20, 2026

adityasasidhar commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PeterStaar-IBM commented Jan 21, 2026

adityasasidhar commented Jan 22, 2026

vku-ibm commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vku-ibm commented Jan 23, 2026

PeterStaar-IBM commented Jan 23, 2026

adityasasidhar commented Jan 23, 2026

adityasasidhar commented Jan 23, 2026

vku-ibm commented Jan 23, 2026

adityasasidhar commented Jan 23, 2026

PeterStaar-IBM commented Jan 23, 2026

adityasasidhar commented Jan 24, 2026

vku-ibm commented Jan 26, 2026

adityasasidhar commented Jan 26, 2026

adityasasidhar commented Jan 29, 2026

✨ Summary of Changes

🔧 Robust Macro Extraction

📝 Inline Text Preservation

🧱 Structural vs Inline Distinction

🔤 Special Character Support

📊 Table Parsing Stability

🧪 End-to-End Testing

Labels

4 participants

adityasasidhar commented Jan 18, 2026 •

edited

Loading

github-actions bot commented Jan 18, 2026 •

edited

Loading

mergify bot commented Jan 18, 2026 •

edited

Loading

codecov bot commented Jan 19, 2026 •

edited

Loading

adityasasidhar commented Jan 19, 2026 •

edited

Loading

adityasasidhar commented Jan 19, 2026 •

edited

Loading

adityasasidhar commented Jan 19, 2026 •

edited

Loading

adityasasidhar commented Jan 20, 2026 •

edited

Loading

vku-ibm commented Jan 23, 2026 •

edited

Loading