Skip to content

feat: document-sanitization EPDF_* exports (XMP, thumbnails, JavaScript)#27

Open
Phauks wants to merge 1 commit into
embedpdf:embedpdf/mainfrom
Phauks:feat/document-sanitization
Open

feat: document-sanitization EPDF_* exports (XMP, thumbnails, JavaScript)#27
Phauks wants to merge 1 commit into
embedpdf:embedpdf/mainfrom
Phauks:feat/document-sanitization

Conversation

@Phauks

@Phauks Phauks commented Jun 13, 2026

Copy link
Copy Markdown

Implements the engine (C++) side of embedpdf/embed-pdf-viewer#673 — document-sanitization removal functions for redaction defensibility.

Adds three EPDF_* extension functions, mirroring the existing EPDF_SetMetaText style (declared in public/fpdf_doc.h, implemented in fpdfsdk/), so the WASM build's export generator picks them up automatically:

  • EPDF_RemoveXMPMetadata(doc) — removes the catalog /Metadata (XMP) stream. XMP is stored separately from /Info, so clearing the Info dict leaves author/title/history in XMP; this removes it.
  • EPDF_RemoveEmbeddedThumbnails(doc) — removes every page's /Thumb (can retain a pre-redaction page image).
  • EPDF_RemoveAllJavaScript(doc) — removes the catalog /Names /JavaScript name tree, a JavaScript /OpenAction (a plain GoTo /OpenAction is preserved), and the catalog /AA.

Each uses the same CPDFDocumentFromFPDFDocument + GetMutableRoot() / RemoveFor (and GetMutablePageDictionary for thumbnails) patterns as the neighbouring extensions.

The TypeScript sanitizeDocument(doc, options) engine method that composes these (plus the existing removeAttachment loop and a non-incremental save), along with Node tests asserting each vector is removed while unrelated content is preserved, are in a companion PR on embedpdf/embed-pdf-viewer (which bumps the pdfium-src submodule to this commit once merged).

Open questions for maintainers are in #673 (granular exports vs. a single EPDF_SanitizeDocument; and treating hidden OCG layers as a separate follow-up).

Add three EmbedPDF extension functions for redaction-defensibility scrubbing
of non-content hidden vectors, mirroring the existing EPDF_SetMetaText style:

- EPDF_RemoveXMPMetadata: drop the catalog /Metadata XMP stream (survives an
  Info-dict clear — the embedpdf#1 sanitization miss).
- EPDF_RemoveEmbeddedThumbnails: drop every page /Thumb.
- EPDF_RemoveAllJavaScript: drop /Names /JavaScript, JS /OpenAction, and /AA.

Declared in public/fpdf_doc.h (auto-exported by the WASM build's generator),
implemented in fpdfsdk/fpdf_doc.cpp via GetMutableRoot()/RemoveFor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant