Skip to content

feat: EPDF_RemoveOptionalContentGroups — hidden optional-content layer removal (follow-up to #673)#28

Open
Phauks wants to merge 2 commits into
embedpdf:embedpdf/mainfrom
Phauks:feat/ocg-removal
Open

feat: EPDF_RemoveOptionalContentGroups — hidden optional-content layer removal (follow-up to #673)#28
Phauks wants to merge 2 commits into
embedpdf:embedpdf/mainfrom
Phauks:feat/ocg-removal

Conversation

@Phauks

@Phauks Phauks commented Jun 13, 2026

Copy link
Copy Markdown

Follow-up to embedpdf/embed-pdf-viewer#673 — adds the fourth, harder sanitization vector: removing content hidden behind OFF optional-content groups (layers).

Stacked on #27 (the XMP/JS/thumbnails exports). This branch includes #27's commit; the OCG-specific commit is 8b8a678. Please land #27 first — I'll rebase this onto the merged base so it shows only the OCG change.

EPDF_RemoveOptionalContentGroups(doc)

Hidden layers can't be sanitized by just deleting /OCProperties — that would make the hidden content visible. This excises the content instead: for each page it drops the page objects that are not visible under the default (View) configuration (using CPDF_OCContext::CheckPageObjectVisible, which resolves OCGs, OCMDs, and /VE visibility expressions), regenerates the page content, then removes the catalog /OCProperties. Implemented in fpdfsdk/fpdf_editpage.cpp (reuses FPDF_LoadPage for content parsing and CPDF_PageContentGenerator for regeneration), declared in public/fpdf_doc.h.

The TypeScript optionalContentGroups flag on sanitizeDocument and a test (test-remove-ocg.mjs: hidden-layer text removed via text extraction, /OCProperties gone, visible content preserved) are in the companion monorepo PR.

Known scope / follow-ups (noted for review): this covers page-level content marked by hidden OCGs. Content inside form XObjects with their own /OC, annotation /OC, and deeply nested cases may need additional handling or a rasterize-flatten fallback — happy to extend based on your preference.

Phauks added 2 commits June 13, 2026 03:38
Add three EmbedPDF extension functions for redaction-defensibility scrubbing
of non-content hidden vectors, mirroring the existing EPDF_SetMetaText style:

- EPDF_RemoveXMPMetadata: drop the catalog /Metadata XMP stream (survives an
  Info-dict clear — the embedpdf#1 sanitization miss).
- EPDF_RemoveEmbeddedThumbnails: drop every page /Thumb.
- EPDF_RemoveAllJavaScript: drop /Names /JavaScript, JS /OpenAction, and /AA.

Declared in public/fpdf_doc.h (auto-exported by the WASM build's generator),
implemented in fpdfsdk/fpdf_doc.cpp via GetMutableRoot()/RemoveFor.
…t layers

Per-page, drop page objects not visible under the default (View) OC config
(CPDF_OCContext::CheckPageObjectVisible resolves OCG/OCMD/VE), regenerate the
content, then remove the catalog /OCProperties. Excises hidden-layer content
rather than just deleting /OCProperties (which would reveal it). Declared in
public/fpdf_doc.h, implemented in fpdfsdk/fpdf_editpage.cpp (reuses FPDF_LoadPage
for content parsing + CPDF_PageContentGenerator for regen).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant