Skip to content
View SunnkerLocket89's full-sized avatar
💭
Color my life with the chaos of trouble
💭
Color my life with the chaos of trouble
  • A Different Perspective
  • Saint Louis
  • 08:25 (UTC -06:00)

Block or report SunnkerLocket89

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SunnkerLocket89/README.md

Idaho4 Exhibits Parser

This repository provides a command line helper that automates the task of downloading and organising the public exhibits listed in the Idaho4_exhibits_with_full_metadata.xlsx spreadsheet. The script reads the spreadsheet, downloads the referenced PDF files, and optionally extracts the first N pages of each document into a dedicated folder.

Installation

The parser now works out of the box using only the Python standard library. Optional third-party packages improve performance and unlock extras:

Install them individually or via the provided requirements.txt file when available:

pip install -r requirements.txt

Usage

python run_idaho4_parser.py \
  --in-file Idaho4_exhibits_with_full_metadata.xlsx \
  --sheet Exhibits_With_Metadata \
  --workers 6 \
  --extract-pages 4

By default the script stores the downloaded PDFs in idaho4_output/downloads and writes a JSON manifest plus a CSV summary to idaho4_output. Downloaded files are prefixed with the zero-padded Excel row number to guarantee unique filenames while keeping the on-disk order aligned with the worksheet. The manifest records whether each row succeeded, was skipped (for example because it did not contain a URL), or failed, and includes the corresponding Excel row number for quick cross-referencing. Re-run the command with --resume to continue from where a previous session stopped without re-downloading files.

Common flags

  • --url-column – Set the spreadsheet column that contains the PDF URL. When omitted the script attempts to infer a sensible column automatically.
  • --id-column – Configure the column that uniquely identifies each exhibit. This identifier is used to name the downloaded files.
  • --out-dir – Choose a different destination directory for all generated artefacts.
  • --manifest / --csv – Override the default manifest output paths.
  • --verbose – Enable verbose logging for troubleshooting.

Run python run_idaho4_parser.py --help to see the full list of supported flags.

Pinned Loading

  1. freelawproject/x-ray freelawproject/x-ray Public

    A tool to detect whether a PDF has a bad redaction

    Python 779 45

  2. codex codex Public

    Forked from openai/codex

    Lightweight coding agent that runs in your terminal

    Rust 1

  3. exiftool exiftool Public template

    Forked from exiftool/exiftool

    ExifTool meta information reader/writer

    Perl 1

  4. feedclient feedclient Public

    Forked from ADSBexchange/feedclient

    Feed ADS-B Exchange using an existing receiver running readsb / dump1090 / piaware / Raspbian / Linux

    Shell 1

  5. tar1090 tar1090 Public

    Forked from wiedehopf/tar1090

    Provides an improved webinterface for use with ADS-B decoders readsb / dump1090-fa

    JavaScript 1