Skip to content

Add NearestPdf Script#631

Merged
khemarato merged 4 commits intomainfrom
nearpdfscript
Feb 19, 2026
Merged

Add NearestPdf Script#631
khemarato merged 4 commits intomainfrom
nearpdfscript

Conversation

@khemarato
Copy link
Copy Markdown
Collaborator

Running the script pulls the latest PDFs from Drive and builds an embeddings index.

Optionally the script then allows you to consider close pairs of PDF files to manually dedupe them or mark them as distinct. To that latter end, this PR adds the ability to track a new distinctFrom property on Drive files. Chains of distinct files are considered all distinct from each other.

@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 19, 2026

Deploy Preview for obu ready!

Name Link
🔨 Latest commit daaf63b
🔍 Latest deploy log https://app.netlify.com/projects/obu/deploys/69968849f5c8600008921473
😎 Deploy Preview https://deploy-preview-631--obu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@khemarato khemarato marked this pull request as ready for review February 19, 2026 03:49
@khemarato khemarato merged commit e3af058 into main Feb 19, 2026
3 of 4 checks passed
@khemarato khemarato deleted the nearpdfscript branch February 19, 2026 03:49
@khemarato
Copy link
Copy Markdown
Collaborator Author

Running the script yielded the following histogram in the critical range:

choosing a cosine sim threshold

This shows a good threshold value of about 0.96 which corresponds quite well with the theoretical value hypothesized here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant