Write a nearestpdf finder and local cache for CORE API by khemarato · Pull Request #629 · buddhist-uni/buddhist-uni.github.io

khemarato · 2026-01-31T11:32:45Z

We added (in #626 ) a title -> file name matching algo. Here we add a content matching algo and the code to use it for deduping files. Then we add the code for combining the filename signal with the content signals to get an overall matching algo capable of saying whether we already have a given CORE API work.

This PR also adds the local SQLite manager which will oversee pulling data from the CORE API.

netlify · 2026-01-31T11:32:50Z

✅ Deploy Preview for obu ready!

Name	Link
🔨 Latest commit	`1ba79ff`
🔍 Latest deploy log	https://app.netlify.com/projects/obu/deploys/698daa883b50cd00083f4685
😎 Deploy Preview	https://deploy-preview-629--obu.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

khemarato · 2026-01-31T12:38:28Z

After loading most of the PDFs and finding their nearest neighbors, here is a plot of the cosine similarities zooming in on the difficult range:

Here is the entire histogram, trying to fit a bimodal binomial distribution:

This suggests a threshold value of 0.965.

[skip ci]

khemarato · 2026-02-11T15:08:45Z

For the final combination of the title match scores and the document cosine similarities:

Fitting a Logistic Regression to the above scatterplot yields the equation:

$$z = -6.04 + (5.86 \times \text{ContentSim}) + (3.90 \times \text{TitleSim})$$ $$P(\text{match}) = \frac{1}{1 + e^{-z}}$$

Adding in a hundred manual examples from PDF similarities as well, that version was more conservative, giving a cutoff equation of 6.33 ContentSim + 3.05 TitleSim > 6.73 when doing the regression: trusting the Content sim a little less and the title sim significantly less.

After playing around on Desmos I found a good line between the two regressions at 18 x + 10 y - 19 > 0. This equation has the nice property that it requires a ContentSim > 0.5 and Title P > 0.1 which makes pruning straightforward.

Playing around with normalizing the Logistic curve, I found that dividing the Z score by 3 gives reasonable P values in the range [0.5, 0.953) over the output domain.

[skip ci]

khemarato self-assigned this Jan 31, 2026

khemarato added 8 commits February 10, 2026 19:00

stash

3738baa

[skip ci]

Add document similarity search function

85c6b11

[skip ci]

actually importable

5e69fe8

stash

7c8388c

More nearest checking code

31b4f04

[skip ci]

Add a basic sqlite db for CORE API

657e514

[skip ci]

Mv call_api to local_core

9988d63

[skip ci]

Some local_core schema fiddling

463bf95

[skip ci]

khemarato force-pushed the coreapidownloader branch from fc1ebcc to 463bf95 Compare February 10, 2026 12:00

khemarato added 4 commits February 10, 2026 21:04

Calling the API works

a860f36

add language detection to local_core

8e4a569

[skip ci]

Add basic DOI getter and more nearestpdf funcs

3fe777e

[skip ci]

Fancy bulk DOI getter

4b27060

[skip ci]

Finalize the nearestpdf finder

1ba79ff

[skip ci]

khemarato changed the title ~~Write a downloader for CORE API Content~~ Feb 12, 2026

khemarato marked this pull request as ready for review February 12, 2026 10:29

khemarato merged commit 5d4f5d9 into main Feb 12, 2026
3 of 4 checks passed

khemarato deleted the coreapidownloader branch February 12, 2026 10:29

khemarato mentioned this pull request Feb 21, 2026

Add NearestPdf Script #631

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a nearestpdf finder and local cache for CORE API#629

Write a nearestpdf finder and local cache for CORE API#629
khemarato merged 13 commits intomainfrom
coreapidownloader

khemarato commented Jan 31, 2026 •

edited

Loading

netlify Bot commented Jan 31, 2026 •

edited

Loading

khemarato commented Jan 31, 2026 •

edited

Loading

khemarato commented Feb 11, 2026 •

edited

Loading

Uh oh!

Labels

1 participant

Conversation

khemarato commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

netlify Bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for obu ready!

khemarato commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

khemarato commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Labels

1 participant

khemarato commented Jan 31, 2026 •

edited

Loading

netlify Bot commented Jan 31, 2026 •

edited

Loading

khemarato commented Jan 31, 2026 •

edited

Loading

khemarato commented Feb 11, 2026 •

edited

Loading