Skip to content

Write a nearestpdf finder and local cache for CORE API#629

Merged
khemarato merged 13 commits intomainfrom
coreapidownloader
Feb 12, 2026
Merged

Write a nearestpdf finder and local cache for CORE API#629
khemarato merged 13 commits intomainfrom
coreapidownloader

Conversation

@khemarato
Copy link
Copy Markdown
Collaborator

@khemarato khemarato commented Jan 31, 2026

We added (in #626 ) a title -> file name matching algo. Here we add a content matching algo and the code to use it for deduping files. Then we add the code for combining the filename signal with the content signals to get an overall matching algo capable of saying whether we already have a given CORE API work.

This PR also adds the local SQLite manager which will oversee pulling data from the CORE API.

@khemarato khemarato self-assigned this Jan 31, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Jan 31, 2026

Deploy Preview for obu ready!

Name Link
🔨 Latest commit 1ba79ff
🔍 Latest deploy log https://app.netlify.com/projects/obu/deploys/698daa883b50cd00083f4685
😎 Deploy Preview https://deploy-preview-629--obu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@khemarato
Copy link
Copy Markdown
Collaborator Author

khemarato commented Jan 31, 2026

After loading most of the PDFs and finding their nearest neighbors, here is a plot of the cosine similarities zooming in on the difficult range:

Nearest Neighbor Histogram

Here is the entire histogram, trying to fit a bimodal binomial distribution:

Modeling PDF

This suggests a threshold value of 0.965.

@khemarato
Copy link
Copy Markdown
Collaborator Author

khemarato commented Feb 11, 2026

For the final combination of the title match scores and the document cosine similarities:

scatterplot

Fitting a Logistic Regression to the above scatterplot yields the equation:

$$z = -6.04 + (5.86 \times \text{ContentSim}) + (3.90 \times \text{TitleSim})$$ $$P(\text{match}) = \frac{1}{1 + e^{-z}}$$

Adding in a hundred manual examples from PDF similarities as well, that version was more conservative, giving a cutoff equation of 6.33 ContentSim + 3.05 TitleSim > 6.73 when doing the regression: trusting the Content sim a little less and the title sim significantly less.

After playing around on Desmos I found a good line between the two regressions at 18 x + 10 y - 19 > 0. This equation has the nice property that it requires a ContentSim > 0.5 and Title P > 0.1 which makes pruning straightforward.

Playing around with normalizing the Logistic curve, I found that dividing the Z score by 3 gives reasonable P values in the range [0.5, 0.953) over the output domain.

@khemarato khemarato changed the title Write a downloader for CORE API Content Feb 12, 2026
@khemarato khemarato marked this pull request as ready for review February 12, 2026 10:29
@khemarato khemarato merged commit 5d4f5d9 into main Feb 12, 2026
3 of 4 checks passed
@khemarato khemarato deleted the coreapidownloader branch February 12, 2026 10:29
@khemarato khemarato mentioned this pull request Feb 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant