Skip to content
#

pretraining

Here is 1 public repository matching this topic...

Data Preparation for Large Language Models — a curated companion to our JCST 2026 survey. Covers Pre-training, Continual Pre-training, and Post-training (SFT/RLHF/RLAIF) across collection, filtering, dedup, generation, evaluation.

  • Updated Apr 28, 2026
  • Shell

Improve this page

Add a description, image, and links to the pretraining topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pretraining topic, visit your repo's landing page and select "manage topics."

Learn more