This repository is primarily for learning and experimenting with state space models, especially readable Mamba-2-style and Mamba-3-style architectures.
The current end-to-end example tasks are:
- raw language-model pretraining on TinyStories or local text
- fine-tuning on a toy subject-predicate-object extraction task
The architecture is the main subject. The SPO pipeline is only one downstream example used to exercise the model.
The default mamba3 settings now use an approximately 1M-parameter model so
the repo remains small but is better suited to modest GPUs.
- studying how selective state space updates work in practice
- comparing a simpler Mamba-2-style block against a richer Mamba-3-style block
- running small CPU-friendly architecture experiments
- using pretraining plus fine-tuning as a test harness for new block ideas
This is an educational codebase, not a production implementation.
mamba2A small Mamba-2-style block with depthwise convolution and selective state updates.mamba3An educational Mamba-3-style block with exponential-trapezoidal updates, rotary or complex state mixing, BC normalization, B/C biases, and optional MIMO rank.- byte and character tokenizers
- raw language-model pretraining
- a toy downstream fine-tuning task
The detailed docs live in docs/:
- Chapter 1: SSM Foundations
- Chapter 2: Mamba
- Chapter 3: Mamba-2
- Chapter 4: Mamba-3
- Chapter 5: Repository Overview
- Chapter 6: Project Workflows
If you want to understand the architecture, start with the first four. If you want to run experiments, then move to the last two.
ssmlab/mamba/model.pyCore architecture: config, Mamba-2 block, Mamba-3 block, and the tiny causal language model.ssmlab/common/tokenizer.pyCharacter and byte tokenizers.ssmlab/common/pretrain_data.pyRaw-text dataset utilities for language-model pretraining.ssmlab/mamba/pretrain.pyPretraining entry point for TinyStories or local text.ssmlab/common/data.pySynthetic SPO task generation and output parsing.ssmlab/mamba/train.pyFine-tuning loop for the SPO demo task.ssmlab/mamba/infer.pyInference CLI for the SPO demo task.
uv syncIf you want uv to manage Python too:
uv python install 3.13
uv sync --python 3.13The top-level workflow is now driven by ssmlab, a Python CLI backed by
ssmlab.yaml.
The CLI shape is:
uv run ssmlab pretrain --model <name> --target <target>
uv run ssmlab train --model <name> --target <target>
uv run ssmlab infer --model <name> --target <target>--model selects a named model version from YAML, for example
mamba3-1b-a1b2. --target selects where to run it, for example local or a
named SSH machine such as gpu-box.
The config file is split into:
models: named model versions and their task configsdata_sources: reusable dataset definitions for training taskstargets: local or SSH machinestasks.pretrain/tasks.train/tasks.infer: top-level CLI tasks
Right now, the named data-source layer is implemented only for tinystories.
The schema is there for additional sources later, but the sample config keeps it
on TinyStories only.
The default ssmlab.yaml in the repo already shows two model versions:
mamba3-1b-a1b2mamba3-1b-a1b2-debug
Each model keeps its shared architecture params once under shared_args, then
adds task-specific args under tasks.pretrain.args and tasks.train.args. The
TinyStories dataset settings now live under top-level data_sources, and
tasks.pretrain references one with data_source: ....
If you need one-off flag overrides, append them after --:
uv run ssmlab pretrain --model mamba3-1b-a1b2 --target local -- --max-train-stories 4000 --log-every 5For the default ~1M-parameter mamba3 setup, the cheapest sensible Vast target
is usually a single RTX 3060 12GB or Tesla T4 16GB. A 6GB card is the
practical floor, but 8GB+ is the safer starting point because it gives more
headroom for batch size and longer sequences.
Configure the machine once in ssmlab.yaml under targets. For example:
targets:
gpu-box:
kind: ssh
host: root@example.com
port: 22
remote_dir: /workspace/ssm
bootstrap_python: python3
python_bin: python
venv_dir: /workspace/ssm/.venv
device: cuda
detach: true
log_file: runs/remote-gpu.log
pid_file: runs/remote-gpu.pidThen launch the configured model version on that machine:
uv run ssmlab pretrain --model mamba3-1b-a1b2 --target gpu-boxThe SSH target path:
- syncs the repo with
rsync - creates or refreshes a remote virtualenv
- reuses the machine's system PyTorch when available
- installs a CUDA-enabled
torchwheel if needed - installs this package plus runtime dependencies
- validates
torch.cuda.is_available() - runs the configured task in the remote repo
With detach: true, the remote target runs under nohup-style detached
execution and writes logs and pid files to the configured paths.
uv run ssmlab pretrain --model mamba3-1b-a1b2 --target localThis trains the architecture as a plain language model and is the cleanest way to study how the SSM behaves on raw text.
To use a local corpus instead of TinyStories, either update the YAML model definition or append a one-off override:
uv run ssmlab pretrain --model mamba3-1b-a1b2 --target local -- --text-file ./some_corpus.txt --output-dir runs/local_lmThe second CLI task is train. It resolves tasks.train for the same named
model. In the sample YAML, train is wired to fine-tuning:
uv run ssmlab train --model mamba3-1b-a1b2 --target localIf you want a different action for train, point that task at another module or
command in ssmlab.yaml.
If you want to bypass the YAML layer entirely, the low-level entry points are still available:
uv run ssmlab-mamba-train \
--architecture mamba3 \
--tokenizer byte \
--mimo-rank 2 \
--d-model 160 \
--n-layers 5 \
--head-dim 16 \
--state-dim 16 \
--init-checkpoint runs/tinystories_mamba3_1m/best.pt \
--output-dir runs/spo_mamba3_1m_ftThe top-level YAML CLI now supports inference too:
uv run ssmlab infer --model mamba3-1b-a1b2 --target local -- --text "Alice reads a book and Bob drives a car."This resolves tasks.infer from ssmlab.yaml, which points at the model's
fine-tuned checkpoint by default.
The low-level CLI still works if you want to pass the checkpoint explicitly:
uv run ssmlab-mamba-infer \
--checkpoint runs/spo_mamba3_1m_ft/best.pt \
--text "Alice reads a book and Bob drives a car."Example output:
raw: [(alice, book, reads), (bob, car, drives)]
triples: [('alice', 'book', 'reads'), ('bob', 'car', 'drives')]
mamba2requires--mimo-rank 1.- TinyStories pretraining is configured around the byte tokenizer.
- Fine-tuning can use either tokenizer, but
--init-checkpointrequires the tokenizer and model shapes to match exactly. - The implementation is intentionally readable and CPU-friendly, not optimized.
- The current downstream task is synthetic and narrow by design.
Good first experiments:
- compare
mamba2vsmamba3 - vary
state_dim - vary
head_dim - vary
mimo_rankformamba3 - pretrain first, then fine-tune
Those experiments are more aligned with the purpose of the repo than the specific SPO benchmark itself.