rutextnorm — Russian text normalization for TTS

Turn written Russian into something a TTS model can say: numbers, dates, money, units, fractions, times, abbreviations, symbols, and mixed Latin/Cyrillic — all spelled out, in agreement, in words.

"В 2024 году инфляция составила 7,5%, а доходы выросли на 3 млрд руб."
        ↓ normalize_russian()
"В две тысячи двадцать четвёртом году инфляция составила семь целых
 и пять десятых процента, а доходы выросли на три миллиарда рублей"

One file, zero dependencies, no network, no ML. Pure re + lookup tables. Deterministic: same input → same output. ~0.15 ms/sentence, ~370 k chars/s (flag_uncertain alone: ~0.02 ms).
Knows when it might be wrong. flag_uncertain() returns the spans the rules can't resolve from the text, so you can route just those to a slower, stronger method (a neural normalizer or LLM) and trust the fast path everywhere else.
Built for TTS, not for a benchmark. Where speakability and a corpus's written form disagree, it favours what the synthesizer should pronounce (see Design choices & gotchas).

Install

pip install rutextnorm

The PyPI name and the import name are the same:

from rutextnorm import normalize_russian

Or vendor the single file — copy rutextnorm.py straight into your project (e.g. into a TTS repo's text/ folder). Nothing else is required. When vendored, the import follows wherever you put it:

from text.rutextnorm import normalize_russian

Requires Python ≥ 3.8.

Quick start

from rutextnorm import normalize_russian

text = """У меня есть $1234 и 5678 рублей. Кроме того, я должен 90.50€ и взял в долг 4321 GBP.
В моём кошельке было 876 UAH и 543.21 RUB, а также я нашёл 20 центов."""

print(normalize_russian(text))

У меня есть тысяча двести тридцать четыре доллара и пять тысяч шестьсот семьдесят восемь рублей. Кроме того, я должен девяносто евро пятьдесят евроцентов и взял в долг четыре тысячи триста двадцать один фунт.
В моём кошельке было восемьсот семьдесят шесть гривен и пятьсот сорок три рубля двадцать одна копейка, а также я нашёл двадцать центов.

Command-line filter

echo "цена 1 500 руб." | python3 -m rutextnorm        # installed
echo "цена 1 500 руб." | python3 rutextnorm.py        # vendored
# -> цена тысяча пятьсот рублей

Use cases

TTS front-end. Run text through normalize_russian before your G2P / acoustic model so the synthesizer never has to guess how to read 7,5% or $3 млрд.
Hybrid pipeline. Use flag_uncertain as a router: the rules handle the ~90% of text they're confident about instantly; only flagged spans go to an expensive neural normalizer (RUNorm) or an LLM. You pay for the slow path only where it actually helps.
Corpus preprocessing. Normalize a training/eval corpus deterministically and reproducibly, with no model weights or API calls in the loop.
Drop-in CLI filter in shell pipelines.

What it normalizes

	Input → output
Cardinals (any size)	`1 234 567` → «один миллион двести тридцать четыре тысячи пятьсот шестьдесят семь»
Ordinals (suffix / Roman)	`1-й` → «первый», `XIX` → «девятнадцатый»
Dates	`05.08.2008` → «пятое августа две тысячи восьмого года», `2008 г.` → «две тысячи восьмой год»
Times	`06:06` → «шесть часов шесть минут», `1:15` → «час пятнадцать минут», `2PM` → «два часа дня»
Money	`543.21 RUB` → «пятьсот сорок три рубля двадцать одна копейка», `$1 млрд` → «один миллиард долларов»
Units (count agreement)	`5 кг` → «пять килограммов», `90 км/ч` → «девяносто километров в час», `7 км.` → «семь километров»
Multipliers	`5 млн` → «пять миллионов», `24,9 млрд руб.` → «двадцать четыре целых и девять десятых миллиарда рублей»
Decimals & percent	`1,2` → «одна целая и две десятых», `50%` → «пятьдесят процентов», `938,00` → «девятьсот тридцать восемь»
Fractions	`2/3` → «две третьих», `1/2` → «одна вторая», `½` → «одна вторая»
Context-governed case	`около 500 км` → «около пятисот километров», `с 500 рублями` → «с пятьюстами рублями»
Trigger nouns	`2 место` → «второе место», `5 этаж` → «пятый этаж»
Compound adjectives	`25-этажный` → «двадцатипятиэтажный»
Abbreviations	`и т.д.` → «и так далее», `ст. 158` → «статья сто пятьдесят восемь»
Acronyms	`СССР` → «эс эс эс эр» (vowel-less spelled out), `НАТО` kept as a word
Latin / mixed	`Google` → «гугл», `GPS` → «джи пи эс», `example.com` → «ексампле точка ком»
Symbols	`&` → «и», `²` → «в квадрате», `°C`, `№`, Greek letters
ё restoration	`еще` → «ещё» (unambiguous words only)

Vocabularies are embedded in the single file. The abbreviation and unit inventories are informed by NVIDIA NeMo-text-processing (Apache-2.0); only single-sense entries are kept and the spoken forms were rewritten and checked by hand.

Knowing when to defer: `flag_uncertain`

normalize_russian always returns its best guess. flag_uncertain(text) tells you where that guess rests on information the text doesn't contain — so a caller can escalate those spans (or the whole sentence) to a stronger method and trust the rest.

from rutextnorm import flag_uncertain

text = "Доктор Smith открыл том XIV на с. 42."
for start, end, original, reason in flag_uncertain(text):
    print(f"{original!r:14}{reason}")

'Smith'       foreign word (transliteration is approximate)
'XIV'         Roman numeral (case defaults to nominative)
'с.'          ambiguous abbreviation (секунда / страница / село / с (предлог))

It reads the input only (never a reference), runs in ~0.02 ms/sentence, and detects five structural ambiguities:

Detector	Why it's uncertain
Foreign words	transliteration is approximate; exact pronunciation needs G2P
Multi-sense abbreviations (`г.` `в.` `с.` `кв.` …)	several expansions; only context disambiguates
Roman numerals	case is context-dependent; read in the nominative by default
Four-digit year-or-cardinal	`1998` could be a year or a count
Bare numbers with no cue	grammatical case / cardinal-vs-ordinal undetermined

Each span carries a reason, so a cost-sensitive caller can ignore the reason types it doesn't care about (e.g. trust foreign-word transliteration and drop those flags). A minimal router:

def normalize_or_escalate(text, escalate):
    spans = flag_uncertain(text)
    if spans:
        return escalate(text)          # neural model / LLM
    return normalize_russian(text)     # fast path

Metrics

Measured against ru_2026.csv — the Google/Kaggle Russian normalization gold (ru_train.csv, 10.6M tokens) with its dataset artifacts removed (per-letter spelling markers, sil tokens). Comparison is ё-insensitive (the module keeps ё, the gold drops it) and space-folded (the gold space-separates transliterated foreign words, e.g. т и б е р и у с, where the module writes the joined word).

acc = exact match; rej = fraction flag_uncertain escalates; trusted = accuracy on the non-escalated part — the number a hybrid pipeline actually ships.

Class	Share	acc	rej	trusted	Residual is…
PLAIN	70%	95.1%	7%	99.5%	foreign words (gold spells per-letter)
PUNCT	21%	100%	0%	100%	—
CARDINAL	2.6%	77.2%	97%	67.9%	oblique case of bare numbers (needs context)
DATE	1.7%	86.1%	47%	94.5%	year case; ambiguous day case
LETTERS	1.8%	23.6%	28%	32.7%	acronyms read as words, not bare letters (deliberate)
VERBATIM	1.5%	95.5%	0%	95.5%	symbol / Greek map
ORDINAL	0.4%	40.4%	67%	88.3%	bare-number ordinals (need context)
MEASURE	0.4%	59.5%	12%	63.4%	oblique case agreement
MONEY	<0.1%	45.9%	37%	52.9%	case agreement; `долларов США` artifact
DECIMAL	<0.1%	58.6%	3%	59.6%	oblique case agreement
FRACTION	<0.1%	77.9%	98%	100%	context-dependent case
TIME	<0.1%	87.6%	5%	90.5%	oblique case; `HH:MM:SS` kept by gold
Overall	100%	93.7%	9.1%	98.2%

Reading the router story: escalating the 9.1% of tokens flag_uncertain marks lifts the trusted accuracy from 93.7% to 98.2%, catching ~75% of all errors. Measured per sentence (the router's real setting, with full context) the figures are 93.8% / 97.9% trusted at 8.5% escalation.

Performance

Measured on a single core (Python 3, M-series Mac), 10 000 iterations per sentence:

	Latency	Throughput
`normalize_russian`	~0.15 ms / sentence	~370 k chars/s
`flag_uncertain`	~0.02 ms / sentence	~8× faster than full normalize

Throughput is flat with sentence length (linear regex scan, no per-sentence setup). A typical TTS batch of 1 000 sentences normalizes in ~150 ms on one core.

The remaining error is dominated by two things rules can't fix without a token classifier or sentence context — grammatical case of bare numbers and a few deliberate divergences (next section) — both of which flag_uncertain is designed to route away. The benchmark is a regression guard, not a target.

The evaluation harnesses (eval_reject.py token-level, eval_reject_sent.py sentence-level), the regression tests (test_russian.py), the extension eval set and the dataset-cleaning script live on dev branches (ru-2.0-alpha); this branch ships only the module. To reproduce: python3 eval_reject.py ru_2026.csv.

Design choices & gotchas

These are intentional. Where a corpus's written form and a synthesizer's spoken needs disagree, the module picks speech.

Feed it whole sentences, not pre-split tokens. The context rules (case after a preposition, год after a year, a unit after a number) only fire when the surrounding words are present. Normalizing isolated tokens silently disables them.
ё is kept in the output (нашёл, ещё) — it carries pronunciation. If you diff against a corpus that writes only е, compare ё/е-insensitively.
A bare number's case defaults to nominative. 5 километров, not пяти километрах — the rules can't know the governing case without a cue in the text. flag_uncertain marks these; give context or route them.
Dates read the day in the genitive and the year in the nominative by default (13 сентября → «тринадцатого сентября», 2008 г. → «две тысячи восьмой год`). Both are the citation-form defaults; the actual case is context-dependent.
Foreign words are transliterated as one word (Google → «гугл»), not spelled by English letter names. Good enough for most TTS; flag_uncertain flags them if you need exact G2P.
Cyrillic acronyms use a vowel heuristic: vowel-less → letter-by-letter (СССР → «эс эс эс эр»), pronounceable → kept (НАТ��). Exceptions like США (spelled out despite vowels) need a pronunciation lexicon and aren't bundled.
Multi-sense abbreviations are left untouched (кв., г., т. standing alone) — they have several expansions. flag_uncertain marks them.
Phone/ISBN numbers are read as plain cardinals (not segmented), and HH:MM:SS times are expanded.

Known limitations (need sentence context or a classifier — out of scope)

Grammatical case agreement of a bare number (500 км → «пятисот километров`).
Disambiguating a bare number as cardinal vs. ordinal vs. year.
Telephone / ISBN segmentation and full URL G2P.
Context-dependent abbreviations (г. → год/город, кв. → квартира/квартал).
Acronyms read as letters despite vowels (США).

For these, the intended pattern is flag_uncertain → escalate to a neural normalizer or LLM.

API

normalize_russian(text: str) -> str

Normalize a string (sentence, paragraph, or document). Idempotent on already-spoken text.

flag_uncertain(text: str) -> list[tuple[int, int, str, str]]

Return (start, end, original, reason) spans where the normalization is an unverifiable guess. Empty list = high confidence in the whole string. Offsets index the input.

Contributing

Found a case it reads wrong? PRs and issues welcome — please include the input, the current output, and the form a Russian TTS should say. Behavioural changes should come with a regression test (test_russian.py on the ru-2.0-alpha branch).

If you improve the solution, please contribute the fix back here too.

License

MIT (see LICENSE). The embedded abbreviation/unit inventories are informed by NVIDIA NeMo-text-processing (Apache-2.0); spoken forms were rewritten by hand.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
rutextnorm.py		rutextnorm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rutextnorm — Russian text normalization for TTS

Install

Quick start

Command-line filter

Use cases

What it normalizes

Knowing when to defer: `flag_uncertain`

Metrics

Performance

Design choices & gotchas

Known limitations (need sentence context or a classifier — out of scope)

API

Contributing

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rutextnorm — Russian text normalization for TTS

Install

Quick start

Command-line filter

Use cases

What it normalizes

Knowing when to defer: flag_uncertain

Metrics

Performance

Design choices & gotchas

Known limitations (need sentence context or a classifier — out of scope)

API

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Knowing when to defer: `flag_uncertain`

Packages