Turn written Russian into something a TTS model can say: numbers, dates, money, units, fractions, times, abbreviations, symbols, and mixed Latin/Cyrillic — all spelled out, in agreement, in words.
"В 2024 году инфляция составила 7,5%, а доходы выросли на 3 млрд руб."
↓ normalize_russian()
"В две тысячи двадцать четвёртом году инфляция составила семь целых
и пять десятых процента, а доходы выросли на три миллиарда рублей"
- One file, zero dependencies, no network, no ML. Pure
re+ lookup tables. Deterministic: same input → same output. ~0.15 ms/sentence, ~370 k chars/s (flag_uncertainalone: ~0.02 ms). - Knows when it might be wrong.
flag_uncertain()returns the spans the rules can't resolve from the text, so you can route just those to a slower, stronger method (a neural normalizer or LLM) and trust the fast path everywhere else. - Built for TTS, not for a benchmark. Where speakability and a corpus's written form disagree, it favours what the synthesizer should pronounce (see Design choices & gotchas).
pip install rutextnormThe PyPI name and the import name are the same:
from rutextnorm import normalize_russianOr vendor the single file — copy rutextnorm.py straight into your project
(e.g. into a TTS repo's text/ folder). Nothing else is required. When vendored,
the import follows wherever you put it:
from text.rutextnorm import normalize_russianRequires Python ≥ 3.8.
from rutextnorm import normalize_russian
text = """У меня есть $1234 и 5678 рублей. Кроме того, я должен 90.50€ и взял в долг 4321 GBP.
В моём кошельке было 876 UAH и 543.21 RUB, а также я нашёл 20 центов."""
print(normalize_russian(text))У меня есть тысяча двести тридцать четыре доллара и пять тысяч шестьсот семьдесят восемь рублей. Кроме того, я должен девяносто евро пятьдесят евроцентов и взял в долг четыре тысячи триста двадцать один фунт.
В моём кошельке было восемьсот семьдесят шесть гривен и пятьсот сорок три рубля двадцать одна копейка, а также я нашёл двадцать центов.
echo "цена 1 500 руб." | python3 -m rutextnorm # installed
echo "цена 1 500 руб." | python3 rutextnorm.py # vendored
# -> цена тысяча пятьсот рублей- TTS front-end. Run text through
normalize_russianbefore your G2P / acoustic model so the synthesizer never has to guess how to read7,5%or$3 млрд. - Hybrid pipeline. Use
flag_uncertainas a router: the rules handle the ~90% of text they're confident about instantly; only flagged spans go to an expensive neural normalizer (RUNorm) or an LLM. You pay for the slow path only where it actually helps. - Corpus preprocessing. Normalize a training/eval corpus deterministically and reproducibly, with no model weights or API calls in the loop.
- Drop-in CLI filter in shell pipelines.
| Input → output | |
|---|---|
| Cardinals (any size) | 1 234 567 → «один миллион двести тридцать четыре тысячи пятьсот шестьдесят семь» |
| Ordinals (suffix / Roman) | 1-й → «первый», XIX → «девятнадцатый» |
| Dates | 05.08.2008 → «пятое августа две тысячи восьмого года», 2008 г. → «две тысячи восьмой год» |
| Times | 06:06 → «шесть часов шесть минут», 1:15 → «час пятнадцать минут», 2PM → «два часа дня» |
| Money | 543.21 RUB → «пятьсот сорок три рубля двадцать одна копейка», $1 млрд → «один миллиард долларов» |
| Units (count agreement) | 5 кг → «пять килограммов», 90 км/ч → «девяносто километров в час», 7 км. → «семь километров» |
| Multipliers | 5 млн → «пять миллионов», 24,9 млрд руб. → «двадцать четыре целых и девять десятых миллиарда рублей» |
| Decimals & percent | 1,2 → «одна целая и две десятых», 50% → «пятьдесят процентов», 938,00 → «девятьсот тридцать восемь» |
| Fractions | 2/3 → «две третьих», 1/2 → «одна вторая», ½ → «одна вторая» |
| Context-governed case | около 500 км → «около пятисот километров», с 500 рублями → «с пятьюстами рублями» |
| Trigger nouns | 2 место → «второе место», 5 этаж → «пятый этаж» |
| Compound adjectives | 25-этажный → «двадцатипятиэтажный» |
| Abbreviations | и т.д. → «и так далее», ст. 158 → «статья сто пятьдесят восемь» |
| Acronyms | СССР → «эс эс эс эр» (vowel-less spelled out), НАТО kept as a word |
| Latin / mixed | Google → «гугл», GPS → «джи пи эс», example.com → «ексампле точка ком» |
| Symbols | & → «и», ² → «в квадрате», °C, №, Greek letters |
| ё restoration | еще → «ещё» (unambiguous words only) |
Vocabularies are embedded in the single file. The abbreviation and unit inventories are informed by NVIDIA NeMo-text-processing (Apache-2.0); only single-sense entries are kept and the spoken forms were rewritten and checked by hand.
normalize_russian always returns its best guess. flag_uncertain(text) tells you
where that guess rests on information the text doesn't contain — so a caller can
escalate those spans (or the whole sentence) to a stronger method and trust the
rest.
from rutextnorm import flag_uncertain
text = "Доктор Smith открыл том XIV на с. 42."
for start, end, original, reason in flag_uncertain(text):
print(f"{original!r:14}{reason}")'Smith' foreign word (transliteration is approximate)
'XIV' Roman numeral (case defaults to nominative)
'с.' ambiguous abbreviation (секунда / страница / село / с (предлог))
It reads the input only (never a reference), runs in ~0.02 ms/sentence, and detects five structural ambiguities:
| Detector | Why it's uncertain |
|---|---|
| Foreign words | transliteration is approximate; exact pronunciation needs G2P |
Multi-sense abbreviations (г. в. с. кв. …) |
several expansions; only context disambiguates |
| Roman numerals | case is context-dependent; read in the nominative by default |
| Four-digit year-or-cardinal | 1998 could be a year or a count |
| Bare numbers with no cue | grammatical case / cardinal-vs-ordinal undetermined |
Each span carries a reason, so a cost-sensitive caller can ignore the reason
types it doesn't care about (e.g. trust foreign-word transliteration and drop those
flags). A minimal router:
def normalize_or_escalate(text, escalate):
spans = flag_uncertain(text)
if spans:
return escalate(text) # neural model / LLM
return normalize_russian(text) # fast pathMeasured against ru_2026.csv — the Google/Kaggle Russian normalization gold
(ru_train.csv,
10.6M tokens) with its dataset artifacts removed (per-letter spelling markers,
sil tokens). Comparison is ё-insensitive (the module keeps ё, the gold drops it)
and space-folded (the gold space-separates transliterated foreign words, e.g.
т и б е р и у с, where the module writes the joined word).
acc = exact match; rej = fraction flag_uncertain escalates; trusted =
accuracy on the non-escalated part — the number a hybrid pipeline actually ships.
| Class | Share | acc | rej | trusted | Residual is… |
|---|---|---|---|---|---|
| PLAIN | 70% | 95.1% | 7% | 99.5% | foreign words (gold spells per-letter) |
| PUNCT | 21% | 100% | 0% | 100% | — |
| CARDINAL | 2.6% | 77.2% | 97% | 67.9% | oblique case of bare numbers (needs context) |
| DATE | 1.7% | 86.1% | 47% | 94.5% | year case; ambiguous day case |
| LETTERS | 1.8% | 23.6% | 28% | 32.7% | acronyms read as words, not bare letters (deliberate) |
| VERBATIM | 1.5% | 95.5% | 0% | 95.5% | symbol / Greek map |
| ORDINAL | 0.4% | 40.4% | 67% | 88.3% | bare-number ordinals (need context) |
| MEASURE | 0.4% | 59.5% | 12% | 63.4% | oblique case agreement |
| MONEY | <0.1% | 45.9% | 37% | 52.9% | case agreement; долларов США artifact |
| DECIMAL | <0.1% | 58.6% | 3% | 59.6% | oblique case agreement |
| FRACTION | <0.1% | 77.9% | 98% | 100% | context-dependent case |
| TIME | <0.1% | 87.6% | 5% | 90.5% | oblique case; HH:MM:SS kept by gold |
| Overall | 100% | 93.7% | 9.1% | 98.2% |
Reading the router story: escalating the 9.1% of tokens flag_uncertain
marks lifts the trusted accuracy from 93.7% to 98.2%, catching ~75% of all
errors. Measured per sentence (the router's real setting, with full context)
the figures are 93.8% / 97.9% trusted at 8.5% escalation.
Measured on a single core (Python 3, M-series Mac), 10 000 iterations per sentence:
| Latency | Throughput | |
|---|---|---|
normalize_russian |
~0.15 ms / sentence | ~370 k chars/s |
flag_uncertain |
~0.02 ms / sentence | ~8× faster than full normalize |
Throughput is flat with sentence length (linear regex scan, no per-sentence setup). A typical TTS batch of 1 000 sentences normalizes in ~150 ms on one core.
The remaining error is dominated by two things rules can't fix without a token
classifier or sentence context — grammatical case of bare numbers and a few
deliberate divergences (next section) — both of which flag_uncertain is
designed to route away. The benchmark is a regression guard, not a target.
The evaluation harnesses (
eval_reject.pytoken-level,eval_reject_sent.pysentence-level), the regression tests (test_russian.py), the extension eval set and the dataset-cleaning script live on dev branches (ru-2.0-alpha); this branch ships only the module. To reproduce:python3 eval_reject.py ru_2026.csv.
These are intentional. Where a corpus's written form and a synthesizer's spoken needs disagree, the module picks speech.
- Feed it whole sentences, not pre-split tokens. The context rules (case after a
preposition,
годafter a year, a unit after a number) only fire when the surrounding words are present. Normalizing isolated tokens silently disables them. - ё is kept in the output (
нашёл,ещё) — it carries pronunciation. If you diff against a corpus that writes onlyе, compare ё/е-insensitively. - A bare number's case defaults to nominative.
5 километров, notпяти километрах— the rules can't know the governing case without a cue in the text.flag_uncertainmarks these; give context or route them. - Dates read the day in the genitive and the year in the nominative by default
(
13 сентября→ «тринадцатого сентября»,2008 г.→ «две тысячи восьмой год`). Both are the citation-form defaults; the actual case is context-dependent. - Foreign words are transliterated as one word (
Google→ «гугл»), not spelled by English letter names. Good enough for most TTS;flag_uncertainflags them if you need exact G2P. - Cyrillic acronyms use a vowel heuristic: vowel-less → letter-by-letter
(
СССР→ «эс эс эс эр»), pronounceable → kept (НАТ��). Exceptions likeСША(spelled out despite vowels) need a pronunciation lexicon and aren't bundled. - Multi-sense abbreviations are left untouched (
кв.,г.,т.standing alone) — they have several expansions.flag_uncertainmarks them. - Phone/ISBN numbers are read as plain cardinals (not segmented), and
HH:MM:SStimes are expanded.
- Grammatical case agreement of a bare number (
500 км→ «пятисот километров`). - Disambiguating a bare number as cardinal vs. ordinal vs. year.
- Telephone / ISBN segmentation and full URL G2P.
- Context-dependent abbreviations (
г.→ год/город,кв.→ квартира/квартал). - Acronyms read as letters despite vowels (
США).
For these, the intended pattern is flag_uncertain → escalate to a neural
normalizer or LLM.
normalize_russian(text: str) -> strNormalize a string (sentence, paragraph, or document). Idempotent on already-spoken text.
flag_uncertain(text: str) -> list[tuple[int, int, str, str]]Return (start, end, original, reason) spans where the normalization is an
unverifiable guess. Empty list = high confidence in the whole string. Offsets index
the input.
Found a case it reads wrong? PRs and issues welcome — please include the input, the
current output, and the form a Russian TTS should say. Behavioural changes should
come with a regression test (test_russian.py on the ru-2.0-alpha branch).
If you improve the solution, please contribute the fix back here too.
MIT (see LICENSE). The embedded abbreviation/unit inventories are informed by NVIDIA NeMo-text-processing (Apache-2.0); spoken forms were rewritten by hand.