MOSSTTSKit

MOSSTTSKit is a Swift Package wrapper for MOSS-TTS-Nano ONNX models.

Origin

MOSSTTSKit is an independent Swift Package built on top of the open-source MOSS-TTS-Nano project and related tooling. It aims to make MOSS-TTS-Nano easier to integrate into Apple-platform apps such as TTSMate and other Swift projects.

The upstream MOSS-TTS-Nano project is available here:

OpenMOSS / MOSS-TTS-Nano

Current package scope:

Download and cache the MOSS-TTS-Nano TTS model files from HuggingFace.
Download and cache the MOSS Audio Tokenizer ONNX model files from HuggingFace.
Initialize from either cached/downloaded models or explicit local model directories.
Load text tokenizer and audio tokenizer models.
Expose all built-in voices from the model manifest through package APIs, plus a makeSpeaker(name:referenceAudioURL:) API that encodes reference audio into acoustic codes for voice cloning.
Provide a real ONNX Runtime backed ONNXSession wrapper for generic tensor inference.
Run a verified real-model preview path: prefill, global decode step, fixed frame sampler, and audio tokenizer decode for the first generated acoustic frame.
Support automatic long-text chunking for speak(...) and speakStream(...), with chunk concatenation and short pauses inserted between synthesized segments.

Acknowledgements

This package builds on and benefits from the following open-source projects:

OpenMOSS / MOSS-TTS-Nano - the upstream TTS model, ONNX runtime flow, and browser/runtime design this package is based on.
Microsoft / ONNX Runtime - the inference runtime used to execute the exported ONNX models.
microsoft / onnxruntime-swift-package-manager - the Swift Package integration used by this project.
Hugging Face / swift-transformers - used for tokenizer and model-loading related Swift-side integration.
Google / SentencePiece - the tokenizer model format used by MOSS-TTS-Nano.

License

This project is licensed under the Apache License 2.0.

Apache-2.0 License

Usage

import MOSSTTSKit

let tts = try await MOSSTTSKit()
let result = try await tts.speak(text: "你好，欢迎使用 MOSS-TTS-Nano。")
try await tts.speakToFile(
    text: "保存成 WAV 文件。",
    outputURL: URL(fileURLWithPath: "/tmp/moss.wav")
)

Long-Text Support

speak(...), speakStream(...), and speakToFile(...) now support long text directly.

MOSSTTSKit will:

split longer input into multiple synthesis chunks automatically
prefer sentence-ending punctuation first, then clause punctuation, then token-budget fallback splitting
concatenate chunk audio and insert short pauses between chunks

The chunking budget is controlled by MOSSTTSOptions.maxTextTokensPerChunk:

let options = MOSSTTSOptions(
    maxGeneratedFrames: 256,
    maxTextTokensPerChunk: 75
)

let result = try await tts.speak(
    text: "第一段。第二段。第三段。",
    options: options
)

For longer paragraphs, you usually do not need to split text manually in the caller anymore.

Important:

Do not set a small maxGeneratedFrames value for long-form synthesis unless you intentionally want a short preview.
If maxGeneratedFrames is left as nil, MOSSTTSKit will fall back to the model manifest default, which is the recommended behavior for normal sentence and paragraph synthesis.
Small frame caps such as 8, 16, 32, or 64 are best reserved for smoke tests, progress UI testing, and short preview generation.

Automatic Model Download

MOSSTTSKit() enables automatic model download by default.

let tts = try await MOSSTTSKit(
    options: .init(
        autoDownload: true,
        progressCallback: { progress in
            print(progress.description)
        }
    )
)

Default cache location:

macOS: ~/Library/Caches/MOSSTTSKit/Models
iOS: <App Sandbox>/Library/Caches/MOSSTTSKit/Models

Default downloaded folders:

.../MOSSTTSKit/Models/
├── MOSS-TTS-Nano-100M-ONNX/
└── MOSS-Audio-Tokenizer-Nano-ONNX/

Use a custom cache directory:

let tts = try await MOSSTTSKit(
    options: .init(
        autoDownload: true,
        cacheDir: URL(fileURLWithPath: "/path/to/custom-cache")
    )
)

Preload models before first synthesis:

try await MOSSTTSKit.preload { progress in
    print(progress.description)
}

Disable automatic download and require cached models:

let tts = try await MOSSTTSKit(
    options: .init(autoDownload: false)
)

Check and clear cache:

let cached = await MOSSTTSKit.isModelCached()
let cacheSize = await MOSSTTSKit.cacheSize()
try await MOSSTTSKit.clearCache()

TTSMate Integration

Recommended first-pass integration for TTSMate:

Add the Swift Package dependency:

.package(path: "/Users/yangxuehui/Documents/dev/MOSSTTSKit/MOSSTTSKit")

Add MOSSTTSKit to the TTSMate target dependencies.
For the first integration pass, either use explicit local model directories or keep auto-download on and surface the progress callback in the UI.

Auto-download version:

import MOSSTTSKit

let tts = try await MOSSTTSKit(
    options: .init(
        autoDownload: true,
        synthesisOptions: MOSSTTSOptions(),
        progressCallback: { progress in
            print(progress.description)
        }
    )
)

Explicit local-directory version:

import MOSSTTSKit

let tts = try await MOSSTTSKit(
    ttsModelDir: URL(fileURLWithPath: "/Users/yangxuehui/Library/Caches/MOSSTTSKit/Models/MOSS-TTS-Nano-100M-ONNX"),
    audioTokenizerDir: URL(fileURLWithPath: "/Users/yangxuehui/Library/Caches/MOSSTTSKit/Models/MOSS-Audio-Tokenizer-Nano-ONNX"),
    options: MOSSTTSOptions()
)

Smoke test with a short sentence:

let result = try await tts.speak(text: "你好，这是 TTSMate 集成测试。")
print(result.audioSamples.count)
print(result.sampleRate)

Show progress and allow cancel:

let result = try await tts.speak(
    text: "你好，这是带进度的测试。",
    options: MOSSTTSOptions(maxGeneratedFrames: 8)
) { progress in
    print("frame \(progress.currentStep)/\(progress.totalSteps)")
    return true
}

Try streaming playback:

let stream = try await tts.speakStream(
    text: "这是流式播放测试。",
    options: MOSSTTSOptions(maxGeneratedFrames: 8)
)

for try await chunk in stream {
    if chunk.isFinal { break }
    // Feed chunk.newAudioSamples into your player/buffer here.
    print("chunk samples:", chunk.newAudioSamples.count)
}

Enumerate all built-in voices:

let speakers = await tts.availableSpeakers
for speaker in speakers {
    print(speaker.identifier ?? speaker.name)
    print(speaker.displayName ?? speaker.name)
    print(speaker.group ?? "Unknown Group")
}

Build a cloned speaker from a reference WAV:

let speaker = try await tts.makeSpeaker(
    name: "Reference",
    referenceAudioURL: URL(fileURLWithPath: "/path/to/reference.wav")
)

let cloned = try await tts.speak(
    text: "这是克隆音色测试。",
    speaker: speaker,
    options: MOSSTTSOptions(maxGeneratedFrames: 8)
)

Integration notes:

Start with maxGeneratedFrames: 3 or 8 only for the first UI smoke test or a deliberate short preview.
For normal sentence and paragraph synthesis, prefer MOSSTTSOptions() and let the package use the model manifest frame limit by default.
For longer paragraphs, the package now chunks text automatically. You can usually keep the full text in one speak(...) call and tune maxTextTokensPerChunk only when you need finer control.
Default auto-download cache on macOS is ~/Library/Caches/MOSSTTSKit/Models.
speakStream(...) is the best fit if TTSMate wants progressive playback.
speak(...) is better for a quick blocking smoke test.
The package currently uses a bounded real-generation path and is suitable for integration testing, but still needs longer-text performance tuning before calling it a fully polished production pipeline.

Limiting generated frames and observing progress:

let shortOptions = MOSSTTSOptions(maxGeneratedFrames: 8)
let preview = try await tts.speak(
    text: "短句测试。",
    options: shortOptions
) { progress in
    print("Generated frame \(progress.currentStep)/\(progress.totalSteps)")
    return true
}

Recommended default for full text synthesis:

let result = try await tts.speak(
    text: "这里可以放完整段落，不需要手动切分，也不需要额外设置低帧数上限。",
    options: MOSSTTSOptions()
)

Streaming audio chunks:

let stream = try await tts.speakStream(
    text: "边生成边播放。",
    options: MOSSTTSOptions(maxGeneratedFrames: 8)
)

for try await chunk in stream {
    if chunk.isFinal { break }
    print("new chunk samples:", chunk.newAudioSamples.count)
}

Using local model directories:

let tts = try await MOSSTTSKit(
    ttsModelDir: URL(fileURLWithPath: "/path/to/MOSS-TTS-Nano-100M-ONNX"),
    audioTokenizerDir: URL(fileURLWithPath: "/path/to/MOSS-Audio-Tokenizer-Nano-ONNX")
)

Preparing a cloned speaker:

let speaker = try await tts.makeSpeaker(
    name: "Reference Voice",
    referenceAudioURL: URL(fileURLWithPath: "/path/to/reference.wav")
)

let cloned = try await tts.speak(text: "使用参考音频音色。", speaker: speaker)

Model Files

Default cache layout:

~/Library/Caches/MOSSTTSKit/Models/
├── MOSS-TTS-Nano-100M-ONNX/
│   ├── moss_tts_prefill.onnx
│   ├── moss_tts_decode_step.onnx
│   ├── moss_tts_local_decoder.onnx
│   ├── moss_tts_local_cached_step.onnx
│   ├── moss_tts_local_fixed_sampled_frame.onnx
│   ├── moss_tts_global_shared.data
│   ├── moss_tts_local_shared.data
│   ├── tokenizer.model
│   ├── tts_browser_onnx_meta.json
│   └── browser_poc_manifest.json
└── MOSS-Audio-Tokenizer-Nano-ONNX/
    ├── moss_audio_tokenizer_encode.onnx
    ├── moss_audio_tokenizer_encode.data
    ├── moss_audio_tokenizer_decode_full.onnx
    ├── moss_audio_tokenizer_decode_step.onnx
    ├── moss_audio_tokenizer_decode_shared.data
    └── codec_browser_onnx_meta.json

The built-in downloader now writes .partial files and resumes interrupted downloads when the server supports HTTP range requests. If the network is unstable, you can manually download the files from:

https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX
https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX

Then either place them in the default cache layout above, let MOSSTTSKit() auto-download into that cache, or pass explicit directories to:

let tts = try await MOSSTTSKit(
    ttsModelDir: URL(fileURLWithPath: "/path/to/MOSS-TTS-Nano-100M-ONNX"),
    audioTokenizerDir: URL(fileURLWithPath: "/path/to/MOSS-Audio-Tokenizer-Nano-ONNX")
)

For now, model inspection is an internal development helper and is not exported as a package product. App integrations such as TTSMate only need the library product MOSSTTSKit.

Status

The package builds and its unit tests pass. ONNX Runtime is integrated for generic sessions, the audio tokenizer path calls real ORT sessions, and real-model integration tests cover the first generated acoustic frame.

The package also parses browser_poc_manifest.json, builds the MOSS-TTS prefill request rows that combine text tokens and reference audio codes, and exposes MOSSTTSEngine.generateFirstAudioFrame(...) as a small real-generation slice. MOSSTTSEngine.generateAudioCodes(...) runs a bounded multi-frame continuation loop. MOSSTTSKit.speak(...) now uses that real multi-frame path. The default MOSSTTSOptions.maxGeneratedFrames is 32 for early integration safety; callers can raise or lower it. MOSSTTSKit.speakStream(...) exposes decoded audio chunks as an AsyncThrowingStream.

The ONNX model repository ships tokenizer.model without tokenizer.json / tokenizer_config.json; MOSSTTSKit currently falls back to the model's byte-token range for offline local initialization. A true SentencePiece parser is still needed for exact tokenizer parity.

The remaining major task is to make the full MOSS-TTS autoregressive inference loop production-ready by adding streaming audio output, validating longer generation performance, and polishing:

moss_tts_decode_step.onnx
moss_tts_local_decoder.onnx
moss_tts_local_cached_step.onnx
moss_tts_local_fixed_sampled_frame.onnx

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Examples		Examples
Sources		Sources
Tests/MOSSTTSKitTests		Tests/MOSSTTSKitTests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md
README.zh-CN.md		README.zh-CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MOSSTTSKit

Origin

Acknowledgements

License

Usage

Long-Text Support

Automatic Model Download

TTSMate Integration

Model Files

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MOSSTTSKit

Origin

Acknowledgements

License

Usage

Long-Text Support

Automatic Model Download

TTSMate Integration

Model Files

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages