Today we're releasing three new models with 80M, 40M and 14M parameters.
The largest model (80M) has the highest quality. The 14M variant reaches new SOTA in expressivity among similar sized models, despite being <25MB in size. This release is a major upgrade from the previous one and supports English text-to-speech applications in eight voices: four male and four female.
Here's a short demo: https://www.youtube.com/watch?v=ge3u5qblqZA.
Most models are quantized to int8 + fp16, and they use ONNX for runtime. Our models are designed to run anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required! This release aims to bridge the gap between on-device and cloud models for tts applications. Multi-lingual model release is coming soon.
On-device AI is bottlenecked by one thing: a lack of tiny models that actually perform. Our goal is to open-source more models to run production-ready voice agents and apps entirely on-device.
We would love your feedback!
Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.
Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.
This is a mind numbing task that requires workers to make hundreds of calls each day with only minor variations, sometimes navigating phone trees, half the time leaving almost the exact same message.
Anyway, I believe almost all such businesses will be automated within months. Human labour just cannot compete on cost.
The legitimate objection people have to AI in this use case is that it can be slow or stupid in a way that wastes time. By acting more humanlike, we signal that we are going to be closer to human level performance.
Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.
Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].
the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?
A stretch goal is 'arbitrary tags' from [singing] [sung to the tune of {x}] [pausing for emphasis] [slowly decreasing speed for emphasis] [emphasizing the object of this sentence] [clapping] [car crash in the distance] [laser's pew pew].
But yeah: instruction/control via [tags] is the deciding feature for me, provided prompt adherence is strong enough.
Also: a thought...
Everyone is using [] for different kinds of tags in this space: which is very simple. Maybe it makes sense to differentiate kinds of tags? I.E. [tags for modifying how text is spoken] vs {tags for creating sounds not specifically speech: not modifying anything... but instead it's own 'sound/word'}
I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.
Regarding running on the 3080 gpu, can you share more details on github issues, discord or email? it should be blazing fast on that. i'll add an example to run the model on gpu too.
Had to get the right Python version and make sure it didn't break anything with the previous Python version. A friend suggested using Docker, so I started down that path until I realized I'd probably have to set the whole thing up there myself. Eventually got it to run and I think I didn't break anything else.
I hate Python so much.
I did eventually do that though, and I'm pretty sure I had to mess about with installing and uninstalling torch.
I dread using anything made in python because of this. It's always annoying and never just works (if the version of python is incompatible, otherwise it's fine) .
Even if you have to install using pip it just affect the active environment.
Maybe I'm only trying simple things.
I suspect success is highly variable on macOS vs. Linux; the spacey bug is only in newer (3.14 only or later) Pythons, which Linux will have.
I should learn to give up quicker.
The iOS version is Swift-based.
Added kitten (nano only, for now, will move on to mini) to my "web tts thing": https://github.com/idle-intelligence/tts-web
If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.
Also:
Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.
Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.
Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?
If only I could have that in Norwegian my SO would be pleased.
Also I totally misremembered regarding Kokoro TTS. It's good, but not what was butchering Norwegian. Forgot which one I was thinking of, maybe it was the old VITS stuff Rhaspy uses. Points stand, the voice was good but could barely understand what was said.
The new 15M is way better than the previous 80M model(v0.1). So we're able to predictably improve the quality which is very encouraging.
I want to be my own personal assistant...
EDIT: I can provide it a RTX 3080ti.
Qwen 3 TTS is good for voice cloning but requires GPU of some sort.
I couldn't locate how to run it on a GPU anywhere in the repo.
(That's using the example as-is. If you switch it to the smaller model, modify the above with +57 MiB of models from HuggingFace, or =727 MiB.)
So I toyed with this a bit + the Rust library "ort", and ort is only 224M in release (non-debug) mode, and it was pretty simple to run this model with it. (I did not know ort before just now.) I didn't replicate the preprocessing the Python does before running the model, though. (You have to turn the text into an array of floats, essentially; the library is doing text -> phonemes -> tokens; the latter step is straight-forward.)
Perhaps his YouTube channel is worth a watch. This video from four months ago compares various STT tools: https://youtu.be/pKU9CABtnOw
Speaking of apps that would, if I had to guess, love to integrate you:
FluidVoice is incredible and developing quickly. Handy is really hot right now. Also have VoiceInk out there, solid iOS option.
[ps-not parent commenter]
I didn't expect it to pronounciate 'ms' correctly, but the number sounded just like noise. Eventually I got an acceptable result for the string "Startup finished in one hundred and thirty five seconds.
The above SECDED check-bit encoding can be implemented in a similar way, but since it uses only three-bit patterns, mapping syndromes to correction masks can be done with three-input AND gates.
It sounded quite good indeed for the normal English stuff, but I guess predictably was quite bad at the domain-specific words. It misspoke "SECDED", had wrong emphasis on "syndromes", and pronounced "AND gates" like "and gates".
Could you give some example of what kind of preprocessing would help in this case? I tried some local LLMs, but they didn't do a good job (maybe my prompts sucked).
I'm not sure if you're misspelling it deliberately or not, but the word you're looking for is "pronounce" and it's verb form "pronouncing", as in "It just has issues pronouncing numbers" and "I didn't expect it to pronounce 'ms' correctly."
On macOS, it's a markedly different experience: it's only ~700 MiB there; I'm assuming b/c no NVIDIA libs get pulled in, b/c why would they.
For anyone who might want to play around with this: I can get down to ~3 GiB (& about 1.3 GiB if you wipe your uv cache afterwards) on Linux if I add the following to the end of `pyproject.toml`:
[tool.uv.sources]
# This tells uv to use the specific index for torch, torchvision, and torchaudio
torch = [
{index = "pytorch-cpu"}
]
torchvision = [
{index = "pytorch-cpu"}
]
torchaudio = [
{index = "pytorch-cpu"}
]
[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
& add "torch" to the direct dependencies, b/c otherwise it seems like uv is ignoring the source? (… which of course downloads a CPU-only torch.)This is an example of what one sees under Linux:
nvidia-nvjitlink-cu12 ------------------------------ 23.83 MiB/37.44 MiB
nvidia-curand-cu12 ------------------------------ 23.79 MiB/60.67 MiB
nvidia-cuda-nvrtc-cu12 ------------------------------ 23.87 MiB/83.96 MiB
nvidia-nvshmem-cu12 ------------------------------ 23.62 MiB/132.66 MiB
triton ------------------------------ 23.82 MiB/179.55 MiB
nvidia-cufft-cu12 ------------------------------ 23.76 MiB/184.17 MiB
nvidia-cusolver-cu12 ------------------------------ 23.84 MiB/255.11 MiB
nvidia-cusparselt-cu12 ------------------------------ 23.99 MiB/273.89 MiB
nvidia-cusparse-cu12 ------------------------------ 23.96 MiB/274.86 MiB
nvidia-nccl-cu12 ------------------------------ 23.79 MiB/307.42 MiB
nvidia-cublas-cu12 ------------------------------ 23.73 MiB/566.81 MiB
nvidia-cudnn-cu12 ------------------------------ 23.56 MiB/674.02 MiB
torch ------------------------------ 23.75 MiB/873.22 MiB
That's not all the libraries, either, but you can see NVIDIA here is easily over 1 GiB.It also then crashes for me, with:
File "KittenTTS/.venv/lib/python3.14/site-packages/pydantic/v1/fields.py", line 576, in _set_default_and_type
raise errors_.ConfigError(f'unable to infer type for attribute "{self.name}"')
pydantic.v1.errors.ConfigError: unable to infer type for attribute "REGEX"
Which seems to be [this bug in spacey](https://github.com/explosion/spaCy/issues/13895), so I'm going to have to try adding `<3.14` to `requires-python` in `pyproject.toml` too I think. That is, for anyone wanting to try this out: -requires-python = ">=3.8"
+requires-python = ">=3.8,<3.14"
(This isn't really something KittenTTS should have to do, since this is a bug in spacey … and ideally, at some point, spacey will fix it.)Also:
+ curated-tokenizers==0.0.9
This version is so utterly ancient that there aren't wheels for it anymore, so that means a loooong wait while this builds. It's pulled in via misaki, and my editor says your one import of misaki is unused.Hilariously, removing it breaks but only on macOS machine. I think you're using it solely for the side-effect that it tweaks phonemizer to use espeakng, but you can just do that tweak yourself, & then I think that dependency can be dropped. That drops a good number of dependencies & really speeds up the installation since we're not compiling a bunch of stuff.
You need to add `phonemizer-fork` to your dependencies. (If you remove misaki, you'll find this missing.)
Huge fan of Ava multilingual and hopefully there are many other others with similar taste, so my feedback might shape things towards a halfway decent direction at least for some.
btw, use case is most often to listen to news/articles.
No need to DM me, just post on HN or /r/LocalLLama and I'll catch wind of it.
Thanks for your work!
Downloading https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl (22 kB)
Collecting num2words (from kittentts==0.8.1)
Using cached num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting spacy (from kittentts==0.8.1)
Using cached spacy-3.8.11-cp314-cp314-win_amd64.whl.metadata (28 kB)
Collecting espeakng_loader (from kittentts==0.8.1)
Using cached espeakng_loader-0.2.4-py3-none-win_amd64.whl.metadata (1.3 kB)
INFO: pip is looking at multiple versions of kittentts to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 0.7.10 Requires-Python >=3.8,<3.13; 0.7.11 Requires-Python >=3.8,<3.13; 0.7.12 Requires-Python >=3.8,<3.13; 0.7.13 Requires-Python >=3.8,<3.13; 0.7.14 Requires-Python >=3.8,<3.13; 0.7.15 Requires-Python >=3.8,<3.13; 0.7.16 Requires-Python >=3.8,<3.13; 0.7.17 Requires-Python >=3.8,<3.13; 0.7.5 Requires-Python >=3.8,<3.13; 0.7.6 Requires-Python >=3.8,<3.13; 0.7.7 Requires-Python >=3.8,<3.13; 0.7.8 Requires-Python >=3.8,<3.13; 0.7.9 Requires-Python >=3.8,<3.13; 0.8.0 Requires-Python >=3.8,<3.13; 0.8.1 Requires-Python >=3.8,<3.13; 0.8.2 Requires-Python >=3.8,<3.13; 0.8.3 Requires-Python >=3.8,<3.13; 0.8.4 Requires-Python >=3.8,<3.13; 0.9.0 Requires-Python >=3.8,<3.13; 0.9.2 Requires-Python >=3.8,<3.13; 0.9.3 Requires-Python >=3.8,<3.13; 0.9.4 Requires-Python >=3.8,<3.13; 3.8.3 Requires-Python >=3.9,<3.13; 3.8.5 Requires-Python >=3.9,<3.13; 3.8.6 Requires-Python >=3.9,<3.13; 3.8.7 Requires-Python >=3.9,<3.14; 3.8.8 Requires-Python >=3.9,<3.14; 3.8.9 Requires-Python >=3.9,<3.14
ERROR: Could not find a version that satisfies the requirement misaki>=0.9.4 (from kittentts) (from versions: 0.1.0, 0.3.0, 0.3.5, 0.3.9, 0.4.0, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.9, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4)
ERROR: No matching distribution found for misaki>=0.9.4
I realize that I can run a multiple versions of python on my system, and use venv to managed them (or whatever equivalent is now trendy), but as I near retirement age all those deep dependencies nets required by modern software is really depressing me. Have you ever tried to build a node app that hasn't been updated in 18 months? It can't be done. Old man yelling at cloud I guess shrugs. > - As a result,
> - When the string "明日["tomorrow"]" is entered into TTS, the TTS model [・皿・] outputs an ambiguous pronunciation that sounds like a mix of "asu" and "ashita" (something like "[asyeta]").
> From this, we found that by using the proposed method, it is possible to obtain data from private data in which the consistency between speech, graphemes, and phonemes is almost certainly maintained for more than 80% of the total.
> Another possible cause is a mismatch between the domain of the training data's audio (all [in read-aloud tones]) and the inference domain.
My resultant rambling follows: 1. Sounds like general state of Japanese speech dataset is a mess
1.1. they don't maintain great useful correspondence between symbols to audio
1.2. they tend to contain too much of "transatlantic" voices and less casual speeches
2. Japanese speakers generally don't denote pronunciations for text
2.1. therefore web crawls might not contain enough information as to how they're actually pronounced
2.2. (potentially) there could be some texts that don't map to pronunciations
2.3. (potentially) maybe Japanese spoken and literal languages are still a bit divergent from each others
3. The situation for Chinese/Sinitic languages are likely __nowhere__ near as absurd, and so Chinese STT/TTS might not be well equipped to deal with this mess
4. This feels like much deeper mess than what commonly observed "a cloud in a sky" Japanese TTS problems such as obvious basic alignment errors(e.g. pronouncing "potatoes" as "tato chi")
--- 0: https://xkcd.com/1425/
1: https://zenn.dev/parakeet_tech/articles/2591e71094ea58
2: https://qiita.com/maishikawa/items/dcadfeebf693080f0415BTW, it seems that kitten (the Python package) has the following chain of dependencies: kittentts → misaki[en] → spacy-curated-transformers
So if you install it directly via uv, it will pull torch and NVIDIA CUDA packages (several GB), which are not needed to run kitten.
In case it helps anyone else, the first time I tried to run purr I got "OSError: PortAudio library not found". Installing libportaudio (apt install libportaudio2) got it running.
text ="""
Hello world. This is Kitten TTS.
Look, it's working!
"""
voice = 'Luna'
On macOS, I get "Kitten TTS", but on Linux, I get "Kit… TTS". Both OSes generate the same phonemes of, Phonemes: ðɪs ɪz kˈɪʔn ̩ tˌiːtˌiːˈɛs ,
which makes me really confused as to where it's going off the rails on Linux, since from there it should just be invoking the model.edit: it really helps to use the same model facepalm. It's the 80M model, and it happens on both OS. Wildly the nano gets it better? I'm going to join the Discord lol.
For some insight into the original question, take a look at the Debian ML policy:
Maybe a dumb and slightly tangential question, (I don't mean this as a criticism!) but why not release a command line executable?
Even the API looks like what you'd see in a manpage.
I get it wouldn't be too much work for a user to actually make something like that, I'm just curious what the thought process is
People from outside the UK often use British as synonymous with English, and in the context of accents, often a South East English accent or some sort of Received Pronunciation (RP) accent. Technically a "British" accent could be from anywhere in England, Scotland, or Wales, and therefore by extension might not even be the English language.
While I'm here, since it's generally confusing, the UK is Great Britain and Northern Ireland. Great Britain is England, Scotland, and Wales.
So being factually correct doesn't really matter. Nobody cares and nobody wants to learn so I adapt for them.
In the same way I almost exclusively write with American spelling now. Life is just easier when you stop fighting.
At 25MB you can actually bundle it with the app. Going to test whether this works in a Vercel Edge Function context -- if latency is acceptable there it opens up a lot of use cases that currently require a round-trip to a hosted API.
curious about the latency characteristics though. 1.5x realtime on a 9700 is fine for batch processing but for interactive use you need first-chunk latency under 200ms or the conversation feels broken. does anyone know if it supports streaming output or is it full-utterance only?
the phoneme-based approach should help with pronunciation consistency too. the models i've tried that work on raw text tend to mispronounce technical terms unpredictably — same word pronounced differently across runs.
my current best approach is wrapping around gemini-flash native, and the model speaking the text i send it, which allows me end to end latency under a second.
are there other models at this or better pricing i can be looking at.
I'm really curious: how does the inference speed of these <25MB models look on consumer GPUs? Also, are these models deterministic, or do they have a stochastic nature where you need to generate multiple takes to get the best prosody?