Open-Source Text-to-Speech Models Compared: Kokoro, SparkTTS, F5-TTS, and Sesame CSM
Large language modeling has, for strong reasons, become one of the most visible and impactful outcomes of the AI era. These models have unlocked a wide range of applications across many domains, including informative chatbots, capable agents, and broad text generation tasks. As a result, there has been an ongoing effort to combine additional modalities with the strengths of these models. From visual understanding to function execution to speech synthesis, the goal has been to make them more connected and more practical.
One especially compelling use case for large language models is producing long-form text intended for audio-based content, such as podcasts, scripts, or even complete narratives. That naturally leads to an interesting question: can AI generate speech that sounds genuinely human?
In this article, we review four of the strongest open-source text-to-speech (TTS) models. More specifically, we compare how well F5-TTS, Kokoro, SparkTTS, and the recently released Sesame perform when generating a paragraph of spoken audio. We evaluate them qualitatively based on how closely the speech matches the input text, as well as how well they handle punctuation and pauses. Together, these tests aim to provide a practical answer regarding which model may be best suited to different use cases. We also point out where certain models are faster than others, although nearly all of them are extremely fast.
Kokoro
Kokoro is the first TTS model covered in this review. There is limited background information available about Kokoro because no research paper has been released for it. Most of what is known comes from its Hugging Face model card, which indicates that the architecture is based on ideas from StyleTTS2.
Kokoro is a very lightweight TTS model released under the Apache license. With just 82 million parameters, it can be deployed in many different settings, including production systems, edge environments, and personal projects. The model supports multiple languages and can generate speech in a wide range of them, including Japanese, Hindi, and Thai. One notable limitation is that it does not offer native voice cloning in the same way as some of the other models discussed here. Instead, it provides a collection of curated voices, represented internally as tensor-based voice options, for users to select from. These voices are well prepared and work effectively.
Kokoro was trained entirely on public-domain audio that was distributed under permissive licenses such as Apache and MIT. Interestingly, the total amount of training audio was under one thousand hours. Because of this relatively modest dataset, the model was reportedly trained for around 1,000 USD on NVIDIA A100 GPUs, which is a notable achievement at a time when training large language, image, and video models is often extremely expensive.
Running Kokoro TTS on a GPU Instance
Kokoro TTS is efficient enough in Python that, when paired with a powerful GPU-based cloud instance, it can generate speech faster than the audio can actually be spoken. To test this, you can launch a GPU-enabled virtual machine and prepare an environment with access to JupyterLab. A more detailed setup guide can be followed separately if needed.
Once the instance is ready, begin by cloning the repository to the machine. After that, install the required packages inside a virtual environment and launch the web-based GUI included with the project so the TTS models can be served. The following commands can be used in the terminal:
git clone https://github.com/hexgrad/kokoro
cd kokoro
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cd demo/
python app.py --share
This starts a Gradio-based web application that serves the Kokoro TTS models. From there, users can choose from a broad selection of voices for speech generation, including both male and female options. The interface also includes sample content generators, such as a random quote generator and book quote generators based on Frankenstein and The Great Gatsby. It is worth experimenting with the available voices to compare how they sound.
In testing, Kokoro produced a 30-second speech sample based on the third paragraph of this article in under a second. The generated audio was high quality, with almost no distortion, and could be produced either all at once or via streaming. The model handled punctuation and pauses very well and sounded convincingly human. The main drawback was that the delivery still felt somewhat stiff and lacking in emotion, which made it apparent that the audio had been generated by AI.
Kokoro is clearly a strong TTS system, though it does not include some of the features offered by other models in this review. In particular, it cannot clone a voice directly from an audio sample. The next model, F5-TTS, performs especially well in that area.
SparkTTS
The next TTS system in this comparison is SparkTTS. Spark is built around a new approach called BiCodec, a single-stream speech codec that breaks speech into two complementary token types: low-bitrate semantic tokens for linguistic meaning and fixed-length global tokens for speaker identity. Unlike more traditional approaches, this binary encoding method is designed to help the generated audio more closely resemble the reference speech sample.
This separated representation, together with the Qwen2.5 large language model and a chain-of-thought generation strategy, makes it possible to control both broad characteristics such as gender and speaking style, as well as fine-grained details such as exact pitch values and speaking rate. According to the reported experiments, Spark-TTS achieves state-of-the-art zero-shot voice cloning while also allowing highly customizable voice generation that goes beyond the limits of reference-based synthesis. (Source)
Run SparkTTS
Like Kokoro, SparkTTS includes a convenient web demo that makes testing easier. Use the following commands to prepare the environment, download the pretrained model files, and launch the web interface:
git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS/
pip install -r requirements.txt
mkdir pretrained_models
apt-get install git-lfs
cd pretrained_models/
git-lfs clone https://huggingface.co/SparkAudio/Spark-TTS-0.5B
cd ..
python webui.py –device 0
After the web demo has started, testing can begin with custom audio samples. Recording a short sample of around ten seconds using your own voice is a practical way to evaluate the model directly. In these experiments, Spark performed worse than Kokoro in nearly every category. Generation often took more than ten seconds, voice cloning was unreliable and frequently introduced unexpected regional accents, and the handling of punctuation and pauses was weak, often producing long gaps between sentences. Although Spark is promising because of its novel architecture, this release does not yet appear to match the quality of other state-of-the-art TTS models.
F5-TTS
Next is F5-TTS, which stands out as the favorite among the models reviewed here. F5 builds upon the earlier E2 model, which was already a strong TTS system. In simple terms, both are designed as fully non-autoregressive text-to-speech systems based on flow matching with a Diffusion Transformer (DiT). Rather than depending on complex components such as a duration model, text encoder, or phoneme alignment module, the text input is padded with filler tokens until it matches the length of the input speech, and denoising is then used to generate the audio. (Source)
F5 improves on this design by adding an initial stage that models the input with ConvNeXt and refines the text representation, making alignment with speech easier before the denoising stage is applied for speech generation.
During inference, the diffusion process is effectively reversed. To generate speech from the input content, the model begins with an audio prompt and its mel spectrogram features, along with a transcription and a text prompt describing the intended content. The audio prompt provides speaker characteristics, while the text prompt guides what should be said. The final speech is then generated through diffusion, using both the tokens and the audio prompt as conditioning signals.
Run F5-TTS on a GPU Instance
Running F5-TTS on a GPU-based virtual machine follows a similar process to the earlier models. There are a few additional setup steps because extra packages need to be installed, but these are straightforward. Use the following commands in the terminal inside the target repository location:
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
Pip install –upgrade pip
Pip install ffmpeg-python
Apt-get install ffmpeg
pip install -e .
F5-tts_infer-gradio
This launches the F5 Gradio inference interface. It is worth testing all of the available demos, including basic speech generation, multi-speaker speech generation, and voice chatting. The multi-speaker generation demo is especially impressive and interesting. Until recently, it could reasonably be considered the best model for that task.
Now let us move on to the TTS model that has been attracting a great deal of attention online lately: Sesame CSM.
Sesame CSM
Sesame Conversational Speech Model (CSM) is a multimodal system that works with both written text and spoken audio. It processes Residual Vector Quantization tokens, which encode semantic as well as acoustic information learned during training. The architecture is divided into two transformer components at the zeroth codebook. A multimodal backbone handles the combined text-audio input and predicts the zeroth codebook, while a separate audio decoder reconstructs speech by modeling the remaining N − 1 codebooks with individual linear heads. Since this decoder is considerably smaller than the backbone, CSM can produce speech quickly while still functioning as an end-to-end model.
The official CSM website currently offers one of the strongest TTS demos available online. The model combines language-model-like knowledge with a noticeably natural speaking style. It is worth testing the public demo before running CSM on a GPU cloud environment, as the hosted version seems to be more heavily optimized than the open-source release currently provides.
Run Sesame CSM on a GPU Instance
It is recommended to run CSM through Python in order to customize the possible outputs. One option is to use a Jupyter Notebook, although a standard Python script can also be used. For convenience, the project already includes a demo script. The required packages can be installed and model access from Hugging Face can be prepared with the following commands:
git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export NO_TORCH_COMPILE=1
# You will need access to CSM-1B and Llama-3.2-1B
huggingface-cli login
At this point, you will be prompted to provide your custom or read-only API key. After entering it, access must be requested on the CSM-1B and Llama-3.2-1B model pages. Once that is done, the script can either be executed directly or edited with a text editor such as vim or nano. Testing the script first is the recommended approach.
python run_csm.py
This produces a file named full_conversation.wav containing audio for a conversation between two speakers. Those speakers can be replaced with custom audio samples to perform voice cloning. In these experiments, this was the strongest multi-speaker model currently available. However, it did not match F5 in overall qualitative quality or word error rate during longer generations.
Choosing the Best Model for TTS
Selecting the best TTS model appears to depend on three primary factors: minimizing word error rate (WER), voice cloning capability, and acoustic tokenization of non-verbal vocal cues and tonal expression.
For the first of these, Kokoro and F5 clearly stood out in the experiments. Word error rate remained very low across all tests and throughout the demos they provided. Kokoro is the recommended option when voice cloning is not a requirement.
For voice cloning, F5 and Spark are both strong models. F5 is the preferred choice over Spark because the generated speech quality is noticeably better. Even so, both models perform well in cloning voices.
Finally, Sesame CSM is the clear leader when it comes to acoustic tokenization of non-verbal vocal signals and tonal nuance. The potential of CSM with further fine-tuning, as demonstrated in its online demo, is genuinely impressive. At the moment, however, the open-source version still does not reach the level set by F5 in terms of voice cloning and overall audio quality.
All of the models covered in this review are excellent TTS systems. Each one stands out in different areas, but overall, F5 remains the strongest recommendation as the best TTS model of the group.


