Content

Vijona

3 Jun at 12:16

Open-Source Text-to-Speech Models Compared: Kokoro, SparkTTS, F5-TTS, and Sesame CSM

Large language modeling has, for strong reasons, become one of the most visible and impactful outcomes of the AI era. These models have unlocked a wide range of applications across many domains, including informative chatbots, capable agents, and broad text generation tasks. As a result, there has been an ongoing effort to combine additional modalities with the strengths of these models. From visual understanding to function execution to speech synthesis, the goal has been to make them more connected and more practical.

One especially compelling use case for large language models is producing long-form text intended for audio-based content, such as podcasts, scripts, or even complete narratives. That naturally leads to an interesting question: can AI generate speech that sounds genuinely human?

In this article, we review four of the strongest open-source text-to-speech (TTS) models. More specifically, we compare how well F5-TTS, Kokoro, SparkTTS, and the recently released Sesame perform when generating a paragraph of spoken audio. We evaluate them qualitatively based on how closely the speech matches the input text, as well as how well they handle punctuation and pauses. Together, these tests aim to provide a practical answer regarding which model may be best suited to different use cases. We also point out where certain models are faster than others, although nearly all of them are extremely fast.

Kokoro

Kokoro is the first TTS model covered in this review. There is limited background information available about Kokoro because no research paper has been released for it. Most of what is known comes from its Hugging Face model card, which indicates that the architecture is based on ideas from StyleTTS2.

Kokoro is a very lightweight TTS model released under the Apache license. With just 82 million parameters, it can be deployed in many different settings, including production systems, edge environments, and personal projects. The model supports multiple languages and can generate speech in a wide range of them, including Japanese, Hindi, and Thai. One notable limitation is that it does not offer native voice cloning in the same way as some of the other models discussed here. Instead, it provides a collection of curated voices, represented internally as tensor-based voice options, for users to select from. These voices are well prepared and work effectively.

Kokoro was trained entirely on public-domain audio that was distributed under permissive licenses such as Apache and MIT. Interestingly, the total amount of training audio was under one thousand hours. Because of this relatively modest dataset, the model was reportedly trained for around 1,000 USD on NVIDIA A100 GPUs, which is a notable achievement at a time when training large language, image, and video models is often extremely expensive.

Running Kokoro TTS on a GPU Instance

Kokoro TTS is efficient enough in Python that, when paired with a powerful GPU-based cloud instance, it can generate speech faster than the audio can actually be spoken. To test this, you can launch a GPU-enabled virtual machine and prepare an environment with access to JupyterLab. A more detailed setup guide can be followed separately if needed.

Once the instance is ready, begin by cloning the repository to the machine. After that, install the required packages inside a virtual environment and launch the web-based GUI included with the project so the TTS models can be served. The following commands can be used in the terminal:

Copy Code

git clone https://github.com/hexgrad/kokoro cd kokoro python -m venv venv source venv/bin/activate pip install -r requirements.txt cd demo/ python app.py --share

This starts a Gradio-based web application that serves the Kokoro TTS models. From there, users can choose from a broad selection of voices for speech generation, including both male and female options. The interface also includes sample content generators, such as a random quote generator and book quote generators based on Frankenstein and The Great Gatsby. It is worth experimenting with the available voices to compare how they sound.

In testing, Kokoro produced a 30-second speech sample based on the third paragraph of this article in under a second. The generated audio was high quality, with almost no distortion, and could be produced either all at once or via streaming. The model handled punctuation and pauses very well and sounded convincingly human. The main drawback was that the delivery still felt somewhat stiff and lacking in emotion, which made it apparent that the audio had been generated by AI.

Kokoro is clearly a strong TTS system, though it does not include some of the features offered by other models in this review. In particular, it cannot clone a voice directly from an audio sample. The next model, F5-TTS, performs especially well in that area.

SparkTTS

The next TTS system in this comparison is SparkTTS. Spark is built around a new approach called BiCodec, a single-stream speech codec that breaks speech into two complementary token types: low-bitrate semantic tokens for linguistic meaning and fixed-length global tokens for speaker identity. Unlike more traditional approaches, this binary encoding method is designed to help the generated audio more closely resemble the reference speech sample.

This separated representation, together with the Qwen2.5 large language model and a chain-of-thought generation strategy, makes it possible to control both broad characteristics such as gender and speaking style, as well as fine-grained details such as exact pitch values and speaking rate. According to the reported experiments, Spark-TTS achieves state-of-the-art zero-shot voice cloning while also allowing highly customizable voice generation that goes beyond the limits of reference-based synthesis. (Source)

Run SparkTTS

Like Kokoro, SparkTTS includes a convenient web demo that makes testing easier. Use the following commands to prepare the environment, download the pretrained model files, and launch the web interface:

Copy Code

git clone https://github.com/SparkAudio/Spark-TTS cd Spark-TTS/ pip install -r requirements.txt mkdir pretrained_models apt-get install git-lfs cd pretrained_models/ git-lfs clone https://huggingface.co/SparkAudio/Spark-TTS-0.5B cd .. python webui.py –device 0

After the web demo has started, testing can begin with custom audio samples. Recording a short sample of around ten seconds using your own voice is a practical way to evaluate the model directly. In these experiments, Spark performed worse than Kokoro in nearly every category. Generation often took more than ten seconds, voice cloning was unreliable and frequently introduced unexpected regional accents, and the handling of punctuation and pauses was weak, often producing long gaps between sentences. Although Spark is promising because of its novel architecture, this release does not yet appear to match the quality of other state-of-the-art TTS models.

F5-TTS

Next is F5-TTS, which stands out as the favorite among the models reviewed here. F5 builds upon the earlier E2 model, which was already a strong TTS system. In simple terms, both are designed as fully non-autoregressive text-to-speech systems based on flow matching with a Diffusion Transformer (DiT). Rather than depending on complex components such as a duration model, text encoder, or phoneme alignment module, the text input is padded with filler tokens until it matches the length of the input speech, and denoising is then used to generate the audio. (Source)

F5 improves on this design by adding an initial stage that models the input with ConvNeXt and refines the text representation, making alignment with speech easier before the denoising stage is applied for speech generation.

During inference, the diffusion process is effectively reversed. To generate speech from the input content, the model begins with an audio prompt and its mel spectrogram features, along with a transcription and a text prompt describing the intended content. The audio prompt provides speaker characteristics, while the text prompt guides what should be said. The final speech is then generated through diffusion, using both the tokens and the audio prompt as conditioning signals.

Run F5-TTS on a GPU Instance

Running F5-TTS on a GPU-based virtual machine follows a similar process to the earlier models. There are a few additional setup steps because extra packages need to be installed, but these are straightforward. Use the following commands in the terminal inside the target repository location:

Copy Code

git clone https://github.com/SWivid/F5-TTS.git cd F5-TTS Pip install –upgrade pip Pip install ffmpeg-python Apt-get install ffmpeg pip install -e . F5-tts_infer-gradio

This launches the F5 Gradio inference interface. It is worth testing all of the available demos, including basic speech generation, multi-speaker speech generation, and voice chatting. The multi-speaker generation demo is especially impressive and interesting. Until recently, it could reasonably be considered the best model for that task.

Now let us move on to the TTS model that has been attracting a great deal of attention online lately: Sesame CSM.

Sesame CSM

Sesame Conversational Speech Model (CSM) is a multimodal system that works with both written text and spoken audio. It processes Residual Vector Quantization tokens, which encode semantic as well as acoustic information learned during training. The architecture is divided into two transformer components at the zeroth codebook. A multimodal backbone handles the combined text-audio input and predicts the zeroth codebook, while a separate audio decoder reconstructs speech by modeling the remaining N − 1 codebooks with individual linear heads. Since this decoder is considerably smaller than the backbone, CSM can produce speech quickly while still functioning as an end-to-end model.

The official CSM website currently offers one of the strongest TTS demos available online. The model combines language-model-like knowledge with a noticeably natural speaking style. It is worth testing the public demo before running CSM on a GPU cloud environment, as the hosted version seems to be more heavily optimized than the open-source release currently provides.

Run Sesame CSM on a GPU Instance

It is recommended to run CSM through Python in order to customize the possible outputs. One option is to use a Jupyter Notebook, although a standard Python script can also be used. For convenience, the project already includes a demo script. The required packages can be installed and model access from Hugging Face can be prepared with the following commands:

Copy Code

git clone git@github.com:SesameAILabs/csm.git cd csm python3.10 -m venv .venv source .venv/bin/activate pip install -r requirements.txt export NO_TORCH_COMPILE=1 # You will need access to CSM-1B and Llama-3.2-1B huggingface-cli login

At this point, you will be prompted to provide your custom or read-only API key. After entering it, access must be requested on the CSM-1B and Llama-3.2-1B model pages. Once that is done, the script can either be executed directly or edited with a text editor such as vim or nano. Testing the script first is the recommended approach.

Copy Code

python run_csm.py

This produces a file named full_conversation.wav containing audio for a conversation between two speakers. Those speakers can be replaced with custom audio samples to perform voice cloning. In these experiments, this was the strongest multi-speaker model currently available. However, it did not match F5 in overall qualitative quality or word error rate during longer generations.

Choosing the Best Model for TTS

Selecting the best TTS model appears to depend on three primary factors: minimizing word error rate (WER), voice cloning capability, and acoustic tokenization of non-verbal vocal cues and tonal expression.

For the first of these, Kokoro and F5 clearly stood out in the experiments. Word error rate remained very low across all tests and throughout the demos they provided. Kokoro is the recommended option when voice cloning is not a requirement.

For voice cloning, F5 and Spark are both strong models. F5 is the preferred choice over Spark because the generated speech quality is noticeably better. Even so, both models perform well in cloning voices.

Finally, Sesame CSM is the clear leader when it comes to acoustic tokenization of non-verbal vocal signals and tonal nuance. The potential of CSM with further fine-tuning, as demonstrated in its online demo, is genuinely impressive. At the moment, however, the open-source version still does not reach the level set by F5 in terms of voice cloning and overall audio quality.

All of the models covered in this review are excellent TTS systems. Each one stands out in different areas, but overall, F5 remains the strongest recommendation as the best TTS model of the group.

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

Linux Export Command: Syntax, Examples and Usage

Linux Basics, Tutorial

2 days ago

Vijona23 Jul at 14:29 How to Use the Export Command in Linux The Linux export command is a built-in shell command that marks variables and functions for inheritance by child…

Scaling Multi-Agent AI Systems for Production

AI/ML, Tutorial

2 days ago

Vijona23 Jul at 11:55 Scaling Multi-Agent AI Systems from Prototype to Production Over the past several years, AI agent frameworks and demonstrations have expanded at extraordinary speed. Moving from an…

Generative Pixel Decoders Beyond VAE for 4K Images

AI/ML, Tutorial

2 days ago

Vijona23 Jul at 10:05 Why Generative Pixel Decoders Are Replacing Traditional VAE Decoding in High-Resolution Image Generation Content1 TL;DR2 What a VAE Does and What It Was Never Designed to…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

Open-Source Text-to-Speech Models Compared: Kokoro, SparkTTS, F5-TTS, and Sesame CSM

Kokoro

Running Kokoro TTS on a GPU Instance

SparkTTS

Run SparkTTS

F5-TTS

Run F5-TTS on a GPU Instance

Sesame CSM

Run Sesame CSM on a GPU Instance

Choosing the Best Model for TTS

Create a Free Account

Posts you might be interested in:

Linux Export Command: Syntax, Examples and Usage

Scaling Multi-Agent AI Systems for Production