Building a Real-Time Speech Translation Pipeline with Open-Source AI

Real-time speech translation has been one of the most exciting ambitions in deep learning since the technology gained major momentum in the early 2020s. Inspired by science-fiction concepts such as the universal translator in Star Trek, the idea of instantly translating spoken language has always felt both fascinating and highly valuable from a business perspective. Even purely from an efficiency standpoint, real-time translation can dramatically speed up multilingual communication and business interactions.

After years of progress in text-to-speech, translation language models, and automatic speech recognition systems, it is now possible to build a real-time speech translation workflow using open-source technologies. In this tutorial, we walk through a Python-based translation pipeline that can operate faster than speech. The setup combines Whisper ASR, the Hunyuan MT translation model, and the Soprano 80M TTS system. We then conclude by showing how these three components can be connected inside a Gradio application to create a functional real-time translated speech demo.

Key Takeaways

End-to-end real-time speech translation is now achievable with open-source tools: By combining Whisper Large-v3 for automatic speech recognition, Hunyuan MT for translation, and Soprano 80M for text-to-speech, developers can create a pipeline that transcribes, translates, and speaks audio in under a second for several seconds of input when using modern GPU hardware.

Model efficiency is just as important as accuracy: Whisper provides strong multilingual zero-shot speech recognition, Hunyuan MT delivers scalable and high-quality translation from 1.8B to 7B parameters, and Soprano offers extremely low-latency speech synthesis. Together, these efficient model choices make real-time performance possible without relying on proprietary platforms.

Production-style demos can be created with limited integration code: A compact Python application wrapped in a Gradio interface is enough to coordinate ASR, translation, and TTS. This shows how accessible real-time speech translation has become for developers and organizations.

Whisper Large v3

Whisper is an advanced automatic speech recognition and speech translation model introduced by OpenAI in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford and others. It was trained on more than five million hours of labeled audio and demonstrates strong zero-shot performance across many datasets, languages, and real-world use cases.

Whisper Large-v3 supports more than 99 languages and includes targeted improvements compared with Whisper Large v2. These include a dedicated Cantonese token and notably better performance, especially for English. The model handles accents, background noise, and technical vocabulary more effectively, reducing errors by around 10–20% compared with v2 while also improving processing speed across varied audio environments and many languages, including Afrikaans, Arabic, Chinese, French, German, Hindi, Japanese, Spanish, and many others.

With Whisper Large v3 as the ASR foundation, spoken input from almost any source can be accurately converted into usable text for the later translation and text-to-speech stages.

Hunyuan MT

Hunyuan Translation Model version 1.5 includes two model variants: HY-MT1.5-1.8B with 1.8 billion parameters and HY-MT1.5-7B with 7 billion parameters. Both are designed for bidirectional translation across 33 languages and also support five ethnic and dialect variants.

HY-MT1.5-7B is based on Tencent’s WMT25 championship model and is optimized for explanatory translation and mixed-language scenarios. It also supports features such as terminology control, contextual translation, and structured output formatting.

Although the 1.8B model has fewer than one-third of the parameters of the 7B version, it provides comparable translation quality with significantly higher speed. After quantization, it can also be deployed on edge devices for real-time translation use cases, making it flexible and broadly applicable.

For this demo, the full 7B model is used. This is mainly due to the available GPU performance, although the 1.8B model can easily be substituted by adjusting the model reference in the code below.

Soprano 1.1 80M

The final component is the text-to-speech model. Soprano is a very lightweight on-device TTS model designed for expressive, high-quality speech generation at extremely high speed. It can deliver up to 2000× real-time generation on GPU and 20× real-time generation on CPU, with lossless streaming and very low latency of less than 15 ms on GPU and less than 250 ms on CPU.

Its compact 80M-parameter architecture uses less than 1 GB of memory while supporting unlimited-length generation through automatic text splitting. It produces clear, expressive 32 kHz audio suitable for real-time speech applications.

Demo: Real-Time Speech Translation

Now that the main components of the pipeline have been introduced, they can be combined into a working application. A GitHub repository is available for this project here. The README file explains how to run the demo. To begin, launch a GPU-enabled server using the instructions from the related setup tutorial. This allows access to the GPU server from a local terminal and makes it possible to open the Gradio web application through a browser or development environment preview feature.

Next, paste the following commands into the terminal window of the GPU server:

git clone https://github.com/Jameshskelton/realtime_speech_translation
cd realtime_speech_translation
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
python3 realtime_speech_translation.py

This starts the web application. By default, both a public share link and a local link are created.

The demo page allows users to submit an audio file or record audio through the Gradio audio module. After adding audio, click the Run button at the bottom of the application to start the translation process. Depending on the length of the submitted audio, the process can take less than a second or a few seconds.

The complete code is shown below:

# -----------------------------
# Imports
# -----------------------------

import os
import torch
import numpy as np
import gradio as gr

from transformers import (
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
)

from soprano import SopranoTTS
from scipy.io.wavfile import write
from pydub import AudioSegment


# -----------------------------
# Device and dtype configuration
# -----------------------------

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32


# -----------------------------
# Whisper ASR setup
# -----------------------------

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    chunk_length_s=5.0,
)


# -----------------------------
# Translation model setup
# -----------------------------

model_name_or_path = "tencent/HY-MT1.5-1.8B"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model_tr = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    device_map="auto",
)


# -----------------------------
# TTS setup
# -----------------------------

model_tts = SopranoTTS()


# -----------------------------
# Audio utilities
# -----------------------------

def convert_to_mono_pydub(input_file, output_file, output_format="wav"):
    """
    Converts a stereo or multi-channel audio file to mono using pydub.
    """
    audio = AudioSegment.from_file(input_file)
    mono_audio = audio.set_channels(1)
    mono_audio.export(output_file, format=output_format)
    print(f"Converted '{input_file}' to mono file '{output_file}'")


# -----------------------------
# ASR → Translation → TTS pipeline
# -----------------------------

def tts_translate(sample_audio):
    output_filename = "out1.wav"

    sample_rate, audio_array = sample_audio
    write(output_filename, sample_rate, audio_array)

    convert_to_mono_pydub("out1.wav", "out1.wav")

    result = pipe("out1.wav")

    messages = [
        {
            "role": "user",
            "content": (
                "Translate the following segment into English, "
                "without additional explanation.\n\n"
                f"{result['text']}"
            ),
        }
    ]

    tokenized_chat = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=False,
        return_tensors="pt",
    )

    outputs = model_tr.generate(
        tokenized_chat.to(model_tr.device),
        max_new_tokens=2048,
    )

    output_text = tokenizer.decode(outputs[0])

    innie = output_text.split("hy_place▁holder▁no▁8|>")[1]
    clean_text = innie.split("<")[0]

    model_tts.infer(clean_text, "out.wav")

    return "out.wav", clean_text


# -----------------------------
# Gradio UI
# -----------------------------

with gr.Blocks() as demo:
    gr.Markdown("# Submit or record your audio for faster-than-speech translation!")

    with gr.Column():
        inp = gr.Audio(label="Input Audio to be Translated")

    with gr.Column():
        with gr.Row():
            out_audio = gr.Audio(label="Translated Audio")
        with gr.Row():
            out_text = gr.Textbox(label="Translated Text", lines=8)

    btn = gr.Button("Run")
    btn.click(fn=tts_translate, inputs=inp, outputs=[out_audio, out_text])

demo.launch(share = True)

As shown above, the application first loads the required model files. It then uses helper functions and one central orchestration function to transcribe the input audio, translate the resulting text into English, and convert the translated text back into spoken English through TTS. For example, this workflow can process five seconds of audio in less than a second when running on a high-performance GPU server powered by an NVIDIA H200.

Closing Thoughts

This tutorial demonstrated that real-time speech translation can now be built by combining individual deep learning models into a single pipeline. This development creates significant potential for many translation scenarios, including business communication and entertainment applications. As ASR, TTS, and language-model-based translation continue to improve, this type of pipeline can become even more capable. A logical next step is adding voice cloning to the process.

Source: digitalocean.com

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

LLM Fine-Tuning Data Preparation Guide

AI/ML, Tutorial
Vijona14 minutes ago Preparing Data for LLM Fine-Tuning Fine-tuning a large language model (LLM) depends heavily on the quality of the training data. Clean, structured, and relevant datasets have a…