technologyexplainerspeech-to-speech

Speech-to-Speech Translation Explained: STT → MT → TTS in Plain English

How speech-to-speech translation works: speech recognition, machine translation, and text-to-speech in a pipeline. Explained without jargon for 2026.

Updated 6 min readMingle Team

What happens when you speak into a translation app

You press a button, speak a sentence in English, and hear French a second later. It feels like magic. Behind the scenes, three distinct technologies run in sequence—each with its own strengths, weaknesses, and failure modes.

Understanding this pipeline—STT, MT, TTS—explains why translation sometimes nails a complex sentence and sometimes garbles a simple name. It also tells you how to speak for better results.

Stage 1: Speech-to-Text (STT)

Speech-to-text, also called automatic speech recognition (ASR), converts your spoken audio into written text in the same language you spoke.

When you say "Where is the nearest pharmacy?" into your phone, the STT engine produces the text string: `Where is the nearest pharmacy?`

How it works. Modern STT systems use neural network models trained on thousands of hours of labeled speech. Google Research has published extensively on end-to-end speech recognition models that map audio waveforms directly to text without intermediate phonetic representations. Meta's wav2vec and self-supervised learning research pushed the field toward models that learn from unlabeled audio, improving recognition for low-resource languages.

Where it fails. STT struggles with:

  • Proper nouns (names, streets, brands) not in its training data
  • Heavy accents or dialects underrepresented in training
  • Background noise competing with speech
  • Fast or mumbled speech
  • Code-switching between languages mid-sentence

Every error at this stage propagates downstream. If STT hears "farmacy" instead of "pharmacy," the translation stage receives the wrong input.

Stage 2: Machine Translation (MT)

Machine translation converts the text from one language to another. The STT output `Where is the nearest pharmacy?` becomes `Où est la pharmacie la plus proche?`

How it works. Neural machine translation (NMT) models, trained on parallel text corpora in both languages, learn to map meaning across languages. Google Research's Transformer architecture revolutionized this field. Meta's NLLB (No Language Left Behind) project extended high-quality translation to hundreds of languages that previously lacked adequate MT support.

Where it fails. MT struggles with:

  • Idioms and cultural expressions that do not translate literally
  • Ambiguous words where context determines meaning
  • Very long or syntactically complex sentences
  • Language pairs with limited training data

For well-supported pairs like English-French or English-Spanish, MT accuracy is high for straightforward sentences. For idiomatic or technical content, errors increase.

Importantly, MT can only be as accurate as the STT text it receives. A garbled STT input produces a confident but wrong translation.

Stage 3: Text-to-Speech (TTS)

Text-to-speech converts the translated text back into spoken audio in the target language. The French text `Où est la pharmacie la plus proche?` becomes audio you hear through your speaker or earbuds.

How it works. Modern TTS uses neural vocoders and spectrogram prediction models to generate natural-sounding speech. Google Research's Tacotron and WaveNet families set benchmarks for speech naturalness. The output is no longer robotic monotone—it includes plausible intonation, pauses, and emphasis.

Where it fails. TTS struggles with:

  • Pronouncing foreign names embedded in translated text
  • Unusual words or neologisms not in its pronunciation dictionary
  • Matching the emotional tone of the original speaker
  • Speaking speed that feels unnatural for the content

TTS errors are usually less disruptive than STT or MT errors because the listener can often infer meaning from context even if pronunciation is slightly off.

The full pipeline in sequence

```

Your voice → [STT] → English text → [MT] → French text → [TTS] → French audio → Other person hears

```

Total latency in 2026 for cloud-connected systems: roughly one to three seconds per sentence. The breakdown:

  • STT: 300–800ms
  • MT: 100–300ms
  • TTS: 300–800ms
  • Network round-trip: 100–500ms (varies by connection)

On-device pipelines (such as Apple's on-device translation) eliminate network latency but typically support fewer languages and may use smaller models with slightly lower accuracy.

Why this matters for how you speak

Knowing the pipeline changes your behavior:

Speak one sentence at a time. STT processes utterances, not continuous monologues. Pausing between thoughts gives each stage clean input.

Protect the microphone. STT is the weakest link. Background noise, distance from the mic, and mumbling all degrade the input that MT and TTS receive.

Verify proper nouns in text. If STT mishears a name, MT and TTS will confidently deliver the wrong name. The text display is your correction checkpoint.

Use simple structures for critical information. "Room 412. Floor 4." translates more reliably than "I believe we were assigned the room on the fourth floor, possibly 412 or 414."

Direct speech-to-speech models: the next frontier

Researchers at Google, Meta, and several universities are developing end-to-end speech-to-speech models that skip the intermediate text stages. These models map audio in one language directly to audio in another language.

The advantage is preserving paralinguistic features—tone, emotion, pace—that text intermediaries lose. The challenge is data: training direct speech-to-speech models requires aligned audio in multiple languages, which is far scarcer than text corpora.

In 2026, production live translation tools still predominantly use the three-stage pipeline. Direct models are advancing in research labs but have not yet replaced the STT → MT → TTS architecture in consumer products.

What this means for live conversation tools

Every live translation product—browser-based, native app, or hardware earbuds—runs some version of this pipeline. Differences between products are primarily:

  • Model quality at each stage (better training data, larger models)
  • Language coverage (how many pairs are supported at high quality)
  • Latency optimization (streaming STT, parallel MT/TTS processing)
  • Deployment (cloud versus on-device)

When you evaluate translation tools, you are evaluating three models working together, not a single technology.

The practical takeaway

Speech-to-speech translation is not magic—it is three well-understood technologies chained together. The pipeline works remarkably well for routine conversation in well-supported language pairs. Its weaknesses—proper nouns, noise, idioms—are predictable and manageable with good speaking habits and text verification.

The next time a translation sounds wrong, ask: did it mishear me (STT), mistranslate me (MT), or mispronounce the result (TTS)? The answer tells you whether to speak more clearly, rephrase, or check the text display.

FAQ

What is the difference between speech-to-text and speech-to-speech translation?

Speech-to-text (STT) converts spoken audio into written text in the same language. Speech-to-speech translation adds machine translation and text-to-speech stages, producing spoken output in a different language. When you use live translation and hear a voice in another language, you are using the full speech-to-speech pipeline.

Why does translation sometimes get names and numbers wrong?

Proper nouns and numbers fail most often at the speech recognition stage, before translation begins. If STT mishears a name, machine translation faithfully translates the misheard text. Speaking clearly, spelling out critical details, and verifying in the text display catches these errors.

How fast is the speech-to-speech pipeline in 2026?

Modern cloud-connected pipelines deliver results in one to three seconds for a typical sentence. On-device pipelines (like Apple Live Translation) can be faster but support fewer languages. Total perceived latency includes all three stages plus network round-trip time.

Will speech-to-speech translation replace human interpreters?

For casual conversation, travel, and routine service interactions, it already has for many use cases. For legal proceedings, high-stakes medical consultations, and diplomatic communication, human interpreters remain essential. The technology handles routine exchanges; humans handle nuance, culture, and accountability.

Try live translation free

No app install. Any phone. 60+ languages.

Start free session