On-device AI beats cloud for TTS

Cloud AI APIs are convenient. You call an endpoint, you get a result, and you pay a "small" fee per request. It feels cheap – until it isn't.

Text-to-speech (TTS) is a perfect example of how that math can quietly turn against you. If you are building a product that scales, you might want to stop paying per request and start looking at the device already in your user’s pocket.

The Kokoro comparison

We recently added Kokoro TTS support to react-native-executorch, our open-source library for running AI models on-device in React Native. Kokoro is an 82M-parameter model under the Apache 2.0 license – free to use commercially, including the model weights.

Let's put the cost difference in concrete terms. If your app generates ~500 characters per TTS call and 1,000 users make two calls daily, that’s 1 billion characters a month.

The model download happens once, and only when the user first needs it. After that, every synthesis — whether the user is on a plane, underground, or in a country where your cloud region has high latency — costs nothing and requires no internet connection.

Kokoro's tradeoffs

Kokoro supports 62 voices across 9 languages, though react-native-executorch currently exposes 8 (more are coming). Google offers a bigger catalog, with 220+ voices across 40+ languages. That said, the difference in voice quality is small. Kokoro performs impressively well against Google’s Neural2 and Chirp tiers, all on a phone with just 82M parameters. If you need a very specific language or regional accent today, the cloud API may still be your only option. But for the vast majority of mobile use cases — accessibility features, read-aloud, voice assistants, e-learning narration — Kokoro's quality is right there.

Integration is simpler than you'd think

import { 
  useTextToSpeech, 
  KOKORO_MEDIUM, 
  KOKORO_VOICE_AF_HEART 
} from 'react-native-executorch';

const App = () => {
  // Runs 100% on-device. Zero network calls. Zero cost.
  const tts = useTextToSpeech({
    model: KOKORO_MEDIUM,
    voice: KOKORO_VOICE_AF_HEART,
  });

  const handleSynthesize = async (text: string) => {
    const waveform = await tts.forward(text);
    // Play or process the audio
  };
};

One honest challenge: getting Kokoro running on-device required writing a phonemizer from scratch in C++. A huge portion of AI tooling – tokenizers, phonemizers, preprocessing pipelines – is written in Python, and none of that runs on iOS or Android. For library users this is completely invisible, but it's worth understanding that react-native-executorch exists precisely to absorb that complexity so you don't have to.

Beyond cost

Cost is the argument that tends to cut through in planning discussions, but it's not the only one.

Offline capability is real and undervalued. A TTS feature that requires internet access will silently fail in the subway, in rural areas, on flights, and in emerging markets with unreliable connectivity. On-device inference simply works, everywhere, always.

Latency – a cloud API call involves a round trip to a server — typically 200–800ms depending on region. On-device inference on a modern smartphone can be faster, and more importantly, it's consistent. There's no cold start, no regional latency spike, no degraded performance under load.

Privacy is another dimension, and how much it matters depends heavily on your use case. When TTS runs on-device, the text being synthesized never leaves the phone — not to your servers, not to a cloud provider, not anywhere. For consumer apps that's often a nice-to-have. For apps in healthcare, legal, finance, or personal productivity, it can be a genuine architectural requirement, and in some jurisdictions, a compliance one.

The scaling argument

The cloud pricing model is designed for the early stages of a product, when usage is low and convenience outweighs cost. As you scale, though, per-request pricing becomes a structural cost that grows in proportion to your success. On-device AI inverts this: your infrastructure cost is fixed (the engineering work to integrate it), and usage is free. For features like TTS that are high-frequency and not computationally exotic by modern phone standards, this is a compelling case.

React Native developers in particular are in a good position to take advantage of this shift. Our library react-native-executorch makes it possible to ship on-device AI features without deep expertise in native code or machine learning — the hard parts of model export, runtime integration, and memory management are handled at the library level.

On-device AI won't replace cloud APIs for every use case. Tasks that require massive models, real-time training, or centralized data still belong on servers. But for well-scoped inference — speech synthesis, image classification, language understanding — especially on features that run frequently and at scale, the question has shifted. It's no longer "is on-device AI good enough?" It's "why are we still paying per request?".

On-Device AI Beats Cloud for TTS – Here’s Why