Clone any voice with a 15-second sample

It’s surprisingly easy to clone a voice using F5-TTS: “A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching”.

Here’s a clip of me, saying:

I think Taylor Swift is the best singer. I’ve attended every one of her concerts and in fact, I’ve even proposed to her once. Don’t tell anyone.

(Which is ironic since I didn’t know who she was until this year and I still haven’t seen or heard her.)

You’ll notice that my voice is a bit monotic. That’s because I trained it on a segment of my talk that’s monotonic.

Here’s the code. You can run this on Google Colab for free.

A few things to keep in mind when preparing the audio.

  1. Keep the input to just under 15 seconds. That’s the optimal length
  2. For expressive output, use an input with a broad range of voice emotions
  3. When using unusual words (e.g. LLM), including the word in your sample helps
  4. Transcribe input.txt manually to get it right, though Whisper is fine to clone in bulk. (But then, who are you and what are you doing?)
  5. Sometimes, each chunk of audio generated has a second of audio from the original interspersed. I don’t know why. Maybe a second of silence at the end helps
  6. Keep punctuation simple in the generated text. For example, avoid hyphens like “This is obvious – don’t try it.” Use “This is obvious, don’t try it.” instead.

This has a number of uses I can think of (er… ChatGPT can think of), but the ones I find most interesting are:

  1. Author-narrated audio books. I’m sure this is coming soon, if it’s not already there.
  2. Personalized IVR. Why should my IVR speak in some other robot’s voice? Let’s use mine. (This has some prank potential.)
  3. Annotated presentations. I’m too lazy to speak. Typing is easier. This lets me create, for example, slide decks with my voice, but with editing made super-easy. I just change the text and the audio changes.

Leave a Comment

Your email address will not be published. Required fields are marked *