It’s surprisingly easy to clone a voice using F5-TTS: “A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching”.
Here’s a clip of me, saying:
I think Taylor Swift is the best singer. I’ve attended every one of her concerts and in fact, I’ve even proposed to her once. Don’t tell anyone.
(Which is ironic since I didn’t know who she was until this year and I still haven’t seen or heard her.)
You’ll notice that my voice is a bit monotic. That’s because I trained it on a segment of my talk that’s monotonic.
Here’s the code. You can run this on Google Colab for free.
A few things to keep in mind when preparing the audio.
- Keep the input to just under 15 seconds. That’s the optimal length
- For expressive output, use an input with a broad range of voice emotions
- When using unusual words (e.g. LLM), including the word in your sample helps
- Transcribe
input.txt
manually to get it right, though Whisper is fine to clone in bulk. (But then, who are you and what are you doing?) - Sometimes, each chunk of audio generated has a second of audio from the original interspersed. I don’t know why. Maybe a second of silence at the end helps
- Keep punctuation simple in the generated text. For example, avoid hyphens like “This is obvious – don’t try it.” Use “This is obvious, don’t try it.” instead.
This has a number of uses I can think of (er… ChatGPT can think of), but the ones I find most interesting are:
- Author-narrated audio books. I’m sure this is coming soon, if it’s not already there.
- Personalized IVR. Why should my IVR speak in some other robot’s voice? Let’s use mine. (This has some prank potential.)
- Annotated presentations. I’m too lazy to speak. Typing is easier. This lets me create, for example, slide decks with my voice, but with editing made super-easy. I just change the text and the audio changes.