Replace my voice in video recording with a TTS generated speech

I want to make a screencast substituting my voice with an audio generated by a TTS engine. The problem is if I use a stt followed by a tts the generated audio will not be in sync with my voice and won't make any sense in the screencast. Is there a way to do this so that the generated audio is in sync with my voice?

I'm not talking about a live stream, just a video. I can still edit it, but I would prefer not to have to copy every piece of the generated audio into it's proper place. I would expect something that cut's the audio into pieces by itself and moves it to roughly match my voice spectrogram I think it's called. So I don't have to do everything manually.

I had thought of just creating the TTS audio separately by writing myself into a text file and using coqui-tts to do tts --text "$(cat tts_input.txt)". Then split the audio from the video, import both the TTS audio and the audio from the video to audacity and cut and move around the generated audio until it matches the audio from the video. Then merge the audio with the video again.

I'm looking for a way to automatize that process because honestly it seems very involved and I would get tired of it really quickly.

1 Answer

You’re absolutely correct that going from Speech-to-Text, then Text-to-Speech would be too slow. Even on the most powerful computers you will not get the latency below 400ms. This is primarily because the STT component cannot convert a spoken word to a text representation until after you’ve finished saying the word, and a TTS component cannot convert a written piece of text to a semi-accurate representation of sound in the English language until after the first few syllables are in place. Add to this the time many packages require while waiting for context behind a statement, and you’re at half-a-second in no time at all. This is much easier for languages such as Japanese where each character has its own sound that does not change depending on what characters are surrounding it (leaving aside ゃ, ゅ, and ょ for the time being).

However, if the goal is to substitute your voice in real-time without burning a hole in your CPU, you may want to look at voice modulators. These are often depicted as tools used by hackers or nefarious individuals, but they’re really quite good at giving people a different sound. Lyrebird is a pretty solid tool for this, but there are many others that can integrate with streaming applications, simplifying your processes.

Replace my voice in video recording with a TTS generated speech

1 Answer

Your Answer

Sign up or log in

Post as a guest

More in news

'Zurdo' pounds out decision vs. Smith in cruiserweight eliminator

Review: Feud: Capote vs. The Swans, “Hats, Gloves and Effete Homosexuals”