I no specialist of the field at all, but in the context of Kyutai they explained their workflow a bit to make their speech to speech model. And basically it boils down to: if you want to make a TTS
(text to speech) model, you can generate audio track using an STT (speech to text) model, and then you have a supervised audio/text pair. You can even add as much noise to the audio as you want, to make a noise resistant STT model.