Thomas Wolf’s Post

View profile for Thomas Wolf, graphic

Co-founder and Chief Science Officer at 🤗 Hugging Face – angel investor

The Kyutai fully end-to-end audio model demo of today is a huge deal that many people missed in the room. Mostly irrelevant are the facts that: - they come a few week after OpenAI ChatGPT-4o - the demo was less polished than the 4o one (in terms of voice quality, voice timing…) Relevant: - the model training pipeline and model archi are simple and hugely scalable, with a tiny 8+ people team like Kyutai building it in 4 months. Synthetic data is a huge enabler here - laser focus on local devices: Moshi will soon be everywhere. Frontier model builders have low incentive to let you run smaller models locally (price per token…) but non-profits like Kyutai have very different incentives. The Moshi demo is already online while the OpenAI 4o one is still in limbo. - going under 300 ms of latency while keeping Llama 8B or above quality of answers is a key enabler in terms of interactivity, it’s game changing, This feeling when the model answer your question before you even finished asking is quite crazy or when you interrupt the model while it’s talking and it react… Predictive coding in a model, instantly updated model of what you’re about to say... Basically they nailed the fundamentals. It’s here. This interactive voice tech will be everywhere. It will soon be an obvious commodity.

Andreas Blixt

Exploring • Previously Framer, Spotify

3mo

I was trying their live demo and it's by far the most responsive model I've seen, and its ability to hear and speak in parallel (two audio tracks) seems beyond even GPT-4o! The consistent language part might need some more work, but that also seems like a problem that we have already made so much progress on within text models. By far the wildest thing in this whole presentation was the one where they simulate a person from 20+ years ago. It's leaps beyond just voice cloning. Makes me think we now have the technology to upload a few hours of speech from a person and get a simulacrum of that person on the phone.

John Gordon

Software Engineer, Geek, Investor, Entrepeneur

3mo

I was working on a similar pipeline and found I had to do some explicit noise removal to prevent my little buddy from talking to herself. I understand the most popular techniques are beam forming, owner voiceprint isolation, echo cancellation, and source localization. The naive way of muting and unmuting the mic is not as fluent. From the demo, how do they get that level of noise isolation? Did they talk about that? Is there a paper?

Rakesh Gohel

Founder at JUTEQ | Empowering Businesses through Cloud Transformation & Solutions | Specializing in Cloud Architecture & Consultation | Generative AI | Entrepreneurship & Leadership | Let's connect & innovate together!🌟

3mo

We are entering in era where team size doesn’t matter, a lot more AI power to wield Love the demo. Interruption and continue with that thought brings model to human level interactions 👏

Bill Stout

Principal AI Security Architect ServiceNow, founder ServiceNow AI Red Team, AI Alliance WG member, BSA WG member

3mo
Jeroen Rombouts

Freelance Machine Learning Engineer

3mo

Until I tried something like this myself it wasn't really on my mind, but whoa does the low latency ever make a difference. 300 msec versus 3 s is an entirely different subjective experience. I hope I get to build something with this tech soon 😎

Yasir Altaf

Data Science/AI Consultant | Big Data Solutions | Cellular Radio & IoT

3mo

Well Moshi is really is awesome. However the flow of the conversation is strange. It seems as if the model is itching to respond and starts blurting out audio in the middle of the question. It seems it has fixed filler sentences to cover for scenarios, when it doesn't really have answer. Latency is matchless though..

Spoorthi G.

Machine Learning Engineer | Cloud Applications, AWS CloudFormation, Database Administration | Data Scientist |Data Engineer | MLops

2mo

The advancements described in Kyutai's fully end-to-end audio model are indeed impressive and mark significant progress in interactive voice technology. Their ability to achieve low latency while maintaining high-quality responses like Llama 8B or above demonstrates a strong focus on usability and interactivity, which are crucial for widespread adoption. The emphasis on local device processing with technologies like Moshi also addresses important concerns around privacy and accessibility, contrasting with larger models that prioritize cloud-based solutions. Moreover, their rapid development cycle and use of synthetic data showcase innovative approaches to scaling AI models efficiently. This suggests a promising future where such technology could become commonplace, revolutionizing how we interact with digital assistants and other AI-powered applications. These advancements underscore the potential for interactive voice technology to become an integral part of everyday life, offering seamless and responsive interactions that feel increasingly natural and intuitive.

Like
Reply
Arno Selhorst

AI-powered Innovation for Sales & Service

3mo

Yes, tried or myself. It's almost too fast for my taste. I feel pressured to not be quiet or run danger of Moshi taking off like a verbal race car. This needs to be tamed quickly. The quality is en par with what you can expect with quantised models around 7B. I love that they go the open source route and focus on mobile devices. What a blessing! Go local or go home I'd say.

I agree - a lot of impressive new bits under the hood. Adding to your list the indexing-by-design of all moshi generations for attribution checking and the native watermarking of moshi’s utterances

Eugeniu Cararus

Software Architect @ FSB Technology | Solution Team, Data Integration

3mo

Nice to see it in action. Spent lots of time on azure cognitive services to bring voice with different tonality and pronunciation however my results weren’t any near this example, however it use to be more patient :)

See more comments

To view or add a comment, sign in

Explore topics