OpenAI’s newest AI model can hold a humanlike conversation

May 13, 2024, 8:03 PM UTC

OpenAI has introduced its most comprehensive artificial intelligence endeavor yet: a multimodal model that will be able to communicate to users through both text and voice.

GPT-4o, which will be rolling out in ChatGPT as well as in the API over the next few weeks, is also able to recognize objects and images in real time, the company said Monday.

The model synthesizes a slew of AI capabilities that are already separately available in various other OpenAI models. But by combining all these modalities, OpenAI’s latest model is expected to process any combination of text, audio and visual inputs more efficiently.

Users can relay visuals — through their phone camera, by uploading documents, or by sharing their screen — all while conversing with the AI model as if they are in a video call. The technology will be available for free, the company announced, but paid users will have five times the capacity limit.

OpenAI, which is backed by Microsoft, also announced a new desktop app for ChatGPT, its popular chatbot first launched in 2022, on MacOS.

Mira Murati, chief technology officer at OpenAI, said during a livestream demonstration that making advanced AI tools available to users for free is a “very important” component of the company’s mission.

“This is the first time that we are really making a huge step forward when it comes to the ease of use. And this is incredibly important because we’re looking at the future of interaction between ourselves and the machines,” Murati said. “And we think that GPT-4o is really shifting that paradigm into the future of collaboration where this interaction becomes much more natural and far, far easier.”

Team members demonstrated the new model’s audio capabilities during the livestream, and shared clips to social media.

An AI assistant that can reason in real time using vision, text and voice would enable the technology to perform a creative range of tasks — such as walking users through a math problem, translating languages during a conversation and reading human facial expressions.

GPT-4o’s response time is much faster than previous models, Murati said in the livestream. The model significantly improves the quality and speed of its performance in 50 different languages.

It matches GPT-4 Turbo, its latest generation model before GPT-4o, in English and code abilities, according to an OpenAI blog post about the news.

It also surpasses existing OpenAI models in vision and audio understanding, the company wrote. CEO Sam Altman wrote in his own blog post Monday that GPT-4o’s new voice and video features are the “best compute interface [he’s] ever used.”

“It feels like AI from the movies; and it’s still a bit surprising to me that it’s real. Getting to human-level response times and expressiveness turns out to be a big change,” Altman wrote. “The original ChatGPT showed a hint of what was possible with language interfaces; this new thing feels viscerally different. It is fast, smart, fun, natural, and helpful.”

Voice capabilities in ChatGPT are not new — the model has offered users a conversational voice assistant since last fall. But unlike its existing voice mode, Murati said, GPT-4o’s voice feature reacts in real time, getting rid of the two- or three-second lag to emulate human response times.

And unlike in ChatGPT’s previous voice mode, a conversation with GPT-4o can keep flowing even if users interrupt it mid-response. During Monday’s demonstration, Mark Chen, head of frontiers research at OpenAI, revealed that GPT-4o can read the user’s tone of voice while also generating a variety of emotion in its own voice.

This new model comes on the heels of the company’s latest — still unreleased — text-to-video model Sora, which made waves in tech and entertainment circles in recent months after it was unveiled in February.

OpenAI’s announcement also rolled out just a day before Google’s annual I/O developer conference, scheduled for Tuesday, where the tech giant is expected to give updates on its latest AI-related developments as well.

In 2023, a record $29.1 billion was invested across nearly 700 generative AI deals, CNBC reported.

Angela Yang

Angela Yang is a culture and trends reporter for NBC News.