Next gen voice models, Gemini 2.5 pro dominating benchmarks, and more native image generation

Hi Reader,

Here are three things I found interesting in the world of AI in the last week.

OpenAI launches next-generation audio models in the API - OpenAI Blog

OpenAI has released a new suite of audio models that dramatically improve both speech-to-text and text-to-speech capabilities for developers building voice agents.

Standing out is the new gpt-4o-mini-tts text-to-speech model that allows developers to “instruct” not just what to say but how to say it. Imagine telling the model to “speak like a sympathetic customer service agent” or “narrate like a medieval knight” - this opens up entirely new possibilities for voice personalization. You can try these voices yourself at openai.fm to hear the difference and it includes the prompts they are using for direction.

It is limited to their standard voices and their license requires developers to say the voices are generated by AI and not a real human.

They also released new gpt-4o-transcribe and gpt-4o-mini-transcribe models that significantly outperform previous Whisper models on word error rate (WER) benchmarks. They handle accents, background noise, and varying speech speeds much better than before - which matters tremendously for real-world applications like call centers and meeting transcription.

The technical improvements here aren’t superficial. OpenAI built these on the GPT-4o architecture but with specialized audio-centric pretraining datasets. They’ve also implemented advanced distillation techniques to make smaller models perform at near-flagship levels, and heavily leaned on reinforcement learning to push transcription accuracy to state-of-the-art levels.

What does this mean? The bar for voice interfaces just got significantly higher. Whether you’re building customer service bots, accessibility tools, or creative audio applications, these new models make previously challenging use cases much more viable.

All these models are available in the API today for developers worldwide.

Google’s Gemini Pro 2.5 quietly dominates benchmarks - Google Blog

Google just dropped Gemini 2.5 Pro and this isn’t just another incremental update – it’s their first real “thinking model” and it’s topping benchmarks across the board. The improvements over their 2.0 release from just a couple months ago look substantial.

What’s immediately striking is that Gemini 2.5 Pro is beating both GPT-4 and Claude 3 on tough reasoning tasks like GPQA and “Humanity’s Last Exam” with state-of-the-art reasoning and scientific accuracy. They’re claiming 18.8% on Humanity’s Last Exam without resorting to majority voting tricks – that’s a significant jump.

On the coding front, Gemini 2.5 Pro is absolutely crushing it. According to the latest Aider polyglot benchmark, Gemini 2.5 Pro scores a remarkable 72.9% – comfortably beating Claude 3.7 Sonnet (64.9%), OpenAI’s O1 (61.7%), and leaving GPT-4.5 Preview (44.9%) in the dust.

The 1M token context window (with 2M coming soon, according to the blog) is still leading all the other players.

Access is currently limited to Google AI Studio and Gemini Advanced users, with Vertex AI integration “coming soon.” Interestingly, Google hasn’t announced pricing yet but says it’s coming “in the weeks ahead” – which suggests they’re still calibrating based on usage patterns and competitive pressure.

I'm keen to see where the pricing lands and how it performs in the real world but there is a fair chance that sonnet-3.7 will be eclipsed as the premiere coding model. Which means insta-upgrade for every agentic coder out there.

GPT-4o makes image generation native to the model - OpenAI Blog

OpenAI has finally integrated image generation directly into GPT-4o.

This was first teased something like 10 months ago, and I'm sure it's pure coincidence that they launched a week after gemini announced their native image generation. After experimenting with both models my initial takeaway is that gpt-4o made better images, but gemini did a much better job at editing.

There are still clear limitations – OpenAI acknowledges issues with cropping longer images, rendering non-Latin text, and precise editing of specific portions of an image. It also struggles with very dense information or small text. But the foundation they've built by training on the joint distribution of online images and text seems to be paying dividends in terms of visual fluency.

Funny that, it's as if they rushed the release or something

As for availability, GPT-4o image generation is rolling out now to Plus, Pro, Team, and Free ChatGPT users, with Enterprise and Education access coming soon. API access for developers will follow in the coming weeks. DALL-E will remain available through a dedicated GPT, but it's clear where the big players are placing their bets for the future of image generation.

cheers,

PS: The next cohort of Practical AI for devs is kicking off in early May, so I'm about to overhaul the curriculum for a third time.

Code With JV

Next gen voice models, Gemini 2.5 pro dominating benchmarks, and more native image generation

Big AI power shifts: search, copyright, partnerships

Figma Make (take two), AI integrations hide power struggle, ElevenLabs nailing Vibe Coding DevRel

Manipulative AI, more cyber security risks and the best open model to date