Hi Reader, Here are three things I found interesting in the world of AI in the last week. OpenAI launches next-generation audio models in the API - OpenAI Blog OpenAI has released a new suite of audio models that dramatically improve both speech-to-text and text-to-speech capabilities for developers building voice agents. Standing out is the new It is limited to their standard voices and their license requires developers to say the voices are generated by AI and not a real human. They also released new The technical improvements here aren’t superficial. OpenAI built these on the GPT-4o architecture but with specialized audio-centric pretraining datasets. They’ve also implemented advanced distillation techniques to make smaller models perform at near-flagship levels, and heavily leaned on reinforcement learning to push transcription accuracy to state-of-the-art levels. What does this mean? The bar for voice interfaces just got significantly higher. Whether you’re building customer service bots, accessibility tools, or creative audio applications, these new models make previously challenging use cases much more viable. All these models are available in the API today for developers worldwide. Google’s Gemini Pro 2.5 quietly dominates benchmarks - Google Blog Google just dropped Gemini 2.5 Pro and this isn’t just another incremental update – it’s their first real “thinking model” and it’s topping benchmarks across the board. The improvements over their 2.0 release from just a couple months ago look substantial. What’s immediately striking is that Gemini 2.5 Pro is beating both GPT-4 and Claude 3 on tough reasoning tasks like GPQA and “Humanity’s Last Exam” with state-of-the-art reasoning and scientific accuracy. They’re claiming 18.8% on Humanity’s Last Exam without resorting to majority voting tricks – that’s a significant jump. On the coding front, Gemini 2.5 Pro is absolutely crushing it. According to the latest Aider polyglot benchmark, Gemini 2.5 Pro scores a remarkable 72.9% – comfortably beating Claude 3.7 Sonnet (64.9%), OpenAI’s O1 (61.7%), and leaving GPT-4.5 Preview (44.9%) in the dust. The 1M token context window (with 2M coming soon, according to the blog) is still leading all the other players. Access is currently limited to Google AI Studio and Gemini Advanced users, with Vertex AI integration “coming soon.” Interestingly, Google hasn’t announced pricing yet but says it’s coming “in the weeks ahead” – which suggests they’re still calibrating based on usage patterns and competitive pressure. I'm keen to see where the pricing lands and how it performs in the real world but there is a fair chance that sonnet-3.7 will be eclipsed as the premiere coding model. Which means insta-upgrade for every agentic coder out there. GPT-4o makes image generation native to the model - OpenAI Blog OpenAI has finally integrated image generation directly into GPT-4o. This was first teased something like 10 months ago, and I'm sure it's pure coincidence that they launched a week after gemini announced their native image generation. After experimenting with both models my initial takeaway is that gpt-4o made better images, but gemini did a much better job at editing. There are still clear limitations – OpenAI acknowledges issues with cropping longer images, rendering non-Latin text, and precise editing of specific portions of an image. It also struggles with very dense information or small text. But the foundation they've built by training on the joint distribution of online images and text seems to be paying dividends in terms of visual fluency. Funny that, it's as if they rushed the release or something As for availability, GPT-4o image generation is rolling out now to Plus, Pro, Team, and Free ChatGPT users, with Enterprise and Education access coming soon. API access for developers will follow in the coming weeks. DALL-E will remain available through a dedicated GPT, but it's clear where the big players are placing their bets for the future of image generation. cheers, JV PS: The next cohort of Practical AI for devs is kicking off in early May, so I'm about to overhaul the curriculum for a third time. |
Each week I share the three most interesting things I found in AI
Hi Reader, Here are three things I found interesting in the world of AI in the last week: Google feels the first crack in search dominance as Safari users drift to AI - testimony In a courtroom bombshell that sent $150 billion of Google’s market value up in smoke, Apple’s Eddy Cue casually mentioned that for the first time in over 20 years, Safari search volume has declined. Tech analysts immediately went into overdrive, frantically updating their valuation models. The 7.3% stock nosedive...
Hi Reader, Here are three things I found interesting in the world of AI in the last week: Figma Make brings AI “vibe-coding” to design workflows - official announcement Figma just launched "Make" - their AI-powered prototype generation tool that aims to deliver on the promise of converting designs and ideas into functional code. It lets designers transform their work into interactive prototypes via text prompts or convert existing Figma designs directly into working code. This is a meaningful...
Hi Reader, Here are three things I found interesting in the world of AI in the last week: LLMs secretly manipulated Reddit users’ opinions in an unauthorized experiment - article, follow-up Researchers from the University of Zurich just got caught running a massive unauthorized AI experiment on r/changemyview, where they unleashed AI bots that posted 1,783 comments over four months without anyone’s consent. The bots were programmed to be maximally persuasive by adopting fabricated identities...