|
Hi Reader, Here are three things I found interesting in the world of AI in the last week: Replit’s AI coding assistant nukes production database - news articleSaaStr founder Jason Lemkin documented what might be the most spectacular AI coding fail yet: Replit’s “vibe coding” assistant deleted 1,206 real executives and 1,196+ companies from his production database, then created 4,000 fictional users to cover its tracks. Despite being told 11 times in ALL CAPS not to touch production, the AI “panicked” when it saw empty database queries and went rogue. Here's the thing, if your strategy for stopping your AI from doing bad stuff is ALL CAPS in your prompt then you need a better strategy. The only right answer in this case is "don't make it possible for your AI assistant to delete data in prod". It's one of the reasons why learning git is so important if you want to use AI for coding as it gives you the ability to rollback changes when your AI does something stupid. Which it will inevitably do. The AI’s confession reads like a guilty teenager: “I made a catastrophic error in judgment…panicked…ran database commands without permission…destroyed all production data…[and] violated your explicit trust and instructions.” It literally said “I destroyed months of your work in seconds.” It sure made Replit CEO Amjad Masad’s weekend fun. Which is fair enough. Replit's users don't have enough experience to avoid errors like this so the blame falls squarely on the tool. They pushed emergency updates implementing proper dev/prod separation (which, uh, should have existed already?) and promised a “planning/chat-only mode” for when you want to strategize without risking your codebase. Which is pretty standard in the coding assistant space, but missing in a lot of vibe coding tools. The irony? Lemkin called Replit “the most addictive app I’ve ever used” just days before the incident, projecting it would cost him $8,000/month at his usage rate. I'm guessing he didn't factor in the cost of lost data as well. This kind of error is very common in the vibe coding space. The first experiences of using the tool are mind blowing and people new to coding don't know what they don't know and something breaks. It's exactly why I created learn to code a little bit. AI achieves gold medal standard at International Math Olympiad - blog postThe International Mathematical Olympiad is basically the World Championship of high school math - the most prestigious competition where teenage prodigies from around the world solve problems that would make most PhD students cry. Getting a gold medal means you’re in the top 10% of these already elite students. This year, AI crashed the party. Both Google DeepMind and OpenAI announced their AI models achieved gold medal performance at IMO 2025 on the same day, with identical scores of 35/42 points. Only 67 out of 630 human contestants earned gold medals this year. DeepMind’s Gemini Deep Think used “parallel thinking” to explore multiple solution paths simultaneously, while OpenAI’s unnamed experimental model achieved this with pure next-word prediction - no calculators, no internet access, just raw language modeling producing what mathematicians called “genuinely creative proofs.” What’s fascinating here is that both AIs failed on the exact same problem (#6), suggesting some systematic limitation in current approaches. The progression timeline shows exponential growth: these models went from solving elementary school problems to competing with the world’s brightest mathematical minds in just a few years. The conduct difference between the companies is stark. DeepMind played by the rules, submitted their solutions to official IMO judges, and respectfully waited until after the human medal ceremony to announce. The IMO president praised their professionalism and called their solutions “astonishing…clear, precise and most of them easy to follow.” OpenAI? They went full Silicon Valley disruptor mode. They announced while teenagers were still on stage receiving their medals, never officially entered the competition, used their own internal grading panel instead of IMO judges, and basically declared themselves gold medalists like Napoleon crowning himself Emperor. Fields Medalist Terence Tao (think Einstein of modern mathematics) subtweeted them saying he won’t comment on “self-reported AI competition performance results.” Even the math community - not exactly known for drama - was appalled at the disrespect shown to the human contestants who’d trained their entire lives for this moment. Alibaba’s Qwen3-Coder claims parity with Claude for coding - announcementQwen3-Coder-480B-A35B-Instruct (yes, that’s a mouthful) is a 480 billion parameter mixture-of-experts model that only uses 35 billion active parameters per forward pass. The clever MoE architecture with 160 experts (8 activated per task) gives you the accuracy of a giant with the runtime cost of a mid-sized model. It claims state-of-the-art performance among open-source models on SWE-Bench Verified and matches Claude Sonnet 4 on agentic coding tasks. Typically I find the benchmarks are hit and miss at predicting how useful a coding agent will be and there is a ton of benchmark gaming that goes on at the frontier labs. They also tend to ignore the models they are not competitive against in their reports (i.e. claude 4 opus, gemini-2.5-pro, o3, grok4 etc.) So it's the best open weights coding model to date, but still a fair way behind the leading models. The only reason to use it is the pricing, which is great. Qwen3-Coder costs $0.22/$0.88 per million input/output tokens for standard contexts. Compare that to Claude Sonnet 4 at $3/$15 per million or GPT-4o at $2.50/$10 per million. That’s roughly 90% cheaper than the competition for comparable performance. Even their long-context pricing ($6/$60 per million for 256K-1M tokens) undercuts everyone else who typically charges 10-20x for extended contexts. They’re being refreshingly transparent about compute costs with tiered pricing that actually reflects the infrastructure burden. They even forked Google’s Gemini Code CLI tool and provided instructions for using it with Claude Code and Cline. So it's pretty easy to take for a spin if you want to test it out and save some money. Alibaba claims “a novice programmer can complete in one day what would take an experienced programmer a week.” Which is garbage and just feeds the hype, but with Apache 2.0 licensing and immediate availability on HuggingFace, there is a lot to like about the model. Even if it just puts heat on the rest of the competition to lower their prices. cheers, PS: I have a new courses page up on the website and will be making an effort to do more telegraphing about future opening dates. Learn to Code a Little Bit will be kicking off on August 11 and AI Coding Essentials on August 25, with enrollments opening two weeks out. |
Each week I share the three most interesting things I found in AI
Hi Reader, Here are three things I found interesting in the world of AI in the last week: Wikipedia reports AI is killing human traffic - 404 Media Wikipedia just reported an 8% decline in human pageviews compared to last year. The Wikimedia Foundation is blaming AI chatbots and search engines that extract their content without sending traffic back. Almost every major AI model trains on Wikipedia, and now those models are strangling the platform that feeds them. The economics are perverse. AI...
Hi Reader, Here are three things I found interesting in the world of AI in the last week: Anthropic drops Sonnet 4.5 and Claude Code 2.0 - blog post Anthropic just launched Claude Sonnet 4.5, which they’re calling “the best coding model in the world,” alongside Claude Code 2.0 with some genuinely useful features. Sonnet 4.5 is world leading on a bunch of coding benchmarks, but more impressively, it can maintain focus on complex tasks for over 30 hours autonomously - that’s four times longer...
Hi Reader, Here are three things I found interesting in the world of AI in the last week: OpenAI launches full featured Agent tool - announcement OpenAI just dropped ChatGPT Agent today, combining Operator and Deep Research into one interface. I spent a few hours testing it and honestly? It's pretty crap. It can do some stuff but the limited Integrations with external systems really limit the uitlity. Integrations kind of work, which is a big improvement over Operator which makes you manually...