Agentic coding notes from Galapagos Island

I've been using AI fairly heavily since last November and the whole thing is a funny experience. An agent will do something that, if a human did it, you'd immediately fire them. My reaction, of course, is to act as if this is great and spin up a thousand agents so they can do even more of that.
Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn't have tests and git bisect
wouldn't work, and it was a UI interaction bug for which I'm not even really qualified to write a test for, so I asked Codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn't possibly be correct). On telling Codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.
I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn't have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn't feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.
Like I said, because this was non-ironically such a great experience, I immediately thought to myself, "how can I get more of this?" and started using agents more and more heavily until I was using coding agents heavily mid-late last year.
Since this post covers a relatively disparate set of topics, here's a brief outline.
- Testing background
- Some details on testing
- Caveman mode
- LLM variance
- Misc
- Agentic loops and writing this post
- Some reasons people talk past each other
Testing background
LLMs are highly leveraged when it comes to testing. In terms of the amount of effort it takes, it's easier than ever to hit a particular quality bar and yet, software seems to be lower quality than ever. A decade ago, we looked at the bugs I ran into in an arbitrary week. There were quite a few bugs then and I run into more bugs now, but I don't think this has to be the case.
For one thing, after a bug has been shipped, it's easier than it's ever been to use a data-driven approach to find and fix the bug. Just for example, at work, I tried creating a pipeline that goes from support ticket (chat or email) to pull request (PR). As far as I can tell, this works ok. Since I work for a company that has a traditional workflow, all of these fixes get reviewed by a human and, so far, we've had no known false positives.
Per unit of time invested, it's also possible to do more thorough testing. Personally, I think this can be effective enough that I'm fairly comfortable trying to ship a large volume of code via a "software factories" workflow because I've seen a testing-heavy no-review workflow that results in much higher quality than any review-reliant workflow I've seen or even heard of.
Like everybody, I have biases that fall out of my experiences. It just so happens that I spent the first decade of my career at a company whose test processes happen to work well in today's LLM environment. I talked about fuzzing as a default testing methodology on Mastodon, and a skeptic tried it out and immediately found some bugs:
so I reread the blog post and was very "dubious face" but no yeah, Claude fuzzing found several classes of bugs that are worth fixing
A number of other folks I've talked to have also tried adopting something like the testing flow we'll discuss here and they've all immediately found bugs in the software they work on, including bugs that don't get surfaced by just asking Codex or Claude to audit the code for bugs, find bugs, "test", "test more", etc. For example, Dennis Snell mentioned that he and a teammate, Jon Surrell, not only found bugs in the code they're working on, but also "in upstream dependencies, including the HTML specification, big-three browsers, and other open-source projects" with fairly low effort.
In general, when I talk to software folks about testing, I'm coming from such a different place that they immediately look at me like I'm an alien, so let's talk about how we tested at this hardware company I worked for, Centaur, which informs my biases about how I like to work. Some of the things that we did that were or are unorthodox in the software world are:
- Hired dedicated QA / test engineers, with testing being a first-class career path on par with being a developer
- No code review by default
- Virtually no hand-written tests
- Constant testing via what programmers sometimes called property based testing, randomized testing, fuzzing, etc., although we just called those tests (hand-written tests were called "hand tests").
- Large regeression test suite (3 months wall clock to execute on compute farm)
- No unit tests
Just to give you an idea of the general structure, when I left (in 2013), we had about 1000 machines generating and running tests at all times for roughly 20 logic designers and 20 test engineers. This was on prem and the machines took up half a floor of the building we were in.
The general structure was that we had maybe 20% of machines running regression tests, and 80% generating and running new tests. Three months of regression tests is too much to gate commits on, so there was a much shorter list of tests that took maybe 10 minutes or so to run that people would run before committing. Those pre-commit tests would run on a special setup to run as quickly as possible, with overclocked machines that were the fastest machines money could buy, as well as a different simulator setup.
New failures would get found and reported as they happened and one to two engineers had a job of sorting through failures and triaging them (rejecting false positives, fixing issues in the test generator that caused them to generate false positives, etc.).
In terms of the magnitude of the impact, unless you count culture as a separate item, (1) was probably the biggest difference between us and a typical software company, but also the most irrelevant for readers here, so I'll relegate the discussion to a footnote1, except for this brief comment that testing is like any other skill; spending more time doing it improves skill and, since testing isn't a first-class career path at most major tech companies, people generally don't have the same level of testing skills at software companies as you see in some career CPU test engineers. In the same way that an engineer who who spends 20 years working on distributed systems or UX is going to be much better at it than an equally talented engineer who spends 5% of their time on distributed systems or UX, someone who spends 20 years working on testing is going to be much better at it than somebody who spends 5% of their time on testing.
(2) is one of the things that makes some of the test practices we used at the chip company suited to AI workflows. We didn't review code by default because we trusted our test practices enough that review didn't, in general, add much reliability. We were shipping fewer than 1 significant user-visible bug per year, and review was done on an as-needed basis when someone wanted an extra set of eyes on something they thought was particularly tricky2. With AI coding workflows, it's easy for one person to generate more code than any human or even any ten humans can review by hand. People have different levels of comfort with shipping code without review. Personally, I'm very comfortable shipping code without human review because I've seen it done on products that are technically more challenging than most software at most software companies.
I often see people say things like, "that's too much risk; we have millions of users" but, empirically, they're talking about a workflow that ships bugs at a rate that's maybe a thousand times higher per capita on raw count, with the ratio being much higher if you adjust for severity. If a company were shipping bugs at, say, a hundredth the rate we were at Centaur while relying primarily on review to catch bugs, then I could see their point, but that's not what's happening at the typical software company where people don't want to move away from human review because of the perceived risk of shipping bugs.
(3) and (4) go hand in hand. Almost every software group I know of that's serious about reliability (various teams that ship reliable databases, distributed databases etc.) are at least directionally doing the same thing, although they might have a larger fraction of hand written tests. For the same reason it's considered a bad idea to rely on testing by interacting with the software yourself and observing whether or not the software appeared to work, it's a bad idea to rely on directly typing out the inputs to a test and the expected outputs. As previously discussed, it's just really inefficient to write tests by hand. For any given level of reliability, you'll get there more quickly if you prefer randomized test generation over hand-written tests.
(5) fell out of having a lot of tests find a lot of bugs. In general, if a test found a bug that we later fixed, we'd keep the test in our regression test suite forever. It turns out, if you find a lot of bugs with good tests, you'll end up with a large test suite. But putting that aside and just looking at it from a test efficiency standpoint, the standard setup in software of having the same set of tests run in CI for each PR is extraordinarily inefficient if you think about the what's more likely to find a bug, running the same test a thousand times in a day or, in the same amount of test time, running a thousand different tests.
(6) came out of test efficiency concerns as well, in that we had a much smaller team than our competitors. That was a reason the company managed to survive for so long. While Intel was putting every x86 designer out of business other than AMD, our operating cost was low enough that the company survived until 2021, at which point it was acquired by Intel for $125M. With the company's tiny team size, it wouldn't have been possible to get reasonable test coverage with unit tests and hiring enough to do unit tests probably would've meant the company would've gone the way of the x86 efforts of Transmeta, Rise, Cyrix, TI, UMC, NEC, VM, etc., a decade or two sooner. From an efficiency standpoint, unit testing does pretty poorly.
To sum it up, we did quite a few things that most software people tell me are bad ideas (dedicated test engineers, no unit tests, no code review, etc.) and we had much higher quality than any software company I've worked for or any software I've used. Whenever I talk about this, people will say that this doesn't apply to software because CPUs only have X concerns and you can't do the same thing with Y. When I first switched from CPU design to software I thought that might be true, but I've since tried this testing methodology with every kind of Y that someone has mentioned this can't work for and it's worked for every single one, so I no longer find this very plausible (and the Xs generally involved incorrect assumptions of what hardware development is like). While there are real differences between hardware and software, when I’ve seen people lean on that as a reason that testing techniques don’t carry over, it’s been the case that the person is relying on some imagined factor that only seems relevant because the person doesn’t know much about hardware development.
One significant difference was the ratio of effort that went into testing vs. development, but the fixed costs of fuzzing are fairly low, so this is scalable to any level of effort and the efficiency gains are still there. And, due to the gains in test efficiency, the ratio of effort wasn't as large as software engineers generally imagine. We had about a 1:1 ratio of test engineers to developers and then spent maybe 10% of our time in a "freeze" state, where the goal was to find bugs and not ship new features, so a zeroth order estimate for the overhead here is that we spent 55% of our effort on testing and 45% on development, or we could've put 2.2x the effort into development if we spent zero effort on testing. If you look at a software company that's shipping significant bugs many times faster than we did and you declare an emergency and get people to spend 55% of their effort on testing, I don't think the ratio changes too much. Maybe they get to half the previous ratio or something, but the level of effort isn't really what's making the difference.
Nowadays, another thing people will say is, why bother with fuzzing when you can just ask an LLM to find bugs? I've tried doing both quite a few times now and my experience has been that fuzzing generally wins on latency to find a bug, and it dominates on finding more bugs and having a lower false positive rate. LLMs have fairly high variance (more on this later), so just asking Codex or Claude to find a bug can sometimes win but, on average, fuzzing has won.
Some details on testing
Despite the very positive things I've said about LLMs testing, LLMs seem pretty bad at testing. It's more that LLMs let you apply testing effort a lot more easily than before than LLMs are good at testing.
An extreme example of this is that everybody I've talked to who cares about quality or testing at all finds the tests LLMs generate by default, or if you tell them "Write tests", "Write more tests", etc., to be poor. People tend to rate the tests as somewhere between worthless and marginally useful, depending on their standards.
For example, Em Chu (a compiler engineer) says:
The existing tests I'm working with aren't perfect, but are still above the bar LLMs seem to aim for, which I would describe as "thorough enough to smuggle a feature through human code review." For a compiler (compared to e.g. UI), where I'm guessing it's easier to write the average test, but a higher bar of correctness is generally expected of the end product, LLMs just suck. They are painfully bad at the adversarial "now, what if I do this" or "let's try the cross-product of everything" process humans use to write tests that actually find bugs
At the same time, I've seen a number of folks rave at how amazingly good LLMs are at testing when you tell them "Write tests", "Write more tests", etc. When I've looked into why people say LLMs are great at testing, what I've found is that people who did essentially no testing at all find LLMs to be great at testing. Well, that makes sense. If you go from basically zero testing effort to a tiny bit of testing effort, that's a huge win.
As of June 2026, directing LLMs to do fuzzing / randomized testing feels similar. I've tried using an LLM to generate a fuzzer and, for most projects, this will turn up real and often serious bugs within minutes. However, on looking at what the LLM-created fuzzer is trying to test, I have the same reaction as a normal programmer who cares about quality looking at LLM-created tests. The coverage of the LLM-generated fuzzer is curiously bad and misses all kinds of basic things you'd expect a hastily human-written fuzzer to cover. Depending on whether you're a glass half empty or a glass half full person, you might say that this says something about the test coverage of most projects, or that it says something about the unreasonable effectiveness of fuzzing.
At a high level, LLM-generated fuzzers from SOTA models today don't do a good job of "thinking about" how inputs should be varied to elicit bugs. Then, if you naively tell it about how inputs should be varied and to combine these, it will also not combine bug ingredients in a reasonable way. It's possible to give instructions that will work well, but this heavily relies on the user to provide direction.
If you're using randomized testing as "extra credit", to catch a few more bugs, or to replace traditional software testing processes, you can just tell an LLM to look for risky areas of the code and find invariants that might be violated and fuzz them. This works ok. When I've convinced people to try some randomized testing, they usually start here and find quite a few bugs they're happy to have found. Due to the nature of who's interested in trying out novel-to-them test techniques, this is often from people who've worked on some of the most well-tested and reliable code at the company and they can find bugs in their own relatively well-tested code.
If you want to use randomized testing to keep an agentic "software factories" workflow honest, then you need to have a way to deal with gaps in SOTA models because, when you're shipping the equivalent of hundreds or thousands of PRs a day into a project, everything that's not constrained from degrading will rapidly degrade.
At a high level, the entire system needs some kind of feedback that finds gaps and instructs whatever loop is making adjustments to the fuzzer to close the gaps. Recently, I've been testing things where I don't understand the domain and don't understand the project or the code, so I've been flying relatively blind and know that there will be a lot of gaps in what I come up with (and, as noted above, LLMs are terrible at this). But even in areas where I'm familiar with the domain and understand the code relatively well, there will still be some gaps because humans miss things and make mistakes, so there always needs to be some kind of feedback into the test setup that can find gaps and allow you or an agent to close the gaps.
I've been playing with various ways to have agents convene and reconvene to get agentic loops running better and, while that kind of thing helps, I haven't figured out a way to do do this well enough to create some kind agentic software quality improvement loop that doesn't rely on outside feedback, whether that's occasional human input, or shipping something (ideally only fractionally and with staged rollout) and then having the system monitor metrics/logs/traces/support tickets/whatever to use that as feedback. The support ticket to PR pipeline I mentioned above is one such feedback loop. The pipeline not only tries to generate a PR, it also tries to get the test setup to add test coverage that will find the bug and possibly surface other bugs, or will re-find the bug if there's a future regression. This seems to work ok-ish, in that it finds real bugs and improves test coverage, but I'm sure there's a lot of room for improvement.
Relatedly, I've been wondering why LLMs are so bad at writing tests. On asking around a bit, I'm told that this is because the capabilities that LLMs have come out of people building RL environments which allow models to improve at tasks, sometimes in a generalizable way and sometimes not. I'm also told that there's a market for selling RL envs, but it's fairly thin because there aren't all that many buyers for them, and you really want to know someone at a lab who's a buyer or close to it. If you are such a person or can connect me with such a person, could you do me a favor and reach out to me (I'm fairly easily reachable on X, Mastodon, email, etc.). I'm curious about how this works and how plausible it is to sell an RL env for something like testing, optimization, or the longer horizon tasks discussed in this post, where it's easy to observe significant gaps.
Back on the topic of testing, when fuzzing or doing any kind of bug auditing, detecting false positives is a critical part of the process. At least for now, having access to a model that's better than anything you can publicly use won't save you. A while back, Dennis Snell told me, frustratedly, that he spent the day wading through AI slop forwarded to his employer by Anthropic that came from their vaunted Mythos model that's too dangerous to release from the public. Anthropic was apparently doing the company some kind of favor or maybe doing some kind of EA security improvement, except that they didn't bother with having a reasonable false positive rejection process so they were just forwarding garbage to us. At the time, I was using a model that, if Fable is any indication, appears to be moderately less capable than Mythos, but I had no problem generating an endless stream of bugs (some of which were security issues) with no known false positives, which seems to indicate that having a reasonable setup around the model is a least as important as having the latest and greatest model.
I've been trying custom workflows on a per-project / per-problem basis and don't exactly have a generic false positive rejection scheme, but there are various things that seem to be semi-generalizable. If you don't mind spending tokens, having independent agents repeatedly check an alleged bug reproduction (repro) substantially cuts the false positive rate. A couple months ago, I mentioned that I had good luck using different "personas" for reviews as well as for managing agentic loops and I got some responses with theoretical reasons this doesn't work well but, in practice, it seems to work fairly well. My workflow changes regularly and maybe a week after that discussion I started adding "contrarian" personas to the mix, which improved performance given the same wall clock or token budget.
For anything human reviewed, having some kind of artifact (e.g., a video if it's a bug that's expected to be apparent in the UI) is necessary. Without really explicitly trying to have the agent review this, just producing this at all seems to reduce the false positive rate somewhat, and then having the agent review the artifact reduces the false positive rate further. Asking agents to independently review the artifact (e.g., looking at the test code that produces the video vs. looking at the video itself) also reduces the false positive rate further. In general, getting independent perspectives seems to help a lot with reducing false positives. In the experiments I've run, this has been less effective than having agents with different personas / perspective per wall clock time or per dollar, but just asking the same question multiple times improves results, for reasons that should be obvious from the graphs in the LLM variance section.
Pretty much everything I've tried to reduce false positive rate has worked, so if you're not scaling up a workflow to the point where optimizing the costs matters, doing anything remotely reasonable seems to work fairly well.
Caveman mode
I keep getting various tool and workflow recommendations and, when I look into it, I can almost never find good information on whether or not it makes sense to adopt the recommendation. Just for example, I've seen "caveman mode" recommended multiple times at work. Caveman mode allegedly reduces token usage and speeds up prompt resolution (the README claims 75% reduction in token usage, 65% reduction token usage, and 2x fewer tokens used, as well as a 3x speed increase).
Searching for information (just googling 'caveman mode', no quotes, the top hit that wasn't a link to caveman mode was this reddit thread where, of the three top comments, one is joke and the other two highly recommend it:
Just extreme brevity in a refreshing way... and dramatically lowered token count without any seeming impact on the analytical thinking... but i have no way to benchmark before and after.
Someone at work is testing it and it seems to actually save tokens AND work just as well.
Most of the rest of the top hits were also positive recommendations for caveman mode that purported to do some kind of eval (although they read like unfiltered LLM text) and the top hit on YouTube was one of the the biggest programming YouTubers saying
it actually works; it actually works quite well ... no, I'm not exaggerating
In a slack thread at work where people were recommending caveman mode, I asked if anyone had done a comparison, noting that the creator of caveman mode responded to the HN thread about caveman mode by saying it's a joke. Someone linked to an analysis of caveman mode that claims a significant win, but the analysis was an LLM-generated SEO spam article with numerous errors. When I politely pointed this out, the person who posted the link said "I only skimmed it".
At that point, I decided to spend about 15 seconds apiece generating some caveman mode benchmarks (it seems like people call benchmarks evals now, so I should call these evals?) (previously discussed in more detail here).
To start with, I'd been using a lot of GPT-5.5 xhigh when this discussion came up a couple months ago and I benchmarked this thing, so let's look at how this looks for GPT-5.5 xhigh on the first benchmark, a simple benchmark where we ask the agent to optimize some code in wasm. Since this is something I spent 15 seconds prompting an agent to generate, I don't think it's worth spending a ton of time discussing the details, but one thing to note is that it's possible to do much better than any of the results an agent achieved here. I would expect a human doing this by hand or a human who's being prescriptive about what the agent should to do to get results that are literally off the charts in the charts below. For the optimization chart, 1.0 is no speedup and higher is better (below 1.0 means the "optimization" slowed things down). To give you an idea of what this looked like when running the experiments, you can click the buttons to follow along interactively or just play an animation.
We can see that, for the first benchmark (optimize a non-trivial algorithm in wasm), after one run, caveman is looking good. We get 1.027 speedup vs. 0.987, $12.10 vs. $23.10, and a wall clock time (of how long the agent took) of 8m51s vs. 14m9s. But we know that LLMs are stochastic, so we should probably run again. After a second run, we can see a big change in the results, with an average of 1.0 speedup for both, but caveman mode coming in at $12.45 in 8m64s vs. $40.38 in 17m57s. That's a huge cost savings that's in line with the claimed savings from caveman mode. It's a bit silly to narrate each step of the animation, but if we skip to the end, we can see that the average after 50 runs is in favor of caveman, with 1.03 vs. 1.01 speedup, and $17.97 in 13m46s vs. $24.21 in 16m52s. That's not as good as what we saw after two points, but that's still solidly in favor of caveman mode.
I asked GPT-5.5 xhigh to do classical and Bayesian statistics on this and it produced a script that says that, for the Optimization 1 benchmark, the p-values for caveman having better speedup, cost, and wall clock time, are 0.1, 0.005, and 0.001, respectively. With Bayesian stats, we have P(caveman better) 0.958, 0.999, and 1.000, respectively. We can look at the plots for the other two benchmarks I spent 15 seconds on as well. Optimization 2 is another "optimize this code in wasm" benchmark, and Game AI is a task where the agent is asked to implement a board game AI for the game Lost Cities with a deadline of 10ms per move.
The results with these benchmarks look mixed. For Optimization 2, we have P(caveman better) = 0.17, 0.999, and 1.000, respectively, and for Game AI, we have P(caveman better) = 0.04, 0.79, and 0.73, respectively, so caveman actually gives worse results for Optimization 2 and Game AI, which is the opposite of what we saw for Optimization 1. BTW, I didn't cherry pick the order of these results to present some kind of surprising narrative reversal. If we were to stop here, we might think that caveman gives worse outcomes but saves money, or maybe it gives better outcomes on some tasks and worse outcomes on some tasks and saves money.
If we try a few more models (GPT-5.4 mini, GPT-5.4, GPT-5.5) at every effort level, we get the following averages, which makes the overall picture less clear (in the graphs below, up and to the left is better, down and to the right is worse; the arrows point from the baseline to caveman):
What are the patterns here? To name a few, for Optimization 1, caveman generally has better results than standard, but for Optimization 2 and Optimization 3, it's mostly the other way around, although there are exceptions. For cost, we can see a variety of patterns as well. There's enough variance between conditions (tasks as well as models and effort levels) that it's clear that we'd have to run a lot more conditions to get a clear picture of what's going on and, overall, the difference averages out to be small enough that it doesn't seem worth using caveman mode.
LLM Variance
Recently, when new models have been released, I've done a search to see what people are saying about them. In general, there are a lot of contradictory comments out there. For example, when GPT-5.5 was released people said, variously, GPT-5.4 is better than 5.5 because it's better at staying on task while 5.5 wanders off and overthinks the problem, making 5.5 much more expensive and pointless; 5.5 is so much better than 5.4 that it's cheaper to use because it doesn't mess up and then get stuck fixing its own issues as much; 5.5 is cheaper than 5.4 because it works so well you can run at a lower effort level; 5.5 "just works" while 5.4 often fails and needs handholding, etc. Often, someone will run a benchmark and show that their statement is true.
Looking at these benchmarks, we can see support for all of these statements that I saw on reddit when searching for comments on GPT-5.5 shortly after release. In Optimization 1, GPT-5.4 has better results than 5.5 and is much cheaper. But in Game AI, GPT-5.5 is substantially better than 5.4, so much so that 5.5 high costs about as much as 5.4 xhigh, but with better results, and 5.5 medium is cheaper than high with significantly better results. With just these three evals, you can find support for every statement I saw people making about GPT-5.5 on release because all of the statements are sometimes true. And that's when we're averaging out variance with quite a few more runs than any reasonable person is going to make to support some comment they're throwing out on the internet.
In general, this kind of thing is why, when I see a metric or graph that summarizes a set of benchmarks, I think, "show me the distribution". Benchmarks of models often reduce to a single, nice, neat number, where you see that X is better than Y, which is better than Z. I find these to be basically meaningless, in that, if we're looking at the latest and greatest from OpenAI and Anthropic, we know there are reasonable benchmarks where X is better than Y and vice versa. If the set of benchmarks had a few more benchmarks that favored Y instead of X, the results would be flipped. For some kind of summary metric like that to be useful to me, it would have to be the case that the set of benchmarks perfectly mirrors the distribution and weight of tasks I do and that I can only choose a single model to use for all tasks. Since neither of those is true, it’s not clear what actionable information I can take away from these benchmarks.
If we look at public benchmarks in more detail, the situation seems worse than it appears from the abstract argument above. Results are generally presented in fairly high precision, as if that's meaningful. For example, on one benchmark, we might see that (for example) GPT-5.5-xhigh is 1% better than Fable 5 medium, but at 19% lower cost. And then if we compare to Opus 4.8 maybe it's 13% worse than GPT-5.5 xhigh at 11% higher cost. If we want to know what this means, we can dig into the data and see that we have some benchmark that claims to be meaningful because it has this big set of diverse tasks, but they're all pass/fail tasks that get run 4 times and most tasks are either very easy and get 4/4 with the best models (except, due to some random noise, you sometimes see a random 3/4) or are very hard and mostly get 0/4. Then there's some small subset of tasks that actually determine the relative scores of these SOTA models. If you change out one of these for a different one, the results between the two highest scoring models can get flipped. If you change a few tasks (out of about 100), then you can see the apparently much worse Opus 4.8 move ahead of GPT-5.5. Change a few more tasks and GLM-5.2 can pull ahead.
When I see things like this, it reminds me of Miguel Indurain, who was enough of a household name when I was a kid that I'd heard of him even though I don't follow cycling. A few years ago, I was curious why household names in cycling since Indurain are all different archetypes from Indurain and it turns out the answer is that it's arbitrary. For arbitrary reasons, the Tour de France has become the most famous cycling race in the world and someone who has a dominant streak can become famous enough that they become known outside of cycling circles. For other arbitrary reasons, there was a period of time where the TdF had much longer time trial stages than it does now, which suits someone of Indurain's archetype. You tweak the benchmark a bit and Miguel Indurain goes from being a once household name to an all-time great time trialist that pretty much nobody has heard of unless they follow cycling.
Back on the topic of coding agents, it's not clear who really needs to pay attention to these benchmarks that present summary metrics of how models are doing. As we noted above, as a user of models, these benchmarks don't meaningfully tell me which model I should use. But if many other users based their decisions on these benchmarks, then AI labs would need to care about their results on these benchmarks.
But even though GPT-5.5 has been handily beating the various Opus 4.x models during 5.5's tenure on most of these benchmarks, Anthropic's business grew much faster than OpenAI's during the time period that the best publicly available models were GPT-5.5 and Opus 4.6/4.7/4.8, so much so that OpenAI has been giving companies free tokens to try to convince people to use GPT. My company was one of many to get months of free tokens and, during that time period, most people still primarily used Claude and Opus. Anthropic's revenue trajectory is incompatible with these benchmarks being major determinants of user choice, so I don't know why anyone should really care what these summary metrics show.
The last set of graphs we looked at shows how much variance we see across tasks, but from the prior set of graphs, we also saw a lot of variance within tasks with the same model and effort level. If we look at a small number of individual runs, then pretty much any conclusion is possible due to the variance between runs. Just for example if we look at Optimization 1, for GPT-5.5 xhigh, one standard deviation between runs is 0.075 (i.e., 7.5% performance increase / decrease). If we look at the average difference between the best and worst tested GPT, that's 1.055 (GPT-5.4 xhigh caveman) and 0.986 (GPT 5.4 mini low), which is less than 1 standard deviation (SD) across GPT-5.5 xhigh. For the actual graphs and not just summary statistics, we have:
For every task, whatever the best condition is, it's easy to get a result where the best condition actually scores worse than a result from the worst condition. When I was chatting to Max Bittker, who's done a lot more benchmarking than I have because he runs an RL environment startup said:
Yeah - the level of noise from task to task and run to run is so high that it's no wonder the discourse ends up confused. Easy to make mistakes like "Wow, $new_model is amazing" -> "Oops, I was still using the old model the whole time", or "this new harness / prompting trick works great!"
But on the other hand, benchmarking has given me confidence in statements like "Opus models recover and debug confusing conditions much better than Sonnet models" and "Chinese models score well on SWE-Bench but underperform on novel tasks" with more statistical significance.
Another comment, this time loosely paraphrased because this is from memory, was that Matt Mullenweg said that, if you look at people who undertake high variance activities, like gamblers, they're often superstitious. You'll see somebody wear their lucky socks or have a specific routine they do before they sit down to play the slots. Using caveman mode or deciding which model is good because a coding agent coughed up a good result after trying it is not so different.
I did the above and wrote the draft of this post before Fable was released, and then Fable was released, so just out of curiosity, I ran a few Fable benchmarks, at which point it was unreleased, so I also did a few runs with Opus 4.8.
Like I mentioned above, these are little benchmarks I created by typing to an agent for about 15 seconds a piece. With how coding agents let you scale processes, you could create a very large number of reasonable benchmarks in 15 seconds of a human time a piece, but this isn't some carefully designed setup for stamping out a bunch of these; it's just something I did because I wanted to get some kind of caveman mode comparison, and when I tried to have agents do this with no supervision, the benchmarks were completely worthless. With a little bit of time spent, the benchmarks seemed good enough for the purposes of evaluating caveman mode and determining that it's not worth further investigation. If I was planning on writing this up publicly, I would've spent a few more minutes making these benchmarks better in multiple ways, but I think they're good enough for the purposes of this variance discussion. I wouldn't read too much into the reuslts, but I did do a few basic checks such as looking at outlier results to make sure agents weren't cheating or there wasn't some issue (such as noise on the box) causing unusually poor runtime results, and the agents weren't cheating and the slowest results did reproduce on a quiet box, etc. And, that disclaimer aside, just like with the GPT benchmarks, we can see variance in these benchmarks that reflects the discussion that's happening online, e.g., Fable is great at some tasks and not so great at other tasks, and quite a few people are making claims about Fable's overall performance based on the performance on some small set of things they tried.
Above, we discussed why it's not clear how I or any particular user of agents can usefully use these benchmarks that produce a ranking or a small tuple of numbers that allegedly tell you how good a model is. But even when drilling down into a benchmark, it's still not clear to me why, as a user, how I should change my behavior as a result of the benchmarks.
Just for example, something I've observed is that Opus 4.8 makes up bad rationalizations to explain things much more often than GPT-5.5. I've tried having them approach the same problem a number of times (I'll often ask both to solve a real problem I'm running into just to see what happens). These are often much larger problems than these benchmark problems, things like real debugging tasks or, in the case of building a game AI, instead of a simple prompt to produce an AI, an agentic loop and/or a series of instructions to get the agent to produce a fairly strong AI. I've asked a small handful of people and they've all observed the same thing with GPT-5.x vs. Opus 4.6/4.7/4.8, including people who prefer Claude and primarily use Claude and are more experienced with Opus than with GPT.
On the flip side, I've seen two benchmarks that measure how good models are at detecting false information and/or not making things up and those benchmarks both show that Opus is much better than GPT at this, to the point where, if the benchmarks show what one might expect them to show, the experiences I and other people have had seem impossible. For example, in one benchmark, a naive reading of the benchmark presentation is that Claude Opus and Sonnet are much better at detecting false information than any other model and Opus 4.8 is the absolute best and detects false information 95% of the time, including 95% of the time in software. The GPTs are terrible at this and rank below Qwen, Grok, Kimi, Minimax, Mimi, and Nemotron.
What's going on here? I'm not sure.
Maybe those benchmarks are measuring a different aspect of making things up than what the people I've talked to (and I) run into while working on real programming problems, or maybe this is a small sample size and the people I've talked to (and I) have just randomly gotten bad rolls of the dice from Opus, or maybe we're "holding it wrong" when it comes to using Opus. I suspect what's happening is that the benchmarks are measuring something different from what happens when you encounter incorrect rationalizations while coding or debugging, but to have any confidence in this I'd have to run some experiments that will be fairly expensive. If you want to support my doing experiments in general, you can subscribe to my Patreon or, if you're at an AI lab, give me some free credits to run experiments with.
BTW, there's something analogous in the Game AI benchmark here, where Opus is substantially better than GPT, but when I tried to have to manage the process of creating an actually strong (superhuman) game AI, GPT seemed to do better because Opus kept falling into making things up / rationalizing nonsense failure mode. With no supervision, both were worthless and had their own failure modes, but when directed, GPT took less prodding to stay on track and keep doing things that could work. This is not reflected at all in the small Game AI benchmark where an xhigh run is a single prompt that "only" costs on the order of $10 and the small benchmark turns up the oppposite result as we see in doing the real version of the task.
But, the point here is that, even when you find benchmarks that measure something that seems to be the exact thing you care about, those benchmarks often don't end up matching what you see in practice. If you're a weirdo like me maybe you'll decide to spend a bunch of time and/or money running some experiments to figure out what's going on but, if you have some job you want to get done (maybe for your actual job), that's a fairly unreasonable thing to do.
Benchmarking and data analysis
For no particular reason, I've always liked designing experiments and measuring things. This was true long before I ever thought about careers and it's still true today. As discussed here, measurement is one of the primary themes of this blog, maybe the primary theme. Lucky for me, this lifelong hobby has been something I've been able to make a career out of. And even luckier, this skill seems to have been made relatively more valuable by coding agents3.
As we discussed above, testing is a rate-limiting factor in highly agentic workflows because when you let agents start doing a lot of things, anything that's not well tested gets stochastically degraded. The more you ship with agents, the worse this gets. If everything you do is like one of these model evals where every task is a pass/fail task and there's no way to really do better than simply not getting the task wrong, then you can make sure your tests are good enough and let agents go to town until tests pass and everything will be great. In practice, for almost all non-contrived problems, it's possible to have a solution that passes a strict correctness eval but is still better or worse in some kind of non-strict correctness sense. If you want to let agents go wild on a problem (whether this is fast and loose vibe coding or running autonomous agentic loops), now you have a benchmarking problem.
When left to their own devices, running in self-improving loops, agents seem fairly bad at this. There are various little things I've been doing for a few months that get some improvement here (having lots of agents think about things independently and reconvene over multiple iterations with a mix of contrarian agents, etc.), but they still don't seem great at this without human intervention.
One reason for this seems to be that agents are really bad at understanding data and doing data analysis. Doing this in the context of some kind of agentic loop or large problem is harder than doing a standalone data analysis, but when people just have an agent do a standalone data analysis, the results are generally terrible without a lot of guidance. Because it's so easy to have an agent do data analysis, I've seen quite a few agent led analyses and, at least so far, every one that I've seen is completely bogus. I've also done a fair number myself (I think people really underrate the value of getting completely bogus output from an LLM; more on this shortly) and found the same thing.
When I say complete bogus, I mean things like finding different numbers that aren't really related and somehow inferring something deep about their relationship, picking two numbers or examples out of many and coming up with a theory that contradicts other numbers in the data, making plots that show something meaningless (but often look pretty), etc. For a random concrete example, the last time I looked at an analysis produced by an agent (I glanced at this as I was writing this), someone had asked an agent to analyze the resource utilization of something. This was Opus 4.8 on max (Fable was disabled, so this was the best thing money could buy from Anthropic at the time) agent determined that 514% of the resources were being consumed by some task where it was impossible that more than 100% of the resources could possibly be consumed in any way. I have a colleague who sometimes comments on these things and will ask innocuous questions like "what does X mean?". Generally, he either gets no response or what appears to be some kind of AI-written response that's as incorrect as the first thing he replied to.
Anyway, when I started using agents for data analysis, maybe in November 2025, I found the speedup to be pretty incredible. There are analyses that, roughly speaking, would've taken weeks that instead take hours.
I hesitate to even describe this next thing I've been doing because of how many completely bogus data analyses I'm already seeing, but something I've been doing a lot of, which I find to be a much larger speedup than doing a "normal vibe coded data analysis" is to run a simple agentic loop where the agent (or agents) "understand" and analyze the problem and then fix up the parts that need to be fixed for me to extract a good analysis from what the agent has produced.
How much this bogus LLM loop speeds things up depends on the problem, but the last time I tried it, I think a traditional pre-LLM analysis of the issue would've taken me some number of days. Let's say two days. With a workflow where I'm multitasking and poking the LLM when it needs poking, my guess is that it would've taken one to three hours of my time over the course of a day or two. With this "have the LLM loop on producing a result I know will be wrong" workflow, it took about five minutes of my time over the course of a few days to get an acceptable result. Literally every time I looked at any part of the analysis I hadn't previously corrected, the analysis was wrong (and, in some cases, I had to issue multiple corrections) but, somehow, the whole thing still moved a lot faster than if I tried to steer the LLM.
I think people are really underestimating the value of getting completely incorrect results out of an LLM. This is, non-ironically, a game changer in a positive way, when directed correctly.
There's something a bit odd about how incredibly bad SOTA models are at data analysis and how much they speed up human data analysis. I have this series of exercises in benchmarking, evals, and experimental design (part 1, part 2, part 3, part 4, part 5, part 6). When GPT-o3 was released, Tyler Cowen said that o3
wipes the floor with the humans, pretty much across the board ... I don’t mind if you don’t want to call it AGI. And no it doesn’t get everything right, and there are some ways to trick it, typically with quite simple (for humans) questions. But let’s not fool ourselves about what is going on here. On a vast array of topics and methods, it wipes the floor with the humans. It is time to just fess up and admit that.
This prompted me to try asking the questions from these exercises to SOTA models. They generally underperform what I'd expect from a reasonable junior colleague unless you specifically word the question so that models can answer the questions. If you just ask the question the way you'd ask the question to a reasonable human colleague in real life, they generally don't do well. This isn't just a theoretical problem that comes up when someone asks an evals homework question. This issue comes up any time you ask an agent to do some kind of empirical, open ended, longer horizon improvement task.
I've seen this issue come up on a wide variety of real world problems, but it even comes up on contrived problems that don't have almost any of the messiness of real world problems. An example of this that I think is a nice problem is building board game AIs. This is a much easier problem to evaluate than most real-world problems because, at the end of the day, you want to beat humans and other AIs on a well-defined game that has a simple win/loss/draw result.
Agents can't really figure out how to do this when left to their own devices. To be fair to AIs, this problem is somewhat harder than internet commenters give it credit for. For example, if you just search for information on how to implement these things, you see all kinds of exchanges like this or this where someone who hasn't done it will explain why it's easy and someone who has will tell them it's harder than they think. While the problem isn't exactly trivial, but it's not exactly hard either.
When I first started trying to implement a board game AI (this was in the GPT-5.1 to 5.2 days), I didn't know anything about board game AIs and tried having coding agents implement what they thought would work. Now that I know a bit about board game AIs, I can say that every direction GPT tried to suggest was bad and couldn't work. As an experiment, I tried this again recently, when GPT-5.5 and Opus 4.8 were the best publicly available models and they failed in the exact same way and suggested many of the exact same bad ideas (just to really make sure, I let them implement the bad ideas and they failed as expected).
I think, at some level, this is a fairly well understood gap in LLMs today. Coincidentally, around the time I was trying GPT-5.1 or 5.2 at this, the Code Clash eval was created, which also tests the same thing and found the same result
Unable to Iterate: Models struggle to improve over rounds, exhibiting a variety of failure modes.
Despite coding being terrible at setting direction and me not knowing anything about board game AIs, I was somehow able to cobble together a superhuman AI for Azul in about 5 hours of my time and get it to a crushingly strong level that's well above any other AI that was out there at the time (and I think still out there now) in about 20 hours of my time. The Azul world champion and probably strongest player in the world overall said:
the bot is quite insane now. def seems stronger than me, though to really test it i would have to play it like a turn-based game. with limited time on my end, i stand no chance 🙂4.
As much as I'd like to be able to say there was "one weird trick", there were really two tricks, but two tricks isn't so bad, especially when they generalize to other kinds of projects as well. The two tricks were:
- Look at the data / have some reasonable-ish evals
- Solve problems systematically
None of this has to be very rigorous. In fact, I'd say this was all running on vibes (in the pre-vibe coding sense of the word). For the data side, for this project, I used a common habit for me, plotting a bunch of things that seemed relevant, eyeballing them, and then nudging things in a direction that I would hope improves things. A more rigorous (and possibly better) approach is to run a bunch of experiments from scratch, do "ablation" runs where you add or remove individual ideas, etc., and understand what each component does and what the impact is. But I wanted to get a strong AI while using as little of my time as possible and just training on my laptop and doing things rigorously seemed too expensive given those constraints.
I'm sure all of this would've been trivial for someone with ML experience, but there were a bunch of little things I had to observe and then find ways to deal with (which is something agents are currently terrible at). For example, in terms of figuring out a good set of evals to look at it, there are a few funny things. One is that you can easily make changes that reduce loss but don't change the win rate against actual opponents (humans or other bots) and vice versa. Another is that, if you try to track improvement in terms of how well your bot beats previous versions of itself, it's very easy to make a series of changes that allows the bot to beat all previous versions of itself and have an apparent Elo gain of 1000 or 2000 against prior versions that isn't any better against humans or bots that play in a different style. And, at the time, there was no superhuman bot that was available to play against, so you couldn't get a reasonable eval by just playing against a field of existing bots.
For a human, these aren't really hard problems. A reasonable human can look at one of these issues, think for a few minutes, and come up with a proposed solution that will probably work (and when it doesn't work, they can try another solution). But there's enough subtlety that agents don't do well at this today without supervision (to be fair to coding agents, many or most humans don't either—almost all of the bots I found to play against were fairly bad, despite people seeming to have spent a decent amount of time on a lot of them, with the one publicly available exception being the "PJF98" Azul bot in https://github.com/cestpasphoto/alpha-zero-general There was one bot stronger than the PJF98 bot in existence, but it wasn't made generally available to play against and was just used to win one season on BGA (which was voided because using a bot is cheating) and then never seen again.
The other trick I mentioned was taking a systematic approach. At a high level, this is because when dealing with an opaque system with some complexity, every time you see a symptom of a bug, there's a good chance this is a window into one or more other bugs that you don't know how to observe. If you non-systematically close the window without fixing the other bugs, those other bugs are still there; you just don't know how to find them anymore.
This is specifically the case when building a board game AI that uses a neural net because you have this net that's doing who-knows-what and some kind of search function that's doing who-knows-what and you can sometimes spot specific obviously bad moves. This is also generally the case with vibe coding and/or agentic loops because you have agents doing who-knows-what and you sometimes get a window into what's going wrong as the giant, overly complex, pile code the agents produce will emit some kind of obviously bad behavior.
Specifically in the case of board games, I looked at the other Azul AIs that are out there. A lot of people had built bots, especially recently with coding agents making it so easy to build a bot. For some of those, you can actually see how the person try to nudge the bot (for example, because they included the instructions they gave to Claude in their repo) and, AFAICT, the main reason these other bots were much worse despite more human time spent on the effort was a combination of using the wrong evals and fixing the symptom of a problem and not the cause.
For example, in one case where the author left the Claude instructions in the repo so you can see what they did, at one point, their bot was making a very bad move that's worse than a move than any human would play even if they'd just learned the game. They author instructed Claude to make a series of straightforward fixes that would stop the bot from making that move, which it did, but this didn't fixed all of the other bad moves that fall out of the systemic problems the bot has; it just fixed the one very obvious symptom of a bad move that the author could see.
A vaguely analogous issue my bot had is that, at one point, my bot would almost always try to open in column 2. A simple fix that I think would've worked would be to tweak the bot to play in column 2 less often. I tried the various "obvious" fixes like increasing noise to increase exploration in self play, etc., but none of the obvious fixes worked other than directly reducing the value of column 2, but that has this other issue we discussed of not really solving the problem. In that specific case, a more systematic fix that worked and made that bot's play a lot more generally robust was to, at various points in the game, fork the game into a new game with different column permutations. The idea here is, the problem is that, on average, column 2 is the best column, so even if the bot opens in a different column, it will have this tendency to go back to the average best column due to the games it's seen in the past. If you're DeepMind, you can throw a bunch of money at this problem and have the bot play a lot more games and learn something else. But if you're running on a laptop, that could take a while and if you're trying to tweak the weight at which the bot should be nudged away from column 2, there's no way you're going to get a reasonably optimal set of weights to do this with. But if you (for example) sometimes, near the endgame, permute column 2 to column 4, the bot will learn that if you have a nearly filled column 4, it's a winning move to complete it, which will help the bot learn that earlier in the game, if you have a partially complete column 4, it makes sense to advance it to near completion, and so on and so forth, until, in the opening, the bot learns that in the opening, it should then open with a reasonable distribution.
For someone like me with no ML/AI background, getting the bot to superhuman performance was a series of puzzles like this that are a combination of figuring out what the right view of the data is to see problems and then tweaking this systematically to solve the problem. This is not so different from managing agentic loops for general problems. Michael Malis has said that running a "software factories" workflow feels like playing Factorio and I can see what he means by that.
Misc
I'm not sure how much value there is in talking about specific workflow tips and tricks since the half-life of these is so short. Before I was running more autonomous loops and was doing more human-in-the loop work, "my" agents would often do something that's in direct contradiction of AGENTS.md or other instructions. When I offhandedly mentioned this to Yossi Kreinin, he suggested adding a note at the bottom of my AGENTS.md to re-read instruction after compaction. For all I know, that's a superstitious act that did nothing, but it seemed to reduce the rate of this issue from a few times a day (when juggling a handful of agents at once, so maybe once every few agent-days) to once every week or two (maybe once every 100 agent-days or so). It seems like this isn't necessary anymore, so either this was a placebo that never did anything in the first place and I was just getting unlucky for a while, or the AI labs have added this as well. A trick we discussed in August 2024, back when LLMs would produce code but not run it by default, was to just have a loop that runs the code to make sure that it works and passes tests. That trick had a surprisingly long useful life considering how obvious and useful the trick was, but of course the big AI labs eventually realized how useful that was and they all have tools that do this now.
For the past few months, the "trick" I've been using is to have something that's vaguely in the same space as something like Gas Town, but with various little things to try to make it more reliable. It seems like Claude's Dynamic Workflows put half of these tricks into an easy to use package (quite possible in a better way than I was doing—I need to play with it to find out) and Codex has some advertised but yet to be released features that seem targeted at doing the same thing. In principle, goal mode should package up most of the other half of the tricks, although it doesn't work nearly as well for now. Presumably that will improve and most of what I'm doing now will be available off-the-shelf within a couple of months? Like the tricks mentioned above, these are all things that are so obvious enough I don't know that there's much value in spending a lot of time talking about the details, but I'll put some brief comments in a later appendix in case anyone is curious.
The meta techniques discussed above have generalized fairly well for me on a variety of projects and they seem like they should be durable as long as agents can't just take a non-trivial problem description and fully solve the problem better than I can, at which point I won't have to worry about it because my economic value may be close to zero.
Another meta idea that's been useful is the very obvious thought that coding agents are highly non-uniform in their effectiveness relative to humans. When discussing about traditional productivity, Fabian Giesen made this comment on how velocity improvements change how he works](/productivity-velocity/):
There are "phase changes" as you cross certain thresholds (details depend on the problem to some extent) where your entire way of working changes. ... There's a lot of things I could in theory do at any speed but in practice cannot, because as iteration time increases it first becomes so frustrating that I can't do it for long and eventually it takes so long that it literally drops out of my short-term memory, so I need to keep notes or otherwise organize it or I can't do it at all. Certainly if I can do an experiment in an interactive UI by dragging on a slider and see the result in a fraction of a second, at that point it's very "no filter", if you want to try something you just do it. Once you're at iteration times in the low seconds (say a compile-link cycle with a statically compiled lang) you don't just try stuff anymore, you also spend time thinking about whether it's gonna tell you anything because it takes long enough that you'd rather not waste a run. Once you get into several-minute or multi-hour iteration times there's a lot of planning to not waste runs, and context switching because you do other stuff while you wait, and note-taking/bookkeeping; also at this level mistakes are both more expensive (because a wasted run wastes more time) and more common (because your attention is so divided). As you scale that up even more you might now take significant resources for a noticeable amount of time and need to get that approved and budgeted, which takes its own meetings etc. There's something analogous about agentic coding velocity improvements. People sometimes make claims about how agentic coding is 100x or 1000x more productive. When I look at tasks I do, it's hard to really pin down a number because you can't effectively lift-and-shift a human workflow to agents, so I'm doing something very different from what I would've done in the first place. For example, we look at the idea discussed above of having agents look at every support ticket to convert support issues into PRs. If I were to claim some kind of speedup here, it would be a huge number, easily more than 1000000x. But of course I wouldn't read every support ticket myself, so such a claim is meaningless. A more reasonable way to estimate this would be to try to figure out what kind of traditional organizational structure would generate that many bug fixes using classical techniques and estimate the ratio, but the error bars on an estimate like that will be more than one order of magnitude and it's also not something that would've been done at any normal software company, so the comparison is once again meaningless. It's like someone who drives 20000 miles a year saying they saved 6000 hours and got a 15x speedup on their commute by driving because that's how long it would've taken to walk the same distance. It's clear that the driving enabled them to do things they wouldn't have been able to otherwise do, but there's no point in discussing the ratio.
Another example of a place where you can claim a massive speedup number, possibly more than 1000000x, that's a meaningless number is, when trying to get an LLM to generate a fuzzer in a semi-automated way, I've tried having an LLM look at the entire commit history and all bug fixes, as well as every plausibly related support ticket in history, to try to get the LLM-written fuzzer to be able to reproduce the bugs in some general way. This seems to have been useful to do, but it's hard to put a real number on the value.
For me, the effective value of LLMs is that there are a lot of tasks that would've been too time consuming to reasonably do that can now be done fairly trivially. In a footnote, I mentioned that there are lots of apps you can just trivially build now, so I sometimes play online board games on a custom app that's just a nicer experience than I can get on any of the major board game platforms (which is also nicer than the typical dedicated app for a board game).
In things that I mostly do to satisfy my curiosity, but that can also have some relevance to a business, I'll also do data analyses that didn't really seem worth the time before. For example, I've long been curious about the ratio of user-experienced bugs to support tickets. With coding agents, it took minutes of my time to get agents to produce a list of major incidents where we know that every user that tried to access the site or use a feature couldn't use the feature or site and to get a list of tickets that appeared to be associated with each incident. On joining this data, a typical ratio was 200 impacted users per ticket, with the ratio often being between 100:1 and 1000:1 (and, I'm sure, a wider range would've been found if more incidents were included). FWIW, I pre-registered a guess of 100:1 to 1000:1, leaning more towards the 1000:1, so I was somewhat off here as the typical 200:1 is closer to 100:1 than it is to 1000:1.
Of course, there are a lot of holes in this analysis. There's no way to tell for sure that there aren't missing tickets (I did check for false negatives by independently cross-checking with tickets that were manually tagged in the bug tracker as being associated with an incident and the main thing that turned up was that agents were able to find many more tickets than humans found, but that doesn't mean that that aren't false negatives the eluded both humans and agents). I also randomly sampled a handful of tickets to see if the associated tickets were reasonable and they seemed reasonable, but this was just a quick order-of-magnitude analysis, so I didn't manually inspect as many tickets as would be necessary to really validate a high precision analysis.
It's also not clear how the ratio changes for less severe issues. It should generally be the case that, for less severe issues, the ratio is higher, and also for less invested users (such as the typical user in the sign-up flow). My guess is that, for many less severe issues, we generally see ratios well above 1000:1, with the ratio being easily above 10000:1 or even above 100000:1 for many subtle issues.
This analysis doesn't let me say anything too precise, but it does work as a response to when someone says "only six users were impacted" because an issue had six tickets. That typically means that there's an internal bug tracker ticket for which someone attached sick support tickets. On using an agent to search for related tickets, you might find twenty-five tickets and then, looking at the issue severity, you might estimate a ratio of 10000:1, so "only six users were impacted" may turn into "it's overwhelmingly likely that at least 5000 users were impacted and plausible to likely that 25000 users were impacted".
If I was already familiar with every system I'd need to query for this analysis, I suspect getting a rough guess for this would've taken hours and it would've been a rougher guess. But given that I wasn't familiar, I would guess that this would've taken a couple of days. Instead, this took maybe 15 minutes of my time. Like Fabian mentioned, this kind of speedup where an app or analysis that would've taken hours or days can be done is a phase change that really changes how you work, and that's not even including autonomous loops that only need occasional maintenance.
Another thought, and it wasn't obvious to me in advance that this would be the case, is that LLMs seem to be a larger productivity multiplier for people who are relatively more expert at the thing they're trying to do. On looking around for commentary on this, I of course got pages full of LLM generated SEO spam and, after wading through it, I saw quite a few comments indicating the opposite. Basically, that anyone can do anything now. While that's more true than it used to be, it seems to even more true that expertise in an area has become more valuable. Max Bittker had a comment that this is at least partially because LLMs are good at counterfeiting things, but that counterfeits aren't that good (yet?), and experts can tell the difference between a counterfeit and the real thing, which I thought was an interesting framing.
And following up on this 2015 post about how many people are underestimating how much AI will displace humans, in the past year, my median LLM-driven remote customer service interaction has been better than my median human remote customer service interaction (a well-paid human would provide a better experience than all but maybe one of my LLM-driven experiences, but companies don't want well-paid humans). The exact same line of reasoning that was rebutted in that post 2015 was repeated in 2022 on the release of ChatGPT. The line of reasoning is so popular that it's been a repeat thought leader viral hit every year since 2022 and I don't see that changing any time soon, but it still seems wrong to me today.
By the way, as we previously discussed, there are lots of ways to be an effective programmer, so I'm not saying what's discussed here is the best way to do things or even a very good way (I'm changing my workflow regularly as I figure out new things) but this is what I've been trying that seems to work better than a purely traditional workflow for me. Just given my background and interests, it's natural for me to take a systematic, evals-driven, approach to agentic coding. I've seen people use what is pretty much the opposite approach and make it work for them as well, where they move very quickly without almost any understanding of what's happening and, when they strike gold, they recursively put all of their effort into that. Not only is that not a great fit for my background, I don't think it would really work well for most of the problems I've worked on, which I select because they seem like a reasonable fit for how I approach problems. No doubt my workflow, ported to the problems the opposite kind of workflow is suited to, would also not work very well.
Appendix: people talking past each other
It's always been the case that people talk past each other when they disagree, but this seems to be more of an issue when talking about agentic coding today than it has been for a lot of other topics.
Some of the major reasons I've seen for this are:
- General incredulity
- Workflow-based reliability differences
- General workflow issues, i.e., "skill issue"
- Differences in expectations
On general incredulity, moreso than at any other time in my life, I see people saying that X is impossible, nobody does X, and anyone who claims that they're doing X is a liar, when I know people (who I trust and are credible) doing X or I'm doing X. There's a lot of incredulity out there about AI. Eleven years ago, we looked at how people were saying AI can't possibly replace or displace humans even as it was already happening, but the kind of displacement there wasn't a serious fear for a typical middle-class person or programmer, so it wasn't as salient for folks I might run into "on the street" as it is today. Now that this is more "in your face" for people I'm running into, the amount of denying reality has shot up.
BTW, I find it quite reasonable that a lot of people are incredulous. For one thing, a lot of the positive claims about AI are incorrect. On average, wild, incorrect, claims get more traction than boringly precise and correct claims, so someone who isn't following this stuff is going to see a lot of incorrect claims that are easy to dismiss. It doesn't help that a lot of the people making these claims have a direct financial interest in the general success of AI companies, which makes it easy for skeptics to conclude that these people are all self-interested liars.
On workflow-based reliability differences, what I mean is that different workflows demand different levels of reliability and people who use a workflow that demands a higher level of reliability will often claim AI is useless because it can't do X reliability, missing that there are many effectiveness uses that don't require that level of reliability. To take an extreme example, some people have done novel mathematical research and solved open problems using AI. Let's say, hypothetically, AI solves some open problem you're working on 1 out of every 100 times you try it and returns completely gibberish 99 out of 100 times. That seems great. Solving a serious open problem in math research 1 in 100 times is a pretty awesome result, even if doing a reasonable code review 1 in 100 times and producing a bad code review 99 out of 100 times would be a bad result.
For a less extreme example, for programming work, if you don't assume that the agent is 100% reliable and build systems to handle this, you can tolerate a much lower level of reliability than somebody who assumes the agent is reliable and needs some kind of human-in-the-loop correction to deal with cases where the agent isn't reliable. I regularly see people give advice saying, "don't use AI for X [because it won't work 100% of the time]" where I know someone who does X all the time and have some boring system for handling X not being 100% reliable. This is something you commonly have to do in programming regardless of whether or not you're dealing with agents, so why not use the same techniques you'd use to handle agents being unreliable?
Differences in scale also cause people to talk past each other as people have very different reliability demands at different scales. There's an old rule of thumb that, for traditional software systems, for every order of magnitude increase in scale, you need a different architecture. That's not really true, but it gets at this directionally correct idea that you really want different architectures at different scales. This is also true for using agents.
Just for example, one that I've seen come up repeatedly related to the earlier note that, on tasks that I do, adding a clear, strict, instruction in AGENTS.md to not do something fails once every week or two when juggling 10 agents, a failure rate of something like one per 100 rule-agent-days. Does this work? If you're doing human-in-the-loop coding with a relatively small number of agents and rule (say, 10 agents, 10 rules) with failures being non-catastrophic things that are usually caught by human review, this works fine. If you're running hundreds or thousands of agents or if you're shipping without review or other guardrails even with only a few agents and failures can cause serious problems, this doesn't work at all. While some techniques are scale invariant, many aren't.
Another kind of workflow issue is when people have a workflow that just can't and then conclude that nobody can do X because they weren't able to do X. I haven't thought too much about the specifics of what doesn't work, but Max Bittker had the following comments:
A few patterns I've noticed, not in order:
1) People sometimes have misconfigured environments: They're on an old model, they have a bunch of MCPs or after-market system prompts turned on, or the LLM is trying to deal with Windows or some unusual environment or IDE.
2) People sometimes trash their context window by using a single super long chat with lots of backtracking or single super-messy working directory.
3) Injecting accidental requirements via imprecise language, sending the LLM off on a harder task than they intended like building a matchmaking system before the core of the game
4) If you don't have a strong idea what you want or when you're done, you might under-specify and then collaboratively drift and change requirements over time and going nowhere in particular (many AI-psychosis vibe coders do this)
Appendix: agentic loops and writing this post
As I noted in the alt text at the start of this post, I was very hesitant to write about AI for years because I'm up here in Galapagos Island, highly disconnected from what's happening because I haven't been reading social media much and I've also been doing a bad job at keeping in touch with folks who are really in the thick of it in SF.
I was convinced that it might make sense to write something in May, when I chatted with a couple folks from SF who were in town for PGConf. Even though they are in the thick of it and seem to be close to the cutting edge of what people are doing with agents, there were still things from my workflow that they thought were interesting. But, like I said, the half-life of particular workflow tips and tricks is short enough that I didn't think it would be useful to write up detailed notes on an exact workflow, which is why the main post has focused more on higher level ideas. I'm not sure what might actually be useful, but I'm going to try writing up something about the evolution of my workflow. As noted above, I don't think there was any point in time where my workflow was great, exactly, but it was useful and there always seemed to be a natural next step to make it a bit more productive.
Back in the stone ages, before tools like Claude and Codex, I would sometimes just run a very simple loop that would just keep compiling the code and running the tests and re-prompting until everything passes, which I mentioned here in mid-2024. At the time, I didn't find this useful enough to use for anything where I knew what I was doing, but it enabled me to embed a little web game into that post and do other tasks that would've required me to learn something about an area where having actual expertise will probably never be particularly interesting to me, such as building a web app.
That kind of thing was mildly useful for quite a while when I wanted to accomplish some simple task, but I didn't start using agents much until running into the bisect/video story at the top of this post, and doing some other data analysis and seeing how much it could speed up data analysis. BTW, the bisect/video story came out of an analysis where I was wondering what the half-life of a working feature and/or bug fix is. Coincidentally, that analysis is linked to a couple of ideas in the post. First, the half-life turned out to be fairly short pre-LLM, which seemed like a point in favor of more testing if we wanted to turn the dial up on velocity (with or without LLMs). And, second, this is yet another analysis that would've taken long enough that I wouldn't have done it pre-LLM.
At some point mid-late 2025, I tried vibe coding some personal projects. I used Codex just because people said it gives you more quota on the subscription plans (I've since tried both and that seems to be accurate). At that point, my goal was to use as little of my time as possible on personal projects while getting the output I wanted, but I hadn't really done any vibe coding before and didn't have any idea what would work, but it quickly became apparent that I did the most naive possible loop, where I would just queue up a bunch of copies of "if X isn't done, implement X until complete; if X is done, ...". I wasn't reading anything about what other people were doing and the people I was personally talking to generally weren't using AI much, so I had no idea that the term "Ralph loop" had been coined in mid-2025 and people were doing various, less cumbersome, variants of this, although this wasn't really all that cumbersome (maybe 5 seconds or so to queue up the copies of the prompt once they're written).
And, in retrospect, this workflow wasn't so bad in terms of steering things in the right direction compared to a Ralph loop. I'd usually look at this a handful of times every day (right before bed, on waking up, maybe a couple other times) and, in general, every time I looked, things would not quite be right and I'd queue up some commands to nudge things in the right direction after looking at the logs/metrics/graphs/etc. that explained what was going on. Since this was for personal projects I was spending a small number of minutes a day on, I wasn't trying to maximize throughput and was just running a few things on each of two different laptops I have lying around. It would've been better to have more autonomous loops that can keep going indefinitely without refilling the queue by hand, but the actual time savings from that isn't really all that much, I think less than 10 minutes per day.
After doing that for a while, I started (sometimes) trying to run more autonomous loops earlier this year. Once again, I think this must've been somewhat behind the curve compared to whatever people are excited about at any given time, but it seemed to work ok. On starting to do that, I found that a lot of the tooling people were recommending wasn't really a great fit for managing agentic loops and the tooling that was designed for it seemed a bit "vibe"-y and would have reliability issues.
An example of the first thing is that Conductor was widely recommended at the time and it wasn't really made to support a lot of things I'd want to do. For example, Conductor has the concept of a workspace to separate things. I often want to invoke what one might call "higher order workspaces", where an agent (or a set of independent agents) decide what should be done with a list of items, after which each item should be fed into another agent (or independent set of agents) (possibly with some limit on agent concurrency), after which, etc., with some kind of graph structure determine how tasks move around, where there are reduce-like steps that look the output from a bunch of tasks, and so on and so forth.. While it's technically possible to make Conductor do this, it wasn't really designed for it and, given that you can just tell agents to write scripts to create a structure like this, it seemed a lot simpler to just use custom vibe-coded scripts.
The kinds of tools that were designed for use of a lot of agents, like Gas Town, mostly seemed to be for less structured workflows, and they often seemed highly vibe coded and somewhat unreliable as a result, to the point that I wasn't sure why I would want to use one of them and not just cobble together something myself. What I mean there is, per the discussion in the long footnote about board games, if something doesn't meet a certain bar for quality and complexity, it's fairly easy to just make a version of it yourself. Even if the thing you make for yourself is buggier than the product, you can make sure that, given your workflow, it doesn't have bugs that impact your particular workflow, so you don't even have to meet the same overall quality bar as the product does for it to be better for you.
As a result, what I've ended up doing is making one-off loops for tasks that need to run for a while. There's probably an efficiency gain from building some kind of orchestrator that fits what I want, but I don't think I've really set up enough of these loops to know what I want, and what I want varies a lot depending on the problem. Sometimes I want to have certain metrics trigger a health check by an agent that can fix issues and sometimes I want an agent or set of agents to automatically be invoked after each iteration (whatever that means) to check on progress and sometimes I want agents to continuously check on progress. Sometimes I want a graph structure with some kind of multi-level triage, etc.
Because it's fairly easy to have an agent set up whatever arbitrary structure you want, I'll generally start with some kind of simple structure with some basic health checks and monitoring and then I'll check in on it periodically and try to make structural fixes to address issues so that the issue doesn't recur. In principle, something like this that's well set up could produce useful work indefinitely but, at least for what I've set up, these loops tend to degrade in productivity over time. Even if nothing is badly wrong, the loops are a lot more productive if I'm actively at the keyboard typing instructions to nudge things in various ways, while monitoring what's happening. If I leave one of these alone for a few days or a week, it will usually still be producing a non-zero amount of useful work, but not nearly as much as if I'd check in daily. I've tried various strategies to try to keep these loops more on track without intervention and I've found various things that can help, but I haven't figured out how to replace myself (yet?).
For example, in April/May 2026, I tried having a bunch of "personas" convene and reconvene to try to nudge the loop instead of doing it myself. At the time, I was trying to get a loop running smoothly that was doing some work on some code that uses CRDTs (which I know nothing about, BTW), so I tried diagnosing issues with something like "use independent agents to review as linus torvalds, kyle kingsbury, marc brooker, tptacek, dan luu, and 4 contrarian personas. have each think for a long time", with multiple iterations of this (yes, I do find it quite silly to invoke my own name). Each of these personas seemed to keep things on track in a different way, e.g., the "linus torvalds" persona would tend to push back against the loop's "natural" tendency to have things spiral out of control due by repeatedly adding unnecessary complexity, the "dan luu" persona would force the loop to measure things before implementing and also sometimes say things like "we must not rationalize XYZ" and push back on some bad reasoning. None of this was enough to replace human observation and intervention, but doing variations of this was enough to keep these loops running more smoothly, with less intervention required.
This kind of persona setup also improved things for architectural design, debugging, etc. Again, not so much that you wouldn't want a human at all, but enough that the human can do a bit less work.
All of this seems to work best if the human understands the common failure modes. For example, when asked to debug and explain when a bug was introduced, if just prompted to explain, even with multiple independent rounds of analysis, the agents would often come back with a completely incorrect explanation (maybe 50% of the time or so). When asked to confirm their hypothesis by actually executing code to check if the hypothesis is correct, that removed most incorrect explanations. Just adding forced checking to a simple prompt has a better success rate than having agents do independent analyses and cross-check each other (doing both together seems to have an even higher success rate).
A lot of getting value out of agents seems to be having some kind of understanding of their failure modes and then working around them. Earlier in the post, we noted that people aren't really going to be left behind if they're not using coding agents now and get started in M months because they can be at most M months behind, but will likely be much less behind than that because of how quickly things are changing. A major reason for this is, AFAICT, a lot of the skill involved in using agents is working around their failures. Of course AI labs want to fix these failures, which means new releases are designed to obsolete these skills as much as possible. This is possible to observe directly, in some failure modes that were quite common a year ago are now much rarer.
BTW, this is another reason I don't find a lot of benchmarks useful for me personally. In general, benchmarks have moved to trying not to "over explain" and instead act like a naive user who isn't very good at using agents. This is a reasonable thing for AI labs to care about because they want to make a product that needs as little expertise as possible to operate but, as a professional programmer, if there's way to work around some failure mode of agents, you're probably going to use it instead of acting like you're a benchmark, and you therefore get a very different result even if you're doing the exact task from some benchmark.
Thanks to Max Bittker, Dennis Snell, Em Chu, Yossi Kreinin, Peter Geoghegan, Michael Malis, @caleb@goodfeeds.net, Misha Yaugdin, and Jason Seibel for comments/corrections/discussion.
The argument against having test enginers that I've heard everywhere I've worked, sometimes as test engineers were being laid off, goes something like "the people who can test the code best are the people who wrote it; also, programmers will get lazy about testing if they outsource testing to somebody else", but that's only one side of a trade-off that has two sides.
In general, you get efficiency gains from specialization because people develop deeper skills in the specialization, and you get efficiency losses because the split in skills causes various kinds of fragmentation issues. You can state the same case made against having dedicated testers about having dedicated front-end and back-end folks, mobile engineers, etc.; it's all true. But there's also a true argument on the other side and it's generally acknowledged that, once you hit a certain scale, the gains from having specialists who understand specific things outweigh the benefits of having teams full of generalists who are only kinda ok at each thing. From having seen how good people can get at testing if they spend a decade or two specializing in learning test skills, I think software companies are missing out by not having people with these skills, but this is arguably out of scope for this post since it's not like you can go out and hire a bunch of people with that skillset with respect to software. To even have that population exist, you'd need a culture and ladders like hardware companies have, where verification engineers are first-class citizens, just like logic designers, where they're on the same pay scale, get promoted to high levels at the same rate, have the same amount of prestige, etc., and then you'd have to do that for twenty years.
The gains here are not just the direct expertise gained from spending a career doing something. The existence of a large community of practice at a company also means that people level up faster at the company. In the same way that I was lucky to learn a lot about distributed tracing because I happened to sit next to an expert in distributed tracing and I probably learned as much in a few months from that as I would've if I spent years figuring it out on my own, I was lucky to learn a lot about testing because I sat in a cluster of world class test engineers. Even at software companies that were over one-thousand times the size of that hardware company, I'm not sure that they had a cluster of that many talented test engineers who'd spent 40+ years specializing in testing for me to learn from (and if they did, that group would be so far from me that I'd never interact with them anyway).
[return]- at one point, as an experiment, we did a review of the code where a bunch of people sat down and looked at all of the code. This found a non-zero number of bugs but, IIRC, it was a low single digit number of bugs. You might argue that, the same thing I said about testing skills is also true but going in the other direction for a crew that didn't review code by default and that we would've found more bugs if people had better code review skills. No doubt that's true but, even so, if I think of the absolute best people I know at finding bugs in code review and imagine somebody who's even better than that, that's still not even close to the same league of bug finding effectiveness per unit time as the median Centaur test engineer. [return]
And, coincidentally, as with testing, it happens to be a skill that gets developed a lot more at CPU companies than in typical software companies, even ones that produce highly performance sensitive products, like databases. Above, we estimated that the effort spent on testing at the CPU design shop I worked for was maybe a ~2:1 ratio over what you'd see in a traditional software company. When it comes to benchmarking/evals/experimental design, the denominator is low enough at traditional software companies that it's hard to estimate the ratio, but it's surely at least 10:1 and 100:1 and 1000:1 are plausible numbers as well.
Of course, by focusing on and developing much more expertise than software companies in these areas, chip companies are often relatively in the stone ages in a number of other areas. Relatively speaking, I'm also relatively weak in most of those areas. I think this works out ok in the context of a company, where it's valuable to have people with complementary skills, but it can definitely cause some problems in interviews.
[return]Just as an aside, I wonder what's going to happen with online board games (except for the end of this footnote, this entire footnote is about board games, and you might want to skip to the bottom if you have no interest in board games). One issue is that there's high "demand" for cheating and there was quite a bit of cheating in competitive online play even before LLMs. Today, it's not all that hard to make a bot that can cheat and it's only going to get easier. I'm not really interested in playing against random strangers online, so I don't generally play ranked/competitive online board games, but quite a few people do. It's hard to see how this survives if the amount of cheating increases.
Another thing I wonder about is the value of the big game platforms. Board Game Arena (BGA) seems like the biggest platform by a fairly large margin, with Tabletop Simulator (TTS) being second. BGA is, uhhh, you might call it fairly quirky if you're being generous. For example, by default, the platform doesn't really allow players to act concurrently. If you click on anything at the same time another player clicks, it prevents them from doing an action if your click manages to get through to the server slightly before theirs. For games with simultaneous actions or selections, this can cause people to get stalled out of their turns for tens of seconds at a time as other players do their actions. I could write a post the length of this entire post on various issues with BGA, but suffice to say that the other issues you might expect of a multiple player online platform that fundamentally doesn't work if players attempt to act concurrently are problems on the platform. On average, TTS is even clunkier to use.
As a result of this, it's fairly easy to implement a nicer user experience for games that don't need a complex interface. Game rules and mechanisms don't fall under US copyright law, so it appears completely legal to implement a game with the exact same rules as long as you don't use their graphics, trademarked logo, etc. I don't personally want to try to build a platform with unlicensed games that tries to compete with BGA, but I do sometimes want to play a game without the quirks that BGA has (or that isn't available on BGA). Depending on the game, it's taken between seconds of my time and a small number of hours to get what seems like a reasonable version of the rules working. Seconds is for simple games where the game basically just works from a simple prompt (for example, here's a game with the same rules as Scout, which I implemented because my friends couldn't find a way to play Scout online. BTW, I don't promise the game will work at all if you try it—it's hosted on some kind of janky free tier hosting because, as I mentioned, I'm not trying to build some kind of big platform and this works better than BGA when me and my friends use it, which is the (very low) quality bar I'm shooting for, which is that we don't run into bugs while playing, not that it's a generally robust implementation. And although the runtime performance of my game implementations is very poor compared to what's technically possible, my friends who are used to BGA all comment on how it feels instantaneous because they're used to the much slower performance of BGA. When the platform people are used to is both extremely slow and extremely buggy, it's easy to vibe code something nicer.
Some games will take longer because current models can't one shot them. In terms of my time, Guards of Atlantis is probably the most time consuming game I've tried to implement. It has a lot of rules because each card has some custom rules and the rules seem unusually difficult for humans to understand, which also makes them non-ideal for LLMs. Relatively speaking, I've observed a much higher rate of people accidentally playing rules incorrectly than I do for most games and I'm not sure that I've seen anyone play correctly unless they're on the official discord and follow rules discussions there or were taught by such a person. Given that, I don't think it would be reasonable to expect an LLM to one-shot the game.
Anyway, given that it takes between seconds and a few hours of someone's time to implement a board game in a nicer form than on the dominant online board game platforms, I wonder what's going to happen to the dominant board game platforms? Will people keep them because of network effects? A decent fraction of the few people I know who play board games online find network effects to be of little import. For example, see these quotes from the aforementioned #1 Azul player in the world:
[I]'ve been having a blast experimenting with [your bot] during my games recently. the bot finds really cool/creative moves that i would never consider, which helps me become more open-minded to how i approach rounds
...
playing games against these bots can be quite fun (i havent touched azul on bga in months) if they are of ~equal strength
It was interesting but I think not surprising to see the top player in the world playing against bots instead of on BGA, for months. One thing with the big online platforms today is that it's actually fairly hard to match up with someone of comparable skill. The stronger a player is, the worse this problem is. Like I said above, I don't really enjoy playing against random strangers, so I don't do it much, but a nearly universal complaint from folks who play competitive ranked games on BGA is what a grind the ladder is. If you're a strong competitive player, almost everyone you match up with is much lower Elo, so you're playing these boring games where you stomp the other player to gain a tiny bit of Elo per game. And then if you make a blunder, you lose a ton of Elo for a single loss. A while ago, I was too sick to do much of anything, so I tried ranked competitive play for a game on BGA and got up to #4 in the world. During the entire climb to get to #4, I played two games against someone ranked in the top 10. Every other game was a fairly boring game against a significantly weaker player. For competitive players, there's the impending problem of rampant cheating, and the existing problem that ranked competitive play is mostly mind-numbingly boring.
For non-competitive play, BGA is less problematic, but I happen to have been teaching somebody how to play a game as I wrote this footnote and they ran into three different bugs as I was writing this footnote. And, unless you're playing one of the most popular games, there's often a long wait for a game (and even for the most popular games, you're in for a wait during off hours). It's not clear to me why people should be playing on these platforms other than inertia. Maybe inertia is enough of a reason and these platforms will continue to be the place people play, but for someone like me who mostly just plays with local friends when playing online, there isn't really any reason not to use a less buggy and better performing apps someone vibe coded over their lunch break.
This same kind of reasoning seems like it should apply to a lot of other kinds of apps. For example, here's Josh Bleecher Snyder talking about coding up a shopping list app as he was shopping. In theory, if software worked, you'd want to use the highly-polished well-tested version that some company has created, but software mostly doesn't work and there are a lot of apps where it's fairly easy to whip up a better version than what most people are using today. Of course, there are still many cases where that's not practical, but I still find it amazing how much software I use somewhat regularly can be re-made in a nicer way fairly easily.
[return]