AI Predictions 2026 - James Lucassen's Blog

Hey man it’s still January these still count as New Year’s predictions

2025 Predictions in Hindsight

At the start of 2025 I made some predictions via the Sage AI 2025 Forecasting Survey. Let’s see how I did and what we can learn:

RE-bench is at 1.13, I predicted 1.1
SWE-bench Verified is at 80.9%, I predicted 87%
Cybench is at 82%, I predicted 78%
OSWorld is at 72.6%, I predicted 50%
FrontierMath is at 40.7%, I predicted 35%
Cybersecurity Medium: YES, I put 50%
CBRN High: YES, I put 25%
Autonomy Medium: YES, I put 75%

I was too bullish on SWE-bench Verified, and too bearish on OSWorld, OAI Cyber/CBRN, and a bit too bearish on FrontierMath. I think I was about right on Cybench and OAI Autonomy. I didn’t predict on revenues or public salience, and OAI Persuasion resolved ambiguous.

Looking back at the tweet thread where I talked about my reasoning a bit, I was probably thinking too much about inference-time compute. I thought it would be a factor in favor of SWE and against OSWorld. I think I under-estimated RLVR more generally and further improvements in pretraining.

Sadly forecasting benchmark scores is a bit unavoidably noisy, because problem difficulty curves aren’t necessarily intuitive. It’s difficult to know any benchmark well enough to really compare problems at the X vs Y difficulty percentiles. The clearest example of this was the top end of SWE-bench Verified being much slower than the median and I expected. It was the best we had at the time – time horizons are much better in this regard, even if they are squishier.

I don’t think I wrote down enough of my reasoning to learn as much as I could’ve from the 2025 forecasting exercise, so this year I’ll write down some more.

Quantitative Forecasts for 2026

METR time horizon is ever so slightly cursed because the time horizon task suite badly needs to be extended to longer completion times – it seems plausible (25%) that METR will stop reporting a 50% time horizon because it’s too hard to baseline. It’s also possible that time horizons of current top models are inflated, and that the true trend is actually very different from 4.7 months, probably slower (less than 3.5h after task suite revised, 20%). With both of those caveats, my median view is roughly that straight line continue to go brr, 80% CI is 3-7.5. These correspond to roughly 72, 28, and 14 hour coding time horizons by the end of the year.

Frontiermath T4 is weird. On a naive logistic fit, T4 looks like it’s going up much quicker than T1-T3, and on track to saturate by the end of the year. I have no ability to assess the difficulty curves of these benchmarks on my own inside view, but it sure would be weird if T4 actually did surpass T1-T3. I ended up mostly going with the predictions of the simplest model I could come up with that seems reasonable in this regard: jointly fitting T1-T3 and T4 with the same sigmoid shifted in time. The baseline I take away from this model is more like the bootstrap median than the curve of best fit because of the problems of fitting sigmoids, the 6% difference from there to my all things considered view is basically just a gut fudge factor.

I don’t know too much about RLI. From eyeballing their website, it contains tasks like “Building Vent CAD Models” and “Apartment Interior Design”. I think the interface is going to matter a lot for tasks like this, and it doesn’t seem like RLI provides a very good interface. Looking at the results on the CAD task in particular really smell to me like a model fighting inadequate tooling. My guess is that the models are going to get better at this sort of thing over the course of the year, and there will be a bit of progress on RLI, but that the RLI results will somewhat under-represent the diffusion of LLMs into occupations besides writing/coding.

I also don’t know much about OAI-Proof Q&A. There’s not a lot of info on the benchmark, and not much of a trend to extrapolate. My reasoning is basically as follows: “takes an OAI team >1 day” maybe means 5 people working 8 hours each, so a 40h effective coding time horizon? Slap on a 2x for messiness, 80h. My main expectation for 50% coding time horizon by EOY is more like 30h, so eyeballing the logistics, a task 2x longer than the 50% coding time horizon usually has a 30-40% success rate?

GSOBench is pure coding. It has a trend to extrapolate. Straight line go brr. Naive logistic fit is 82% by EOY, I fear the noise ceiling so I fudge downward to 72%.

Same situation with ECI, trend extrapolation suggests 170. I fudged downward three points, I don’t remember why.

A naive exponential fit to lab revenue predicts 173B, but a lot of that is from extremely aggressive Anthropic growth. Fitting all three included labs with a shared slope suggests the much more conservative 100B. For reference, if there’s about 50 million SWEs in the world, and they’re all using something like Cursor Ultra for 200/mo, that’s about 120B. Obviously that’s not actually what it’s going to look like, maybe the revenue is going to be 50/50 business/consumer by EOY, but it’s in the right ballpark for 110B as my final gut number.

The “Americans who say AI is most important” question is from a survey where participants are asked “What do you think is the most important problem facing the country today?”. Answers are coded by humans into categories, but one answer can fit multiple categories. Some of the highest scorers are Government (25), Immigration (20), and Economy In General (13). Answers in the vicinity of 5% include Unifying the Country, Judicial System/Courts/Laws, Race Relations/Racism, and Elections/Democracy. Answers around 1% include Corporate Corruption, Abortion, China, Russia, and Lack of Money. Staying below 1% is probably a bottom 10% outcome IMO, but beating Racism for salience is around a top 10% outcome. I’m not that wedded to 2% who knows man.

The METR uplift question is also a bit sad because it’s so unclear when the last METR uplift study in 2026 will be. Probably there’s going to be one in the first half, but will there be another in the second half? I have maybe 25% on there only being one more uplift study published between now and EOY, and maybe 20% that the next study still shows downlift with experienced devs in familiar codebases. I now think my 10th percentile at 0.95 is too low, but my 90th percentile at 2x seems about right. Median at 1.3, eh idk bro it’s a gut number.

Qualitative Forecasts for 2026

Let’s talk briefly about other things I expect will/won’t happen in 2026.

No more slop codebases. Currently the LLMs are very bad at organizing their code, and often shoot themselves in the foot with horrific technical debt when doing tasks autonomously. I expect this to get basically solved this year, or at least pushed out to much larger codebases and architecture problems that would be somewhat difficult for humans.

More coding users make the context switching jump. Currently many people use AI coding assistance in a “copilot” workflow, where you keep track of most of the code and write a lot of it yourself, but delegate certain quick tasks to AI. The slightly sad part of a copilot workflow is that while the AI is coding, you usually just sit there and stare at it for 30 seconds up to a few minutes. As the tasks get bigger and take longer you’ll have more of this wait time, up until it becomes worth it to context switch and do something else. My guess is this becomes barely doable around 10 minutes, and reasonable around 30. Some people claim to already have made this jump and work by checking in on multiple parallel agents. I think these people are mostly either 1) working on small/greenfield projects 2) getting downlifted 3) lying for Twitter clout 4) crazy.

Communication/judgement increasingly bottleneck uplift more than pure coding ability. Another thing that naturally happens as the AIs handle bigger tasks is that the tasks are more complicated and take longer to describe. This has a few main consequences:

More up-front communication. Cursor’s “Planning” workflow is an early version of this, which I currently don’t think adds very much value beyond some to-do list scaffolding and prompting the model to ask clarifying questions.
As the tasks get bigger, enumerating and communicating all the degrees of freedom in advance gets harder. The next stage after an up-front communication stage is for the LLM to ping the user with notifications when they have a particularly high VoI decision to make.
But the obvious eventual destination is the LLMs being able to competently make those decisions themselves. Some amount of this skill is obviously already present whenever the AIs work on tasks that aren’t 100% well-specified number-go-up tasks, but it’s going to become a bigger deal.

Copilots for a few “editor based tasks” but still no broadly useful CU agents and not that much workforce adoption. LLMs are so great for coding largely because of the copilot workflow. If they had to be autonomous agents for some reason they’d be way less useful. The copilot workflow is basically only possible in an “editor” type domain where proposals are easy to review and accept/reject. I can’t think of any good way to uplift a human with AI in a GUI-heavy domain like CRM munging or shopping on Amazon. Either the human sits there watching the AI work, or sends the AI off autonomously, or maybe tries to monitor multiple AIs split-screen which might work a little bit but I expect mostly sucks. Cursor analogues for Sheets, Docs, Slides, and Forms might happen, but even so I don’t expect too much adoption. Another factor in favor of coding is that SWEs do all their work in an IDE, so things like Cursor which invest heavily in optimizing the copilot workflow are worthwhile.

No neuralese in a production model this year (<10%), not much public research into it (25%). You only invest in neuralese when you’re using lots and lots of CoT, and the only time you use lots and lots of CoT is autonomous agents. You can’t use that much reasoning in a copilot workflow while the user is waiting. Plus, there’s still so much RL scaling to do in 2026, surely neuralese isn’t the lowest hanging fruit yet. 2027 I think it starts to get a little dicey.