AI excels at code competitions, struggles…

Feb 17

What CodeForces rankings reveal about AI capabilities

3 Comments

What do you think of the benchmarks SWE-bench or SWE-bench verified for tracking real world software engineering skill? Those scores are rising, but I'm not sure how easy it is to game them.

Expand full comment

Carolyn Meinel

Feb 17

Has anyone ever tested AI programming at tasks that require numerical analysis? Example: systems of nonlinear differential equations, in which the number of iterations need to be limited as a function of how quickly results begin diverging, or whether convergence is rooted in reality?

Expand full comment

Tensor Templar

Feb 18

Interesting take. I have been pointing out how bad coding puzzles are as proxy for SWE progress and as result did not look into the benchmark methodology used - if they didn't adjust for time then it is a pointless way of comparing.

On the other hand this just illustrates we are using LLMs wrong if we expect to get good results in a single pass, when they excel at generating in batches instead.

Expand full comment

The Power Law

AI excels at code competitions, struggles…