What do you think of the benchmarks SWE-bench or SWE-bench verified for tracking real world software engineering skill? Those scores are rising, but I'm not sure how easy it is to game them.
Has anyone ever tested AI programming at tasks that require numerical analysis? Example: systems of nonlinear differential equations, in which the number of iterations need to be limited as a function of how quickly results begin diverging, or whether convergence is rooted in reality?
Interesting take. I have been pointing out how bad coding puzzles are as proxy for SWE progress and as result did not look into the benchmark methodology used - if they didn't adjust for time then it is a pointless way of comparing.
On the other hand this just illustrates we are using LLMs wrong if we expect to get good results in a single pass, when they excel at generating in batches instead.
What do you think of the benchmarks SWE-bench or SWE-bench verified for tracking real world software engineering skill? Those scores are rising, but I'm not sure how easy it is to game them.
Has anyone ever tested AI programming at tasks that require numerical analysis? Example: systems of nonlinear differential equations, in which the number of iterations need to be limited as a function of how quickly results begin diverging, or whether convergence is rooted in reality?
Interesting take. I have been pointing out how bad coding puzzles are as proxy for SWE progress and as result did not look into the benchmark methodology used - if they didn't adjust for time then it is a pointless way of comparing.
On the other hand this just illustrates we are using LLMs wrong if we expect to get good results in a single pass, when they excel at generating in batches instead.