[I'm one of the co-creators of SWE-bench] The team managed to improve on the alr...

Snuggly73 · 2025-05-16T18:00:51 1747418451

I can be completely off base, but it feels to me like benchmaxxing is going on with swe-bench.

Look at the results from multi swe bench - https://multi-swe-bench.github.io/#/

swe polybench - https://amazon-science.github.io/SWE-PolyBench/

Kotlin bench - https://firebender.com/leaderboard

Bjorkbat · 2025-05-16T23:36:28 1747438588

I kind of had the feeling LLMs would be better at Python vs other languages, but wow, the difference on Multi SWE is pretty crazy.

kristianp · 2025-05-17T22:08:41 1747519721

Maybe a lot of the difference we see between peoples comments about how useful AI is for their coding, is a function of what language they're using. Python coders may love it, Go coders not much at all.

ofirpress · 2025-05-17T00:46:16 1747442776

Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html

Snuggly73 · 2025-05-17T02:35:11 1747449311

I mean that there is the possibility that swe bench is being specifically targeted for training and the results may not reflect real world performance.

mr_north_london · 2025-05-16T20:15:14 1747426514

How long did it take to go from 20% to 75%?