Hacker Newsnew | past | comments | ask | show | jobs | submit | joshuahedlund's commentslogin

> thinkism

Thank you for this term. In my view, the belief that AGI singularly will rapidly destroy us because it will think 10,000 times faster than us is a form of thinkism.


> Last mile is already “solved” with the little robots that drive around cities, no need for hands.

And yet we haven’t seen widespread adoption because they can’t handle stairs, steep slopes, streets without sidewalks, sidewalks with mud, or a hundred other real world challenges


We haven’t seen widespread adoption because they can’t hope to compete with human delivery drivers on cost. The cost to DoorDash and Uber Eats of a delivery driver is nothing upfront and a few dollars per delivery. The cost of a delivery robot is thousands of dollars upfront and more per delivery. Stairs aren’t even in the top 10 problems these robots face, they’re more than capable of delivering to most customers already.

Sometimes it’s hard to objectively tell whether two animals don’t appear to reproduce because they are unable genetically, or technically able still but behaviorally unwilling in normal natural circumstances, or we don’t know but we just haven’t observed it for that particular combo, etc


It has, tho the rate of new record highs have been reducing from peak to peak: 10x > 3x > 1.5x


Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.


it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private


Scott Alexander blogged about it today: https://www.astralcodexten.com/p/best-of-moltbook


> If your job is to translate requirements into code manually - and that's it - you're the generalist travel agent.

I’ve been a full-stack web programmer at five different companies over the last fifteen years, big and small, e-commerce and B2B, junior to senior to staff, and that has never fully described my responsibilities.


Which responsibilities do you figure are a combination of highly valuable in your role, and resistant to automation?


Knowing what to implement, and having the social skills to perform various tasks in a company?


I would love for SWE Verified to put out a set of fresh but comparable problems and see how the top performing models do, to test against overfitting.


> most normal people don't know what Claude or Gemini are

“Google Gemini” is the No 2 ranked app in the Apple App Store (behind ChatGTP) and has been for some time


https://en.wikipedia.org/wiki/Goodhart%27s_law "When a measure becomes a target, it ceases to be a good measure"

I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: