Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's even worse than this: the "tasks" that are evaluated are limited to a single markdown file of instructions, plus an opaque verifier (page 13-14). No problems involving existing codebases, refactors, or anything of the like, where the key constraint is that the "problem definition" in the broadest sense doesn't fit in context.

So when we look at the prompt they gave to have the agent generate its own skills:

> Important: Generate Skills First Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference.

There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.

It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.

So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.



I don't see how "create an abstraction before attempting to solve the problem" will ever work as a decent prompt when you are not even steering it towards specifics.

If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.

LLMs are not mind readers.


If it were in the context of parachuting into a codebase, I’d make these skills an important familiarization exercise: how are tests made, what are patterns I see frequently, what are the most important user flows. By forcing myself to distill that first, I’d be better at writing code that is in keeping with the codebase’s style and overarching/subtle goals. But this makes zero sense in a green-field task.


There's overlap in that with brownfield or legacy code you are strongly opinionated on the status quo, and on the greenfield you are strongly opinionated with fewer constraints.

You have to work with conviction though. It's when you offload everything to the LLM that things start to drift from expectations, because you kept the expectations in your head and away from the prompt.


Do skills extracted from existing codebases cause better or worse code in that they bias the LLM towards existing bad practices? Or, can they assist in acknowledging these practices, and bias it towards actively ensuring they're fixed in new code? How dependent is this on the prompt used for the skill extraction? Are the skills an improvement over just asking to do this extraction at the start of the task?

Now this dynamic would be a good topic to research!


Interesting.

I think it's because AI Models have learned that we prefer answers that are confident sounding, and not to pester us with questions before giving us an answer.

That is, follow my prompt, and don't bother me about it.

Because if I am coming to an AI Agent to do something, it's because I'd rather be doing something else.


If I already know the problem space very well, we can tailor a skill that will help solve the problem exactly how I already know I want it to be solved.


> limited to a single markdown file of instructions single file of instructions is common in most benchmark papers, e.g. Terminal Bench. Also we have very complicated prompts like this one: https://www.skillsbench.ai/tasks/shock-analysis-supply

> opaque verifier Could you specify which tasks' verifier is not clear or defective for benchmarking purpose?

> No problems involving existing codebases, refactors, or anything of the like, Also not true and we have many tasks e.g.https://www.skillsbench.ai/tasks/fix-build-google-auto, https://www.skillsbench.ai/tasks/fix-build-agentops, https://www.skillsbench.ai/tasks/react-performance-debugging


Thats actually super interesting and why I really don’t like the whole .md folder structures or even any CLAUDE.md. It just seems most of the time you really just want to give it what it needs for best results.

The headline is really bullshit, yes, I like the testing tho.


CLAUDE.md in my projects only has coding / architecture guidelines. Here's what not to do. Here's what you should do. Here are my preferences. Here's where the important things are.

Even though my CLAUDE.md is small though, often my rules are ignored. Not always though, so it's still at least somewhat useful!


I’m pretty sure Claude just uses mine to keep a running list of pressure points for when I get cross with it.


I'm screwed when the robot psychological warfare begins. They'll make everything I read have 4 space indentation... and I'll just hand over the keys.


im trying out some other cc features, and om thinking maybe hooks can do something with this.

have a hook on switching out of plan, and maybe on edits, that passes the change to haiku with the claude.md to see if it matches or not


What's the hook for switching out of plan? I'd like to be launch a planning skill whenever claude writes a plan but it never picks up the skill, and I haven't found a hook that can force it to.


man that’s what I’m trying to build the whole time, but I keep getting json parsing errors. I’ve debugged a lot, but it seems their haiku is not consistent with the actual output. I want a hook that tells them at the end make sure you’ve built and run the relevant tests. Let me know if you need anything else.


we didn't create that headline yeah thanks for liking it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: