Hacker Newsnew | past | comments | ask | show | jobs | submit | visarga's commentslogin

> In that situation, they give the (wrong) answer that sounds the most plausible.

Not if you use web search or deep report, you should not use LLMs as knowledge bases, they are language models - they learn language not information, and are just models not replicas of the training set.


> because it has nothing to pull from?

Chat rooms produce trillions of tokens per day now, interactive tokens, where AI can poke and prod at us, and have its ideas tested in the real world (by us).


> It's crazy to me that you'd trust the output of an LLM for that. It's something where if you do it wrong it could cause major damage,

With critical tasks you need to cross reference multiple AI, start by running 4 deep reports, on Claude, ChatGPT, Gemini and Perplexity, then put all of them into a comparative - critical analysis round. This reduces variance, the models are different, and using different search tools, you can even send them in different directions, one searches blogs, one reddit, etc.


Or you can ask for a link to the manual. I genuinely can't tell if your post is real advice or sarcasm intended to highlight the insanity of trying to fit square pegs in round holes of using LLMs for everything.

> These news sites run ads that are borderline gore, disturbing images promoting snake oil weight loss or skin care treatments

And that doesn't raise an eye brow, but well worded AI articles based on sources is described as slop


Looks like another bullet machine, the cheapest way to present a response.

You have 2 paths - code tests and AI review which is just vibe test of LGTM kind, should be using both in tandem, code testing is cheap to run and you can build more complex systems if you apply it well. But ultimately it is the user or usage that needs to direct testing, or pay the price for formal verification. Most of the time it is usage, time passing reveals failure modes, hindsight is 20/20.

> In 2030, how is Anthropic going to keep Claude "up-to-date"

I think the majority of research, design and learning goes through LLMs and coding agents today, considering the large user base and usage it must be trillions of tokens per day. You can take a long research session or a series of them and apply hindsight - what idea above can be validated below? This creates a dense learning signal based on validation in real world with human in the loop and other tools, code & search.


An unqualified statement, the user has copyright over the elements they provide. In an image if they make manual edits for example, those are protected. In a modern agentic codebase the code itself is least valuable, what counts more are the specs and tests.

Good luck with that argument in court.

A nice way to use traditional ML models today is to do feature extraction with a LLM and classification on top with trad ML model. Why? because this way you can tune your own decision boundary, and piggy back on features from a generic LLM to power the classifier.

For example CV triage, you use a LLM with a rubric to extract features, choosing the features you are going to rely on does a lot of work here. Then collect a few hundred examples, label them (accept/reject) and train your trad ML model on top, it will not have the LLM biases.

You can probably use any LLM for feature preparation, and retrain the small model in seconds as new data is added. A coding agent can write its own small-model-as-a-tool on the fly and use it in the same session.


What do you mean by "feature extraction with an LLM?". I can get this for text based data, but would you do that on numeric data? Seems like there are better tools you could use for auto-ML in that sphere?

Unless by LLM feature extraction you mean something like "have claude code write some preprocessing pipeline"?


It's for unstructured inputs, text and images, where you need to extract specific features such as education level, experience with various technologies and tasks. The trick is to choose those features that actually matter for your company, and build a classifier on top so the decision is also calibrated by your own triage policy with a small training/test set. It works with few examples because it just needs a small classifier with few parameters to learn.

Isn't the whole point for it to learn what features to extract?

Yes, it should remain part of the commit, and the work plan too, including judgements/reviews done with other agents. The chat log encodes user intent in raw form, which justifies tasks which in turn justify the code and its tests. Bottom up we say the tests satisfy the code, which satisfies the plan and finally the user intent. You can do the "satisfied/justified" game across the stack.

I only log my own user messages not AI responses in a chat_log.md file, which is created by user message hook in the repo.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: