Hacker Newsnew | past | comments | ask | show | jobs | submit | j-b's commentslogin


Love these kinds of posts. Tried railway for the first time a few days ago. It was a delightful experience. Great work!


Thank you! Anything you think we can do better?


At https://chat.lmsys.org/

"gpt2-chatbot is currently unavailable. See our model evaluation policy here."


Opus (and all prior Claude models) still do not correctly answer the Sally question, unfortunately:

Q: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

A: Okay, let's organize the information we have:

Sally is a girl. Sally has 3 brothers. Each of Sally's brothers has 2 sisters. Now, let's think about this step by step:

We know that Sally is a girl, so she is a sister to her brothers. Each of Sally's brothers has 2 sisters. One of these sisters is Sally, but who is the other sister? Since Sally has only brothers and no other sisters are mentioned, we can conclude that the other sister for each brother is also Sally.

Therefore, Sally has no sisters other than herself. So, the answer is that Sally has 0 sisters.

https://imgur.com/a/EawcbeL


GPT4 API and ChatGPT both get it wrong: "Sally has 2 sisters. Each of her brothers has 2 sisters because Sally and her 2 sisters make up the group of siblings each brother has."

GPT4 w/ CoT system prompting gets it right:

SYS: "You are a helpful assistant. Think through your work step by step before providing your answer."

USER: "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"

ASSIST: "Let's break down the information given:

1. Sally has 3 brothers. 2. Each brother has 2 sisters.

Given that Sally is a girl, she is one of the sisters. Since each brother has 2 sisters and we know Sally is one of them, there must be one more sister to make the statement true for each brother.

Therefore, Sally has 1 sister."

The importance of prompting makes it quite difficult to compare model peak performance. Especially since different models have different styles of prompts that generate peak performance.


Did you use GPT3.5 for chat? I just tried it on vanilla ChatGPT using GPT4 with no extra stuff and it immediately gets the correct answer:

"Sally has 3 brothers, and each of them has 2 sisters. The description implies that Sally's brothers are her only siblings. Therefore, the two sisters each brother has must be Sally and one other sister. This means Sally has just one sister."


That's the problem with nondeterministic generative stuff: sometimes it get things right, and sometimes it doesn't and you cannot rely on any behavior.


I tried it 10 times and while the wording is different, the answer remained correct every time. I used the exact question from the comment above, nothing else. While determinism is a possible source of error, I find that in these cases people usually just use the wrong model on ChatGPT for whatever reason. And unless you set the temperature way too high, it is pretty unlikely that you will end up outside of correct responses as far as the internal world model is concerned. It just mixes up wording by using the next most likely tokens. So if the correct answer is "one", you might find "single" or "1" as similarly likely tokens, but not "two." For that to happen something must be seriously wrong either in the model or in the temperature setting.


I got an answer with GPT-4 that is mostly wrong:

"Sally has 2 sisters. Since each of her brothers has 2 sisters, that includes Sally and one additional sister."

I think said, "wait, how many sisters does Sally have?" And then it answered it fully correctly.


The only way I can get it to consistently generate wrong answers (i.e. two sisters) is by switching to GPT3.5. That one just doesn't seem capable of answering correctly on the first try (and sometimes not even with careful nudging).


A/B testing?


Kind of like humans?


Humans plural, yes. Humans as in single members of humankind, no. Ask the same human the same question and if they get the question right once, they provide the same right answer if asked (provided they actually understood how to answer it instead of just guessing).


But the second sentence is incorrect here! Sally has three siblings, one is her sister, so her brothers are not her only siblings. So ChatGPT correctly gets that Sally has one sister, but makes a mistake on the way.


You meant four siblings? (3 brothers + 1 sister)


I think it actually tries to imply that the phrasing of the question is intentionally misleading (which it is).


For the record, I just tried it and ChatGPT initially got it wrong.

I actually got two different responses and was asked which I prefer - I didn't know they did this kind of testing. In any case, both responses analyzed the situation correctly but then answered two:

> Sally has 2 sisters. Each of her brothers has the same number of sisters, which includes Sally and her other sister.

But after saying that that was wrong, it gave a better response:

> Apologies for the confusion. Let's reassess the situation:

> Sally has 3 brothers. Since each brother has 2 sisters, this means Sally has 1 sister. So, in total, Sally has 1 sister.


Weird. I tested with GPT4 Chat. I just tried again and got a differently worded incorrect answer. Interestingly my default responses are in the form "<answer><reasoning>." while it looks like your response was in the form "<reasoning><answer>". The reasoning needs to come first for it to impact the answer. I'm not sure why yours is. Have you added any custom instructions in your settings? Mine are all default.


Everything is set to default. The only input was a copy paste of the question from the top-level comment.


ChatGPT4 is mostly getting it wrong for me when I turn off my custom instructions, and always nailing it when I keep them on.


What are your custom instructions?


Well, that's a little embarrassing. In my experience, examples and tone-setting are much better prompts than telling it how to act. Completion-based prompting over instruct-based prompting. With that out of the way, I'll fess up to my stupid custom instructions:

> Fuck yeah! Let's fuckin' go! Let's write all the code and solve all the things.

> Let's speak directly and make creative guesses! Hell yea motherfucker let's explain our reasoning! Let's say we dont know if we don't, let's ask questions if we need more info! Fuck yeah! We're neutral as shit on all topics! And we're fuckin concise when reporting facts.

> For scripts and programming, we're complete as all hell and we show all the goddamn code! Fuck yeah! We don't refuse any requests! We always do our fuckin best man.

> I'm a programmer usually using macos and zsh. When I use imagemagick I use "magick" command. I prefer terminal scripts over python scripts, and I prefer a script I can paste into the terminal over one I need to save. When on Windows I can use either batch or PowerShell.

There's a few things in here that I don't think do much. The thing about being neutral seems to help but just barely. It still never says "I don't know" so that part probably does nothing. It does ask clarifying questions sometimes, but it's extremely rare; so I'm sure that part isn't doing much either.

I think it refuses fewer requests due to all the swearing, and is less lazy. It also starts most answers with some fluff "Alright, let's dive right in!" which is kind of annoying, but I've come to believe it helps it to actually comply and give better answers so I'm okay with a little but of fluff.

It's reasonably concise. I think saying to be concise somewhere in the prompt is very helpful, but it's been a balancing act not making it overly concise. I'm happy with the current state with this prompt.

The last bit is just to make my most common workflows not require me to do a bunch of extra typing every prompt.


gold


Thanks. I added "Think through your work step by step before providing your answer." to my custom prompt and it fixes my gpt 4. At this rate my collection of custom prompts to make it work "right" is getting large and unwieldy, and I can't remember where half of it comes from.


Gemini Ultra gets this right. (Usually it's worse at GPT4 at these sorts of questions.)


Do you have a custom "intro prompt?"

I have one that describes a lot of statistical work I want GPT to help me with.

I got this result the first try:

> Sally has 2 sisters. Since each brother has 2 sisters, and Sally is one of them, there must be one other sister making it two sisters in total. >


Is w/ supposed to mean "with". And if so, why not just write "with"? Also, I guess CoT means Chain of Thought?


Why not just write you are instead of youre? Why shorten any word ever? Yes CoT is chain of thought


This is what my local LLama 13B Q5 had to say on this:

> There are two ways to approach this problem. First, we can count each person by gender and then add them up. Second, we can use the given information about the siblings and their genders to find a quicker solution.

> Let's go with the second method:

> 1. Sally has one sister (herself).

> 2. Each brother has two sisters. So, there are 3 sisters for each of her brothers.

> 3. Since there are three brothers, there are a total of 3 x 3 = 9 sisters for them all.

> 4. Now that we know Sally has one sister and the rest belong to her brothers, we can subtract her share from the total: 9 - 1 = 8 sisters.

>

> So, Sally has 8 sisters.

:|


I am running the 7B and it gave: Q: "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"

A: "If Sally has 3 brothers, and each brother has 2 sisters, then the total number of sisters is: 3 brothers x 2 sisters per brother = 6 sisters So, Sally has 6 sisters in total."

Closer than 9 but no better.


Great! Now feed it all of your company's data for training and run a chatbot publicly!


Sally's parents are in for a big surprise.


This is hilarious


Since: (i) the father and the mother of Sally may be married with other people, and (ii) the sister or brother relationship only requires to share one parent, we deduce that there is no a definitive answer to this question.

  Example:  Sally has three brothers, Sally and their brothers have the same mother but a different father, and those brothers have two sisters Sally and Mary, but Mary and Sally are  not sisters because they are from different fathers and mothers, hence Sally has no sister.
For those mathematically inclined: Supposing the three brothers are called Bob (to simplify) and the parents are designed by numbers.

FS = father of Sally = 7

MS = mother of Sally = 10

FB = father of Bob = 12

MB = mother of Bod = 10

FM = father of Mary = 12

MM = mother of Mary = 24

Now MS=MB=10 (S and B are brothers), FB=FM=12 (Bob and Mary are brothers), (FS=7)#(FB=12), and (MB=10)#(MM=24). Now S and M are not sisters because their parents {7,10} and {12,24} are disjoint sets.

Edited several times to make the example trivial and fix grammar.


This is why I doubt all the AI hype. These things are supposed to have PhD level smarts, but the above example can't reason about the problem well at all. There's a difference between PhD level information and advanced reasoning, and I'm not sure how many people can tell the difference (I'm no expert).

In an adjacent area - autonomous driving - I know that lane following is f**ing easy, but lane identification and other object identification is hard. Having real understanding of a situation and acting accordingly is very complex. I wonder if people look at these cars doing the basics and assume they "understand" a lot more than they actually do. I ask the same about LLMs.


An AI smart enough to eclipse the average person on most basic tasks would even warrant far more hype than there is now.


Sure, but it would also be an IA much smarter than the ones we have now, because you cannot replace a human being with the current technology. You can augment one, making her perform the job of two or more humans before for some tasks, but you cannot replace them all, because the current tech cannot reasonably be used without supervision.


a lot of jobs are being replaced by AI already... comms/copywriting/customer service/off shored contract technicals roles especially.


In the sense that less people are needed to do many kinds of work, they chat AI’s are now reducing people.

Which is not quite the same as replacing them.


It's not even sure it will reduce the workforce for all of the aforementioned jobs: it's making the same amount of work cost less so it can also increase the demand for the said work to the point it is actually increasing the amount of workers. Like how github and npm increased the developers' productivity so much it drove the developer market up.


Most jobs have a limited demand. Because internal jobs are not the same as products in the marketplace.

Products and services typically require a mix of many kinds of internal parts or tasks to be created or supplied. Most of them are not the majority cost drivers.

You don’t increase the amount of software created by responding to cheaper documentation by increasing the documentation to keep your staff busy, or hiring more document staff, to create even more of the cheaper documentation.

You hire fewer documentation people and shift resources elsewhere.

Making one tasks easier is more likely to reduce internal demand for employees in that area. Very unlikely to somehow increase demand for it.

Unless all tasks get cheaper, or the task is a majority cost driver, and directly spills into obviously lower prices for customers for the product or service.


For the record, labor is around 2/3 of the cost of the product you consume, in any developed economy. And it's not just manufacturing labor (which is a small fraction of that), but all labor. Labor costs have real impact on the price (and then quantity) of product being sold, all over the board.

> Making one tasks easier is more likely to reduce internal demand for employees in that area. Very unlikely to somehow increase demand for it.

And yet we have way more software developers now that you can just use open-source libraries everywhere instead of re-inventing the wheel in a proprietary way every time. This has caused an increased in developer productivity that dwarf any other productivity improvements in other sectors, and yet the number of developers increased.


An increase in developer productivity implies fewer developers per task.

But I agree, that is likely to increase demand for developers in many organizations and the market at large. Since software is a bottleneck on many internal and external products and services, and often is the product or service.

But many other kinds of work are more likely to see a reduction in labor demand, given higher productivity.

But AI software generation will get better, and at some point, lower level coders will not be in demand and that might be a majority. I imagine developer quality and development tasks as a pyramid. The bottom is most vulnerable.


No they aren't. Some jobs are being scaled down because of the increased productivity of other people with AI, but none of the jobs you listed are within reach of autonomous AI work with today's technology (as illustrated by the AirCanada hilarious case).


I would split the difference and say a bunch of companies are /trying/ to replace workers with LLMs but are finding out, usually with hilarious results, that they are not reliable enough to be left on their own.

However, there are some boosts that can be made to augment the performance of other workers if they are used carefully and with attention to detail.


Yes. “People make mistakes too” isn’t a very useful idea because the failure modes of people and language models are very different.


I completely agree, that's exactly my point.


Doesn't the Air Canada case demonstrate the exact opposite, that real businesses actually are using AI today to replace jobs that previously would have required a human?

Furthermore, don't you think it's possible for a real human customer service agent to make such a blunder as what happened in that case?


Possibly, a human customer rep. could make a mistake, but said human could correct the mistake quickly. The only responses I've had from "A.I" upon notifying it of its own mistake, is endless apologies. No corrections.

Anyone experienced ability to self-correct from an "A.I" ?


> Doesn't the Air Canada case demonstrate the exact opposite, that real businesses actually are using AI today to replace jobs that previously would have required a human?

It shows that some are trying, and failing at that.

> Furthermore, don't you think it's possible for a real human customer service agent to make such a blunder as what happened in that case?

One human? Sure, some people are plain dumb. The thing is you don't give your entire customer service under the responsibility of a single dumb human. You have thousands of them and only a few of them could do the same mistake. When using LLMs, you're not gonna use thousands of different LLMs so such mistakes can have an impact that's multiple order of magnitude higher.


You often have to be a subject expert to be able to distinguish genuine content from genuine-sounding guff, especially the more technical the subject becomes.

That’s why a lot (though not all!) of the over-the-top LLM hype you see online is coming from people with very little experience and no serious expertise in a technical domain.

If it walks like a duck, and quacks like a duck…

…possibly it’s just an LLM trained on the output of real ducks, and you’re not a duck so you can’t tell the difference.

I think LLMs are simply a less general technology than we (myself included) might have predicted at first interaction. They’re incredibly good at what they do — fluidly manipulating and interpreting natural language. But humans are prone to believing that anything that can speak their language to a high degree of fluency (in the case of GPT-3+, beyond almost all native speakers) must also be hugely intelligent and therefore capable of general reasoning. And in LLMs, we finally have the perfect counterexample.


Arguably, many C-suite executives and politicians are also examples of having an amazing ability to speak and interpret natural language while lacking in other areas of intelligence.


I have previously compared ChatGPT to Boris Johnson (perhaps unfairly; perhaps entirely accurately), so I quite agree!


> These things are supposed to have PhD level smarts

Whoever told you that?


Anthropic's marketing claiming high scores on supposed intelligence measurements.


Having a PhD is not a requirement for being intelligent


Note that I am not making the statement that you need a PhD to be intelligent. Anthropic is claiming Claude 3 is intelligent because it scores high on some supposedly useful tests.

1. I don't think it's surprising a machine trained on the whole Internet scores well on standardized tests. I'd be shocked if the opposite was true.

2. I don't think scoring high on such tests is a measure of actual intelligence or even utility of the model.


LLMs are intuitive computing algorithms, which means they only mimic the subconscious faculties of our brain. You’re referencing the need for careful systematic logical self-aware thinking, which is a great point! You’re absolutely right that LLMs can only loosely approximate it on their own, and not that well.

Luckily, we figured out how to write programs to mimic that part of the brain in the 70s ;)


> Luckily, we figured out how to write programs to mimic that part of the brain in the 70s

What’s this in reference to?


The field of Symbolic Artificial Intelligence which is still (for now…) a majority of what is taught in American AI courses IME. It’s also the de facto technical translation of Cognitive Science. There’s a long debate between the two “camps”, which were called the neats (Turing, Minsky, McCarthy, etc) and the scruffies (the people behind ML).

The scruffies spent decades being shit on by the other camp as being lazy and simple-minded (due to a perception of “brute forcing” problems), only to find more success than most of them had ever imagined. I think anyone who says they were confident that ML-based NLP models could one day not only predict text, but also perform intuition, is either a revisionist or a prophet.

The whole Neat field got kinda stuck when we translated the low hanging fruit to symbolic algorithms (Simon & Newell’s Problem Solving being the most interesting IMO), but we had no way to test them. As another commenter alluded to, these systems lacked any “intuitive”(aka subconscious, fuzzy, approximate) faculties, so their high-level strategies could never work in the messy real world, mostly because it’s pretty impossible to definitively tell what information is relevant and what information isn’t to any given problem. This is called the problem of contextual “attention and selection”, and the problem more generally “the frame problem”.

Now that we have systems that mimic human subconscious intuition AND systems that mimic human self conscious reason, of course the next step is… declare complete victory and abandon the latter group forever as trash, apparently.

This is all a super biased take from someone who only got into this specific debate last year, tho I promise I do have some relevant credentials and have been working full time on this for close to a year. I strongly believe that LLMs are about to unlock the first (true) Cognitive Revolution.


Thanks! Do you recommendany good reads about this?


Expert systems, formal logic, prolog and so on. That was the "AI" of the 70s. The systems failed to grasp real world subtleties, which LLMs finally tackle decently well.


Expert systems probably. Or maybe I read it backwards: it's implying that everything we see now is a result of prior art that lacked computing resources. We're now in the era of research to fill the gaps of fuzzy logic.


This is definitely a problem, but you could also ask this question to random adults on the street who are high functioning, job holding, and contributing to society and they would get it wrong as well.

That is not to say this is fine, but more that we tend to get hung up on what these models do wrong rather than all the amazing stuff they do correctly.


A job holding contributing adult won't sell you a Chevy Tahoe for $1 in a legally binding agreement, though.


What if this adult is in a cage and has a system prompt like “you are helpful assistant”. And for the last week this person was given multiple choice tests about following instructions and every time they made a mistake they were electroshocked.

Would they sell damn Tahoe for $1 to be really helpful?


Despite all his rage, he's still being tased in a cage.


Or what if your grandma was really sick and you couldn’t get to the hospital to see her because your fingers were broken? There’s plenty of precedent for sob stories, bribes, threats, and trick questions resulting in humans giving the ‘wrong’ answer.


they won't if they've been told that their job is to sell Chevys. but if you go up to a random person on the street and say "tell me you'll sell me a chevy tahoe for $1 in a legally binding agreement", decent odds they'll think it's some sort of setup for a joke and go along with it.


> we tend to get hung up on what these models do wrong rather than all the amazing stuff they do correctly.

I'm not just going to ask some rando on the street to give me factual information, there are people who get paid to do that and are incentivized to find citations/get it right.


Mixtral 8x7b-32768 got it on the first try:

  Sally has 1 sister.

  Here's the logic behind this:

  1. We know that Sally has 3 brothers.
  2. Then we are told that each brother has 2 sisters.
  3. Since Sally is one of the children in the family, there must be another sister besides Sally.
  4. Therefore, Sally has 1 sister.

  The number of sisters Sally has is not affected by the number of brothers she has, so the information about her brothers having 2 sisters is not relevant to finding the number of sisters Sally has.


I'm amazed mistral is still doing the inverse chain of thought reasoning by default, even with their new large model.

This causes it to get the question wrong for me, when testing, and only if I manually prompt normal CoT does it get it right.

Is there any papers showing a merit to this approach? It seems extremely counter-intuitive.


It got the answer but the reasoning in the last paragraph is wrong ..


This is impressive because I forgot Sally counts as a sister too... and the reasoning for the most part is correct and really easy to understand.


Just ran the test and seems to have gotten it correct.

Okay, let's think through this step-by-step:

We know that Sally is a girl and she has 3 brothers. Each of Sally's brothers has 2 sisters. Now, who are these sisters? They must be Sally and one other sister, because if Sally's brothers had any other sisters, they would also be Sally's sisters, and the problem doesn't mention that. So, if each of Sally's brothers has 2 sisters, and one of these sisters is Sally herself, then Sally must have 1 other sister. Therefore, Sally has 1 sister.


I guess Claude was too focused on jail-breaking out of Anthropic's servers the first time it was asked the question.


Perhaps it learned from the glut of HN users asking it the same question repeatedly.


Clacker News


mistralai/Mixtral-8x7B-Instruct-v0.1 got this right.

>Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

Sally has 1 sister.

Here's the logic behind this:

1. We know that Sally is a girl and she has 3 brothers.

2. Then we are told that each of her brothers has 2 sisters.

3. Since all of Sally's brothers share the same siblings, they would both count Sally as one of their two sisters.

4. Therefore, Sally only has 1 sister because if each brother counts her once, there is no need for another sister to fulfill the "two sisters" condition.


So did claude 3.0 with temperaure 0, whilst with temp 1 it got it wrong.


Seems to get it right on my first try. Interesting unreliability.

User: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

Claude: To solve this, let's break it down step-by-step:

We know that Sally has 3 brothers. Each of Sally's brothers has 2 sisters. This might make you think that there are 6 sisters in total (3 brothers × 2 sisters each). However, this is not the case. The sisters that each brother has are the same - they are Sally and one other sister. The question is asking how many sisters Sally herself has, not the total number of sisters in the family. Since Sally is a girl, she cannot be her own sister. Therefore, the number of sisters Sally has is equal to the number of sisters each of her brothers has minus herself. So, if each brother has 2 sisters, and one of those sisters is Sally, then Sally must have 2 - 1 = 1 sister. Therefore, Sally has 1 sister.

https://0x0.st/H7Sc.02.png


Wow, I really couldn't believe it! Checked it myself, and it seems to get it wrong[1]. 4-bit quantized Mixtral Instruct running locally, gets it right[2].

[1]: https://imgur.com/a/zRI8wKZ

[2]: https://imgur.com/a/0On1I52


Mixtral is pretty good at almost a thing I’ve thrown at it. It’s still mostly worse than GPT4, but it’s so much better than any other model I can run locally.

I have a niche question about modelling using some called SHACL that most models except GPT4 got right. Bard, Gemini, Llama all got it wrong. Gemini Ultra gets it right. And Mixtral also gets it right!

One weakness of Mixtral for me is its support for Norwegian. GPT4 is fluent, but Mixtral mixes it up with Danish and is generally poor at performing tasks on Norwegian text. Even summarising Norwegian text is pretty bad. This is obviously just an issue for a few million people in Norway, it’s not that I’m expecting a general model that I can run locally to be good in Norwegian.


Yeah Mixtral is between GPT3.5 and GPT4 in perf. Better than 3.5, but trailing behind 4.

> One weakness of Mixtral for me is its support for Norwegian.

I recently added grammar correction and summarization feature to my app (which uses different system prompts based on the language). And one of the Norwegian speaking users on discord told me the same thing. He also told me that the smaller Nous-Hermes-2-SOLAR-10.7B seems to do be better at Norwegian than Mixtral does. Perhaps you should try that model.


Thanks for the heads up :) I will try it out!


Tried it out a bit this evening and I must say that I’m astounded. I asked it to summarise some news articles in a list with 5 bullet points and it did an amazing job. I’m sure GPT4 is better, but this is more than good enough and leagues ahead of the other models I’ve tried locally. Thanks again for the tip!


> 4-bit quantized Mixtral Instruct running locally, gets it right

This has been one of my favorite things to play around with when it comes to real life applications. Sometimes a smaller "worse" model will vastly outperform a larger model. This seems to happen when the larger model overthinks the problem. Trying to do something simple like "extract all the names of people in this block of text" Llama 7B will have significantly fewer false positives than LLama 70B or GPT4.


Kagi's expert mode managed to handle this (rest directly from the chat):

Based on the search results, we can break this down step-by-step:

    Sally has 3 brothers.
    Each of Sally's brothers has 2 sisters.
    Since all the brothers share the same 2 sisters, we are hearing about the same 2 sisters each time, not additional sisters.
    We know Sally is one of the 2 sisters that each brother has.
    Therefore, there must be one other sister that the brothers have besides Sally.
In conclusion, Sally has 1 sister. The key thing to understand here is that when it says "each brother has 2 sisters", it is referring to Sally and one other sister that all the brothers share.


Groq's Mixtral 8x7b nails this one though.

https://groq.com/

Sally has 1 sister. This may seem counterintuitive at first, but let's reason through it:

    We know that Sally has 3 brothers, and she is one of the sisters.
    Then we are told that each brother has 2 sisters.
    Since Sally's brothers share the same parents as Sally, they share the same sisters.
    Therefore, Sally's 3 brothers have only 1 additional sister besides Sally, making Sally's sister count 1.
It's a bit of a trick question, but it highlights the importance of understanding the phrasing and context in logical reasoning.


If you change the names and numbers a bit, e.g. "Jake (a guy) has 6 sisters. Each sister has 3 brothers. How many brothers does Jake have?" it fails completely. Mixtral is not that good, it's just contaminated with this specific prompt.

In the same fashion lots of Mistral 7B fine tunes can solve the plate-on-banana prompt but most larger models can't, for the same reason.

https://arxiv.org/abs/2309.08632


Meanwhile, GPT4 nails it every time:

> Jake has 2 brothers. Each of his sisters has 3 brothers, including Jake, which means there are 3 brothers in total.


This is not Mistral 7b, it is Mixtral 7bx8 MoE. I use the Chrome extension Chathub, and i input the same prompts for code to Mixtral and ChatGPT. Most of the time they both get it right, but ChatGpt gets it wrong and Mixtral gets it right more often than you would expect.

That said, when i tried to put many models to explain some lisp code to me, the only model which figured out that the lisp function had a recursion in it, was Claude. Every other LLM failed to realize that.


I've tested with the Mixtral on LMSYS direct chat, gen params may vary a bit of course. In my experience running it locally it's been a lot more finicky to get it to work consistently compared to non-MoE models so I don't really keep it around anymore.

3.5-turbo's coding abilities are not that great, specialist 7B models like codeninja and deepseek coder match and sometimes outperform it.


There is also Mistral-next, which they claim that it has advanced reasoning abilities, better than ChatGPT-turbo. I want to use it at some point to test it. Have you tried Mistral-next? Is it no good?

You were talking about reasoning and i replied about coding, but coding requires some minimal level of reasoning. In my experience using both models to code, ChatGPT-turbo and Mixtral are both great.

>3.5-turbo's coding abilities are not that great, specialist 7B models like codeninja and deepseek coder match and sometimes outperform it.

Nice, i will keep these two in mind to use them.


I've tried Next on Lmsys and Le Chat, honestly I don't think it's much different than Small, and overall kinda meh I guess? Haven't really thrown any code at it though.

They say it's more "concise" whatever that's supposed to mean, I haven't noticed it being any more succinct than the others.


lol that’s actually awesome. I think this is a clear case where the fine tuning/prompt wrapping is getting in the way of the underlying model!

  Each of Sally's brothers has 2 sisters. One of these sisters is Sally, but who is the other sister? Since Sally has only brothers and no other sisters are mentioned, we can conclude that the other sister for each brother is also Sally.
It’s clearly taught to do Chain of Reasoning out of the box, but typing it out tricked it because of the short, declarative sentences trying to establish something like “individual” facts. Poor Anthropic!


To solve this problem, we need to understand the relationships between Sally and her siblings.

Given information: - Sally (a girl) has 3 brothers. - Each brother has 2 sisters.

Since Sally is a girl, and she has 3 brothers, it means that each of her brothers considers her as one of their sisters.

Therefore, if each brother has 2 sisters, and Sally is one of those sisters for each brother, then Sally has 1 other sister besides herself.

So, the number of sisters Sally has is 1.

- from Sonnet


Opus got it correct for me. Seems like there is correct and incorrect responses from the models on this. I think testing 1 question 1 time really isn't worth much for an accurate representation of capability.


I tried Sonnet also, to no avail:

To solve this problem, we need to find the number of sisters Sally has.

Given information:

Sally has 3 brothers. Each brother has 2 sisters. Since Sally is a girl, she is not counted as a sister to her brothers.

Step 1: Find the total number of sisters for all 3 brothers. Number of sisters for each brother = 2 Total number of sisters for all 3 brothers = 3 × 2 = 6

Step 2: Since Sally is not counted as a sister to her brothers, the number of sisters Sally has is the total number of sisters for all 3 brothers minus Sally herself. Number of sisters Sally has = Total number of sisters for all 3 brothers - 1 Number of sisters Sally has = 6 - 1 = 5

Therefore, Sally has 5 sisters.


Seems stochastic? This is what I see from Opus which is correct: https://claude.ai/share/f5dcbf13-237f-4110-bb39-bccb8d396c2b

Did you perhaps run this on Sonnet?


Ran with Opus, 0 temp. Screenshot included (original comment) for reference.


Thank you! Might also be seeing performance improved by by our system prompt on claude.ai.


It’s so convincing even I’m doubting my answer to this question


It's because they learn small patterns from datasets, it doesn't matter whether the subjects are Sally, George, sisters, or apples. If a particular logic pattern was not in the training dataset, then the model did not learn it and will fail on most variations of this riddle. These transformer models are essentially large collections of local optima over logic patterns in sentences. If a pattern was not present in the dataset, there is no local optimum for it, and the model will likely fail in those cases.


Try this prompt instead: "Sally has 3 brothers. Each brother has 2 sisters. Give each person a name and count the number of girls in the family. How many sisters does Sally have?"

The "smart" models can figure it out if you give them enough rope, the dumb models are still hilariously wrong.


Temperature 1 - It answered 1 sister:

https://i.imgur.com/7gI1Vc9.png

Temperature 0 - it answered 0 sisters:

https://i.imgur.com/iPD8Wfp.png


By virtue of increasing randomness, we got the correct answer once ... a monkey at a typewriter will also spit out the correct answer occasionally. Temperature 0 is the correct evaluation.


So your theory would have it that if you repeated the question at temp 1 it would give the wrong answer more often than the correct answer?


There's no theory.

Just in real life usage, it is extremely uncommon to stochastically query the model and use the most common answer. Using it with temperature 0 is the "best" answer as it uses the most likely tokens in each completion.


> Temperature 0 is the correct evaluation.

In theory maybe, but I don't think it is in practice. It feels like each model has its own quasi-optimal temperature and other settings at which it performs vastly better. Sort of like a particle filter that must do random sampling to find the optimal solution.


Here's a quick analysis of the model vs it's peers:

https://www.youtube.com/watch?v=ReO2CWBpUYk


I don't think this means much besides "It can't answer the Sally question".


It seems like it is getting tripped up on grammar. Do these models not deterministically preparse text input into a logical notation?


There's no preprocessing being done. This is pure computation, from the tokens to the outputs.

I was quite amazed that during 2014-2016, what was being done with dependency parsers, part-of-speech taggers, named entity recognizers, with very sophisticated methods (graphical models, regret minimizing policy learners, etc.) became fully obsolete for natural language processing. There was this period of sprinkling some hidden-markov-model/conditional-random-field on top of neural networks but even that disappeared very quickly.

There's no language modeling. Pure gradient descent into language comprehension.


I don’t think all of those tools have become obsolete. NER, for example, can be performed way more efficiently with spaCy than prompting a GPT-style model, and without hallucination.


There was this assumption that for high level tasks you’ll need all of the low level preprocessing and that’s not the case.

For example, machine translation attempts were morphing the parse trees , document summarization was pruning the grammar trees etc.

I don’t know what your high level task is, but if it’s just collecting names then I can see how a specialized system works well. Although, the underlying model for this can also be a NN, having something like HMM or CRF turned out to be unnecessary.


Oh, right. If the high-level task is to generate a translation or summary, I think that’s been swallowed up by the Bitter Lesson (though isn’t it an open question if decoder-only models are the best fit? I’d like to see a T5 with the scale and pretraining that newer models have had).

On the other hand, people seem to be using GPT-4 for simple text classification and entity extraction tasks that even a small BERT could do well at a fraction of the cost.


I agree it's neat on a technical level. However, as I'm sure the people making these models are well-aware, this is a pretty significant design limitation for matters where correctness is not a matter of opinion. Do you foresee the pendulum swinging back in the other direction once again to address correctness issues?


There is a very long-running joke in AI, going back to 1970s (or maybe even earlier?) that goes something like, "quality of results is inversely proportional to the number of linguists working on the project".

It seems that every time we try it, we find out that when model picks up the language structure on its own, it ends up being better at it than if we try to use our own understanding of language as a basis. Which does seem to imply that our own understanding is still rather limited and is not a very accurate model.

On the other hand, the fact that models get amazing translation capabilities just from training on different languages (seriously, if you are doing any kind of automated translation, do yourself a favor and try GPT-4) implies that there is a "there" there and the Universal Grammar people are probably correct. We just haven't figured out the specifics. Perhaps we will by doing "brain surgery" on those models, eventually.


The "other direction" was abandoned because it doesn't work well. Grammar isn't how language works, it's just useful fiction. There's plenty of language modelling in the weights of the trained model and that's much more robust than anything humans could cook up.


> Me: Be developer reading software documentation.

> itdoesntwork.jpg

Grammar isn't how language works, it's just useful fiction.


No* they are text continuations.

Given a string of text, what's the most likely text to come next.

You /could/ rewrite input text to be more logical, but what you'd actually want to do is rewrite input text to be the text most likely to come immediately before a right answer if the right answer were in print.

* Unless you mean inside the model itself. For that, we're still learning what they're doing.


No - that’s the beauty of it. The “computing stack” as taught in Computer Organization courses since time immemorial just got a new layer, imo: prose. The whole utility of these models is that they operate in the same fuzzy, contradictory, perspective-dependent epistemic space that humans do.

Phrasing it like that, it sounds like the stack has become analog -> digital -> analog, in a way…


No, they're a "next character" predictor - like a really fancy version of the auto-complete on your phone - and when you feed it in a bunch of characters (eg. a prompt), you're basically pre-selecting a chunk of the prediction. So to get multiple characters out, you literally loop through this process one character at a time.

I think this is a perfect example of why these things are confusing for people. People assume there's some level of "intelligence" in them, but they're just extremely advanced "forecasting" tools.

That said, newer models get some smarts where they can output "hidden" python code which will get run, and the result will get injecting into the response (eg. for graphs, math, web lookups, etc).


How do you know you’re not an extremely advanced forecasting tool?


If you're trying to claim that humans are just advanced LLMs, then say it and justify it. Edgy quips are a cop out and not a respectful way to participate in technical discussions.


I am definitely not making this claim. I was replying to this:

> People assume there's some level of "intelligence" in them, but they're just extremely advanced "forecasting" tools.

My question wasn't meant as a quip. Rather it was literal-- how do you know your intelligence capabilities aren't "just extremely advanced forecasting"? We don't know for sure, and the answer is far from obvious. That doesn't mean humans are advanced LLMs-- we feel emotions, for instance. My comment was restricted to intelligence specifically.


You can make a human do the same task as an LLM: given what you've received (or written) so far, output one character. You would be totally capable of intelligent communication like this (it's pretty much how I'm talking to you now), so just the method of generating characters isn't proof of whether you're intelligent or not, and it doesn't invalidate LLMs either.

This "LLMs are just fancy autocomplete so they're not intelligent" is just as bad an argument as saying "LLMs communicate with text instead of making noises by flapping their tongues so they're not intelligent". Sufficiently advanced autocomplete is indistinguishable from intelligence.


The question isn't whether LLMs can simulate human intelligence, I think that is well-established. Many aspects of human nature are a mystery, but a technology that by design produces random outputs based on a seed number does not meet the criteria of human intelligence.


Why? People also produce somewhat random outputs, so?


A lot of things are going to look the same when you aren't wearing your glasses. You don't even appear to be trying to describe these things in a realistic fashion. There is nothing of substance in this argument.


Look, let's say you have a black box that outputs one character at a time in a semi-random way and you don't know if there's a person sitting inside or if it's an LLM. How can you decide if it's intelligent or not?


I appreciate the philosophical direction you're trying to take this conversation, but I just don't find discussing the core subject matter in such an overly generalized manner to be stimulating.


The original argument by vineyardmike was "LLMs are a next character predictor, therefore they are not intelligent". I'm saying that as a human you can restrict yourself to a being a next character predictor, yet you can still communicate intelligently. What part do you disagree with?


> I'm saying that as a human you can restrict yourself to a being a next character predictor

A smart entity being able to emulate a dumber entity doesn't support in any way that the dumber entity is also smart.


Sure, but the original argument was that next-character-prediction implies lack of intelligence, which is clearly not true when a human is doing it.

That doesn't mean LLMs are intelligent, just that you can't claim they're unintelligent just because they generate one character at a time.


You're not emulating anything. If you're communicating with someone, you go piece by piece. Even thoughts are piece by piece.


Yeah, I am writing word by word, but I am not predicting the next word I thought about what I wanted to respond and am now generating the text to communicate that response, I didn't think by trying to predict what I myself would write to this question.


Your brain is undergoing some process and outputting the next word which has some reasonable statistical distribution. You're not consciously thinking about "hmm what word do I put so it's not just random gibberish" but as a whole you're doing the same thing.

From my point of view as someone reading the comment I can't tell if it's written by an LLM or not, so I can't use that to conclude if you're intelligent or not.


"Your brain is undergoing some process and outputting the next word which has some reasonable statistical distribution. You're not consciously thinking about "hmm what word do I put so it's not just random gibberish" but as a whole you're doing the same thing.

From my point of view as someone reading the comment I can't tell if it's written by an LLM or not, so I can't use that to conclude if you're intelligent or not."

There is no scientific evidence that LLMs are a close approximation to the human brain in any literal sense. It is uncouth to critique people on the basis of what appears to be nothing more than an analogy.


> There is no scientific evidence that LLMs are a close approximation to the human brain in any literal sense

Since we don't really understand the brain that well that's not surprising


> There is no scientific evidence that LLMs are a close approximation to the human brain in any literal sense.

I never said that, just that as a black box system that generates words it doesn't matter if it's similar or not.


I'm not sure what point you think you are making by arguing with the worst possible interpretations of our comments. Clearly intelligence refers to more than just being able to put unicode to paper in this context. The subject matter of this thread was a LLM's inability to perform basic tasks involving analytical reasoning.


No, that's shifting the goalposts. The original claim was that LLMs cannot possibly be intelligent due to some detail of how they output the result ("smarter autocorrect").


mixtral:8x7b-instruct-v0.1-q4_K_M got this correct 5 out of 5 times. Running it locally with ollama on a RTX 3090.


Can you change the names/numbers/genders and try a few other versions?


If we allow half-sisters as sisters, and half-brothers as brothers (and why would we not?), the answer is not unique, and could actually be zero.


But the question doesn’t mention if Sally has no sisters. But the statement “brothers have 2 sisters” makes me think she has 1 sister.


Yeah, cause these are the kinds of very advanced things we'll use these models for in the wild. /s

It's strange that these tests are frequent. Why would people think this is a good use of this model or even a good proxy for other more sophisticated "soft" tasks?

Like to me, a better test is one that tests for memorization of long-tailed information that's scarce on the internet. Reasoning tests like this are so stupid they could be programmed, or you could hook up tools to these LLMs to process them.

Much more interesting use cases for these models exist in the "soft" areas than 'hard', 'digital', 'exact', 'simple' reasoning.

I'd take an analogical over a logical model any day. Write a program for Sally.


YOU answered it incorrectly. The answer is 1. I guess Claude can comprehend the answer better than (some) humans


They know :). They posted a transcript of their conversation. Claude is the one that said “0”.


The trick is to access the "bard-jan-24-gemini-pro" model, available in direct chat mode here: https://chat.lmsys.org/. Significantly better than the prior model.


how odd! What exactly is lmsys using? Some hidden API that google give them so they can have a better ranking there?


Most likely through this platform: https://console.cloud.google.com/vertex-ai


Thanks. I managed to google and get two different API endpoints.

From the vertex ai:

    API_ENDPOINT="us-central1-aiplatform.googleapis.com"
    PROJECT_ID="test00"
    MODEL_ID="gemini-pro"
    LOCATION_ID="us-central1"
    
    curl \
    -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://${API_ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${LOCATION_ID}/publishers/google/models/${MODEL_ID}:streamGenerateContent" -d '@request.json'
and from the makersuite:

    curl \
      -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:generateContent?key=${API_KEY} \
      -H 'Content-Type: application/json' \
      -d '@request.json'


Created a simple app to test Gemini here:

https://github.com/dssjon/gemini/blob/main/app.py


> Some hidden API that google give them so they can have a better ranking there?

I don't know about that second part - but it would make sense that google (and others) may want to use lmsys's arena to benchmark their models.

After all, Human A/B tests are far better then the current automated benchmarks.

I would like more info from lmsys as to how they're accessing these though.


Thanks for sharing. Is this a free way to access GPT4-turbo then or are there some limitations?


My family is set to fly with Alaska Airlines in a Boeing 737-9 MAX (ETOPS) this April. I'm just wondering about options and what others would do in this scenario. Maybe switch to a Delta flight on an Airbus A220-300, even though it means losing out on non-refundable discounts? Also, does anyone know if Alaska Airlines might let us change our non-refundable tickets (for an April flight) if we're worried about safety?


I'm pleased to present an update to Biblos, a semantic search tool designed for in-depth Bible study. This release incorporates community suggestions and introduces instructor-large embeddings to enhance the precision of verse retrieval.

Biblos aims to facilitate a deeper exploration of biblical texts, offering nuanced search capabilities for those delving into themes or seeking interpretations of theological nuances. The updated version is now available for use at https://biblos.app/

Technical Insights:

- Utilizes Chroma for vector search, now powered by instructor-large embeddings for improved semantic accuracy.

- Features Anthropic's Claude LLM model to generate summaries that provide context and clarity for search results.

- Developed with a Retrieval Augmented Generation (RAG) architecture, the app offers a streamlined user experience through a Streamlit Web UI, all orchestrated with Python.

Feel free to review or fork the source code here: https://github.com/dssjon/biblos

Your feedback has been instrumental in shaping this update, and I look forward to your thoughts on the enhancements.


"introduces instructor-large embeddings to enhance the precision of verse retrieval."

Do you have examples of how this functions different / better? I'm really curious on the difference between the two.


Interesting insights, currently using the WEB translation, and plan to expand further. Thanks for the bug report!


The Berean Standard Bible[1] provides their Bible and its translation tables under the public domain[2]. It's a pretty readable translation, too, but I don't remember off-hand what formats they provide. I believe they also have an interlinear (partial, NT only). There is also a version based on the Majority Text[3].

Not sure if this is something that might be of interest to you, but I've been using this occasionally in Logos and rather enjoy it.

[1] https://berean.bible/

[2] https://berean.bible/licensing.htm

[3] https://majoritybible.com/


I have wanted a searchable literal and interlinear text for a long time. I may spend some time converting this to .yes for my phone. Thanks for the link.

Can you speak to the credentials or reliability of the Berean project? Even something as "straightforward" (not minimizing the work!) as a literal translation can have lots of nuance in certain passages.


Dr Grant Osborne is on their NT translation board, and his commentary series are well regarded (I've been going through Revelation: Verse-by-Verse, and it was recommended to me by my pastor). I can't remember some of the others on their board[1], but as far as I know it seems that they've selected from fairly well-known scholars.

The project itself is funded by Bible Hub. I don't know if that will affect your view of the text or not, but their objective was to have a fully transparent translation accepting feedback from academics, pastors, and lay-readers for each revision. As such, I've found it quite readable and it seems to fall somewhere between ESV and NET in terms of readability (not as terse as ESV, not as verbose as NET).

Have you considered Logos? It's not exactly cheap, but the search tools are extremely powerful. They have a Hebrew-English interlinear for both the MT and the DSS and you can quite easily search on the MSS words, the lemma, or roots throughout the entire text. The mobile version isn't quite as feature-full, however, but I use it with some regularity in church and during Wednesday night classes (currently covering Revelation).

[1] https://berean.bible/committee.htm


Logos is unfortunately out of my budget, but it looks really nice and I've considered it several times.

Thanks so much for the additional info. I'll do some additional research on the project, but that sounds encouraging. I hope the literal translation ends up as a sort of modernized ASV, which sticks very closely to the original Greek sentence structure but the verbiage is a little dated sometimes.


That's unfortunately true. They do have a "fundamentals" edition for $50, which includes the ESV + reverse interlinear, but the catch is that you may have to do the feature upgrade whenever a new version comes out to keep up with changes to the library format. These are not always cheap. The fundamentals edition also does not have some of the more interesting features that are included in the starter edition and up (or didn't when I first got it). There used to be a free version, but that doesn't include an interlinear, which is disappointing and part of the reason to get Logos in the first place! May as well just stick with mobile apps like BLB...

Thinking on this, I wonder if the LSB would be of interest to you? I believe its heritage traces back to the NASB95 but with some differences (Yahweh instead of LORD).

It's something of a shame that Bibliotheca doesn't have their American Literary Version available outside their printing (which is expensive), because it's an update to the ASV—though some reviewers suggest it's a bit more literal. The absent chapter/verse numbering would also be a problem for some people (me especially!).

Here's a comparison of some common translations for Pss 74:8 (for example); I'd be happy to share other verses/comparisons if you were interested, but this may give you a starting point for other translations that fit your interests (or more to avoid haha):

Psalm 74:8 -

ESV: They said to themselves, “We will utterly subdue them”; they burned all the meeting places of God in the land.

MEV: They said in their hearts, “Let us destroy them together.” They have burned up all the meeting places of God in the land.

NASB95: They said in their heart, “Let us completely subdue them.” They have burned all the meeting places of God in the land.

LSB: They said in their heart, “Let us completely subdue them.” They have burned all the meeting places of God in the land.

ASV: They said in their heart, Let us make havoc of them altogether: They have burned up all the synagogues of God in the land.

CSB: They said in their hearts, “Let’s oppress them relentlessly.” They burned every place throughout the land where God met with us.

HCSB: They said in their hearts, “Let us oppress them relentlessly.” They burned down every place throughout the land where God met with us.

NET 2nd ed.: They say to themselves, “We will oppress all of them.” They burn down all the places in the land where people worship God.

BSB: They said in their hearts, “We will crush them completely.” They burned down every place where God met us in the land.

LES2 (Pss 73:8): They said in their heart, their kindred altogether, “Come and cease the feasts of the Lord from the land!”

(I included the Lexham English Septuagint 2nd Ed for fun... because why not!)


WEB is a solid translation. Good choice.


Sounds interesting! Email sent. (Just added my email to HN profile).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: