Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It depends on the domain, but chain of thought can get 3.5 to be extremely reliable, and especially with the new 16k variant

I built notionsmith.ai on 3.5: for some time I experimented with GPT 4 but the result was significantly worse to use because of how slow it became, going from ~15 seconds per generated output to a minute plus.

And you could work around that with things like streaming output for some use cases, but that doesn't work for chain of thought. GPT 4 can do some tasks without chain of thought that 3.5 required it for, but there are still many times where it improves the result from 4 dramatically.

For example, I leverage chain of thought in replies to the user when they're in a chat and that results in a much better user experience: It's very difficult to run into the default 'As a large language model' disclaimer regardless of how deeply you probe a generated experience when using it. GPT 4 requires the same chain of thought process to avoid that, but ends up needing several seconds per response, as opposed to 3.5 which is near-instant.

-

I suspect a lot of people are building things on 4 but would get better quality of output if they used more aspects of chain of thought and either settled for a slower output or moved to 3.5 (or a mix of 3.5 and 4)



It depends a lot on the domain, even for CoT. I don't think there are enough NLU evaluations just yet to robustly compare GPT-3.5 w/ CoT/SC vs. GPT-4 wrt domain.

For instance, with MATH dataset, my own n=500 evaluation showed no difference between GPT-3.5 (w/ and w/o CoT) and GPT-4. I was pretty surprised by that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: