It depends on the domain, but chain of thought can get 3.5 to be extremely relia...

It depends on the domain, but chain of thought can get 3.5 to be extremely reliable, and especially with the new 16k variant

I built notionsmith.ai on 3.5: for some time I experimented with GPT 4 but the result was significantly worse to use because of how slow it became, going from ~15 seconds per generated output to a minute plus.

And you could work around that with things like streaming output for some use cases, but that doesn't work for chain of thought. GPT 4 can do some tasks without chain of thought that 3.5 required it for, but there are still many times where it improves the result from 4 dramatically.

For example, I leverage chain of thought in replies to the user when they're in a chat and that results in a much better user experience: It's very difficult to run into the default 'As a large language model' disclaimer regardless of how deeply you probe a generated experience when using it. GPT 4 requires the same chain of thought process to avoid that, but ends up needing several seconds per response, as opposed to 3.5 which is near-instant.

I suspect a lot of people are building things on 4 but would get better quality of output if they used more aspects of chain of thought and either settled for a slower output or moved to 3.5 (or a mix of 3.5 and 4)