Hacker Newsnew | past | comments | ask | show | jobs | submit | gnulinux's commentslogin

Maybe, ever since I graduated from college I learned again and again that pretty much anything worth thinking about in life boils down to math for me. I'd maybe/probably study CS, as a minor or double major, but Pure/Applied Math programs can be more intellectually enriching in this day and age. This is a completely person analysis, it'll change for everyone.

My first impressions: not impressed at all. I tried using this for my daily tasks today and for writing it was very poor. For this task o3 was much better. I'm not planning on using this model in the upcoming days, I'll keep using Gemini 2.5 Pro, Claude Sonnet, and o3.

Imho chatterbox is the current open weight SOTA model in terms of quality: https://huggingface.co/ResembleAI/chatterbox

Thank you, I hadn't heard of it. Will have a look! The samples sound excellent indeed.

Name recognition? Advertisement? Federal grant to beat Chinese competition?

There could be many legitimate reasons, but yeah I'm very surprised by this too. Some companies take it a bit too seriously and go above and beyond too. At this point unless you need the absolute SOTA models because you're throwing LLM at an extremely hard problem, there is very little utility using larger providers. In OpenRouter, or by renting your own GPU you can run on-par models for much cheaper.


Not even that, even if o3 being marginally better is important for your task (let's say) why would anyone use o4-mini? It seems almost 10x the price and same performance (maybe even less): https://openrouter.ai/openai/o4-mini

Probably because they are going to announce gpt 5 imminently

Wow, that's significantly cheaper than o4-mini which seems to be on part with gpt-oss-120b. ($1.10/M input tokens, $4.40/M output tokens) Almost 10x the price.

LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.


It's funny because I was thinking the opposite, the pricing seems way too high for a 5B parameter activation model.

Sure you're right, but if I can squeeze out o4-mini level utility out of it, but its less than quarter the price, does it really matter?


Are the prices staying aligned to the fundamentals (hardware, energy), or is this a VC-funded land grab pushing prices to the bottom?

It's averaging to $0.3/1M input tok and $1.2/1M output tok. That's kind of mind blowingly cheap for a model at its caliber. Gemini 2.5 Pro is more than 10x that price.

At $2/1Mt it's cheaper than e.g. Gemini 2.5 Pro which is ($1.25/1Mt for input and $10/1Mt per output). When I code with Aider my requests average to something like 5000 tokens input and 800 tokens output. At this rate, Gemini 2.5 Pro is about $0.01425 per single Aider request and Cerebras Qwen3 Coder is $0.0116 per request. Not a significant difference, but I think sufficiently cheaper to be competitive, especially given Qwen3-coder is on part with Gemini/Claude/o3, it even surpasses them in some tests.

NOTE: Currently in OpenRouter, Qwen3-Coder requests are averaging to $0.3/1M input tok and $1.2/1M output tok. That's just so significantly cheaper that I wouldn't be surprised if open weight models start eating Google/Anthropic/OpenAI lunch. https://openrouter.ai/qwen/qwen3-coder


Do you have any experience on how good is Qwen3-coder compared to Claude 4 Sonnet?

No, unfortunately, I haven't used Qwen3-coder yet. I do like Claude 4 Sonnet, but my favorite programming LLM is Gemini 2.5 Pro at the moment, I think it's the smartest model (Claude and o3 do print better code though).

I have exprience using the base Qwen3-32B model and it's extremely good for its size, especially in solving undergrad/grad level math problems. So my guess would be that Qwen3-coder should be competitive, but this is just speculation.


Qwen3 is the open weight state of the art at the moment. Qwen3-embedding-8B and Qwen3-reranker-8b are surprisingly good (according to some benchmarks, better than Gemini 2.5 embedding). 4B is also nearly as good so you might as well use that too unless 8B benefits your usecase. If you don't need a SOTA-precise embedding model because you'll run a more powerful reranker, you could run qwen3-embedding-4B at Q4 which is only 2GB, and will process extremely fast in most hardware. A weaker but close choice is `Qwen3-Embedding-0.6B` at Q8 which is about 600MB and will run just fine on most powerful CPUs. So if that does the job for you, you may not even need GPU, just grab an instance with 16 vCPUs, that'll give you plenty of throughput, probably more than you need until your RAG has thousands of active users.

Tool calling complements RAG. You build a full scale RAG (embedding, reranker, create prompt, get output from LLM) and hook that to a tool another agent can see. That combines both their power.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: