Hacker Newsnew | past | comments | ask | show | jobs | submit | rbitar's commentslogin

Fantastic, this looks excellent and excited to try it

I regularly use @ key to add files to context for tasks I know require edits or patterns I want claude to follow, adds a few extra key strokes but in most cases the quality improvement is worth it


Where do you get the 220 token/second? Genuinely curious as that would be very impressive for a model comparable to sonnet 4. OpenRouter currently publishing around 116/tps[1]

[1] https://openrouter.ai/anthropic/claude-haiku-4.5


Was just about to post that Haiku 4.5 does something I have never encountered before [0], there is a massive delta between token/sec depending on the query. Some variance including task specific is of course nothing new, but never as pronounced and reproducible as here.

A few examples, prompted at UTC 21:30-23:00 via T3 Chat [0]:

Prompt 1 — 120.65 token/sec — https://t3.chat/share/tgqp1dr0la

Prompt 2 — 118.58 token/sec — https://t3.chat/share/86d93w093a

Prompt 3 — 203.20 token/sec — https://t3.chat/share/h39nct9fp5

Prompt 4 — 91.43 token/sec — https://t3.chat/share/mqu1edzffq

Prompt 5 — 167.66 token/sec — https://t3.chat/share/gingktrf2m

Prompt 6 — 161.51 token/sec — https://t3.chat/share/qg6uxkdgy0

Prompt 7 — 168.11 token/sec — https://t3.chat/share/qiutu67ebc

Prompt 8 — 203.68 token/sec — https://t3.chat/share/zziplhpw0d

Prompt 9 — 102.86 token/sec — https://t3.chat/share/s3hldh5nxs

Prompt 10 — 174.66 token/sec — https://t3.chat/share/dyyfyc458m

Prompt 11 — 199.07 token/sec — https://t3.chat/share/7t29sx87cd

Prompt 12 — 82.13 token/sec — https://t3.chat/share/5ati3nvvdx

Prompt 13 — 94.96 token/sec — https://t3.chat/share/q3ig7k117z

Prompt 14 — 190.02 token/sec — https://t3.chat/share/hp5kjeujy7

Prompt 15 — 190.16 token/sec — https://t3.chat/share/77vs6yxcfa

Prompt 16 — 92.45 token/sec — https://t3.chat/share/i0qrsvp29i

Prompt 17 — 190.26 token/sec — https://t3.chat/share/berx0aq3qo

Prompt 18 — 187.31 token/sec — https://t3.chat/share/0wyuk0zzfc

Prompt 19 — 204.31 token/sec — https://t3.chat/share/6vuawveaqu

Prompt 20 — 135.55 token/sec — https://t3.chat/share/b0a11i4gfq

Prompt 21 — 208.97 token/sec — https://t3.chat/share/al54aha9zk

Prompt 22 — 188.07 token/sec — https://t3.chat/share/wu3k8q67qc

Prompt 23 — 198.17 token/sec — https://t3.chat/share/0bt1qrynve

Prompt 24 — 196.25 token/sec — https://t3.chat/share/nhnmp0hlc5

Prompt 25 — 185.09 token/sec — https://t3.chat/share/ifh6j4d8t5

I ran each prompt three times and got (within expected variance meaning less than 5% plus or minus) the same token/sec results for the respective prompt. Each used Claude Haiku 4.5 with "High reasoning". Will continue testing, but this is beyond odd. I will add that my very early evals leaned heavily into pure code output, where 200 token/sec is consistently possible at the moment, but it is certainly not the average as claimed before, there I was mistaken. That being said, even across a wider range of challenges, we are above 160 token/sec and if you solely focus on coding, whether Rust or React, Haiku 4.5 is very swift.

[0] Normally not using T3 Chat for evals, just easier to share prompts this way, though was disappointed to find that the model information (token/sec, TTF, etc.) can't be enabled without an account. Also, these aren't the prompts I usually use for evals. Those I try to keep somewhat out of training by only using paid for API for benchmarks. As anything on Hacker News is most assuredly part of model training, I decided to write some quick and dirty prompts to highlight what I have been seeing.


Interesting and if they are using speculative decoding that variance would make sense. Also your numbers line up with what openrouter is now publishing at 169.1tps [1]

Anthropic mentioned this model is more then twice as fast as claude sonnet 4 [2], which OpenRouter averaged at 61.72 tps for sonnet 4 [3]. If these numbers hold we're really looking at an almost 3x improvement in throughput and less then half the initial latency.

[1] https://openrouter.ai/anthropic/claude-haiku-4.5 [2] https://www.anthropic.com/news/claude-haiku-4-5 [3] https://openrouter.ai/anthropic/claude-sonnet-4


That's what you get when you use speculative decoding and focus / overfit the draft model on coding. Then when the answer is out of distribution for the draft model, you get increased token rejections by the main model and throughput suffers. This probably still makes sense for them if they expect a lot of their load will come from claude code and they need to make it economical.


I'm curious to know if Anthropic mentions anywhere that they use speculative decoding. For OpenAI they do seem to use it based on this tweet [1].

[1] https://x.com/stevendcoffey/status/1853582548225683814


Congrats to the team, I'm surprised the industry hasn't been as impressed with their benchmarks on token throughput. We're using the Qwen 3 Coder 480b model and seeing ~2000 tokens/second, which is easily 10-20x faster then most LLM models on the market. Even some of the fastest models still only achieve 100-150 tokens / second (see OpenRouter stats by provider). I do feel after around 300-400 tokens/second the gains in speed feel more incremental, so if there was a model at 300+ tokens/second, I would consider that a very competitive alternative.


Really excited for this product, the industry needs alternatives to WebContainers which has become more restrictive around licensing. Also great to see that non-node runtimes (ruby / python) will be supported. Having said that, really wish this was open-source, even if that meant the OSS version had more limited features then the commercial alternative.


This token throughput is incredible and going to set a new bar in the industry. The main issue with the cerebras code plan is that number of requests/minute is throttled, and with agentic coding systems each tool call is treated as new "message" so you can easily hit the api limits (10 messages/minute).

One workaround we're doing now that seems to work is use claude for all tasks but delegate specific tools with cerebras/qwen-3-coder-480b model to generate files or other token heavy tasks to avoid spiking the total number of requests. This has cost and latency consequences (and adds complexity to the code), but until those throttle limits are lifted seems to be a good combo. I also find that claude has better quality with tool selection when the number of tools required is > 15 which our current setup has.


Frontend.co | REMOTE | Full-time & Part-time | https://www.frontend.co

Frontend is building an AI-powered Shopify development platform. We use AI to generate full-stack application built using Next.js / Tailwind for Shopify-connected storefronts.

Experience: 7-10+ years as a full-stack experience with recent experience using TypeScript, Next.js, Tailwindcss, Postgres, and Ruby on Rails. E-commerce development using Shopify GraphQL APIs is a plus. Our tech stack is:

- Next.js - Supabase / Postgres - TypeScript - Tailwind - Ruby on Rails - Shopify Storefront + Admin GraphQL APIs

Send us a note at: info[plus]hn@frontend.co


Looks great and excited to try this out. We’ve also had success using CodeSandbox SDK and E2B, can you share some thoughts on how you compare or future direction? Do you also use Firecracker under the hood?


> can you share some thoughts on how you compare or future direction?

Microsandbox does not offer a cloud solution. It is self-hosted, designed to do what E2B does, to make it easier working with microVM-based sandboxes on your local machine whether that is Linux, macOS or Windows (planned) and to seamlessly transition to prod.

> Do you also use Firecracker under the hood?

It uses libkrun.


Self-hosting is definitely something we are keen to explore as most of the cloud solutions have resource constrains (ie, total active MicroVMs and/or specs per VM) and managing billing gets complicated even with hibernation features. Great project and we'll definitely take it for a spin


I can't tell if it uses firecracker but thats my main question too. I'm curious as to whether microsandbox will be maintained and proper auditing will be done.

I welcome alternatives. It's been tough wrestling with Firecracker and OCI images. Kata container is also tough.


It will be maintained as I will be using it for some other product. And it will be audited in the future but it still early days.


I wanted to try Kata containers soon. What difficulties do you have with them?


Excited to try this out, it will solve two problems we’ve had: applying a code diff reliably and selecting which files from a large codebase to use for context.

We quickly discovered that RAG using a similarity search over embedded vectors can easily miss relevant files, unless we cast a very wide net during retrieval.

We’ve also had trouble getting any LLM to generate a diff format (such as universal diff) reliably so your approach to applying a patch is exciting.


This looks great, glad to see this project and congrats on the launch. Having said that, how does this project fit in with the Shopify Hydrogen effort using Remix / React? There seems to be an ever growing number of ways to build a shopify storefront these days (ie, native templates, remix/hydrogen, web components, Shopify JS Buy SDK, etc.) so it's not clear what technology to "bet on" from a developer perspective.

Separately, nice touch adding the refined LLM instructions, this looks like a nice pattern for other UI frameworks to follow.


Different tools for different users and different levels of control.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: