I've heard lots of people on HN complaining about bot traffic bogging down their...

noodle · 2025-07-02T16:01:51 1751472111

As someone who had some outages due to AI traffic and is now using CloudFlare's tools:

Most of my site is cached in multiple different layers. But some things that I surface to unauthenticated public can't be cached while still being functional. Hammering those endpoints has taken my app down.

Additionally, even though there are multiple layers, things that are expensive to generate can still slip through the cracks. My site has millions of public-facing pages, and a batch of misses that happen at the same time on heavier pages to regenerate can back up requests, which leads to errors, and errors don't result in caches successfully being filled. So the AI traffic keeps hitting those endpoints, they keep not getting cached and keep throwing errors. And it spirals from there.

Symbiote · 2025-07-02T22:48:42 1751496522

That's a pretty big assumption.

The largest site I work on has 100,000s of pages, each in around 10 languages — that's already millions of pages.

It generally works fine. Yesterday it served just under 1000 RPS over the day.

AI crawlers have brought it down when a single crawler has added 100, 200 or more RPS distributed over a wide range of IPs — it's not so much the number of extra requests, though it's very disproportionate for one "user", but they can end up hitting an expensive endpoint excluded by robots.txt and protected by other rate-limiting measures, which didn't anticipate a DDoS.

Meekro · 2025-07-03T00:21:04 1751502064

Ok, clearly I had no idea of the scale of it. 200RPS from a single bot sounds pretty bad! Do all 100,000+ pages have to be live to be useful, or could many be served from a cache that is minutes/hours/days old?

Symbiote · 2025-07-03T08:42:59 1751532179

The main data for those pages is in a column store, so it can sustain many thousand RPS (at least).

The problem is we have things like

  Disallow: /the-search-page
  Disallow: /some-statistics-pages

in robots.txt, which is respected by most search engine (etc) crawlers, but completely ignored by the AI crawlers.

By chance, this morning I find a legacy site is down, because in the last 8 hours it's had 2 million hits (70/s) to a location disallowed in robots.txt. These hits have come from over 1.5 million different IP addresses, so the existing rate-limit-by-IP didn't catch it.

The User-Agents are a huge mixture of real-looking web browsers; the IPs look to come from residential, commercial and sometimes cloud ranges, so it's probably all hacked computers.

I could see Cloudflare might have data to block this better. They don't just get 1 or 2 requests from an IP, they presumably see a stream of them to different sites. They could see many different user agents being used from that IP, and other patterns, and can assign a reputation score.

I think we will need to add a proof-of-work thing in front of these pages and probably whitelist some 'good' bots (Wikipedia, Internet Archive etc). It is annoying since this was working fine in its current form for over 5 years.

conductr · 2025-07-02T14:57:05 1751468225

The presumption I’m already using cloudfare is a start. Is this a requirement for maintaining a simple website now?

haiku2077 · 2025-07-02T15:06:46 1751468806

Either that or Anubis (https://anubis.techaro.lol/docs), yes.

roguecoder · 2025-07-02T15:46:19 1751471179

So these companies broke the internet

haiku2077 · 2025-07-02T17:18:53 1751476733

Which companies?

OpenAI, Anthropic, Google? No, their bots are pretty well behaved.

The smaller AI companies deploying bots that don't respect any reasonable rate limits and are scraping the same static pages thousands of times an hour? Yup

e3bc54b2 · 2025-07-02T18:35:49 1751481349

Anecdote, but at least for tiny little server hosting single public repository, none of these companies had 'well behaved' bots. It may be possible that they learned to behave better but I wouldn't know since my only possible recourse was to blacklist them all AND take the repo private.

haiku2077 · 2025-07-02T21:10:42 1751490642

Those are the small companies spoofing their user agent as the big companies to dodge countermeasures.

jtolmar · 2025-07-02T16:55:20 1751475320

The stories I've heard have been mostly about scraper bots finding APIs like "get all posts in date range" and then hammering that with every combo of start/end date.

jauntywundrkind · 2025-07-02T17:23:02 1751476982

I too am a bit confused / mystified at the strong reaction. But I do expect a lot of badly optimized sites that just want out.

I struggle to think of a web related library that has spread faster than Anubis checker. It's everywhere now! https://github.com/TecharoHQ/anubis

I'm surprised we don't see more efforts to rate limit. I assume many of these are distributed crawlers, but it feels like there's got to be pools of activity spinning up, on a handful of IPs. And that they would be time correlated together pretty clearly. Maybe that's not true. But it feels like the web, more than anything else, needs some open source software to add a lot more 420 Enhance Your Calm responses, as it feels like. https://http.dev/420

zerocrates · 2025-07-02T19:42:08 1751485328

The reaction comes from some combination of

- opposition to generative AI in general

- a view that AI, unlike search which also relies on crawling, offers you no benefits in return

- crawlers from the AI firms being less well-behaved than the legacy search crawlers, not obeying robots.txt, crawling more often, more aggressively, more completely, more redundantly, from more widely-distributed addresses

- companies sneaking in AI crawling underneath their existing tolerated/whitelisted user-agents (Facebook was pretty clearly doing this with "facebookexternalhit" that people would have allowed to get Facebook previews; they eventually made a new agent for their crawling activity)

- a simultaneous huge spike in obvious crawler activity with spoofed user agents: e.g. a constant random cycling between every version of Chrome or Firefox or any browser ever released; who this is or how many different actors it is and whether they're even doing crawling for AI, who knows, but it's a fair bet.

Better optimization and caching can make this all not matter so much but not everything can be cached, and plenty of small operations got by just fine without all this extra traffic, and would get by just fine without it, so can you really blame them for turning to blocking?

jowea · 2025-07-02T22:38:03 1751495883

I'm not an expert on website hosting, but after reading some of the blog posts on Anubis, those people were truly at wit's end trying to block AI scrappers with techniques like the ones you imply.

jauntywundrkind · 2025-07-03T01:59:09 1751507949

https://xeiaso.net/blog/2025/anubis/ links to https://pod.geraspora.de/posts/17342163 which says:

> If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

My gut is that the switch between IP addresses can't be that hard to follow. That the access pattern it pretty obvious to follow across identities.

But it would be non trivial, it would entail crafting new systems and doing new work per request (when traffic starts to be elevated, as a first gate).

Just making the client run through some math gauntlet is an obvious win that aggressors probably can't break. But I still think there's probably some really good hanging fruit for identifying and rate limiting even these somewhat rather more annoying traffic patterns, that the behavior itself leaves a figure print that can't be hidden and which can absolutely be rate limited. And I'd like to see that area explored.

Edit: oh heck yes, new submission with 1.7tb logs of what AI crawlers do. Now we can machine learn some better rate limiting techniques! https://news.ycombinator.com/item?id=44450352 https://huggingface.co/datasets/lee101/webfiddle-internet-ra...

xena · 2025-07-03T03:18:45 1751512725

This isn't as helpful as you think. If it included all of the HTTP headers that the bots sent and other metadata like TLS ClientHelloInfo it would be a lot more useful.

jauntywundrkind · 2025-07-03T19:26:44 1751570804

There's headers, but I hadn't noticed that they are the response headers. :facepalm:

deepsiml · 2025-07-02T14:29:52 1751466592

Not much into that kind of DevOps. What is a good basic caching in this instance?

haiku2077 · 2025-07-02T15:18:13 1751469493

It comes down to:

1. Use the Cache-Control header to express how to cache your site correctly (https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Cac...)

2. Use a CDN service, or at least a caching reverse proxy, to serve most of the cacheable requests to reduce load on the (typically much more expensive) origin servers

mrweasel · 2025-07-02T15:42:30 1751470950

Just note that many AI scrapers will go to great length to do cache busting. For some reason many of them feel like they need to get the absolute latest version and don't trust your cache.

haiku2077 · 2025-07-02T15:44:49 1751471089

You can use Cache Control headers to express that your own CDN should aggressively refresh a resource but always serve it to external clients from cache. It's covered in the link under "Managed Caches"

cortesoft · 2025-07-02T17:06:14 1751475974

A CDN can be configured to ignore cache control headers in the requests and cache things anyway.

TechDebtDevin · 2025-07-02T15:04:23 1751468663

Cloudflare and other CDNs will usually automatically cache your static pages.

x0x0 · 2025-07-02T16:39:28 1751474368

It's not complex. I worked on a big site. We did not have the compute or i/o (most particularly db iops) to live generate the site. Massive crawls both generated cold pages / objects (cpu + iops) and yanked them into cache, dramatically worsening cache hit rates. This could easily take down the site.

Cache is expensive at scale. So permitting big or frequent crawls by stupid crawlers either require significant investments in cache or slow down and worsen the site for all users. For whom we, you know, built the site, not to provide training data for companies.

As others have mentioned, Google is significantly more competent than 99.9% of the others. They are very careful to not take your site down and provide, or used to provide, traffic via their search. So it was a trade, not a taking.

Not to mention I prefer not to do business with Cloudflare because I don't like companies that don't publish quota. If going over X means I need an enterprise account that starts at $10k/mo, I need to know the X. Cloudflare's business practice appears to be letting customers exceed that quota then aggressively demanding they pay or they'll be kicked off the service nearly immediately.