I too am a bit confused / mystified at the strong reaction. But I do expect a lo...

zerocrates · 2025-07-02T19:42:08 1751485328

The reaction comes from some combination of

- opposition to generative AI in general

- a view that AI, unlike search which also relies on crawling, offers you no benefits in return

- crawlers from the AI firms being less well-behaved than the legacy search crawlers, not obeying robots.txt, crawling more often, more aggressively, more completely, more redundantly, from more widely-distributed addresses

- companies sneaking in AI crawling underneath their existing tolerated/whitelisted user-agents (Facebook was pretty clearly doing this with "facebookexternalhit" that people would have allowed to get Facebook previews; they eventually made a new agent for their crawling activity)

- a simultaneous huge spike in obvious crawler activity with spoofed user agents: e.g. a constant random cycling between every version of Chrome or Firefox or any browser ever released; who this is or how many different actors it is and whether they're even doing crawling for AI, who knows, but it's a fair bet.

Better optimization and caching can make this all not matter so much but not everything can be cached, and plenty of small operations got by just fine without all this extra traffic, and would get by just fine without it, so can you really blame them for turning to blocking?

jowea · 2025-07-02T22:38:03 1751495883

I'm not an expert on website hosting, but after reading some of the blog posts on Anubis, those people were truly at wit's end trying to block AI scrappers with techniques like the ones you imply.

jauntywundrkind · 2025-07-03T01:59:09 1751507949

https://xeiaso.net/blog/2025/anubis/ links to https://pod.geraspora.de/posts/17342163 which says:

> If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

My gut is that the switch between IP addresses can't be that hard to follow. That the access pattern it pretty obvious to follow across identities.

But it would be non trivial, it would entail crafting new systems and doing new work per request (when traffic starts to be elevated, as a first gate).

Just making the client run through some math gauntlet is an obvious win that aggressors probably can't break. But I still think there's probably some really good hanging fruit for identifying and rate limiting even these somewhat rather more annoying traffic patterns, that the behavior itself leaves a figure print that can't be hidden and which can absolutely be rate limited. And I'd like to see that area explored.

Edit: oh heck yes, new submission with 1.7tb logs of what AI crawlers do. Now we can machine learn some better rate limiting techniques! https://news.ycombinator.com/item?id=44450352 https://huggingface.co/datasets/lee101/webfiddle-internet-ra...

xena · 2025-07-03T03:18:45 1751512725

This isn't as helpful as you think. If it included all of the HTTP headers that the bots sent and other metadata like TLS ClientHelloInfo it would be a lot more useful.

jauntywundrkind · 2025-07-03T19:26:44 1751570804

There's headers, but I hadn't noticed that they are the response headers. :facepalm: