Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not an expert on website hosting, but after reading some of the blog posts on Anubis, those people were truly at wit's end trying to block AI scrappers with techniques like the ones you imply.





https://xeiaso.net/blog/2025/anubis/ links to https://pod.geraspora.de/posts/17342163 which says:

> If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

My gut is that the switch between IP addresses can't be that hard to follow. That the access pattern it pretty obvious to follow across identities.

But it would be non trivial, it would entail crafting new systems and doing new work per request (when traffic starts to be elevated, as a first gate).

Just making the client run through some math gauntlet is an obvious win that aggressors probably can't break. But I still think there's probably some really good hanging fruit for identifying and rate limiting even these somewhat rather more annoying traffic patterns, that the behavior itself leaves a figure print that can't be hidden and which can absolutely be rate limited. And I'd like to see that area explored.

Edit: oh heck yes, new submission with 1.7tb logs of what AI crawlers do. Now we can machine learn some better rate limiting techniques! https://news.ycombinator.com/item?id=44450352 https://huggingface.co/datasets/lee101/webfiddle-internet-ra...


This isn't as helpful as you think. If it included all of the HTTP headers that the bots sent and other metadata like TLS ClientHelloInfo it would be a lot more useful.

There's headers, but I hadn't noticed that they are the response headers. :facepalm:



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: