The biggest issue for me is clearly masquerading their User-Agent strings. Regardless of whether they are slow and respectful crawlers, they should clearly identify themselves, provide a documentation URL and obey robots.txt. Without that, I have to play a frankly tiring game of cat and mouse, wasting my time and the time of my users (they have to put up with some form of captcha or PoW thing).
I've been an active lurker in the self-hosting community and I'm definitely not alone. Nearly everyone hosting public facing websites, particularly those whose form is rather juicy for LLMs, have been facing these issues. It costs more time and money to deal with this, when applying a simple User-Agent block would be much cheaper and trivial to do and maintain.
I've been an active lurker in the self-hosting community and I'm definitely not alone. Nearly everyone hosting public facing websites, particularly those whose form is rather juicy for LLMs, have been facing these issues. It costs more time and money to deal with this, when applying a simple User-Agent block would be much cheaper and trivial to do and maintain.
sigh