With that opinion, are you also suggesting that we ban ad blockers? Because it's...

ijk · 2025-07-02T15:14:50 1751469290

What I want to know is if the flood of scraping everyone has been complaining about is coming from people trying to scrape for training or bots doing RAG search.

I get that everyone wants data, but presumably the big players already scraped the web. Do they really need to do it again? Or is it bit players reproducing data that's likely already in the training set? Or is it really that valuable to have your own scraped copy of internet scale data?

I feel like I'm missing something here. My expectation is that RAG traffic is going to be orders of magnitude higher than scraping for training. Not that it would be easy to measure from the outside.

mattcollins · 2025-07-02T15:27:59 1751470079

I wondered about this, too.

Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.

progmetaldev · 2025-07-03T00:11:40 1751501500

I believe it's both. We're at a place where legislation hasn't really declared what is and isn't allowed. These scrapers are acting like Googlebot or any other search engine crawler, and trying to find any kind of new content that might be of value to their users.

New data is still being added online daily (probably hourly, if not more often) by humans, and the first ones to gain access could be the "winners," particularly if their users happen to need up to date data (and the service happens to have scraped it). Just like with search engines/crawlers, there's also the big players that may respect your website, but there are also those that don't use rate-limiting or respect robots.txt.

wiether · 2025-07-02T20:03:14 1751486594

You should ask Zuck, since, for what we've seen and what we were ask to act against, Meta is the main culprit in scraping every single page of websites, multiple times a day.

And I'm talking about ecommerce websites, with their bot scraping every variation of each product, multiple times a day.

postalcoder · 2025-07-02T14:34:39 1751466879

I don't think we should ban ad blockers, but I also think it's fair to suggest that the loss of organic traffic could be affecting the incentive to create new digital content, at least as much as the fear of having your content absorbed into an LLM's training data.

Boldened15 · 2025-07-02T15:11:25 1751469085

IMO the backlash against LLMs is more philosophical, a lot of people don’t like them or the idea of one learning from their content. Unless your website has some unique niche information unavailable anywhere else there’s no direct personal risk. RAG would be a more direct threat if anything.

toomuchtodo · 2025-07-02T15:46:20 1751471180

It's really about who is getting the value from the work of the content. If content creators of all sorts have their work consumed by LLMs, and LLM orgs charge for it can capture all the value, why should people create to have their work vacuumed up for the robot's benefit? For exposure? You can't eat or pay rent with exposure. Humans must get paid, and LLMs (foundational models and output using RAG) cannot improve without a stream of works and data humans create.

Whether you call it training or something else is irrelevant, it's really exploitation of human work and effort for AI shareholder returns and tech worker comp (if those who create aren't compensated). And the technocracy has not been, based on the evidence, great stewards of the power they obtain through this. Pay the humans for their work.

o11c · 2025-07-02T19:00:09 1751482809

It's not philosophical, it's economical.

AI scrapers increase traffic by maybe 10x (this varies per site) but provide no real value whatsoever to anyone. If you look at various forms of "value":

* Saying "this uses AI" might make numbers go up on the stock market if you manage to persuade people it will make numbers go up (see also: the market will remain irrational longer than you can remain solvent).

* Saying "this uses AI" might fulfill some corporate mandate.

* Asking AI to solve a problem (for which you would actually use the solution) allows you to "launder" the copyright of whatever source it is paraphrasing (it's well established that LLMs fail entirely if a question isn't found within their training set). Pirating it directly provides the same value, with significantly less errors/handholding.

* Asking AI to entertain you ... well, there's the novelty factor I guess, but even if people refuse to train themselves out of that obsession, the world is still far too full of options for any individual to explore them all. Even just the question of "what kind of ways can I throw some kind of ball around" has more answers than probably anyone here knows.

What am I missing?

robrenaud · 2025-07-03T01:35:23 1751506523

Why are 100s of millions of people using AI if it is providing no value?

o11c · 2025-07-03T02:22:15 1751509335

Because it's injected into a previously working product - even if it makes it worse - and automatically injects its ideas. That counts as "somebody using it".

Because it's bundled with other products that do provide value, and that counts as "someone using it".

Because some middle manager has declared that I must add AI to my workflow ... somehow. Whatever, if they want to pay me to accomplish less than usual, that's not my problem.

Because it's a cool new toy to play around with a bit.

Because surely all these people saying "AI is useful now" aren't just lying shills, so we'd better investigate their claims again ... nope, still terminally broken.