I write online (comments here, open source software, blogging, etc) because I have ideas I want to share. Whether it's "I did a thing and here's how" or "we should change policy in this specific way" or "does anyone know how to X" I'm happy for this to go into training models just like I'm happy for it to go into humans reading.
Thank you for having this attitude. I have never attempted any blogging because I always figured no one is actually going to read it. With LLMs, however, I know they will. I actually see this as a motivation to blog, as we are in a position to shape this emerging knowledge base. I don't find it discouraging that others may be profiting off our freely published work, just as I myself have benefited tremendously from open source and the freely published works of others.
This is an interesting take, thanks for sharing. I wonder how someone should adjust their blogging if they believe their primary audience will be LLMs.
There’s a few instances of things I stated (about historical topics or very narrow topics in sociology) that were incorrect. LLMs scraped these off of web forums or other places, and now these bogus “facts” are permanently embedded into LLM models, because nobody else ever really talked about the specific topic.
Most amusingly, someone cited LLM generated output about this telling me how this “fact” is true when I was telling them it’s not true.
Tbh, that content I'm mostly fine with. My only real issue is that people are making trillions off the free labor of people like you and me, giving less time to create that OSS and blogs. But this isn't new to AI, it is just scaled.
What I do care about is the theft of my identity. A person may learn from the words I write but that person doesn't end up mimicking the way I write. They are still uniquely themselves.
I'm concerned that the more I write the more my text becomes my identifier. I use a handle so I can talk more openly about some issues.
We write OSS and blog because information should be free. But that information is then being locked behinds paywalls and becoming more difficult to be found through search. Frankly, that's not okay
> What I do care about is the theft of my identity. A person may learn from the words I write but that person doesn't end up mimicking the way I write. They are still uniquely themselves.
Of course they do, to some extent. Just because it's been infeasible to track the exact "graph of influence", that's literally how humans have learned to speak and write for as long as we've had language and writing.
> I'm concerned that the more I write the more my text becomes my identifier. I use a handle so I can talk more openly about some issues.
That's a much more serious concern, in my view. But I believe that LLMs are both the problem and solution here: "Remove style entropy" is just a prompt away, these days.
> A person may learn from the words I write but that person doesn't end up mimicking the way I write.
Oh, I wish I could get AI to mimic the way I write! I'd pay money for it. I often want to type up an email/doc/whatever but don't because of occasional RSI issues. If I could get an AI to type it up for me while still sounding like me - that would be a big boon for my health.
Oh yeah, I use dictation and then clean it up with GPT. It's awesome. But I speak very differently from how I write. So I'd like to dictate it, and then have it rewrite it in my writing style.
> people are making trillions off the free labor of people like you and me
I read "No Discrimination Against Fields of Endeavor" to also include LLMs and especially the cases that we most deeply disagree with.
Either we believe in the principles of OSS or we do not. If you do not like the idea of your intellectual property being used for commercial purposes then this model is definitely not for you.
There is no shame in keeping your source code and other IP a secret. If you have strong expectations of being compensated for your work, then perhaps a different licensing and distribution model is what you are after.
> that information is then being locked behinds paywalls and becoming more difficult to be found through search
Sure - If you give up and delete everything. No one is forcing you to put your blog and GH repos behind a paywall.
> Either we believe in the principles of OSS or we do not. If you do not like the idea of your intellectual property being used for commercial purposes then this model is definitely not for you.
I've been writing open source for more than 20 years
I gave away my work for free with one condition: leave my name on it (MIT license)
the AI parasites then strip the attribution out
they are the ones violating the principles of open source
> then perhaps a different licensing and distribution model is what you are after.
I've now stopped producing open source entirely
and I suggest every developer does the same until the legal position is clarified (in our favour)
> I suggest every developer does the same until the legal position is clarified (in our favour)
There are a lot of people developing open source software with a wide range of goals. In my case, I'm totally happy for LLMs to learn from my coding, just like they've learned from millions of other peoples. I wouldn't want them to duplicate it verbatim, but (due to copyright filters + that not usually being the best way to solve a problem) they don't.
> Either we believe in the principles of OSS or we do not.
What about respecting licenses?
Seriously, don't lick the boot. We can recognize that there's complexity here. Trivializing everything only helps the abusers.
Giving credit where credit is due is not too much to ask. Other people making money off my work can be good[0]. Taking credit for it is insulting
[0] If you're not making much, who cares. But if you're a trillion dollar business you can afford to give a little back. Here's the truth, OSS only works if we get enough money and time to do the work. That's either by having a good work life balance and good pay or enough donations coming in. We've been mostly supported by the former, but that deal seems to be going away
I think this may be too much of a "literal" interpretation of OSS without really considering the social contract many OSS supporters believe in, wherein users of OSS will act in good faith and might eventually reciprocate for the benefits they're getting, e.g. the way companies have slowly accepted paying their own employees to contribute to projects openly, releasing their own open source code, respecting the spirit of OSS licenses, sponsoring the developers of the thing they use, etc.
I think it's entirely fair that even staunch supporters of OSS get turned off when AI companies scrape their work to ingest into a black box regurgitator and then turn around and tell the world how their AI will make trillions of dollars by taking away the jobs of those obsolete OSS developers, showing no intention of ever giving back to the community.
Weather training on code is fair use is still an open legal question, and it may well be fair use. The way a license works is by saying "you have my permission to use this code as long as you follow these conditions", but if no license is required than the conditions are irrelevant.
There is an active case on this, where Microsoft has been sued over GitHub copilot, and it has been slowly moving through the court system since 2022. Most of the claims have been dismissed, and the prediction market is at 11%: https://manifold.markets/JeffKaufman/will-the-github-copilot...
Let's actually look at the MIT license, a very permissive license
> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to ***use***, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
So, you can use it but need to cite the usage. It's not that hard. Fair use if you just acknowledge usage.
Is it really that difficult to acknowledge that you didn't do everything on your own? People aren't asking for money. It's just basic acknowledgement.
Forget the courts for a second, just ask yourself what is the right thing to do. Ethically.
> Forget the courts for a second, just ask yourself what is the right thing to do
Forgetting the courts, whether reading the source code and learning from it is intended to count as "use" is not clear to me, and I would have guessed no. Using a tool and examining a tool are pretty different.
Human reading code? Ambiguous. But I think you're using it. Running code? Not ambiguous.
Machine processing code? I don't think that's ambiguous. It's using the code. A person is using the code to make their machine better.
This really isn't that hard.
Let's think about it this way. How do you use a book?
I think you need to be careful that you're not justifying the answer you want and instead are looking for what the right answer is. I'm saying this because you quoted me saying "what is right" and you just didn't address it. To quote Feynman (<- look, I cited my work. I fulfilled the MIT license obligations!)
> The first principle is that you must not fool yourself, and you are the easiest person to fool.
The key question is whether it is sufficiently "transformative". See Authors Guild vs Google, Kelly vs Arriba Soft, and Sony vs Universal. This is a way a judge could definitely rule, and at this point I think is the most likely outcome.
> Microsoft will forever be a pariah if they get away with this.
I doubt this. Talking to developers, it seems like the majority are pretty excited about coding assistants. Including the ones that many companies other than Microsoft (especially Anthropic) are putting out.