To quote from the article: "These repositories, belonging to more than 16,000 organizations, were originally posted to GitHub as public, but were later set to private [..]" Once things are public, they will forever remain public (in some form). That's how the internet works.
tl;dr Bing indexed and cached a public repository, then made it available to its AI chat. Later, the repository author switched the repository to private and understood how the internet works. And the story gets only better as the author is the founder of a “cybersecurity” company.
I'm baffled. There isn't even the seed of a story here, just someone not understanding that if you put data out there, the data is [checks notes] out there.
there are also the various fundamental security issues GitHub has where making a repo private (and a few other cases e.g. related to forks) comitts/content _which was never public_ (e.g. pushed after it was made private) are publicly available.
Sorry to reply late. We don't have any dedicated marketers to watch the communication. We do get inspired from it and other projects and just added a credit statement in our GitHub page. Hope that addresses your concern. About being OSS, we are not ready to do it right now on both code and business strategy side. May do it later.
However, the broader issue that Microsoft has infiltrated OSS and its organizations successfully by hiring and donating remains. It would not surprise me at all if they now hire people with an ostensibly "freedom fighter" background for credibility.
Look at how many people here cite his (former?) membership in the Pirate Party for credibility! Party membership means nothing. Politicians (in general!) change their minds, can be bought, etc. The Green Party in Germany started out as a peace party and has been used repeatedly to lend credibility to the Kossovo and other wars.
Today, we are pleased to announce that Microsoft will once again be supporting Open Data Day by providing mini-grants to organisations to help them run events, the call will launch on Open Data Day 2022.
They also supported "Open Data Day 2021". Sounds like a nice trojan horse to influence EU legislation through purported activists.
That’s no longer true. Copilot uses the same ChatGPT-3.5 model as, well, ChatGPT. If it were trained on just GitHub projects, the chat features wouldn’t work at all.
You're assuming that Copilot Chat and the regular completion are the same model. Do you have a source that says so? I'd assumed that they were two different models, since they're quite different tasks.
Footnote 1 on page 2 explicitly mentions the 3.5 model and the research in this paper is only about auto completion: https://arxiv.org/pdf/2306.15033.pdf
Lastly, OpenAI states on the original Codex page: “OpenAI Codex is a descendant of GPT-3; its training data contains both natural language and billions of lines of source code from publicly available sources, including code in public GitHub repositories.” - It included GitHub repos, but it never was only GitHub repos. https://openai.com/blog/openai-codex
As I said to the other commenter, I specifically avoided saying "only", I said "primarily", and I should have clarified that I meant "primarily fine-tuned". My point is simply that it is far more likely to spit out results that are patterned after GitHub than results that are patterned after any inter-programmer communications.
Also, I wasn't contesting that autocomplete uses GPT 3.5 as the base model, I was contesting the idea that it uses the same derivative model as chat.
you don't need to use the chat to test whether it was exclusively trained on github or not.
I started typing this in my IDE in a rust project I already had open:
// the following function prints out the best soccer players of all time, which teams they played for, and why they were great as a hashmap
fn soccer_greats() -> Hashmap<String,(Vec<String>, String)> {
let mut hm = HashMap::new();
it completed with the following:
hm.insert("Pele", (vec!["Santos", "New York Cosmos"], "He scored over 1000 goals in his career".to_string()));
hm.insert("Maradona", (vec!["Boca Juniors", "Barcelona", "Napoli"], "He won the world cup with Argentina".to_string()));
hm.insert("Messi", (vec!["Barcelona"], "He has won the Ballon d'Or 6 times".to_string()));
hm.insert("Ronaldo", (vec!["Barcelona", "Inter Milan", "Real Madrid", "AC Milan", "Corinthians"], "He won the world cup with Brazil".to_string()));
hm.insert("Zidane", (vec!["Juventus", "Real Madrid"], "He won the world cup with France".to_string()));
}
I don't believe that information is going to be on github anywhere, but i could be mistaken.
You're addressing a straw man, I never claimed it was "exclusively" trained on GitHub. I said "primarily", though I should have been specific and said "primarily fine-tuned".
In the context of the person I replied to, the point is that it isn't made up primarily of a bunch of communications between programmers.
They did not prompt at all. They used GitHub’s code search to find projects where the repo owner specified that the code was generated “by Copilot” and the authors took that at face value for all code in the project. Whether the code was actually suggested by Copilot is not at all analyzed in the paper. As such, the results are highly questionable.