Is it just the model that needs to be open source? I thought the big secret sauc...

dragonwriter · on Nov 19, 2023

No, the model is useful without the dataset, but its not functionally "open source", because while you can tune it if you have the training code, you can't replicate it or, more important, train it from scratch with a modified, but not completely new, dataset. (And, also, understanding the existing training data helps understand how to structure data to train that particular model, whether its with a new or modified data set from scratch, or for finetuning.)

At least, that's my understanding.

PeterisP · on Nov 19, 2023

For various industry-specific or specialized task models (e.g. recognizing dangerous events in self-driving car scenario) having appropriate data is often the big secret sauce, however, for the specific case of LLMs there are reasonable sets of sufficiently large data available to the public, and even the specific RLHF adaptations aren't a limiting secret sauce because there are techniques to extract them from the available commercial models.