Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is it just the model that needs to be open source?

I thought the big secret sauce is the sources of data that is used to train the models. Without this, the model itself is useless quite literally.




No, the model is useful without the dataset, but its not functionally "open source", because while you can tune it if you have the training code, you can't replicate it or, more important, train it from scratch with a modified, but not completely new, dataset. (And, also, understanding the existing training data helps understand how to structure data to train that particular model, whether its with a new or modified data set from scratch, or for finetuning.)

At least, that's my understanding.


For various industry-specific or specialized task models (e.g. recognizing dangerous events in self-driving car scenario) having appropriate data is often the big secret sauce, however, for the specific case of LLMs there are reasonable sets of sufficiently large data available to the public, and even the specific RLHF adaptations aren't a limiting secret sauce because there are techniques to extract them from the available commercial models.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: