The Information’s report from earlier this month claimed that GPT-5 was only developed in the last 1-2 months, after some sort of breakthrough in training methodology.
> As recently as June, the technical problems meant none of OpenAI’s models under development seemed good enough to be labeled GPT-5, according to a person who has worked on it.
But it could be that this refers to post-training and the base model was developed earlier.
My understanding is that training data cut-offs and dates at which the model were trained are independent things.
AI labs gather training data and then do a ton of work to process it, filter it etc.
Model training teams run different parameters and techniques against that processed training data.
It wouldn't surprise me to hear that OpenAI had collected data up to September 2024, dumped that data in a data warehouse of some sort, then spent months experimenting with ways to filter and process it and different training parameters to run against it.
Is the filtering and processing very specific to the data set?
I'd kind of assume that they would dump data into the data warehouse on September 2024, then in parallel continue data collection and do the months of work to determine how to best filter, process it, and select training parameters, etc. Then once that was locked in to do a final update to the say, December 2024 data warehouse for the final training.
Do the filtering, processing, and training parameters need to be fairly fine-tuned to the specific data set?
> As recently as June, the technical problems meant none of OpenAI’s models under development seemed good enough to be labeled GPT-5, according to a person who has worked on it.
But it could be that this refers to post-training and the base model was developed earlier.
https://www.theinformation.com/articles/inside-openais-rocky...
https://archive.ph/d72B4