The Information’s report from earlier this month claimed that GPT-5 was only dev...

simonw · 2025-08-07T20:28:28 1754598508

My understanding is that training data cut-offs and dates at which the model were trained are independent things.

AI labs gather training data and then do a ton of work to process it, filter it etc.

Model training teams run different parameters and techniques against that processed training data.

It wouldn't surprise me to hear that OpenAI had collected data up to September 2024, dumped that data in a data warehouse of some sort, then spent months experimenting with ways to filter and process it and different training parameters to run against it.

ncallaway · 2025-08-08T15:40:04 1754667604

Is the filtering and processing very specific to the data set?

I'd kind of assume that they would dump data into the data warehouse on September 2024, then in parallel continue data collection and do the months of work to determine how to best filter, process it, and select training parameters, etc. Then once that was locked in to do a final update to the say, December 2024 data warehouse for the final training.

Do the filtering, processing, and training parameters need to be fairly fine-tuned to the specific data set?