I would extend the same reasoning to Mistral as DeekSeek as to where they sit on the innovation pipeline. That doesn’t have to be a bad thing (when done fairly), only to remain mindful that it’s not a fair comparison (to go back to the original point).
This is one message of the founders of Mistral when they accidentally leaked one work-in-progress version that was a fine-tune of LLaMA, and there are few hints for that.
Like:
> What is the architectural difference between Mistral and Llama? HF Mistral seems the same as Llama except for sliding window attention.
So even their “trained from scratch” models like 7B aren’t that impressive if they just pick the dataset and tweak a few parameter.
Right, so Mistral accidentally released one internal prototype that was fine-tuned LLaMA. How does it follow from there that their other models are the same? Given that the weights are open, we can look, and nope, it's not the same. They don't even use the same vocabulary!
And I have no idea what you mean by "they just pick the dataset". The LLaMA training set is not publicly available - it's open weights, not open source (i.e. not reproducible).