Nobody has mentioned Approximate Nearest Neighbor search (aka vector search), which IMO, are fundamental data structures advancements.
Basically given a set of million (billion, trillion...) ~1000 valued vectors, and a query ~1000 valued vector, find the closest vector in the indexed set. This is "nearest neighbor" search and there have been increasingly more and more sophisticated approaches:
That's about the number of pixels in a 2-hour movie at 4k. Not too common yet, but once it's possible we're going to want to feed this amount of data into neural networks.
There's about 14B trades per year on the NYSE which i'm sure could represent 10x that in entities (buyer, seller, broker, etc) and could easily hit 1000x that in log lines. The shares per day is in the billions, so hitting 1T if each share is represented uniquely.
You don't typically use vector search for trade data though. It's already ridculously well structured. Assets have identifiers, parties and counterparties have IDs, etc. I'm not sure what nearest neighbors in a vector space would add.
Dumb example but still example from practical world. Your body (assuming you are a human) has trillions of cells. Each cell way more complicated than what a 1000-dimensional vector can represent, but maybe it could be a loose estimate of some properties in each cell. Now the algorithm could be about finding the most similar cell. Could be useful for finding e.g. other cancer cells based on properties in one cancer cell.
Not that this is a practical example because we technically cannot index all cells in each body. But even if such an algorithm being studied today might be useful one day when we do capability to collect such data
If I was building it from my 5 minutes of googling, using 15TB nvme u2 drives, and easily available server chasis, I can get 24 drives per 2u of a rack. That's 360 TB + a couple server nodes. So ~6u per PB. A full height rack is 42u, so 6-7PB per rack once you take up some of the space with networking, etc. So dozens is doable in a short datacenter row.
Realistically you could fit a lot more storage per U, depending on how much compute you need per unit of data. The example above assumes all the disks are at the front of the server only, if you mount them internally also, you can fit a lot more. (see Backblaze's storage pods for how they did it with spinning disks).
Probably an order of magnitude or two more. Still something that is feasable in a research context - early MRI and genome sequencing had similar "too much data" problems like this, but the researchers still built it out to learn stuff. Tech marched forward and these days no one really blinks about it. I presume that if such a "all the cells scanner" was invented today, it would only be used for research for a long time - and that by the time it became widespread data storage will have caught up.
Should theoretical research of data structures and algorithms have been capped at 1GB in 1980 because that was the biggest single hard drive available back then and you couldn’t store for example a 2GB dataset on a disk?
Sorry, they've crawled trillions of pages, and narrowed it down to an index of 100s of billions. Conveniently, the link answers your question of "can you have PB sized indices?" to which we can clearly say, yes.
Perhaps one example would be 100 billion websites, but with a set of vectors for each website (chunked bert encoding, sum chunked glove vecs, whatever).
Then you could have something like 3-100ish vectors for each website, which would be a few trillion vecs.
I'm not in the we scraping/search area, so idk about the 100B website thing (other than it's Google order of size), but the encoding would take some mega amount of time depending on how it's done, hence the suggested sum of GloVe chunks (potentially doable with decent hardware in months) rather than throwing LLM on there (would take literal centuries to process).
This is still several orders of magnitude more items than the entire training corpus for all GPT models combined. I guess if you were to index individual codepoints in the training corpus, we'd start to see those volumes.
You don't index the training data, but other data. It gives LLMs the medium-term memory that they're missing.
Think of it like an accountant. Weights are all of their experience. Prompt is the form in front of them. A vector database makes it easier to find the appropriate tax law and have that open (in the prompt) as well.
This is useful for people as well, like literally this example. But the LLM + vector combination is looking really powerful because of the tight loops.
For context stuffing LLMs with small token limits(AKA Retriever Augmentation), you will need to break up each article in few-sentence chunks. You can get to a large number of these chunks very rapidly.
Basically given a set of million (billion, trillion...) ~1000 valued vectors, and a query ~1000 valued vector, find the closest vector in the indexed set. This is "nearest neighbor" search and there have been increasingly more and more sophisticated approaches:
http://ann-benchmarks.com
https://haystackconf.com/us2022/talk-6/
And has spawned companies like Weaviate, Pinecone, etc etc (a half a dozen others)