Nobody has mentioned Approximate Nearest Neighbor search (aka vector search), wh...

fzliu · on May 10, 2023

Most commonly used ANN algorithms today are clever optimizations atop "old" data structures. DiskANN is a recent favorite: https://milvus.io/blog/2021-09-24-diskann.md

kokanee · on May 10, 2023

"Most commonly used X today are clever optimizations atop old Y" is pretty much the story of technology, isn't it?

RoyGBivCap · on May 10, 2023

It's lego all the way down to the clock cycle.

marginalia_nu · on May 10, 2023

What sort of dataset are you indexing that has trillion entries?

That's 100,000 english wikipedias or 50 googles.

lacker · on May 10, 2023

That's about the number of pixels in a 2-hour movie at 4k. Not too common yet, but once it's possible we're going to want to feed this amount of data into neural networks.

marginalia_nu · on May 11, 2023

Using an approximate nearest neighbors search for this is just worse though.

You can retrieve the exact information even faster. From an (x,y,t) coordinate, you can find the exact index in constant time

maerF0x0 · on May 10, 2023

There's about 14B trades per year on the NYSE which i'm sure could represent 10x that in entities (buyer, seller, broker, etc) and could easily hit 1000x that in log lines. The shares per day is in the billions, so hitting 1T if each share is represented uniquely.

marginalia_nu · on May 10, 2023

You don't typically use vector search for trade data though. It's already ridculously well structured. Assets have identifiers, parties and counterparties have IDs, etc. I'm not sure what nearest neighbors in a vector space would add.

l33t233372 · on May 10, 2023

Nonetheless it’s an example you asked for of a dataset with over a trillion entries.

marginalia_nu · on May 10, 2023

I asked which dataset they were indexing that was of this size, not whether any such dataset exists in other domains.

captaincrowbar · on May 11, 2023

I believe I hear the sound of a True Scotsman knocking at the door.

marginalia_nu · on May 11, 2023

My exact words are still up there

> What sort of dataset are you indexing that has trillion entries?

It doesn't say

> What sort of dataset has a trillion entries?

yes_man · on May 10, 2023

Dumb example but still example from practical world. Your body (assuming you are a human) has trillions of cells. Each cell way more complicated than what a 1000-dimensional vector can represent, but maybe it could be a loose estimate of some properties in each cell. Now the algorithm could be about finding the most similar cell. Could be useful for finding e.g. other cancer cells based on properties in one cancer cell.

Not that this is a practical example because we technically cannot index all cells in each body. But even if such an algorithm being studied today might be useful one day when we do capability to collect such data

marginalia_nu · on May 10, 2023

Where are you going to store this data? It's dozens of petabytes.

sophacles · on May 10, 2023

It's only a few racks worth of disk servers.

If I was building it from my 5 minutes of googling, using 15TB nvme u2 drives, and easily available server chasis, I can get 24 drives per 2u of a rack. That's 360 TB + a couple server nodes. So ~6u per PB. A full height rack is 42u, so 6-7PB per rack once you take up some of the space with networking, etc. So dozens is doable in a short datacenter row.

Realistically you could fit a lot more storage per U, depending on how much compute you need per unit of data. The example above assumes all the disks are at the front of the server only, if you mount them internally also, you can fit a lot more. (see Backblaze's storage pods for how they did it with spinning disks).

Dozens of PB is not that much data in 2023.

marginalia_nu · on May 11, 2023

This is still like tens of thousands of dollars of equipment to store information about a single person's biology.

sophacles · on May 11, 2023

Probably an order of magnitude or two more. Still something that is feasable in a research context - early MRI and genome sequencing had similar "too much data" problems like this, but the researchers still built it out to learn stuff. Tech marched forward and these days no one really blinks about it. I presume that if such a "all the cells scanner" was invented today, it would only be used for research for a long time - and that by the time it became widespread data storage will have caught up.

SoftTalker · on May 11, 2023

> Dozens of PB is not that much data in 2023.

Yes it is. Just transferring it at data center speeds will take days if not weeks.

rsrsrs86 · on May 11, 2023

A truck has more bandwidth than network adapters technically

yes_man · on May 10, 2023

Should theoretical research of data structures and algorithms have been capped at 1GB in 1980 because that was the biggest single hard drive available back then and you couldn’t store for example a 2GB dataset on a disk?

marginalia_nu · on May 10, 2023

Not at all, I'll still call out fantastic claims when I see them though.

karpierz · on May 10, 2023

Google has definitely indexed over a trillion pages.

marginalia_nu · on May 10, 2023

Do you have any sources for this claim?

As far as I am aware Google doesn't publish any statistics about the size of its index, which no doubt varies.

groby_b · on May 10, 2023

https://www.google.com/search/howsearchworks/how-search-work....

marginalia_nu · on May 11, 2023

Well what do you know, they contradict the claim made above.

karpierz · on May 11, 2023

Sorry, they've crawled trillions of pages, and narrowed it down to an index of 100s of billions. Conveniently, the link answers your question of "can you have PB sized indices?" to which we can clearly say, yes.

tourist2d · on May 10, 2023

Where do you think computers store data?

perfmode · on May 11, 2023

index every photo taken in the past ten years (1.8 trillion per year)

for each photo, extract and index SIFT vectors

https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...

thfuran · on May 11, 2023

Is sift still used?

perfmode · on May 12, 2023

i’m curious also.

i used it as an example of a way a dataset might have a huge number of feature vectors.

curious if there are better ways to do this now.

chaxor · on May 11, 2023

Perhaps one example would be 100 billion websites, but with a set of vectors for each website (chunked bert encoding, sum chunked glove vecs, whatever). Then you could have something like 3-100ish vectors for each website, which would be a few trillion vecs.

I'm not in the we scraping/search area, so idk about the 100B website thing (other than it's Google order of size), but the encoding would take some mega amount of time depending on how it's done, hence the suggested sum of GloVe chunks (potentially doable with decent hardware in months) rather than throwing LLM on there (would take literal centuries to process).

de_nied · on May 10, 2023

One application are LLMs. I've seen a project use Pinecone to enable "infinite context" for GPT-3.

marginalia_nu · on May 10, 2023

This is still several orders of magnitude more items than the entire training corpus for all GPT models combined. I guess if you were to index individual codepoints in the training corpus, we'd start to see those volumes.

travisjungroth · on May 11, 2023

You don't index the training data, but other data. It gives LLMs the medium-term memory that they're missing.

Think of it like an accountant. Weights are all of their experience. Prompt is the form in front of them. A vector database makes it easier to find the appropriate tax law and have that open (in the prompt) as well.

This is useful for people as well, like literally this example. But the LLM + vector combination is looking really powerful because of the tight loops.

ukuina · on May 11, 2023

For context stuffing LLMs with small token limits(AKA Retriever Augmentation), you will need to break up each article in few-sentence chunks. You can get to a large number of these chunks very rapidly.