Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nobody has mentioned Approximate Nearest Neighbor search (aka vector search), which IMO, are fundamental data structures advancements.

Basically given a set of million (billion, trillion...) ~1000 valued vectors, and a query ~1000 valued vector, find the closest vector in the indexed set. This is "nearest neighbor" search and there have been increasingly more and more sophisticated approaches:

http://ann-benchmarks.com

https://haystackconf.com/us2022/talk-6/

And has spawned companies like Weaviate, Pinecone, etc etc (a half a dozen others)



Most commonly used ANN algorithms today are clever optimizations atop "old" data structures. DiskANN is a recent favorite: https://milvus.io/blog/2021-09-24-diskann.md


"Most commonly used X today are clever optimizations atop old Y" is pretty much the story of technology, isn't it?


It's lego all the way down to the clock cycle.


What sort of dataset are you indexing that has trillion entries?

That's 100,000 english wikipedias or 50 googles.


That's about the number of pixels in a 2-hour movie at 4k. Not too common yet, but once it's possible we're going to want to feed this amount of data into neural networks.


Using an approximate nearest neighbors search for this is just worse though.

You can retrieve the exact information even faster. From an (x,y,t) coordinate, you can find the exact index in constant time


There's about 14B trades per year on the NYSE which i'm sure could represent 10x that in entities (buyer, seller, broker, etc) and could easily hit 1000x that in log lines. The shares per day is in the billions, so hitting 1T if each share is represented uniquely.


You don't typically use vector search for trade data though. It's already ridculously well structured. Assets have identifiers, parties and counterparties have IDs, etc. I'm not sure what nearest neighbors in a vector space would add.


Nonetheless it’s an example you asked for of a dataset with over a trillion entries.


I asked which dataset they were indexing that was of this size, not whether any such dataset exists in other domains.


I believe I hear the sound of a True Scotsman knocking at the door.


My exact words are still up there

> What sort of dataset are you indexing that has trillion entries?

It doesn't say

> What sort of dataset has a trillion entries?


Dumb example but still example from practical world. Your body (assuming you are a human) has trillions of cells. Each cell way more complicated than what a 1000-dimensional vector can represent, but maybe it could be a loose estimate of some properties in each cell. Now the algorithm could be about finding the most similar cell. Could be useful for finding e.g. other cancer cells based on properties in one cancer cell.

Not that this is a practical example because we technically cannot index all cells in each body. But even if such an algorithm being studied today might be useful one day when we do capability to collect such data


Where are you going to store this data? It's dozens of petabytes.


It's only a few racks worth of disk servers.

If I was building it from my 5 minutes of googling, using 15TB nvme u2 drives, and easily available server chasis, I can get 24 drives per 2u of a rack. That's 360 TB + a couple server nodes. So ~6u per PB. A full height rack is 42u, so 6-7PB per rack once you take up some of the space with networking, etc. So dozens is doable in a short datacenter row.

Realistically you could fit a lot more storage per U, depending on how much compute you need per unit of data. The example above assumes all the disks are at the front of the server only, if you mount them internally also, you can fit a lot more. (see Backblaze's storage pods for how they did it with spinning disks).

Dozens of PB is not that much data in 2023.


This is still like tens of thousands of dollars of equipment to store information about a single person's biology.


Probably an order of magnitude or two more. Still something that is feasable in a research context - early MRI and genome sequencing had similar "too much data" problems like this, but the researchers still built it out to learn stuff. Tech marched forward and these days no one really blinks about it. I presume that if such a "all the cells scanner" was invented today, it would only be used for research for a long time - and that by the time it became widespread data storage will have caught up.


> Dozens of PB is not that much data in 2023.

Yes it is. Just transferring it at data center speeds will take days if not weeks.


A truck has more bandwidth than network adapters technically


Should theoretical research of data structures and algorithms have been capped at 1GB in 1980 because that was the biggest single hard drive available back then and you couldn’t store for example a 2GB dataset on a disk?


Not at all, I'll still call out fantastic claims when I see them though.


Google has definitely indexed over a trillion pages.


Do you have any sources for this claim?

As far as I am aware Google doesn't publish any statistics about the size of its index, which no doubt varies.



Well what do you know, they contradict the claim made above.


Sorry, they've crawled trillions of pages, and narrowed it down to an index of 100s of billions. Conveniently, the link answers your question of "can you have PB sized indices?" to which we can clearly say, yes.


Where do you think computers store data?


index every photo taken in the past ten years (1.8 trillion per year)

for each photo, extract and index SIFT vectors

https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...


Is sift still used?


i’m curious also.

i used it as an example of a way a dataset might have a huge number of feature vectors.

curious if there are better ways to do this now.


Perhaps one example would be 100 billion websites, but with a set of vectors for each website (chunked bert encoding, sum chunked glove vecs, whatever). Then you could have something like 3-100ish vectors for each website, which would be a few trillion vecs.

I'm not in the we scraping/search area, so idk about the 100B website thing (other than it's Google order of size), but the encoding would take some mega amount of time depending on how it's done, hence the suggested sum of GloVe chunks (potentially doable with decent hardware in months) rather than throwing LLM on there (would take literal centuries to process).


One application are LLMs. I've seen a project use Pinecone to enable "infinite context" for GPT-3.


This is still several orders of magnitude more items than the entire training corpus for all GPT models combined. I guess if you were to index individual codepoints in the training corpus, we'd start to see those volumes.


You don't index the training data, but other data. It gives LLMs the medium-term memory that they're missing.

Think of it like an accountant. Weights are all of their experience. Prompt is the form in front of them. A vector database makes it easier to find the appropriate tax law and have that open (in the prompt) as well.

This is useful for people as well, like literally this example. But the LLM + vector combination is looking really powerful because of the tight loops.


For context stuffing LLMs with small token limits(AKA Retriever Augmentation), you will need to break up each article in few-sentence chunks. You can get to a large number of these chunks very rapidly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: