Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm sure you've already explored this, but is there some precompression operation that you could do to the vectors such that they're no longer sparse, and therefore relatively uniform?





They weren't sparse, they were dense but the "density" was quite non-uniform (think typical learned ML vectors). Not too far from an N-dimensional gaussian (I ended up reading research on quantizing Gaussian distributions, but that didn't help either as we didn't have a perfectly gaussian thing).

VAE objectives are useful for pushing embeddings into a Gaussian distribution.

Here's some work on low-latency neural compression that you might find interesting: https://arxiv.org/abs/2107.03312


If you start from N-dimensional Gaussian-distributed vectors and normalize them, you wind up with uniformly distributed vectors on an (N-1)-dimensional hypersphere. Of course, sphere-packing on the surface of a hyperspehere is its own fun problem, and if your data is not actually Gaussian may still not be exactly what you want, but coding the vector magnitude separately from the vector direction is probably a good start.

I see, thank you.

Another thing that I'm sure you explored, and I'd love to hear how it went, would be to rearrange the elements in the vectors such that perhaps the denser parts could be more contiguous, and the sparser parts could be more contiguous, on average. That sounds like something that would be easier to compress. Were the distributions such that a rearrangement like this might have been possible? Or were they very evenly distributed?

I.e. could you have rearranged a Gaussian-like distribution into a Poisson-like distribution?


What ended up launching is a fancy product quantization based on k-means. Some of the tricks were storing magnitude separately (i.e. removing the mean) and rearranging dimensions based on variance (and/or rotation based on PCA) for PQ to work better.

I also remember trying to fit a distribution so that I can generate synthetic data (not for a lack of data, but more for understanding the problem space better). The synthetic data quantized pretty differently - my guess is that it's because of random areas of density and sparsity.

I'm not quite following your exact rearranging idea though. Not sure if the above answers the question.


I was just wondering, it seems like an interesting problem to explore.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: