More

markisus · 2025-12-11T16:36:40 1765471000

I’ve confirmed this on my iphone as well.

Using swipe, no space bar after kill: Kill maps Jill myself Jill myself

Using swipe, manually pressing space bar after kill: Kill mussels Kill mussels Kill mussels

mock-possum · 2025-12-11T16:43:17 1765471397

Yeah same -

Kill males kill males kill muddled kill mussels (hilarious)

Treat myself tear myself try myself tell myself

It won’t do it.

nothercastle · 2025-12-11T17:17:17 1765473437

Kill mussels confirmed

markisus · 2025-12-09T14:28:27 1765290507

On the contrary, I think it demonstrates an inherent limit to the kind of tasks / datasets that human beings care about.

It's known that large neural networks can even memorize random data. The number of random datasets is unfathomably large, and the weight space of neural networks trained on random data would probably not live in a low dimensional subspace.

It's only the interesting-to-human datasets, as far as I know, that drive the neural network weights to a low dimensional subspace.

markisus · 2025-12-09T11:26:25 1765279585

Each fine tune drags the model weights away from the base model in a certain direction.

Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.

The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.

Another way to say it is that you can compress fine tune weights into a vector of 40 floats.

Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?

I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.

I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.

mapontosevenths · 2025-12-09T14:49:10 1765291750

Agreed. What's surprising here to me isn't that the fine tunes are compressible, it's the degree to which they're compressible. It seems like very little useful new information is being added by the fine-tune.

They're using SVD to throw away almost all of the "new information" and apparently getting solid results anyhow. Which of course raises interesting questions if replicable. The code doesn't seem to have been released yet though.

farhanhubble · 2025-12-10T04:02:28 1765339348

Yeah but it also made me think if deep down neural networks are curated random basis vectors, like in random projections.

markisus · 2025-12-08T01:59:32 1765159172

Where do lock free algorithms fall in this analysis?

raggi · 2025-12-08T02:49:49 1765162189

Some considerations can be similar, but the total set of details are different.

It also depends which lock free solutions you're evaluating.

Some are higher order spins (more similar high level problems), others have different secondary costs (such as copying). A common overlap is the inter-core, inter-thread, and inter-package side effects of synchronization points, for a lot of stuff with a strong atomic in the middle that'll be stuff like sync instruction costs, pipeline impacts of barriers/fences, etc.

bluecalm · 2025-12-08T06:33:35 1765175615

In a very basic case lock free data structures make threads race instead of spin. A thread makes their copy of a part of list/tree/whatever it needs updating, introduces changes to that copy and then tries to substitute their own pointer for the data structure pointer if it hasn't changed in the meantime (there is a CPU atomic instruction for that). If the thread fails (someone changed the pointer in the meantime) it tries again.

adrian_b · 2025-12-08T10:37:57 1765190277

There are several classes of lock free algorithms with different properties.

Lock free algorithms for read only access to shared data structures have only seldom disadvantages (when the shared data structure is modified extremely frequently by writers, so the readers never succeed to read it between changes), so for read-only access they are typically the best choice.

On the other hand lock free algorithms for read/write access to shared data structures must be used with great caution, because they frequently have a higher cost than using mutual exclusion. Such lock free algorithms are based on the optimistic assumption that your thread will complete the access before the shared structure is accessed by another competing thread. Whenever this assumption fails (which will happen when there is high contention) the transaction must be retried, which will lead to much more wasted work than the work that is wasted in a spinlock.

Lock free algorithms for read/write access are normally preferable only when it is certain that there is low contention for the shared resource, but in that case also a spinlock may waste negligible time.

The term "lock-free" is properly applied only to the access methods based on optimistic access instead of mutual exclusion (which uses locks).

However, there is a third kind of access methods, which use neither optimistic access nor mutual exclusion with locks, so some authors may conflate such methods together with the lock-free algorithms based on optimistic access.

This third kind of access methods for shared data have properties very different from the other two kinds, so they should better be considered separately. They are based on the partitioning of the shared data structure between the threads that access it concurrently, so such methods are applicable only to certain kinds of data structures, mainly to arrays and queues. Nevertheless, almost any kind of inter-thread communication can be reorganized around message queues and shared buffers, so most of the applications that use either mutual exclusion with locks or lock-free optimistic access can be rewritten to use concurrent accesses to dynamically partitioned shared data structures, where the access is deterministic unlike with lock-free optimistic algorithms, and there are no explicit locks, but the partitioning of the shared data structure must be done with atomic instructions (usually atomic fetch-and-add), which contain implicit locks, but they are equivalent with extremely short locked critical sections.

gpderetta · 2025-12-08T20:07:51 1765224471

Typically lock-free algorithms do not retry when a read happens concurrently with a write[1], so they scale very well for read mostly data or when data is partitioned between writers.

Scalability is always going to be poor when writers attempt to modify the same object, no matter the solution you implement.

[1] of course you could imagine a lock-free algorithm where reads actually mutate, but that's not common.

menaerus · 2025-12-09T06:48:35 1765262915

> Scalability is always going to be poor when writers attempt to modify the same object, no matter the solution you implement.

MVCC.

gpderetta · 2025-12-09T09:15:38 1765271738

Well, yes, that's one way of avoiding mutating the same object of course.

menaerus · 2025-12-09T10:19:56 1765275596

Technically it is not because eventually it will be mutated, and that's one way of achieving the scalability in multiple writers scenario.

markisus · 2025-12-07T14:35:44 1765118144

Their VGGT, Dinov3, and segment anything models are pretty impressive.

markisus · 2025-11-13T22:14:12 1763072052

The problem becomes complicated once the large discrete objects are not actuated. Even worse if the large discrete objects are not consistently observable because of occlusions or other sensor limitations. And almost impossible if the large discrete objects are actuated by other agents with potentially adversarial goals.

Self driving cars, an application in which physics is simple and arguably two dimensional, have taken more than a decade to get to a deployable solution.

markisus · 2025-10-07T17:45:38 1759859138

The authors somewhat address your questions in the accompanying paper https://arxiv.org/abs/2410.24206

> We emphasize that the central flow is a theoretical tool for understanding optimizer behavior, not a practical optimization method. In practice, maintaining an exponential moving average of the iterates (e.g., Morales-Brotons et al., 2024) is likely a computational feasible way to estimate the optimizer’s time-averaged trajectory.

They analyze the behavior of RMSProp (Adam without momentum) using their framework to come up with simplified mathematical models that are able to predict actual training behavior in experiments. It looks like their mathematical models explain why RMSProp works, in a way that is more satisfying than the usual hand waving explanations.

DoctorOetker · 2025-10-07T22:49:33 1759877373

Yes, it certainly provides a lot more clarity than the handwaving.

While momentum seems to work, and the authors clearly state it is not intended as a practical optimization method, I can't exclude that we can improve convergence rates by building on this knowledge.

Is it guaranteed for the oscillating behavior to have a period of 2 steps? or is say 3 step period also possible (a vector in a plane could alternately point to 0 degrees, 120 degrees and 240 degrees).

The way I read this presentation the implication seems to be that its always a period of 2. Perhaps if the top-2 sharpnesses are degenerate (identical), a period of N distinct from 2 could be possible?

It makes you wonder what if instead of storing momentum with exponential moving average one were to use the average of the last 2 iterations, so there would be less lag.

It also makes me wonder if we should perform 2 iterative steps PER sequence so that the single-sample-sequence gives feedback along it's valley instead of across it. One would go through the corpus at half the speed, but convergence may be more accurate.

kingstnap · 2025-10-08T05:36:52 1759901812

I doubt it would ever be period 3.

Not a formal proof, but there is this fun theorem called period 3 implies chaos that my gut instinct says applies here.

Basically if you have a continuous mapping from [a,b] -> [a,b] and there exists a 3 cycle then that implies every other cycle length exists.

Which in this case would kinda say that if you are bouncing between three values on the y axis (and the bouncing is a continuous function which admittedly the gradient of a relu is not) you are probably in a chaotic system

Now that requires assuming that the behaviour of y is largely a function of just y. But their derivation seems to imply that it is the case.

markisus · 2025-10-07T16:14:16 1759853656

First I thought this would be just another gradient descent tutorial for beginners. But the article goes quite deep into gradient descent dynamics, looking into third order approximations of the loss function and eventually motivating a concept called "central flows." Their central flow model was able to predict loss graphs for various training runs across different neural network architectures.

markisus · 2025-10-04T01:13:16 1759540396

Can someone explain the bit counting argument in the reinforcement learning part?

I don’t get why a trajectory would provide only one bit of information.

Each step of the trajectory is at least giving information about what state transitions are possible.

An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.

navar · 2025-10-04T13:33:47 1759584827

I believe it's because the way you measure things in RL, each episode only tells you whether it was good (say reward +1) or bad (say 0 or negative reward), it does not tell you anything about the trace that was produced to get the outcome. This reward is the only thing measured to produce your gradients. Hence why the amount of info in it is O(1).

This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.

mountainriver · 2025-10-04T02:06:58 1759543618

A fair amount of research has shown that RL doesn’t add knowledge to the base model it just optimizes paths that already exist. Now ProRL from Nvidia showed there are ways of adding knowledge, mostly through progressive merging.

I’m still not fully convinced of the 1bit claim, they made other mistakes in the blog post

markisus · 2025-09-21T23:02:22 1758495742

These problems seem to have the flavor of interview questions I heard for quant positions.

gsf_emergency_2 · 2025-09-22T02:36:45 1758508605

So the role that would benefit from this specific competency is: interviewer

(For both quant and leetcode positions)