More

thinkharderdev · 2025-12-08T19:43:53 1765223033

> To have better performance in benchmarks

Yes, exactly.

thinkharderdev · 2025-12-04T16:04:18 1764864258

> Except a consumer can discard an unprocessable record?

It's not the unproccessable records that are the problem it is the records that are very slow to process (for whatever reason).

thinkharderdev · 2025-12-03T11:53:14 1764762794

> If it were written with async it would likely have enough other baggage that it wouldn't fit or otherwise wouldn't work

I'm unclear what this means. What is the other baggage in this context?

hansvm · 2025-12-04T04:17:29 1764821849

In context (embedded programming, which in retrospect is still too big of a field for this comment to make sense by itself; what I meant was embedded programming on devices with very limited RAM or other such significant restrictions), "baggage" is the fact that you don't have many options when converting async high-level code into low-level machine code. The two normal things people write into their languages/compilers/whatever (the first being much more popular, and there do exist more than just these two options) are:

1. Your async/await syntax desugars to a state machine. The set of possible states might only be runtime-known (JS, Python), or it might be comptime-known (Rust, old-Zig, arguably new-Zig if you squint a bit). The concrete value representing the current state of that state machine is only runtime-known, and you have some sort of driver (often called an "event loop", but there are other abstractions) managing state transitions.

2. You restrict the capabilities of async/await to just those which you're able to statically (compile-time) analyze, and you require the driver (the "event loop") to be compile-time known so that you're able to desugar what looks like an async program to the programmer into a completely static, synchronous program.

On sufficiently resource-constrained devices, both of those are unworkable.

In the case of (1) (by far the most common approach, and the thing I had in mind when arguing that async has potential issues for embedded programming), you waste RAM/ROM on a more complicated program involving state machines, you waste RAM/ROM on the driver code, you waste RAM on the runtime-known states in those state machines, and you waste RAM on the runtime-known boxing of events you intend to run later. The same program (especially in an embedded context where programs tend to be simpler) can easily be written by a skilled developer in a way which avoids that overhead, but reaching for async/await from the start can prevent you from reaching your goals for the project. It's that RAM/ROM/CPU overhead that I'm talking about in the word "baggage."

In the case of (2), there are a couple potential flaws. One is just that not all reasonable programs can be represented that way (it's the same flaw with pure, non-unsafe Rust and with attempts to create languages which are known to terminate), so the technique might literally not work for your project. A second is that the compiler's interpretation of the particular control flow and jumps you want to execute will often differ from the high-level plan you had in mind, potentially creating more physical bytecode or other issues. Details matter in constrained environments.

thinkharderdev · 2025-12-04T21:56:44 1764885404

That makes sense. I don't know anything about embedded programming really but I thought that it really fundamentally requires async (in the conceptual sense). So you have to structure your program as an event loop no matter what. Wasn't the alleged goal of rust async to be zero-cost in the sense that the program transformation of a future ends up being roughly what you would write by hand if you have to hand-roll a state machine? Of course the runtime itself requires a runtime and I get why something like Tokio would be a non-started in embedded environments, but you can still hand-roll the core runtime and structure the rest of the code with async/await right? Or are you saying that the generated code even without the runtime is too heavy for an embedded environment?

hansvm · 2025-12-06T06:22:21 1765002141

> fundamentally requires async (in the conceptual sense)

Sometimes, kind of. For some counter-examples, consider a security camera or a thermostat. In the former you run in a hot loop because it's more efficient when you constantly have stuff to do, and in the latter you run in a hot loop (details apply for power-efficiency reasons, but none which are substantially improved by async) since the timing constraints are loose enough that you have no benefit from async. One might argue that those are still "conceptually" async, but I think that misses the mark. For the camera, for example, a mental model of "process all the frames, maybe pausing for a bit if you must" is going to give you much better results when modeling that domain and figuring out how to add in other features (between those two choices of code models, the async one buys you less "optionality" and is more likely to hamstring your business).

> zero-cost

IMO this is a big misnomer, especially when applied to abstractions like async. I'll defer async till a later bullet point, looking instead at simpler abstractions.

The "big" observation is that optimization is hard, especially as information gets stripped away. Doing it perfectly seemingly has an exponential cost (active research problem to reduce those bounds, or even to reduce constant factors). Doing it approximately isn't "zero"-cost.

With perfect optimization being impossible for all intents and purposes, you're left with a world where equivalent units of code don't have the same generated instructions. I.e., the initial flavor of your code biases the generated instructions one way or another. One way of writing high-performance code then is to choose initial representations which are closer to what the optimizer will want to work with (basically, you're doing some of the optimization yourself and relying on the compiler to not screw it up too much -- which it mostly won't (there be dragons here, but as an approximate rule of thumb) because it can't search too far from the initial state you present to it).

Another framing of that is that if you start with one of many possible representations of the code you want to write, it has a low probability of giving the compiler the information it needs to actually optimize it.

Let's look at iterators for a second. The thing that's being eliminated with "zero-cost" iterators is logical instructions. Suppose you're applying a set of maps to an initial sequence. A purely runtime solution (if "greedy" and not using any sort of builder pattern) like you would normally see in JS or Python would have explicit "end of data" checks for every single map you're applying, increasing the runtime with all the extra operations existing to support the iterator API for each of those maps.

Contrast that with Rust's implementation (or similar in many other languages, including Zig -- "zero-cost" iterators are a fun thing that a lot of programmers like to write even when not provided natively by the language). Rust recognizes at compile-time that applying a set of maps to a sequence can be re-written as `for x in input: f0(f1(f2(...(x))))`. The `for x in input` thing is the only part which actually handles bounds-checking/termination-checking/etc. From there all the maps are inlined and just create optimal assembly. The overhead from iteration is removed, so the abstraction of iteration is zero-cost.

Except it's not, at least not for a definition of "zero-cost" the programmer likely cares about (I have similar qualms about safe Rust being "free of data-races", but those are more esoteric and less likely to come up in your normal day-to-day). It's almost always strictly better than nested, dynamic "end of iterator" checks, but it's not actually zero-cost.

Taking as an example something that came up somewhat recently for me, math over fields like GF(2*16) can be ... interesting. It's not that complicated, but it takes a reasonable number of instructions (and/or memory accesses). I understand that's not an every-day concern for most people, but the result will illustrate a more general point which does apply. Your CPU's resources (execution units, instruction cache, branch-prediction cache (at several hierarchial layers), etc) are bounded. Details vary, but when iterating over an array of data and applying a bunch of functions, even when none of that is vectorizable, you very often don't want codegen with that shape. You instead want to pop a few elements, apply the first function to those elements, apply the second function to those results, etc, and then proceed with the next batch once you've finished the first. The problems you're avoiding include data dependencies (it's common for throughput for an instruction to be 1-2/cycle but for latency to be 2-4 cycles, meaning that if one instruction depends on another's output it'll have to wait 2-4 cycles when it could in theory otherwise process that data in 0.5-1 cycles) and bursting your pipeline depth (your CPU can automagically resolve those data dependencies if you don't have too many instructions per loop iteration, but writing out the code explicitly guarantees that the CPU will _always_ be happy).

BUT, your compiler often won't do that sort of analysis and fix your code's shortcomings. If that approximate layout of instructions doesn't exist in your code explicitly then the optimizer won't solve for it. The difference in performance is absolutely massive when those scenarios crop up (often 4-8x). The "zero-cost" iterator API won't yield that better codegen, since it has an output that the optimizer can't effectively turn into that better solution (yet -- polyhedral models solve some similar problems, and that might be something that gets incorporated in modern optimizers eventually -- but it doesn't exist yet, it's very hard, and it's illustrative of the idea that optimizers can't solve all your woes; when that one is fixed there will still exist plenty more).

> zero-cost async

Another pitfall of "zero-cost" is that all it promises is that the generated code is the same as what you would have written by hand. We saw in the iterator model that "would have written" doesn't quite align between the programmer and the compiler, but it's more obvious in their async abstraction. Internally, Rust models async with state machines. More importantly, those all have runtime-known states.

You asked about hand-rolling the runtime to avoid Tokio in an embedded environment. That's a good start, but it's not enough (it _might_ be; "embedded" nowadays includes machines faster than some desktops from the 90s; but let's assume we're working in one of the more resource-constrained subsets of "embedded" programming). The problem is that the abstraction the compiler assumes we're going to need is much more complicated than an optimal solution given the requirements we actually have. Moreover, the compiler doesn't know those requirements and almost certainly couldn't codegen its assumptions into our optimal solution even if it had them. If you use Rust async/await, with very few exceptions, you're going to end up with both a nontrivial runtime (might be very light, but still nontrivial in an embedded sense), and also a huge amount of bloat on all your async definitions (along with runtime bloat (RAM+CPU) as you navigate that unnecessary abstraction layer).

The compiler definitely can't strip away the runtime completely, at least for nontrivial programs. For sufficiently simple programs it does a pretty good job (you still might not be able to afford supporting the explicit state machines it leaves behind, but whatever, most machines aren't _that_ small), but past a certain complexity level we're back to the idea of zero-cost abstractions not being real because of optimization impossibility, when you use most of the features you might want to use with async/await you find that the compiler can't fully desugar even very simple programs, and fully dynamic async (by definition) obviously can't exist without a runtime.

So, answering your question a bit more directly, my answer is that you usually can't fix the issue by hand-rolling the core runtime since it won't be abstracted away (resulting in high RAM/ROM/CPU costs), and even in sufficiently carefully constructed and simple code that it will be abstracted away you're still left with full runtime state machines, which themselves are overkill for most simple async problems. The space and time those take up can be prohibitive.

thinkharderdev · 2025-12-03T11:50:06 1764762606

Right, because this would deadlock. But it seems like Zig would have the same issue. If I am running something in a evented IO system and then I try and do some blocking IO inside it then I will get a deadlock. The idea that you can write libraries that are agnostic to the asynchronous runtime seems fanciful to me beyond trivial examples.

thinkharderdev · 2025-12-03T11:32:09 1764761529

Honestly I don't see how that is different than how it works in Rust. Synchronous code is a proper subset of asynchronous code. If you have a streaming API then you can have an implementation that works in a synchronous way with no overhead if you want. For example, if you already have the whole buffer in memory sometimes then you can just use it and the stream will work exactly like a loop that you would write in the sync version.

conradev · 2025-12-03T22:55:11 1764802511

serde is a pull parser and it would take significant modification to convert it into an incremental push parser without blocking a thread.

thinkharderdev · 2025-12-02T11:10:38 1764673838

> The problem is I still have some of their clothes I bought 10 years ago and their quality trumps premium brands now.

I'm skeptical of this claim. Maybe it's true for some particular brand but that's just an artifact of one particular "premium brand" essentially cashing in its brand equity by reducing quality while (temporarily) being able to command a premium price. But it is easier now than at any other time in my life to purchase high-quality clothing that is built to last for decades. You just have to pay for that quality, which is something a lot of people don't want to do.

thinkharderdev · 2025-11-14T13:31:59 1763127119

> It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.

Does this article demonstrate that though? I get, and agree, that a lot of people are using "big data" tools for datasets that are way too small to require it. But this article consists of exactly one very simple aggregation query. And even then it takes 16m to run (in the best case). As others have mentioned the long execution time is almost certainly dominated by IO because of limited network bandwidth, but network bandwidth is one of the resources you get more of in a distributed computing environment.

But my bigger issue is just that real analytical queries are often quite a bit more complicated than a simple count by timestamp. As soon as you start adding non-trivial compute to query, or multiple joins (and g*d forbid you have a nested-loop join in there somewhere), or sorting then the single node execution time is going to explode.

Demiurge · 2025-11-14T15:26:06 1763133966

I completely agree, real world queries are complicated joins, aggregations, staged intermediary datasets, and further manipulations. Even if you start with a single coherent 650gb dataset, if you have a downstream product based on that, you will have multiple copies and iterations, which also have the reproducible, tracked in source control, and visualized in other tools in real time. Honestly, yes, parquet and duckdb make all this easier than awk. But, they still need to be integrated into a larger system.

thinkharderdev · 2025-10-31T13:39:00 1761917940

This depends a lot of what you are using exceptions for. I think in general the branch on Ok/Err is probably not meaningful performance-wise because the branch predictor will see right through that.

But more generally the happy-path/error-path distinction can be a bit murky. From my days writing Java back in the day it was very common to see code where checked exceptions were used as a sort of control flow mechanism, so you end up using the slow path relatively frequently because it was just how you handled certain expected conditions that were arbitrarily designated as "exceptions". The idea behind Result types to me is just that recoverable, expected errors are part of the program's control flow and should be handled through normal code and not some side-channel. Exceptions/panics should be used only for actually exceptional conditions (programming errors which break some expected invariant of the system) and immediately terminate the unit of work that experienced the exception.

thinkharderdev · 2025-10-26T14:33:50 1761489230

Happened to me. Bought a house with wood floors in the basement. We had some flooding which ruined the wood and when we ripped it out to replace, turns out the wood floors were installed over the original asbestos tiles. From what I can tell, the asbestos tiles themselves were of no particular danger to us, but once they got wet and started cracking they had to be removed which cost an additional couple thousand dollars on top of replacing he floors.

thinkharderdev · 2025-10-23T21:20:11 1761254411