I never understood why the recommended replacement for ß is ss. It is a ligature of sz (similar to & being a ligature of et) and is even pronounced ess-zet. The only logical replacement would have been sz, and it would have avoided the clash of Masse (mass) and Maße (measurements). Then again, it only affects whether the vowel before it is pronounced short or long, and there are better ways to encode that in written language in the first place.
I agree that writing it "sz" might have created less problems.
However, it is likely that it has never been pronounced "sz", but always "ss" and the habit of writing "sz" for the double consonant may have had the same reason as the writing of "ck" instead of the double "kk".
On the other hand, fonts can be an expression of your personality. Shouldn't it be preferable to centrally enable overriding fonts instead of forcing every site designer not to use custom fonts to express themselves? Theoretically, it is easier to remove formatting than it is to add it. Therefore, this functionality should be part of the browser, not the website. Firefox has this as an option: "Allow pages to choose their own fonts, instead of your selections above".
Personally, I quite like the site's design and its font. My gripe often is light gray text on a darker gray background. The bad readability that so many newer sites seem to prefer makes me question my eyes or my monitor capabilities. Reader mode in Firefox is also often very helpful.
“Ideally” here is a statement of what I’d find ideal. I’m not nominating myself as font-police or suggesting that we force people to do anything.
But, the feature is overused, IMO. Anything can be used to express a bit of personality, but I do think it is sometimes specified in cases where it really isn’t.
I concur with most of these arguments, especially about longevity. But, this only applies to smallish files like configurations because I don't agree with the last paragraph regarding its efficiency.
I have had to work with large 1GB+ JSON files, and it is not fun. Amazing projects such as jsoncons for streaming JSONs, and simdjson, for parsing JSON with SIMD, exist, but as far as I know, the latter still does not support streaming and even has an open issue for files larger than 4 GiB. So you cannot have streaming for memory efficiency and SIMD-parsing for computational efficiency at the same time. You want streaming because holding the whole JSON in memory is wasteful and sometimes not even possible. JSONL tries to change the format to fix that, but now you have another format that you need to support.
I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful. Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that without parsing everything in search of the closing bracket or quotes, accounting for escaped brackets and quotes, and nesting.
My rule of thumb that has been surprisingly robust over several uses of it is that if you gzip a JSON format you can expect it to shrink by a factor of about 15.
That is not the hallmark of a space-efficient file format.
Between repeated string keys and frequently repeated string values, that are often quite large due to being "human readable", it adds up fast.
"I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data."
One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards. The JSON can have offsets into the binary as necessary for identifying things or labeling whether or not it is compressed or whatever. This often largely mitigates the inefficiency concerns because if you've got a big pile of binary data the JSON bloat by percent tends to be much smaller than the payload; if it isn't, then of course I don't recommend this.
I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-20220103.json.gz is quite fun to work with. It is 109 GB, which decompresses to 1.4 TB, and even the non-compressed index for random access with indexed_gzip is 11 GiB. The compressed random access index format, which gztool supports, would be 1.4 GB (compression ratio 8). And rapidgzip even supports the compressed gztool format with further file size reduction by doing a sparsity analysis of required seek point data and setting all unnecessary bytes to 0 to increase compressibility. The resulting index is only 536 MiB.
The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.
I see sooo many comments on this submission talking about large files. It feels massively over-relresented a concern to me.
On Linux, a good number of FS have builtin compression. My JSON all gets hit with lz4 compression automatically.
It indeed annoying having to go compress & decompress files before sending. It'd be lovely if file transfer tools (including messaging apps) were a bit better at auto-conpressing. I think with btrfs, it tests for compress ability too, will give up on trying to compress at some point: a similar effort ought be applied here.
The large file question & efficiency question feels like it's dominating this discussion, and it just doesn't seem particularly interesting or fruitful a concern to me. It shouldnt matter much. The computer can and should generally be able to eliminate most of the downsides relatively effectively.
> I have had to work with large 1GB+ JSON files, and it is not fun.
I had also had to work with large JSON files, even though I would prefer other formats. I had written a C code to split it into records, which is done by keeping track of the nesting level and of whether or not it is a string and the escaping in a string (so that escaped quotation marks will work properly). It is not too difficult.
> I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful.
I agree, which is one reason I do not like JSON (I prefer DER). In addition to that, there is escaping text.
> Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that
With DER you can easily skip over any data.
However, I think the formats with type/length/value (such as DER) do not work as well for streaming, and vice-versa.
This resonates. I have finally started looking into local inference a bit more recently.
I have tried Cursor a bit, and whatever it used worked somewhat alright to generate a starting point for a feature and for a large refactor and break through writer's blocks. It was fun to see it behave similarly to my workflow by creating step-by-step plans before doing work, then searching for functions to look for locations and change stuff. I feel like one could learn structured thinking approaches from looking at these agentic AI logs. There were lots of issues with both of these tasks, though, e.g., many missed locations for the refactor and spuriously deleted or indented code, but it was a starting point and somewhat workable with git. The refactoring usage caused me to reach free token limits in only two days. Based on the usage, it used millions of tokens in minutes, only rarely less than 100K tokens per request, and therefore probably needs a similarly large context length for best performance.
I wanted to replicate this with VSCodium and Cline or Continue because I want to use it without exfiltrating all my data to megacorps as payment and use it to work on non-open-source projects, and maybe even use it offline. Having Cursor start indexing everything, including possibly private data, in the project folder as soon as it starts, left a bad taste, as useful as it is. But, I quickly ran into context length problems with Cline, and Continue does not seem to work very well. Some models did not work at all, DeepSeek was thinking for hours in loops (default temperature too high, should supposedly be <0.5). And even after getting tool use to work somewhat with qwen qwq 32B Q4, it feels like it does not have a full view of the codebase, even though it has been indexed. For one refactor request mentioning names from the project, it started by doing useless web searches. It might also be a context length issue. But larger contexts really eat up memory.
I am also contemplating a new system for local AI, but it is really hard to decide. You have the choice between fast GPU inference, e.g., RTX 5090 if you have money, or 1-2 used RTX 3090, or slow, but qualitatively better CPU / unified memory integrated GPU inference with systems such as the DGX Spark, the Framework Desktop AMD Ryzen AI Max, or the Mac Pro systems. Neither is ideal (and cheap). Although my problems with context length and low-performing agentic models seem to indicate that going for the slower but more helpful models on a large unified memory seems to be better for my use case. My use case would mostly be agentic coding. Code completion does not seem to fit me because I find it distracting, and I don't require much boilerplating.
It also feels like the GPU is wasted, and local inference might be a red herring altogether. Looking at how a batch size of 1 is one of the worst cases for GPU computation and how it would only be used in bursts, any cloud solution will be easily an order of magnitude or two more efficient because of these, if I understand this correctly. Maybe local inference will therefore never fully take off, barring even more specialized hardware or hard requirements on privacy, e.g., for companies. To solve that, it would take something like computing on encrypted data, which seems impossible.
Then again, if the batch size of 1 is indeed so bad as I think it to be, then maybe simply generate a batch of results in parallel and choose the best of the answers? Maybe this is not a thing because it would increase memory usage even more.
AI-overview was the straw that broke the camel's back for me recently. But I also suffered from dark mode issues for a long time. On almost every visit, it shows the outer background dark but the smaller search results background as white, and the search result text is still in light mode, ergo, it is not readable. After refreshing, it works, but this user experience is untenable for a trillion-dollar company. I changed to Startpage.com, though.
When you do want an AI overview you can have Kagi do that by adding a ? at the end of your query. It flows nicely for me as the difference between searching for something, or just asking the internet a question. Kagi cites sources and allows you to move the conversation to a new LLM session.
I don't mind it necessarily, as I use it all the time in place of Googling, so you would think that folding it into search would work well.
But it doesn't, at least not for me. But I think that's tied to poor implementation, design, and being a solution looking for a problem rather than a philosophical issue with using AI to improve search experience.
The Anime Yukikaze (2002 - 2005) has some similar themes. It's about a fighter jet pilot using a new AI-supported jet to fight against aliens. It asserts that the combination of human intuition and artificial intelligence trumps either of the two on its own. If I remember correctly, the jet can pilot on its own, but when it becomes dangerous, the human pilot only uses the AI hints instead of letting it autopilot.
Yukikaze is a very interesting novel - still have to sit down and read the novels instead of watching the anime, but an important plot element is both the interactions between humans and their AIs (which are not just "human in a computer" as usual) but also a different take on popular views of which way an AI will decide in a conflict :)
I wish Kambayashi was just more widely known. Or Japanese Sci-Fis and LNs in general. There have been couple legitimate "oh that's now reality" moments for me in real world developments of AI.
> I'm sure there are zip-bomb equivalents in binary formats like .xlsx, PDF, .docx, etc.
Yes. Both, docx and xlsx are literally just a zip of XML files with a different extension. PDF can contain zlib streams, which use deflate compression just as gzip, so all the mentioned methods apply to all three formats.
The "gold" would not be radioactive as far as I understand. The reaction is 198Hg (stable) -> 197Hg (65h half life) -> 197Au (stable). You would end up with a mix of radioactive 197Hg and stable 197Au. It should be easy to separate these with established processes because you can easily refine 99.99% gold via various chemical and electrolytic processes. But I doubt any established refinery would touch radioactive inputs because it would contaminate the whole processing chain.
Even if you could separate mercury-198 for zero cost, it would only be 10% of the mercury production, and the yearly mercury production is 4500 t/yr, i.e., at most a maximum of 450 t/yr mercury-198. Compare this to gold production, which is 3100 t/yr, or silver production of 27000 t/yr. One might argue that mercury production could be ramped up if it is needed more, but its Earth's crust abundance is only slightly higher than silver, and again, mercury-198 would be 10x rarer than silver, i.e., only twice as abundant as gold.
> Since the process described here permanently transmutes mercury into a valuable material, it is possible that fusion transmutation could be considered as a form of waste disposal. While early plants will be highly incentivized to specifically transmute 198Hg, we note that the isotopes with higher neutron number can also in the long
term be transmuted to 197Au...
>The EU also has 6000 tons of mercury currently and expects to need to dispose of 11,000 tons over the next 40 years [95, 96]. As such, even with no change in existing processes, 14,000 metric tons of mercury could be made available for processing and isotope removal in the next ten years of fusion development, corresponding to 1400 tons of 198Hg and about the same mass of 197Au, with a current market value of ∼ $140B.
Yes, that section is fitting and interesting. It is the production-side view. I think I was more motivated by the comments envisioning an abundance of cheap gold, which seems not in any way near or even possible, even with this approach as cool and baffling as it is.
I don't think that it is of much use as waste disposal because again, it can only remove 10%, i.e., an insignificant amount. If it were even mined because of this, then more mercury waste would be produced than before, but increased mining would probably be many decades or centuries in the future, as long as there is still waste to reuse.
So, how long would the current midterm stockpile of 1400 t for 198Hg for the next 10 years last? At 5 t per 1 GW per year, i.e., 5 t per 8.76 TWh, and a current global electricity generation of ~30 PWh, replacing all energy production with fusion would be able to transmute 3400 t 198Hg per year, over twice the stockpile. Of course, there would be a myriad of other bottlenecks long before that, but consuming all the existing stockpile seems feasible in human time spans.
I am honestly impressed by the amount of transmutation that is possible with fusion. And it is a lucky coincidence that the half-life is only dozens of hours for the middle product. I never thought of that process or would have guessed grams of production instead of tons, probably because of the association with existing particle accelerators. It is quite amazing, but also presumably still decades off into the future.