Bad news: the optimizer is moving your functions outside the loop.
for _ in 0..<iterations {
let result = flatuseStruct(outputData)
assert(result == 8644311667)
total = total + UInt64(result)
}
Looking at the assembly... the call to `flatuseStruct` is moved outside the loop in Release builds. You're only measuring 1 thousand iterations of `flatuseStruct`, not 1 million.
Your red flag should have been this:
> One million times decoding of a small object graph took 0.35ms
That's literally impossible. That's doing 2.8 billion iterations per second. A single function call generally takes 2 nanoseconds (you can't do 1 billion per second).
Eager run
=================================
1557 ms encode
264 ms decode
34 ms use
206 ms dealloc
504 ms decode+use+dealloc
0,38 ms direct
0,32 ms using struct
=================================
Total counter1 is 8644311667000000
Total counter2 is 8644311667000000
Total counter3 is 8644311667000000
Encoded size is 315 bytes, should be 344 if not using unique strings
=================================
I can't figure out what's at these two addresses; lldb didn't seem to accept any reasonable syntax. lldb is terrible. But I'll bet that RBX is holding the value of `total'. I don't know what r14 is, and it doesn't seem to matter since nothing here uses it.
So this code actually times one call to flatDecodeDirect, then 200 iterations of an unrolled do-nothing loop. The compiler has figured out somehow that flatDecodeDirect is going to do exactly the same thing each time, and taken advantage of that by calling it only once. I'm guessing this means that flatDecodeDirect is only called 1,000 times in total.
As a sanity check for this kind of thing - try making a little loop that just increments an integer the appropriate number of times, and see how long that takes. (Check the assembly language output to ensure the generated code is doing what you think - it should be a 2-instruction loop.)
On my laptop that takes 1.8ms. This isn't the absolute limit of how long it takes to do 1,000,000 of anything, but it'll do as a rough estimate. So you should be suspicious if a program suggests it's taking much less time than that to do 1,000,000 of something that's a lot more complicated, as the test did. (It reported 1,000,000 iterations in 0.53ms on my PC.)
(Of course, as with any rough estimate, this only gives you a suspicion, and isn't proof without further investigation.)
This is happening because of compiler optimisation technique called Loop Invariant Code Motion [0], which means that the author is not measuring what he thinks he's measuring (and this should be obvious from the numbers really), so the result is meaningless.
* Swift's boolean -> integer conversion is oddly slow (should probably be reported)
* allocations are expensive (duh)
* Converting arbitrary binary data to a string is deceptively simple which is obvious coming from C or C++ but possibly less so coming from higher-level languages:
> String conversion. If I use byte array to string conversion I move from 0.35ms to 1774.73ms. And if I do what the test needs to do (get the length of the string “s.utf8.count”), I am at 2737.4ms. Which is an additional second spend doing factually nothing.
Except it's not going nothing, it has to
* allocate a buffer for the string (possibly multiple times depending how reservation works by default) (and specifically for Swift there's the potential additional issue NSString bridging)
* validate that the input is decodable and possibly transcode to whatever the internal encoding is if it's not UTF-8
* iterate the string's codepoints and sum the number of UTF8 bytes necessary to encode that codepoints
That's a shit-ton of work compared to doing literally nothing if you just check the number of bytes in the original array.
I might be misreading this, but that number doesn't seem possible. He says he can do 1 million decodings in 0.35ms, but that means each decoding is done in less than a third of a nanosecond, which sounds unreasonable. I'm not familiar with FlatBuffers, but surely there needs to be some sort of validation step for the data, right?
as being said above: this is probably a micro-benchmarking compiler optimization mistake. i.e.: don'tignore your result or the compiler will probably optimize it away.
Rather unlikely. To get “faster than C” you need to hand-code in assembler, and know your target CPU really well to outsmart the compiler.
One advantage of C is that it's relatively easy to see what the CPU does when you look at the source. In Swift, this is no longer the case. So, unless you know Swift really well, being “as fast as C” doesn't come easily. One thing that helps you here is looking at your compiler's assembly output. Sadly, with Swift this is not as convenient in Xcode as it is with (Objective-)C.
'restrict' is part of the C specification these days, so Fortran has no theoretical performance advantage over C. Of course, there is a body of code in Fortran built up with restrict-semantics applied by default, so there may well be real-world advantages.
"We saw that allocating memory and retain release calls where dominating when profiling. So why not just do the same bare bone thing that C does. Write a bunch of functions which read data from a byte array without allocating any objects."
Good, but this can become a problem if you are not properly clearing the buffer and suddenly start leaking data. OpenSSL is the current big example.
Doing benchmarks without knowing what is doing, without understand what is happening and then talking about faster than C only shows naivité.
Hope you will use of this mistake to learn how things work and to investigate more in detail before announcing misleading and erroneus information.
I can understand the desire to beat C and that this comes from detaching software from hardware obscuring the basis of software and turning it in a kind of magic driven by faith.
The writer disbelieved their own results, got others to double and triple check their work, and when the correct explanation emerged, published a correction. I think you're being overly harsh (and, in the end, the results were very close to C).
The good thing about any narrative, it resonates with different people on different levels.
The blog post is in deed titled "10,000 times faster Swift".
I though it will be a catchy title even though 6 seconds to 0.35 ms is not factor of 10,000.
I thought about renaming the title to 500 times faster Swift, which would be rathe more accurate insight of current findings, but than what the hack. It's a blog post. I didn't published a scientific paper. I just reflected on my resent work.
The main points of the Blog posts wehere anyways about how it is possible to make low level optimisations to make Swift programs faster. And as a matter of fact the Loop-invariant code motion was a valid technique to get the same result. Result being sum of payload content. The compiler was smarter than me. It gave me the same result doing 250times less work. I find it impressive.
I must be honest I am not fluent in assembly this is why I could not figure it out by myself.
Was I suspicious? Absolutely!!!
But the facts were in my face.
Shouldn't I publish an article, where I am not sure why I got what I got?
If I wouldn't publish the article, I would not figured out the truth and wouldn't learned form this experience.
And after all, this post is about performance pitfalls in Swift language. The comparison with C was almost accidental. I would compare it with C++ if I would have a Windows machine, as the benchmark for C++ project has Windows specific code. I also consulted with the author of flatcc, who is much more relaxed about my blog post than you are :)
This blog post is about learning something. I learned something before I wrote this post I shared it and now I learned even more.
You should try it yourself.
Maybe not as satisfying as criticising, but it also has it's moments.
This. I may have been harsh, but I won't assume a rookie mistake only. And being 1.8x (45ms to 25ms?) is not even "near" but is surely the expected from a "modern" language.
It would have been more honest to modify the headline and put corrections inline, for sure. (Still, 170x faster is not bad — I would still have clicked :-) )
How do you know? There is no data on the respective machines and environments. A number from a tweet and a number from a blog post do not make such data.
Your red flag should have been this:
> One million times decoding of a small object graph took 0.35ms
That's literally impossible. That's doing 2.8 billion iterations per second. A single function call generally takes 2 nanoseconds (you can't do 1 billion per second).