10 thousand times faster Swift

gilgoomesh · on May 17, 2016

Bad news: the optimizer is moving your functions outside the loop.

    for _ in 0..<iterations {
        let result = flatuseStruct(outputData)
        assert(result == 8644311667)
        total = total + UInt64(result)
    }

Looking at the assembly... the call to `flatuseStruct` is moved outside the loop in Release builds. You're only measuring 1 thousand iterations of `flatuseStruct`, not 1 million.

Your red flag should have been this:

> One million times decoding of a small object graph took 0.35ms

That's literally impossible. That's doing 2.8 billion iterations per second. A single function call generally takes 2 nanoseconds (you can't do 1 billion per second).

mzaks · on May 17, 2016

I Updated the run bench function

https://gist.github.com/mzaks/e3a2dc7ccdfc2397bc26c55eb6dc8a...

the output is now:

  Eager run
  =================================
  1557 ms encode
  264 ms decode
  34 ms use
  206 ms dealloc
  504 ms decode+use+dealloc
  0,38 ms direct
  0,32 ms using struct
  =================================
  Total counter1 is 8644311667000000
  Total counter2 is 8644311667000000
  Total counter3 is 8644311667000000
  Encoded size is 315 bytes, should be 344 if not using unique strings
  =================================

As you can see all three counters are equal.

toth · on May 17, 2016

The function call is not being optimized out, it's being hoisted outside the loop. I.e., it is as if the code was written as:

    let result = flatuseStruct(outputData)
    for _ in 0..<iterations {
        assert(result == 8644311667)
        total = total + UInt64(result)
    }

The counter will still be correct, but you are not measuring what you think you are measuring.

mzaks · on May 17, 2016

This makes sense!

Changed the iteration to:

   for i in 0..<iterations {
     let result = flatuseStruct(outputData, start:i)
     assert(result == 8644311666 + Int(i))
     total2 = total2 + UInt64(result)
   }

now result is around 43ms

Thanks for pointing it out. Have to check if there are so other things involved, but that might be it.

easytiger · on May 17, 2016

That's 0.35 nanoseconds to do something.

If it is a nop on a modern cpu it can run x4 nops at the same time. Either you aren't doing useful work or your measurements are wrong.

Unfortunately I can't view gists in work.

to3m · on May 17, 2016

You can see the problem here:

First it cals CFAbsoluteTimeGetCurrent and saves the result.

    0x100272dab <+10443>: callq  0x1002b7b38 ; CFAbsoluteTimeGetCurrent
    0x100272db0 <+10448>: movapd %xmm0, -0xa0(%rbp)

Here is the call to flatDecodeDirect. I guess RDI is the input. That's usual for the x64 ABI.

    0x100272db8 <+10456>: movq   -0x100(%rbp), %rdi
    0x100272dbf <+10463>: callq  0x10026fb10 ; flatDecodeDirect

I don't know what this next bit is for.

    0x100272dc4 <+10468>: testq  %rax, %rax
    0x100272dc7 <+10471>: js     0x10027444c ; <+16236> [inlined] generic specialization <FlatBuffersPerformanceTestDesktop.FlatBufferReader> of Swift._ContiguousArrayBuffer._checkValidSubscript (Swift.Int) -> ()

0x3e8=1000. The loop counter is in ECX.

    0x100272dcd <+10477>: movl   $0x3e8, %ecx

I can't figure out what's at these two addresses; lldb didn't seem to accept any reasonable syntax. lldb is terrible. But I'll bet that RBX is holding the value of `total'. I don't know what r14 is, and it doesn't seem to matter since nothing here uses it.

    0x100272dd2 <+10482>: movq   -0xd8(%rbp), %rbx
    0x100272dd9 <+10489>: movq   -0x198(%rbp), %r14

Here's the loop. The loop is unrolled 5 times. total+=result. 0x100274454 produces some kind of exception on integer overflow.

    0x100272de0 <+10496>: addq   %rax, %rbx
    0x100272de3 <+10499>: jb     0x100274454 ; at flatbench.swift:284
    0x100272de9 <+10505>: addq   %rax, %rbx
    0x100272dec <+10508>: jb     0x100274454
    0x100272df2 <+10514>: addq   %rax, %rbx
    0x100272df5 <+10517>: jb     0x100274454
    0x100272dfb <+10523>: addq   %rax, %rbx
    0x100272dfe <+10526>: jb     0x100274454
    0x100272e04 <+10532>: addq   %rax, %rbx
    0x100272e07 <+10535>: jb     0x100274454

The loop was unrolled 5 times, so drop 5 from the loop counter and repeat.

    0x100272e0d <+10541>: addq   $-0x5, %rcx
    0x100272e11 <+10545>: jne    0x100272de0 ; at flatbench.swift:276

Get current time.

    0x100272e13 <+10547>: callq  0x1002b7b38 ; CFAbsoluteTimeGetCurrent

So this code actually times one call to flatDecodeDirect, then 200 iterations of an unrolled do-nothing loop. The compiler has figured out somehow that flatDecodeDirect is going to do exactly the same thing each time, and taken advantage of that by calling it only once. I'm guessing this means that flatDecodeDirect is only called 1,000 times in total.

As a sanity check for this kind of thing - try making a little loop that just increments an integer the appropriate number of times, and see how long that takes. (Check the assembly language output to ensure the generated code is doing what you think - it should be a 2-instruction loop.)

On my laptop that takes 1.8ms. This isn't the absolute limit of how long it takes to do 1,000,000 of anything, but it'll do as a rough estimate. So you should be suspicious if a program suggests it's taking much less time than that to do 1,000,000 of something that's a lot more complicated, as the test did. (It reported 1,000,000 iterations in 0.53ms on my PC.)

(Of course, as with any rough estimate, this only gives you a suspicion, and isn't proof without further investigation.)

mzaks · on May 17, 2016

I was suspicious, I just could not put my finger on it :)

   for i in 0..<iterations {
     let result = flatuseStruct(outputData, start:i)
     assert(result == 8644311666 + Int(i))
     total2 = total2 + UInt64(result)
   }

this results in around 42ms compared to C 25ms. I guess I should update my blog post :)

Thanks for your help.

statictype · on May 17, 2016

Could the C/C++ benchmarks also be doing the same?

phpnode · on May 17, 2016

This is happening because of compiler optimisation technique called Loop Invariant Code Motion [0], which means that the author is not measuring what he thinks he's measuring (and this should be obvious from the numbers really), so the result is meaningless.

0. https://en.wikipedia.org/wiki/Loop-invariant_code_motion

masklinn · on May 17, 2016

So

* Swift's boolean -> integer conversion is oddly slow (should probably be reported)

* allocations are expensive (duh)

* Converting arbitrary binary data to a string is deceptively simple which is obvious coming from C or C++ but possibly less so coming from higher-level languages:

> String conversion. If I use byte array to string conversion I move from 0.35ms to 1774.73ms. And if I do what the test needs to do (get the length of the string “s.utf8.count”), I am at 2737.4ms. Which is an additional second spend doing factually nothing.

Except it's not going nothing, it has to

* allocate a buffer for the string (possibly multiple times depending how reservation works by default) (and specifically for Swift there's the potential additional issue NSString bridging)

* validate that the input is decodable and possibly transcode to whatever the internal encoding is if it's not UTF-8

* iterate the string's codepoints and sum the number of UTF8 bytes necessary to encode that codepoints

That's a shit-ton of work compared to doing literally nothing if you just check the number of bytes in the original array.

openasocket · on May 17, 2016

I might be misreading this, but that number doesn't seem possible. He says he can do 1 million decodings in 0.35ms, but that means each decoding is done in less than a third of a nanosecond, which sounds unreasonable. I'm not familiar with FlatBuffers, but surely there needs to be some sort of validation step for the data, right?

stefs · on May 17, 2016

as being said above: this is probably a micro-benchmarking compiler optimization mistake. i.e.: don'tignore your result or the compiler will probably optimize it away.

easytiger · on May 17, 2016

maybe he has a 10GHz CPU

tempodox · on May 17, 2016

> ...Swift being as fast or even faster than C...

Rather unlikely. To get “faster than C” you need to hand-code in assembler, and know your target CPU really well to outsmart the compiler.

One advantage of C is that it's relatively easy to see what the CPU does when you look at the source. In Swift, this is no longer the case. So, unless you know Swift really well, being “as fast as C” doesn't come easily. One thing that helps you here is looking at your compiler's assembly output. Sadly, with Swift this is not as convenient in Xcode as it is with (Objective-)C.

protomyth · on May 17, 2016

> To get “faster than C” you need to hand-code in assembler, and know your target CPU really well to outsmart the compiler.

Ada compilers can beat C compilers, and languages with some restrictions such as Fortran can beat C.

AlisdairO · on May 18, 2016

'restrict' is part of the C specification these days, so Fortran has no theoretical performance advantage over C. Of course, there is a body of code in Fortran built up with restrict-semantics applied by default, so there may well be real-world advantages.

protomyth · on May 17, 2016

"We saw that allocating memory and retain release calls where dominating when profiling. So why not just do the same bare bone thing that C does. Write a bunch of functions which read data from a byte array without allocating any objects."

Good, but this can become a problem if you are not properly clearing the buffer and suddenly start leaking data. OpenSSL is the current big example.

jdright · on May 17, 2016

Doing benchmarks without knowing what is doing, without understand what is happening and then talking about faster than C only shows naivité.

Hope you will use of this mistake to learn how things work and to investigate more in detail before announcing misleading and erroneus information.

I can understand the desire to beat C and that this comes from detaching software from hardware obscuring the basis of software and turning it in a kind of magic driven by faith.

Tloewald · on May 17, 2016

The writer disbelieved their own results, got others to double and triple check their work, and when the correct explanation emerged, published a correction. I think you're being overly harsh (and, in the end, the results were very close to C).

archgoon · on May 17, 2016

> The writer disbelieved their own results

No they didn't.

"To be honest it is hard for me to say, how exactly it happened that we got 70 times faster than C, but it is a measurable fact."'

You also don't publish a headline "10,000 times faster Swift" if you don't believe your own results.

At best, this is click-bait.

mzaks · on May 17, 2016

The good thing about any narrative, it resonates with different people on different levels.

The blog post is in deed titled "10,000 times faster Swift". I though it will be a catchy title even though 6 seconds to 0.35 ms is not factor of 10,000.

I thought about renaming the title to 500 times faster Swift, which would be rathe more accurate insight of current findings, but than what the hack. It's a blog post. I didn't published a scientific paper. I just reflected on my resent work.

The main points of the Blog posts wehere anyways about how it is possible to make low level optimisations to make Swift programs faster. And as a matter of fact the Loop-invariant code motion was a valid technique to get the same result. Result being sum of payload content. The compiler was smarter than me. It gave me the same result doing 250times less work. I find it impressive.

I must be honest I am not fluent in assembly this is why I could not figure it out by myself.

Was I suspicious? Absolutely!!! But the facts were in my face.

Shouldn't I publish an article, where I am not sure why I got what I got? If I wouldn't publish the article, I would not figured out the truth and wouldn't learned form this experience.

And after all, this post is about performance pitfalls in Swift language. The comparison with C was almost accidental. I would compare it with C++ if I would have a Windows machine, as the benchmark for C++ project has Windows specific code. I also consulted with the author of flatcc, who is much more relaxed about my blog post than you are :)

This blog post is about learning something. I learned something before I wrote this post I shared it and now I learned even more.

You should try it yourself.

Maybe not as satisfying as criticising, but it also has it's moments.

jdright · on May 17, 2016

This. I may have been harsh, but I won't assume a rookie mistake only. And being 1.8x (45ms to 25ms?) is not even "near" but is surely the expected from a "modern" language.

Tloewald · on May 17, 2016

It would have been more honest to modify the headline and put corrections inline, for sure. (Still, 170x faster is not bad — I would still have clicked :-) )

askyourmother · on May 17, 2016

Use C or C++ instead?

b34r · on May 17, 2016

RTFA dude. The end product was _faster_ than C.

dkersten · on May 17, 2016

Is it still faster than C with this mistake fixed? https://news.ycombinator.com/item?id=11713486

tempodox · on May 17, 2016

> _faster_ than C

How do you know? There is no data on the respective machines and environments. A number from a tweet and a number from a blog post do not make such data.

mzaks · on May 17, 2016

Performance test run on Travis CI in a virtual machine https://travis-ci.org/mzaks/FlatBuffersSwift

function called for decode+use+dealloc https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBu...

function called for direct: https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBu...

function called for using struct: https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBu...

Everything is on Github, you are welcome to try it out on your own machine.

Cyph0n · on May 17, 2016

...after a ton of ugly workarounds and optimizations. I wonder how much slower than C a straight C++ port would be.