Hacker News new | past | comments | ask | show | jobs | submit login
Viable ROP-free roadmap for i386/armv8/riscv64/alpha/sparc64 (marc.info)
116 points by zdw on Sept 26, 2023 | hide | past | favorite | 63 comments



All of this complexity is one of the key reasons that Wasm does not have an addressable stack (it is fully virtualized in the execution semantics). That moves the management of stack-allocated data structures up a level, but guarantees that the machine has control flow integrity (CFI) built in.


Code compiled to WASM typically also have a stack in the linear memory for storing the program's data structures and values that would need to be dereferenced, thus without any protection. Only the "local" variables and the operand stack are protected.

"Control-Flow Integrity" can have a general sense, or a specific sense. In the general sense, it means that function pointers and return addresses are protected — and WASM protects the raw function pointers.

In code compiled to WASM, a function pointer is represented as an integer indexing into a single global list with all functions. The "call_indirect" op checks only that the index is within bounds and that the function type signature of the function you look up matches.

In the specific sense, as in the 2005 paper titled "Control-Flow Integrity: Principles, Implementations, and Applications", it refers to the code also enforcing that a right function pointer is used at each indirect call site. Each site has a list of allowed function pointers. WASM does not do this.

(Sorry for the long post but I just don't want people to get confused and believe that WASM is safer than it is)


Wait, it's not about addressability, but essentially about type confusion - typically data that is not actually a code pointer suddenly gets used as such by `ret`. If you are running on a machine that can push an integer but pop it as something else you're still in trouble. BTW, signed pointers are essentially a type check mechanism, where forging types is hard.


The activation records don't have to be stored in one piece: it's entirely possible to have a call stack that stores only return addresses and a data stack that stores frames of local variables (that's how you'd normally program e.g. on 6502). Which is a scenario that x64 arguably supports too, having separate RSP and RBP registers, with RBP-addressing being nice and easy and RSP-relative addressing encoded very cumbersomely.


Clang/LLVM supports such a scheme. It saves other variables that don't need to be dereferenced there too. Many utilities on several BSDs are compiled with it, but are necessarily statically linked. <https://clang.llvm.org/docs/SafeStack.html>

It is also the default on Fuchsia, which therefore supports shared libraries. <https://fuchsia.dev/fuchsia-src/concepts/kernel/safestack>

The problem with these software-based approaches is that it is security-by-obscurity, which breaks if the address to the safe stack would leak. Like ASLR, it is considered more or less broken on 32-bit systems where it is easier to allocate significant portions of the address space to find where it is not and then do educated guesses. However, there have been a few papers using Intel's MPK or even CET to protect it properly, at some performance cost of course.

It was also the model that Itanium used. You got the return address in a register and because register windows were saved to a separate stack, it thus saved the return address there too.


There's AMD shadow stack and Intel CET which do that as well. For compatibility, the return address is also on the data stack, but verified with the address stack.


CET complicates software virtual machines in several ways, e.g. deoptimization and exception handling. It's the wrong solution to a software problem that should never have arisen and just makes software even more complicated. An unfortunate metastasis of running unsafe code without checks for too long.


Funny that you use the term "x64", which is a technically incorrect way to refer to 64 bit x86 / amd64. If anything, "x64" refers to Alpha: 21064, 21164, 21264, 21364.


If the thing is widely being known as "x64", then "x64" is that thing's name. That's what the word "name" means.

And if you want to be picking nits, it's actually EM64T or at the very worst, IA-32e. Then again, there are actually two versions of this ISA and Intel is currently calling its version "Intel 64" (and AMD used to call their pre-release version "x86-64", by the way), but definitely not "64-bit x86".

Edit: Oh hey, you're the same guy who made that silly argument 11 months ago [0]. Never mind me then.

[0] https://news.ycombinator.com/item?id=33097250


This is in a thread that's about various CPUs INCLUDING Alpha.

Good job ;)


I sure hope you don't call those boxes full of electronics being discussed 'computers' since that's obviously a technically incorrect way to refer to them[1].

[1] https://en.wikipedia.org/wiki/Computer_(occupation)


Your pedantry is incorrect. Computers are people who compute and machines that compute.

The point is that in a discussion that includes Alpha, "x64" obviously isn't clear. Doubling down with "but everyone does it" isn't a good look.


It's a good argument. If "everyone else does it" wasn't an effective argument against changing names, "oceanography" would've been rightfully renamed to "oceanology" by now.


Your accusation of pedantry, under the circumstances, is comical. Talk about 'isn't a good look'.


You're the one who tried to make my post about the literal incorrectness of calling amd64 "x64" in a thread that also references Alpha, which is the correct usage of "x64", seem silly by saying that I'm referring to computers incorrectly because I'm referring to electronic computers, not human computers.

Now you want to pretend it's not an attempt at pedantry? I'm not sure why you WANT to be an asshole, but either disagree with me TECHNICALLY (that is, point out how "x64" is NOT and never was a reference to Alpha, and show evidence how "x64" somehow has always referred to amd64 outside of Windows-centric circles, and we can discuss that), or admit you're trying to be as wise-ass and don't be pedantic then get upset about me supposedly accusing you of being pedantic.

In other words, what do you really think you're bringing to the discussion? If you don't like when people point out incorrectness, then just say so.

There's a time and a place for making incorrect generalizations. Technical people shouldn't make incorrect generalizations when talking about technical things.


Your accusation of assholery, under the circumstances, is comical. Talk about 'isn't a good look'.


It does make implementing a custom garbage collector harder though, maybe they could have come up with a compromise where there’s a way to walk over local variables on the stack but instead they’re building GC into the runtime, which is also a good solution but having both options might have been nice because a custom one can be more flexible and supported by runtimes that don’t have GC


We've discussed engine support for stack walking for application-level GC within linear memory, but there hasn't been a clear win (at least in my mind) over just using a shadow stack, which doesn't require walking frames, just scanning a memory segment. Opening up the contents of executing stack frames imposes a ton of constraints on the engine and complexity (e.g. engine now has to maintain mappings, support frame modification, etc), that have implications for basically all execution tiers. It doesn't seem like the right tradeoff to me.

I've implemented a shadow stack for Virgil, so I am aware of how much it sucks. But that doesn't suck as bad as coming up with a stack walking protocol and then modifying every Wasm engine in existence to support that.


Actually yes, it sounded good at first but it makes optimisation and determinism harder for the engines so it's probably better not to have it


The use of an illegal instruction to cause the crash is interesting, especially given the speculative execution angle. I naively would have expected branch prediction to do a good job given that the branch is never taken in typical operation, but I guess that only applies to frequently run code


It makes sense now that I think about it.

We usually base our intuition on the older and simpler branch predictors of 10-20 years ago, where the location of the branch and it's taken/not-taken history are tightly coupled. In those, a never-taken branch is either unknown (which counts as a non-taken prediction) or the branch is known and correctly predicted.

But modern branch predictors decoupled the branch location and branch history. They hash the sequence of the last few branches and use that hash to index into the branch history table, allowing the a single branch to have multiple different histories depending on the control flow leading up to it. That significantly improves branch prediction for hot code, the predictor can now track things like a branch that will always be taken during the first iteration of a loop, or a branch will always be taken if the function was called from one location but not another. They can even track the relationships between indirect branches in vtables.

But Hash collisions are expected, especially outside of hot code. So it should be quite common for never-taken branches in warmish code to be known, but predicted as taken due to a collision.


I wouldn't expect a CPU to speculatively execute something like "int 3" either, given compilers will frequently place it in between real functions (for alignment).


It's unclear whether this matters, though. Some CPUs speculate through unconditional control flow as if it is never going to happen.


Removing RET instructions won't make your program unexploitable. Especially on architectures like x86_64 where you can get so many possible gadgets from a standard library.

Trying to make software mechanisms like this that span multiple architectures feels like a relic of the past. These days you need to use processor-specific features like ARM pointer authentication.


It's not meant to be 100% reliable. Also they clearly state that variable length instruction architectures (like x86) are harder to protect. And if you can use something better then great, use it. You can then disable this mechanism.


This misses the point of how ROP works, and is in fact why previous attempts at ROP gadget reduction were flawed: it only takes a handful of gadgets to make a chain. Trying to protect control flow requires more sound approaches than this.


At the time, Todd was testing with whatever the popular rop compiler was, and it wasn't able to chain anything using libc. Even on amd64 you can restrict which gadgets are available. Maybe you can find other approaches with a careful hand search, but I think knocking out the biggest exploit generator is hardly flawed in a practical sense.


People don't generate exploits using popular ROP compilers.


what are popular ROP compilers used for then?


CTF challenges mostly


so they're used in combination with already known exploits but you're saying no one uses them during the development of exploits?


No, they’re mostly toys and demos. They’re not an accurate representation of real-world exploit development.


oh i must be confused about what a CTF challenge is


CTF challenges are to cooking competitions what exploit development is to being a restaurant cook. There are time limits, practicality is less of a concern, and everyone knows that toy constraints are added because nobody wants to watch you stare at IDA for three weeks


Say, if hardware checked that RET transfers control to a place that's immediately preceded by a CALL instruction, would that help?


It would help but it wouldn’t solve ROP. I think it would probably be less useful than gadget reduction, honestly, since there are a lot of useful sequences after a call instruction.


It would not help at all. See (all of, but especially) section 5.4 of N. Carlini, A. Barresi, M. Payer, D. Wagner, and T.R. Gross, "Control-Flow Bending: On the Effectiveness of Control-Flow Integrity," in proc. USENIX Security 2015, https://www.usenix.org/conference/usenixsecurity15/technical...


That would disable certain mechanisms that are occasionally useful. For example, to implement user-mode context switches and function hooking.


ARM's Branch Target Identification does something similar to that (but for jumps & calls, not returns).


A shadow stack of return addresses would help.


Not to mention there are other methods, like JMP oriented programming (someone’s even written some papers on that, and tools for it).

Also worth noting: the mov being Turing complete paper.


Somewhat related: did Power arch really reach it’s “EOL” by now? IIRC IBM was still doing something with it at least? Anyone in the know?

> So amd64 isn't as good as arm64, riscv64, mips64, powerpc, or powerpc64.


https://www.ibm.com/power? Too obscure?

On the embedded side, NXP (former Freescale former Motorola) makes them, as does 'Macom' (former AMCC via some hedge fund gymnastics...never heard of them either). Others like Xilinx probably still have licenses to produce. And you can still get the rad-hard RAD750 from BEA systems, for your outer space or post-apocalypse needs.


Why do Open BSD developers in 2023 care about architectures like alpha and sparc64? What are their use cases? Why would anybody pick those over Intel / ARM? What do those architectures provide that makes people care? Is it just existing hardware that people already have and don't want to throw away, or is there an actual reason one would buy such a processor over the more popular competition? I get the arguments for Riscv for example, but the rest are a mystery for me.

I'm not criticizing here, but actually seeking to understand.


Big-endian CPUs were once the dominant clients in TCP/IP networking, and these machines define the hton and ntoh C macros as no-ops. This means that the network itself is big-endian.

Here is an old HP-UX machine running on PA-RISC:

  # grep ntoh /usr/include/netinet/in.h    
  #ifndef ntohl
  #define ntohl(x)        (x)
  #define ntohs(x)        (x)
On x86_64, this is a byte-swap:

  $ grep bswap /usr/include/netinet/*.h         
  /usr/include/netinet/in.h:#   define ntohl(x) __bswap_32 (x)
  /usr/include/netinet/in.h:#   define ntohs(x) __bswap_16 (x)
  /usr/include/netinet/in.h:#   define htonl(x) __bswap_32 (x)
  /usr/include/netinet/in.h:#   define htons(x) __bswap_16 (x)
SPARC is big-endian, and the memory is not in the same order as it is on x86. This can coax bugs out of software that are otherwise not seen on little-endian systems.

Alpha is litte-endian, but it has its own exotic problems.


> Alpha is litte-endian, but it has its own exotic problems.

At least some models of the Alpha were bi-endian, selectable via a pin on the package, IIRC. I believe the Cray T3E ran the chips big-endian.


I think MIPS and POWER could also do this.


Arm (both 32 and 64-bit) supports both BE and LE. Switchable at runtime so that you can have a BE VM on an LE host just fine.

Apple Arm chips however are LE only, but all of Arm's Cortex-A/Neoverse and NVIDIA's cores support both LE and BE operation.


Theo De Raadt has expressed that compiling and testing on CPUs that are "not boring" helps them to find bugs earlier.


Running on many many architectures is an another way of testing code robustness. Code that runs on X could crash on Y because it always had a bug that the quirks of X hid. The weirder the architecture the better, kind of.

Or somebody may just think it’d be a fun challenge to do and so it supports it. That’s all.


There are a lot of mistakes that make it in to common code when people only test with one architecture.

Also, the environmental cost of manufacturing new hardware is often greater than even a decade of running less efficient, older hardware longer.

To add to that, Alphas and older Sun systems are actually proper server hardware. There are differences that you may never learn about in the x86 world. I've been running an AlphaServer DS25 for years now, and it's worlds better hardware quality than any x86 server product you can buy now.


I am having trouble believing that the environmental impact of a running a server from 2002 is lower than buying a machine that is 2-3 orders of magnitude more efficient.


I doubt that the machine is even one order of magnitude less efficient.

The power supplies on that Alpha are 500W and even fairly low end Supermicro systems have 600W supplies.

The biggest differential is probably the efficiency of the power supply.


Better in what sense?

How would you characterize the differences?


Well, for one thing people still sell sparc machines. But beyond that, they care to catch bugs that an architecture monoculture misses/ignores. For example, the OpenBSD folks have said a number of times SPARC64 is a good way to catch memory alignment problems. Diversity in testing is a good thing.


OT: I wonder whether Apple or Firefox’s “reading mode” algorithms (both broken for this link) will gain the ability to handle these plaintext, fixed-width, hard line ended, htmlified email archives before the owners of any of these sites like marc.info adapt to use more user friendly formatting that can reflow in a narrower-width browser window…


my firefox do not propose reading mode because ... the page itself is pretty much in reading mode already with the line breaks to limit line length.

Nothing is broken as I see it.


It’s not very user friendly on mobile.


You can force reader view by entering "about:reader?url=<page URL>" into the address bar, but it can not decide what should be displayed for this page. IIUC this check runs on all visited pages and the button is only shown when it succeeds.


I see margin-left: $small.

Reader mode would help with that.


I'm on safari 16.6 and reader mode seems to work just fine on this site.


I’m on Safari on iOS 15.7.4 and Reader Mode opens but just shows the long lines with a horizontal scroll and no wrapping still


Same on safari 15.6




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: