Hacker News new | past | comments | ask | show | jobs | submit login
File Descriptor Transfer over Unix Domain Sockets (copyconstruct.medium.com)
117 points by talonx on Nov 2, 2020 | hide | past | favorite | 39 comments



The article doesn't mention that file descriptors "in flight" over sockets are garbage collected if the listening process doesn't pick them up. This has been the subject of serveral bugs/security issues: https://nvd.nist.gov/vuln/detail/CVE-2008-5029 https://lwn.net/Articles/779472/

Al Viro's description sums up one of the recent problems (which was fixed):

Among the features provided by io_uring is the ability to "register" one or more files with an open ring; that speeds I/O operations by eliminating the need to acquire and release references to the registered files every time. When a file is registered with an io_uring, the kernel will create and hold a reference for the duration of that registration. This is a useful feature but it contained a problem that, seemingly, only somebody with a Viro-level understanding of the VFS could spot, describe, and fix; it is a new variant on the cycle problem described above. In short: a process could create a Unix-domain socket and register both ends with an io_uring. If it were then to pass the file descriptor corresponding to the io_uring itself over that socket, then close all of the file descriptors, a cycle would be created. The io_uring code was unprepared for that eventuality.


That's an interesting bug. Thanks for linking it here!

Many flavours of UNIX have bugs in the area of file descriptor passing. The most recent one I spotted was that FreeBSD would leak file descriptors on the receiving side if the cmsg space was too small to hold all incoming file descriptors numbers.

Another interesting one is that garbage collection of in-flight file descriptors is sometimes implemented by effectively calling close() recursively. This means that you can cause kernel panics due to kernel stack overflows by closing a socket pair that contains in-flight file descriptors of socket pairs that contain in-flight file descriptors of socket pairs that contain in-flight file descriptors of [...]


This doesn't list the biggest gotcha of all, the fact that the cmsg API is incredibly sharp and there are at least 7 ways you can screw it up. Nearly every single use of cmsg in Android's source tree was buggy in at least one of these ways:

  - not aligning the cmsg buffer
  - leaking fds if more fds are received than expected
  - blindly dereferencing CMSG_DATA without checking the header
  - using CMSG_SPACE(fd_count) instead of CMSG_SPACE(fd_count * sizeof(int))
  - using CMSG_SPACE instead of CMSG_LEN for .cmsg_len
  - using CMSG_LEN instead of CMSG_SPACE for .msg_controllen
  - using a length specified in number of fds instead of bytes
It's possible that Android is uniquely bad at this, but I'm skeptical.


Compare this to WinAPI approach: its analogue of dup() (called DuplicateHandle) also accepts a descriptor of a process into which the file descriptor must be duplicated. You need to have the correct access right, of course.

However, because in Windows TCP sockets are implemented in the user space, you can't quite use this routine to pass sockets: the part of its state in the user-space won't get duplicated (well, you can't send FILE* into another process in neither Linux nor Windows, right), so there is WSADuplicateSocket function which doesn't actually duplicates the socket, it gives you a struct full of its internal state which you have to send to another process somehow, and then the other process has to re-create the socket manually, by passing this struct into WSASocket call... and I've no idea how the inter-process sharing is then set up, but apparently it is set up, so those two sockets would actually have synchronized state from then on.


>However, because in Windows TCP sockets are implemented in the user space,

That is not what's happening. TCP is entirely in the kernel. You are confusing Winsock with TCP.

Even then, Winsock has it's own driver (AFD.sys, which definitely does not stand for Another Fucking Driver) as well.

Most of the user-mode Winsock (ws2_32.dll, mswsock.dll, ws2tcpip.dll) implement the Winsock 2.0 spec-related bookkeeping, but Winsock calls essentially boil down to Ioctl()s to AFD.sys in the common case.

The thing to realize is that Winsock 2.0 was spec-ed at a time when Windows NT (and 95) allowed vendors to install things called "layered service providers", which were providers that hooked into Winsock and allowed you to intercept/redirect/filter various Winsock calls. Lots of internet filters worked in this way in the early days.

Now, given that these 3rd party providers were free to implement their protocol in user space completely, the socket handle you got was not guaranteed to be a kernel handle (Called an IFS handle all over Winsock documentation, for Installable FileSystem handle. )

Thus, WSADuplicateSocket() was a way to allow these providers to re-create a socket based on whatever provider-specific information was passed in through IPC.

Going back to your original claim, you can actually take a native AFD socket handle to a TCP connection and duplicate it (with DuplicateHandle) into another processes. It will also probably work just fine, but no one at Microsoft is going to officially support this usage, for the above-mentioned compatibility reasons, even though Winsock layered providers are officially deprecated now.

Edit: clarified what I meant by duplicate in last paragraph.


Thanks for clarifying!

And I guess one could still implement an LSP in the user mode while supporting DuplicateHandle directly, without all the WSADuplicateSocket stuff, by having a service with all the custom logic, and returning pipes into this service as socket handles. That's basically what a microkernel would do as well, right?


I guess you technically could, because the handle your LSP returns is entirely up to it.

Haven't thought about microkernels in a couple of decades :), but yes, I can see how what you are saying could work since the handle will technically be a duplicatable kernel handle. It might be better for the service (or subsystem in NT microkernel terminology) to have a driver so that the client can reference it with a proper, dedicated handle, but that's a bit beyond my wheelhouse as to which approach is better.

Edit: Completely forgot about ws2ifsl.sys which is a driver that exists just to give Non-IFS providers a kernel handle with the WPUCreateSocketHandle() call !


A more analogous approach would be to use ALPC; its handle-passing support was inspired by unix domain sockets. Unfortunately, it looks like ALPC never got exposed via Win32.

Windows TCP sockets are definitely their own thing, which is pretty unfortunate, but Windows exposes a number of other kernel abstractions as handles - processes, mutexes, shared memory sections, events, timers, &c; “Wait for either this particular process to exit or this timer to expire” is trivial. At the time I worked on it, it was one of the features I really preferred about Windows over Unix.


The whole idea of having (awaitable) process descriptors in addition to PIDs is amazing. Suddenly, you don't need wait/waitpid/wait3/wait4/SIGCHLD or even PID 1. When the last process descriptor vanishes, the process itself can be purged from the process table.

And Win8 added nested jobs, so now you can basically have inescapable process groups with "if the leader process dies, terminate all its children and grandchildren" semantics. On Linux, I had to just give up and use Docker containers for this: some software simply insists on daemonizing itself too much, to the point that on receiving SIGTERM, it re-sends SIGTERM to everyone in its process group (including you, the unsuspecting parent. That was very "funny" to debug).

But of course, the downside of Windows is that the amount of legacy crap and weird quirks and incompatibilities is insane, even more than in x86. "The Windows low-level APIs wouldn't recognize "general" if it showed up in a uniform with stars on the shoulder." (c) njsmith


It's my understanding that this hole will be filled with pidfd. It's now upstream but there are no glibc bindings yet. https://lwn.net/Articles/801319/


Yeah, it's pretty awful. The only remotely sane way to use them is also to use SOCK_SEQPACKET which is not supported everywhere. Another way that works for some cases on Linux at least is to pick up fds from /proc.


I have no way of verifying your Android claim, but this API is really well-documented including examples of how to use it properly.

Here's the OpenBSD version, which is quite typical - https://man.openbsd.org/CMSG_DATA.3.

It's literally a no-brainer to code it all correctly on the first try. So the scary stuff on your list is exactly of the same nature as using printf() without reading the spec and then calling it "incredibly sharp".


File descriptor passing has been a common technique used in privilege separated design, used quite extensively in OpenBSD software. Other notable examples include Google's Chrome browser. It's quite telling how it's not mentioned once in this article.

Combined with OS security features like pledge(2) on OpenBSD, which has separate sendfd/recvfd promises, and unveil(2), an unprivileged process can have its access to the filesytem and other system attack surfaces (system calls, ioctls) removed completely or restricted and only be able to act on file descriptors passed by a privileged parent.

https://man.openbsd.org/pledge.2

https://man.openbsd.org/unveil.2

A skeleton example of a common style for OpenBSD privsep daemons, which uses 3 processes

https://github.com/krwesterback/newdctl

https://github.com/krwesterback/newd

This uses OpenBSD's imsg(3) API, an abstraction around the underlying Unix sendmsg/SCM_RIGHTS functionality, along with other IPC abstractions.

https://man.openbsd.org/imsg_init.3

https://github.com/tmux/tmux/blob/master/compat/imsg.c

https://github.com/tmux/tmux/blob/master/compat/imsg.h


FD transfer is also used when a process needs to work with files (or devices) that are out of its reach due to the account restrictions.

In this case, the process will talk to another process that does have required access, the latter would open the file of interest and pass the handle back.

This is needed very rarely, but in cases where it's a good fit, it provides a very elegant and simple solution for an otherwise hairy problem.

One such case is when the program uses an Engine + UI model, whereby the engine runs under a system account and the UI is under an interactive user. As the engine runs, it writes logs and the UI needs to display them. So one solution is to tweak permissions on the log files to make them universally readable. It's not hard to do, but it makes the whole thing more fragile - these permissions may get inadvertently stripped off, the UI process may be sandboxed by an antivirus, etc. That is, the program may end up in a state when the UI cannot access the logs, but the engine can.

The alternative here is for the UI process to ask the engine to open the logs and pass the handles back. Very simple to do and resistant to accidental breakage.

Another case was when we had to ship a pre-built Linux binary (a VPN client) that needed to open a TAP device. The latter normally requires a root access, but the client was closed-source, so it had to be able to run under restricted user accounts, because asking people to run it under the root was not an option. The solution was to make a small open-source daemon that listened on a domain socket for requests to open /dev/tapx, did that and passed tap FDs back to the requesting process.


I’m using this technique in manpages.debian.org, which uses mandocd, a daemon which allows converting many manpages (using mandoc) without the exec overhead: https://github.com/Debian/debiman/commit/3715b1eaf9c1793b9a8...

I’m transferring the stdin, stdout and stderr file descriptors instead of starting new processes :)


FWIW this technique is described in "Unix Network Programming" by W. Richard Stevens (http://www.kohala.com/start/unp.html) in section 6.1: "Passing File Descriptors"


Unless you have really long-lived connections you cannot drop, let your load balancer layer above the machine drain old connections and then just restart after some threshold. I’ve seen too many bugs in socket handover scenarios to make it worth it in basically every normal use case.

Remember, just handing over your sockets means you get connections in all kinds of different phases of your protocol’s state machine. So now you need to bolt on some more context transfer mechanisms as well...


TFA is about handing over server sockets, so that the new version can deal with new connections while the old version deals with old connections.

I have done this for an RTMP server. It worked out pretty well.


That only works when a session is a 1:1 relationship with a connection (e.g. a TCP service with sessions that can’t span TCP connections). That assumption falls apart with UDP, multiple TCP connections sharing session identifiers, etc.

The world of QUIC means the kernel is a little too out of the loop for this to easily work.


If you can do that, maybe you can just use SO_REUSEPORT and arrange for SIGTERM to close the listening socket before going into the drain phase.


SO_REUSEPORT has only been supported since about 2013 though, the trick with passing the fd over a Unix socket has worked for way longer than that.


That does seem like it would have worked and been easier.


Android makes liberal use of exchanging file descriptors between processes in it's ipc mechanisms. Slightly different use case then what the article discusses but it's an interesting pattern available to multiprocess same host ipc.


Another real-world use case is zero-copy interprocess communication, as in the Wayland protocol <https://wayland-book.com/surfaces/shared-memory.html>. It can also be combined with sealed files <https://lwn.net/Articles/593918/> to avoid some of the pitfalls of shared memory.


Wayland (& gstreamer) uses fd-passing to implement zero copy of graphics-textures/video frames across process. this is typically used in conjunction with buffers allocated by a mechanism like dmabuf which allocates a memory area and provides a fd associated with it to userland. this (translated) fd can be used by the other process to map the same region of memory.


I don't follow. How can file descriptors be passed through a shared memory to another process and remain valid in its context?

Assuming it's not Windows, where it is possible to explicitly clone a handle for a specific process.


The linked code example doesn't seem to do fd passing, but I guess the use case of fd passing in shared memory context would be the other way around: passing fd's that point to shared memory. The shared memory could then be mmap'ed using the passed fd as the handle.


>"Socket Takeover enables Zero Downtime Restarts for Proxygen by spinning up an updated instance in parallel that takes over the listening sockets, whereas the old instance goes into graceful draining phase. The new instance assumes the responsibility of serving the new connections and responding to health-check probes from the L4LB Katran. Old connections are served by the older instance until the end of draining period, after which other mechanism (e.g.,Downstream Connection Reuse) kicks in."

Great idea!

Now, that being said, this idea, or rather, this specific solution to this specific problem -- is actually a subset of a much broader problem in Computer Science, and that is:

How to move any OS component (up to and including running programs that may have many files, locks, sockets and other shared OS objects open) to another OS on another machine,

without causing any problems!

That is, how to move such entities robustly.

There are various ideas in this field (of which the above paper/article is one) -- but due to the complexities involved, there are no easy answers (at least, not as far as moving whole running programs with many shared OS objects go).

At least one, and probably several experimental OS's have been created in the past which attempt to do this -- but they aren't mainstream, and without doing more research, I'm not sure how robust (which is always a subjective term!) they were...

But, it's a fascinating area of Computer Science, to be sure.

Anyway, great idea and great article in this area!


Using UDS to seamlessly move Proxygen workloads like that is so slick!

Here's how we've used this file descriptor transfer feature:

We made a transport which can accept local connections on a Unix Domain Socket and then "upgrades" that connection to two pipes (read and write). Those pipes are passed over the UDS and the client/server communicate over them.

We use a kernel-bypass library (OpenOnload) that implement pipes as shared memory in user-space. Very low latency and high throughput.

We made a `boost::asio` implementation of this available on GitHub. It is old and I'm not sure if it works with latest Boost, but it is quite readable for people to play with. We once bolted it onto Redis' Unix Socket transport for fun, but abandoned it as it was a hassle to maintain.

https://github.com/neomantra/asio-pipe-transport


Also useful for operating one process which might provide access to certain files, to other processes.

You can further restrict it using the fact that you can easily verify the PID/UID/GID on the other end of the UNIX socket. (https://man7.org/linux/man-pages/man7/unix.7.html see SO_PEERCRED) You can also manually send your PID/UID/GID (which allows you to specify any of your real-, effective-, or saved set- *ID; or if you're root anything).


This can also be a handy technique when dealing with CLI tools that take a long time to boot/warm. Leave a zygote/server process running in the background that’s warmed, serving a UDS. When you want to invoke the tool, have a lightweight client connect to the server UDS and send argv, the env vars, cwd, and open file descriptors across the UDS to the zygote. The zygote forks, and the fork sets the env vars, argv, etc on itself and then runs the job.

The only problem with this is cancellation- you’ll need your client to propagate signals to to forked runner process, as well.


This technique is also used to implement privilege separation in OpenSSH.


This is often done with fork(2) + set{uid,gid,euid,egid}(2).

Is there an advantage to passing a file descriptor? What's the workflow in this case, the daemon runs with privileges and passes the fd to a process that's entirely separate and was never started as root? What would passing fds provide that forking + dropping privileges doesn't?


This is a really great paper that described the concept: http://www.citi.umich.edu/u/provos/papers/privsep.pdf

You got the gist of it: you can have a privileged process that can access resources and pass them to an unprivileged process upon request. The unprivileged process runs as nobody and is chroot into an empty directory, so the file descriptors are capabilities. It doesn't really matter if the unprivileged process is a child of the root one, so long as it discards all inherited resources/privileges other than its end of the socket before real work begins.

The advantage over merely inheriting resources is that you can grant new resources later on and decide whether to grant them in the stateful manner.


Can someone say what method Nginx uses to pass file descriptors from the listening or master socket to the worker? Does the Nginx master proc process accept() new connection and use sendmsg() with SCM_RIGHTS to send those new connections to worker processes?


See `ngx_write_channel` in `ngx_channel.c`[1]:

    cmsg.cm.cmsg_len = CMSG_LEN(sizeof(int));
    cmsg.cm.cmsg_level = SOL_SOCKET;
    cmsg.cm.cmsg_type = SCM_RIGHTS;
    
    [...]
    
    ngx_memcpy(CMSG_DATA(&cmsg.cm), &ch->fd, sizeof(int));
    
    [...]
    
    n = sendmsg(s, &msg, 0);

edit: This article[2] also explains the way the master process sends commands to its workers.

[1] https://github.com/nginx/nginx/blob/2a81e0556611188a1b9b3e12...

[2] https://titanwolf.org/Network/Articles/Article?AID=6fb184f9-...


Ah thanks, this second link is really great analysis of that part of the code base. Cheers!


Plan9 had the dedicated sendfd()/recvfd () functions, which seems like a friendlier API.


I have one question. Could this be done with CRIU better? I mean transferring the socket from one process to another or from one netns to another? libsoccr is just a kind of socket serialisation library, right ? :-)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: