February 24, 2022 in Systems21 minutes
I came across some interesting sockets-related behavior this week that caused me to go down a bit of a rabbit hole. This ended up taking me on a tour of Linux’s socket and IPv4/IPv6 implementation. I thought the journey was instructive, and I hope that my attempt to recount the steps I went through is useful to you.
Working with sockets on Linux is typically done with a handful of syscalls. The first and most obvious one, socket()
creates the socket, then bind()
is used to assign an address to it (the kind of address depends on the address family specified when creating the socket).
At this point, you can use either
connect()
orlisten()
to connect to an existing listener, or listen passively for incoming connections, respectively. However, for the scope of this post, we’ll focus just on the first few steps, up throughbind()
. It’s important to note that thisbind()
operation takes place before either, which means that this is the method by which you choose which address is used as the local address to which this socket binds. When you send traffic to the other side of the socket, either as a client or server, this will be your source address in those packets.
In most programming languages, this is quite easy to do (especially systems-focused ones), as a large number have basic socket primitives built into the standard library. Python, as an example, has a sockets
module in its standard library, and creating and binding to a socket are both one-liners:
We’ll stick with IPv4 for the first few examples, but we’ll be covering IPv6 in this post as well.
We specified 0.0.0.0
as the address, which tells the OS to listen on all IP addresses currently configured on the system. This can be useful, as when you’re writing software you often don’t know (and don’t want to) the actual IP address of the server where that software is going to run - you just know you want to grab whatever address is available.
We can inspect the actual syscalls being made using strace
:
However, there are use cases where being more explicit about which address a given socket binds to is necessary. Imagine that there’s some software on your machine that should only accept connections from other software running on the same machine. In this case, you would probably want to bind only to addresses in 127.0.0.0/8
, as this is the dedicated range for this kind of traffic (systemd-resolved
works this way, as an example). This is also useful in situations where a machine has multiple network connections - you can bind only to an address in a particular subnet. This is common on firewalls where a webserver running an administrative application is only available on the “inside” connection.
These are all fairly common reasons why you might explicitly specify an address during bind()
, but ultimately all of these methods involve using an address that already “belongs” to the system, meaning it’s either built-in (as is the case with 127.0.0.0/8
) or it is configured on a network interface.
However, some use cases exist where you might want to bind to an address that’s not configured on any network interface at all. One of the canonical examples is when you want the return traffic for a given connection to pass through a load balancer, and leave it up to the load balancer to determine where to ultimately deliver the traffic.
There are a few ways you can do this in Linux, but we’ll look at two. The first is through a socket option called IP_FREEBIND
(described in Linux’s IPv4 protocol documentation). Socket options are per-socket configuration settings that you can specify after initially creating the socket. This is done through the setsockopt()
syscall. Linux uses an integer value of 15 for the IP_FREEBIND
option, so that’s what we need to pass in to setsockopt()
, while also specifying a value of 1
, indicating this option should be enabled. Python also makes this easy:
This will work, despite the fact that 192.168.123.123
is not actually configured on any of our system’s network interfaces.
Another other way to accomplish this in Linux (which as we’ll soon see only works for IPv4) is through a feature called Any-IP. This is a fancy term for adding an entry to the local
routing table, indicating that any traffic received on a given prefix should be handled by the local machine on the specified interface, as if every address in that prefix was individually configured on that interface. This can be very useful if you want to potentially bind to a huge number of addresses - rather than configuring each one individually on an interface, you can just add a single routing entry that summarizes them:
With TCPv4 sockets we don’t even need the IP_FREEBIND
option. As long as the address we’re binding to exists in an Any-IP route, this will work just fine.
So, to summarize, in order to bind to an IPv4 address that is not configured on an interface, you must either specify the IP_FREEBIND
socket option, or bind to an address that’s part of an Any-IP route.
In addition to the two methods we’ve just explored, there are two other mechanisms in Linux that, when enabled, also allow a socket to bind to an address that is non-local (not configured anywhere on the machine).
- The
sysctl
optionnet.ipv4.ip_nonlocal_bind
(and the IPv6 equivalentnet.ipv6.ip_nonlocal_bind
) - this is a system-wide setting, so it affects all sockets. Naturally, elevated privileges are required to set this option.- The
IP_TRANSPARENT
socket option. This used for transparent proxying, and while, likeIP_FREEBIND
, it is a per-socket option, it however requires root privileges or theCAP_NET_ADMIN
capability to enable. It also requires additionaliptables
rules as part of the transparent proxy configuration.While both of these options incidentally enable non-local binds, they are designed for different purposes and/or come with drawbacks we don’t want to deal with for this example. So while we’ll see references to these options > in our exploration, just know that they’re out of scope for this post.
Now, it’s time to see how this is done in IPv6. Let’s take the Any-IP approach, by first creating a local route for a prefix that doesn’t match any IPv6 address configured on our system:
This should allow us to simply create the socket and bind to any address in that prefix:
However, this fails with OSError: [Errno 99] Cannot assign requested address
. We can see with strace
that bind()
is returning an EADDRNOTAVAIL
:
Linux’s IPv6 protocol documentation doesn’t contain an IP_FREEBIND
option like the IPv4 version does, but it does say “The IPv6 API aims to be mostly compatible with the IPv4 API (see ip(7)). Only differences are described in this man page”. Because of this, I tried setting the IP_FREEBIND
option on the socket:
This worked, so this means that the IP_FREEBIND
socket option is IP version agnostic. However, I was intrigued, because with IPv4, the presence of an Any-IP route meant that IP_FREEBIND
was not needed. Clearly in IPv6, it still is for some reason.
Next, I wanted to test the reverse case: IP_FREEBIND
set, but without a matching Any-IP route:
Despite the lack of an Any-IP prefix, the previous example will still work.
So, it seems that while IPv4 sockets require either an Any-IP route, or IP_FREEBIND
, IPv6 is a bit more strict; regardless of whether or not an Any-IP prefix matches the address you want to bind to, you always need to use the IP_FREEBIND
option to bind to an address that’s not actually configured on an interface.
I’ve found a few other posts ([1], [2]) that seem to confirm that this difference in behavior is real, so I felt better that what I was observing wasn’t due to some error on my part, but there were still some unanswered questions rattling around in my head:
I spent a good chunk of time Googling, but there were few results that even acknowledged this difference, nevermind explained why it was the case. Soon, I realized that the fastest way for me to get my answers is to just go straight to the source - the kernel source code itself.
I am running a fairly recent kernel (5.10) and as a result, all examples provided are from that version. Ideally not much has changed since then as of the time of this writing, but if you’re looking at a different version, YMMV.
Since the error in question occurs when we try to bind our existing socket to an address, I figured the best place to start was looking at the implementation for the bind()
syscall. This can be found in the __sys_bind()
function in net/socket.c
:
As we know from the syscall documentation, the first parameter passed to bind()
is the file descriptor where our socket was created. Naturally, this is the first parameter to __sys_bind()
. This is then passed to sockfd_lookup_light()
to get the socket details, including the protocol-specific implementations (remember we specified AF_INET
or AF_INET6
when creating sockets?). The important step for our purposes is the call to sock->ops->bind()
, which invokes the bind implementation for the protocol used by this socket.
This blog post is really great, and goes into way more detail on the process of getting to the appropriate bind implementation.
The IPv4 implementation in Linux can be found in net/ipv4/af_inet.c
. Within, inet_bind()
is called by the previous example when AF_NET
family is used. This ultimately calls __inet_bind()
, where the real work is done.
You won’t scroll far before you see a large conditional that looks promising:
This evaluates a few conditions to determine the suitability of the address that we wish to bind to. First, the address is passed to inet_can_nonlocal_bind
:
This function checks to see if any of the three options that would allow for a nonlocal bind are present, including IP_FREEBIND
, and if so, returns true
.
inet_addr_valid_or_nonlocal
was added in later kernel versions to further cut down on repeated code, so if you’re looking at more recent kernel versions, you may only see a call to this function. It wraps both the conditions ininet_can_nonlocal_bind
as well as the address types ofaddr->sin_addr.s_addr
andchk_addr_ret
all in one place.
Since we know that enabling IP_FREEBIND
on a socket will cause this function to return true, we also know that the conditional above in __inet_bind()
will immediately pass, since it will only raise an error if all of the parameters for the conditional return false
.
However, let’s assume we haven’t configured IP_FREEBIND
. What other conditions could be true
in our case that would enable this to still bind successfully?
The second interesting conditional from the check in __inet_bind()
is:
This looks interesting becuase the suffix LOCAL
seems to imply that this address was checked for membership in the local
routing table, which we know to be the mechanism by which Any-IP works. However, this is just a theory based on nothing more than the name of a referenced constant, so let’s figure out where chk_addr_ret
comes from.
This value is retrieved a few lines above in __inet_bind()
:
This is exported as inet_addr_type_table()
in net/ipv4/fib_frontend.c
but ultimately implemented via the __inet_dev_addr_type()
function just above.
Broadcast and multicast are easy to identify at the bit level, so those are checked immediately and returned when detected. Provided the address isn’t one of those, it looks like a FIB lookup is performed to further identify the type for this address. We can also tell from the function signature that it returns an unsigned int
. So, in the condition chk_addr_ret != RTN_LOCAL
back in __inet_bind()
, the integer value from inet_addr_type_table()
must match whatever value is assigned to RTN_LOCAL
. But what is that value?
RTN_LOCAL
is actually defined as an item within an enum within include/uapi/linux/rtnetlink.h
:
Since no values are being explicitly set here, each of these items are assigned the corresponding 0-based index (this is how enums work in C). This means that RTN_LOCAL
would have the value of 2, RTN_BROADCAST
is 3, and so on. The comment Accept locally
seems to further indicate that this value represents an address found in the local
routing table, but it would still be nice to confirm this somehow. Ultimately what we’re looking for is the exact integer value returned by the inet_addr_type_table()
function.
We could continue to dive into the kernel source code, and figure out how the internals of the FIB work, and probably get to a reasonable conclusion, but this would take considerably more time. And it turns out, we don’t have to! We can use eBPF to inspect the parameters and return value of inet_addr_type_table()
on a live, running system.
bpftrace
makes it really easy to create simple tracing programs on Linux that are powered by eBPF. We can attach to a kprobe
for the inet_addr_type_table
function, to print the address being checked whenever the function is invoked. Of course, we also want to attach to a kretprobe
to print the return value from this function as well.
We will pass this file to bpftrace
, and once we see the message Attaching 2 probes...
, we can open a few sockets in a separate process. Here’s the output from bpftrace
when we bind to a few different addresses:
0.0.0.0
didn’t even need to be checked against the routing table; this was returned immediately as RTN_BROADCAST
, which corresponds to a value of 3, and this matches what we’re seeing in the bpftrace
output.10.12.0.1
is another host on our network, so this returns 1, which is RTN_UNICAST
. The same applies for 192.168.1.123
, which matches the default route, so this traffic will be unicasted to the default gateway.192.168.123.123
matches the local
route we added earlier, and as expected, returns a value of 2, which corresponds with RTN_LOCAL
. For good measure, I tested 127.0.0.1
which obviously matches a local route, and this also returns 2.So where does this get us? Well, if we zoom all the way back to the conditions in __inet_bind()
that could result in the return of EADDRNOTAVAIL
, the one we’ve been trying to figure out thus far tries to see if chk_addr_ret != RTN_LOCAL
. We know now that this will evaluate to false
, since in the case of our local address, chk_addr_ret
will equal 2, which the value of RTN_LOCAL
.
This means, that if one of the nonlocal bind options are set, like IP_FREEBIND
, or the address matches a local route, the bind can proceed to the next step. This confirms the behavior we observed in userspace, but more importantly, we know how the kernel is making the decisions it makes. Now, it’s time to take this knowledge over to the IPv6 implementation, and compare.
We’ve seen from playing around with sockets in userspace that IPv6 is more strict than IPv4 when it comes to non-local binds, requiring an option like IP_FREEBIND
to be enabled, and that the address being bound matches an Any-IP route. However, can we verify this by looking at the kernel-side implementation in the same way we’ve verified this for IPv4?
net/ipv6/af_inet6.c
contains the Linux IPv6 implementation, so this is a good place to start. Searching for nonlocal_bind
here actually shows a promising result; we see a conditional that looks very similar to that found in the IPv4 implementation. However, this is a red herring; if you scroll up, you’ll notice this only applies to v4-mapped IPv6 addresses, which is not what we’re working with here.
Scrolling down a little further, we see something a bit more familar:
This conditional seems to be simpler at first glance, but we’ll have to look at the two functions that are called in order to know for sure. First, we can look at ipv6_can_nonlocal_bind()
:
This looks remarkably similar to the function inet_can_nonlocal_bind
we saw back in the IPv4 implementation. In short, this is checking for the three options that would permit nonlocal binds to take place with IPv6 addresses: the net.ipv6.ip_nonlocal_bind
sysctl
option, and the two socket options IP_FREEBIND
, and IP_TRANSPARENT
. If any of these are enabled, this function returns true
. Because this function is called within a logical AND (&&
), the second half of the conditional, calling ipv6_chk_addr
, wouldn’t even execute.
We know that neither net.ipv6.ip_nonlocal_bind
or IP_TRANSPARENT
are set, so the presence of IP_FREEBIND
is clearly what’s allowing the bind to move past this potential EADDRNOTAVAIL
return. However, let’s take a look at what would happen if we didn’t set this option, which would result in a false
result, and cause ipv6_chk_addr()
to be called. Given that this is the second of only two conditions to be checked, this function must return a true
result, or our bind will fail. So what does ipv6_chk_addr()
do?
ipv6_chk_addr()
is just a passthrough, for another function ipv6_chk_addr_and_flags()
, passing along its own parameters and a few others. This function in turn does much the same thing to __ipv6_chk_addr_and_flags()
, which is where the decision is ultimately made.
The first important thing to keep in mind is that both ipv6_chk_addr()
and ipv6_chk_addr_and_flags()
have an int
return type, and will return 0 to indicate false
, and 1 to indicate true
. However, __ipv6_chk_addr_and_flags()
will return a pointer to a struct net_device
. This can can of course be either a NULL
or non-NULL
value, and you’ll notice the ternary operator translates these to int
values 0 and 1, respectively before returning the result.
Within __ipv6_chk_addr_and_flags
, you’ll notice the use of hlist_for_each_entry_rcu
- this is a macro used for iterating over an RCU list, and in this case is iterating over inet6_addr_lst
, which is a hash table of all configured IPv6 addresses on the system.
From here it gets a bit more straightforward - the conditional at the bottom of the loop first compares the address being passed in to this function against the current iteration through inet6_addr_lst
. If none of these match, the iteration ends, and the final statement returns a NULL. Following this back up the chain, this will cause ipv6_chk_addr_and_flags()
to return a 0, which will cause ipv6_chk_addr()
to return a 0, which will be interpreted as a false
by the conditional back in the main IPv6 implementation. When this happens, an EADDRNOTAVAIL
is returned, and the bind fails.
This is our smoking gun - if the address we’re attempting to bind to isn’t configured on the system, one of the three options that explicitly permit this must be enabled, otherwise, it will fail. No FIB lookup, no implicit Any-IP tie-in.
Just because I like to be exhaustive, we can verify all of this again using bpftrace
:
The kprobe
here will let us know when __ipv6_chk_addr_and_flags()
is called, and will print the address being checked. The kretprobe
will let us know what value it returns.
As expected, neither of these trigger when we’re binding using IP_FREEBIND
, since this is enough to get our conditional in __inet6_bind()
to exit early. However, when we omit that socket option, and use a nonlocal IPv6 address, we see a return value of 0:
Of course, one potential source of confusion (at least it was for me) was that Any-IP is totally supported for IPv6 (it was originally added back in 2010). This is great news, because Any-IP is even more useful in IPv6; you can treat an absurdly large number of addresses as “local”, with a single routing entry. So don’t be misled into believing that somehow this feature is missing.
The difference here is that unlike IPv4, the FIB is not consulted when binding an IPv6 address to a socket, full-stop. If you want to bind to a non-local address, you must use something like IP_FREEBIND
.
At this point I feel it’s obvious I’ve kicked this dead horse quite a bit. I have a pretty firm grasp on the code, and I understand the conditions and logic that leads to the behavior I’m seeing. However, there’s still one question lingering in my mind:
To be honest…..I am not really sure. And to be clear, I wouldn’t consider this a huge problem necessarily, just a slight irritation. It seems that most people I’ve talked to about this have been bit by it in the past, and have just learned to always pass an option like IP_FREEBIND
when doing non-local address binds.
Most of the reason I dug into this as far as I did was in case there’s a more concrete reason that IPv6 binds don’t do a FIB lookup - a corner case I haven’t considered, that might bite me as I use this feature in production. To date I haven’t found one yet (though I did ask in the netdev
mailing list, and if I get a response I’ll be sure to update this section).
The best answer I’ve gotten thus far is that this wasn’t exactly intentionally left out, more likely a byproduct of the fact that the IPv6 implementation was developed separately, and different decisions were made. Could be as simple as that. IPv6 is its own protocol, with its own set of considerations and decisions to be made, rather than a simple extension of IPv4. So I’d buy this reason. However, if anyone knows of any other reasons I haven’t covered, I’d love to know, both for my own curiosity as well as awareness of corner cases I’ve not considered. Please comment below if you have any information here.