Non-Local Address Binds in Linux

February 24, 2022 in Systems21 minutes

I came across some interesting sockets-related behavior this week that caused me to go down a bit of a rabbit hole. This ended up taking me on a tour of Linux’s socket and IPv4/IPv6 implementation. I thought the journey was instructive, and I hope that my attempt to recount the steps I went through is useful to you.

Working with sockets on Linux is typically done with a handful of syscalls. The first and most obvious one, socket() creates the socket, then bind() is used to assign an address to it (the kind of address depends on the address family specified when creating the socket).

At this point, you can use either connect() or listen() to connect to an existing listener, or listen passively for incoming connections, respectively. However, for the scope of this post, we’ll focus just on the first few steps, up through bind(). It’s important to note that this bind() operation takes place before either, which means that this is the method by which you choose which address is used as the local address to which this socket binds. When you send traffic to the other side of the socket, either as a client or server, this will be your source address in those packets.

In most programming languages, this is quite easy to do (especially systems-focused ones), as a large number have basic socket primitives built into the standard library. Python, as an example, has a sockets module in its standard library, and creating and binding to a socket are both one-liners:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("0.0.0.0", 0))

We’ll stick with IPv4 for the first few examples, but we’ll be covering IPv6 in this post as well.

We specified 0.0.0.0 as the address, which tells the OS to listen on all IP addresses currently configured on the system. This can be useful, as when you’re writing software you often don’t know (and don’t want to) the actual IP address of the server where that software is going to run - you just know you want to grab whatever address is available.

We can inspect the actual syscalls being made using strace:

socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

However, there are use cases where being more explicit about which address a given socket binds to is necessary. Imagine that there’s some software on your machine that should only accept connections from other software running on the same machine. In this case, you would probably want to bind only to addresses in 127.0.0.0/8, as this is the dedicated range for this kind of traffic (systemd-resolved works this way, as an example). This is also useful in situations where a machine has multiple network connections - you can bind only to an address in a particular subnet. This is common on firewalls where a webserver running an administrative application is only available on the “inside” connection.

These are all fairly common reasons why you might explicitly specify an address during bind(), but ultimately all of these methods involve using an address that already “belongs” to the system, meaning it’s either built-in (as is the case with 127.0.0.0/8) or it is configured on a network interface.

Observing Non-Local Binds from Userspace

However, some use cases exist where you might want to bind to an address that’s not configured on any network interface at all. One of the canonical examples is when you want the return traffic for a given connection to pass through a load balancer, and leave it up to the load balancer to determine where to ultimately deliver the traffic.

There are a few ways you can do this in Linux, but we’ll look at two. The first is through a socket option called IP_FREEBIND (described in Linux’s IPv4 protocol documentation). Socket options are per-socket configuration settings that you can specify after initially creating the socket. This is done through the setsockopt() syscall. Linux uses an integer value of 15 for the IP_FREEBIND option, so that’s what we need to pass in to setsockopt(), while also specifying a value of 1, indicating this option should be enabled. Python also makes this easy:

# creating this, for convenience (will be used in later examples)
IP_FREEBIND = 15

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_IP, IP_FREEBIND, 1)
s.bind(("192.168.123.123", 0))

This will work, despite the fact that 192.168.123.123 is not actually configured on any of our system’s network interfaces.

Another other way to accomplish this in Linux (which as we’ll soon see only works for IPv4) is through a feature called Any-IP. This is a fancy term for adding an entry to the local routing table, indicating that any traffic received on a given prefix should be handled by the local machine on the specified interface, as if every address in that prefix was individually configured on that interface. This can be very useful if you want to potentially bind to a huge number of addresses - rather than configuring each one individually on an interface, you can just add a single routing entry that summarizes them:

~$ ip route add local 192.168.123.0/24 dev lo

With TCPv4 sockets we don’t even need the IP_FREEBIND option. As long as the address we’re binding to exists in an Any-IP route, this will work just fine.

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("192.168.123.123", 0))

So, to summarize, in order to bind to an IPv4 address that is not configured on an interface, you must either specify the IP_FREEBIND socket option, or bind to an address that’s part of an Any-IP route.

In addition to the two methods we’ve just explored, there are two other mechanisms in Linux that, when enabled, also allow a socket to bind to an address that is non-local (not configured anywhere on the machine).

  • The sysctl option net.ipv4.ip_nonlocal_bind (and the IPv6 equivalent net.ipv6.ip_nonlocal_bind) - this is a system-wide setting, so it affects all sockets. Naturally, elevated privileges are required to set this option.
  • The IP_TRANSPARENT socket option. This used for transparent proxying, and while, like IP_FREEBIND, it is a per-socket option, it however requires root privileges or the CAP_NET_ADMIN capability to enable. It also requires additional iptables rules as part of the transparent proxy configuration.

While both of these options incidentally enable non-local binds, they are designed for different purposes and/or come with drawbacks we don’t want to deal with for this example. So while we’ll see references to these options > in our exploration, just know that they’re out of scope for this post.

Now, it’s time to see how this is done in IPv6. Let’s take the Any-IP approach, by first creating a local route for a prefix that doesn’t match any IPv6 address configured on our system:

~$ ip -6 route add local 2001:db8:123::/64 dev lo

This should allow us to simply create the socket and bind to any address in that prefix:

# Note the AF_INET6 address family
s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
# Matches the previously-configured Any-IP route
s.bind(("2001:db8:123::1", 0))

However, this fails with OSError: [Errno 99] Cannot assign requested address. We can see with strace that bind() is returning an EADDRNOTAVAIL:

socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2001:db8:123::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)

Linux’s IPv6 protocol documentation doesn’t contain an IP_FREEBIND option like the IPv4 version does, but it does say “The IPv6 API aims to be mostly compatible with the IPv4 API (see ip(7)). Only differences are described in this man page”. Because of this, I tried setting the IP_FREEBIND option on the socket:

s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_IP, IP_FREEBIND, 1)
s.bind(("2001:db8:123::1", 0))

This worked, so this means that the IP_FREEBIND socket option is IP version agnostic. However, I was intrigued, because with IPv4, the presence of an Any-IP route meant that IP_FREEBIND was not needed. Clearly in IPv6, it still is for some reason.

Next, I wanted to test the reverse case: IP_FREEBIND set, but without a matching Any-IP route:

~$ ip -6 route del local 2001:db8:123::/64 dev lo

Despite the lack of an Any-IP prefix, the previous example will still work.

So, it seems that while IPv4 sockets require either an Any-IP route, or IP_FREEBIND, IPv6 is a bit more strict; regardless of whether or not an Any-IP prefix matches the address you want to bind to, you always need to use the IP_FREEBIND option to bind to an address that’s not actually configured on an interface.

I’ve found a few other posts ([1], [2]) that seem to confirm that this difference in behavior is real, so I felt better that what I was observing wasn’t due to some error on my part, but there were still some unanswered questions rattling around in my head:

  1. Is it possible I’m still doing something wrong? After all, this is really the first time I’ve played around with binding sockets to nonlocal addresses on Linux.
  2. Assuming I’m not doing anything wrong, why does the IPv6 implementation differ for some reason? Is there some kind of other corner case I’m not thinking of, or some other mechanism for implicitly allowing nonlocal binds that I haven’t found?

I spent a good chunk of time Googling, but there were few results that even acknowledged this difference, nevermind explained why it was the case. Soon, I realized that the fastest way for me to get my answers is to just go straight to the source - the kernel source code itself.

IPv4 Non-Local Binds in the Kernel

I am running a fairly recent kernel (5.10) and as a result, all examples provided are from that version. Ideally not much has changed since then as of the time of this writing, but if you’re looking at a different version, YMMV.

Since the error in question occurs when we try to bind our existing socket to an address, I figured the best place to start was looking at the implementation for the bind() syscall. This can be found in the __sys_bind() function in net/socket.c:

int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err, fput_needed;

	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
		err = move_addr_to_kernel(umyaddr, addrlen, &address);
		if (!err) {
			err = security_socket_bind(sock,
						   (struct sockaddr *)&address,
						   addrlen);
			if (!err)
				err = sock->ops->bind(sock,
						      (struct sockaddr *)
						      &address, addrlen);
		}
		fput_light(sock->file, fput_needed);
	}
	return err;
}

As we know from the syscall documentation, the first parameter passed to bind() is the file descriptor where our socket was created. Naturally, this is the first parameter to __sys_bind(). This is then passed to sockfd_lookup_light() to get the socket details, including the protocol-specific implementations (remember we specified AF_INET or AF_INET6 when creating sockets?). The important step for our purposes is the call to sock->ops->bind(), which invokes the bind implementation for the protocol used by this socket.

This blog post is really great, and goes into way more detail on the process of getting to the appropriate bind implementation.

The IPv4 implementation in Linux can be found in net/ipv4/af_inet.c. Within, inet_bind() is called by the previous example when AF_NET family is used. This ultimately calls __inet_bind(), where the real work is done.

You won’t scroll far before you see a large conditional that looks promising:

err = -EADDRNOTAVAIL;
if (!inet_can_nonlocal_bind(net, inet) &&
    addr->sin_addr.s_addr != htonl(INADDR_ANY) &&
    chk_addr_ret != RTN_LOCAL &&
    chk_addr_ret != RTN_MULTICAST &&
    chk_addr_ret != RTN_BROADCAST)
    goto out;

This evaluates a few conditions to determine the suitability of the address that we wish to bind to. First, the address is passed to inet_can_nonlocal_bind:

static inline bool inet_can_nonlocal_bind(struct net *net,
					  struct inet_sock *inet)
{
	return net->ipv4.sysctl_ip_nonlocal_bind ||
		inet->freebind || inet->transparent;
}

This function checks to see if any of the three options that would allow for a nonlocal bind are present, including IP_FREEBIND, and if so, returns true.

inet_addr_valid_or_nonlocal was added in later kernel versions to further cut down on repeated code, so if you’re looking at more recent kernel versions, you may only see a call to this function. It wraps both the conditions in inet_can_nonlocal_bind as well as the address types of addr->sin_addr.s_addr and chk_addr_ret all in one place.

Since we know that enabling IP_FREEBIND on a socket will cause this function to return true, we also know that the conditional above in __inet_bind() will immediately pass, since it will only raise an error if all of the parameters for the conditional return false.

However, let’s assume we haven’t configured IP_FREEBIND. What other conditions could be true in our case that would enable this to still bind successfully?

The second interesting conditional from the check in __inet_bind() is:

chk_addr_ret != RTN_LOCAL

This looks interesting becuase the suffix LOCAL seems to imply that this address was checked for membership in the local routing table, which we know to be the mechanism by which Any-IP works. However, this is just a theory based on nothing more than the name of a referenced constant, so let’s figure out where chk_addr_ret comes from.

This value is retrieved a few lines above in __inet_bind():

chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id);

This is exported as inet_addr_type_table() in net/ipv4/fib_frontend.c but ultimately implemented via the __inet_dev_addr_type() function just above.

static inline unsigned int __inet_dev_addr_type(struct net *net,
						const struct net_device *dev,
						__be32 addr, u32 tb_id)
{
	struct flowi4		fl4 = { .daddr = addr };
	struct fib_result	res;
	unsigned int ret = RTN_BROADCAST;
	struct fib_table *table;

	if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
		return RTN_BROADCAST;
	if (ipv4_is_multicast(addr))
		return RTN_MULTICAST;

	rcu_read_lock();

	table = fib_get_table(net, tb_id);
	if (table) {
		ret = RTN_UNICAST;
		if (!fib_table_lookup(table, &fl4, &res, FIB_LOOKUP_NOREF)) {
			struct fib_nh_common *nhc = fib_info_nhc(res.fi, 0);

			if (!dev || dev == nhc->nhc_dev)
				ret = res.type;
		}
	}

	rcu_read_unlock();
	return ret;
}

unsigned int inet_addr_type_table(struct net *net, __be32 addr, u32 tb_id)
{
	return __inet_dev_addr_type(net, NULL, addr, tb_id);
}
EXPORT_SYMBOL(inet_addr_type_table);

Broadcast and multicast are easy to identify at the bit level, so those are checked immediately and returned when detected. Provided the address isn’t one of those, it looks like a FIB lookup is performed to further identify the type for this address. We can also tell from the function signature that it returns an unsigned int. So, in the condition chk_addr_ret != RTN_LOCAL back in __inet_bind(), the integer value from inet_addr_type_table() must match whatever value is assigned to RTN_LOCAL. But what is that value?

RTN_LOCAL is actually defined as an item within an enum within include/uapi/linux/rtnetlink.h:

enum {
	RTN_UNSPEC,
	RTN_UNICAST,		/* Gateway or direct route	*/
	RTN_LOCAL,		/* Accept locally		*/
	RTN_BROADCAST,		/* Accept locally as broadcast,
				   send as broadcast */
	RTN_ANYCAST,		/* Accept locally as broadcast,
				   but send as unicast */
	RTN_MULTICAST,		/* Multicast route		*/
	RTN_BLACKHOLE,		/* Drop				*/
	RTN_UNREACHABLE,	/* Destination is unreachable   */
	RTN_PROHIBIT,		/* Administratively prohibited	*/
	RTN_THROW,		/* Not in this table		*/
	RTN_NAT,		/* Translate this address	*/
	RTN_XRESOLVE,		/* Use external resolver	*/
	__RTN_MAX
};

Since no values are being explicitly set here, each of these items are assigned the corresponding 0-based index (this is how enums work in C). This means that RTN_LOCAL would have the value of 2, RTN_BROADCAST is 3, and so on. The comment Accept locally seems to further indicate that this value represents an address found in the local routing table, but it would still be nice to confirm this somehow. Ultimately what we’re looking for is the exact integer value returned by the inet_addr_type_table() function.

We could continue to dive into the kernel source code, and figure out how the internals of the FIB work, and probably get to a reasonable conclusion, but this would take considerably more time. And it turns out, we don’t have to! We can use eBPF to inspect the parameters and return value of inet_addr_type_table() on a live, running system.

bpftrace makes it really easy to create simple tracing programs on Linux that are powered by eBPF. We can attach to a kprobe for the inet_addr_type_table function, to print the address being checked whenever the function is invoked. Of course, we also want to attach to a kretprobe to print the return value from this function as well.

kprobe:inet_addr_type_table
{
    printf("inet_addr_type_table finding type for %s\n", ntop(arg1));
}

kretprobe:inet_addr_type_table
{
    printf("inet_addr_type_table returned: %d\n", retval);
}

We will pass this file to bpftrace, and once we see the message Attaching 2 probes..., we can open a few sockets in a separate process. Here’s the output from bpftrace when we bind to a few different addresses:

~$ bpftrace addr_type_trace.bt
Attaching 2 probes...
inet_addr_type_table finding type for 0.0.0.0
inet_addr_type_table returned: 3
inet_addr_type_table finding type for 10.12.0.1
inet_addr_type_table returned: 1
inet_addr_type_table finding type for 192.168.123.123
inet_addr_type_table returned: 2IP_TRANSPARENTlocal
inet_addr_type_table finding type for 192.168.1.123
inet_addr_type_table returned: 1
inet_addr_type_table finding type for 127.0.0.1
inet_addr_type_table returned: 2
  • 0.0.0.0 didn’t even need to be checked against the routing table; this was returned immediately as RTN_BROADCAST, which corresponds to a value of 3, and this matches what we’re seeing in the bpftrace output.
  • 10.12.0.1 is another host on our network, so this returns 1, which is RTN_UNICAST. The same applies for 192.168.1.123, which matches the default route, so this traffic will be unicasted to the default gateway.
  • Finally, 192.168.123.123 matches the local route we added earlier, and as expected, returns a value of 2, which corresponds with RTN_LOCAL. For good measure, I tested 127.0.0.1 which obviously matches a local route, and this also returns 2.

So where does this get us? Well, if we zoom all the way back to the conditions in __inet_bind() that could result in the return of EADDRNOTAVAIL, the one we’ve been trying to figure out thus far tries to see if chk_addr_ret != RTN_LOCAL. We know now that this will evaluate to false, since in the case of our local address, chk_addr_ret will equal 2, which the value of RTN_LOCAL.

This means, that if one of the nonlocal bind options are set, like IP_FREEBIND, or the address matches a local route, the bind can proceed to the next step. This confirms the behavior we observed in userspace, but more importantly, we know how the kernel is making the decisions it makes. Now, it’s time to take this knowledge over to the IPv6 implementation, and compare.

IPv6 Non-Local Binds in the Kernel

We’ve seen from playing around with sockets in userspace that IPv6 is more strict than IPv4 when it comes to non-local binds, requiring an option like IP_FREEBIND to be enabled, and that the address being bound matches an Any-IP route. However, can we verify this by looking at the kernel-side implementation in the same way we’ve verified this for IPv4?

net/ipv6/af_inet6.c contains the Linux IPv6 implementation, so this is a good place to start. Searching for nonlocal_bind here actually shows a promising result; we see a conditional that looks very similar to that found in the IPv4 implementation. However, this is a red herring; if you scroll up, you’ll notice this only applies to v4-mapped IPv6 addresses, which is not what we’re working with here.

Scrolling down a little further, we see something a bit more familar:

if (!ipv6_can_nonlocal_bind(net, inet) &&
    !ipv6_chk_addr(net, &addr->sin6_addr,
            dev, 0)) {
    err = -EADDRNOTAVAIL;
    goto out_unlock;
}

This conditional seems to be simpler at first glance, but we’ll have to look at the two functions that are called in order to know for sure. First, we can look at ipv6_can_nonlocal_bind():

static inline bool ipv6_can_nonlocal_bind(struct net *net,
					  struct inet_sock *inet)
{
	return net->ipv6.sysctl.ip_nonlocal_bind ||
		inet->freebind || inet->transparent;
}

This looks remarkably similar to the function inet_can_nonlocal_bind we saw back in the IPv4 implementation. In short, this is checking for the three options that would permit nonlocal binds to take place with IPv6 addresses: the net.ipv6.ip_nonlocal_bind sysctl option, and the two socket options IP_FREEBIND, and IP_TRANSPARENT. If any of these are enabled, this function returns true. Because this function is called within a logical AND (&&), the second half of the conditional, calling ipv6_chk_addr, wouldn’t even execute.

We know that neither net.ipv6.ip_nonlocal_bind or IP_TRANSPARENT are set, so the presence of IP_FREEBIND is clearly what’s allowing the bind to move past this potential EADDRNOTAVAIL return. However, let’s take a look at what would happen if we didn’t set this option, which would result in a false result, and cause ipv6_chk_addr() to be called. Given that this is the second of only two conditions to be checked, this function must return a true result, or our bind will fail. So what does ipv6_chk_addr() do?

ipv6_chk_addr() is just a passthrough, for another function ipv6_chk_addr_and_flags(), passing along its own parameters and a few others. This function in turn does much the same thing to __ipv6_chk_addr_and_flags(), which is where the decision is ultimately made.

int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
		  const struct net_device *dev, int strict)
{
	return ipv6_chk_addr_and_flags(net, addr, dev, !dev,
				       strict, IFA_F_TENTATIVE);
}
EXPORT_SYMBOL(ipv6_chk_addr);

/* device argument is used to find the L3 domain of interest. If
 * skip_dev_check is set, then the ifp device is not checked against
 * the passed in dev argument. So the 2 cases for addresses checks are:
 *   1. does the address exist in the L3 domain that dev is part of
 *      (skip_dev_check = true), or
 *
 *   2. does the address exist on the specific device
 *      (skip_dev_check = false)
 */
static struct net_device *
__ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
			  const struct net_device *dev, bool skip_dev_check,
			  int strict, u32 banned_flags)
{
	unsigned int hash = inet6_addr_hash(net, addr);
	struct net_device *l3mdev, *ndev;
	struct inet6_ifaddr *ifp;
	u32 ifp_flags;

	rcu_read_lock();

	l3mdev = l3mdev_master_dev_rcu(dev);
	if (skip_dev_check)
		dev = NULL;

	hlist_for_each_entry_rcu(ifp, &inet6_addr_lst[hash], addr_lst) {
		ndev = ifp->idev->dev;
		if (!net_eq(dev_net(ndev), net))
			continue;

		if (l3mdev_master_dev_rcu(ndev) != l3mdev)
			continue;

		/* Decouple optimistic from tentative for evaluation here.
		 * Ban optimistic addresses explicitly, when required.
		 */
		ifp_flags = (ifp->flags&IFA_F_OPTIMISTIC)
			    ? (ifp->flags&~IFA_F_TENTATIVE)
			    : ifp->flags;
		if (ipv6_addr_equal(&ifp->addr, addr) &&
		    !(ifp_flags&banned_flags) &&
		    (!dev || ndev == dev ||
		     !(ifp->scope&(IFA_LINK|IFA_HOST) || strict))) {
			rcu_read_unlock();
			return ndev;
		}
	}

	rcu_read_unlock();
	return NULL;
}

int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
			    const struct net_device *dev, bool skip_dev_check,
			    int strict, u32 banned_flags)
{
	return __ipv6_chk_addr_and_flags(net, addr, dev, skip_dev_check,
					 strict, banned_flags) ? 1 : 0;
}
EXPORT_SYMBOL(ipv6_chk_addr_and_flags);

The first important thing to keep in mind is that both ipv6_chk_addr() and ipv6_chk_addr_and_flags() have an int return type, and will return 0 to indicate false, and 1 to indicate true. However, __ipv6_chk_addr_and_flags() will return a pointer to a struct net_device. This can can of course be either a NULL or non-NULL value, and you’ll notice the ternary operator translates these to int values 0 and 1, respectively before returning the result.

Within __ipv6_chk_addr_and_flags, you’ll notice the use of hlist_for_each_entry_rcu - this is a macro used for iterating over an RCU list, and in this case is iterating over inet6_addr_lst, which is a hash table of all configured IPv6 addresses on the system.

From here it gets a bit more straightforward - the conditional at the bottom of the loop first compares the address being passed in to this function against the current iteration through inet6_addr_lst. If none of these match, the iteration ends, and the final statement returns a NULL. Following this back up the chain, this will cause ipv6_chk_addr_and_flags() to return a 0, which will cause ipv6_chk_addr() to return a 0, which will be interpreted as a false by the conditional back in the main IPv6 implementation. When this happens, an EADDRNOTAVAIL is returned, and the bind fails.

This is our smoking gun - if the address we’re attempting to bind to isn’t configured on the system, one of the three options that explicitly permit this must be enabled, otherwise, it will fail. No FIB lookup, no implicit Any-IP tie-in.

Just because I like to be exhaustive, we can verify all of this again using bpftrace:

kprobe:__ipv6_chk_addr_and_flags
{
    printf("__ipv6_chk_addr_and_flags checking address: %s\n", ntop(((struct in6_addr *)arg1)->in6_u));
}

kretprobe:__ipv6_chk_addr_and_flags
{
    printf("__ipv6_chk_addr_and_flags returned: %d\n", retval);
}

The kprobe here will let us know when __ipv6_chk_addr_and_flags() is called, and will print the address being checked. The kretprobe will let us know what value it returns. As expected, neither of these trigger when we’re binding using IP_FREEBIND, since this is enough to get our conditional in __inet6_bind() to exit early. However, when we omit that socket option, and use a nonlocal IPv6 address, we see a return value of 0:

~$ bpftrace addr_type_trace.bt
Attaching 2 probes...
__ipv6_chk_addr_and_flags checking address: 2001:db8:1234::5678
__ipv6_chk_addr_and_flags returned: 0

Of course, one potential source of confusion (at least it was for me) was that Any-IP is totally supported for IPv6 (it was originally added back in 2010). This is great news, because Any-IP is even more useful in IPv6; you can treat an absurdly large number of addresses as “local”, with a single routing entry. So don’t be misled into believing that somehow this feature is missing.

The difference here is that unlike IPv4, the FIB is not consulted when binding an IPv6 address to a socket, full-stop. If you want to bind to a non-local address, you must use something like IP_FREEBIND.

Conclusion

At this point I feel it’s obvious I’ve kicked this dead horse quite a bit. I have a pretty firm grasp on the code, and I understand the conditions and logic that leads to the behavior I’m seeing. However, there’s still one question lingering in my mind:

To be honest…..I am not really sure. And to be clear, I wouldn’t consider this a huge problem necessarily, just a slight irritation. It seems that most people I’ve talked to about this have been bit by it in the past, and have just learned to always pass an option like IP_FREEBIND when doing non-local address binds.

Most of the reason I dug into this as far as I did was in case there’s a more concrete reason that IPv6 binds don’t do a FIB lookup - a corner case I haven’t considered, that might bite me as I use this feature in production. To date I haven’t found one yet (though I did ask in the netdev mailing list, and if I get a response I’ll be sure to update this section).

The best answer I’ve gotten thus far is that this wasn’t exactly intentionally left out, more likely a byproduct of the fact that the IPv6 implementation was developed separately, and different decisions were made. Could be as simple as that. IPv6 is its own protocol, with its own set of considerations and decisions to be made, rather than a simple extension of IPv4. So I’d buy this reason. However, if anyone knows of any other reasons I haven’t covered, I’d love to know, both for my own curiosity as well as awareness of corner cases I’ve not considered. Please comment below if you have any information here.