updated: 2018-06-05 Replace gorb with merlin.
In addition to the Kubernetes stack on AWS, I’m also helping to build an on-premise Kubernetes platform. We want to continue to leverage feed, the ingress controller we built. Ingress generally requires an external IP load balancer to front requests from the internet and elsewhere. In AWS we use ELBs. For on-premise, we need to build our own.
The solution we’ve settled on for now is:
- IPVS with consistent hashing (using built-in source hash module) and direct-return.
- merlin to provide an API for ipvs so our ingress controller can attach and detach itself.
- VIPs registered to a DNS entry with active/passive failover, handled by keepalived.
IPVS is the Linux kernel solution to load balancing IP connections. It’s been around a while and is quite stable, and used by many companies in production. It has a lot of options and some quirks.
We chose an active/passive setup as it’s pretty simple to get going and well documented.
The only reasonable configuration is to use either direct return or tunnelling. NAT doesn’t scale and has resiliency issues. Direct return requires your real servers are on the same ethernet network. We chose this approach since our use case is relatively simple at the moment. Tunelling is much more generic, as you can send traffic to any IP reachable server, but requires a bit more configuration.
Direct return alters the destination MAC address in the original packet to be the chosen real server. The real server needs to be configured with the same VIP as the IPVS node on its loopback interface. The IPVS node sends the modified packet out onto the local network, and it gets directed to the real server which then routes it internally to its loopback. As the source address remains unaltered, the real server can send the response directly back to the client without having to go with IPVS. As responses tend to be much larger than requests, this can substantially improve performance. Note that ELBs, in comparison, proxy both requests and return, like IPVS-NAT.
Our real servers are the Kubernetes nodes running feed, our ingress controller. We’ve modified feed to register the VIPs on the node’s local loopback to facilitate direct return.
Connection state and consistent hashing
As we have an active/passive setup, we need to keep state synchronised between nodes so that when failover occurs, connections are preserved. One approach is to use IPVS connection synching, but this has some problems:
- Only deltas are synced – so if an IPVS node is restarted, it knows nothing about existing connection state.
- Increases traffic between IPVS nodes on the order of number of connections being created.
- Connection sync may be delayed causing some connections to get dropped on failover.
Using consistent hashing obviates the need for state sync between IPVS processes. IPVS will use its local connection state table for quick lookup. If that fails, it will use its scheduler to pick the destination server. As long as we use a consistent hash, such that every IPVS node would pick the same real server, we can avoid having to synchronise state.
One of the built in hashes provides this functionality – source hashing or sh. It indexes into the list of real servers using a hash of the source address. Combined with gorb, which configures all IPVS nodes to have the same set of real servers, we can ensure consistent server selection for any incoming packet and seamless failover. To set this up right requires tweaking a few IPVS settings:
sh-portto include the source port. This is important to improve the distribution of connections.
sh-fallbackto use the next real server if the server is unavailable. Needed for draining.
net.ipv4.vs.sloppy_tcp=1allows the passive node to handle preexisting connections (otherwise IPVS will ignore any TCP packets it doesn’t handle the initial SYN for).
net.ipv4.vs.schedule_icmpproperly schedules ICMP packets required for correct TCP functioning.
net.ipv4.vs.expire_nodest_conn=1to remove stale/dead connections so further packets will get resets, letting clients quickly retry.
net.ipv4.vs.conn_reuse_mode=2so IPVS can properly detect terminated connections with direct return. Without this, it’s possible for source port reuse to lead to broken connections.
Once we had done this, we still ran into some issues with poor balancing to the real servers. Some investigation showed the problem: the sh hash function is bugged. It’s using Knuth’s Multiplicative Hashing, but with the common error of not right shifting the final result. This leads to the higher bits having virtually no impact on the hash. The missing right shift is noticeable in the kernel source.
As a result, we ended up building our own scheduler module using the ip_vs_sh module code with the hash function fixed. This is pretty easy to build and install using DKMS.
There are some other options for scheduling we’d like to consider, but sh with the hash fix provides a very good distribution for us. A promising option would be to implement maglev hashing, which has some nice load balancing properties.
To support deployment, reboots, and general movement of our ingress nodes, we have a drain mechanism that works well even under high loads:
- On graceful termination, feed (our ingress controller), sets its weight in IPVS to 0. This means new connections will go to the next server, while existing connections remain intact.
- Feed waits for some drain period, such as 300s, before finally removing itself from IPVS. This gives existing client connections a chance to close. Once the real server is removed from IPVS, IPVS will drop all the related connections.
This process relies on IPVS maintaining connection state for preexisting connections, so it has a couple drawbacks. If an IPVS node has a failure or restart during this process, any draining connections will be discarded. Similarly if an IPVS node restarts after the drain, any shifted connections (from the real server list changing) will get broken. This is a tricky issue to solve, and we are still trying to figure out the right approach. But for the normal use case, feed deployments and migrations should be seamless.
Merlin, combined with etcd, provides a distributed consistent view of the set of real backend servers. It’s merlin’s job to provide both an API and synchronisation of IPVS configuration across all the IPVS nodes.
Merlin runs on each IPVS node. The active IPVS node has a special management VIP that feed can use to access the merlin API. This is necessary because the feed node itself will have the normal VIP on its loopback, so can’t use the normal VIP to access the IPVS node.
We originally used gorb, but kept running into issues around its management of the IPVS state. After many attempts to fix these issues, we came to the conclusion that the fundamental way it reconciled state was not safe. So we built merlin from scratch, around a reconciliation loop that reconciles actual state with state in the store. This gives us high confidence in merlin’s ability to keep the local state correct, regardless of any external modifications. While we were at it, we modernized the API with gRPC, exposed the full options of IPVS, and provided a nice CLI called
meradm in the spirit of
We are using a simple active/passive IPVS node pair setup. Each pair gets its own VIP (virtual IP), which is added to the DNS entry for all domains our ingress can handle. keepalived handles setting the VIP on the active node, and issues a gratuitous ARP on failover to update any forwarding tables.
This has some drawbacks though:
- Half your nodes sit idle.
- Relies on DNS load balancing to direct requests to multiple VIPs. This has all the drawbacks of DNS load balancing – such as clients caching IP addresses indefinitely. This can lead to poor load balancing and also poor failure recovery.
- It requires use of keepalived to handle failover of the VIP. This introduces its own set of failure scenarios, such as split brain.
- Scaling is somewhat painful and slow, as it requires adding a new pair and waiting for DNS to propagate.
A better, but a bit more complex solution, would be to use ECMP in the local router to load balance across all the IPVS nodes – making everything active. This is something we need to investigate, and will probably move towards in the future. It requires using BGP to update the local router’s routing tables as your IPVS node goes in and out of service. Combined with consistent hashing, this should allow for non-disruptive removal and addition of IPVS nodes in a fully active setup.
Our on-premise solution for front-ending Kubernetes is still a work in progress. There is a lot more to be done and improved upon. I think we have a good start, though.
- ECMP for a fully active solution.
- Better consistent hashing if we need it.
- Support for