Over the last two years I’ve been building an in house PaaS system based on Kubernetes. We started on Kubernetes 1.0, which was early days. It’s been a challenging and fun experience.
Our tech stack is:
- Java/Python/golang applications
- Gradle plugin/Jenkins for PaaS and CI/CD
- Kubernetes (1.6 + custom patches)
- Ubuntu 16.04 (with HWE kernel)+Docker 1.12
- Terraform and Ansible for provisioning
- Prometheus/Alertmanager/Monit for monitoring
Kubernetes as a tool
Kubernetes is a building block for your own systems. Out of the box it provides an API that abstracts the gory details of whatever hardware, OS, and container technology you use. This provides an excellent abstraction for product teams to develop and deploy against. An SRE team (or devops team, or whatever you want to call it) can focus on everything under the hood – the standing up and maintenance of the k8s clusters. A single team owning the k8s platform can support and scale to many teams (in our case, 50-100 devs in 6-8 person teams). Much of this is due to the design and formalised API of k8s, which provides a “just right” balancing point between platform and product development.
I think the success of k8s really goes to show it has hit a sweet spot. There are still a lot of pain points, which I’ll get to later, but I really believe that k8s is the best choice for software deployments at any medium to large size company. At small companies or startups I’d recommend fully hosted solutions.
Run everything on Kubernetes
We try to run everything inside of k8s, even special cased servers where we use kubelet manifests. The benefits are numerous:
- Get all daemonsets for free. Log shipping, node problem detector, and so on.
- pod.yml provides a much nicer abstraction for running Docker containers.
- All remote administration is centralised via the k8s api and kubectl, including ACL. SSH access is rarely needed.
PaaS on Kubernetes
Kubernetes out of the box is relatively low level if you want to do automated deployments, CI/CD, and so on. If you want a more high level experience, you’ll need to build a fair amount of tooling on top of k8s. To address this there are several commercial products like Redhat’s Openshift. Due to lack of maturity of available products when we started, we built our own abstraction using Jenkins and Gradle.
We built a gradle plugin as the main PaaS entry point for teams to build, deploy, and promote artifacts through various artifacts. This plugin evolved as we adopted new k8s features, such as ingress. The commands that are run look like this:
./gradlew release # use axion plugin to tag the repo with a version ./gradlew buildImage pushImageToDev # build docker image and push to the dev repo ./gradlew deployToDev # deploy to dev environment ./gradlew promoteFromDevToReleaseCandidate # promote image between dev and release candidate repos ./gradlew deployToShowcase # deploy to showcase environment ... and so on, including canary and production deployments
These commands can be run from a developer’s workstation, as well as in a CI/CD system. We used Jenkins for CI/CD with the excellent job-dsl-plugin which lets you fully automate job creation.
A project’s configuration happens inside of
build.gradle in a
k8plugin block. This defines all the various blobs for building images, deployments and ingress. We source a
pod.yml in the project directory which gets set as the deployment’s spec.
We create a single bootstrap job as part of the Jenkins image, which builds a folder and bootstrap job for every project. Each project bootstrap builds out the project’s build pipeline via a file
ci.yaml that exists in the root of each project. This approach was inspired by TravisCI/CircleCI, but done in Jenkins with the Job DSL plugin – which also let’s us write unit tests against the expected job creation. In the end, every team has full CI/CD pipelines for their projects and the submodules inside of them.
We handle the interface between Jenkins and the projects via generic gradle task names, such as
ciStubbedFunctionalTests, and so on. Product teams can implement these however they want, allowing for a lot of flexibility without having to resort to manual editing of a Jenkins job.
Standing up Kubernetes on AWS
When we started building this platform, we ran into a big hurdle. Not much help was provided for standing up k8s clusters, managing them, monitoring them, and so on. K8s was almost a pure software project. The hard part, the part requiring the true grit of hardened site engineers, of actually deploying it and supporting it in production, had little guidance or support. There were existing contributed config management scripts (saltstack, ansible, various others), but they were either incomplete or not particularly great, nor easily compatible with AWS. So we had to build this from the ground up, which was a challenging process.
Fortunately, these days we have great projects like kops and kubeadm which make this much easier. If you’re starting k8s, I heavily recommend you use one of these tools. Kubeadm provides building blocks. Kops is a full end to end solution for AWS k8s. If you are going to do Kubernetes from scratch, be warned – here be dragons!
Ansible + Terraform
Our platform is built with terraform for standing up the AWS nodes and infrastructure. We wrap the terraform command to support multiple environments and roles. We try to put as much as we can in auto scaling groups, since it makes it much easier to spin down and up nodes, as terraform only tracks the auto scaling group’s launch configuration.
Ansible is used to configure all the nodes. The ansible repo is packaged and prfmoted similar to any other software project, and deployed across our clusters from a centralised deployment tool. We have a set of serverspec and java tests using the fabric.io k8s client to verify the deployments. These tests are very extensive, almost too much so since they catch a lot of the flakiness inherent in config management being applied to running servers. But they give us a high level of confidence that what we deploy works.
Going forward we would really like to move towards building immutable AMIs with packer, which we think would eliminate a lot of the flakiness. Now that our development is stable I think this approach makes more sense.
Ingress and AWS
We originally used the
Loadbalancer service type, which uses the AWS functionality in k8s to create an ELB and connect it to all nodes in the cluster. We found this had a few limitations:
- Creates an ELB for every single service – we quickly ran into ELB limits, and it’s also costly with hundreds of services.
- Each service ELB attaches to every single node in the cluster, which causes a large amount of unnecessary traffic from health checks. It increases as O(m*n) where m=# services and n=# nodes.
- Performs poorly in failure scenarios. If a single pod dies, the ELB might end up removing a significant portion of your cluster’s nodes because it doesn’t know where the unhealthy instance is.
Because of these reasons we decided to move to ingress resources as soon as it was available. Ingress is a generic resource that defines how you want your service accessed externally. A controller is created in the cluster to watch ingress resources and then setup the external access. A benefit in AWS is we could create a single shared ELB and handle virtual dispatch in our cluster instead of outside of it, allowing us to avoid the drawbacks described above.
As there wasn’t anything available at the time, we built our own controller called feed. It is nginx based and uses Route53 to point ingress hostnames to the ELB. The nginx controller is lightweight, and can handle almost all our traffic with only a few single CPU instances (we have more for availability). This was a big improvement for us.
In modern k8s I would advise not using LoadBalancer type, and use an ingress controller fronted by ELBs.
I also would avoid using ALBs. The connection draining behaviour on them is broken when we last tried using them a few months ago. This prevents 0 downtime deployments of the nginx component.
Ubuntu and Docker
We run Ubuntu 16.04 LTS with the hwe kernel. We use the hwe kernel because it has the recommended version of the ixgbevf driver for AWS enhanced networking. The only issue we had so far was a kernel bug which caused MTU to get locked to 1500 instead of the 9001 jumbo frame AWS supports. Fortunately this was non disruptive.
We’re on the latest version of Docker supported by k8s, which is 1.12, using aufs. After a lot of trial and error, we settled on the combo of Ubuntu+Docker+AUFS for a stable solution. We spent a lot of time trying to get Centos+Devicemapper/DirectLVM stable, but we could not avoid the occasional hard lockup. The other issue we encountered is that devicemapper seems to degrade poorly in full disk situations, usually resulting in data loss.
It’s still not perfect though. All of our docker processes leak memory over time. We’ve had to create a monit task to restart docker when memory exceeds some threshold. Fortunately the live-restore feature of Docker 1.12 seems to work, so the restart is non-disruptive.
Monitoring and management
On a node level, every daemon runs as a systemd service with Restart=always. Monit is used to check high level behaviour is functioning (for example, if DNS is working).
VPC wide, metrics are fed into a Prometheus server. Combined with alertmanager, this is the main mechanism for alerting on problems. All our alerts go into specific slack rooms per environment. Major alerts are also routed to an internal call out API.
Logs are all sent via a log shipper on every node to a AWS’s hosted ElasticSearch. The log shipper was built by us to handle some specific needs, and runs as a daemonset. It let’s teams create custom log fields via json log messages. It also has sophisticated metrics so we can monitor logs backing up and detect if a pod starts logging abnormally, which can threaten the stability of the ES cluster.
Poor Man’s Chaos Monkey
Early on in development, we decided to turn on auto package OS updates with automatic reboots as a simple mechanism to keep the system up to date. We set a random hour and minute to do the reboot via Ansible, so every node reboots at a random time.
While we originally intended to remove this mechanism for production, it has worked so well for us that we’ve kept it. It forces us to make node reboots non disruptive, from the OS level all the way up to the application. In effect, we get the benefits of a chaos monkey with just a few lines of apt configuration. And we benefit from up to date OS packages without needing a complex roll out strategy.
There is a chance of multiple nodes rebooting at the same time. This is the birthday problem. As the number of nodes approaches the number of hour/minute combinations (24*60 =1440), the chance of multiple rebooting at the same time quickly approaches 100%. In practice though, this is rarely a problem for us since we don’t have anywhere near 1000 nodes – and simply forces us to make things HA for the rare times it happens. If our cluster was much larger we’d have to create a more sophisticated system.
We primarily use Cassandra for our applications’ persistent state. We explicitly made the decision to not put Cassandra in Docker or k8s when we started, due to the instability of Docker and the lack of features in k8s to support a database. Currently there is a lot of work on StatefulSets, which are intended for databases. I imagine it won’t be long before it makes a lot of sense to put our database inside of k8s.
As with any technology there are inevitably challenges in adopting it. Here are some of the highlights, some of which were mentioned briefly already.
Every major release seems to introduce new regressions. With the right magical incantation of various pieces of software, we were finally able to get a stable docker setup.
We started on Centos with device mapper and direct LVM on an EBS. Our nodes would frequently hard hang – locking up the kernel completely.
We switched from EBS to ephemeral storage, and this helped dramatically for some reason. Still not perfect though – we experienced occasional hard lock ups.
After a long period of dealing with the instability and trying to fix it, we finally gave up and settled on the most widely used production combination – Ubuntu + Docker + AUFS. The difference in stability was remarkable. We even switched back to EBS storage so we could use the newer EC2 instance types. Everything has been swell, apart from the Docker memory leak previously mentioned.
We originally started on Centos, which was great in general. We only had a couple issues:
- Occasional hard locks on mounted ext4 file systems on EBS – xfs had no problem though. Not sure why – this seemed Centos kernel specific, so was difficult to find any related bugs.
- Poor behaviour with transparent huge pages which caused a lot of grief – much worse than THP on other OSs, for some reason. Was rectified by disabling THP though.
I’m also not too comfortable with the usage of a frankenkernel on Centos. It means bugs very specific to a distro, which hurts the ability to Google-Fu issues. Combined with the fact we wanted to use AUFS and more recent network driver, we decided to swap over to Ubuntu.
We chose nginx because the only contributed ingress controller at the time also used nginx. So it gave us a good base to start from. In retrospect, nginx has an issue with reloads that makes it not ideal for usage as an dynamically updating ingress proxy.
Whenever a user updates an ingress or service, we need to reload the nginx configuration. If we wanted to use endpoints directly over service IPs, this reloading becomes even more dynamic.
Unfortunately, nginx behaves in a rather hostile manner to clients on a reload. It drops all client connections after in flight requests are processed. It’s to spec, but it’s not ideal if you want zero downtime deployments. Dropping the connection will inevitably cause client errors.
In contrast, haproxy handles this in a nice way (with the latest haproxy taking care of the one outstanding edge case). When reloading, it will send a
connection: close header to all clients on the response, and then wait for all client connections to close. This works well in practice. As long as requests are coming in, all client connections to the old instance get closed very quickly. This would have let us easily have zero downtime updates.
To mitigate this, we limit the frequency of nginx reloads to once per 5 minutes. Looking long term, we may move to an haproxy solution.
We swapped from ELBs to ALBs because it’s behaviour mitigated most of the problems of the nginx reloads. ALBs don’t keep a pool of connections like ELBs. Instead, an ALB seems to aggressively close connections shortly after using them, while still reusing them for bursts of requests. That means when nginx reloads and drops all client connections, generally ALBs won’t have any outstanding connections anyways (unless under load). Whereas an ELB will keep around a large pool of connections even when not under load, so a reload is much more disruptive.
Unfortunately, we ran into another problem with ALBs. When doing a deployment of our nginx controller, we would drain off connections to guarantee no new requests would go to the old controller before detaching it. This works perfectly fine in an ELB. In an ALB, it doesn’t work – new requests still get sent to the old controllers. We worked with AWS to get to the bottom of it, and they said it should get fixed in a future ALB update. For now, we just swapped back to ELBs until this is solved.
JVM in a container
Most of our applications are JVM based. Running the hotspot JVM out of the box in a container is problematic. All the defaults, such as number of GC threads, sizing of memory pools, and so forth use the host system’s resources as a basis. This is not ideal since the container is usually far more constrained than the host. In our case, we use m4.4xlarge nodes which have 16 cores and 64Gi of ram, while running pods that are usually limited to a couple cores and couple Gi of ram.
It took us a while to figure out the right balance of options to get the JVM to behave well:
-Xmxto roughly half the size of the memory limits of the pod. There is a lot of memory overhead in hotspot. This value will depend on the application as well, so it takes some work to find the right amount of max heap to prevent an oom kill.
-XX:+AlwaysPreTouchso we can more quickly ensure our max heap setting is reasonable. Otherwise we might only get oom killed under load.
- Never limit CPU on a JVM container, unless you really don’t care about long pause times. The CPU limit will cause hard throttling of your application whenever garbage collections happen. The right choice almost always is to use burstable CPU. This let’s the scheduler soft clamp the application if it uses too much CPU in relation to other applications running on the node, while also letting it use as much CPU as available on the machine.
Layer 4 connectivity between pods
Most of the applications prior to k8s use relied on external load balancers to proxy client connections. This meant applications could be sloppy with regards to graceful termination, since they could rely on the load balancer to gracefully drain off connections. Most Java web frameworks do not shutdown in a non disruptive way. Instead, they do similar to nginx where in flight requests are processed (although even that is/was bugged in many frameworks) and then client connections are dropped without any further ado.
Once we moved these applications inside of k8s they became responsible for their own graceful termination. The solution, after much trial and error, was to add a filter with a conditional that gets activated at shutdown. When activated, it adds a
connection: close header to all responses, and delays shutdown for a drain duration. As long as a request comes in during the drain time, the client connection gets closed due to the header. This was relatively straightforward to implement, although the implementation varied based on the framework used. This solved all of our issues with errors on application deployment, giving us zero downtime application deployments.
Other alternatives include using things like linkerd or istio to proxy all connections between pods, much like having a load balancer between each pod. We experimented with these, but none were as simple or reliable as simply having the application handle shutdown properly. Using an alternative RPC mechanism instead of http/1.1 would probably take care of this as well.
If you are planning on deploying Kubernetes widely in a production environment, you should probably expect to have to dig into the source code at some point and fix bugs. Things are much better now than 1.0 days, but k8s is still undergoing heavy feature development and regressions happen. Here is a summary of the some of the issues we had to debug and submit PRs for:
- Deployment replica set hash used ADLER32 which caused collissions in as few as 100 deployments (very easy to hit if you’re doing a deployment on every build). This led to inconsistent deployment states.
- Many EBS and volume manager bugs. K8s allows you to attach EBS’s by volume id to your pods, which we use for things like Prometheus. The stability of this feature hasn’t been great, and been one of the long standing sore points of k8s for us. Volumes would fail to attach or detach, leading to outages of affected components. It took about 4 versions (1.2-1.6) and many PRs to get some stability on this. We still don’t trust it fully, and won’t let critical systems mount external storage.
- Endpoint updates, the association between a service IP and the pod IPs, could take 10+ seconds to update. Obviously this is not ideal – requests could go to old pods and fail during deployments.
- Memory leak in the apiserver due to a wasteful cache that stored large amounts of text JSON in memory – gigabytes worth, with low cache efficiency. Mostly solved now by the move to protocol buffers.
Fortunately, k8s is extremely responsive to PRs. Once we got to the bottom of the problem, we could get a PR merged within a couple weeks. It’s a well run project and very pleasant to contribute to.
Log shipping with fluentd
We use fluentd for log shipping from each node. We had to build something custom so we could allow teams to specify their own fields. It consumes json messages from the docker log files, and then parses it for expected fields as well as any custom defined fields. These get shipped to an AWS hosted ElasticSearch cluster.
This works pretty well, but we’ve found fluentd uses quite a lot of memory and cpu. It’s not too bad on large nodes, but we need to dedicate 512-1Gi of ram and 1-2 CPUs to process logs with it.
Prometheus with Grafana for dashboards is the possibly the best metric and analysis tool I’ve used. It has some quirks though. Most of the metric functions do extrapolation, which just doesn’t work well with certain types of things (like counting the exact number of errors in a time period). There are ways of getting exact amounts though, such as using
offset but it took us a while to figure out these intricacies of the query language. This has caused a lot of confusion among users, especially with people familiar with the graphite/grafana setup which provides precise functions for viewing data.
The other problem we had was exorbitant memory usage. None of the tunables seem to help here – memory usage almost seems unbounded, I suspect as a combination of fragmentation in the golang heap and time series being loaded in memory from queries. With about 1.5 million time series total we need around 40-50Gi of ram and 6-8 CPUs. In earlier versions memory would just increase until it got OOMKilled, but this has improved in later versions. There will also be a rewrite of the storage engine in 2.0 which should help. We found disabling transparent huge pages also helped substantially (~30% drop in memory usage).