March 15th, 2020
AWS CloudHSM can provide you with a ready-made solution for managing your secrets and cryptographic operations within a dedicated Hardware Security Module (HSM), which is compatible with the relevant compliance standards. However, the usage of CloudHSM with Kubernetes orchestration (like AWS Elastic Kubernetes Service, or EKS for short) is all but well documented. The core problem being how the application should learn the location of CloudHSM devices and how to connect to these devices.
Disclaimer: I am neither in favor nor against CloudHSM as a product. CloudHSM is still a costly service with quite a cumbersome architecture. While it does what it promises, it is not a default go-to solution. Instead, it is a solution you end up when other possibilities are blocked.
CloudHSM devices can utilize clustering to produce a highly available setup. CloudHSM does not offer a single endpoint to a newly created cluster, but rather obscurely relies on the client library to learn available devices in the cluster via any single CloudHSM device. The CloudHSM devices in the cluster will synchronize their data and metadata about cluster members. While these synchronization and connectivity issues are taken care of by CloudHSM cluster & client libraries (e.g., Cavium client library), we need a way to provide an IP address of any single HSM device into our application.
The design of CloudHSM is targeted for a rather static runtime environment, like good old virtual machines, having a long lifetime. This configuration allows the application on the virtual machine to connect to a single HSM device and then learn other devices it can connect to in the case the first device fails.
While applications running in pods on the EKS cluster can also be running for a long time, we usually want to design the architecture so that it can react to changes immediately and in an automated fashion. There are a few different cases to consider when running the application using CloudHSM on Kubernetes pods.
New application pod is started, gets a preconfigured IP of a CloudHSM device, connects to it starts learning other CloudHSM device IP addresses.
While the application is running, a CloudHSM device can experience a malfunction or controlled takedown. If the client library has learned of other CloudHSM devices, it should be able to connect to alternative devices available by itself.
Now the preconfigured IP of CloudHSM device is not served by a functioning CloudHSM device, and the client library cannot connect to any CloudHSM device. There must be a way to learn any single IP address of a CloudHSM device.
Allow the application pod to query AWS APIs with action cloudhsm:DescribeCluster
to get a list of CloudHSM device IP addresses available. Use Kubernetes init containers to run the query with a simple container with awscli available. Pass the IP with an empty dir volume to the main container in the pod, which can then be read by the application.
Pros/Cons
+ Immediate and fresh information about CloudHSM devices
+ Everything happens within Kubernetes
- Requires usage of init containers and extra volumes to pass information, adds complexity to the application itself which has to do the heavy-lifting
The application can use a DNS name to connect to any CloudHSM device IP. The DNS records can be updated, e.g., with Lambda processing reading the stream of CloudTrail events and periodically polling CloudHSM API for the current set of devices.
Pros/Cons
+ Requires no changes to the application except the switch to using DNS name instead of an IP address
- Keeping records up to date is now external process, pods might get an old IP address leading to multiple wasted restarts
- Requires AWS clue to maintain DNS records like monitoring CloudTrail and to trigger Lambda functions to run the updates
Kubernetes operator deployment can poll current CloudHSM devices with its reconciler loop and update Kubernetes ConfigMap accordingly. The application pod can use IP addresses from the ConfigMap as environment variables which the application itself can use to connect to CloudHSM devices.
Pros/Cons
+ Requires no changes to the application
+ Everything happens within Kubernetes
- Updating ConfigMap when CloudHSM IPs change takes up to reconciler loop idle-timeout seconds, a configurable duration.
- Requires an implementation
For those of us seeing the last approach of Kubernetes operator to provide CloudHSM device IPs into the cluster as the way to go I have created CloudHSM operator available on https://github.com/hhamalai/cloudhsm-operator