Hardening Spot Ocean: Building Custom AMIs

September 9, 2022 by

By Scott Stanford, Sr. Automation Engineer

Security is top of mind for IT professionals around the world. Potential new threats pop up seemingly every day, which means that all NetApp IT systems must meet the highest security standards. 

During our Spot Ocean journey, Elastic Kubernetes Service (EKS) worker node Amazon Machine Images (AMI), published by AWS, were flagged by our container scanning tools. This isn’t unusual – we scan for common vulnerabilities and exposures (CVEs) frequently. However, what was surprising was that the EKS worker nodes were included as part of the shared responsibility model. This meant that any security hardening or updates needed to be through customer-provided AMIs. 

Problem solving

We had another issue: there were very few resources on building custom AMIs for EKS worker nodes. After a search, one of the AWS Labs GitHub repos was our answer. It has an unsupported framework for creating EKS worker AMIs and leverages Hashicorp Packer and several customer scripts to create AMIs. Now that we were able to create an AMI that could be used with EKS workers, we had solved half of our problem.  

The second half was solved using the working custom AMIs. We reused some of the hardening scripts from our Linux systems and created new scripts to fill in the gaps. These were added to the Packer builds to create an updated and hardened EKS worker AMI. 

With the hardened AMIs in hand, the next hurdle was how to leverage automation to reduce the time needed to update the Ocean clusters. First, the AMI build process was moved into a Docker container. The container was a shell that dynamically pulled in everything needed from Git repos, allowing changes to be made outside of rebuilding the Docker image. 

A Kubernetes CronJob was used to launch the container on a specific day of every month, encrypt the EBS volumes and deploy to multiple regions. Any changes needed, like the target Kubernetes version was defined through a ConfigMap. 

The last step was updating the clusters – or rather the Virtual Node Groups (VNG) that we managed through Terraform Cloud. Two workflows were used. 

For the POC clusters, it was fully automated. For the production clusters, two manual steps were needed. In both cases, the AMI referenced in the Terraform workspaces were updated to the new AMI in the relevant region through a separate Kubernetes cronjob. 

With the POC cluster, the Terraform workspace was planned and applied through the Terraform Cloud API, with the Ocean clusters rolled through the Spot Ocean API. The POC clusters are updated approximately a week and half before the production clusters allowing time to find any issues. Updating the production clusters requires a manual apply in Terraform Cloud, with the clusters rolled through the Spot console. 

By automating the updating of the POC clusters, it made it easier to test and validate for the team that owned the clusters and allowed access for separate teams that perform patching.