NetApp IT is Optimizing Container Costs with Spot Ocean
By Scott Stanford
It seems like everyone talks about how cloud solutions reduce costs, increase efficiency, and make automation possible. However, when you drill into the available solutions, you find that many of them require significant ramp-up time or startup costs to achieve results. It makes you wonder about the real ROI.
When NetApp IT’s automation and monitoring tools services team began exploring solutions for expanding into hyperscaler environments, we knew that we needed something that would be optimal for running in AWS and could deliver a similar user experience to what we have on our premises. However, we also wanted a solution that was cost effective and easy to automate. We were able to check all the boxes using Spot Ocean by NetApp® with Amazon Elastic Kubernetes Service.
As part of our cloud journey in AWS, NetApp IT started looking at solutions that don’t require management of Amazon Elastic Compute Cloud instances (EC2s). Like most companies expanding to the cloud, our first iteration was to use EC2s to run Kubernetes clusters matching what we have on our premises.
Spot Ocean by NetApp is an application-scaling service that manages the worker nodes that are running as Spot instances. AWS manages the control plane for the cluster and Spot Ocean manages everything else. By default, Ocean is configured to look at all instance types that are available in the region being used to determine which would be the most cost effective. It also right-sizes the instance types and sizes used by the worker nodes to meet what is being consumed in the cluster. For example, if the cluster has pods using 51 CPU cores, Ocean shrinks and adjusts the types of the worker nodes to be as close as possible to those 51 cores.
For a full demo of the platform, check out the webinar I did recently.
We have two clusters that that run 24/7: stage and production. Once these are set up, the only time that anyone needs to touch them is to upgrade the version of the cluster. The upgrades are also simple enough that they don’t need to be automated. After the EKS upgrade is complete, all we need to do is update the Amazon Machine Image (AMI) used by Ocean and kick off a cluster roll. The Ocean cluster roll replaces all of the worker nodes, first by adding capacity with the new AMI and then by draining the workers being replaced, which replaces all the workers in the cluster at a pace specified by the user.
AWS Spot instances are often viewed as low-cost, easily disposable excess capacity. In 2 minutes you can lose the compute as well as the EBS volumes attached to it. In NetApp IT’s use case, the storage is disposable because new workers are configured from the AMI used and the user data. The preventive replacement feature of Spot.io replaces the workers with new instances when they are near the average replacement time for the instance type, size, and region. This allows the clusters to avoid spikes in capacity due to Spot replacements.
With multiple members on the team managing and testing with clusters, there are times when an isolated environment is needed. Typically, the EKS cluster would be created and then imported into Spot Ocean, which would then start creating worker nodes on Spot. To simplify this process, the Spot team forked the AWS eksctl codebase and added functionality to create the full deployment in one step, resulting in an EKS cluster managed by Spot Ocean. If a dedicated cluster is needed, we can just run the modified eksctl, do what we need to, and then delete the cluster with eksctl to prevent any unneeded charges.
What’s exciting about Spot Ocean is that it doesn’t require any automation to be created or monitoring of the workers, because that is all handled by Ocean. It proved to be a turnkey solution that got us up and running quickly, and also requires little effort to keep running.
So far, we have seen AWS compute savings between approximately 60% and 70% for our two 24/7 clusters and an average of 70% in our smaller dev clusters. (However, these numbers can fluctuate based on the Spot availability in the AZs we are using.) This savings allows us to have a similar spend to what we’ve seen in the past with OnDemand instances, with a lot more compute due to Spot savings.
We’re excited about what the future holds for this model and the potential cost savings we can achieve with a large cluster. You can learn more about Spot Ocean by visiting the website.