NetApp on NetApp

Transcript: NetApp IT’s Automation Use Case for Trident, OpenShift, and Ansible

Automation & Containerized Apps Success with NetApp Trident and RedHat OpenShift & Ansible

Leyddy Portas: Good afternoon, everyone and welcome to this NetApp on NetApp INFORM session. I’m going to be your host today. My name is Leyddy Portas and I am a technical partner manager at NetApp. Just a quick note for those who are new to these NetApp INFORM webinars. We are hosting them every month, so if you haven’t received the invitations through me, please contact me or Elias. I’ll put the email on the chat now, and we’ll be happy to add you to our distribution list. Today, we have a lot of people on the call. So if you have any questions during the presentation or the demo, just feel free to post it on the chat or in the Q&A, and we’ll get back to you at the end of the session, just because we only have one presenter today. Today’s session is entitled, NetApp IT Automation Use Case with Trident, OpenShift and Ansible. Presenting today, we have David Fox who is a systems engineer at NetApp IT. David, thanks so much for joining us. I’ll hand it over to you.

Overview: Ansible and the NetApp IT Journey

David Fox: Great. Thank you. Today we’re going to talk a little bit about our journey with Ansible, OpenShift and Trident. We’ll start with Ansible since that’s what we’ve been using the longest. We started using Ansible back in 2014, primarily as a configuration management tool. Ansible is actually really adept at configuration management, it’s also very strong with orchestration as we’ll see later. We started out using Ansible in this configuration management role, and we had a really simple first use case of configuring 3000 plus Unix machines to have an appropriate NTP client configuration. This was a project that when I joined IT had been ongoing in a manual fashion for several weeks. As things go typically when you’re doing it manually, the results were inconsistent. There was a lot of configuration drift and that kind of thing. So, we took this project that had been ongoing in a manual fashion for several weeks and within a couple of days of coding into playbooks and testing our automation, we were able to wrap that up within one week.

The gains were immediately evident. On top of just the speed that we were able to complete that with, the results were utterly consistent. And on top of that, because Ansible really is good at desired state, we eliminated the configuration drift as well. With Ansible, we did this plan and expand approach. We tested it out in this simple use-case, worked really well, and from there we sort of kind of grew. From 2014 to 2016, we grew a lot. We started enforcing SOCs compliance and SOC settings. From there in 2017, we expanded even further, started building machines with Ansible, adopting infrastructure’s code practices. Started orchestrating more across not only compute, but things like networking. Load balancer configuration, that kind of stuff. We also started getting our feet wet with auto response.

Think about those use-cases where if there’s a problem, you’ve got an operations team, they have a run book. If they see this alert, go do these things. We were able to use Ansible to automate those things that would typically be done manually by an operations team. Then in 2019, we released Ansible modules for ONTAP, that’ll allow us to further expand in the orchestration realm and configure not only compute and networking, but ONTAP storage as well. And of course we get all the benefits with Ansible, with the Unity’s ONTAP modules in our storage configuration. It’s very consistent and we’re able to rapidly move changes out into our production landscapes. That’s where we started out with Ansible, a little bit of our journey, how we got to where we’re at today. But during this time, the world was changing.

Three Parts of CloudOne with IT Governance: CloudOne Architecture

We had 3000 servers and these servers all represented different applications. These applications were deployed manually by application teams. So, during this time there was a transition to containerizing applications and deploying those containers on really vanilla servers. We started leveraging the Kubernetes container technology around ’17, specifically the Kubernetes distribution called OpenShift, which is provided by Red Hat. In 2017, getting our feet wet with that, learned a lot of lessons. And then in 2019, we took a step back, looked at all those lessons and said, “Hey, we can deliver not only, the OpenShift platform, but really a suite of capabilities that really are what our applications teams need, all in one bundle.” We decided to wrap all of that up in one package and call it CloudOne. That package is essentially a ServiceNow front-end that allows our application teams to go and order an application stack of a certain t-shirt size.

So like a large Spring Boot deployment or a medium MongoDB and have that sample application delivered set up with an Azure DevOps, get repository of CICB pipeline. That’s all pointing to an OpenShift, an OpenShift cluster where we’re running Trident. And we’re running two clusters, one in the AWS for our sub-prod workloads and then an OpenShift cluster in our own premises data center for production workloads. All of this is really tied together with Ansible. So Ansible is doing the delivery of all this stuff, Ansible is setting up the Azure DevOps Git repository, is setting up the pipelines. It’s creating the project in OpenShift. Setting up all our back in OpenShift, that kind of thing.

CloudOne Architecture: Day 1 – Create Application Environment

A little bit more about what that looks like. Again, this is to service our front-end. The user experience consists of a ServiceNow portal, where the user is presented with a service catalog. Again, think, I want Nginx large t-shirt size or MongoDB, large t-shirt size. They request that application set in ServiceNow. And when they click the complete order button, that kicks off a job in Ansible Tower. Ansible Tower for those that don’t know is a product by Red Hat that essentially provides an API interface for Ansible playbooks. And our playbooks do a bunch of things.

Again, we’re setting up a Docker registry of social locations. We’re delivering NetApp groups, which are effectively active directory groups that we use later on for our back. We’re setting up OpenShift projects, we’re creating the project, we’re creating the quote on top of the project. We’re doing all the artifactory integration and that kind of stuff. Also we’re orchestrating across, again, Azure DevOps, setting up the Git repository. Setting up the hello world application stack inside of that repository, we’re setting up a CI/CD pipeline that allows them to take this hello world code and immediately start deploying it.

CloudOne Architecture: Day 2 and Beyond

Again, what you end up with is this bundle where you can immediately get started. Thanks to what we’re able to do with Ansible and Trident, we’re able to get the users going in 30 minutes with a fully functional delivery pipeline where they can start doing stuff immediately. The way that they interface with our platform is really simply through to Git repository. They check code into their Azure DevOps Git repository. They kick off their CI/ CD pipeline. First, it deploys to their development namespace in our OpenShift, AWS cluster. Once they are happy with that, they can then graduate that code change through the landscapes, into staging, which runs in our private cloud cluster. Then finally into production. Of course, if they need to make any changes such as they need a new persistent volume, they will put that persistent volume claim manifest in their Git repository in Azure DevOps. When they push it out via the pipeline, Trident’s going to make sure that that gets created quickly.

NetApp Cloud Volumes Service in AWS

Let’s talk a little bit about what these clusters look like a little bit lower in level. As I was saying earlier, we have two clusters in our CloudOne environment. One of the OpenShift cluster is in AWS. This is where our sub-prod, what we call workspaces go. When developers are testing out the code, it lands here first. Fundamentally, the developers really don’t know the difference between applications running in the cloud in AWS, and applications running on-prem. OpenShift and Trident abstract that away. In AWS, we’re using two storage providers as backends for Trident. One is Cloud Volumes Services, and the other one is NetApp NPS. In either event, again, the user really doesn’t know the difference. They’re creating a persistent volume claim, try to create the storage on the appropriate backend and presents the storage to OpenShift.

From that point, the application developer, the end user, they don’t need to know anything about that process. It just happens.

Private Cloud for CloudOne: Leveraging NetApp Products

And again, we’ve got our private cloud implementation of the same infrastructure. We’re running OpenShift, in this case on top of NetApp HCI, which is really a VMware appliance. We’ve got Trident installed here, and its backend is now all flash FAS. Just like in that OpenShift AWS, application developer, they have manifests. And one of those manifests is persistent volume claim manifest. It says, “I need a disc of a certain amount of size.” Trident’s going to take care to make sure that the disc is presented to the application set. And from that point, that application has what it needs to function correctly. Let me break for a second and see if there are any questions.

Q: From your perspective, does it make sense to use Terraform maybe in combination with Ansible for automation in the context of NetApp, whereas Terraform is not very useful in this context?

A: Terraform is actually a really awesome tool. We looked into Terraform, but at the time where we started our automation journey, a lot of the Terraform providers just weren’t there, some of which were critical. And on top of that, as we’ll see going forward, the OpenShift playbooks that we need to run, they come from Red Hat. So, it was always going to be… At some level, it was always going to be at least a Terraform and Ansible discussion, but the fact that Ansible can do it combined with the time at the… At the time the Terraform couldn’t do what we needed to do, we just stuck completely with Ansible.

Q: Since HCI is being retired, what is the new option?

A: Right now we’re looking at the FlexPod and that’s something probably best for a separate webinar, but yes, we’re looking at the FlexPod with the retirement of HCI.

Okay, cool. If there are no more questions, let’s go to the next slide.

AutoScale: Expand and contract VM inventory based on container usage

I talked a little bit about how we’re using Ansible in our Unix infrastructure for configuration management. I talked a bit about how we’re using the Ansible in our CloudOne environment for service delivery. Now, let’s talk a little bit about how we’re using Ansible in Cloud One for orchestration and configuration management of a different type. One of the things that we need to make sure of in our CloudOne environment is that we have the appropriate footprint of an OpenShift cluster to meet the compute needs of our applications. When you set up an OpenShift cluster, there are a certain amount of CPU requests that are configurable. And if you run out, applications stop deploying. You’ll not be able to schedule the application because there are not enough CPU requests. CPU is configurable to meet your requests. To avoid this problem. While at the same time, reducing our it spend in AWS, we use Ansible to create an AutoScale routine.

AutoScale is essentially a scheduled task in Ansible Tower. And what it does is, it looks at the available CPUs in the cluster, relative to the CPUs being requested by all the applications running on that cluster. If the requested application, or if the requested CPU reaches a certain threshold relative to the available CPU that we have on my cluster, our AutoScale routine is going to create a new VM or AWS EC2 instance. It’s going to add that to our existing OpenShift cluster. Then we’re going to make sure that the export policy being used by Trident includes the new IPs for those new VMs or new AWS EC2 instances in the export policy that Trident is using. Conversely, if we drop below a certain threshold in our case 50%, then we’re going to remove virtual machines or remove AWS EC2 instances. Again, as part of trying to reduce overall IT spend. The goal here has really defined the Goldilocks zone and AutoScale helps us do that. We don’t want to over provision, but we want to make sure that we have the right amount of resources available at all times.

CloudOne Use Case: Compute Autoscale
Overview of How NetApp IT delivers compute to OpenShift

Well, how do we do this? We’ve got a simple Ansible playbook. What it does is it’ll do a Splunk search. It’ll look for the CPU set of configurable. And it’ll also look for the CPUs that are being requested. If we’re above or below the either scale up or scale down threshold, we’ll integrate with ServiceNow to either create or remove CMDB entries will create or destroy AWS EC2 instances or VMs on ACI. We’ll add or remove those instances from OpenShift. And then again, we’ll make sure the export policy is up to snuff relative to the VMs that we’re either creating or destroying. This is completely automated, no manual intervention required. Again, it’s running on a schedule.

Splunk Search to determine Autoscale condition

What does this look like from a Tower perspective? For those that haven’t seen Ansible Tower, this is what it looks like. What we’re looking at right now is what’s called Ansible Tower workflow. A workflow is really just a collection of job templates. And our job template is really an Ansible tower artifact that represents an Ansible playbook. This workflow that’s running right now is our AutoScale workflow. The first thing that happens is we’re going to update the project and update the inventory that our playbooks live on from the Git repository. And then the first playbook that runs after that happens is our AutoScale search. This is the playbook that does this one search to see how much compute we have relative to how much compute is being consumed.

Once that finishes, we’ll run the next playbook, which is scale-up playbook. Notice that on the left-hand side, we have some extra variable sets. One is called scale-down and one is called scale-up. In this case, both are false. What that means is we’re appropriately sized, nothing has to happen.

Scaling out an application on OpenShift

That’s ultimately what we want to see. But how does that change if we go and add workload to our cluster? Let’s find out. So I’m going to go create a project from the boilerplate templates provided by OpenShift. Again, this is how application teams in a CloudOne environment, pre-applications. This is more to serve as an example. We’re going to create a CakePHP boilerplate application, take the defaults.

Now, that’s created. You see that application’s building for source code. We can also see that Trident has already created a storage for us. Now that the application is built, we can see that this particular pod is requesting half a CPU. That’s half a CPU per pod. This scale up from one pod, which is a total of half a CPU requests for this application stack to 50.

Now if this application stack is requesting 25 CPUs, what you’ll see in the events is that we’ve got a bunch of pods that I can’t schedule because there’s insufficient CPU. And this is the problem that AutoScale is meant to avoid. Typically in our environment, we don’t see people radically scaling to this extent. So again, this is really just to illustrate the power of what we’re able to do with Ansible and ONTAP modules.

Machinations of scaling out compute resources

Now, let’s look at AutoScale again. Same workflow template as before, once again, we’re going to kick off our AutoScale search. Then the next playbook runs to scale up or down as needed. And we’ll notice on the left-hand side that this time, the value for scale up instead of being false is true. That means we need to create, in this case, AWS EC2 instances. The playbooks sort of do that. First, we’re going to launch the instances. This process takes about 15 minutes, by the way, I’ve trimmed it down quite a bit for the sake of time. But you’ll still get a feel for what’s going on. After the instance is up, we’re going to do some bootstrapping to get the node ready for inclusion into the cluster. Things like setting hosts names and soling Docker, performing a full OS system upgrade. So all the patches from Red Hat. Once that’s done this playbook will kick off the Red Hat supplied playbook to add a new node to the cluster.

Once that playbook is completed, we’re going to do some more housekeeping. We’re going to label these newly added nodes. We’re going to restore our logging agent. We’re going to add some IP tables rules for SNMP monitoring. We’re going to set up SNMP itself. And then we’re going to request for service out delivery… I’m sorry, discovery.

Then finally, try to playbook to make sure that our export policy that Trident is using includes our new nodes.

Impact to application

Now that that’s completed successfully, we could come back to our OpenShift dashboard and you can see now that there are several containers that are coming up and they’re not quite ready yet, but they’re becoming ready. Whereas before they were stuck in pending. There are many more that are stuck in pending. In AutoScale, we’ll keep on creating virtual machines in this very extreme use-case until that condition is satisfied and all pods are running.

Key Takeaways

The key takeaways here. We found that Ansible is a very powerful configuration management and orchestration tool. Especially using the NetApp modules. We’re able to bring a level of consistency and agility to our IT department that we just weren’t able to do when doing things manually. The same thing goes for Trident. It enables us to provision storage at the speed of business. No more waiting for manual storage requests that could take hours, or maybe for days. Simply create a persistent volume claim in your OpenShift environment, Trident creates a volume, presents it to the application, end of story. We found that both tools have a very important role to play in our organization.

That’s the end of my presentation. I see there’s some more questions, let’s go through those.

Q: Are you using Spot on CloudOne?

A: Yes, we are. And our latest playbook creates all the Spot Elasticgroups of everything for us. So yes, we’re able to leverage that in our OpenShift, AWS environment.

Q: Any potential to use Spot in the AutoScale solution?

A: We’ve looked into that, and I guess you could say, we actually do that. Again, we’re using our Ansible automation to create the Spot Elasticgroups, but as far as Spot kicking off the AutoScale job, I’m not sure how that would work. I mean, it may be possible. We have looked into it. That said, I believe Spot Ocean can do that. But with this version of OpenShift that we’re running, which is OpenShift 3.11, I don’t believe that’s a realistic option. But in our next generation environment, again, the one that I said earlier probably deserves its own webinar, we are using the Spot Ocean there. It’s a much better fit.

Q: Screenshots/videos looked like OpenShift three. Is this use-case based on OpenShift three?

A: Yes, it is. We’re looking at migrating away from OpenShift three. And this goes into, once again, that webinar that we probably need to have a follow up on or follow up with to discuss.

I like that you put so much effort in automation on Ansible.

Yeah, we like it too. It’s really made a big difference within our IT organization. When I first started out, and that happened 2014, we were very much a traditional shop. Doing things manually, have a project for really some of the smallest things. We were able to take those projects, the duration of which were measured in weeks or months, we shrink those down to days.

And we could do the same thing with storage now. And we do the same thing with storage now.

Q: Can you show me something of Trident?

A: I could, and I think I did, but that’s sort of the beauty of Trident is, it does one thing and it does it well. That’s create volumes to back persistent volume claims. There’s not really a lot to show. It’s the UNIX philosophy: do one thing, do it well. And that’s what Trident does. It provisions storage and presents it to your cluster quickly. So, we did show that, it just happened so quick you didn’t see it.

Leyddy: Thanks, David. I don’t think there’s any question on the chat, but we just wanted to thank you for this presentation. I’m going to give them a few minutes just in case another question pops out.

David: Sure.

Leyydy: Also wanted to ask everyone in the audience today that we are trying to ask everyone what topics you think that you are more interested in, what future topics you want to see in these sessions? It would help a lot if you can just pop it in the chat and then we can look into it. That’s all I wanted to ask for. I think no one has asked any questions, so I will then just end this. Thanks so much, David, for this session. It was great. And thank you all for attending today. We’ll be sending the recording and the slide deck. And please don’t hesitate to contact me or Elias or Deanna if you have any questions. Thank you all. Have a great day. Thanks, David.

David: Thank you.