Transcript: Managing large scale storage upgrades across the NetApp enterprise

Managing large scale storage upgrades across the NetApp enterprise

Hello everyone. Thanks for joining this presentation. My name is Peter Han, Senior Technical Program Manager in NetApp IT. And David Tanigawa, who is the Tech leader will join this presentation to present the technical considerations.

Agenda

For today’s agenda, the NetApp IT has been running the FAST program, which is to upgrade and maintain the hardware platform and storage software upgrades. We want to share our experience and give a current status for the European TPM teams. And Dave is covering the technical considerations of this program. And then I am going to cover the NetApp life-cycle measurement dashboard, the challenges and the lessons learned.

Corporate IT at NetApp

For the overview of the corporate NetApp IT, that we are currently running a 100 terabyte of storage with 450 storage controllers and 5,000 servers. And that supports 375 enterprise applications. As of today, we have 124 SaaS applications running on our environment.

Established Program and Process

First program step was to establish the program, and where to find the processes in upgrading and maintaining the hardware and OS levels of the NetApp storages solutions. As for the major goal of this program, making sure that operating environment is in a consistent, stable environment, and maintain the structure to recycle. Also, by implementing new product and new code, then upgrade a new feature and bug fixes in our operating environment.

And this program covers the NetApp products and has 10 tracks. One of the other major things is the ONTAP 9 track. Also, it covers the other technology products including cluster switches, StorageGRID, and HCI, and E-Series systems, and more. And by using this program, NetApp IT can improve the performance, learn the updates on storage software and the hardware environment, and create the enterprise NetApp products showcase to our customers.

There are benefit to the NetApp product development engineering as we are Customer-1, the early adopters of the NetApp products, the product development team can validate the product readiness before they ship to the customers. Also learn about their products with the other variable load with live data sets from NetApp IT. Also discover new bugs and issues, and work together with IT and engineering teams for the early resolution and triage of bug fixes from the program.

First Application Systems Testing (FAST) Program

There are two type of releases in NetApp: RC for release candidate and GA for general availability. The FAST program is the process used by Customer-1 for upgrading and maintaining OS level of the NetApp storage solutions within large corporate strategy environment. So in terms of the deployment process, we put GA in the IT lab for feature testing, then do the TOIs, then moving to approve the release cycle to deploy to the other environment.

Step #1 is the lab environment. Then wait for a week to check the stability of the new code, then move them to back up, because a backup has a SnapMirror services and destination relationship requirements. We upgrade the entire backup before we move to subprod.

Then once we create all the backups and wait for a week, then we move to the subprod migration. A subprod environment includes the test to staging environment. Per our upgrade policies, we have to recreate in the subprod environment for at least the full clusters before moving to the production. Once we upgraded the subprod clusters and watch for the two weeks, then move to production environment. Production is our critical application environment and we only deploy the GA to the production environment.

Deployment flow for corporate environment

Once the new code has been released and complete the lab upgrade has happened along with bake-in period, then we move to the corporate upgrade environment. This is the process flow for the corporate environment for supported RC and GA releases. Then it goes to IT release process. For the major early releases, like 8.0 to 9.0 upgrade, then we usually take the IT certification process, have all of the applications, database, and server environment certified through the new version.

Once that certification is completed, then it goes through the IT release processes. In the first upgrade, the entire backup should be upgraded and monitored for one week. Then it moves into subprod for 15 days (around two weeks). Once we upgrade the subprod, the requirements that we have is to create at least four clusters in a subprod, before moving into production (prod). Then once the subprod migration and waiting period is completed, we start to upgrade the major clusters to the new version.

Mitigate Risks

In the early stages of this program, that is very difficult to get approvals from the application owners for the storage upgrades. At the beginning of the program it would cause some of major outages to the applications because file upgrades require to take over data. The infrastructure standard was not fully implemented and some critical servers that would cause a single P1 outage. If there were issues during an upgrade, then sometimes it impacts the major application releases because of poor communication. There was no collaborative planning with application teams for the storage upgrades.

To mitigate these risks, the program team implemented collaborative planning with the corporate application owners. With proactive use of communications and upgrade pre-checks, we have been able to avoid any technical issues during upgrades. Also, standard installations or approaches are implemented because at the time of the upgrade. After the storage upgrade is completed, we usually perform the active application and database validations, before moving to the next upgrade after monitoring the stability over the HA pair.

The new process in the NetApp ONTAP is very helpful and lowers the risks because Automatic, Non-Disruptive Upgrade allows the process to avoid any manual errors that may hinder upgrade. The buddy checklist during the upgrade is helpful. An active post validation is done on critical applications after upgrades as well as on the standard process used for the upgrade.

FAST Upgrade Schedule

Here is our standard FAST upgrade schedule for ONTAP 9. Usually it is a 4-week schedule. At Week-3, we do a complete upgrade pre-check and identify any technical issues preventing the upgrade. If there is a Severity -1 (i.e. P1) issue that will prevent the upgrade, then we start work to address the issues.

At Week-2, there is an application leaders and owners meeting. The program team explains the benefits of the upgrade and the schedule. Then application team can share their release schedule to avoid any conflict between their releases and with our storage upgrade.

If there is a major application releases scheduled during our upgrade window, then we change our schedule. Then we send our first user notification and ask for the feedback for the whole of the application teams hosted in the storage system. We receive the application team’s feedback to understand any items remaining. Then project team makes a “Go” or “No Go” decision for the upgrades.

Week-zero we send out the second user notification, then perform the upgrade. For the Tier-1 applications, we do the active post application validation right after the storage upgrade. The Tier-1 application teams to join our call and perform the upgrade validation right after the upgrade on the node.

And at Day +1, we send out the final user notification about the status of the storage upgrade. And if there are any follow-up issues from the application teams, we take the actions for that

That is the overall of the FAST program and the processes. If you have any questions, let us know.

FAST Program – ONTAP Upgrades

Hello everyone, my name is David Tenagawa. I’m a senior storage engineer with the Customer-1 Solutions Group within NetApp IT. Today, I’m going to discuss some technical considerations which go into planning and preparing for an ONTAP upgrade.

Pre-upgrade Checks

On the first slide, you have an overview of our pre-upgrade checks. Our fast upgrade process has evolved over time as we have refined the process through our continuous improvement efforts. Over the course of the past several years, we’ve gained a more thorough understanding of the dependencies, and considerations which go into preparing for an upgrade. Last year, we undertook a large effort to upgrade from ONTAP 9.1 to 9.6, and that effort really included the full gamut of considerations.

So, as we begin our preoperative checks, we look at Upgrade Advisor, Active IQ, Config Advisor, specifically for health issues, or risks in our environment, which we need to address before beginning an upgrade. We also look at our hardware. We want to make sure that our hardware is supported with our target ONTAP release, any hardware failures or issues in our environment, we want to make sure those get addressed before we start our upgraded as well.

We also look at the external tools, and utilities that we use to help manage our storage environment, specifically software which runs outside of the cluster. We want to make sure those tools, and utilities are also supported with our target ONTAP release. In the case of our SAN hosts, we want to make sure those are multipath so that when we do take over some givebacks as part of our upgrade process, we want to be able to ensure that those SAN hosts continue to have uninterrupted access to their LUNs.

In cases where we are using SnapMirror, we want to make sure that we’re running compatible ONTAP versions on the clusters with our source, and destination volumes. Our cluster switches, and XLS versions are CF versions. We need all of that to also be supported by our target ONTAP version. We also review system shelf and disk firmware. It’s a good idea to review the ONTAP 9 release notes, to understand the changes that are being introduced with the ONTAP version you’re planning to upgrade to.

SnapMirror Data Protection relationships

On the next slide, we talk about data protection relationships, or SnapMirror DP relationships. These are as they always have been, i.e. clusters with SnapMirror destination volumes need to be running a version of ONTAP, which is greater than or equal to the ONTAP version, running on clusters with SnapMirror source volumes.

SnapMirror Unified Replication relationships

On the next slide, we talk about SnapMirror XDP relationships, now known as Unified Replication Relationships. As you can see in this table, there is more interoperability between ONTAP versions with XDP. However, there are some exceptions to that. So, it’s helpful to refer to this table in planning for your upgrade. Included on this slide is a link to the data protection guide, which contains this table, and some additional information.

SnapMirror DP to XDP conversion

The next slide addresses some additional considerations for SnapMirror, for ONTAP 9.3, XDP replaced DP as the default SnapMirror type. It is important to be aware that when you upgrade to 9.3 or later versions, any preexisting DP relationships aren’t automatically converted to XDP. Those relationships are still DP relationships that continue to function as they always have.

However, there are three reasons why we may want to consider converting those relationships to XDP. The first reason, as we discussed in the previous slide, is the improved interoperability with different ONTAP versions for XDP relationships.

Another reason why we might want to convert those relationships is BURT 1166096, which describes a condition in which clusters with SnapMirror DP destination volumes might panic, or experience an unplanned take over. That risk is effectively worked around by converting those relationships from DP to XDP.

And the third reason to consider converting those relationships is the storage efficiency limitations that are associated with DP relationships. With DP relationships, my source volume and my destination volume need to be configured with identical storage efficiency configurations. So, that limitation is also effectively eliminated by converting those relationships to XDP.

On this slide I provided a link to a helpful article, which explains how to convert DP relationships to XDP. The important thing to be aware of when going through that process is to make sure that the source, and destination volumes have a common snapshot. So, that way, when we break off the existing DP relationship, and recreate a new XDP relationship in its place, we can simply resync the relationship instead of having to re-baseline it.

End of Support (EOS) FAS Platforms

On the next slide, we talk about the end of support FAS platforms. There are several product families, several FAS platforms, which are no longer supported as of ONTAP 9.2. In our case last year, as we were preparing to upgrade from 9.1 to 9.6, we had several filers belonging to these product families that we needed to evacuate. And so, we developed a systematic approach to our node evacuation process, which is explained on the next slide.

End of Support (EOS) Hardware

Our node evacuation effort begins with cleanup. We want to start by identifying any orphaned cluster configurations, or other storage resources, which are no longer in use, anything that is inactive, anything that has been orphaned. It will simplify our migration process.

So, we cleaned these up, one big consideration, that is a part of that are unused volumes. So, we take a look at a zero IOPS report. What we found in our environment, is that volumes with zero IOPS over a period of 90, or more days, can be safely considered inactive, and we can start the decommissioning process on those volumes.

In our environment we also enable volume recovery queue with 360 volume delete retention hours. This gives us the opportunity to recover previously deleted volumes for 360 hours, or 15 days following the volume deletion. If there ends up being no reason to recover the volume, then the volume is permanently, and automatically deleted from the cluster.

Once our cleanup effort is complete, then we start migrating NAS data volumes. These are really the low hanging fruit of our data volume migrations. These are simple to accomplish, and we can make a nice dent in our migration process by knocking these out. So, we start with here and once our NAS data volumes have been migrated, then we continue on with our SAN volumes.

With the SAN volumes, there are a few different considerations. In our environment we have several hosts with boot mines, which required downtime to update the storage path in the BIOS of those hosts, in order to update the configuration to point to the new LUN location on the new HA pair.

For data lines, we can migrate one path at a time by taking the lift down on the old node, bringing it up on a port on the new node, making sure that our host has access through the new path, and then repeating the process with the redundant path.

Selective LUN mapping enables us to restrict LUN access to the HA pair, which hosts the LUN. So, as we are preparing to migrate a SAN volume from the current HA pair to a new HA pair. We want to update the LUN mapping configuration to include the destination HA pair as well as the source HA pair. After we migrate the volume, we can then remove the source HA pair from that mapping configuration. In cases where we have fiber channel LUNs, we also needed to update zoning configuration to make sure those hosts have access to their LUNs to the new HA pair.

Other considerations related to the node evacuation process include things like migrating SVM root volumes, recreating LS mirror volumes, rehoming NAS data lifts, and rehoming SVM management lifts.

Cluster Switches

On the next slide, we have a matrix which shows the supported Cisco switches, and XLS versions, RCF versions for various versions of ONTAP. At the bottom of the slide with a link to the page, which contains this matrix. And at the end of the slide deck, I have links to the pages where we can download the NX-OS software and RCF configuration files also. And we want to make sure all components are supported with our target ONTAP releases.

Storage management software: SnapDrive for Windows

On the next slide, we start to discuss our storage management software. So again, this is the software, the tools, and utilities that we run outside of the cluster to help manage our storage environment. In our environment with ONTAP 9.1, we were extensively using SnapDrive for Windows and SnapManager for SQL Server. As you can see from this screenshot from the interoperability matrix tool, SnapDrive for Windows is no longer supported after ONTAP 9.3.

Storage management software: SnapManager for SQL Server

On the next slide, we can see that a SnapManager for SQL Server requires SnapDrive for Windows. So, in planning for our upgrade to ONTAP 9.6, we needed to replace this solution to back up our SnapManager, our SQL server data. So, what we ended up doing was replacing those products with SnapCenter, which is depicted on the next slide.

Storage management software: SnapCenter

SnapCenter provides application consistent data protection for ONTAP hosted data, anywhere in the hybrid cloud. The architecture consists of a centralized SnapCenter server.  And then a plugin is installed on each of the hosts that we want to back up. In our environment, because we are using this product to replace SnapDrive and SnapManager for SQL Server, we are primarily using this to back-up our SQL server data.

However, the solution also provides plugins for other types of data, including Exchange Server, Windows, Oracle, SAP Hana, VMware vSphere, and there’s also a custom plugin, which can be configured to back-up other application data types.

ONTAP Upgrade Considerations

So, in summary, as we prepare to your upgrade ONTAP, there are several considerations which are part of the process. We want to make sure our hardware is supported with our target ONTAP release, that any hardware failures or issues have been addressed, and resolved.

In cases where we have SnapMirror configured we want to make sure that we are running compatible ONTAP versions for source and destination clusters. We want to make sure our cluster switches, and XLS versions, and RCF versions are supported with our target ONTAP release. We want to review our monitoring tools to make sure we identify and address any health issues or risks in our environment.

In cases where we are migrating 7-mode data, we want to make sure that our target ONTAP versions, support TDP SnapMirror.  Then for any external off box storage management software that we may be using, we want to make sure that is also compatible with our target ONTAP release.

Additional Resources

I have links to some other resources which may be helpful to you in planning for your ONTAP upgrades, including the interoperability matrix tool, and availability report, a couple of firmware pages, firmware download pages, and some software download pages.  And with that, I am happy to take any questions that you may have.

Challenges encountered: ONTAP 9.1 to 9.6/9.7

One of the most recent challenges with our program—and I heard that many of European customers are facing the same challenges—is the ONTAP 9.1 upgrade to the 9.6. Because a 9.1 upgrade to 9.6 requires the end-of-support nodes be removed from the cluster. This means the data on the end-of-support node should be migrated to the supported nod, especially data migrations with boot LUNs moves as it requires the downtime. The project team will have to work closely work with application teams to take (schedule) the downtime.

The other challenge is the SnapManager migration. For the SQL backups, SnapManager is supported up to the 9.3. All of the SnapManager backups had to be migrated to SnapCenter. We migrated the SnapManager servers, i.e. around a hundred servers, to the SnapCenter. We had two projects to drive these requirements, and we successfully completed all those projects. And now all the clusters are already migrated, or they’re ready to migrate, to ONTAP 9.6. If you look at the next page, that shows our current status of the ONTAP.

Dashboard for ONTAP

Our entire ONTAP storage fleet supports the entire corporate application portfolio. We have 39 ONTAP9 clusters with 172 cluster nodes. If you look at May last year, 46% of the clusters were running on 9.1 and 37% of the clusters were running 9.5 or 9.6. After 12 months, the 9.1 clusters represent 21%, and 9.5, 9.6, and 9.7 represent 72% as of last month. So out of those 9.1 clusters, all data migration and SnapCenter migration is completed so those are ready to the 9.7 upgrades. We are currently running 9.7 upgrade cycle. Our current target is by the end of July to have our entire ONTAP fleet running with 9.7.

Lessons Learned

If you look at the next slide, we have learned from this program is that in order to be successful, and in order no impact to the applications during this ONTAP upgrade, infrastructure standards should be implemented.  Also an accurate “Upgrade Pre-Check” is critical for successful upgrades. Further data migrations for 7-mode migrations or for the node notifications to remove the end of support node, is very important. Work closely with the user’s application teams and infrastructure teams, and maintain the trust of the users toward the project team is also very critical.

As we are migrating thousands of volumes and hundreds of applications in our data migration process, it  is very important to have a very organized approaches and efficient process. And for some applications, like Tier1 applications and those with high impact into the business, it is very difficult to schedule a downtime; the application downtime will have a direct impact on the business. We developed Ansible automation playbooks to eliminate downtime during the cut over. It was very successful. In summary of these lessons learned from this program, we continue to improve our process in order to maintain our robust storage fleet and data infrastructure.

Thank you all for attending today. Be sure to check out www.NetAppIT.com for our collection of use changes leveraging NetApp technology.  Be sure to check out the webinar page to find on-demand recordings from our previous webinars as well as details on our next webinar event.