NetApp IT uses ONTAP FlexGroup to power and manage a 3.6PB Active IQ data lake

By Faisal Salam

NetApp Active IQ provides customers and partners with actionable intelligence on their NetApp environment via a dashboard that summarizes performance, availability, capacity forecasting, health summary, case histories, upgrade recommendations, and more. Every week the system generates about 100TB of data and 225 million files―and growing.

As the team responsible for IT operations and fulfilling storage requirements for these rapidly growing, almost insatiable data sets, we struggled on two fronts. As the Active IQ data lake grew, IT constantly teetered on exceeding SLA targets set with the internal business team for application processing. It was nerve wracking.

Moreover, we continually hit capacity limitations of the assigned NFS volumes. Every 2-3 weeks, new volumes had to be established with redirects. This drove 24.2 hours of change activity each month as the Command Center dealt with the frequent alerts from exceeded thresholds, the storage team established new volumes, and the application developers updated over 200 servers with the new information. It was a reactive, manual hot mess.

NetApp ONTAP FlexGroup Volumes

To improve the Active IQ data ingestion challenge and it’s growing data lake, we implemented NetApp ONTAP FlexGroup Volumes as it has the capacity to scale up to 20 PB of storage and 400 billion files. The FlexGroup technology allowed us to present a single, scalable storage volume while delivering a 15-20% reduction in overall data processing time from the application side. 

We have seen a 2x improvement in input/output operations per second (IOPS) performance, 10% more throughput, and lower total average latency.  Today we are easily meeting SLA targets with ample headroom.

By implementing FlexGroup, we have simplified operations and removed the tedious manual activities associated with volume changes that happened every 2-3 weeks to once every two years (based on projected data growth). This is because FlexGroup can span multiple nodes and grow capacity non-disruptively, while providing a single namespace. Today when we run out of space, we can add more nodes/constituent volumes to the same FlexGroup volume(s), transparent to the app.  We also get to leverage all of the efficiencies of ONTAP like deduplication, compaction, and compression.

Over the past few years, the cluster itself has grown over time as we continue to add nodes for capacity. When we first created it, the FlexGroup was 600TB and is now 3.6PB in size. Likewise, the number of constituent volumes has increased from 25 to 120, and the cluster increased from 4 nodes when initially deployed to 16 nodes today.


Faisal Salam is a Senior Storage Engineer with NetApp IT, responsible for designing and deploying ONTAP systems.

Tags: