Getty Images

Vast Data solves high-performance computing challenge

When DUG began to supply high-performance computing as a service, it discovered its storage was not up to the task, sending the company on a years-long journey.

DownUnder GeoSolutions got its start in a backyard shed built by its founders.

Almost 20 years later, DownUnder GeoSolutions, which is now called DUG Technology, has carved out a niche for itself, providing high-performance computing as a service (HPCaaS) to academics, scientists and enterprises for modeling and data processing.

As the company grew, so did its storage requirements. It searched for a storage provider to support its growing HPCaaS business and took briefings with a series of vendors, including DDN and WekaIO. The company, which has developed much of its technology in-house, even tried its hand at a homegrown alternative before finding a good fit with Vast Data. 

"We needed to scale with performance, we needed a high level of reliability and we needed a high level of maintainability," said Stuart Midgley, CIO at DUG.

Outgrowing homegrown storage

Headquartered in Perth, Western Australia, DUG started out processing data for oil and gas companies and developing software for seismic processing to help with exploration and production.

As data volumes and data dependency grew, so did DUG, eventually expanding into an HPCaaS business, which, in turn, exposed its own dependency for a storage product that could support its growth.

DUG looked at vendors from around the globe. When the company couldn't find a product that worked, it stitched together its own, Midgley said.

It used the open source parallel distributed file system Lustre, a data management software it knew well and had contributed to with patches.

But DUG's use of Lustre, a file system commonly used by supercomputers, presented its own challenges, according to Midgley. DUG patched together software from Lustre, ran its own client and patched the kernel, fixing issues it saw, all while Lustre software was not being maintained by Linux.

Midgely said issues would arise when one version of the kernel was needed to run a stable version of Lustre but then triggered, say, a security issue; addressing the security issue meant dealing with an unstable version of Lustre.

For hardware, DUG used a combination of NetApp's E-Series block storage and white box systems, or generic arrays.

"We were buying white box systems from any hardware vendor, loading a lot of disks and then building on top of those our own storage solution," Midgley said.

If we had an issue, if we lost a server or something happened, the file system had to keep going.
Stuart MidgleyCIO, DUG

The NetApp storage device ran both SSDs and HDDs behind a redundant array of independent disks (RAID) controller, which was used to manage the storage system and protect against a drive failure.

But the NetApp RAID controller turned out to be problematic, giving DUG corrupt data or data that was different than what it wrote, Midgley said. DUG stopped using the RAID controllers and instead used the NetApp units as a chassis like a JBOD or "just a bunch of disks" system for capacity. DUG used homegrown software alongside ZFS file system and Lustre on top of the chassis, he said.

Drive failures, which are rare and happen about once every three years, were occurring daily, Midgley said. HPCaaS required DUG to work with thousands of drives, a number that far exceeds normal operations, so the uptick in failures was problematic.

"We don't want the file system to go down," he said. "If we had an issue, if we lost a server or something happened, the file system had to keep going."

The drive failures led the company to look for an external storage provider.

DUG considered DDN, Qumulo, Pure Storage and WekaIO. DUG benchmarked, tested and played around with several pieces of hardware, but Midgley's team kept running into the same problem: Eventually, the storage needs would "break a RAID controller" or cause the controller to crash even during a simple test, he said.

Using Vast Data for HPC

Roughly four years ago, DUG decided to talk to Vast Data, a flash memory storage startup that was still in stealth at the time.

Vast sent DUG hardware to stress test the storage equipment against the expected performance needs. The tests resulted in some "absolutely catastrophic failures," Midgley said. Like all the storage products that came before it, DUG broke Vast's storage in the same way it broke NetApp's -- severely, Midgley said.

Within a couple of days, Vast recovered the test file system in read-only, copied the data, wiped and reset the hardware, and then put DUG's data back on the clean hardware. Within a month, Vast was able to fix the bug that caused the crash, a selling point for DUG.

Still, Vast did things that caused some confusion for DUG. For instance, Vast used network file system (NFS), a file accessing protocol designed for one computer to service data to many, not a protocol designed for clustered scalable file systems like Lustre. Vast acknowledged drawbacks to NFS, including relaxed security features, performance issues with locking and file consistency, and issues with concurrent writing to files, but it also noted that NFS was a stable file system and maintained by the Linux community.

DUG also ran into performance issues with Vast's hardware. Midgley said, using Vast, DUG could create 30,000 to 40,000 files a second but could only delete 200 to 300 files a second.

"If you put that in the hands of users, very quickly you will have a billion files, and it will take you the rest of time to delete," he said.

Vast systematically worked through the issue until files could be deleted rapidly in a more realistic timeframe. 

Every storage system, including DUG's own homegrown system, hit catastrophic failures, but Vast was able to recover the fastest, Midgley said. Other issues have cropped up, including during the proof-of-concept stage, but Vast has continued to be attentive in addressing those issues.

"Every problem we hit, Vast solved," Midgley said. "[Vast] didn't solve it over five years -- they solved it within a month or two."

Six to seven months into working with Vast Data, DUG was no longer able to break the storage system.

DUG's focus going forward

When DUG started using Vast, its data requirements were relatively small. But DUG's customer base has evolved in the last four years, with users wanting more out of their storage.

For example, one feature now in demand is encryption at rest, or encrypting data that is not actively being used. Currently, DUG manages a customer's encryption keys. But increasingly, DUG's customers want to handle key management, meaning only the end user can see what has been written. This is a feature that Vast Data currently lacks, although it is on its roadmap.

With that being said, its customer service is top notch, according to Midgley.

"We have a Slack channel [with Vast] for problems. We put them in there. Usually within a couple of minutes, someone's on the system looking. So, we can't ask for much more than that," he said.

Dig Deeper on Flash memory and storage