What are you looking for ?
Advertise with us
RAIDON

Deakin Applied AI Institute in Australia Chooses WekaIO and Xenon

To stay in front of increasing performance demands from AI researchers

The Deakin Applied AI Institute (A2I2) opened in 2019 merging 2 groups at the University – the Pattern Recognition and Data Analytics team (PRaDA) and the Deakin Software and Technology Innovation Laboratory (DSTIL).

Deakin Applied Ai Institute In Australia Chooses Wekaio And Xenon1

The merger combined ML expertise with the ability to deliver complex ideas and systems in user friendly software, web and mobile applications. 

A2I2 takes a multi-disciplinary approach to projects, and works on “human in the loop” problems where AI solutions enhance human capabilities. Partnering with government and industry, it works across diverse sectors including health, defense, education, finance, manufacturing and security.

To drive the ML processes of data analytics, training and inference, the institute operates HPC cluster which includes the latest NVIDIA DGX System as well as other compute nodes. There are over 80 researchers at A2I2, and at any given time over 50 of these are active users of the HPC cluster. Robert Ruge, systems and network engineer, and Josh Cole, AI systems officer, shared their story of meeting the Institute’s increasing need for faster access to data.

The Virtual Dementia Experience (VDE) is an immersive, emotive and interactive experience that aims to capture and simulate the experience of living with dementia.

Deakin Applied Ai Institute In Australia Chooses Wekaio And Xenon2

Computer Vision Demands Higher Storage Performance
The HPC cluster was originally established with 1.7PB of storage deployed with a Linux based parallel file system across 8 storage servers. This was set-up with volume mirroring, so effective working space was 850TB. Since that was commissioned, a number of researchers had moved into computer vision based AI, which was placing a heavy load of IO/s on the storage. This was having a knock-on effect across the Institute, with the heavy IO/s load effecting all jobs which were running slower and slower.

The team started evaluating alternative storage solutions. Ruge and Cole took a broad look at the current storage offerings from major vendors and emerging vendors, looking for “highest IO/s we could get for our budget“, supportability, modern design, and minimizing complexity were key considerations.

The team settled on WekaIO, Inc., which was able to offer the highest IO/s for their budget, and, as Ruge pointed out, “WekaIO runs on commodity hardware we could buy from anyone, and we could grow it incrementally as we have funds – which is what we’ve done … allowing us to keep one step ahead of the researchers.

WekaIO also stood out for ease of operation, support, and Ruge noted: “WekaIO was built with new technology in mind. NVME based, it doesn’t carry any baggage from the HDD era. That was also an advantage.

Move Encourages Archiving
The initial roll-out was 436TB raw NVMe, across 10 servers. Taking a cautious approach, the original plan was to stand up the WekaIO storage and use it in parallel with the existing Linux clustered storage. The team took advantage of the move to the new storage to spring clean the existing data with the researchers, who were able to archive more than half their data.

As we did our clean-up and pulled our accounts across, we soon realized we could squeeze it all in the new system. Which was good for us, we didn’t have to run two stacks,” explained Ruge.

The old Linux system had also implemented mirroring and included a lot of overheads from the HDD based architecture.

As a consequence, Ruge explained: “monthly backups were taking 35 days, so we couldn’t guarantee we would have a good backup. Since we transitioned to WekaIO the backup is now taking 6 days!

Cole added: “the team is now implementing weekly backups on top of the monthlies“, which weren’t possible before.

Enroute Trauma Management Project: Helping paramedics make the right decisions to save lives.Deakin Applied Ai Institute In Australia Chooses Wekaio And Xenon3

Benchmarking performance
With the WekaIO system in place, Cole ran a series of benchmarks to compare it to the old system.

He found “a single host on the new system could get much higher IO/s than the entire cluster on the old system … and that’s including using NFS, UDP” which are the slowest WekaIO protocols.

He provided the benchmarks below on the original install with 10 servers.

Commenting on the benchmarks, Ruge noted that: “we’ve not reached the limits of the storage system,” and Cole added: “the performance we’re seeing with the current storage is much higher than the old storage system could have handled. But we still have plenty of space to spare, and we’ve got plenty of overhead for them to play with. So, storage is no longer a bottleneck which is good.Deakin Applied Ai Institute In Australia Chooses Wekaio And Xenon4The experience of the researchers matches the benchmark numbers, with Cole providing the following quotes from the A2I2 team:

I would like to inform you that my code is running very fast on the server. In particular, I usually need from 40 minutes to 1 hour to complete one epoch, but now it only takes around 6 minute per epoch.

“Thanks for implementing the new WekaIO file system. Performance is very much improved.

“WekaIO is great. It is now pretty much instantaneous to start running jobs, whereas before I would sometimes wait 10 minutes with jobs blocked in a disk wait state.

Early Cerebral Palsy Screening with Deep LearningDeakin Applied Ai Institute In Australia Chooses Wekaio And Xenon5WekaIO Provides Solid Foundation for A2I2
WekaIO has been a game changer for A2I2.

Ruge noted: “Price compared to performance compared to other solutions we looked at … it definitely delivered the performance we needed and we still have headroom to push the system as researchers get more sophisticated. And in the less than 12 months that it has been in, we’ve managed to upgrade it with extra money that’s become available.”

This was a key capability that stood out, allowing the team to “expand incrementally as we have funds available, and doing so increase the performance and keep ahead of the researchers.

Ruge reflected on the early part of the decision, and noted: “One concern we had, [WekaIO] was a new company to us, a small company, and fairly new on the global market, and that was quite a concern for us but we decided to take a punt and so far so good and it’s been great to see the support roll out in Australia, and staff being employed, with WekaIO growing from one staff in APJ to 20 now. The fact that Xenon was the implementation partner, and stood by WekaIO meant a lot to us as well, as we’ve always had successful engagement with the Xenon team.

Cole highlighted the support, and admin interface that makes his job easier, “with WekaIO the user interface has a lot more information that I can make use of and having the support is nice and just better documentation,” and when there are issues, they’ve been able to access the WekaIO developers directly and resolve issues on the first call. That combined with the local support team has made for a smooth roll-out and implementation across the Institute.

A2I2 researchers are constantly pushing the boundaries in AI research, data, model size, so the way of the future includes further incremental expansion, faster storage performance, and continuing to “stay one step ahead of our researchers with WekaIO.”

 

Articles_bottom
ExaGrid
AIC
ATTOtarget="_blank"
OPEN-E