Coolant leak forces Rutgers supercomputer into partial operation
Rutgers University’s new supercomputer has been in a state of partial operation since January when a seal burst, leaking coolant and preventing the system from working as designed.
The cooling system broke about two months ago, forcing the shutdown of the complete Caliburn supercomputer, said Manish Parashar, a distinguished professor in the Department of Computer Science.
The computer itself did not break, only the cooling system in the facility the computer is located in.
“On one of the cold mornings in January, one of the seals for the cooling system broke, causing the coolant to leak,” he said. “So the cooling in the system was not sufficient to create the right environment, so we shut down Caliburn until (University Facilities) was able to fix it.”
The computer was shut down to prevent damage from excessive heat.
The system has been online since last June, though it only became accessible to the general University community at the beginning of this semester.
“Caliburn has been operational since June 2016, and was being tested, benchmarked and tuned,” he said. “We worked with early users during Fall 2016 to evaluate different usage modes before it was made available more generally to the Rutgers and (New Jersey) researchers.”
There are two parts to the Caliburn system, he said. The first part is the actual supercomputer, but the other part is a data center which provides power and cooling so as to create the proper environment for the machine.
While the system has not been repaired yet, parts of the supercomputer were relocated to another facility, allowing users to access part of the machine’s services, he said.
“What we (did) in a week was to get this temporary solution up and running by moving part of the system. That was a short-term thing,” he said. “The long-term is really a process where Rutgers Facilities points out what the problem and solution is, figures out the cost and fixes it.”
The system is under warranty and insurance, so facilities personnel are working with the computer’s original vendors at no additional cost to the University, he said.
Once the facility is repaired, researchers will test Caliburn and ensure everything is working as designed before they fully start the system up again, he said.
Every component in the system has already been tested individually, he said. The computer itself has shown no signs of damage or other issues, meaning once the facility is fixed, the system can be made operational immediately.
Some of the repairs have already been completed. The team will make sure they understand why coolant leaked before they can spin the machine up again, he said.
Once they have determined the cause and made sure it cannot happen again, they will perform a full integrity check of the supercomputer.
Parashar anticipates the system will fully operational within a few weeks, at which point they will resume normal operations.
Even though the system is only partially online, users have still been taking advantage of it, he said. Those working on projects have still been accessing the system through the computer's web portal.
“I think the response on using the system has been great,” Parashar said. “Even the partial system that we’ve got online has been heavily used and we hope to get the entire system online very very soon.”
Nikhilesh De is a correspondent for The Daily Targum. He is a School of Arts and Sciences senior. Follow him on Twitter @nikhileshde for more.