Microsoft Australian East Data Center control system cyber incident – unintentional or malicious?
Operators of data centers have been primarily concerned about the cybersecurity of the information technology (IT) data and networks rather than operational technology (OT) and control systems used to monitor and control the physical infrastructure of the data center. Cybersecurity of the data center infrastructure has generally been limited to OT networks using internet protocols (IP) and Internet-connected controllers. Yet the control system field devices in data centers, often referred to as the Purdue Reference Model Level 0,1 devices, are the most cyber-vulnerable (no cybersecurity, authentication or cyber forensics) and least likely to be adequately addressed. These devices include the process sensors that measure pressure, level, flow, temperature, humidity, voltage and current; uninterruptible power supplies (UPSs); power distribution units (PDUs); smart breakers; automatic transfer switches; generator systems; battery monitoring systems; chiller motors, fire suppression, etc.
Because of these gaps, I wrote an article for the Federal Facilities Council of the National Academy of Engineering – “Challenges in Federal Facility Control System Cybersecurity, Including Level 0 and 1 Devices” – which is on the National Academies of Sciences website. It should also be noted that much of this equipment is used throughout the critical infrastructures.
Microsoft data center shutdown
Between Aug. 30, 2023, and Sept. 1, 2023, customers experienced issues accessing or using Azure, Microsoft 365 and Power Platform services. A utility voltage sag in the Australia East region tripped a subset of the cooling units offline in one data center. While working to restore cooling, temperatures in the data center increased so Microsoft proactively powered down a subset of selected compute and storage scale units to avoid damage to hardware.
Data centers and chillers have been impacted by control system cyber incidents
Data center control system cyber incidents have shut down or damaged data centers. Cases include damage to chiller motors, shutdown of chiller motors, “frying” the servers in the data center, data center fires, shutdowns from loss of power, etc. Systems that were designed to protect mission- and safety-critical systems have been co-opted to be used as attack vectors against the very systems they were meant to protect. Control system cyber incidents have impacted the physical operation of data centers operated by many different entities globally.
NIST, GAO, ISA and others have defined a cyber incident as electronic communications between systems or systems and people (e.g., operator displays) that can affect confidentiality, integrity or availability. A cyber incident does not have to be malicious. However, the informal IT definition of a cyber incident is the system or device is connected to the Internet and data has been compromised or stolen. This definition is data-focused, not physical impact. My non-public database includes more than 17 million control systems cyber incidents. Some of these have had serious safety and financial consequences. However, most of these control system cyber incidents were not due to compromise of Internet protocol (IP) networks, so they were not identified as being cyber-related. In the case of the Microsoft data center, the incident was being treated as a mechanical, not cyber, issue. As a result, the incident wasn’t addressed by CISA, SANS or others as a cyber incident.
According to Microsoft, its original equipment manufacturer (OEM) vendor is trying to understand what caused the event. There was no mention of either Microsoft or the OEM vendor’s cybersecurity organization being involved. Few OEMs are cognizant of control system cyber issues at the field-device level.
Difference between a control system cyber incident and a control system cybersecurity incident
Prior to Stuxnet, a control system cyber incident could be distinguished from a control system cybersecurity incident. A control system cyber incident was unintentional like the PG&E San Bruno natural gas pipeline rupture. A control system cybersecurity incident on the other hand was a malicious cyberattack like the 2000 Australian wastewater attack. However, Stuxnet demonstrated that a sophisticated cyber attacker could make a cyberattack look like an equipment malfunction and go unidentified by traditional OT network monitoring for an extended period of time. The 2017 Triton cyberattack against a Saudi Arabian petrochemical plant caused a shutdown in June 2017. There was no identification of any cyber issues by network monitoring. The safety modules that caused the shutdown were sent to the OEM to perform a root cause analysis. There were no issues identified with the controllers as the malware was in the Triconex engineering workstation. As a result, the plant was restarted even though the malware was still in the engineering workstation. The malware was discovered when the plant was shut down again in August 2017.
What happened at Microsoft’s Australian East Data Center
The cooling capacity for the two affected data halls consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2) before the voltage dip event. When the event occurred, all five chillers in operation faulted and didn’t restart because the corresponding pumps did not get the run signal from the chillers. Infrastructure thermal warnings from components in the affected data halls directed a shutdown of selected compute, network and storage infrastructure. The shutdown was to protect data integrity and infrastructure health and resulted in a loss of service availability for a subset of the availability zone. The onsite team was able to manually restart the five chillers meaning it was the automation that was impacted.
Due to the size of the datacenter campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner. Microsoft has temporarily increased the team size, until the underlying issues are better understood, and appropriate mitigations can be put in place. This was an automation event. Will the teams be trained in control system cybersecurity, particularly at the field device level?
The emergency operating procedures for restarting chillers was slow to execute. Consequently, Microsoft is exploring ways to improve existing automation to be more resilient to various voltage sag event types. How will control system cyber issues be addressed for automation that has no built-in cybersecurity?
Utilizing the playbook in sequencing workload failovers and equipment shutdown could have been prioritized differently with better insights. We are working to improve reporting on chilled water temperature, to enable more timely decisions for failover/shutdown based on thresholds. How will the staff identify if the temperature sensor readings are correct or even if the temperature measurements are from the sensors and not spoofed as there is no cybersecurity or authentication in the process sensors?
What made this incident a control system cyber incident
In this case the control system cyber incident involved electronic communication between systems that impacted availability. The five chillers did not restart because the corresponding pumps did not get the run signal. As a result, the data center shutdown resulting in a loss of availability – a control system cyber incident.
Chiller motor availability is very important for data center availability as demonstrated by the N+2 configuration. The cooling capacity for the two affected data halls consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2). A non-detailed mathematical approach demonstrates how low a probability an unintentional scenario has of occurring:
- Probability of a voltage dip – small
- Probability of a voltage dip that affects the operation of the data center with back-up power sources not mitigating the voltage dip – very low
- Probability of the failure of a chiller system – low
- Probability of the failure of five signals to turn on the chiller pumps at the same time – extremely low
- Probability of the concurrent failure of five signals during a non-mitigated voltage dip – extremely, extremely low.
Could it have been malicious?
The incident in Microsoft’s data center might be considered in the light of activity by China’s Volt Typhoon threat actor. Using Volt Typhoon as an example, Microsoft assessed with moderate confidence that the Volt Typhoon campaign is pursuing development of capabilities that could disrupt critical communications infrastructure between the United States and Asia region during future crises. Volt Typhoon has been active since mid-2021 and has targeted critical infrastructure organizations in Guam and elsewhere in the United States. In this campaign, the affected organizations span the communications, manufacturing, utility, transportation, construction, maritime, government, information technology and education sectors. There have been other control system cyberattack campaigns, including Redfly.
There is no cybersecurity or authentication in process sensors, including in chiller systems. It is possible for a threat actor to spoof or prevent sensor signals from being issued. This has happened before. It is possible for a cyberattack to cause a voltage dip. This has also happened before.
IT cybersecurity organizations are not control system cybersecurity experts
Just because a company is an IT cybersecurity expert does not mean it also has sufficient expertise in control system cybersecurity. Microsoft is obviously a major player in operating systems and IP network cybersecurity. In this case, a physical failure from a control system cyber incident affected a large IT operation and it is critical to understand the lack of cyber resilience of the control system involved, particularly the process sensors that are 100% trusted.
In another case, a large utility had their IT organization perform security scans of their data center assets. Because it was successful, the IT organization expanded the scanning into NERC CIP substations. The security group had no previous experience with scanning substations. The port scanning caused the real time protocol operation of the protective relays to stop. All the devices in each substation were affected at the same time in every case. Without knowing that a security scan was initiated, it looked like a DDOS attack resulting in equipment malfunction.
These cases demonstrate why appropriate control system cybersecurity training is so important.
Learning opportunity for both the defensive and offensive cyber communities
As this control system cyber incident resulted in physical impacts, it should be a learning experience for both cyber defenders and offensive cyber attackers. Even if this incident was unintentional, it could have been caused by a malicious cyberattack. I think it is fair to say the offensive cyber attackers will be looking at this event regardless of whether it was malicious or unintentional. Unfortunately, control system field device cyber issues generally are not included in red team exercises. Will the cyber defenders heed the lessons learned?
Technologies that could help improve the reliability and cybersecurity of data centers (and other critical infrastructures) include process sensor monitoring at the physics level off-line from the OT networks, appropriate network cloaking technologies, use of low-voltage UPS systems and appropriate machine learning technologies. These technologies can also improve the reliability and cybersecurity of any industrial or manufacturing environment. Additionally, these technologies are not susceptible to IT malware such as SolarWinds or ransomware.
I have written a micro learning module for ISA on Identifying Control System Cyber Incidents. I am also preparing a class for Stevens Institute’s Maritime Security Center on control system cybersecurity which will address the lack of identifying control system cyber incidents and provide actual control system cyber incident case histories.
Summary
There are myriad issues addressed by this incident. OT networks and control system field devices should be designed with cybersecurity protection. However, cybersecurity technology often doesn’t exist at the control system field device level so compensating controls including developing control system cybersecurity policies and cybersecurity training for the engineers should be initiated for all critical infrastructures. Compensating technologies such as process sensor monitoring at the physics level, appropriate network cloaking technologies, use of low-voltage UPS systems and appropriate machine learning technologies should be considered for all critical infrastructures. Regardless of the best attempts to secure control systems, unintentional or malicious incidents can occur. Consequently, there is the need to identify potential control system cyber incidents so a control system cyber incident response program can already be in place and used.
Leaders relevant to this article: