| 4 min read
On July 19, Windows systems got a recurring blue screen of death (BSOD) and rebooting cycle as a problem with cybersecurity firm CrowdStrike's sensor to gather telemetry data on potential new threat techniques. The images of people crammed at airports around the world, as their flights were grounded, were perhaps the most visible depiction of the chaos that has been this latest IT outage. More than 3,300 flights were canceled! Several other industries were impacted as well, though. With about 8.5 million devices that crashed, this event has thus amounted to a serious global economic loss. CrowdStrike's and Microsoft's responses have been swift, providing the steps for machine recovery. The cybersecurity firm has since also given an explanation about what went wrong with their sensor. From their account, we have noticed something crucial that we see as a lesson about shifting security testing to the left.
What the outage was about
The incident involved CrowdStrike's endpoint security sensor for Microsoft Windows systems. One type of security content configuration for the sensor tells the latter what specific behaviors to observe, detect or prevent. Its use is for threat detection staff to identify probable adversarial acts. CrowdStrike's platform gets security content configuration updates often. One specific update sent to Windows hosts with the sensor version 7.11 made the systems crash, that is, if the systems were online during a specific hour that the update was available. The crash was due to the CrowdStrike's system inability to handle an exception triggered by an out-of-bounds memory read (i.e., reading data from outside the intended buffer), in turn triggered by the problematic security content the firm delivered. The Windows systems proceeded to be stuck in an endless loop involving the BSOD and rebooting.
What allowed the update to be delivered was the firm's trust in the successful results of their stress tests made in the staging phase, having had no issues with previous updates, and clearance by their content validator system. The latter tests that everything is okay in the update for publishing, and due to a bug, in this case it saw no problem.
Their systems rendered temporarily useless, many companies could not operate normally. An estimate says about 125 out of the 500 most profitable traded firms in the U.S. were affected. Further, it says they face a collective direct loss of $5.4 billion. The healthcare and banking industries perceiving 57% of the losses caused by the event. In regards to airlines, which in the U.S. included Delta, United Airlines and American Airlines, their loss is estimated to be $143 million each. Other affected services for which losses are expected include IT, retail and wholesale, finance, and manufacturing.
Machine recovery has been a pain, as in most cases it has required manual work by IT staff. Microsoft advised rebooting as an effective solution, indicating that it may take several tries (even 15 reboots) for the strategy to do the trick. Other fixes were suggested if the previous does not work, among them, restoring the system to a version without the CrowdStrike update or booting the machine into safe mode to manually delete the problematic file. By the way, it may come as no surprise that malicious actors have taken advantage of the event to push traps they make users believe are solutions.
Circumstances have come to a tough measure: What is said some refer to as "the largest IT outage in history" is cause to CrowdStrike's CEO being requested to give his public testimony before the House Committee on Homeland Security. This is in spite of the cybersecurity firm diligently taking action to prevent the damages from escalating.
This is yet another call to shift security to the left
This whole incident shakes us cybersecurity firms to the core. We are indeed important enablers of the success in operations of companies around the world. Therefore, this event reminds us we need to be ever more watchful that our products are thoroughly tested before each release and that tests themselves are correct and comprehensive. Of course, during this incident, CrowdStrike has published a set of actions to take to prevent future incidents like this with their sensor. They mention, for example, local developer testing and enhancements to their content validator, like more validation checks. What's more interesting to us, however, is their mention of code reviews and to test quality processes from development through deployment by third parties.
We feel the need to highlight the issue that trust in preventing defects of the content update was placed upon the testing and staging phases, that is, right before deployment into production. But in the line from left to right that represents the development lifecycle, from requirements to maintenance, respectively, security needs to be shifted to the left. This shift-to-the-left approach in software development means testing security earlier in the software development lifecycle (SDLC); that is, earlier than the traditional testing phase. Instead of waiting until the staging or post-deployment phases, testing begins in the initial stages of development, including requirements gathering, design, and coding phases. By incorporating security and functionality tests early, developers can identify and remediate vulnerabilities before moving forward. This prevents the compounding of errors, reduces the likelihood of critical issues emerging in later stages, and makes remediation less costly than if done after the product has been handed to the end user. We're clearly seeing in the incident discussed here how costly issues discovered after release can be.
Developers need to have security in mind while they code and be able to manually review the security of code written by their peers. Indeed, a third-party review is extra care, but still developers themselves must be developers of secure code and comprehensive tests for the functionalities they add.
In the same vein, we feel the need to stress the importance of dogfooding (i.e., trying the products developed by one's own company as though one were an end user). This should be done before making product updates available. We understand from CrowdStrike's preliminary Post Incident Review, that the firm does this for one type of security content configuration for their sensor but doesn't do it for the type that was at fault in this incident. It is worth noting, then, that this strategy should be considered for every part of the products one offers end users.
So, reader, think of your development practices and recognize whether you're continuously weaving security into the entire SDLC. Think of the importance of your product for the community and your client companies' operations. And if you want us to help you develop and deploy secure software, don't hesitate to ask us about our Continuous Hacking solution.
Share
Recommended blog posts
You might be interested in the following related posts.
How we enhance our tests by standardizing them
Introduction to cybersecurity in the aviation sector
Why measure cybersecurity risk with our CVSSF metric?
Our new testing architecture for software development
Protecting your PoS systems from cyber threats
Top seven successful cyberattacks against this industry
Challenges, threats, and best practices for retailers