Lessons from CrowdStrike: Managing Risks in IT and OT Environments

Episode 19 July 29, 2024 00:15:43
Lessons from CrowdStrike: Managing Risks in IT and OT Environments
PrOTect It All
Lessons from CrowdStrike: Managing Risks in IT and OT Environments

Jul 29 2024 | 00:15:43

/

Hosted By

Aaron Crow

Show Notes

In Episode 19 of "Protect It All," titled "Lessons from CrowdStrike: Managing Risks in IT and OT Environments," Host Aaron Crow gets into the recent CrowdStrike Falcon platform incident that caused widespread system crashes and blue screens of death on Windows machines. Drawing from his extensive IT and OT experience, Aaron explains that the issue stemmed from a routine update error, not a cybersecurity attack. He explores why it had such a significant impact on major entities like airlines and airports.

 

Aaron highlights the critical differences between IT and OT risk management, emphasizing the importance of automated updates, real-time threat detection, and thorough update testing. He discusses the need for comprehensive risk assessment and the implementation of cyberinformed engineering practices to prevent similar issues in the future.

 

Listeners will gain key insights into balancing cybersecurity measures with system reliability and availability and actionable recommendations for strengthening their IT and OT environments.



Connect With Aaron Crow:

 

Learn more about PrOTect IT All:

 

To be a guest or suggest a guest/episode, please email us at [email protected]

 

View Full Transcript

Episode Transcript

[00:00:00] Speaker A: You're listening to protect it all, where Aaron Crow expands the conversation beyond just ot delving into the interconnected worlds of it and Ot cybersecurity. Get ready for essential strategies and insights. Here's your host, Aaron Crow. [00:00:18] Speaker B: What's up? Everyone wanted to dive into this crowdstrike incident that happened. Everybody's talked about it. Everybody. If you haven't heard about it, then you've probably been under a rock. But a little bit about me, Aircrow. I spent a lot of time, kind of grew up in my career, started out doing like desktop support and worked in infrastructure building networks and supporting active directory and exchange. So over my career, I've kind of been in a lot of spaces, including even in my OT space when I first got into OT and working critical infrastructure. A lot of that time was spent rolling out basic services that we've had in it for a while. And what do I mean about that? I mean, you know, antivirus and patching and firewalls and network architecture and things like that, things that we'd had in the IT world for decades, but really just rolling those out into these OT spaces, because as we started bringing that technology into these spaces, we started having some of those problems. We had viruses because we were using commercially off the shelf products like Windows and VMware and Cisco and all those things, whereas, you know, before it was proprietary systems that they weren't updated, but they also didn't have the same vulnerabilities because. Because they just weren't as much right. If you look at a lot of the vulnerabilities that come out today, it's because of attackers go after the most prevalent systems, windows and all those types of things. So those are the ones that have the issues that we see the most. We see malware on Windows machines, we see viruses on Windows machines. Not that Mac or others or Linux are not, they're immune to it, but it's because the most prevalent, the most devices are Windows machines. So that's even the case in the OT world, as we saw in this. Now, this Crowdstrike issue, obviously it. Well, maybe not obviously, just for everybody. To be clear, this was not a cybersecurity issue. Crowdstrike is a cybersecurity tool. But it is. This was not an incident from a bad actor that was attacking or nation state. It wasn't malicious in any way. It was really just an update issue. So what happened? On July 19, Crowdstrike released a routine sensor configuration update for its Falcon platform on Windows systems. This update was meant to enhance the security against specific cyber threats and contained a logic error that caused a system crash, leading to a blue screen, a death cycle that that cycle would boot up going to the blue screen and it sit there. Why was that? Why was the issue so widespread? And ultimately it really only impacted, it only impacted Windows machines. And the biggest reason there was an or the widespreadness of it was the response. Right? So the problem became widespread due to a few factors. Let's just dive into them. Right? So automated updates. I've got my tool. It automatically updates. The faulty update was pushed automatically to systems running the Falcon sensor. And all those companies that were using CrowdStrike for endpoint security have typically you're going to auto update, you want your latest antivirus updates, you want all those types of things there. So you're constantly getting those updates and you're securing against vulnerabilities that are coming up. But that also meant that the flaw was quickly distributed across their enterprise or across their environments. Global use of Crowdstrike. CrowdStrike is a major player, and for good reason. They're a great product. The CrowdStrike Falcon platform is widely used with over 24,000 customers, including many Fortune 500 companies. As a result, this update impacted a significant number of critical systems. We saw major airlines and airports. You look at Twitter and there's tons of examples of pictures of people walking and seeing blue screens, and I think even Times Square had them. So why do organizations use CrowdStrike? Crowdstrike is a leading cybersecurity firm. It's known for its advanced threat detection and response capabilities. We use products like that for a few reasons, is real time threat detection. They provide real time monitoring response to cyber threats. So as devices and systems and vulnerabilities and threats come up, they're using endpoint protection. They have artificial intelligence and machine learning to detect those anomalous activities and do something about them. They have comprehensive protection from malware detection, endpoint protection, threat intelligence, protecting system data and operation. Ultimately, you want those systems to be operating as we saw with those blue screens. And why it was such an impact then. Crowdstrike has built trust, major corporations, government entities, because it's effective. It's not that a product is infallible, as we see. And I'll dive into more around the impact, but CrowdStrike is popular because it works. It's popular because it's been a staple for organizations. But why was this issue so impactful? That's really the key here, and it's really understanding all of these details on what, what can we learn from this to make sure that we, we have a different or we don't go down the same road again. So critical system failures. The blue screen of death, it was by an update led. That update is what pushed those failures. Right. It only pushed it on Windows machines and they were basically inoperable, like they were. Until a person put hands on a, on a keyboard and fix the problem, there was nothing you could do. So they would just sit there at the blue screen to death and tell you do it. So, you know, if you think about airline check ins and banking services, hospitals, all of those types of devices, anytime that any of those were hit, you look at airports and there's hundreds of them. So some person has to put hands on. And there are, there were some automated, but for the most part, a lot of those systems are segmented for a reason, so it makes it difficult to get to them. Maybe, you know, the ones in the airport are a kiosk. The machine may be on the back, maybe they have to. I saw pictures of, you know, workers with, you know, on, on ladders trying to get to the PC to be able to do the work. It was a fairly complex remediation. So the fix included, you know, requiring technician to really manually boot systems into safe mode or recovery mode, delete that problematic file, which is time consuming. Again, just that process. Booting it in, even if I can do it remotely, is difficult. But when you have other things like Bitlocker and the fact that these devices are physically segmented or I have to go physically put hands on them, and there's tens, hundreds of these devices spread across large geographical areas, every airport, all these different locations. I don't have necessarily people just sitting there waiting to deploy. So how do I put hands on all of these things? The simultaneous failure systems worldwide, it really compounded on each other, right? So we had bis continuity issues, large companies, airports. You know, if you look at the airlines in the air during that time, there was a significantly less amount of airplanes in the air. And that was compounded because of this. This was not necessarily an OT issue or obviously it wasn't a cybersecurity issue, but this is a prime example of how this it and ot convergence thing, right? So we have these systems, and many of these systems were ot systems in my view, but they weren't the systems on the airplane like they weren't not flying planes because there was crowdstrike on the control system of the plane. It was because they couldn't book things, they couldn't schedule, they couldn't get you boarding passes, they couldn't book your luggage. All of those were the reasons that brought this thing down and that jaws into the bigger picture of why things are different in it. And Otzen, um, automatic updates. There's a reason why we don't patch in ot the same way we do in it. It's a reason why, you know, one of the other stories in this is, is Southwest Airlines came out and they had Windows 3.1 and they weren't impacted the same way that some of the other airlines were. I'm not saying that everybody should have really old operating systems, but what I am saying is, is it's a prime example of upgrading and having the latest and greatest of everything doesn't necessarily make you more reliable, make you more available or make you even more secure. At the end of the day, these entities are doing what they need and what they, they focus on, whether it's a power plant, an airline, an airport, you know, a train, a manufacturing facility, just replacing and upgrading is not necessarily the right choice and not, not the right action. Same thing with, with updating. Right? So we want to update our crowdstrike or whatever our systems are that the iOS on our Cisco devices, the firmware on our firewalls, the, you know, all these things in an IT world, I'm going to update those almost instantly or very quickly within weeks of those things and sometimes hours of those things being released. But in an OT world, it's so dangerous to do that. And this is a prime example of how that is dangerous. So that also brings in risk. If I update these devices, then there's a risk of bringing down an entire entity. We saw this on a large scale. This may be one of the largest ever. But what do I do about that if I'm not going to. Let's play Devil's advocate and let's pretend that we're not going to auto update any of our devices in critical systems from now on. What does that mean? Well, how can you do this differently? Well, you can roll it out to a smaller group. You can have a test environment, you can validate that. It's not going to break. Because if they'd have tested this on a few systems before they just broadly rolled it out across their entire organization, then they would have seen these blue screens come up and they would have stopped it from impacting their entire operation. But that takes time. That takes resources, that takes dedication. How many updates are made? How frequently can they do that? And if they're not updating on that same schedule, I mean, some of these updates happen daily, hourly, and then what happens when you don't update it and that vulnerability is there and you've got a hole and a vulnerability in these environments. So you have to look and architect these environments. Purpose built. I talk a lot about the cyber informed engineering. Idaho National Labs has done a great job of really pushing that concept and designing this idea. I'm going to be speaking about it at Defcon and ICS Village, but ultimately it's around purposes like this. Right. What can I do? Let's take a step back again, going back to I can't patch these things all the time, so how can I make sure these environments are safe and secure and available when I know I can't patch them all the time? So I have to make other remediations and other mitigations to fix those things. I need to make sure I'm having backups. I need to make sure that I'm doing testing. I'm going through all of these steps and I have somebody at the table that's playing that devil's advocate because obviously it makes sense to go patch and put the latest vulnerability information for my endpoint protection like a crowdstrike. Right. Obviously I want to have the latest and greatest. I want to have the latest information. It's like the president is always wanting the latest information about whatever is going on in the world. He doesn't want to be working on, you know, two, two month old or even our old information because it could, it could have changed. And his decisions can, can vary depending on, on that information. We look at it the same way, but obviously on the flip side, the risk is something like this can happen. So we've got to have, you know, there's a reason why ot is segmented. There's a reason why we don't patch at the same level. There's a reason why we use different tools. There's a reason why we segment from, you know, active directory and we don't put OT devices typically into an IT organization. We have different teams that are pushing it. Like you need to have different analysts that are looking at the data. And all these things are because a lot of the technology that's in OT and it are very similar. But the way that we run them and the impacts to an outage, if your email server goes down for a few hours, it's a bad day, but it's not the end of the world. Your entire organization doesn't shut down. But as we see with incidents like this, an OT environment bringing down your entire booking operation means that you can't fly planes, it means that you can't sell tickets, you can't. All of these things have these ripple effects across sectors. So, you know, the crowdstrike blue screen of death. This incident highlights the critical dependency of organizations and how critical it is on cybersecurity solutions. Right, and the cascading effects of how a single update can have on global operations. Understanding the cause and the widespread impact of this issue will help to really underscore the importance of how update testing and challenging maintaining cybersecurity. While Crowdstrack has taken all the great steps to rectify the problem, it really just shows it's not a Crowdstrike problem. Yes, this incident came from Crowdstrike, but it's not. It's an industry issue. It's a how I deal with my OT environment, how I understand risk. It's really a learning experience. Or it can be and should be a learning experience for both cyber providers, vendors, clients, asset owners to really understand the importance of the comprehensive management of our environments and architecture and how to, you know, what my recovery plan is and how do I make sure I minimize these impacts in the future. Dig into Crowdstrack. They had a good response and detail information about what happened. You can look at Crowdstrack's page, Wikipedia, and a lot of folks out there really talking about the details, the technical nitty gritty of what happened. But from a business and an OT and an overall just architecture side, there's a lot to be learned from this incident. [00:15:18] Speaker A: Thanks for joining us on protect it all, where we explore the crossroads of it and OT cybersecurity. Remember to subscribe wherever you get your podcasts to stay ahead in this ever evolving field. Until next time.

Other Episodes

Episode 10

June 03, 2024 00:56:07
Episode Cover

Tools and Techniques for Better Network Visibility and Vulnerability Management with Kylie McClanahan

In Episode 10 of Protect It All, titled "Tools and Techniques for Better Network Visibility and Vulnerability Management with Kylie McClanahan," host Aaron Crow...

Listen

Episode 24

September 16, 2024 00:52:03
Episode Cover

Evolution of Maritime Safety: From Analog Beginnings to Digital Redundancies

In this episode of Protect It All, host Aaron Crow is joined by Christopher Stein from Royal Caribbean Group to delve into the fascinating...

Listen

Episode 1

January 23, 2024 00:03:08
Episode Cover

Welcome to PrOTect IT All

In this episode, Aaron discusses: His background in IT, cybersecurity, and operational technology The vision of bridging the gap between OT and IT The...

Listen