Enhancing OT Cybersecurity: From Legacy Systems to Cloud Solutions with Paul Shaver

Episode 32 November 18, 2024 00:57:11
Enhancing OT Cybersecurity: From Legacy Systems to Cloud Solutions with Paul Shaver
PrOTect It All
Enhancing OT Cybersecurity: From Legacy Systems to Cloud Solutions with Paul Shaver

Nov 18 2024 | 00:57:11

/

Hosted By

Aaron Crow

Show Notes

In this episode, Aaron is joined by Paul Shaver, an experienced OT security consultant from Mandiant, part of Google Cloud. Together, they navigate the nuanced landscape of operational technology (OT) cybersecurity.

 

The episode begins with Aaron recalling a critical incident at a power plant that underscores the potential pitfalls in OT environments. This sets the stage for a rich discussion on the evolution of OT technology, with Aaron and Paul reminiscing about primary domain controllers and early NT workstations.

 

The conversation shifts to the future of OT in the cloud, where Paul highlights the benefits of cloud solutions, including enhanced resiliency, security, and data optimization through AI. A compelling customer case study illustrates modern technology adoption with web-based HMIs and Chromeboxes.

 

Paul offers a detailed analysis of the current OT cybersecurity landscape, addressing the persistent legacy system challenges and the need for a cohesive IT-OT security strategy. He discusses the evolving threat landscape influenced by global geopolitical tensions and the rise of zero-day vulnerabilities.

 

Listeners will gain practical insights into foundational cybersecurity measures, such as network segmentation, asset inventory management, and robust access control..

 

Key Moments: 

 

04:14 Connecting IT and OT optimizes processes securely.

09:54 Lost production severely impacts manufacturing revenue recovery.

14:06 Ensure network notifications; control access, separate credentials.

17:10 Engineers need secure access to adjust parameters.

21:55 Endpoint detection on older systems is critical.

28:47 Resilience is crucial in CrowdStrike incident response effectiveness.

32:11 Limited resources for global incident response efforts.=

39:22 Rebuilt domain controller caused authentication issues.

42:37 Focus on resiliency and cloud opportunities, leveraging multi-cloud.

44:59 Improve grid operations using cloud and hyper-converged technology.

48:38 Local cloud provides redundancy for remote sites.

51:15 Critical for acquisition process and problem-solving.

 

About the guest : 

Paul Shaver has dedicated more than two decades to various roles in Operational Technology (OT), primarily within the oil and gas industry. His expertise spans OT architecture, design, and build, along with run and maintaining responsibilities as an asset owner. 

Before transitioning into cybersecurity, Paul served as a Technology Director for an oil and gas company in California. Driven by a burgeoning interest in security, he joined Mandiant nearly five years ago. At Mandiant, now part of Google, Paul relishes the mission of enhancing security postures in OT and critical infrastructure, contributing to significant advancements in the field.

How to connect Paul: https://www.linkedin.com/in/pbshaver/

Connect With Aaron Crow:

 

Learn more about PrOTect IT All:


To be a guest or suggest a guest/episode, please email us at [email protected]

View Full Transcript

Episode Transcript

[00:00:00] Speaker A: You're listening to Protect it all, where Aaron Crow expands the conversation beyond just OT delving into the interconnected worlds of IT and OT cybersecurity. Get ready for essential strategies and insights. Here's your host, Aaron Crowe. Welcome to the show, Paul. Thank you so much for taking time. I'm excited to have this conversation. We've met many times in person and we've crossed a lot of the same paths. Obviously this world is fairly small, especially in OTC cybersecurity. So why don't you introduce yourself, tell the audience who you are, what you do and who you work for. [00:00:37] Speaker B: Yeah, awesome. Thanks Aaron. Thanks for having me. So, Paul Shaver. I work for Mandiant, which is now part of Google Cloud and Google Cloud Security. I lead the ICS and OT security consulting team in man at Mandiant and I've been here about five years. Prior to that I've spent about 20 to 25 years, various roles in OT. Spent the bulk of my career in the oil and gas industry and the majority of my career in OT architecture design. Build some time in there with run and maintain at an asset owner. Prior to joining Mandiant, I was technology director for a oil and gas company in California and decided I wanted to make the full time jump to security. Had spent a lot of years with security kind of as a collateral role doing those jobs and had landed at Mandiant five years ago, just shy of five years. And yeah, it's been a, been a fun experience and really enjoy the mission and what we get to do here. And now being part of Google, really expanding that and helping to improve security posture in OT and critical infrastructure and all of those buzzwords, the world that we live in. [00:01:54] Speaker A: Absolutely, yeah. And I definitely want to get there, but first I just want to kind of kick us off with you know, kind of current state of OT cyber, you know, from persistent changes to you know, legacy systems, visibility, like all of those basic types of things of the current state that we're at. Like you said, like we've both been doing this for 20 something years and some of the problems that we're having, we were having 10 years ago and we were having 20 years ago and some of the same systems are running that were around 20 years ago when we started this thing. Right. So kick us off with where we are today from your perspective and kind of this OT cyber space and where we, where we sit. [00:02:34] Speaker B: I think, you know, we still, like you said to your point, we're still facing a lot of those same, those same struggles Right. The same things that add complexity, the same things that introduce risk. Right. That expand the attack surface for these environments kind of all still there, right? We still have legacy systems. We still have a. A lot of Windows XP running HMIs in these environments. Right. So there's a lot there. I think what we're seeing probably last two or three years. More and more companies are starting to take the OT security of their estates, of their environments more seriously. And I think we talked about this a couple times when we've hung out. We're starting to get to a point where we're not, we're not chasing a bunch of legacy problems and we're getting the chance to innovate a little bit. Right? The technologies are improving. The vendors in the OT device space are starting to build security into those OT devices, the PLCs and the communication cards. And so now we're starting to be on this kind of level playing field where we see green fields being built and we've got a bunch of really good security capability that we've never had before. And so I think the life cycle on security is starting to get better with new products. We're starting to catch up with this idea that we can't do this, we can't implement this control in this environment because the PLC doesn't support it or the COM card doesn't support it. So they're starting to balance out. And now, you know, again, the. Also the expansion of the technology that's connecting to these OT environments, right? Organizations are connecting their ERP systems, their manufacturing engineering systems. They're. They're using data to optimize their processes, or they're using data from the production floor to optimize financial process, right? So with those connections, we're seeing companies start to look at how they can connect between IT and OT and leverage that data. And so that brings in, you know, both a complication to, you know, how we do that and how we do that securely. But it also brings in the opportunity to have some really good conversation about, you know, what does data flow inside the OT environment look like? Not just what are we connecting it to in the IT side, you know, how can we better segment, how can we improve efficiencies in just network efficiencies and bandwidth and latency and all those things in the OT environments, because we're starting to have conversations around security and introducing external connections. So I think the security conversation facilitates a lot of, like, really, you know, how do we better, how do we optimize these environments in ways that you know, that then we can leverage for security purposes as well. [00:05:31] Speaker A: Yeah. [00:05:32] Speaker B: So a lot to unpack in that statement. Right. But I do see, like, you know, five years ago, six years ago, I would have said, man, everything, everything's a mess. Everything's vulnerable. Like, we got to fix this. And. [00:05:45] Speaker A: Yeah. [00:05:45] Speaker B: And I think we're getting to a point now where we're seeing more companies spend the money, invest the resources, people, time to make those improvements. And that, that says a lot for how fast we're addressing some of these problems. [00:06:02] Speaker A: It seems slow, but it is really fast. Right. Especially in these spaces. So pivoting off of that and continue down that thought process. Like, have you seen the threat landscape shift in the last three to five years? We've seen so much more in the news around critical infrastructure and OT and stuff like that. So what have you seen in the threat landscape focusing on these environments and these OT spaces? Because we just see more and more in the news. [00:06:31] Speaker B: Yeah. I mean, a lot of that has to do, I think, with kind of geopolitical struggles across the globe. Sure. And so, you know, where we see threat actors from, you know, Russian, Russian Associated Groups, China Associated Groups, you know, trying to establish a foothold and some kind of persistence in these environments, leveraging edge devices and the vulnerabilities that, you know what, I can't remember the number. I just had this stat for another call. But a ridiculous amount of zero days in the last 12 to 18 months. Right. And a lot of those on edge devices, firewalls and load balancers and some kind of edge connectivity. And for OT environments, that means a lot of those. Maybe it's a cellular modem, maybe it's a firewall that connects to some kind of, you know, wireless backhaul or that, or, you know, some kind of publicly some kind of connection that leverages public infrastructure, that means you've got something potentially hanging on the Internet that's vulnerable. That's a persistent point. That's a. That potentially becomes an end point for use in a botnet. You know, there's a lot of that. So I think because we've got these threat actors that have some geopolitical type stake for leveraging that, the increase in vulnerabilities and zero days that we've seen exploited, there's a lot more reason to be concerned. But I also believe that we're doing a much better job of improving the security posture of these environments. If you go back, you know, that five, six years and you run a shodan scan for, you know, Rockwell PLCs or HMI devices or whatever it might be. You know, there were a lot of stuff connected directly to the Internet. And now when we run, when we run those research scans, when we're doing those type of engagements for clients where we're looking to see what their attack surface that's on the Internet looks like, it's way better than it used to be. So yes, the threat landscape changes. The attack surface broadens with some of these zero days and more edge devices. But I think we are doing a better job of identifying that stuff early. The threat intel is looking for these things and doing a better job of getting victim notifications out there. Right. So we're definitely evolving on both sides of the coin. [00:09:07] Speaker A: Yeah, absolutely. It's changing so much and it's an uphill battle and it's constantly chang. Cyber is in general. But you know, ot I think we all believe we're a little bit behind the eightball. But there's, there's reasons behind that. We have to move slower because there's, there's physical implications to those things. Like we can't just go patch everything and hope for the best. You can't do that in a production environment because it'll, it can kill. It literally kill people. [00:09:33] Speaker B: Yeah. I mean, worst, you know, the best case is, you know, some kind of a equipment damage in those cases. Right. Then you move to like some kind of environmental damage or loss of life or limb and, and there's, you know, and you take all of those, those risks out of it and it's just a revenue risk. [00:09:54] Speaker A: Right, right. [00:09:55] Speaker B: These companies that are running a production environment, manufacturing a widget, you know, three days of lost production is, you know, that's three days of lost revenue potentially. [00:10:04] Speaker A: Correct. [00:10:04] Speaker B: Right. And you look at some applications in like refining or chemical processes or oil and gas, you have to shut in production for two or three days. It might take you weeks to get back up to that same level of production. Because you have to get stuff up to an operating temperature or you have to, you know, you have to, you, you've got, you know, stuff being made in a, in a batch sequence. And when you shut that process down, you have to start back at batch zero. And that might take four, five, six days a week or plus to get back to that optimal level of production. So, you know, there's definitely a revenue impact to being able to patch and update in some of this. So you really have to look at compensating controls and good security hygiene and make sure that we're doing the things that are most critical to protect these environments. You know, you do all that before you start, you know, spending millions of dollars on network security products that are great, that do a really good job of helping you identify the things that are in your environment. And but if you don't have good architecture or if you don't have good segmentation, if you don't have good access controls, then your network security monitoring maybe is just going to detect a thousand things a day because you've got lots of holes in the, in the civ, as it were. [00:11:26] Speaker A: So which, which leads me to my, my next topic and it's really around when it comes to foundational cybersecurity hygiene. Like what are the must haves for an organization to protect their OT environment? Right, you just kind of talked about, about that at a brief level. But you know, it's not all spending billions or millions or hundreds of millions or you know, huge numbers. Sometimes the, the foundational things that we need to do are just basic, like they're, they're process, they're people, they're not always buying the new technology and the Lamborghini, sometimes you just need to buy the screwdriver. [00:12:00] Speaker B: Yeah, no, absolutely. You know, the biggest one is network segmentation. I can't tell you and I'm sure you know this, you work in this space and you go out and support customers much the same way our teams do. The large flat networks in these OT environments still exist. The application servers, the engineering workstations, the HMI machines in the control room and the PLCs and the comm devices, sometimes even the network connected instrumentation, all still on that same subnet, all still on one default VLAN1 or VLAN255 or, and all addressed in a, in a private, you know, class C address space. And good Lord, there's so much out there, right? So, so network segmentation is the first thing in kind of like peeling back the layers of the onion to figure out, you know, what, what is on these, these networks. You know, the number of devices that you have an asset inventory is kind of the next thing and trying to understand what's there so that you understand what the risks are. If you've got HMIs that are running embedded Windows XP, that's one level of risk versus maybe those are a very specific Linux kernel that was developed just for that machine. And there's a list of CVEs for that, that's like 2, 3 CVEs for that device versus I don't even remember what the list of CVEs for Windows XP is it's, it's in the thousands, if not tens of thousands. Right. So understanding what you have becomes really critical. These are again, all basic security hygiene stuff, and then controlling the access into the environments, making sure that the people that you know, the machines, you know, we can't necessarily zero trust a N O T environment. Again, the technology doesn't support it. Yeah, but you can ensure that, you know, if a new machine connects to the network, at least somewhere there's a notification from the switch, from the router, from the firewall, from the infrastructure somewhere that says, hey, something new connected, we need to know what that is. So that's having an idea of what's on your network is important, and then controlling who can connect to it, who can access those environments. Not having shared credentials between your IT and your OT environment is, you know, I can't tell you how many times we see, you know, even where, you know, you're not enforcing good password security in the OT environment. So the username is maybe different between IT and ot, but you know, that guy that has access in both environments, he doesn't want to remember two passwords, and so he's using the same one on both sides. Right. So. And that's on a sticky note under. [00:15:07] Speaker A: His keyboard or printed on the monitor that is sitting in the room. [00:15:13] Speaker B: Yeah. The admin password is on a label tape on the side of the HMI out in the field. Exactly. Because the operators need the admin password when they have to change a set point. [00:15:28] Speaker A: Well, and it gets to a good point too. And I've had this argument conversation with people, especially people outside of ot, that don't get it. Like you work into, you walk into a control room in a power plant. They're, they're not, the workstations are not locked. They never have to log in with a password. But there's a reason for that, right? Is a, first of all, there, there's multiple physical layers of security that you had to get to. B, there's not that many people that work there. And if you're not supposed to be sitting there, somebody's going to notice it. It's, it's manned 24 hours a day, seven days a week. They've got cameras on it. Like, there's a lot of other mitigating factors that go into that. But at the end of the day, if there's something going on and they need to stop something, like, you know, Life Liberty, you know, all this stuff, like you can damage equipment, you can kill People like you can blow up. You can, all sorts of really bad things can happen. Any hesitation from that operator being able to stop or start a process that can save a life or all that is more important than having to log in with a password and forgetting his password and having to type it three times. Because we've all done that. You have to. I know my password, but I have to, I have to type it six times because I fat fingered it. Because it's a freaking complex. [00:16:41] Speaker B: Yeah. 32 random characters and 10% of them have to be special characters and four capitals. Yeah. No, it's a completely valid point. Right. The operators don't need, shouldn't need to log in. Right. Typically, especially when you've got something that's manned 24 7. Right, sure. That person's always there. And that might just be. Operators might just have the ability to do set point changes or have a restriction on how much they can adjust that set point, whereas the engineers that can actually change logic, change code, set that set point outside of a predetermined set of values. That absolutely needs a password. That absolutely needs two factor authentication if you can, if you can enable it, and not having those shared, those shared credentials. And then again, so it's policy too, right? It's having good security policy that is collaborative with the IT security policy and enforceable in the OT environment. You don't want these two drastically different policies because that becomes a nightmare to try to, to try to manage it. But you also, if you've got an OT environment and you can't use Zero Trust or you can't use two factor authentication or your endpoint detection tools can't be. Agents can't be installed on those machines. Your policy has to be able to be collaborative with your IT environment to be able to make sure that it's manageable. [00:18:26] Speaker A: So yeah, yeah, I mean I have, I have a great use case of that is it was pushing policies down in an OT space and they were, they were doing, you know, good work and they were, they rolled out a group policy through Active Directory that just locked workstations after five minutes of inactivity or 15 minutes, whatever the number is. But there was literally a screen. It was on the corporate network. So it wasn't technically an OT device. Like we have this argument all the time about was this an IT attack? Was it OT attack at the end of the day, just like Colonial Pipeline, none of that was ot, but it impacted ot, right. They shut down the pipeline because of it. So does it matter if it was an OT attack. No, it really doesn't because it shut it down, right? So same thing happened in this scenario. This was in a control room. It was a screen that had PI data. And that, as we know, the operators run off of PI, right? Everything they do, they, they're running from PI. And all the indications that come from that OSI PI server, well, they, they locked the screen, they didn't know what the password was because the operators never had to log into it. So they had to get an engineer who was not there, he wasn't on shift that day. So the person that knew how to log in couldn't. And they couldn't get the data they needed to because an IT person trying to do a good job and put this device into a group policy that locked down the workstation and they had no access to it, right? So it's a prime example of how a simple good policy, good hygiene conversation, like nobody would argue that was a good idea, except that they didn't understand all the implications that it was going to bring. And obviously it got put into a separate group policy and that goes back to the whole asset inventory. They knew that device was there, but they didn't understand the function of it, right? So yes, they knew it existed, yes, they knew they were patching it, they were backing it up, like all that kind of stuff, but they didn't know what it was really used for. So they didn't know that when I roll this thing out, there could be implications to my production at a, at a power plant, right? God forbid it's at a nuclear power plant that they can't control, that that's a whole nother line of problems. [00:20:20] Speaker B: And it is, it just comes down to that good, good hygiene, right? So when you're, you know, when you, and I love the fact that there's an OT environment with, with active directory in it, right? This is another thing that we just don't see enough of, right? We don't see access credentials being managed even, even, you know, to that level, right? So definitely having an OT environment that has its own active directory, phenomenal, right? But it just comes back down to asset inventory and access controls and knowing what's there and making sure that you're building your active directory, you know, forest out, you're creating ous, you're creating use user groups that include these, you know, these machines and being able to categorize them for their purpose. But to your point, right, you know, 99% of what we see that impacts OT environment starts in it, right? It's gotta it has. Whether that's an IT system on the enterprise side or that's an IT system on the OT side, it doesn't matter. These, the attackers have to have, for the vast, vast majority of these cases they have to have that, that access point, that initial compromise and that's going to be the system that back to that, you know, the system that has the most CVEs that potentially can't get patched. That's, that's going to be a Windows operating system. That's going to be a Linux operating system. That's going to be something that is a traditional OS sitting in that environment. That, that gives an attacker an initial compromise and a place to pivot from and, and do their recon. So you know, protecting those assets becomes, becomes the most critical. Again, all the network security monitoring platforms that are out there, they do a phenomenal job of being able to detect anomalies and help organizations better understand what's in their environment and what's happening in their environment. But that not having endpoint detection capabilities on those commercial off the shelf operating systems that we, that we're beholden to is critical. So for some of those older systems where you can't install an endpoint agent or you can't have EDR running or antivirus running, we really need to have that visibility of audit logs and system logs and application logs and pay attention to what's happening on those machines in a much better way. Again, good hygiene methods here where before you're spending all this money, know what data is available from that can be leveraged for a security purpose. [00:22:55] Speaker A: Sure. [00:22:56] Speaker B: Without being necessarily having it be quote unquote security data. Right. [00:23:00] Speaker A: Yeah. [00:23:01] Speaker B: PLC system variables are a great example of this. I tell people all the time, if you can't afford network security monitoring platform, that's, you know, right. Now let's get you to a point where you can, but in the meantime, CPU runtimes, PLC scan, you know, ladder logic, scan times, memory usage on a, on a plc, those are all system variables that can be archived in your historian. [00:23:26] Speaker A: Yep. [00:23:26] Speaker B: Right. And, and just like they're archived in their historian, then once they're there, they can be used to create an alert in your, in your, you know, HMI alarm system. And that can be escalated based on, hey, the CPU runtime is abnormally high. [00:23:42] Speaker A: Sure. [00:23:42] Speaker B: Right. We can say hey, something changed. Let's go figure out what changed. Can we tie that to a change management process? Look, you have a security triage playbook right there in front of you and you don't have anything but the normal process data that we're just looking at it in a different way. So that's really valuable for, you know, the small operators that don't, that can't afford, you know, big security budgets. They can't afford the tools that. And maybe they, maybe they're working towards that. But there's lots of ways to start at the ground level to protect these environments and detect when something changes. [00:24:19] Speaker A: Yeah. And when I started this process, it was before we really called it OT Cyber. Right. And a lot of the things that I was rolling out, I didn't have a budget. Right. As an, I was working as an asset owner in a power utility. And you know, I've told this story multiple times, times. But you know, when I, when I went to the plants and, and was explaining what we were doing and that what we were trying to do, I didn't sell them on cyber because again, this was, you know, 2010, nobody cared about cyber. It wasn't a concern that these asset owners are really all that focused on what they were cared, they did care about is their bottom line availability, you know, anything that could make them run more efficiently. So what you just talked about, right. So helping them monitor. So like we were getting logs out of all these systems that they already had and we were just, just showing it to them in a way they hadn't seen before. Right. So when we first turned on our first splunk server in like 2010, at a power plant in an OT environment, monitoring a critical, you know, a DCS control system, and one of the very. Within seconds we started noticing it bubbled up to the top. We found that there was a redundant switch that they didn't know. It was sitting there beeping and having an error code. And it was, it was telling the world, but nobody was listening, that it had a problem. And it was actually, it was down, like it had administratively taken itself down because it was overheating. Well, we went over to this device, we found the device and we went over to it and the fan wasn't spinning. So it said, you know, it's overheating, fan's not working in the power supply. So we went, we walked over to it as a Cisco switch, walked over to it and there was a zip tie in the fan, like through the case, blocking the fan from spinning. So we pull the zip tie out, the fan starts spinning, the error code goes away. But nobody had ever seen that or walked by it and understood what that that red blinky light was. Are logged into it and known. So these tools, they're not just cyber related, right? They're, they're efficiency related. They're the reliability, the resilience. It's more than just cyber. I know we work with the, you know, cyber is the big thing right now. But at a power plant, if they have to choose, it really comes down to availability or a lot of these critical infrastructures. They care about availability and safety, safety and availability. Cyber is one risk mitigation, it's one attack vector, but they really care about that availability. So if all of these tools, yes, they give cyber data, but they can also make it more reliable and give data to operators that can help them run their plant better. Right? [00:26:39] Speaker B: Yeah. No, I mean, and that's, that's the name of this game. Right. And I think we focus a lot on security and security is important keeping these systems. But I look at it from a resiliency standpoint. Right. And to that point, uptime is the most critical factor. [00:26:56] Speaker A: Yep. [00:26:56] Speaker B: Right. Whether you're a power grid, you know, in the hottest part of the summer, or you're a gas producer in the winter, or you're, you know, pumping the wastewater out of the sewer so it doesn't back up in somebody's house. Right. Up time is critical. And so a lot of what we look at in better protecting these environments is also, you know, putting resiliency at the forefront of this. We want to layer in those levels of defense. We want to be able to detect and respond more quickly so that we keep these systems as resilient as possible and keep up time. But also that also enables us to have quicker recovery capability. So when we do have a problem, when there is a compromise, when we don't know if or how an OT system is compromised, when the IT system is the tools that we have in place for response capability, for forensic capability, for detection speed, the recovery time. Right. I, I have the, I use the hurricane analogy a lot. Right. If you, if you've lived on the Gulf coast and you've dealt with hurricanes, you know that you're not going to stop a hurricane from making landfall. [00:28:10] Speaker A: Right? [00:28:10] Speaker B: Right. But if you've got good defensive measures in place, you board your windows up, you sandbag your doors, you, you are keeping the trees away from, you know, your, the roof line of your house so they're not going to blow on it, you're building good defense for capability and then the resiliency is there with recovery time shortens and you have less overall impact from a storm than you would if you didn't do those things. And so the cyber defense comes into the same kind of an aspect of better prepared means quicker recovery, better resiliency. [00:28:47] Speaker A: Well, and we had a great example of this worldwide incident with CrowdStrike. Right. And, and we saw a difference. And really the difference in how people responded were how, how they had built resilience into their system. Like, how well did they, have they tested their incident response and recovery plan? How, how well have they tested their backup plan? How well have they gone through all these steps? Because at the end of the day, yes, it was a patch and it was a Microsoft thing and all the different key factors that went into it, but ultimately it was the, the owner of the systems. In my perspective, this is just speaking by Aaron, not by anybody else, but it was ultimately their fault. Right. You know, I would never have pushed updates to my entire fleet of systems in a power industry when I supported 45 power plants, I would never have just blindly sent them all to one. Like, I had a staged way that I did that. I'd, I'd test one system, make sure it didn't break anything, and then I'd roll it out to another system. Okay, it didn't break the second system, and then I'd start rolling out to more people. But I would do that per site. I didn't just blindly say, oh, it worked okay in the lab. Send it right. [00:30:00] Speaker B: At 5 o'clock on a Friday. Yeah, yeah. And to your point, like the recovery time. So I had a meeting with a customer that week. I think it was probably a day or a day and a half later. And I got on the call a couple minutes early and I was fully expecting him to not show up because I know he's a crowdstrike shop. And I was like, I hadn't heard from him. And so I sat on the call and when he popped on, I was like, look, I said, if you want to reschedule this, I'm sure you're busy doing this. He goes, no, we're fine. He goes, less than 18 hours. He said it's like 16 hours and 27 minutes or something. Everything was back up and running. And I was like, wow, how'd you do that? And he goes, disaster recovery systems were tested and they worked. And we had a plan and a process in place. He said, admittedly, he said there was a guy that couldn't sleep and he was up at 1 o'clock in the morning and saw something weird happen. And we got on it probably quicker than most people were able to get on it he said, but even if we hadn't had that, he said we'd still be talking about 24 hours rather than 18 hours or whatever it was. Right. So that having that plan in place, having that capability and knowing that it's tested and you know you can push a button and roll it back is incredible. And you know, if you know about the crowdstrike problem, right. That's not just a, that's not just a roll back the. Right, the application. Right. That was a boot level process that needed to happen and their recovery processes supported that, which is incredible. And that was, you know, that didn't have a, it had a minimal impact to their production environments, but it did have an impact to their production environments because they did have application servers and data servers that had the endpoint agent on them. So. [00:31:51] Speaker A: Yeah, yeah, it's insane. I have a couple of stories from folks. Same thing having those conversations and it was either people were down for a week and they were panicking because Delta Airlines is a prime example of that where they had kiosks all over around the world. And it wasn't a problem of they could do it but they had to physically touch things. So they, it was, it was a, it was a body problem. They. So you know, we sent people out, they were getting people from every consultant firm they could find. And that gets to a bigger problem of you know, making sure that you know, a lot of folks will put their, their all of their incident response plan into a. We've got a retainer with mandiant or whomever. Right. And as good as you guys are and all of us are, there's only so many of us, right. And when a global incident like this happens, there's only so many competent and qualified people that we can put on a plane and send out to put hands on keyboard. So there, there's a problem, a resource problem from that perspective because everybody uses Mandiant because you guys are lead, you know, industry leading in that space, rightfully so. But again, how many people do you have that you can send out to do this? And when every company, when you talk in Amazon and Facebook and Delta and American Airlines and Southwest. Well, Southwest didn't get it because they were running really old systems. But that's a whole nother company conversation. [00:33:08] Speaker B: Yeah, yeah. Your systems don't even support agent back to the xp. Right. [00:33:13] Speaker A: Yeah, yeah. But you know that, that, that's a bigger problem. And, and part. And it goes to the, the overall resilience. And, and, and when we talk about resilience we're not just talking about technology, we're not just talking about architecture. Yes, that is a, that is a factor and that is a, a piece of, of building a resilient system. But it's also the oh, the, you know, break glass, you know, the oh, handlebar. Right. It's like we know something's going to happen. Have we planned for that? Like have we thought through some of those things? And you're never going to think through everything but if you go through that, that exercise, thinking, okay, I know I designed an awesome system, I'm really smart, good job, Aaron. But what could go wrong and bringing in outside people, have you thought about this? Have you thought about that, like going through those exercises And I think that's the biggest piece of people that probably came out a little better than others were because they probably had those conversations and their system was more resilient because of that. [00:34:07] Speaker B: Right? Yeah. And you're absolutely right. So incident response retainers, right? Yes, you should absolutely have an incident response retainer with Mandiant. Yep. We've got a free one by the way. So I mean that just gets paperwork in place. You know, worst case scenario, we don't go through the. However long it takes to get contracts signed. You should have that capability with as many of providers as you can have. Because when a bad day happens, you know, it's not always, you know, not always possible to answer the phone. And to that, to that point, right. When you've got this mass conflag situation. Right. If we had this huge compromise that was affecting the entire globe, it's going to be all hands on deck. And Mandiant and CrowdStrike and Palo and all of these organizations that have incident response teams are all going to be working together to solve it. We've seen that happen before. We know that the collaboration is there when the, we're all in competition on a day to day basis in some cases. But when those bad days happen, this is a, you know, this is a critical mission field and everybody comes together and we get the job done. Yeah, but having contracts in place and not having to wait that additional 24 or 48 hours to get paperwork signed or especially during a compromise when you probably are going to want that contract to be a three way contract with council and you know that's going to make that process a little bit longer. Council's not going to want to accept web terms or whatever the base level terms of and conditions are. So go through that exercise. I can't stress this enough. Get the paperwork done ahead of time. I know that it's, you know, in some cases you're maybe spending a little bit of money, but having that level of assurance that somebody's going to answer the phone is, is well worth it. So. [00:35:59] Speaker A: Yeah, so obviously there's, there's differences and, and that's, that's the complexity of this, right, is, is it's, it's way beyond. And again, I know we hang our hat on cyber security, but I think you and I both come from a background of we did all this way before we did cyber. Right. It's all kind of gets lumped underneath that cyber umbrella now. But there's so much more than just cyber. It's, it's, it's understanding my business process. And you know, there's, there's influencers on online that, you know, talk about how it doesn't matter if you have an asset inventory, you can protect it. It's just like, okay, you're right, you can, but the more information that you have, the better off you're going to be. You know, again banging on the, on the Colonial pipeline thing, one of the problems that they had was they didn't know how far it went because they didn't have a truly under understanding of their environment and they didn't know where that, that delineation from IT to OT was. And so because of that, they made some safe choices, which they should have. They did the right thing. But this goes back to understanding, you know, that foundational understanding of your environment, understanding what your assets are and not just a list of them. Like, I don't just want a list. I need to understand what each, what you, what they all do and what they're. If I take that one out, does my system fall down? Like, does the deck of cards fall and can I replace that one without breaking things? Can I work on unit? You talked about the flat Network. Network. I've seen environments where, you know, I take down unit one and I, when unit two is online and anything I do is actually impacting a production run, you know, line that's running. Right. And obviously that's not designed well. Right. I need to think about all those things. If I need to be able to take one down without impacting the other. What shared services do they have that are, that are going across those that I would have to have both of them down to be able to work on, or worst case scenario, a hacker got it or it just broke and failed. Am I going to bring down two, two lines, two power plants, two whatever. Because I Only want to have one switch or one fiber run or one server or whatever. The thing is, those are all considerations when you're designing these systems or when you're assessing these systems that you need to have that perspective when you're looking at these things. [00:38:12] Speaker B: Yeah, yeah. It's funny, it reminded me like take, you know, impacting one unit and then, and then something else happening. Like we had something similar where it's a COM card that one of the early generations of Ethernet IP COM cards and if you disconnected that one unit or you turned that one unit from run to program, the COM card would fault. And for whatever reason when that COM card faulted, because there was a communication string between two other units for load balancing, it would just like the other two units, you could take them completely offline, it wouldn't even bother it. But that one, that one lead unit always took things down if you, if you took that one offline at the wrong time. So forgot about that. [00:38:58] Speaker A: I had, we've been doing this for too long. I know. I had a similar one way back in the day. And again they had a shared active directory between multiple units at a power plant. And the vendor was in doing a control system upgrade and they were upgrading unit one. Unit one was offline, unit two was still running and they replaced a domain controller. And when they replaced the domain controller they used a little DVD that booted up and was supposed to go through all the scripts that the vendor had put that did all the things, added the ous and the, all the group policies and all that kind of stuff, but it didn't work. So what, what the technician did and what they'd been told to do is they just rebuilt a brand new domain controller from scratch. So they built a new forest using the same name and IP address and domain controller name as the old one. But it's on one big flat network work. So this new domain controller with the same name, IP address and everything comes online and as they start adding the unit, whatever unit was offline, I don't remember which one it was. When they added the new assets to this new domain controller, they all worked. The problem was is they started having authentication and token issues on the other unit because those devices were trying to authenticate with this domain controller that was the FISMO role holder. So it had PDC emulator, all that type of stuff on it. So unit two that was running all of its authentication started failing. And then all of a sudden in the control room, all at once, all the numbers on the, on the screen smurfed and went to zero because they had no indication whatsoever. Now, the plant was running fine because the PLCs and all the controllers in the field were doing their thing, but everything the operator could see just went dark and they panicked. Obviously they're trained to punch the unit out when they can't control it or know what to do. But we, we spent two or three days proving what happened because they had no idea the vendor didn't know what happened. They, they just like, we didn't do it. We were working on unit two. There's no, there's no way what we did had any impact except that it did. [00:40:49] Speaker B: Yeah, yeah. Not that you. And not to point out that you dated yourself in there because we both, you said PDCs and we both nodded like primary domain controllers haven't been a thing for. I don't even remember how long at this. I know I've been doing this for way too long. [00:41:04] Speaker A: Exactly. Dude. I was a domain. I was a domain admin at, at and T and an Exchange admin supporting exchange5.5. So that tells you how long I've been doing this. [00:41:14] Speaker B: I, When I first joined the military in, in the mid-90s, I was one of the first generation coming out of high school, joining the military that had had a personal computer at home. [00:41:25] Speaker A: Right. [00:41:25] Speaker B: And when they started rolling out NT workstations on board ships, they didn't have anybody to do IT work. And so they sent a bunch of us to a plus and network plus and basic Microsoft training and we rolled out NT 3.5 on board worships like, like. Yeah, it's been, it's been a minute. [00:41:46] Speaker A: It has. So thinking about that. And so we went from one extreme. We just talked about how we aged ourselves in this conversation and how some of the older technology, I mean, I've got a Mac over there. That, that, that was actually mine. So that, that tells you how, how long I've been doing this stuff. But I actually had that when that was a new, new system. So. But swinging to the other side, especially since Mandan is part of Google now, what are the opportunities in, you know, the future of OT in the cloud? Because we know it's coming. Even though everybody's like, oh, we're never going to do that, but, you know, OT is going to be in the cloud. Many, many systems are already going to cloud in some capacity. Obviously not everything, but talk to me a little bit about your thoughts on, you know, kind of moving OT systems in the cloud and that emerging trend of, you know, at least thought leadership on it. [00:42:37] Speaker B: Look, I think we've got a lot of really great opportunities to back to that kind of resiliency conversation to keep these systems up and online and operational and withstand, you know, some attack, withstand, you know, system failures. Because we've got that level of resiliency built in. And so there's some really good opportunities back to that resiliency side of this. From a security side, same thing, we push these systems, the cloud connections, running application data, application servers in the cloud I think makes a lot of sense because you have a lot of horsepower to use that data for optimization, for analysis, using the gen AI against that data and helping to better understand your process or there's a lot of really cool opportunities there. And I think to your point, OT in the cloud, it's going to get here quicker than we think. I think we've got a really interesting opportunity for once on the industrial control side and controls engineering side to be ahead of a technology curve and really start thinking about how we build frameworks and build standards and, and how we do this in a way that makes sense and leveraging multi cloud too. Because back to the resiliency, not everything should be in one cloud provider because you want to make sure that, especially where you've got critical infrastructure systems, you want to make sure that you diversify that as much as possible. So there's a lot of really good opportunity and I think we have an opportunity as security practitioners, as automation engineers, as control system engineers to get ahead of that technology curve. And we're getting as mandiant, we're getting to have some really good conversations with some of our, you know, some of our Google cloud partners that are working in manufacturing spaces, working with critical infrastructure, you know, even at a, even at a resiliency level of keeping the grid more operational. Right? We've got NERC SIP in place. Your medium and your high assets have to have electronic security perimeters and we've got to keep those systems protected based on those regulations. But grid stabilization could be a whole lot more effective and efficient. And I'm not saying flip the switch and go do this. I'm saying let's have a really good conversation around how we do it in a way that makes sense. But you can improve grid operations, you could improve grid interoperability by pushing a lot of these systems to the cloud. You could reduce the potential impact that your IT systems being compromised has on your OT systems if you're leveraging some of these more resilient systems that the cloud offers. And there's also Some really great ways to do, to use you know, cloud hyper converged technology down at the on prem level. If you've got to meet a regulatory requirement or you've got to meet a data sovereignty requirement and bring cloud technology down to your local on prem edge. You know every, every cloud manufacturer has this, but we've got Google Cloud Edge that, where we can, we can put a Google Cloud hypervisor that normally sits in one of our data centers on site, in your, in your environment, have all the processing horsepower, AI, all those tools sit locally and then have a backend connection to the cloud for management, for security, for patches and updates. And so you're improving some resiliency but you're keeping your data local, you're maintaining that security perimeter. So again I think we're in a really interesting place where we still have a 15 to 20 year lifecycle on the hardware in the field. And that might be true, but we have an opportunity to start leveraging some newer technology to support all of the backend infrastructure that supports that. We have a customer a couple years ago they were doing a big project to rip and replace some of their OT hardware on their manufacturing plant floor. They realized that they could move to all HMI based or web based HMI servers. So all their applications got moved to web based applications. And in the process of doing that they didn't need Windows machine on the plant floor anymore to support that and they moved to Chrome boxes. So they've got these little, you know, $199 chrome boxes on the manufacturing plant floor. All their operators are using a web based interface that they're accessing through Chrome. Those things are completely replaceable in 15 minutes. You know, they get, they plug a new one in, you get a new IP address, you log into the, to the web application and you're back up and running. So it makes that, that plant floor and they're managing those things with their Chrome Enterprise solution. And so they've got all of the security capabilities that Chrome Enterprise has that Google uses when we use our Chromebooks for ourselves. You've got all that capability to protect those environments. So it's really interesting to start seeing some of this progression and where organizations are able to leverage it and not break a regulatory requirement. But again, starting to really have forward thinking conversation about what this really looks like. And I'm, I don't want to, I don't want to spoil the surprise, but I think we're in 2025, we're gonna have some really good conference talks on this subject. [00:48:05] Speaker A: So, yeah, I mean, and ultimately the irony is many of our sites are not that far. I mean, I literally just did an assessment at a power plant in West Texas and you know, it's probably a, a, you know, 20, 30 year old plant. It probably hadn't been upgraded that much along the way. But they're using those thin clients now. The only difference is the server is in the room, so they have a thin client sitting on the table for all their operator screens. And the remote server is just in a server rack, just, you know, in the same room. Yeah, they don't have to know that could get moved to the data center or the cloud. It doesn't have to go to Google or Amazon or Microsoft. It can go to your own local cloud and you can have that redundancy. And you know, one of the hesitations that I've seen at these sites, you know, power plants are not located in usually highly dense populated areas. They're in the middle of nowhere. But now with the technology that has changed from cell phones, signals and speeds on cellular communications, not to mention, you know, the whole, you know, satellite communications and how much faster that has gotten because of, you know, Elon and all the, all the technologies that they've brought, like this is a, anywhere in the world you can do this type of stuff and support. Especially because I'm not sending an entire thing. It's a web hosted application. So I'm just sending basic data back and forth. I'm not having to, I'm not streaming Netflix. I don't have to have the same bandwidth requirements and it's available and I could have multiple paths. I can have cellular, I can have satellite and I've reduced the latency so all of it can work in these environments and we're not that far off. Like the operator experience, they probably wouldn't even know that you moved the server out of the room because it's the same experience that they are currently having. It's just not in the room. It's just now down at your local data center or in Google. Right? [00:50:00] Speaker B: Yeah, yeah. I mean, and that data center can be anywhere, right? To your point, it doesn't necessarily have to be a cloud based data center. Right. You could have, you know, a server running in your buddy's garage in Honolulu and have a, you know, again, that can go anywhere. So the technology that we've got is enabling a lot of that. You know, I want us, I'd love to see us start leveraging some of that newer connectivity technologies for, you know, replacing some of these antiquated. There's a customer that we just did an assessment for. The cell modem that they're using is like 7 years old, 8 years old at this point, like one of the of first generations of LTE cell modems. And it's still got default creds. It's never been updated. Okay, let's take a hard look at this. But at the same time, we got an opportunity to, if it's just connection technology, you've got an opportunity sometimes to replace that, to upgrade that, to improve the security posture of it. But if we don't know what's there, back to the asset identification conversation. If we don't know it's there, we don't know we need to replace it. So. Or update or whatever. [00:51:15] Speaker A: And what, what critical, how is it critical to my process and what do I use it for? Because that's usually when I. A lot of the things that I do and one of the more hard, more difficult things that I've done is, you know, acquisition. So a company buys a power plant or manufacturing facility or whatever from another entity and they don't necessarily understand how to transfer that thing out. Right, right. So understand. So a lot of times what we do is we, we pull something like we transfer something over and we wait till stuff breaks and then we fix it on the fly. Right? Yeah. Because they don't understand the process enough to know what's going to break whenever we start changing things. So we just have to be ready. And I've got a SWAT team that's sitting there, okay, go fix this. Okay, go fix that. You look at this, you look at that. Right? And then we, we get it working, but we have to do it in concert with the operators because they're the ones that truly understand something's not right. And then we have to start troubleshooting and figure it out because ultimately there is anybody that really understands the whole business process around it to be able to say, hey, you can't move this until I've done this, because this is dependent upon that. Right? [00:52:15] Speaker B: Yeah. Yeah. [00:52:16] Speaker A: So. So I always like to wrap up with a, you know, in the next five to 10 years, what's one thing that you are maybe excited about coming up over the horizon? And maybe one thing that's concerning regarding OT cyber that again that you think may be coming up are good or bad? [00:52:33] Speaker B: I think it's the same answer for both. And it's the last thing we just talked about. Right? [00:52:37] Speaker A: Yeah. [00:52:37] Speaker B: OT capability in the cloud. I'm really excited about the potentials that it brings for resiliency and interoperability and all of those things. But to the point I made earlier, if we've got a chance, for once, we've got a chance to be ahead of the curve on this. And I think we need to be having the conversations now about how we secure it, how we build good frameworks, how we build good standards, and how we build something that, that for critical infrastructure is unifying that supports all of these sectors. Right. We've got some proposed updates to the security directives from the TSA coming out. Right there's, that's open for review now. I think they're due in February. Who knows how that that changes potentially with the, you know, administration changes that we're about to go through. But we've got a great opportunity for, you know, to bring all of this under one umbrella when we start talking about OT security and OT systems in the cloud. And if we start having those conversations now, if we start forward thinking on that and having those opportunities for conversations now, it removes some of that cause for concern, but it definitely is not something that we can wait on. [00:53:52] Speaker A: Yeah, absolutely. I agree, man. We're at a place where we can start thinking and future proofing this. We know it's coming, so we need to start planning for it. And we find the least critical processes and start doing those. Prove the process out and then rolling it out to others. You don't just blanketly send it like we said earlier, right. You find a pilot, you test it out there, you find the weaknesses and then you enhance that root cause and continue to improve. And then you start rolling it out and rolling it out. And the cool thing with this industry that I love is that there's a lot of sharing, right? So power industry is great in that, right? Because they're not really competing. You know, I'm not competing with the power plant down the street because we're all selling our power. So I'm, I'm happy Duke Energy shares with, you know, Luminant and Nextera and all the others because they're not exactly competing with each other. In fact, they buy each other's plants all the time. So, yeah, I mean, that's a, that's a good thing. I'm excited to see what, what Google has coming out because I agree with you. I think that the, the cloud is the, is the future and we need to make sure we find a good way to do it so that we're not forced to go in blindly and we're not 20 years down the road and just starting to implement techn that are that could be done today to really make a difference. [00:55:06] Speaker B: Yeah, agreed. [00:55:07] Speaker A: Awesome. Well, thanks for your time today. Anything you want. Kind of closing out call to action. Anybody you're going to be anywhere you want somebody to see, read your book or list you on a podcast or any of the things. Buy a cigar or a bourbon. [00:55:20] Speaker B: No books for me but for this community, look forward to seeing everybody in Tampa here in a couple months. Yeah, always good to to get a chance to spend time with this community. And I think that's S4 in Tampa and previously in Miami. It's always a good time for this community to get together and solve the world's problems. So looking forward to that. And shout out to my buddy Mike Holcomb who has spun up ICS BSides. It's going to run alongside us for this year. So, so big thanks to Mike for the effort that he's putting in there. Really excited about getting a chance to have a B sides associated with an ICS and excited to see how that grows alongside S4 continuing to grow. [00:56:11] Speaker A: So yeah, 100%. I love that. I'm definitely going to reach out to Mike and get him on here to talk about that because I think it's huge. Like we have these B sides and they correspond with Black Hat and RSA and all these other ones, right. We we having one for ics. As important as it is and as big as it is, It'll complement with S4 and not take anything away. Just, just be an add on for us to really get that message out and have more conversations. So like you said, so we can solve the world problems. [00:56:36] Speaker B: Yeah. Awesome. [00:56:38] Speaker A: Awesome man. Hey, good to see you. Thanks for your time today and until next time, if nothing else, I'll see you in Tampa, sir. [00:56:43] Speaker B: Thank you sir. Good to talk. Catch up later. [00:56:46] Speaker A: Thanks for joining us on Protection, where we explore the crossroads of IT and OT cybersecurity. Remember to subscribe wherever you get your podcasts to stay ahead in this ever evolving field. Until next time.

Other Episodes

Episode 27

October 14, 2024 00:27:59
Episode Cover

Practical Cyber Hygiene Tips for Families and Individuals During Cybersecurity Awareness Month

In this episode, host Aaron Crow takes a deep dive into the essential aspects of cyber hygiene.  As part of the Cybersecurity Awareness Month...

Listen

Episode 31

November 11, 2024 00:25:22
Episode Cover

Essential Cybersecurity Strategies for Small and Medium-Sized Enterprises

In this episode, host Aaron Crow addresses the pressing issue of cybersecurity for small and medium-sized businesses. With their limited budgets and resources, these...

Listen

Episode

October 07, 2024 01:09:01
Episode Cover

Building Resilient Tech Environments: Lessons from Dennis Maldonado

In this episode, Aaron Crow engages in an insightful conversation with Dennis Maldonado, Director of Technology for Harris, Fort Bend ESD 100. The discussion...

Listen