BC/DR Planning: Brittleness of the Modern Internet
10–20–2025 (Monday)
Hello, and welcome to The Intentional Brief - your weekly video update on the one big thing in cybersecurity for middle market companies, their investors, and executive teams.
I’m your host, Shay Colson, Managing Partner at Intentional Cybersecurity, and you can find us online at intentionalcyber.com.
Today is Monday, October 20, 2025, and we’re going to look at some real life struggles around Business Continuity and Disaster Recovery today, brought to us by AWS.
BC/DR Planning: Brittleness of the Modern Internet
The big story around the Internet today is, ironically, that a significant portion of the Internet is unavailable, due to what we now know is a DNS error in Amazon Web Service’s US-EAST-1 Region.
Beginning just after midnight this morning, AWS began experiencing what they call “service impacts”. While the term might seem somewhat generic or benign, the impacts were far from it.
According to their own published updates and timeline, it took several hours to identify the issue and push the mitigations, and more hours still for those mitigations to work their way through the deeply interconnected systems that we use everyday on the modern Internet.
DNS, or Domain Name Services, serves as the bridge for reconciling the domain name (e.g. amazon.com) to the actual IP address that your computer needs to communicate to in order to retrieve that web page, application, data, etc. When that’s not available, then your request fails and so does your page loading, app loading, etc.
In this case, it also caused a cascade of failures both within AWS itself, and across thousands of companies and their digital estates.
The BBC has been doing a good job of tracking this in a sort of live-blog style that’s common in these type of instances, noting that some applications took more than six hours to come back online, and others are still offline.
The challenge here is that we take modern Infrastructure-as-a-Service (IaaS) or Platform-as-a-Service (PaaS) components for granted because their uptime is so robust. In many ways, we don’t even think about what an outage might mean to our business, or if we do, we consider the cost of running in multiple availability zones and see that as a significant increase in both cost and complexity.
It will be interesting to see what business leaders decide to do after today’s outage, which came early in the US workday, but squarely in the middle of the day for the UK and many others.
Couple this with the fact that US-EAST-1 is one of the oldest and largest regions within AWS, and that the initial issue caused a raft of downstream issues, further complicated by everyone in that tenant attempting to run their own restorations on top of the impacted services, and you can see how this can spiral.
But, as we often do on this show, we should come back to understanding those things we can control, and accepting the risks inherent in these sorts of arrangements. While outages do happen, they’re extremely rare for AWS (or any of the other hyperscalers like Microsoft Azure or Google’s GCP).
The alternative is that you add significant cost and complexity into your own environment to either run in multiple clouds, run on multiple availability zones within the same cloud, or keep failover infrastructure otherwise available. While it sounds like a good idea to have a fall back plan, it’s often a point of diminishing returns and can create significant overhead and friction when you encounter regular requirements like patching, updating, deploying new software, backups, rotating hardware, or a myriad of other mechanics that AWS abstracts away for you when you leverage their platform.
Beyond that, they have Site Reliability Engineers and other responders who are able to engage at a speed and scale that we’re highly unlikely to match here in the middle market.
Instead of moving away from the cloud, I would encourage you to consider thinking through fallback and alternative approaches to critical workloads, think through ways in which we may have adopting (knowingly or unknowingly) a single point of failure, and - in some cases - simply accept the risk and wait for AWS to get back on its feet.
I know that’s not always the answer we want to hear, or we want to tell our executives, Board, or Shareholders, but sometimes it’s still the best path forward. Put your energy into building robust recovery playbooks to get back online and functional once the infrastructure has returned, and minimize the impact with proactive customer communication and alternative engagement models.
Fundraising
From a fundraising perspective, another tip-top week, with newly announced funds totaling more than $26B and putting us at at over $116B in newly committed capital for the quarter here just a few weeks in.
To be fair, the vast majority of this week’s new capital comes from Ardian, who raised $20b for its fifth infrastructure fund, which mostly focuses on Europe.
But for some additional context, if we have another week like last week, next week, we’ll have raised more in October than all of Q3 combined.
As usual, it’s hard to read the tea leaves on all this, given the macro uncertainty in some critical areas, whether it’s the ongoing sanctions battle between US and China (now involving rare earth elements), the ongoing war in Ukraine, the government shutdown here in the US, or any other number of things going seemingly sideways. And yet - the fund announcements keep coming.
A reminder that you can find links to all the articles we covered below, find back issues of these videos and the written transcripts at intentionalcyber.com, and now sign up for our monthly newsletter, the Intentional Dispatch.
We’ll see you next week for another edition of the Intentional Brief.
Links
https://health.aws.amazon.com/health/status
https://www.bbc.com/news/live/c5y8k7k6v1rt
https://www.axios.com/2025/10/20/aws-outage-reddit-roblox-coinbase
https://www.ft.com/content/755d7413-c71a-4a6e-bbe9-2b821032bdee