The failure of networked systems: The repercussions of systematic risk revisited

NOTE: Images in this archived article have been removed.

This is an updated story that originally ran in January 2008. David Clarke’s warnings about the risks of failure in highly connected systems have proved to be prescient in light of recent events – Big Gav. (David Clarke writes under the pen name aeldric.)

There are those among the Peak Oil community who suspect that we could be facing a failure of our interdependent society that may be sudden, profound, and complete. I have repeatedly said that I am not numbered among them. My opinion is that our way of life will have to change significantly, but slowly. I don’t expect to be clubbing anybody with a femur in any foreseeable future. This opinion is on record in both print and electronic media, and I don’t expect to be issuing a retraction any time soon–but a recent event forced me to admit that I may have to hedge a little.

Image Removed

Our internal network here has been having problems. My email (and more importantly my access to TOD) has been very unreliable over the last two days. The network regularly flicked from “working” to “failed” in the blink of an eye. I was reminded that the speed of collapse in a network is often a function of the natural frequency (speed) of the network, while the breadth of failure depends on a number of factors, including load and the degree of interdependence within the network.

The problem was eventually traced to a problem with one piece of software on one machine on our intranet. The software drivers for the network interface card on one machine were corrupt.

This raised a question in my mind: The Internet Protocol was originally designed to be a robust, reliable, redundant system. How does one piece of software on one machine bring down a network with thousands of nodes?

The answer is easy: Cost efficiencies.

Our Intranet network could have been built to be reliable, but instead it was built to be “efficient”. Far from being a network of fail-safe systems, our network is a network of interdependencies. When the system was loaded, a single failure brought the whole system down. “Business Efficiency” has brought our network to its knees for two consecutive days.

I have seen this pattern a lot recently. Last year the power went out in my city. The power transmission system was heavily loaded one afternoon, when a single failure brought the whole system down.

Academics have studied failures of complex systems with interesting results. One of the experiments they did will be familiar to anyone who has ever played with sand-castles as a child. Build a sand pile by gradually adding grains of sand. After a while, avalanches start to run down your pile. Sometimes they are minor, while other times they affect the whole pile. There is seemingly no way to reliably predict the outcome.

However Per Bak, in his book “How Nature Works,” shows that there is an instructive way to look at this question.

There is a critical angle for piles of sand–a level of steepness that the slope cannot go beyond without sand starting to roll down the slope. Imagine that, as you add sand, you colour red all of the areas of the pile that achieve this critical angle (and are thus on the verge of an avalanche). You will notice that the red patches appear as tendrils running down the side of the pile. As you add sand to the pile it gets higher and wider – the pile gets steeper and more little tendrils of red appear. Eventually you will see the tendrils of red start to interconnect.

If you drop a grain of sand on a red area then you will precipitate an avalanche. If the red area is interconnected with other red areas then all these areas will be drawn into the avalanche. If the red area is isolated, then the avalanche will be confined to one red tendril running down the side of the pile.

This basic principal can be applied to my network problem. If one route on the network gets loaded to capacity (i.e. turns red), the system detects that it has reached maximum capacity, and it delays traffic (piles it higher) or switches traffic to other routes (spreads wider).

If the other routes were new, unloaded and redundant parts of the network, then this would not be a problem. But they are not. The other routes are simply other parts of the old, heavily loaded network. Pretty soon all routes are red, and they are all interconnected. So when one part of the network fails, it passes the traffic to another part of the network, which fails and your avalanche starts. With all networks connected, all of them are vulnerable and all fail.

Our network operates at electronic speeds, and it failed with the same rapidity.

Understanding how this happened is critically important. There are four parts to creating the complete meltdown of a network:

1. Create a network by building connections between systems.

2. When a particular part of the network approaches overload (goes red), recognise that this is happening and use the connections you have created to allow you to switch load to another part of the network.

3. Continue doing this until all areas are red.

4. Now add more load.

When we poured sand on our sand pile we allowed the sand to fall randomly, and thus the avalanches seemed random. But once we had the ability to monitor (see our potential “avalanche” areas coloured red), we were able to carefully divert the sand into other areas. This delays the avalanche, but in the long run the avalanche is going to be much worse, because it will occur when all areas are red.

In summary: The ability to measure and monitor the system gives us the capacity to avoid small avalanches in individual areas. However, if we keep adding load without adding capacity we overload the entire network and thus make an all-encompassing avalanche inevitable.

If we can’t add capacity, then it would have been better to allow a series of small avalanches.

A look at the financial markets at the moment might illustrate the same point. When we look at the “sub-prime” issues that are emerging, we see that the market created a series of “Investment Vehicles” that allowed risk to be shared. A complex network of interdependencies was created to share this risk, but capacity was not added to deal with the possibility of default. The various institutions that bought these “Investment Vehicles” thought they were buying assets, not debts. The institutions failed to recognise that they needed to add “capacity” in the form of liquidity equal to the possible value of defaults on this debt. As a result, now that load is being applied (in the form of defaults) it threatens to bring down the entire network, rather than just the single “node” that originated the debt.

[Update, October 2008: In view of what has happened since I wrote this piece in January I should probably mention that, in my view, the natural frequency for the cascading failure of the economic system is quite variable. We have electronically linked systems in some areas, while other areas rely on lawyers and accountants laboriously unwinding CDS and other derivatives by hand. The variability means that contrary to what I was hearing yesterday, this cascade is far from finished.

It is also worth noting that a crash can happen in a fast system, but you may not feel it until it has propagated through a slow system, if these systems exist as part of a chain. For example, credit systems can lock up quite quickly, but you may not feel it until the effects have propagated through transport systems. When credit is unavailable, resellers cannot buy items (such as grain), so it does not go on ships, and does not get delivered–but it will be weeks before you notice the delivery failure. (Baltic Dry is an indicator of shipping rates. As I write this, the Baltic Dry Index is down about 80%. http://www.bloomberg.com/apps/quote?ticker=bdiy&exch=IND&x=15&y=11 –and we are starting to feel the effects of this downturn.) The speed of impact of a cascading failure is often limited by the natural frequency of the slowest link.]

The critical concept is that monitoring and networking the system allows us to go right up to the edge of disaster, and then move load to another part of the network until it, too, is on the edge of disaster.

Now that the networking effects have been discussed, I would like to push the analogy a bit further and look at how this plays out from a Peak Oil perspective.

Several years ago, sweet light crude oil started getting a bit more difficult to obtain. In response, we stopped talking about “oil” and started talking about “liquids”. The word “liquids” covers Liquefied Natural Gas (LNG), ethanol, heavy oils, tar sands, and an increasing number of other oil-substitutes.

Essentially the part of the network called “Sweet Light Crude” turned red, so we started connecting the “Oil Network” to other networks.

We connected oil to the “food” network by turning food into ethanol. Actually food was already connected because you need oil to make food in the modern world, but now the circle is complete-–previously we used oil to create food, and now we use food (corn, sugar, palm oil, etc) to create oil (or oil-substitutes).

Adding LNG and CTL (Coal-To-Liquid) to the network connects oil to other energy sources. As this connection strengthens and load starts to be applied, a shortage of any of these sources would have an impact in each of the other sources. To some extent, this has already started to occur.

Adding tar sands and various other oil substitutes to the network has made a surprising connection between the environment and oil. This connection takes many forms, but the most interesting lies in the fact that oil substitutes are less efficient than light sweet crude-–much more CO2 is produced for any given amount of work done. This connection is emerging, and could have interesting repercussions. The problem applies to virtually all the oil-substitutes, so the widespread adoption of substitutes (particularly CTL and tar sands) might cause an environmental disaster which in turn would suppress ethanol production and create knock-on effects in other parts of the network.

The financial system has an important role to play in this network. If energy, food and the environment can be considered three portions of the network, then our financial system can be considered to be both a form of network monitoring, and the communication medium that the network uses to pass signals around. Consider the financial system to be similar to the blue cable running out the back of your computer. Your computer’s blue cable isn’t likely to run hot, but our finance system is a network of networks, and it is glowing red. In addition to monitoring and communication, the financial system provides support for maintenance and upgrades of the energy systems, so capacity in the financial system is critical.

When one part of the network develops a problem (say production of LNG suddenly drops), then messages get sent via the financial system (in the form of increased prices), and the other parts of the system accept the load, if they can, by increasing production. When compared to an Internet Protocol network there are many faults in this system. High latency leads to slow responses. Poor monitoring leads to conflicting signals or a failure to detect faults. Bad messages are often not corrected, leading to incorrect responses, and so on.

The speed of a crash

The interesting point to note is that increasing demand past capacity will not immediately “crash” this system. Oil facilities that are working at capacity will not “crash” if demand exceeds the capacity, they will simply continue working at capacity. The crash may come, but it will come because demand heats up the financial system and crashes other systems that depend on finances. Since the oil production system is dependent on other systems, this could conceivably cause an eventual crash. Eventually lack of maintenance will degrade the capacity, but this is a process that occurs over a period of months or years.

Likewise, the process of adding capacity is exceptionally slow. Building CTL or NGL plants takes the best part of a decade.

The oil production system can certainly crash, but it would be a crash in slow motion.

The only part of the system that can crash quickly is the financial system. The financial system provides monitoring, communication, maintenance and upgrades. So a profound, complete crash in this area could conceivably bring down the whole network.

However, could such a financial crash occur? An immediate halt to oil production would require a crash far more profound than the Great Depression. The response speed of our financial system has been improved by linking many of the sub-systems electronically, but there are still a number of choke points, circuit breakers, and sanity checks. The Great Depression emerged over a period of months or years. Even with the electronic linkages in place today, a complete breakdown of our financial institutions is unlikely to happen overnight.

If this system crashes overnight, it will be because the plug got pulled-–a breakdown of society external to the system.

The natural frequency for events in the oil and oil-substitute network is in the range of months at least, or more likely years. Internal stresses cannot cause it to crash overnight.

The Breadth of a crash

The breadth of the crash depends on the degree of linkage and the degree to which each part of the network is loaded. This is where I start to worry.

Oil appears to be at or near peak capacity–exports are dropping. As for the food network–world grain reserves are at historic lows, and expected to drop a little more next year. And the environment? Climate change is clearly with us, indicating that the environment has already gone past its capacity.

When looked at in these terms it appears that the network is already in decline. Each of these three parts of the network is at or past capacity. If a span of years is the natural time-frame for a crash in this system, then it seems quite plausible that we are watching a very broad-based crash of our energy systems–right now.

Our actions in increasing the connections to the food and environment networks will not help, and may simply speed the crash.

The signals indicating the start of a crash would be seen in the monitoring and communication system–-the financial systems. Prices for oil would go up. Which we have seen…. Prices for food would go up. Which we have seen…. We might expect perturbations, volatility, and attempts to “price” the environment…. Hmmmm.

Conclusion

I am forced to concede that a broad-based collapse is a possibility. I still maintain that a sudden collapse is unlikely, but if it is already happening, then it could certainly look sudden when we eventually notice it.

[Update, October 2008: I am still hoping to avoid a sudden, broad-based collapse. Some factors look like they will contribute, while others will mitigate. In many cases, the pace cannot proceed faster than the slowest system in the dependancy chain. In monitoring this situation, look for dependancy connections between systems, and then ask yourself what the natural frequency of the slowest system in the chain is.]