Failure of Complex Systems and PDSA

[!NOTE]

This article has not undergone in-depth study and reflection and may be significantly modified in the future.

Exceptions in Complex Systems#

Whenever a major product failure occurs, people's first reaction is astonishingly consistent: find the person who made the mistake or the part that failed. We eagerly search for a clear "root cause" because it gives us an illusory sense of control—if we just fix this point, everything will return to normal.

—— What does the "reset" frequently mentioned in Chinese aerospace actually mean? - Zhihu.

In some cases, this simple way of thinking is not wrong; simplicity means speed. I believe that quickly identifying the "root cause" and claiming to have solved it is a solution in situations where staff are eager to resolve customer complaints and rescue the scene. We often feel satisfied or even smug about "solving problems," but we must clearly recognize that this "solution" addresses human issues rather than the entire system.

Whether it's an energy network, a company, equipment, or software systems, these things are complex in design or practical terms, and no one can simply understand how they currently work. They are actually filled with various minor issues, but enough redundancy in design allows them to function normally.

At a certain moment, when several minor issues suddenly collaborate, a planned task becomes impossible, leading to an accident. We need to address the accident and, based on superficial understanding, apply a patch that isn't too unsightly, convincing the person who discovered the problem that the issue has been resolved. I once accompanied a classmate to deal with an administrative director at a school. Although my classmate and I were cursing her for being overly troublesome, she said something I found very philosophical—"Every action leaves a trace; everything you do has consequences." Similarly, our eagerness to apply a patch can introduce more subtle minor issues into the entire system.

In engineering management, people often obsess over finding superficial "root causes," neglecting the real underlying causes. Because those who actually do the work are providing "an explanation" to those above them, and this explanation usually needs to attribute responsibility to a specific person or thing, making it easier to pass through, but ultimately only obscures the systemic root causes.

This mindset is rooted in the early second stage of quality control, the "statistical quality control stage," [^three stages of quality management] which is a legacy of the "Ford assembly line" industrial era, overly focused on disassembly and standardization, carrying a strong causal relationship. Compared to the first stage, the "quality inspection stage," this is certainly a significant methodological improvement. However, in the current era, the complexity of products, engineering, society, and organizations has sharply increased, forming complex systems with multiple variables, non-linear relationships, and real-time changes, where variables influence each other, making it difficult for humans to grasp the causal relationships of each link. Yet our thinking patterns remain stuck in the primitive, natural handling of simple, linear relationships, leading to a significant cognitive gap.

[^three stages of quality management]: 1. Quality inspection stage: Before the 18th century, products typically came from workshops, and quality assurance relied on the craftsmanship and experience of manual operators, with skilled workers performing final checks. This inertia continued until the early 20th century, essentially just picking out defective products from finished goods as "post-hoc checks." 2. Statistical quality control stage: Mainly employs statistical methods and control charts proposed by Shewhart to timely identify and improve defects in certain processes. 3. Total quality management stage: The TQC paper from 1956 proposed that quality issues arising in the production process account for only 20% and introduced the idea of total quality management that fully considers market research, design, production, and service.

Focus on the System#

If a coffee shop's quality is inconsistent and a customer complaint arises one day, the manager's first reaction is always to "put out the fire." They hold an emergency meeting, quickly find the on-duty staff member, accuse them of not adjusting the coffee machine properly, and then impose a fine and compensate the customer.

Is that enough? The complaint is handled very promptly, but why does the same issue keep happening? The essence of relying on staff skills to produce coffee has not changed throughout the entire coffee shop system.

In our time, there are almost no such coffee shops left. Think about it: in chain brand coffee shops, isn't the product flavor almost consistent across all locations? Of course, we know that such flavors are not necessarily excellent, perhaps not as good as that occasionally faulty staff member. But that is the positioning of chain coffee shops; I only produce products of this quality and naturally serve only those customers who are satisfied with this quality.

Deming ¹ categorized all quality issues into two types.

The first type is called "controllable failures." This is like your computer suddenly blue-screening. It is an abnormal, sudden disruption with a clear cause—perhaps user error, a hardware failure, or a driver crash. For such problems, you must take immediate action to find it, fix it, and ensure it does not happen again. It's like putting out a fire; immediate execution is required.

But more common and troublesome is the second type of problem, which Deming calls "incidental failures." This is more like your computer's overall speed fluctuating. It is not caused by a single, clear failure but is an inherent part of the system. Perhaps your operating system is a bit bloated, too many programs are running in the background, or there is insufficient hard drive space... Countless small, random factors work together to create this overall, indescribable "lag." This is the "background noise" of the system; it is always present.

It is evident that because staff are human, quality issues caused by coffee shop staff are "incidental failures," while chain coffee shop managers wisely downgrade target customers, establish a stable and complete coffee bean supply system, and minimize the complexity of staff operations as a means of optimizing the system.

Deming's path is to first extinguish all those suddenly flaring "controllable failures." By establishing a set of standards (to be explained later), scientifically judging which are the true abnormal signals, once all "controllable failures" are eliminated, the system enters a "stable state." At this point, there are still problems and fluctuations, but these are all normal noise.

At this time, the truly important improvements are just beginning; the root causes of all subsequent problems are no longer a specific person or thing but the entire system itself. Managers need to improve the system more intelligently and cautiously, continuously repeating the process of reflection and improvement.

How to Determine if the System Has Entered a "Stable State"#

Some mathematical methods and indicators are still not understood.

PDCA and PDSA#

First, the concept of the Deming cycle involves repeatedly going through several stages to achieve system optimization.

PDCA is the widely recognized "Deming cycle." It stands for Plan-Do-Check-Act, which means planning, executing, evaluating, and improving. Although Deming himself clearly stated that he never proposed this; it may be a misattribution.

PDSA is the original "Deming cycle." It stands for Plan-Do-Study-Act, which means planning, executing, learning, and improving.

Repeating these four stages achieves a stepwise improvement in the system.

Modern methodologies mention the concept of "large cycles encompassing small cycles," and some stages are pretentiously expanded, such as expanding C to 4C—Check (inspect), Communicate (communicate), Clean (clean), Control (control). However, my view is that methodologies should not be overly detailed; excessive refinement ultimately equates to having no methodology.

I believe that overemphasizing cycle-driven approaches can stifle the system's innovativeness, which can be fatal at certain stages of the system. At the same time, this method lowers the system's upper limits; the system's remarkable breakthroughs and innovations naturally bring more minor issues. Therefore, I believe this quality optimization system is only applicable in a "stable state" and should be treated as a tool for execution.

Deming also expressed similar views in [Deming's New Economics (2nd Edition) | yono's document](https://data.yono233.cn/ 书籍 / 戴明的新经济观（原书第 2 版）＝THE NEW ECONOMICS FOR INDUSTRYK,GOVERNMENT,EDUCATION SECOND EDITION_13726844.pdf), stating "do not be superstitious about methods, but adapt to local conditions." This book is Deming's last work, and I strongly recommend downloading and reading it. It also discusses the futility of performance rankings, the commitment of everyone to optimizing the system, and respecting employees rather than rewarding them materially—these rather idealistic views align with the understanding of the master's thoughts similar to PDCA/PDSA.

Additionally, there are the famous Fourteen Points of Deming that can be searched and learned independently, which are not particularly relevant to us common folks.

Reflection#

My biggest takeaway is not to blame mistakes on a specific point; when problems occur, it is actually a flaw in the system's design. So there is no need to be anxious or self-blaming; these are issues for those above us.

This article is synchronized and updated to xLog by Mix Space. The original link is https://www.yono233.cn/posts/white/25_6_24_FailSysPDSA

An American quality management master who laid a solid foundation for quality management in Japanese enterprises. ↩