Thursday, May 9, 2024

Chaos Engineering: Embracing Disorder In Resilient Systems

In my last post I wrote about resilience engineering – interesting stuff but can it be tested? Of course!

Growing up in the 1960s one of my favorite shows was Get Smart, a classic comedy TV series parodying the spy genre, following bumbling secret agent Maxwell Smart and his adventures against the villainous organization KAOS, The International Organization of Evil.


Like the TV show, in the world of engineering, unpredictability and chaos are often viewed as adversaries to be conquered. However, a paradigm shift has occurred in recent years, with some engineers embracing chaos as a means to build more resilient systems. This approach, aptly named "Chaos Engineering," is a discipline that advocates deliberately injecting failure into systems to test their robustness and identify weaknesses before they cause real-world disasters.

 

At its core, Chaos Engineering is about embracing the inherent unpredictability of complex systems and using it to our advantage. Instead of waiting for failures to occur in production, engineers proactively introduce controlled chaos to observe how systems respond under adverse conditions. By doing so, they gain valuable insights into system behavior and dependencies, enabling them to build more resilient architectures.

[image credit: https://www.bmc.com/blogs/chaos-engineering/]

The principles of Chaos Engineering are grounded in scientific experimentation. Engineers start by defining a hypothesis about how their system should behave under stress. They then design and execute experiments that simulate real-world failures, such as server crashes, network latency spikes, or database outages. These experiments are carefully controlled to minimize the impact on users while still providing meaningful insights into system behavior.

 

One of the key benefits of Chaos Engineering is its ability to uncover hidden weaknesses in distributed systems. In today's world of microservices and cloud computing, systems are becoming increasingly complex, with numerous interdependencies and failure points. Traditional testing methods often fail to uncover these issues until they manifest in production. Chaos Engineering helps mitigate this risk by actively probing for weaknesses in a controlled environment.

 

Netflix is perhaps the most famous proponent of Chaos Engineering, with its "Chaos Monkey" tool being widely used to inject failures into production systems. By regularly causing disruptions in their infrastructure, Netflix ensures that engineers are constantly aware of potential weaknesses and can design systems to withstand them. This approach has helped Netflix achieve unprecedented levels of uptime and scalability, even in the face of unexpected events like server outages or network failures.

 

However, Chaos Engineering is not without its challenges. Introducing chaos into a system requires careful planning and coordination to ensure that experiments do not cause widespread outages or data loss. Moreover, interpreting the results of chaos experiments can be complex, as system behavior is often non-linear and influenced by numerous factors.

 

Despite the challenges, the benefits of Chaos Engineering are undeniable. By proactively testing for weaknesses and building resilience into their systems, engineers can improve uptime, reduce downtime, and ultimately deliver a better experience for users.


Tuesday, May 7, 2024

Resilience Engineering: Understanding System Failures, Causes, And Enhancing Resilience For Better Recovery

Resilience engineering is a multifaceted concept that has gained increasing importance in various fields, particularly in engineering, systems design, and risk management. It refers to the ability of a system, organization, or individual to adapt and recover from unexpected challenges, disturbances, or failures while maintaining essential functions and performance. In essence, resilience engineering aims to understand how systems fail, why they fail, and how they can be designed or improved to withstand and recover from failures effectively. 

[image source: https://rote.se/upload/images/resilience-engineering.png]

One of the core principles of resilience engineering is recognizing that failures are inevitable in complex systems. Instead of focusing solely on preventing failures, resilience engineering emphasizes building systems that can gracefully degrade or adapt when failures occur. This approach acknowledges the interconnectedness and unpredictability inherent in complex systems, whether they are technological, organizational, or socio-technical.

 

At the heart of resilience engineering is the concept of "anticipating the unexpected." This involves actively seeking out potential sources of failure, understanding their potential impacts, and developing strategies to mitigate or recover from them. Rather than relying solely on past data or traditional risk assessment methods, resilience engineering encourages a proactive and dynamic approach to risk management that takes into account uncertainties and emergent properties of complex systems.

 

Central to resilience engineering is the idea of "resilience barriers" or "safety margins." These are mechanisms, redundancies, or practices built into a system to prevent or mitigate the escalation of failures. Unlike traditional safety measures, resilience barriers are designed to be flexible and adaptable, allowing for rapid response and recovery in the face of unforeseen events. Examples of resilience barriers include backup systems, cross-training of personnel, flexible protocols, and decentralized decision-making structures.


Resilience engineering also emphasizes the importance of learning from failures. Instead of viewing failures as purely negative events to be avoided, resilience engineering sees them as valuable opportunities for learning and improvement. This involves conducting thorough post-event analyses, sharing insights across organizational boundaries, and implementing changes to prevent similar failures in the future. By fostering a culture of learning and adaptation, resilience engineering helps organizations become more agile and responsive to change.

 

In recent years, resilience engineering has found applications in diverse domains, including aviation, healthcare, cybersecurity, and disaster management. For example, in aviation, resilience engineering has led to the development of Crew Resource Management (CRM) programs, which focus on improving communication, decision-making, and teamwork among flight crews to enhance safety and resilience in the face of unexpected events.

 

Interested in learning more? Here’s a 9 minute 51 secondf introductory video by Dr. David Woods, a professor in the Department of Integrated Systems Engineering at the Ohio State University. https://youtu.be/r8awKlk7JPM?feature=shared

 

Additonal videos from Dr Woods are posted on the YouTube C/S/E/L BackChannel  here.

 

As our world becomes increasingly interconnected and complex, the principles of resilience engineering will continue to be essential for ensuring safety, reliability, and effectiveness in all walks of life.

Friday, May 3, 2024

Community College Engineering Student Transfer

Yesterday I checked in via LinkedIn with a Holyoke Community College Engineering program
graduate who transferred to a nationally ranked top ten engineering university. The student is studying Electrical Engineering there and I asked how things were going. Here’s a screen shot of the response I got with identification information removed – including student name and the transfer university. Pretty cool!

The student compliments my two classes (Circuits 1 and Circuits 2) but there is so much more. Both classes are Calculus and Differential Equations based so the students need to really know their math stuff before I get them. The math, physics and chemistry instruction is  exceptional at Holyoke Community College – as it is at so many other community colleges in the country.  It is not just the STEM classes that prepare students for my classes though. To get their degree our students need to take additional courses including English Composition, History, Social Sciences and in some cases Business courses. These courses are critical, complementing the technical knowledge, skills and abilities gained in engineering courses, producing well-rounded professionals capable of addressing complex challenges with creativity, empathy, and ethical awareness.

 

I see it every day with students coming to my classes prepared to learn, solve problems, communicate and understand some pretty complex stuff. Amazing faculty doing amazing things in their classrooms makes it pretty easy for me to teach those classes.


We (community colleges) often face unjust criticism due to misconceptions. Despite offering quality education, we’re sometimes seen as inferior to four-year institutions. We provide valuable opportunities and options with smaller classes, dedicated faculty, and affordable tuition. And let's not forget transfer to four year institutions.

 

Thanks to the unnamed student – you certainly made the day!