Thursday, May 9, 2024

Chaos Engineering: Embracing Disorder In Resilient Systems

In my last post I wrote about resilience engineering – interesting stuff but can it be tested? Of course!

Growing up in the 1960s one of my favorite shows was Get Smart, a classic comedy TV series parodying the spy genre, following bumbling secret agent Maxwell Smart and his adventures against the villainous organization KAOS, The International Organization of Evil.


Like the TV show, in the world of engineering, unpredictability and chaos are often viewed as adversaries to be conquered. However, a paradigm shift has occurred in recent years, with some engineers embracing chaos as a means to build more resilient systems. This approach, aptly named "Chaos Engineering," is a discipline that advocates deliberately injecting failure into systems to test their robustness and identify weaknesses before they cause real-world disasters.

 

At its core, Chaos Engineering is about embracing the inherent unpredictability of complex systems and using it to our advantage. Instead of waiting for failures to occur in production, engineers proactively introduce controlled chaos to observe how systems respond under adverse conditions. By doing so, they gain valuable insights into system behavior and dependencies, enabling them to build more resilient architectures.

[image credit: https://www.bmc.com/blogs/chaos-engineering/]

The principles of Chaos Engineering are grounded in scientific experimentation. Engineers start by defining a hypothesis about how their system should behave under stress. They then design and execute experiments that simulate real-world failures, such as server crashes, network latency spikes, or database outages. These experiments are carefully controlled to minimize the impact on users while still providing meaningful insights into system behavior.

 

One of the key benefits of Chaos Engineering is its ability to uncover hidden weaknesses in distributed systems. In today's world of microservices and cloud computing, systems are becoming increasingly complex, with numerous interdependencies and failure points. Traditional testing methods often fail to uncover these issues until they manifest in production. Chaos Engineering helps mitigate this risk by actively probing for weaknesses in a controlled environment.

 

Netflix is perhaps the most famous proponent of Chaos Engineering, with its "Chaos Monkey" tool being widely used to inject failures into production systems. By regularly causing disruptions in their infrastructure, Netflix ensures that engineers are constantly aware of potential weaknesses and can design systems to withstand them. This approach has helped Netflix achieve unprecedented levels of uptime and scalability, even in the face of unexpected events like server outages or network failures.

 

However, Chaos Engineering is not without its challenges. Introducing chaos into a system requires careful planning and coordination to ensure that experiments do not cause widespread outages or data loss. Moreover, interpreting the results of chaos experiments can be complex, as system behavior is often non-linear and influenced by numerous factors.

 

Despite the challenges, the benefits of Chaos Engineering are undeniable. By proactively testing for weaknesses and building resilience into their systems, engineers can improve uptime, reduce downtime, and ultimately deliver a better experience for users.


Tuesday, May 7, 2024

Resilience Engineering: Understanding System Failures, Causes, And Enhancing Resilience For Better Recovery

Resilience engineering is a multifaceted concept that has gained increasing importance in various fields, particularly in engineering, systems design, and risk management. It refers to the ability of a system, organization, or individual to adapt and recover from unexpected challenges, disturbances, or failures while maintaining essential functions and performance. In essence, resilience engineering aims to understand how systems fail, why they fail, and how they can be designed or improved to withstand and recover from failures effectively. 

[image source: https://rote.se/upload/images/resilience-engineering.png]

One of the core principles of resilience engineering is recognizing that failures are inevitable in complex systems. Instead of focusing solely on preventing failures, resilience engineering emphasizes building systems that can gracefully degrade or adapt when failures occur. This approach acknowledges the interconnectedness and unpredictability inherent in complex systems, whether they are technological, organizational, or socio-technical.

 

At the heart of resilience engineering is the concept of "anticipating the unexpected." This involves actively seeking out potential sources of failure, understanding their potential impacts, and developing strategies to mitigate or recover from them. Rather than relying solely on past data or traditional risk assessment methods, resilience engineering encourages a proactive and dynamic approach to risk management that takes into account uncertainties and emergent properties of complex systems.

 

Central to resilience engineering is the idea of "resilience barriers" or "safety margins." These are mechanisms, redundancies, or practices built into a system to prevent or mitigate the escalation of failures. Unlike traditional safety measures, resilience barriers are designed to be flexible and adaptable, allowing for rapid response and recovery in the face of unforeseen events. Examples of resilience barriers include backup systems, cross-training of personnel, flexible protocols, and decentralized decision-making structures.


Resilience engineering also emphasizes the importance of learning from failures. Instead of viewing failures as purely negative events to be avoided, resilience engineering sees them as valuable opportunities for learning and improvement. This involves conducting thorough post-event analyses, sharing insights across organizational boundaries, and implementing changes to prevent similar failures in the future. By fostering a culture of learning and adaptation, resilience engineering helps organizations become more agile and responsive to change.

 

In recent years, resilience engineering has found applications in diverse domains, including aviation, healthcare, cybersecurity, and disaster management. For example, in aviation, resilience engineering has led to the development of Crew Resource Management (CRM) programs, which focus on improving communication, decision-making, and teamwork among flight crews to enhance safety and resilience in the face of unexpected events.

 

Interested in learning more? Here’s a 9 minute 51 secondf introductory video by Dr. David Woods, a professor in the Department of Integrated Systems Engineering at the Ohio State University. https://youtu.be/r8awKlk7JPM?feature=shared

 

Additonal videos from Dr Woods are posted on the YouTube C/S/E/L BackChannel  here.

 

As our world becomes increasingly interconnected and complex, the principles of resilience engineering will continue to be essential for ensuring safety, reliability, and effectiveness in all walks of life.

Friday, May 3, 2024

Community College Engineering Student Transfer

Yesterday I checked in via LinkedIn with a Holyoke Community College Engineering program
graduate who transferred to a nationally ranked top ten engineering university. The student is studying Electrical Engineering there and I asked how things were going. Here’s a screen shot of the response I got with identification information removed – including student name and the transfer university. Pretty cool!

The student compliments my two classes (Circuits 1 and Circuits 2) but there is so much more. Both classes are Calculus and Differential Equations based so the students need to really know their math stuff before I get them. The math, physics and chemistry instruction is  exceptional at Holyoke Community College – as it is at so many other community colleges in the country.  It is not just the STEM classes that prepare students for my classes though. To get their degree our students need to take additional courses including English Composition, History, Social Sciences and in some cases Business courses. These courses are critical, complementing the technical knowledge, skills and abilities gained in engineering courses, producing well-rounded professionals capable of addressing complex challenges with creativity, empathy, and ethical awareness.

 

I see it every day with students coming to my classes prepared to learn, solve problems, communicate and understand some pretty complex stuff. Amazing faculty doing amazing things in their classrooms makes it pretty easy for me to teach those classes.


We (community colleges) often face unjust criticism due to misconceptions. Despite offering quality education, we’re sometimes seen as inferior to four-year institutions. We provide valuable opportunities and options with smaller classes, dedicated faculty, and affordable tuition. And let's not forget transfer to four year institutions.

 

Thanks to the unnamed student – you certainly made the day!

Monday, April 29, 2024

Distributed Inference And Tesla With Some SETI Nostalgia


In this post, I’m setting aside any political stuff and focusing solely on tech.

 

In recent months, the electric vehicle (EV) market has seen a decline, marked by falling sales and an increase in unsold inventory. Tesla, in particular, has received a significant share of negative attention. During Tesla's first-quarter earnings call last week, Elon Musk diverged from the norm by highlighting Tesla's broader identity beyond its role in the automotive industry. He emphasized the company's engagement in artificial intelligence and robotics, suggesting that pigeonholing Tesla solely within the EV sector overlooks its broader potential.
 

Musk's suggestion to actively utilize Tesla's computational power hints at a larger strategic vision. He envisions a future where idle Tesla vehicles contribute to a distributed network for AI model processing, termed distributed inference. This concept could leverage the collective computational strength of millions of Tesla cars worldwide, extending the company's impact beyond transportation.

 

Very interesting – I drive maybe 1-2 hours per day, the rest of the time my car is not being used. What if all that computing horsepower could be used while I’m not using it? Musk’s concept brings up memories of the sunsetted SETI@home computer application. SETI was a distributed computing project that allowed volunteers to contribute their idle computer processing power to analyze radio signals from space in the search for extraterrestrial intelligence (SETI). SETI@home used data collected by the Arecibo Observatory in Puerto Rico and the Green Bank Telescope in West Virginia to search for patterns or anomalies that could indicate the presence of intelligent alien civilizations.

 

Participants in SETI@home downloaded a screensaver or software client onto their computers, which would then process small segments of radio telescope data during periods of inactivity. The processed data would be sent back to the project's servers for analysis. By harnessing the collective power of millions of volunteer computers around the world, SETI@home was able to perform computations on an unprecedented scale. The project was launched in 1999 by the University of California, Berkeley, and it quickly became one of the largest distributed computing projects in history. Although the original SETI@home project ended in 2020, its legacy lives on as an example of the power of distributed computing and the widespread public interest in the search for extraterrestrial life.

 

Musk's vision underscores Tesla's potential to revolutionize not only the automotive sector but also broader domains such as artificial intelligence and robotics. It signifies a strategic shift towards leveraging Tesla's resources and expertise in a SETI-like way to drive innovation and create value in new and unexpected ways.

Friday, April 26, 2024

Communications, Networking Methods & Protocols: Introduction and the Information Asset

Terry Pardoe and I wrote an unpublished text titled Data Communications, Networking Methods and Protocols book 20 years ago. Terry passed away on May 2, 2016 at the age of 76. Over this summer I’ll be posting content from that unpublished book here in honor and respect of Terry. It is interesting – 20 years later - a combination of some obsolete but other still relevant technologies. Here’s the first post from the first chapter.

 

The creation and introduction of the binary digital computer into the world of information collection, processing and distribution has brought with it massive expansions in the speed of processing and the breadth of distribution. It has also brought new approaches to connection and an ever increasing need to construct and operate complex, multi vendor networks. Computer systems allow us to make complex information manipulations millions of times faster than by hand and reduce the risk that we make the same mistakes as we always did.

 

Before any attempt is made to analyze the creation and operation of networks ranging in size from ones covering a single household to global coverage we need to understand the evolving role of computers in the past, the present and the future and how our need to deliver computer power and information to a wide range of users has resulted in complex solutions utilizing a broad spectrum of computer types and transmission mechanisms. Such integration has made the use of standardized approaches of paramount importance

 

In this post we'll take a look at how computer systems, and information use,  have evolved into modern approaches and how the world of standards has ensured this transition from the simple to the complex.

 

The Information Asset


The collection, storage and maintenance of timely information over a wide range of types has been implemented over the centuries by a range of written book-keeping techniques that include wall paintings, scrolls, and both hand written ledgers and typed ledgers.

 

Within a corporation different types of information exist in many forms Corporate level information can include financial records, asset lists, customer profiles, product definitions and specifications, trend analyses, competition evaluations and much more. At the department level information can include function definitions, resource availability, staffing lists, technical specifications, schedules and other operational information. In addition, information such as personal schedules, travel support documents, operating procedures, usernames and passwords is typically collected and saved by individuals.

 

A corporation may also acquire and maintain personal and often private and sensitive information about it's employees including social security and tax information, educational background materials and work history. It may also save  information considered to be useful to the corporation from public sources. Trade laws and restrictions in overseas markets, climatic conditions in countries of operation, demographics, maps and travel instructions are all examples of this type of information. Such collection and storage of information has always presented a number of issues to management, the major ones being:


Ownership - Who, within the organization, owns the information and protects and certifies its accuracy.

 

Control - Who controls the information, its collection, it's use by whom, it's modification, also by whom and when, and its final elimination. (It should be noted that ownership and control may be vested in different individuals or organizational units.)

 

Distribution - How is information distributed, to whom, under what conditions, by what technical mechanisms and what controls are in place to prevent it from  being misused or falling in the wrong hands.

 

The key to successful information control lies in the selection or creation and consequent  implementation of a company wide suite of information standards. Many examples of such standards exist and have evolved over the ages addressing such issues as:


·      The infrastructure needed to create, maintain and use stored information resources.

·      The financial cost of creation, maintenance, protection and final elimination of all forms of information.

·      All machine (if used) and  human factors

·      Measures taken to eliminate the impact of all disasters, natural or manmade.

 

The goal with all collected and distributed information, whether it be stored as paintings on cave walls or detailed writings in ledgers, has always been to meet the objectives of what the authors have defined as the Information Bill of Rights.

  • The right information
  • To the right person or process
  • At the right time
  • In the right place
  • In the right form and format
  • At the right price

Wednesday, April 24, 2024

Spatial Diversity In Wireless Communications

Spatial diversity is one of those fundamental technologies used in wireless communications
(cellular networks, Wi-Fi, satellite communications, and broadcasting) that does not get much exposure. The technology is used to combat fading and improve signal quality, enabling reliable communication links, especially in challenging environments characterized by obstacles, interference, or long propagation distances. Let’s take an introductory look. 

Spatial diversity exploits the spatial dimension of wireless channels by deploying multiple antennas at either the transmitter or receiver, or both. By leveraging spatial separation between antennas, spatial diversity techniques minimize the effects of fading, which result from signal attenuation, reflections, and scattering in multipath propagation environments. Through the simultaneous reception of multiple independent copies of a transmitted signal, spatial diversity enhances the likelihood of receiving at least one strong signal, thus improving the overall reliability of communication links.

 

There are three key methods involved - Selection Diversity, Maximal Ratio  Combining (MRC) and Equal Gain Combining (EGC).

 

Selection Diversity: In selection diversity, multiple antennas are strategically placed to receive the same signal, and the antenna with the highest received signal strength is chosen for further processing. This technique is relatively simple to implement and offers improved diversity gain, particularly in scenarios with moderate to severe fading.

 

Maximal Ratio Combining (MRC): MRC combines signals from multiple antennas with different complex weights, determined based on the channel conditions. By weighting each received signal based on its signal-to-noise ratio (SNR) and combining them coherently, MRC maximizes the received signal power, thereby enhancing the overall signal quality and reliability.

 

Equal Gain Combining (EGC): EGC employs a simpler approach by combining signals from multiple antennas with equal weights. While less complex than MRC, EGC provides diversity gain by mitigating the impact of fading through signal averaging.

 

Spatial diversity offers an effective mechanism to combat fading and enhance signal reliability. Through the strategic deployment of multiple antennas and the application of diverse combining techniques, the technology improves data transmission across a wide range of environments and applications.

Monday, April 15, 2024

Lost Text and Lost Friend: Terry Pardoe and Data Communications, Networking Methods and Protocols

 

In 2003-2004, I collaborated with Terry Pardoe, co-authoring a Network Security book published in 2004. Inspired by its success, in 2005 we began work on another book titled Data Communications, Networking Methods and Protocols, which unfortunately never made it to publication. Fast forward to September 2014, where I had the honor of delivering the opening keynote for the fall semester at New Hampshire Community Technical College (NHCTC) in Nashua. Terry played a pivotal role in making this happen, and during the event, we had the chance to capture the photo here together, proudly holding our first co-authored text.

I first crossed paths with Terry back in 1999 when he joined NHCTC-Nashua as a part-time lecturer and became a subject matter expert in the (sunsetted in 2016) Verizon NextStep program. Despite a rocky start, our relationship quickly blossomed into a close friendship. Terry possessed a remarkable sense of humor, though I can't recall ever seeing him laugh. He sure knew how to make me laugh though. He resided in Nashua, New Hampshire, alongside his wife and family. A few years ago I learned Terry passed away on May 2, 2016 at the age of 76. Years later and I’m just finding out – it happens to us all - people we work with and are friends with – we lose touch when things change. Before we get into the content – here’s a little bit about Terry.

 

Terry was born in the United Kingdom and educated at the Birmingham College of Advanced Technology (Now Aston University). After coming to the states, Terry D. Pardoe was executive vice president of International Management Services Inc. a USA based computer application and training organization for 23 years (until Aug 1999). On his death, he had more than 40 years experience in the design and application of networks, communications and information systems. He was an internationally recognized expert on all aspects of telecommunications and networking including wide and local area networks, TCP/IP based networks, the Internet, intranets, client server computing, data and network security and many other applied areas. He lectured and consulted on a worldwide basis for a wide range of clients including: Digital Equipment Corp., AT&T, Sprint United, Verizon, Citibank, IBM, Honeywell, NT&T (H.K.), SCI (Brazil), etc. Terry worked with all the major agencies of the US Government including NASA, NSA, DISA, US Navy, US Army, IRS and many others.

 

Terry was the author or co-author of over 200 technical texts on computer applications, management techniques and data communications including text to support the first Java seminar available on a worldwide basis. He authored many www pages which include complex graphics, Java applets and JavaScript.

 

Today I found an old CD-ROM with all text, images, etc intact of the Data Communications, Networking Methods and Protocols book. Over the summer I’ll be posting content from that unpublished book here in honor and respect of Terry. It is interesting – 20 years later - a combination of some obsolete but other still relevant technologies.

 

An amazing man and an amazing career, building the foundation for the Internet we have today. Thanks Terry!!