Thursday, February 26, 2015

Product Management - Design For Reliability



The role of quality and reliability in a product success cannot be disputed. Product failures in the field inevitably lead to losses in the form of repair cost, product recalls, lost sales,  warranty claims, customer dissatisfaction, product recalls, loss of sales, and in extreme cases, loss of life. Thus, quality and reliability play a critical role in product development.

Quality and reliability has become a standard in products - Airplanes, medical devices, Cars,  Robotics, Industrial automation etc., Yet when it comes to software products - the reliability and quality seems to be sadly lacking.

Often times during product development cycle, reliability and quality testing is compromised in favor of faster time to market. The general attitude is that - "If customers find a bug, we will fix it in a patch release."

In addition, practice of agile product development and rapid release cadence: A new release every quarter, or month and in case of extreme programing - daily updates!

The idea of quickly fixed all known defects, security failures etc., has led to products that has poor reliability.

Today, customers typically wait for 2-3 quarters after the product release - before getting that product into production. Large enterprise customers have to test new software products before getting it into production. But with shrinking product life cycles, companies are being forced to build products that has specific design features for reliability.

As a result, new enterprise products are now being designed for reliability. From a product design concept, reliability is about an application's ability to operate failure free.

This includes ensuring accurate data is coming into the system and data transformation is error free, Error-free state management, and non-corrupting recovery when failure conditions are detected failure.

Creating a high-reliability application starts early in development life cycle - right at the product specifications and is built right into architecture, design, coding, testing, deployment, and operational maintenance.

Reliability cannot be built into an application at deployment stage. Though it is quite common  from early design specification, through building and testing, to deployment and ongoing operational maintenance. You can't add reliability onto an application just before deployment.

Common steps for building reliability into a product are:


  1. Product Reliability requirements are defined in product specification.
  2. Product architecture includes reliability eg: Distributed Vapp architecture
  3. Application management information is built into the application.
  4. Use redundancy for reliability.
  5. Use quality development tools.
  6. Use built-in application health checks
  7. Use consistent error handling
  8. Build error recovery mechanism into the product
  9. Incorporate Design for Debug functionality - for easy debug.


Many of the reliability design ideas also overlap with high availability - where the system resilience is built into software. In High-Availablity systems two or more instance of the software are running separately - but synchronously. New software systems are designed for geo-distributed deployment, where customers can continue to use the product - even if a data center goes down.

There is a very close relationship between reliability and availability. While reliability is about how long an application runs between failures, availability is about an application's capacity to immediately begin handling all service requests, and especially — if a failure occurs — to recover quickly and thereby minimize the time when the application is not available. Obviously, when an application's components and services are highly reliable, they cause fewer failures from which to recover and thereby help increase availability.


Improving Software Reliability


Software and system reliability can be improved by giving attention to the following factors:


  1. Focus strongly and systematically on requirements development, validation, and traceability, with particular emphasis on software usage and software management aspects. Full requirements development also requires specifying things that the system must do and what the systems must not do. (e.g., heat-seeking missiles should not boomerang and return to the installation that  fired them).
  2. Formally capture a "lessons learned" database and use it to avoid past issues with reliability and thus mitigate potential failures during the design process. Think defensively. Examine how the code handles off-normal program inputs. Design to mitigate these conditions.
  3. Beta software releases are most helpful in clarifying the software's requirements. The user can see what the software will do and what it will not do. This  will help to clarify the user's needs and the developer's understanding of the user's requirements. Beta releases help the user and the developer gather experience and promote better operational and functional definition of the product. Beta releases also help clarify the user environmental and system exception conditions that the code must handle.
  4. Build diagnostic capability into the product.  When software systems fail, the software must collect all required information needed to debug the case automatically.
  5. Carry out a potential failure modes and effects analysis to harden the system against abnormal conditions.
  6. Software Failures  at customer site should always be analyzed down to their underlying  root cause for repair and to prevent reoccurrence. To be the most proactive, the system software should be parsed to see if other instances exist where this same type of failure could result.
  7. Every common failures must be treated as critical and must be resolved to its root cause and remedied.
  8. Capture and document the most significant failures - understand what caused the failure and develop designs to prevent such failures in future.
  9. Fault injection testing must be part of system testing.


Benefits of Design for Reliability


The concept of design for reliability (DFR)  in software is becoming a standard in recent years and will continue to develop and evolve in years to come. Design for reliability shifts the focus from "test-analyze-fix" philosophy to designing reliability into products and processes using best available technologies.

DFR also changes test engineering from product testing for defect detection to testing for system stability and system resilience.

As DFR standards evolve, product companies are setting up reliability engineering teams as an enterprise wide activity - which gives guidance on advice on how to design for reliability, provide risk assessments, provide templates for reliability analysis, develop quantitative models to derive the probability of failure for products.

DFR impacts the entire product lifecycle: reducing life-cycle risks and minimizing the combined cost of design, manufacturing, quality, warranty, and service. Advances in system disgnotics/prognostics and system health management is helping the development of new models and algorithms that can predict the future reliability of a product by assessing the extent of degradation from its expected operating conditions.

DFR principles and methods are aimed proactively to prevent faults, failures, and product malfunctions, which result in cheaper, faster, and better products. Product reliability is best used as a tool to gain customer loyalty and customer trust. For example, lot of customers still use Sun/Oracle Computers, IBM Z series systems, Unix OS for its reliability.  

No comments: