Ten Truths about Building Safe Software for Medical Devices

Obtaining pre-market approval for a medical device is arduous. Manufacturers must look beyond the strictly technical challenges and focus on the needed environment and culture. They must consider ten fundamental truths—truths that we must tell and truths that we must face—about building and gaining approval for medical devices.

by Yi Zheng and Chris Hobbs, QNX Software Systems

The first truth applies most broadly. Without a company-wide safety culture, it is unlikely that a safe medical product can be built. A safety culture is not only a culture in which engineers are permitted to raise questions related to safety, but a culture in which they are encouraged to think of each decision in that light. A programmer might think, “I could code this message exchange using technique A or B, and I am not sure how to balance the better performance of A against the higher dependability of B,” and know with whom that decision should be discussed. The culture that encourages the programmer even to consider the question must be nurtured.

Truth 2: Experts
Safe systems must be simple. And creating a simple system is the hardest challenge for any engineer. For this we need experts. It takes specialized training and experience to define what a safe system must do and to verify that it meets its safety requirements. Ultimately, it is the relevant experts—domain experts, system architects, software designers, process specialists, programmers, verification specialists, among others—who determine the requirements, select appropriate design patterns and build and validate the system.
Such expertise is expensive because it must be based on experience rather than training: few university undergraduate courses in computer engineering cover embedded software development, and even fewer teach the elements of creating embedded systems with sufficient dependability.
No system is absolutely dependable, and so we must understand what our system needs in order to be sufficiently dependable. Accepting sufficient dependability reduces development cost and gives us the measures against which we can validate our safety claims. Without an understanding of what dependability is sufficient, we are likely to produce a system that is complex, and hence fault-ridden and prone to failure. Software design patterns and techniques have moved significantly since the mid-1990s, but many designers have not been exposed to these changes. Figures 1 and 2 show graphical illustrations of some of the newer development tools and methods.

Truth 3: Processes
Good processes are a measurable proxy for something that is currently largely unmeasurable. It is relatively easy to measure whether a process has been followed; it is much more difficult to assess whether good quality design and code are being produced. While no one claims that a good process guarantees good product, it is generally recognized that good product is unlikely to result from a poor process.
The medical device software standard IEC 62304 is about processes, and without good processes we will never be able to demonstrate that the system meets its safety requirements. IEC 62304 sets out the processes required in developing a medical device, not because these guarantee the production of a safe product, but because they provide the environment within which development parameters can be assessed. For example, having a good test process allows statistical claims to be made about test coverage. Without the process, this would be impossible. In addition, they provide the structure within which the chain of evidence in the safety case is preserved. Retrospectively producing a safety case is possible but expensive and would almost certainly require the re-generation of evidence that existed during the project development but which was not preserved.

Truth 4: Making Claims Explicit
Safety claims must explicitly state dependability levels, and the limits within which these levels are claimed.The FDA has recognized that “indirect process data showing that design and production practices are sound” is not adequate to demonstrate that software is safe, and that “device assurance practices […] focused on demonstrating product-specific device safety” are also required. This demonstration is included in a safety case and reflects the observation above that the purpose of a high-quality process is not to guarantee a high-quality product but to provide the environment within which evidence can be assessed.
Every safety case has at its heart claims of this sort: “This system will do A with level of dependability B under conditions C and, if it is unable to do A, it will move to its design safe state with probability P.” This claim with its attendant caveats are laid out in the system’s Safety Manual so that they can be incorporated into the safety case of a higher-level system.
A system’s dependability is its ability to respond correctly to events in a timely manner, for as long as required: a combination of availability—how often it responds to requests in a timely manner—and reliability—how often these responses are correct.
The safety case states the system’s dependability claims and provides the evidence that it meets these claims. The limits of the dependability claims are as important as the claims themselves. For example, a medical imaging system may be designed to meet IEC 61508 SIL3 requirements for continuous operation not exceeding 8 hours, at which time the system must be reset (rejuvenated). Since imaging sessions are typically brief, this limit will pose no inconvenience, even for a system being used 24 hours a day.

Figure 1: Detail from a diagram (small part of a Bayesian Network) showing the probability of failure per hour for a medical monitoring device reference design. Great expertise is required to identify risks and correctly calculate probabilities of failure.

Truth 5: System Failures
No system is immune to bugs, especially Heisenbugs—mysterious bugs that “appear,” then “disappear” when we look for them. Failures will occur. Build a system that will recover or move to its design safe state. Accepting that all systems will contain faults, and that faults may lead to failures, a safe system must include multiple lines of defense (Table1).
These include the isolation of safety-critical processes. It is critical to identify safetycritical components, and design so that they cannot be compromised by other components. While the ideal solution is to identify and remove faults from the code, this is impractical. It is necessary to prevent faults from becoming errors. Beware the Heisenbug and design so that faults are caught and encapsulated before they become errors in the field.
The next level is to prevent errors from becoming failures. Techniques such as replication and diversification are less suitable to software than to hardware but can still be valuable if used carefully.
The final line of defense then is detection and recovery from failures. In many systems it is acceptable to move to the pre-defined design safe state and leave recovery to a higher-level system (such as a human). In some systems this is not practical and either recovery or restart will be needed. In general, the crash-only model followed by a fast reset may be preferred to an attempt to recover in an ill-defined environment.

Figure 2: Detail from a system-level fault tree (partial view) for a medical monitoring device. The fault tree uses a Bayesian network and can be seamlessly integrated into a safety case, if the case is also prepared using Bayesian techniques.

Truth 6: Validation
Testing is designed to detect faults in the design or implementation indirectly by uncovering the errors and failures that they can cause. Testing is of primary importance in detecting and isolating Bohrbugs—solid, reproducible bugs that remain unchanged even when a debugger is applied—but is of less use when faced with Heisenbugs because the same fault manifests as different errors each time it occurs.

However, to demonstrate that our system meets its safety claims, we must use testing as just one of many techniques because testing is insufficient to prove dependability. Other methods are required including formal design, statistical analysis, retrospective design validation and more.

Among these, static analysis is recommended by agencies such as the FDA because it is invaluable for locating suspect code. Static analysis can include syntax checking against coding standards, fault probability estimation, correctness proofs against assertions in the code, and symbolic execution (static/dynamic hybrid). In addition, proven-in-use and prior-use data are essential for building dependability claims. The in-use hours and failures resulting from this use should be gathered throughout the product lifecycle. The larger the sample size, the greater the confidence we can place in our claims.
Other techniques include fault injection. This means deliberately introducing faults that can be both test code designed to handle error detection and help estimate the number of remaining faults. As with the analysis of random tests, the results of fault injections require careful statistical analysis. Formal and semi-formal design verification are traditionally done before implementation, and design verification can also be performed retrospectively.

Truth 7: COTS and SOUP
The best way to build a safe software system is usually not to build everything oneself as that will entail more risk than building a system with selected commercial off-the-shelf (COTS) components. Building OSs, communications stacks and databases requires specialized knowledge, and the COTS equivalent may have the advantage of tens of millions of hours of in-use history. So it is permissible to use COTS and even software of uncertain provenance (SOUP), if these components come with sufficient evidence to support the overall system’s safety case.
That said, COTS software is usually SOUP as far as the developer of the medical device is concerned, and should therefore be treated with appropriate caution. Both IEC 61508 and IEC 62304 assume that SOUP will be used. The trick is to ensure that sufficient documented evidence is available to be able to quantify the implications of the SOUP for our system, meeting its safety requirements.
This evidence will include proven-in-use data, fault histories and other historical data. We should request the source code and test plans so we can scrutinize the software with static code analysis tools. The vendor should also make available the detailed processes used to build the software or a statement from an external auditor that those processes were suitable for an IEC 62304 device.

Truth 8: Certified Components and Their Vendors
Components with safety certifications, such as an OS certified to IEC 61508, can speed development and validation, and facilitate approvals. If COTS is used, there is an advantage to employing components that have received relevant approvals. Agencies, such as the FDA, MHRA, Health Canada and their counterparts in other jurisdictions, approve not the components but the entire system or device for market; nonetheless, components that have received certifications, such as IEC 61508 or IEC 62304, can streamline the approval process and reduce time-to-market.
In order to receive certification, these components must be developed in an environment with appropriate processes and quality management. They must undergo the proper testing and validation, and the COTS software vendor must provide all the necessary artifacts, which in turn support the approval case for the final device.

Truth 9: Auditors
The auditors are our friends. Engage them early on. In the world of safe software development, certification auditors are our friends. They understand how we need to establish our processes to obtain the certifications, and they can help us structure our safety case. The earlier we bring the auditors in to help us, the less we’ll have to revise, and the more efficient our development cycle will be.
It is particularly useful to explore the proposed structure of the safety case argument with the auditor before evidence has been added to it. If a notation such as GSN or BBN is used to express the argument, clearly separating the structure of the argument from the evidence, we can ask the auditor: “If we present the evidence for this argument, would you be satisfied?” This reduces the chances of surprise during an audit.

Truth 10: It Doesn’t End with the Product Release
Our responsibility for a safe system does not end when the product is released; it continues until the last device and the last system are retired. The following numbers are a little dated but eloquent: updates to software can compromise its integrity. In a study the FDA conducted between 1992 and 1998, 242 out of 3,140 device recalls (7.7 percent) were found to be due to faulty software. Of these, 192—almost 80 percent—were caused by defects introduced during software maintenance.
In other words, the faults were introduced after the devices had gone to market. Hence, the processes we use to ensure that our software meets its safety requirements must encompass the entire lifecycle of the software, including fixes and updates.

QNX Software Systems, Ottawa, ONT. (613) 591-0931. [www.qnx.com].

Ten Truths about Building Safe Software for Medical Devices

What do you think?

Digital Edition


Name required	Email required	Website