Predicting When to Reboot ``Continuously Operating'' Embedded Software

Jeffrey Voas and Frank Charron

ABSTRACT

One of the greatest problems with embedded control software is knowing how to gracefully handle state corruption that gradually builds up after continuous execution of the software. This paper presents: (1) an empirical technique for assessing how frequently such systems should be rebooted, and (2) an empirical technique for determining which portions of the state create the greatest safety risks if they are corrupted. Armed with this information, developers and certifiers can create justifiable plans for the frequency with which the software should be rebooted. Further, they can customize and embed internal self-tests into those portions of the state found to have the greatest safety risks. These self-tests will warn that serious safety problems are likely to occur unless action is taken.

KEYWORDS: Fault injection, State corruption, Reboot, Embedded software

 

1 INTRODUCTION

Conventional software systems take inputs from users and provide output information back to the users. In contrast, embedded control software begins executing from a set of initial conditions when it boots up and then continues executing according to state information that was calculated from earlier executions of the software. Figure 1 illustrates a basic control system that is supervised by continuously running software. The software is comprised of two parts: initialization and a control loop.

Figure 1 Basic control system

The initialization code is executed once per program execution at the time when the program is first booted. After the initialization has completed, the internal data state is in a predefined state. This startup state will be the same each time the system is rebooted and the program started up again. The control loop iterates continuously, collecting new inputs from the system, and calculating new outputs which are fed back into the system. The calculation of new output values for a current iteration through the loop is based on the current input values and the internal data state, which evolves as a function of all previous input values and control parameters. The outputs produced for a current pass through the loop directly impact the inputs that will be supplied in the next pass through the loop.

We term state information that was calculated from earlier executions of the software as feedback state. The process of using information from previous executions on a current computation is handled by a feedback mechanism built into the software or device that the software controls. In addition to stored state information, other input information from the environment that the software is embedded into may also be fed into the software.

One other interesting difference between conventional software and embedded software is that embedded software often produces little external output. Most of the results that embedded software computes remains held as internal state information. This information is created during execution but not released to the external environment. Results that are released externally mostly consist of output signals that are necessary to keep the physical entity under the software's control operating safely. Because embedded systems often produce little external output when compared to the amount of information they actually compute, they are more difficult to test and debug than conventional software programs.

The problems of testing embedded software can generally be tied to a lack of ``observability.'' The more information that a tester can glean concerning what actually occurred inside of a program as it executed, the easier the software will be to debug if the software failed. Also, having access to this additional information provides more confidence to the tester that the software did not execute any faults on an execution in which the software did not fail.

By increasing the amount of information released from inside of the software during testing (e.g., outputting 2 64-bit floating point values as opposed to one), more of the internal (intermediate) calculations can be observed and thus evaluated for correctness. Observability has long been a metric used in hardware design that describes the degree (or ability) of a chip to detect problems in the inner logic of a chip at the output of the chip. When observability is poor, Built-In Self Tests (BISTs) have often been employed to force complex circuits to perform self validation. These hardware probes are placed into circuits to increase the observability of the circuit during test. Similarly, assertions increase software's observability by increasing the dimensionality and/or cardinality of the software's output space, which is precisely what you want if your goal of testing is to catch defects [4].

 

2 THE REBOOTING QUESTION

Aside from the problem of testing embedded software, another difficult problem occurs during specification and design. The problem is knowing, before the software is built, how frequently to restart (reboot) the software after it is deployed. It is prudent to assume that eventually the state of embedded software will become corrupted. This is particularly true as the interval between restarts increases.

Corruption of the internal data state maintained within the control loop could manifest itself in one of two undesirables ways: (1) the unstable data state could lead to computation of unsafe outputs; or (2) due to the feedback within the system, the corruption could continue to degrade the system until the software is unable to control the system according to specifications.

Consider software built to keep a car on a pre-defined track. Corruptions in the internal data state could cause the controller portion of the software to yield unsafe outputs to the devices that guide the car. Moreover, over time the corruption could propagate throughout the data state enough to defeat the controller, ultimately leading the car off the track to an unacceptable and presumably catastrophic state. This cannot be tolerated by many control systems, particularly safety-critical systems.

Consider how this problem resulted in the Patriot missile failure during the Persian Gulf War. According to the Associated Press [1]:

Army investigators concluded that the exact reason for Patriot's failure to shoot at the Scud will never be known for sure. But they said the most likely explanation was a previously unknown glitch in Patriot computer software. Army technicians had determined as much as two weeks prior to the attack that the Patriot computer was vulnerable to losing track of incoming Scuds when the computer was kept running for long periods, according to internal Army reports released in response to a Freedom of Information Act request by The Associated Press. Tragically, no alert bulletins were sent to Patriot operators in the field because the technicians viewed this as a minor problem of less importance than other Patriot improvements they were working on. The technicians did not think Patriot computers would be kept running for more than several hours at a time. [Earlier reports in RISKS noted that the spec called for only 14 hours of operation, not 100, and that the clock was drifting...]

``Had rebooting, or shutting off the system, occurred, it would have decreased the chance that the inexact (computer) calculation would have occurred,'' said June 14 memo signed by Lt. Gen. Ellis D. Parker, director of the Army Staff.A previously classified internal Army memo dated Feb. 20 described a series of software improvements to the Patriot computer, including a change that was designed to avert the tracking problem under circumstances of long continuous operation. But in follow-up memos intended to be seen by users of the Patriot, no mention was made of the tracking problem or of a need for periodic computer shutdowns. The technical specialists simply thought they had found a way of improving further on the accuracy of Patriot, not correcting a potentially fatal error. ``No significance was given this change because no prior related tracking problems had been seen,'' said an Army Patriot program office report. ``Therefore, no alarm or urgent notification was transmitted to the field.''

This suggests that if the missile system had been rebooted at the appropriate time interval, it would have operated properly when it was imperative that it did so.Because of this possibility for any system that ``operates continuously'', it is important to have access to conservative predictions of how often to reboot the software. Ideally, this information will be provided to the user before the software is released and placed into operation. Clearly after accidents occur we know precisely how often we should have rebooted. The goal of our research is to not wait until accidents occur and to instead provide a conservative (safe) metric for how often to reboot before the software is released. A further benefit of our approach is that it will allow developers and testers to determine which portions of a ``corrupted'' state have a benign impact on overall safety and which portions cause catastrophic (hazardous) problems. Once armed with the information that our approach provides, developers and testers will know which internal software computations need additional assurances of integrity before the software is deployed.

2.1 Software Fault Injection

Our approach for studying how corrupt states affect systems that employ feedback is based on software fault injection. The goal of software fault injection is to observe whether a program can behave in certain predefined, undesirable manners [3,6]. Software fault injection is a dynamic analysis that can be used to mine from the internals of the software and discover corrupted program states that force hazardous output events to occur.

Software fault injection is a ``what if'' analysis technique that can forcefully corrupt the information in a program's state (as well as mutate code) [6]. This paper employs the approach of corrupting program states. The reason for this is simple: one particular corruption of a specific program state for a given test execution of the software can simulate a wide class of code mutations. Further, methods for performing the instrumentation to mutate code are not nearly as sophisticated as those for corrupting internal states.

Before addressing the issue of state build-up, let us first discuss in greater depth the different ingredients needed to perform fault injection. To begin, there are two unique classes of events that must be defined by the person performing fault injection. First, there are those events that will be forcefully injected into the state of the application software as it executes (this is where the ``injecting'' is taking place). These events are termed data anomalies. Examples here would include corrupting a pointer, modifying the value that a programmer variable currently contains, or slowing down a computation.

The second set of events are classes of functional behavior (i.e., output events) that the user does not want the application software to exhibit. Examples here include hazardous output states (that are defined during hazard analysis) or calls to system-level utilities that the application software should not make. This second class of events is referred to as output anomalies. What constitutes an output anomaly must be defined with respect to the state of the system that the software resides in or controls. In safety-critical systems, output anomalies are hazards.

Fault injection usually creates data anomalies using pseudo-random number generation [7]. There are two basic ways in which pseudo-random number generation is employed to corrupt program states: (1) by changing a value to a distinct value (that is based on the original value), and (2) by changing a value to something that is totally independent of the original value.

Our approach here is simple: (1) freeze the execution of the software (2) inject data state anomalies once into some portion of the state of the software, and (3) unfreeze the execution of the software allowing it to execute with corrupt state information. From there, we will observe whether our artificial data anomaly propagates to other portions of the state, and whether the state becomes so corrupt that it eventually causes the software to produce unsafe external outputs to the environment that it resides in. If it does, we can build safe predictions for how frequently the software should be rebooted to avoid the most damaging forms of ``corrupted state'' build-up.

 

3 ALGORITHMS AND METRICS

3.1 Setting up the Analysis

In preparation for performing the proposed analysis, the user must do some preliminary work. First, the user must determine which portion of the state is reused between executions of the control loop. Static data flow can be employed to assist in identifying those state variables that are fed back into the control loop.

Next, the user must decide what classes of external events the software should never output to the entity that it controls. These unacceptable events are referred to as software-influenced hazards. We will seek to avert these hazardous external events by deciding how frequently to reboot. An additional guard against these hazardous external events can be accomplished by embedding assertions (internal self-tests) on the appropriate portions of the state that, if corrupted, are likely to engender hazards.

Towards finding effective assertions that can be used after deployment, we recommend embedding internal assertions on the state at various locations of the control loop during analysis. This step is critical in addressing the observability issue discussed earlier in the paper. The internal assertions can reveal those segments of the state data that may be monitored after the software is deployed to anticipate hazards before the hazardous conditions occur. Furthermore, these assertions provide a means to characterize the impact that corruption of one portion of the state has on the rest of the state.

Again, static data flow techniques may prove useful in determining portions of the state to test at various locations. For efficiency, isolate program variables earlier in the data flow that statically appear to impact the outcome of hazard evaluations; likewise, isolate state variables later in the data flow that statically appear to depend on state variables that may be corrupted earlier in the data flow. For each of these selected state variables, an assertion can be provided to test for unacceptable values or even to provide a simple watch mechanism.

Finally, the user must determine the way to introduce data anomalies during analysis: (1) what types of data anomalies to inject into the state, and (2) on which execution of the software to perform the injection of data anomalies. For example, is it reasonable to assume that on the first execution of the software the state will get corrupted? If so, then that is an execution of the software on which we should inject a data anomaly to see what happens. Or should we wait until the software reaches a stable/steady state before we inject the data anomaly? For example, maybe it would make more sense to wait until the thousandth execution, after the system has ``warmed up''.

3.2 Algorithm

Once we make these decisions, the following algorithm can be applied:

  1. Let the software execute completely for X executions. (For continuously operating software programs, it is often hard to distinguish a complete execution because it is unclear what constitutes as a single, complete input vector. If this occurs, then allow the software to run for a fixed interval of time equal to some constant Y).
  2. Halt execution.
  3. Corrupt some portion of the state that is used as feedback. This simulates that portion of the state being corrupted on the execution before X or at time earlier than Y.
  4. Resume software execution.
  5. This algorithm performs the task of corrupting the state (i.e. through fault injection). To collect the information that we are seeking after we have corrupted the state:

  6. Check the internal assertions or probes into the software state (both feedback state and non-feedback state), that reveal what other portions of the state are subsequently corrupted as a result of the injected data anomaly. Record the results of the internal assertions.
  7. Check the mechanism that acts in an ``oracle-like'' capacity and determines if any external output states from the software have hazardous potential. If any are found, then information should be recorded that specifically tells when the hazard was first observed. A timestamp of this event will be stored in a set T. (This information will be needed by our reboot frequency metric.) If no hazards are observed and the software has either executed X + Z times or has executed for Y + Q time (where the size of Z and Q are based on a rough estimate of how often the system would be rebooted before this analysis was performed), then halt execution and also record this timestamp in T.

By applying steps (5) and (6), the appropriate self-tests can be built that warn before a hazard is likely to occur. In addition, step (6) can be used to determine how frequently rebooting should occur.

3.3 Metric

The reboot frequency metric that we propose can either be based on some number of executions of the embedded software or execution time. Recall our earlier comments about the difficulty in determining what constitutes a full input vector (and hence the difficulty in determining when one execution ends and the next execution begins). For this reason, the time-based metric is likely to be preferred.

In order to simulate a variety of possible corruptions when computing the reboot frequency metric, the above algorithm should be applied repeatedly to different portions of the feedback state using different data anomalies. The amount of resources available when this process is applied will determine how thoroughly the analysis will cover different anomalies and portions of the feedback state.

Once fault injection has been applied and the results are collected, a sequence of times will be recorded from the different trials. Each time either represents when a hazard was observed or the trial timed-out (no hazard had yet occurred). If Y and Q were constant for all trials, and if all recorded times equal Y + Q, then no hazards were observed. If however there exists a time in the sequence that is less than Y + Q, then that represents a trial on which an external hazard was produced.

The first task in determining the reboot frequency metric is to find the minimum time in T. Let this value be g. If g < Y + Q, then we predict that we should reboot at least every g units of time. To be conservative, we should probably double our caution and reboot every 0.5g units of time. If g = Y + Q, then we predict that by rebooting every Y + Q units of time (or possibly at intervals slightly greater than Y + Q), that we can have confidence that no state corruption capable of causing a hazard exists in the feedback state.

It is worth noting that the initial period Y should be carefully chosen. We should have a high degree of confidence that the software will maintain a relatively stable state and operate safely for the initial duration of time Y. The analysis allows us to extend our confidence to a time interval beyond Y; it helps answer the question, ``how much longer after initial time length Y should the software remain running before we should restart it?''

Note that a similar analysis can be performed if number of executions is used instead of time.

3.4 Assertions

As we mentioned earlier, fault injection analysis can provide detailed information on which portions of the feedback state, if corrupted, resulted in hazards. Also, fault injection can be implemented in a such a way as to record the trace of events that happened between the time when the anomaly was injected and the hazard occurred. This information can be used to build assertions that sit on a shadow processor or are embedded into the control software to warn when similar events occur after the software is deployed. These assertions then act as warning mechanisms.

3.5 Implementation and Deployment Issues

By now, the reader may be wondering how the internal assertions can be successfully implemented. Embedded systems often have scarce extra memory for programs that are bloated with instrumentation. Also, there may not be a channel by which the information can get out to the tester or user. Recall we discussed earlier that this was the observability dilemma.

To address the first issue, it is not necessarily the case that the embedded software be riddled with assertions to test to see if states have become corrupted. There are techniques such as shadow processing where the assertions that would normally be instrumented into the embedded software are instead executed on different machines [2]. To make this work, however, there must be channels by which probes can physically connect to the memory cells that the software is using and probe the stored values. This information is then sent back to the shadow processors that then execute the assertions on that information. In addition, the time at which the memory is probed is critical, so the shadow processor must be time-synchronized with the embedded software.

We should also address the issue of how to select Q and Z in more detail. Recall that we said that their values should be based on a rough estimate of how often the system was expected to be rebooted before this analysis was performed. So for example, if we had originally thought that the system would require rebooting every 6 months, then we should set Q to a value closer to 6 months than 5 minutes. This allows us to test out the viability of our 6 month estimate. And by applying techniques such as accelerated testing, we can obtain results that predict a very long period of time between reboots, from analysis performed within a much shorter period of time.

One final point that we would like to make deals with post-deployment of the software. As the software continues to operate safely for greater and greater periods of time without incident, it is likely that the interval between reboots can continue to be increased. Why? Because all of that operational use demonstrates that the software is of even better quality and robustness than we could have predicted from the laboratory testing alone. The point here is that our reboot frequency metric really only estimates how often to reboot the software before the software has been in field operation. Once fielded and without serious incidents, the reboot intervals are candidates for increase. These field observations may be used to effectively improve upon our original estimates for the initial time period Y in the preliminary laboratory analysis that was conducted. As stated before, the proposed analysis can be used to gauge how much longer the deployed software should be left running, given its current safe operating duration in the field.

 

4 VARIATIONS

The previous section describes a general algorithm for estimating the reboot frequency for software. In this section, we discuss variations on the general algorithm.

4.1 Varying X

The user may be interested in varying the number of executions of the loop, X (or the time, Y), that elapses before the anomalous state is injected. Different portions of the state may be particularly sensitive at different intervals of the total software execution lifetime. For instance, corrupting a specific portion of the state after only a few passes through the execution loop may lead to hazards within a few more iterations of the execution loop. However, corrupting the same portion of state later in the execution of the program (after the system has "warmed up") may not lead to any hazards at all or may lead to a hazard only after a significant length of execution time elapses.

During this analysis, it would be useful to record the length of additional time that the software can run safely after introduction of a corrupted portion of the state. Such data allows the user to gain an understanding of how much longer the software should continue to run if a state corruption has occurred. Given the user’s estimate of the probability of state corruption over time, the user can arrive at an optimal reboot frequency.

For example, the graph in Figure 2 represents how a user might estimate the probabilities of a state corruption as a function of software execution time. Somewhere between 600 and 700 units of time (i.e. iterations of the control loop or hours), the user has a greater than 50% confidence that a state corruption will occur. If the user performs analysis where the value of X is varied, a graph like the one shown in Figure 3 can be obtained. The graph shows the number of additional units of time (on average) that the software may be run before a hazardous outcome occurs, for different times of state corruption. The graph indicates that after a certain time, about 300 units of time, the software is fairly tolerant of state corruptions, which do not propagate to hazardous outcomes until, on average, 1000 more units of time are executed. This trend is consistent with software that has a warm-up period during the early portion of execution; once the software has reached a steady state, it is much more tolerant of state corruptions. Based on this data, the user may choose a reboot frequency in the neighborhood of 600 units time + 1000 units of time = 1600 units of time.

Figure 2 Probabilities of state corruption at different times of the software execution.

Figure 3 Estimates of time to hazard after injecting state corruptions at different times.

4.2 Location Based Approach

The general algorithm can also be refined to apply to individual locations within the main execution loop of the software. At a given location, if a state variable gets defined (i.e. assigned a value), then we can easily introduce the corruption of the associated state variable by injecting a fault at this location.

When injecting anomalous state data at specific locations, the analysis will be applied in a similar manner as proposed in the general algorithm. After X executions (or after time Y), the state is corrupted at location l. Following this state corruption, the analysis resumes execution of the software, checking internal assertions that detect resulting corruptions to other the portions of the state and checking for hazardous outcomes. We could compute the reboot frequency for each location l. Instead, a more useful measurement appears to be: how much longer does the software execute after the corruption injected at location l before a hazard is encountered? Fault injection can be performed for different values of X (or Y) as well as different anomalous data values at location l. The goal of the analysis is to understand whether the data corruptions at location l propagate over time to a hazardous outcome, and if so, how long it takes to propagate to the hazard.

If the user finds a specific location where hazards are created within a short period of time after corruption, this indicates that the location is significant in the causation of the hazardous conditions. Closer inspection of the state variable getting defined at this location is warranted, because if a software fault exists at the location, the software can create a hazardous condition very quickly. Internal assertions may be specified at these locations in order to test for values, which can predict impending hazards before hazards occur.

4.3 Distributed Faults Approach

This location-based analysis can be taken one step farther. We can extend the analysis to include distributed fault simulation. State corruptions can be injected into the running software at multiple locations at a given iteration X (or time Y). The analysis can help determine whether distributed faults can lead to hazardous software output.

With well-placed internal assertions during analysis, the user may discover that the multiple state corruptions compound the impact on other parts of the state and accelerate the propagation to hazardous outcome. These other portions of the state can be treated with fault injection to assess the sensitivity of the reboot frequency to variations in these state variable values. Such investigations can lead to the discovery of portions of the state that can be monitored to prevent hazardous outcomes.

5 SUMMARY

Parnas has advocated periodically resetting most variables in safety-critical real-time systems so that test trajectories would be limited in length [5]. This periodic cleansing also limits the propagation of corrupt data values. Our reboot frequency metric is designed to give users a feeling of how often this should occur. And by placing assertions on certain portions of the feedback state, users can be warned when types of corruption have occurred that are likely to result in hazards.

The inclusion of the internal assertions provides an opportunity for other types of analyses that can study the dynamic relationships among the state variables. While static data flow theoretically can provide information about which state variables depend on which other variables, dynamic analysis that combines state corruptions and internal assertions can help quantify the dependencies. Thus assertions can assess the sensitivities of state variables to other state variables. This is a future research area of research.

In summary, the proposed analysis holds great promise for allowing us to assess the expected safe operation span of safety-critical control software and to lengthen this safe operation span through the use of well placed assertions.

 

6 REFERENCES

1. R. Burns. Army Records Say Computer Shutdown Might Have Averted Scud Disaster, Associated Press, August 15, 1991.

2. H.Patil and C. Fischer. Efficient Run-time Monitoring Using Shadow Processing. In Mireille Ducasse, editor, Proc. of the 2nd International Workshop on Automated and Algorithmic Debugging, St. Malo, France, 1995.

3. M. Hsueh, T. K. Tsai, and R. K. Iyer. Fault injection techniques and tools. IEEE Computer, 30(4):75--82, April 1997.

4. J. Voas and L. Kassab. Using Assertions to Make Untestable Software More Testable. Software Quality Professional, 23, September, 1999.

5. D. Parnas, A. J. Van Schouwen, and S. Po Kwan. Evaluation of Safety-Critical Software. Communications of the ACM, pages 636--648, June 1990.

6. J. Voas and G. McGraw. Software Fault Injection: Inoculating Programs Against Errors. John Wiley and Sons, New York, 1998.

7. S. Park and K. Miller. Random Number Generators: Good Ones are Hard to Find. Communications of the ACM, 31(10):1192--1201, October 1988.

 

7 AUTHORS AND BIOGRAPHY

J. M. Voas. Chief Scientist, Reliable Software Technologies. Address: 21351 Ridgetop Circle, Suite 400, Dulles, VA 20166, USA. Phone: 703.404.9293. Fax: 703.404.9295. Email: jmvoas@rstcorp.com. Jeffrey Voas is a Co-founder and Vice-President of Reliable Software Technologies (RST). In 1999, RST was named to the WashTech-50, identifying it as one of the 50 fastest growing high-tech companies in the Washington D.C. area. Also in 1999, RST was ranked as the 5th fasting growing technology company in Virginia. Overall, RST was ranked as the 8th fasting growing privately held company in Virginia. Voas has coauthored two Wiley books: (1) Software Assessment: Reliability, Safety, Testability (1995), and (2) Software Fault Injection: Inoculating Programs Against Errors (1998). Voas was the General Chair for COMPASS'97, is the Program Chair for ISSRE'99, and Program Co-Chair for ICSM 2000. Voas is a Senior Member of the IEEE, received a Ph.D. in computer science from the College of William & Mary in 1990, and in 1999, Voas was named Young Engineer of the Year by the District of Columbia Council of Engineering and Architectural Societies. Voas is an adjunct professor at West Virginia University.

 

Frank Charron. Project Manager, Reliable Software Technologies. Address: 21351 Ridgetop Circle, Suite 400, Dulles, VA 20166, USA. Phone: 703.404.9293. Fax: 703.404.9295. Email: fhchar@rstcorp.com. Frank Charron is the Project Manager of the fault injection core technology at Reliable Software Technologies (RST). He is responsible for the architecture, design, and development of fault injection tools, including the current RST SafetyNet product. Charron has experience applying SafetyNet in assessing the fault tolerance of several safety-critical systems, including medical, nuclear, and avionics systems. As a project leader in the RST Research Division, he has contributed to prototype development and experimentation in research areas that include software fault-tolerance, software security/vulnerability analysis, intrusion detection, and commercial-of-the-shelf (COTS) component wrapping. Charron received a BA in Mathematics from the University of Virginia and an MS in Operations Research from the George Washington University, Washington DC.