Recovery testing is an important & generally overlooked technique. Instead of ignoring the inevitably of bugs, it faces them head-on by investigating how software will react in the face of a trouble. It is applicable across all phases of software testing, and is especially productive at exposing bugs on systems under heavy load and stress. It is essential that the software testing engineers give due consideration to recovery implications while developing their test plan.
Software’s ability to recover from a failure is an important contributor to its robustness. Recovery can also be one of the most interesting test focus areas. How much recovery testing is needed largely depends upon the nature of the target program, as well as the operating system environment it will operate within.
At the other end of the spectrum are environments in which an application that fails is simply expected to crash and the entire operating system will need to be rebooted before the application can be restarted cleanly. Most software lies somewhere in between.
Various forms of recovery testing covers Function Verification Test (FVT), System Verification Test (SVT) and integration test disciplines.
Here in this post I am discussing the Function Verification Test (FVT) & System Verification Test (SVT).
A) Methods of attacking programs during Function Verification Test
According to the situation, there are many different ways in which we can attack a program’s recovery capabilities during FVT. Few of them I am describing below. However before we can check how well a program recovers from an error, we need a way to generate that error in the first place.
Some of the options I am describing here are given below.
Option –1: By using Special Tools and Techniques
In some cases, an error can be easily created through external means, such as filling up a log file or killing a process. But many times such techniques aren’t enough during FVT. As a software testing engineer we need to simulate a bad parameter being passed from one module to another, or force an error interrupt to occur just as the module reaches a critical point in its processing. It may not be obvious to us how to go about injecting such errors, but several techniques are available to us.
a) Stub Routines:
If we need to force another module or component to pass bad input into our target software, we need to replace that module with a small stub routine. The stub routine will do little more than accept incoming requests, then turn around and reply to them in a reasonable way. However, it will purposely corrupt the one parameter we are interested in. Alternatively, rather than replacing a module with a stub we can tamper with the module itself, altering it to pass back bad data when called by our target software.
These approaches will only work if the module we intend to "stub out" is called infrequently under conditions, which we can externally generate. Ideally, it would only be called by the module under test. We don’t want to insert a bogus stub routine that will be invoked millions of times per second for routine tasks by many other modules in the component. If we do, its identity as an impostor will quickly be revealed and the software will surely stumble. This stubbing approach obviously creates an artificial environment, so it’s probably the least desirable method listed here. But under the right circumstances, it can be useful.
b) Zapping Tools:
Some systems have tools that allow the software testing engineer to find exactly where a particular module is loaded in memory on a running system, display its memory, and change bytes of that memory on the fly. This dynamic alteration of memory is called a zap. If we can’t find such a tool for the system we are testing on, we can consider writing our own. We will probably find that creating a crude zapping tool is not a major undertaking.
A zapping tool gives us an easy means to selectively corrupt data. We can also use it to overlay an instruction within a module with carefully constructed garbage, so when that instruction is executed it will fail. As with the stub routine case, care must be used not to meddle in an area that is frequently executed on the running system, or the volume of errors we will generate will be overwhelming. However, zapping is not nearly as artificial a technique as stub routines. In the right situations it can be very effective.
c) Error Injection Programs:
Another approach is to create a small seek-and-destroy program to inject the desired errors into the system. To create such a program we must first determine exactly what error we wish to inject by studying the target software. Let us say the module in question maintains a queue of pending requests, and a counter which indicates the current length of the queue. When the module scans the queue, it relies on this counter to determine if it has reached the end. We decide to corrupt that counter so that the queue scanning code will fall off the end of the queue and throw an error.
To implement this plan, software testing engineers write a small program that operates with full system privileges. It follows a chain of system control structures until it locates our target module in memory. Our program establishes addressability to this module’s dynamic area (i.e., access to its variables), examines the current contents of the counter variable, doubles it, and then exits. The next time the target module tries to traverse the full queue, it’s in for a surprise.
This is a simple example, but try to imagine other cases where our error injection program corrupts the contents of a control structure shared by multiple modules within a component, or performs other nasty deeds. In essence, this is nothing more than automating the function of a manual zapping tool. But because the seek-and-destroy program is operating at computer speeds, it can be much more nimble and precise in its attacks.
d) Emulators and Hypervisors:
Through things called emulators and hypervisors, it’s possible to create what is known as virtualized environments. For this discussion all we need to realize is that they create another layer of software between an operating system and the hardware it runs on. In some implementations, this extra layer has special debugging capabilities that can be used to set breakpoints. These breakpoints can freeze the entire system when triggered. This gives the software testing engineer an opportunity to stop the system at a specific point, corrupt memory or register contents, then restart it and watch the recovery support take action.
This is quite different from the sort of breakpoint function available in interactive debuggers, which can create a very artificial environment. In virtualized environments, the operating system and all of the middleware and applications running on top of it are unaware of the existence of this extra layer. When a breakpoint is hit, the entire system stops not just one module. At that point, the virtualization layer hands control over to the software testing engineer.
Such technology is not universally available. But if we have access to a virtualized environment that supports break-pointing capabilities, it probably offers the most powerful mechanism for injecting errors during FVT.
Option –2: Enabling the Restartability of Program
The most basic recovery option is enabling a program to restart cleanly after a crash. In FVT, the focus is placed on failures within individual components of the overall product. We will generally need to trick a component into crashing. We can do this in a virtualized environment by setting a breakpoint at some specific location in its code. When the breakpoint hits we can insert carefully corrupted data, set the system’s next instruction pointer to the address of an invalid instruction, or zap the component’s code itself to overlay a valid instruction with some sort of garbage that’s not executable. We then resume the program after the breakpoint, watch it fail, and ensure it generates the appropriate failure messages, log entries, dump codes, etc. If it has robust recovery support, it may be able to resume processing as if nothing had happened. If not, it may force the entire product to terminate.
If the program terminates, software testing engineer can then restart it and determine if it restarts successfully and is able to process new work (or resume old work, depending on its nature). If we resorted to zapping the component’s code with garbage to force it to crash, and that code remains resident in memory, then we will need to repair the overlay prior to restarting the program (or it will just fail again).
Option –3: Using Component level Recovery out of Anticipated Errors
Most commercial software has some sort of component-level (or object-level) recovery, whether it is operating system-managed, or more basic signal try-and-catch mechanisms employed by some programming languages. At a high level, the idea is to establish a recovery environment around a chunk of code, such that if an error interrupt (e.g., program check, I/O error) occurs, the recovery routine will be given control to take some sort of action. That action could be as simple as issuing an error message. Or, it could be as complex as generating a memory dump, logging or tracing the error, releasing program-owned resources and serialization, and freeing held memory. It might even restore overlaid data in key control structures and retry the failed operation.
There may be a long list of anticipated error types for which the recovery routines take unique actions. At a minimum, our FVT plan should include scenarios for forcing each of those errors. After each error, we need to ensure the recovery code processes them correctly. It should issue the correct error messages, trace entries, log records, generate a valid memory dump, or perform whatever action the code is designed for. When choosing locations within a component to inject errors, prime consideration should be given to points where memory is obtained, shared resources are in use, or serialization mechanisms (e.g., locks, mutexes) are held. These areas are complicated to handle properly during recovery processing, and so are good grounds for test exploration.
Sufficient Diagnostic Data:
Our test plan should also include an attempt to verify that any error information generated is sufficient for its intended purpose. If a message is presented to the end user, is there enough information so the user can make an intelligent decision about what to do next? Or, if there’s no reasonable action the user can take, is the message necessary at all or will it just lead to needless confusion? If diagnostic data is generated, will it be sufficient to determine the root cause of the problem? This is where we go beyond simply testing to the specifications, and instead determine in a broader sense if the function is "fit for purpose." As a software testing engineer, we bring a different perspective to the table than does the developer. We need to be sure to leverage that perspective to ensure the program’s actions are useful and helpful.
Option –4: Using Component-level Recovery out of Unanticipated Errors
A thorough test plan will go beyond errors that the program’s recovery support was coded to handle. It will also investigate how the program responds to unanticipated errors. At a minimum, the code should have some sort of catchall processing for handling unknown errors (if it doesn’t, we may have found our first bug). We need to be a little devious here. We need to use the instruction zapping approach if necessary, but find a way to force the code to react to errors it hasn’t attempted to address, and then ensure it reacts reasonably. Again, software testing engineers use their own end-user view to determine what "reasonably" means for this program.
Also included in this category are errors that occur at the system level but also impact the individual component. These errors can include memory shortages, hardware element failures, network problems, and system restarts. Force or simulate as many of these types of errors as seem relevant, and discover if the component handles them gracefully – or if it takes a downward nosedive.
B) Methods of attacking programs during System Verification Test (SVT)
The objective of SVT is also similar to FVT, namely to wreak controlled havoc and see how the software responds. But in SVT, the focus shifts from a narrow, component-level view to an entire product view. It also folds load / stress into the picture. This is critical, because it’s common for recovery processing to work perfectly on an unloaded system, only to collapse when the system is under heavy stress.
Restartability: In System Verification Test, there are two aspects to restartability.
1) Program crash
2) System crash
1) Program crash: Here, because we are operating at an end-user level in which techniques such as setting breakpoints are not applicable, there must be an external way to cause the program to fail. Such external means could include bad input, memory shortages, or a system operator command designed to force the program to terminate fast and hard. Alternatively, input from software testing engineers during the software’s design might have led to the inclusion of special stability features that can aid with error injection.
An advantage to using external means to crash the program is that we are able to send normal work to the program so it is busy doing something at the time we force the crash. Programs that die with many active, in-flight tasks tend to have more problems cleanly restarting than idle ones do, so we are more likely to find a bug this way.
2) System crash: This case is similar to program crash, except that any recovery code intended to clean up files or other resources before the program terminates will not have a chance to execute. The approach here should be to get the program busy in processing some work, and then kill the entire system. The simplest way to kill the system is simply to power it off. Some operating systems, like z/OS, provide a debugging aid that allows a user to request that a particular action be taken when some event occurs on a live system. That event could be the crash of a given program, the invocation of a particular module, or even the execution of a specific line of code. The action could be to force a memory dump, write a record to a log, or even freeze the entire system. In z/OS, this is called setting a trap for the software. If such support is available, then another way to kill the system is to set a trap for the invocation of a common operating system function (like the dispatcher), which when sprung will take the action of stopping the system immediately so we can reboot it from there.
After the system reboot, restart the application and check for anomalies that may indicate a recovery problem by watching for messages it issues, log entries it creates, or any other information it generates as it comes back up. Then send some work to the program and ensure it executes it properly and any data it manipulates is still intact. Restartability is the most basic of recovery tests but, if carefully done, will often unearth a surprising number of defects.
Clustered System Failures:
Some software are designed to operate in a clustered environment to improve its scalability or reliability characteristics. Devise scenarios to probe these capabilities.
For example, consider a group of Web application servers clustered together, all capable of running the same banking application. An additional system sits in front of this cluster and sprays incoming user requests across the various systems. If one server in the cluster fails, the sprayer should detect the loss and send new work elsewhere. We can try crashing a server and restarting it, all the while watching how the remaining systems react. Another scenario might be to crash several members of the cluster serially before restarting any of them, or crashing multiple members in parallel
# What if the sprayer system crashes?
# Does it have a hot standby that will take over to keep work flowing? Should it?
# Does any data appear corrupted after completion of the recovery process?
# All such possibilities are fair game for the wily tester.
Depending on the nature of the software under test, it may need to cope with failures in the underlying environment. In the case of operating systems, this usually means failure of hardware components (e.g., disk drives, network adapters, peripherals). For middleware and applications, it usually means the failure of services that the operating system provides based on those hardware components.
# What happens if a file system the application is using fills up, or the disk fails?
# What if a path to a required Storage Area Network (SAN) device fails or is unplugged by a careless maintenance person?
# What if a single CPU in a multiprocessing system fails?
# Are there any cases in which the operating system will alert the application of environmental failures?
# How does the application respond to such information?
Even if the application has no specific support for such events, it may still be worthwhile to see how badly it is compromised when the unexpected happens. In the mainframe world, tools are used to inject error information into specific control structures in memory on a running system, and then force a branch to the operating system’s interrupt handler to simulate the occurrence of various hardware failures. Similar tools can be created for Linux or other operating systems. This sort of testing is very disruptive, so unless we have our own isolated system, we will need to schedule a window to execute these scenarios to avoid impacting everyone else’s work.
During the course of normal load / stress or longevity runs, the software being tested will almost surely fail on its own, with no help from the software testing engineer.
# Rather than cursing these spontaneous, natural errors, take advantage of them.
# Do not look only at the failure itself; also examine how the program dealt with it.
# Monitor recovery processing to see how the software responds to unplanned failures.