News Column

Patent Issued for Computer-Implemented System and Method for Providing Software Fault Tolerance

June 19, 2014



By a News Reporter-Staff News Editor at Computer Weekly News -- F5 Networks, Inc. (Seattle, WA) has been issued patent number 8745440, according to news reporting originating out of Alexandria, Virginia, by VerticalNews editors.

The patent's inventors are Ceze, Luis (Seattle, WA); Godman, Peter (Seattle, WA); Oskin, Mark (Seattle, WA).

This patent was filed on September 21, 2011 and was published online on June 3, 2014.

From the background information supplied by the inventors, news correspondents obtained the following quote: "Fault tolerance is an engineering principle that requires a system to continue operating despite failures or faults, albeit with a possibly diminished level of service or capacity. Fault tolerant design has been applied to computer systems to ensure system availability and crash resilience, generally through replication and redundancy. Replication requires multiple components to provide identical functionality while operating in parallel to improve the chances that at least one system is working properly. One way to effect fault tolerance is by following a quorum rule based on a majority of votes as received from the constituent replicated components, the output agreed by the majority is used as the output of the whole system. Redundancy requires multiple components to provide identical functionality, but fault tolerance is instead provided by switching or 'failing' over from a faulty component to a spare redundant component. A fault tolerant system can combine both replication and redundancy at different levels of component design. For instance, storage servers conventionally use redundancy in data storage through a redundant array of inexpensive disks, or RAID, hard drive configuration, while also including replicated power supplies to serve as spares in case of the failure of the power supply currently in use.

"In general, replication and redundancy provide fault tolerance at the physical component- or hardware-level, where failure conditions can be readily contained by removing a component from service until a permanent repair can be later effected. Replicated or redundant fault tolerance can also be used in software. Providing software fault tolerance through replication or redundancy, though, can be expensive and inefficient in terms of resource utilization and physical component costs. Additionally, software fault tolerance adds the complication of needing to provide continued service often without the certainty that the correct faulty software component has been both identified and rendered harmless. A software error that potentially affects system state could remain undetected and persist latently, only to later rematerialize with possibly catastrophic consequences, despite earlier efforts to undertake fault tolerance.

"Alternatively, fault recovery can be used for software to directly address underlying causes of fault or failure, rather than relying on the indirect quorum voting or failover solutions used in fault tolerance. In general, software fault recovery can be provided through either roll forward and roll back. Roll forward requires a software system to attempt to correct its system state upon detecting an error and continue processing forward from that point in execution on by relying on the self-corrections made as being sufficiently remedial. Roll back requires a software system to revert to an earlier, and presumably safe, version of system state and continue processing forward from the earlier version on after backing out any erroneous system state.

"However, even with replicated multithreaded software execution, resilience to latent software faults and failures, colloquially referred to as 'bugs,' is not assured due to the nondeterministic nature of multithreaded programs. Current multicores execute multithreaded code nondeterministically because, given the same inputs, execution threads can potentially interleave their memory and I/O operations differently in each execution. The nondeterminism arises from small perturbations in the execution environment due to, for instance, other processes executing simultaneously, differences in operating system resource allocation, cache and translation lookaside buffer states, bus contention, and other factors relating to microarchitectural structures. Software behavior is subject to change with each execution due to multicore nondeterminism, and runtime faults, such as synchronization errors, can appear at inconsistent times that may defy subsequent efforts to reproduce and remedy. Multithreaded programming is difficult and can lead to complicated bugs. Existing solutions focus on hardware fault tolerance and are ill-suited to resolving the kinds of multithreaded programming bugs necessary for achieving software fault tolerance."

Supplementing the background information on this patent, VerticalNews reporters also obtained the inventors' summary information for this patent: "One embodiment provides a computer-implemented method for providing software fault tolerance. A multithreaded program is executed. The program execution includes a primary multithreaded process and a secondary multithreaded process. A set of inputs is provided to the primary multithreaded process and the inputs set is copied to the secondary multithreaded process. The executions of the primary multithreaded process and the secondary multithreaded process are divided into a deterministic subset of the execution (chunks of the dynamic execution) that ends at a checkpoint. Upon occurrence of a fault in one of the executions on one of the multithreaded processes prior to reaching the checkpoint, an execution path through the deterministic subset for the faulty execution is retired. Execution of the deterministic subset is continued on the other multithreaded process with a subsequent different path to elide the fault being chosen.

"A further embodiment provides a computer-implemented method for providing software fault tolerance through state transfer. A multithreaded program is executed. The program execution includes a primary multithreaded process and a secondary multithreaded process. A set of inputs is provided to the primary multithreaded process and the inputs set is copied to the secondary multithreaded process. The executions of the primary multithreaded process and the secondary multithreaded process are divided into a deterministic subset of the execution that ends at a checkpoint. An execution of the deterministic subset is speculatively performed on the secondary multithreaded process. Upon successfully completing the speculative execution up through the checkpoint, the program state is provided to the primary multithreaded process; thus, execution proceeds in the primary process without experiencing a failure.

"A still further embodiment provides a computer-implemented method for providing software fault tolerance through a state repository. A multithreaded program is executed. The program execution includes a primary multithreaded process and a secondary multithreaded process. A set of inputs is provided to the primary multithreaded process and the inputs set is copied to the secondary multithreaded process. The executions of the primary multithreaded process and the secondary multithreaded process are divided into a deterministic subset of the execution that ends at a checkpoint. An execution of the deterministic subset is speculatively performed on the secondary multithreaded process. A state repository is searched for a stable execution path while performing the speculative execution, wherein the state repository is searched either periodically or upon encountering a program fault. Upon ending the speculative execution, the program state is stored into the state repository accessible by the primary multithreaded process.

"A still further embodiment provides computer-implemented method for providing software fault tolerance through speculative execution. A multithreaded program is executed. The program execution includes a primary multithreaded process and a secondary multithreaded process. A set of inputs is provided to the primary multithreaded process and the inputs set is copied to the secondary multithreaded process. The executions of the primary multithreaded process and the secondary multithreaded process are divided into a deterministic subset of the execution that ends at a checkpoint, wherein the secondary multithreaded process executes slightly behind the primary multithreaded process. Upon occurrence of a fault in the execution of the primary multithreaded process prior to reaching the checkpoint, an execution path through the deterministic subset for the faulty execution is discarded and the secondary multithreaded process is notified to begin speculatively performing one or more executions of the deterministic subset. Upon successfully completing the speculative execution on the secondary multithreaded process up through the checkpoint, the program state is provided to the primary multithreaded process, which resumes execution past the checkpoint that was just transferred.

"Still other embodiments will become readily apparent to those skilled in the art from the following detailed description, wherein are described embodiments by way of illustrating the best mode contemplated. As will be realized, other and different embodiments are possible and the embodiments' several details are capable of modifications in various obvious respects, all without departing from their spirit and the scope. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive."

For the URL and additional information on this patent, see: Ceze, Luis; Godman, Peter; Oskin, Mark. Computer-Implemented System and Method for Providing Software Fault Tolerance. U.S. Patent Number 8745440, filed September 21, 2011, and published online on June 3, 2014. Patent URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=8745440.PN.&OS=PN/8745440RS=PN/8745440

Keywords for this news article include: Software, F5 Networks Inc..

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC


For more stories covering the world of technology, please see HispanicBusiness' Tech Channel



Source: Computer Weekly News


Story Tools






HispanicBusiness.com Facebook Linkedin Twitter RSS Feed Email Alerts & Newsletters