News Column

Patent Issued for Non-Volatile Memory for Checkpoint Storage

August 7, 2014



By a News Reporter-Staff News Editor at Computer Weekly News -- A patent by the inventors Blumrich, Matthias A. (Yorktown Heights, NY); Chen, Dong (Yorktown Heights, NY); Cipolla, Thomas M. (Yorktown Heights, NY); Coteus, Paul W. (Yorktown Heights, NY); Gara, Alan (Yorktown Heights, NY); Heidelberger, Philip (Yorktown Heights, NY); Jeanson, Mark J. (Rochester, MN); Kopcsay, Gerard V. (Yorktown Heights, NY); Ohmacht, Martin (Yorktown Heights, NY); Takken, Todd E. (Yorktown Heights, NY), filed on January 10, 2011, was published online on July 22, 2014, according to news reporting originating from Alexandria, Virginia, by VerticalNews correspondents.

Patent number 8788879 is assigned to International Business Machines Corporation (Armonk, NY).

The following quote was obtained by the news editors from the background information supplied by the inventors: "The present invention relates generally to checkpointing in computer systems; and, particularly, to checkpoints in applications running on high performance parallel computers.

"To achieve high performance computing, multiple individual processors have been interconnected to form a multiprocessor computer system capable of parallel processing. Multiple processors can be placed on a single chip, or several chips--each containing one or more processors--become interconnected to form single- or multi-dimensional computing networks into a multiprocessor computer system, such as described in co-pending U.S. Patent Publication No. 2009/0006808 A1 corresponding to U.S. patent application Ser. No. 11/768,905, the whole contents and disclosure of which is incorporated by reference as if fully set forth herein, describing a massively parallel supercomputing system.

"Some processors in a multiprocessor computer system, such as a massively parallel supercomputing system, typically implement some form of direct memory access (DMA) functionality that facilitates communication of messages within and among network nodes, each message including packets containing a payload, e.g., data or information, to and from a memory, e.g., a memory shared among one or more processing elements. Types of messages include user messages (applications) and system initiated (e.g., operating system) messages.

"Generally, a uni- or multi-processor system communicates with a single DMA engine, typically having multi-channel capability, to initialize data transfer between the memory and a network device (or other I/O device).

"Such a DMA engine may directly control transfer of long messages, which long messages are typically preceded by short protocol messages that are deposited into reception FIFOs on a receiving node (for example, at a compute node). Through these protocol messages, the sender compute node and receiver compute node agree on which injection counter and reception counter (not shown) identifications to use, and what the base offsets are for the messages being processed. The software is constructed so that the sender and receiver nodes agree to the counter ids and offsets without having to send such protocol messages.

"In parallel computing system, such as BluGene.RTM. (a trademark of International Business Machines Corporation, Armonk N.Y.), system messages are initiated by the operating system of a compute node. They could be messages communicated between the OS (kernel) on two different compute nodes, or they could be file I/O messages, e.g., such as when a compute node performs a 'printf' function, which gets translated into one or more messages between the OS on a compute node OS and the OS on (one or more) I/O nodes of the parallel computing system. In highly parallel computing systems, a plurality of processing nodes may be interconnected to form a network, such as a Torus; or, alternately, may interface with an external communications network for transmitting or receiving messages, e.g., in the form of packets.

"As known, a checkpoint refers to a designated place in a program at which normal processing is interrupted specifically to preserve the status information, e.g., to allow resumption of processing at a later time. Checkpointing, is the process of saving the status information. While checkpointing in high performance parallel computing systems is available, generally, in such parallel computing systems, checkpoints are initiated by a user application or program running on a compute node that implements an explicit start checkpointing command, typically when there is no on-going user messaging activity.

"Further, in prior art user-initiated checkpointing, programs running on large parallel computer systems often save the state, e.g., of long running calculations, at predetermined intervals. This saved data is called a checkpoint. This process enables restarting the calculation from a saved checkpoint after a program interruption, e.g., due to soft errors, hardware or software failures, machine maintenance or reconfiguration. Large parallel computers are often reconfigured, for example to allow multiple jobs on smaller partitions for software development, or larger partitions for extended production runs.

"A typical checkpoint requires saving the data from a relatively large fraction of available memory of each processor, which is then typically written to an external file system. Writing these checkpoints can be a relatively slow process for a highly parallel machine with limited I/O bandwidth to file servers. The optimum checkpoint interval for reliability and utilization depends on the problem data size, required compute time, expected failure rate, and the time required to write the checkpoint to storage. Reducing the time required to write a checkpoint improves system performance, availability and effective throughput.

"Thus, it would be highly desirable to increase the speed and efficiency of the checkpoint process at each parallel computing node."

In addition to the background information obtained for this patent, VerticalNews journalists also obtained the inventors' summary information for this patent: "In one aspect, there is provided a system and method for increasing the speed and efficiency of a checkpoint process performed at a computing node of a computing system, such as a massively parallel computing system.

"In one embodiment, there is provided a system and method for increasing the speed and efficiency of a checkpoint process performed at a computing node of a computing system by integrating a non-volatile memory device, e.g., flash memory cards, with a direct interface to the processor and memory that make up each computing node.

"Thus, in one aspect, there is provided a method for checkpointing messages in a parallel computing system having a plurality of nodes connected as a network, each node having multiple processor units and an associated memory operatively connected therewith via an interconnect device, the method comprising:

"receiving, at one or more control units, a command instruction from a processor for controlling data flow of packets received by a network and flow of packets to be transmitted to the network, each the one or more control units coupled to each of a plurality of devices within the node involved with processing of received and transmitted packets for communicating data therebetween;

"performing, at each the node, a checkpoint, the performing including: generating, at the control units, a control signal to initiate stopping flow of packets received by a network and flow of packets to be transmitted to the network; and,

"responding to a first control signal received at a logic device associated with each the plurality of devices, to initiate obtaining the checkpoint data when the packet data flow has stopped, the checkpointing data obtained from the plurality of devices for receipt in register devices associated with each the one or more control units; and,

"responding to a second control signal for writing out the checkpoint data received at the associated register devices to a non-volatile memory storage device,

"wherein each the control unit generates selective control signals to perform the checkpointing of system related data in presence of messaging activity associated with a user application running at the node.

"Further, there is provided a system for checkpointing messages in a parallel computing system having a plurality of nodes connected as a network, each node having multiple processors and an associated memory operatively connected therewith via an interconnect device, the checkpointing system comprising:

"at each node:

"a non-volatile memory device;

"one or more control units, each control unit adapted to receive command instructions from a processor for controlling data flow of packets received by a network and flow of packets to be transmitted to the network, each the one or more control units coupled to each of a plurality of devices within the node involved with processing of received and transmitted packets for communicating data therebetween;

"the control unit responsive to a control signal for performing a checkpoint at the node, wherein the control unit generates a control signal to initiate stopping of a flow of packets received by a network and flow of packets to be transmitted to the network; and,

"a logic device associated with each the plurality of devices and each responsive to a control signal to initiate obtaining the checkpoint data when the packet data flow has stopped, the checkpointing data obtained from the plurality of devices for receipt in register devices associated with each the one or more control units,

"the control unit responsive to a further control signal for writing out the checkpoint data received at the associated register devices to said non-volatile memory storage device,

"wherein each the control unit generates selective control signals to perform the checkpointing of system related data in presence of messaging activity associated with a user application running at the node.

"In a further aspect, there is provided a computer program product for checkpointing messages in a parallel computing system having a plurality of nodes connected as a network, each node having multiple processor units and an associated memory operatively connected therewith via an interconnect device, the computer program product comprising:

"a storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising:

"receiving, at one or more control units, a command instruction from a processor for controlling data flow of packets received by a network and flow of packets to be transmitted to the network, each the one or more control units coupled to each of a plurality of devices within the node involved with processing of received and transmitted packets for communicating data therebetween;

"performing, at each the node, a checkpoint, the performing including: generating, at the control units, a control signal to initiate stopping flow of packets received by a network and flow of packets to be transmitted to the network; and,

"responding to a first control signal received at a logic device associated with each the plurality of devices, to initiate obtaining the checkpoint data when the packet data flow has stopped, the checkpointing data obtained from the plurality of devices for receipt in register devices associated with each the one or more control units; and,

"responding to a second control signal for writing out the checkpoint data received at the associated register devices to a non-volatile memory storage device,

"wherein each the control unit generates selective control signals to perform the checkpointing of system related data in presence of messaging activity associated with a user application running at the node.

"Advantageously, incorporating a non-volatile memory device such as flash memory provides a local storage for checkpoints thus relieving the bottleneck due to I/O bandwidth limitations associated with some memory access operations."

URL and more information on this patent, see: Blumrich, Matthias A.; Chen, Dong; Cipolla, Thomas M.; Coteus, Paul W.; Gara, Alan; Heidelberger, Philip; Jeanson, Mark J.; Kopcsay, Gerard V.; Ohmacht, Martin; Takken, Todd E.. Non-Volatile Memory for Checkpoint Storage. U.S. Patent Number 8788879, filed January 10, 2011, and published online on July 22, 2014. Patent URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=8788879.PN.&OS=PN/8788879RS=PN/8788879

Keywords for this news article include: Software, International Business Machines Corporation.

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC


For more stories covering the world of technology, please see HispanicBusiness' Tech Channel



Source: Computer Weekly News


Story Tools






HispanicBusiness.com Facebook Linkedin Twitter RSS Feed Email Alerts & Newsletters