News Column

Patent Issued for Data Processing System for Reverse Reproduction in Data Stream Processing

June 24, 2014



By a News Reporter-Staff News Editor at Information Technology Newsweekly -- A patent by the inventor Imanishi, Junichi (Tokyo, JP), filed on March 3, 2010, was published online on June 10, 2014, according to news reporting originating from Alexandria, Virginia, by VerticalNews correspondents.

Patent number 8751566 is assigned to Hitachi, Ltd. (Tokyo, JP).

The following quote was obtained by the news editors from the background information supplied by the inventors: "A computer system that performs data stream processing generally handles data that arrives from one moment to the next (hereinafter, may be referred to as time-series data). Time-series data is one element comprising stream data. In other words, stream data is an aggregation of time-series data.

"Time-series data comprises a timestamp, which is information denoting the time that this time-series data occurred. In data stream processing, operations (for example, grouping, duplication removal, sum/difference/product set operations, a tabulation operation, and a join operation) are performed on time-series data.

"However, since time-series data arrives endlessly, the stream data (large quantity of time-series data) must be separated into finite datasets. As a method for separating the stream data into finite datasets, for example, there is the sliding window method (for example, Non Patent Literature 1).

"According to the sliding window method, the lifetime of the time-series data is configured. Data stream processing, for example, includes the following processes: (1) A process for acquiring on the basis of the configured lifetime a window (a dataset) 42, which at a certain point in time will become an operation target, from inputted stream data 41 as shown in FIG. 2; (2) a process for performing an operation on the time-series data (input data) 45 included in the dataset 42; and (3) a process for sequentially outputting output data 46 inside a dataset 43 of output data 46 comprising the operation result.

"As a result of this, stream data 44 is constructed using the sequentially outputted output data 46."

In addition to the background information obtained for this patent, VerticalNews journalists also obtained the inventor's summary information for this patent: "Technical Problem

"Hereinbelow, for convenience sake, the expressions 'old'/'new' will be used with respect to time. For example, a first time is older than a second time signifies that the first time is further in the past than the second time. Alternatively, a first time is newer than a second time signifies that the first time is further in the future than the second time.

"Now then, time-series data (input data) arriving from one moment to the next is temporarily stored. Then, in a case where an event (for example, a failure or a symptom of a failure) has been detected, based on input data comprising a timestamp that denotes a time further in the past than the time at which the event was detected (hereinafter, the event detection time), a process (hereinafter, search process) for searching for a phenomenon related to the event (for example, the cause of the event, referred to hereinafter as event-related phenomenon) is performed. Specifically, for example, a time that constitutes a criterion (hereinafter, reference time) is configured, and the search process is performed on the basis of this time. In the search process, for example, input data, which comprise timestamps denoting times subsequent to the reference time, are processed chronologically (in the time axis direction). That is, multiple pieces of input data are processed in order from older-to-newer times denoted by the timestamps beginning from the input data comprising the oldest timestamp (the timestamp that denotes the oldest time of the timestamps denoting times subsequent to the reference time). The event-related phenomenon, for example, is discovered from the output data created based on the input data related to the event-related phenomenon.

"Because the event-related phenomenon is a phenomenon that is discovered in the search process, normally it is not possible to know the time that the event-related phenomenon occurred (hereinafter, event-related time) prior to the start of the search process. This makes it difficult to configure the reference time.

"When the time between the event detection time and the reference time is long, most likely the reference time will be a time that is further in the past than the event-related time, and therefore the event-related phenomenon will most likely be discovered in accordance with the search process. However, in this case, there is likely to be a large quantity of time-series data that will have to be processed before the event-related phenomenon is found, and therefore, it will probably take a long time for the event-related phenomenon to be found.

"Alternatively, when the time between the event detection time and the reference time is short, most likely the reference time will be a time that is further in the future than the event-related time, and therefore the event-related phenomenon will most likely not be discovered in accordance with the search process. In this case, it is necessary to change the reference time and perform the search process once again. When the search process has to be performed again, it will ultimately take a long time until the event-related phenomenon is found.

"Consequently, an object of the present invention is to shorten the time required to find a phenomenon related to a detected event.

"Solution to Problem

"Multiple pieces of input data are processed in order from newer-to-older times denoted by the timestamps within the input data beginning from the input data comprising the timestamp that denotes the newest time of the timestamps denoting times prior to the reference time (for example, the event detection time) (hereinafter, latest reference stamp). That is, reverse reproduction is performed in the data stream processing.

"'Reverse reproduction in data stream processing' is reproduction in the reverse order from the forward reproduction in data stream processing.

"'Forward reproduction in data stream processing' is chronological (hereinafter, may be referred to as 'forward order') reproduction. Specifically, forward reproduction in data stream processing is a process performed by inputting multiple pieces of input data into a dataset (window) in order from older-to-newer times denoted by the timestamps in the input data as shown in FIG. 3A, and the result of this processing is that multiple pieces of output data (51 through 55) that have been created are outputted in order from older-to-newer times denoted by the timestamps in the output data (T=1, 2, 3, 4, 5). In accordance with this, the input data is processed in the time axis direction having a certain time as a reference.

"Alternatively, 'reverse reproduction in data stream processing' is reproduction in reverse chronological order (hereinafter, may be referred to as 'reverse order'). Specifically, reverse reproduction in data stream processing, as shown in FIG. 3B, is inputting input data into a dataset in order from newer-to-older times denoted by the timestamps in the input data, processing the inputted input data, and as a result of this processing, outputting multiple pieces of output data (61 through 65) that have been created in order from newer-to-older times denoted by the timestamps in the output data (T=5, 4, 3, 2, 1). The results of the output data processing here are the same as the results of the processing for the output data with the same timestamps as at forward reproduction. In accordance with this, the input data is processed in reverse order to the time axis direction having a certain time as a reference.

"According to the present invention, multiple pieces of input data are processed in reverse order (that is, an order that goes backward in time) beginning from input data comprising the latest reference stamp. This makes it possible to shorten the time required for discovering the event-related phenomenon.

"Furthermore, when employing reverse reproduction in data stream processing, it is desirable to keep in mind a number of points, for example, the following four points. Furthermore, in the following explanation, it is supposed that one piece of input data comprises a timestamp (T) and a value (V), and that one piece of output data created in accordance with an operation that uses one or multiple pieces of input data comprises the total of the timestamp(s) (T) and the value(s) (V). Therefore, it is supposed that the operation is a tabulation.

"

"A first point to keep in mind relates to the creation of the output data.

"As shown in FIG. 4A, when input data (71 and 72) is inputted during forward reproduction, the results of the operation on the input data that exist at this time are outputted (73 and 74). For example, in a case where the input data 71 has been inputted, the only input data that exists is the input data 71, and therefore the output data 73, which comprises the same timestamp and value (T=1, total=1) as the input data 71 is outputted. Next, in a case where the input data 72 has been inputted, since the existing input data are the input data 71 and 72, the output data 74, which comprises the timestamp T=2 of the input data 72, and a total of 3, which is the total of the value V=1 of the input 71 and the value V=2 of the input data 72, is outputted.

"The dataset of the operation-target input data (71 and 72) at this time is a set of data generated during a period that extends back into the past from a certain point in time. That is, when the input data 72 was inputted, the input data 71, which was generated further in the past than this data, existed in the operation-target dataset.

"Therefore, simply performing an operation by inputting input data in the reverse direction of the time axis (that is, simply performing reverse reproduction) does not enable the operation result to be reproduced properly. Specifically, for example, as shown in FIG. 4B, since the input data is inputted in the newer-to-older order of the times denoted by the timestamps in the reverse reproduction, the input data 82 is inputted, and thereafter, the input data 81 is inputted. When the input data 82 is inputted, normally the input data 81 should exist in the operation-target dataset, but the operation is performed in a state in which the input data 81 does not exist. For this reason, the output data 84, which comprises an incorrect total of 2, is outputted. Furthermore, although the input data 82 should not exist in the operation-target dataset when the input data 81 is inputted, the operation is performed in a state in which this input data 82 exists. For this reason, the output data 83, which comprises an incorrect total of 3, is outputted.

"Consequently, reverse reproduction must be performed by taking the lifetime of the input data into account. The lifetime, for example, is specified (expressed) using either a time period or the latest number of pieces of data. The 'latest number of pieces of data' is the maximum number of pieces of input data that can be put into a dataset (window).

"In a case where the lifetime is specified using a time period, the timestamp inside the input data is corrected. The time denoted by the post-correction timestamp is a time computed by adding the lifetime to the time denoted by the pre-correction timestamp. The time denoted by the post-correction timestamp is the extinction time in the forward order (the time at which the input data becomes extinct in forward reproduction), in other words, the input time in the reverse order (the time denoted by the timestamp inside the input data in reverse reproduction). Alternatively, the time denoted by the pre-correction timestamp is the input time in the forward order (the time denoted by the timestamp inside the input data in forward reproduction), in other words, the extinction time in the reverse order (the time at which this input data becomes extinct in reverse reproduction). Therefore, in reverse reproduction, the input data becomes extinct at the time denoted by the pre-correction timestamp inside this input data (in other words, the time obtained by the lifetime having been added to the time denoted by the post-correction timestamp inside this input data). As shown in FIG. 4C, the operation is performed as though the input data (92 and 91) do not exist at the times (T=5, T=4) denoted by the post-correction timestamp inside these input data, and the operation is performed as though the input data (92 and 91) do exist at the times (T=2, T=1) denoted by the pre-correction timestamp (in the example of FIG. 4C, the lifetime is 3). As a result of this, it is possible to correctly reproduce the state of the operation-target dataset at the points in time of the input times in forward reproduction (T=2, T=1), that is, the original output data (93 and 94) can be obtained. The output data (93 and 94) are outputted in the order of output data 94 and 93. That is, the same output data 93 and 94 as the output data 73 and 74 outputted in forward reproduction are outputted in reverse order from the case of the forward reproduction.

"In a case where the lifetime is specified using the number of pieces of latest data, when the input data is inputted in the newer-to-older order of the times denoted by the timestamps inside the input data, the state of the dataset is not the state of the time denoted by the timestamp inside the input data that was inputted to this dataset last, but rather is the state at the time denoted by the timestamp inside the input data initially inputted to this dataset. That is, the time corresponding to the state of the dataset is displaced in accordance with the number of pieces of latest data. Therefore, in accordance with this, the timestamp inside the output data must be corrected in accordance with the specified number of pieces of latest data.

"

"A second point to keep in mind relates to the timing for outputting the output data.

"As shown in FIG. 5A, in the forward reproduction, output data (data comprising operation results) can be outputted every time the input data (101 through 103) is inputted into the window (dataset) 100.

"However, as shown in FIG. 5B, in the reverse reproduction, even though the input data comprising T=10 is inputted to the window 110, a correct operation result is not obtained at this point in time. In this case, the output of the latest operation result must start at the point in time at which the dataset is reproduced correctly, that is, the point in time at which a number of pieces of input data (111 through 116) corresponding to that specified by the lifetime of the data (for example, 6) has been inputted to the window 110.

"Consequently, in a case where the lifetime has been specified using a time period, when input data comprising a timestamp denoting the latest time has been inputted into the dataset in the forward order in the forward reproduction, this dataset transitions to the latest state. Alternatively, in the reverse reproduction, when input data comprising a timestamp denoting the latest time of the input data inputted into the dataset becomes extinct, this dataset transitions to the latest state. Therefore, subsequent latest operation results (output data) are outputted from the point in time at which the input data comprising the timestamp denoting the latest time becomes extinct.

"In a case where the lifetime is specified using the number of pieces of latest data, when the inside of the dataset is full of input data, that is, when input data corresponding to the specified number of pieces of latest data has been inputted, this dataset is the latest state. Subsequent latest operation results (output data) are outputted from this point in time.

"

"A third point to keep in mind relates to synchronization when joining datasets. That is, when the lifetime of the data and the output timing are taken into account, there are cases where the states of datasets will deviate at some point in time (a timestamp) between multiple datasets at the point in time when a certain input data is inputted. Therefore, when multiple datasets are joined, the datasets must be synchronized by taking into account the point in time at which the states of the datasets deviate.

"

"A fourth point to keep in mind relates to the relationship between an output timing specified by the user and the output timing during reverse reproduction. Specifically, for example, in a case where the user specifies that operation results be outputted when the input data has been inputted, in the reverse reproduction the operation results are outputted when the input data becomes extinct. Alternatively, in a case where the user specifies that operation results be outputted when the input data becomes extinct, in the reverse reproduction the operation results are outputted when the input data is inputted.

"Data stream processing technology related to the present invention can be expected to be employed in a variety of fields, such as, for example, the prediction of stock price based on stock trading data, the prediction of traffic jams using operational data (for example, data denoting speed, direction of travel, and so forth) collected from large numbers of vehicles, the monitoring of Web server accesses, the monitoring of the operating status of machines and equipment, traffic control for vehicles in the distribution industry, the preparation of new management indicators based on accumulated business data (data related to business), and the discovery of anomalous patterns based on multiple pieces of diagnosis data that has been collected (for example, the electrocardiograph data, scanned image data, and so forth of multiple patients)."

URL and more information on this patent, see: Imanishi, Junichi. Data Processing System for Reverse Reproduction in Data Stream Processing. U.S. Patent Number 8751566, filed March 3, 2010, and published online on June 10, 2014. Patent URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=8751566.PN.&OS=PN/8751566RS=PN/8751566

Keywords for this news article include: Hitachi Ltd., Information Technology, Information and Data Processing.

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC


For more stories covering the world of technology, please see HispanicBusiness' Tech Channel



Source: Information Technology Newsweekly


Story Tools






HispanicBusiness.com Facebook Linkedin Twitter RSS Feed Email Alerts & Newsletters