News Column

Patent Issued for Analyzing Large Data Sets to Find Deviation Patterns

July 29, 2014

By a News Reporter-Staff News Editor at Information Technology Newsweekly -- According to news reporting originating from Alexandria, Virginia, by VerticalNews journalists, a patent by the inventors Sengupta, Arijit (San Mateo, CA); Stronger, Brad A. (San Mateo, CA); Kane, Daniel (Menlo Park, CA), filed on September 29, 2011, was published online on July 15, 2014.

The assignee for this patent, patent number 8782087, is BeyondCore, Inc. (San Mateo, CA).

Reporters obtained the following quote from the background information supplied by the inventors: "The present invention relates generally to quality management in a data-processing environment. Specifically, it relates to operational risk estimation and control associated with a data processing operation.

"Errors in documents during a data processing operation, for example, data entry and data transformation are common. These errors may result in significant losses to an organization, especially if a large amount of data is processed. It is therefore important to control the quality of documents. Conventional techniques for controlling the quality of documents include error detection and correction, and determination of parameters for measuring errors. One such measurement parameter can be the percentage of documents with errors. However, these parameters do not directly indicate the impact of the errors to the organization.

"Further, the conventional techniques for error detection are manual in nature. Errors can be detected by manually checking a set of documents to catch errors and compute the error rate. However, this technique may be error prone since the errors are detected manually. Further, the number of documents to be reviewed for catching errors (rather than just estimating error rates) is a function of the error rate. If the error rate is high, then a high percentage of documents need to be reviewed for catching a higher percentage of errors. Consequently, this technique can be labor intensive and therefore expensive.

"Another technique for error prevention involves double typing the same document. The two different versions of the same document are compared electronically, and any discrepancies are reviewed and corrected. However, in this case each document needs to be double typed, which can be a labor-intensive exercise. The double typing and the confirmation of its correctness are done on a larger set of the documents. Further, a supervisor has to manually review each discrepancy to detect which of the two operators has made an error, or to correct the errors. Further, manual reviews themselves are prone to errors and result in wastage of labor, money and time. Conventional techniques for detection of errors and correction are therefore cumbersome and expensive.

"Furthermore, data entry operators can become aware as to when the supervisors are carrying out quality checks, and concentrate on quality for that period. If the process requires double entry of a complete document, it may result in `gaming` of the system by the data entry operators, i.e., they may be lax in the initial data entry and catch errors if there is a discrepancy.

"In other conventional techniques, critical fields are pre-defined by a supervisor/management. These critical fields are defined on the basis of their subjective criticality. Subsequently, preventive and corrective measures are taken in these critical fields. Further these critical fields themselves are not updated automatically and are only updated periodically during management review. As a result, the quality of the processed document may not be improved beyond a certain extent.

"Accordingly, there is a need for developing techniques that manage the quality of documents. Such techniques should be cost-effective, scalable, and less time-consuming. There is a need for techniques that can measure error rate, control error rate, predict errors, and enable their subsequent prevention. Further, there is a need for techniques that ensure that the critical fields are identified dynamically and automatically.

"Further, these techniques should enable benchmarking of organizations, i.e., how well organizations control data processing operational risk relative to one another. Such a benchmark should be comparable across process variations, organization size, document type, etc. Also, measurement schemes for data processing operators and systems should be directly correlated to measures used to evaluate the organizations. This enables true alignment of measurement schemes with performance requirements. These techniques should also deter `gaming` of the system by data entry operators and supervisors."

In addition to obtaining background information on this patent, VerticalNews editors also obtained the inventors' summary information for this patent: "Various embodiments of the invention provide methods and systems for identifying critical fields in documents, for example so that quality improvement efforts can be prioritized on the critical fields.

"One aspect of the invention concerns a method for improving quality of a data processing operation in a plurality of documents. A set of documents is sampled. An error rate for fields in the documents is estimated based on the sampling. Critical fields are identified based on which fields have error rates higher than a threshold. Which fields are the critical fields may be automatically updated on a dynamic basis. In one approach, the error rate for a field is based on both a frequency of errors in the field and a relative weight for that field. For example, the relative weight might be based on the operational impact of data processing errors in that field.

"Various types of thresholds can be used. For example, the threshold may be a predetermined constant value. Alternately, the threshold may vary as a function of the relative weight of a field. It may also be adjustable, either by the user or dynamically based on the sampled documents. The threshold may be an aggregate across multiple fields, not just a threshold for a single field. For example, the set of critical fields may be determined by selecting the critical fields with the highest error rates until the aggregate sum of error rates reaches a threshold. The threshold can also vary as a function of the distribution of error rates for the fields. For example, if the distribution of error rates is bimodal, the threshold may be set at some point between the two modes.

"In various embodiments, the error rate for a field is determined in part by estimating a probability that data entered for a field in a document is in error, without knowing a correct transcription for the field. The data entered for a given field typically has a distribution among the different answers provided. Data-entered answers that are identical form a cluster. For example, if three operators type (or otherwise data enter) the same answer for a field, that is a cluster. A mode is the cluster for the most frequently appearing answer. There can be multiple modes if different answers are data-entered with the same frequency.

"In one aspect, estimating the probability of error accounts for clusters, modes and/or their equivalencies. Equivalencies can be determined based on the number of and sizes of clusters, as well as other factors. In one approach, the clusters that have the largest size for a field are determined to be equivalent and correct answers. In another approach, these clusters are determined to be not equivalent. Nevertheless, a single cluster is not selected as the correct answer. Rather, each non-equivalent cluster is assigned a probability of being a correct answer that is a function of the cluster's size. In yet another approach, the cluster, for which the associated operators have a lower average historical error rate, is selected as a correct answer for a field. Clusters could also be selected as the correct answer, based on whether the associated operators have a lower error rate for the field within the set of documents currently being evaluated or whether the associated operators have a lower historic error rate for the field. Estimating the correct answer can also take into account whether the data entered for a field is the default value for that field.

"Various embodiments of the present invention further provide methods and systems for quality management of a plurality of documents for a data-processing operation in an entity. Each document comprises at least one field. The entity includes an organization, or one or more employees of the organization.

"In an embodiment of the invention, the method measures the quality of a plurality of documents in a data-processing operation. A relative operational risk is assigned for errors in each field of the plurality of documents. The assignment is based on the relative operational impact of the errors, and a frequency of errors is determined for each field. Finally, an error rate is determined, based on the relative operational risk and the frequency of errors associated with each field.

"In another embodiment, a method for quality management of a plurality of documents for a data-processing operation in an entity is provided. The method comprises determination of error rates. Further, critical fields in the documents are dynamically identified based on the relative operational impact and the frequency of errors in the various fields. Errors are then reduced in the critical fields by using, for example, double typing of the data in the critical fields.

"Further, the occurrence of errors is predicted by determining a correlation between them and a set of process and external attributes. The possibility of occurrence of the errors is notified to a supervisor if the attributes exhibit the characteristics correlated with errors. The supervisor can then take preventive measures. Alternatively, other preventative/corrective actions can be taken based on the predictions. This process of error prediction, error rate computation and error prevention can be performed independently or iteratively, thereby reducing the occurrence of the errors. Further, the set of error correlation attributes and the set of critical fields also get updated depending upon changes in the measured error rate.

"In an embodiment of the invention, a set of documents is randomly identified for the purpose of sampling. Such a random sampling is used for determining the probability of errors related to specific fields of the documents.

"In another embodiment of the invention, the `operational risk weighted error` is identified for each employee for each field corresponding to the randomly sampled documents. This helps in identifying the specific training needs of the employees and in better targeting training efforts. Employees may also be assigned to various tasks based on their error rates.

"Furthermore, a pattern of errors can be identified at a process level and an employee level. The identified error patterns are then correlated with the root causes of errors. Subsequently, on the basis of the correlation, a database is generated. The database can then be used for identifying the root causes of further error patterns. The database can be used to diagnose the root cause of an error pattern, for example, the root cause of an error pattern can be training related or process related or system related. Once an error pattern (or high frequency of errors) corresponding to a field has been identified, either for individual employees or for groups of employees, the database can also be used for a predictive diagnosis of the error. The diagnosis may be a training, system or process error. If the diagnosis identifies a training need, then the method described in the previous paragraph can be used to better allocate training resources to the specific weaknesses of the employee or to specific weak employees. Employees may also be assigned to various tasks based on their error patterns.

"Furthermore, the database can provide information regarding the historic diagnosis of previously observed error patterns corresponding to a field and/or an employee. For example, the database can provide historic data about diagnosis of a previous error or error pattern, and the methodology adopted at that time for mitigating the error.

"The quality management system pertaining to the plurality of documents includes means for determining error rates. The means for reducing errors is responsible for reducing errors by focusing on critical fields in the plurality of documents. It also updates the critical fields based on changes in error rates and patterns. The means for predicting the occurrence of errors predicts errors by determining a correlation between the errors and a set of attributes. It also updates the set of attributes based on changes in error rates and patterns. A means for controlling is used to coordinate between the remaining system elements of the quality management system. The means for controlling keeps a tab on the quality of the plurality of documents.

"Other aspects of the invention include components and applications for the approaches described above, as well as systems and methods for their implementation."

For more information, see this patent: Sengupta, Arijit; Stronger, Brad A.; Kane, Daniel. Analyzing Large Data Sets to Find Deviation Patterns. U.S. Patent Number 8782087, filed September 29, 2011, and published online on July 15, 2014. Patent URL:

Keywords for this news article include: BeyondCore, BeyondCore Inc., Information Technology, Information and Data Processing.

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC

For more stories covering the world of technology, please see HispanicBusiness' Tech Channel

Source: Information Technology Newsweekly

Story Tools Facebook Linkedin Twitter RSS Feed Email Alerts & Newsletters