News Column

Researchers Submit Patent Application, "Identifying Confidential Data in a Data Item by Comparing the Data Item to Similar Data Items from...

July 22, 2014

Researchers Submit Patent Application, "Identifying Confidential Data in a Data Item by Comparing the Data Item to Similar Data Items from Alternative Sources", for Approval

By a News Reporter-Staff News Editor at Information Technology Newsweekly -- From Washington, D.C., VerticalNews journalists report that a patent application by the inventors Thomason, Michael Scott (Raleigh, NC); Arun, Jai S. (Morrisville, NC); Myers, Benjamin L. (Durham, NC); Rotermund, Chad C. (Williamsburg, VA); Jariwala, Ajit J. (Cary, NC), filed on January 2, 2013, was made available online on July 10, 2014.

The patent's assignee is International Business Machines Corporation.

News editors obtained the following quote from the background information supplied by the inventors: "This disclosure relates generally to protection against unintentional disclosure of confidential information in a computing environment.

"A problem arises, however, when referencing a document that contains both confidential information and non-confidential information, especially when that document may need to live (in whole or in part) external to the enterprise (or to a particular system thereof). Consider, for example, a customer filing a problem management record (PMR) with an external support provider. That problem record, which may have been generated in an automated manner, may include both confidential information, as well as information about the problem. The non-confidential information, if it could be extracted, may have independent value (e.g., if published in a support note). In such case, however, it would be necessary to remove or redact the confidential information. Removing or redacting the confidential information manually, naturally, is prone to errors of identification and unintentional omission. Publication of even a seemingly innocuous piece of information can create a significant legal or financial liability.

"Existing confidential data detection solutions often rely on various strategies to prevent confidential information from being disclosed inadvertently. In one approach, a list of confidential items or terms is used; a document is compared against this list to identify portions that might require omission or redaction. Assembling and maintaining such a list, however, are non-trivial tasks. Another approach is to run the document against a simple tool, such as a spellchecker to allow irregularities to be exposed (and which then might be acted upon proactively). This approach, however, produces a large number of both false positives and false negatives. Yet another approach involves data string matching, e.g., searching for and removing terms matching a particular format (e.g., (###) ###-####), but this approach is narrow in scope. Other known approaches involve machine learning systems, pattern matching, and the like.

"There remains a need in the art to provide for enhanced techniques to identify and distinguish confidential and non-confidential information from within a document (or, more generally, a data item) so that the confidential information may remain protected against inappropriate disclosure."

As a supplement to the background information on this patent application, VerticalNews correspondents also obtained the inventors' summary information for this patent application: "The technique herein take advantage of a characteristic of confidential information, namely, that it is unlikely to be found outside of (i.e., external to) an organization that owns or controls (or at least desires to protect) that confidential information. As such, when it is desired to examine a given 'data item' (e.g., a document) for inclusion of confidential information, the data item is compared against data items having similar structure and content from one or more other sources. In the example scenario of a data item being a problem record, a particular PMR (the one being examined for potential confidential information) from a customer is compared to other PMRs from one or more other customer(s) distinct from the customer. By comparing data items (of similar structure and content) from different sources, confidential information is then made to stand out by searching for terms (from the sources) that are not shared between or among the data items. In contrast, common words or terms that are shared across the sources are ignored as likely being non-confidential information; what remains as not shared may then be classified as confidential information and protected accordingly (e.g., by omission, redaction, substitution, or the like). Using this technique, non-confidential information may be safely segmented from confidential information, preferably in a dynamic, automated manner.

"In one embodiment, a method of identifying confidential information in a data item (e.g., a document, a report, a file, a log, an email, a message, or other communication) is described. The data item is associated with a source, such as a customer. The method begins by receiving or obtaining, from each of a set of alternative sources, a data item of a same type and format as the data item. The data item (being examined for confidential information) is then compared to each of the data item(s) obtained from the set of alternative sources to identify occurrences of particular pieces of information in the data item. Based on the occurrences of particular pieces of information in the data item and a given sensitivity criteria, one or more pieces of information in the data item are then segmented as representing potential confidential information. These one or more pieces of information may then be highlighted (e.g., in a visual representation of the data item) so that a user viewing the data item may take a given action (e.g., removal, redaction, substitution, or the like) with respect to a particular piece of information. The resulting data item, with the particular piece(s) of information representing confidential information no longer present, also may be used for other purposes (e.g., auditing, logging, reporting, publication, or the like).

"The foregoing has outlined some of the more pertinent features of the invention. These features should be construed to be merely illustrative. Many other beneficial results can be attained by applying the disclosed invention in a different manner or by modifying the invention as will be described.


"For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

"FIG. 1 depicts an exemplary block diagram of a distributed data processing environment in which exemplary aspects of the illustrative embodiments may be implemented;

"FIG. 2 is an exemplary block diagram of a data processing system in which exemplary aspects of the illustrative embodiments may be implemented;

"FIG. 3 illustrates a known data loss prevention (DLP) solution in which the subject matter of this disclosure may be implemented;

"FIG. 4 illustrates a process flow illustrating a method to dynamically segment non-confidential information a set of confidential data according to this disclosure;

"FIG. 5 illustrates a first use case scenario illustrating a set of debug error messages generated by several different customers and received by a support company;

"FIG. 6 illustrates a first customer's error message annotated with a highlight to identify potential confidential information that has been identified by the technique of this disclosure;

"FIG. 7 illustrates a tech note that the support company has created in the first use case scenario after substituting non-confidential information for the confidential information located by the method;

"FIG. 8 illustrates an error message from a customer in a second use case scenario;

"FIG. 9 illustrates frequency of occurrence data generated as a result of analyzing similar error messages received from the alternative sources in the second use case scenario;

"FIG. 10 illustrates the error message of FIG. 8 annotated to show potential pieces of information that may be confidential information; and

"FIG. 11 illustrates a representative data processing system and workflow that implements the described method."

For additional information on this patent application, see: Thomason, Michael Scott; Arun, Jai S.; Myers, Benjamin L.; Rotermund, Chad C.; Jariwala, Ajit J. Identifying Confidential Data in a Data Item by Comparing the Data Item to Similar Data Items from Alternative Sources. Filed January 2, 2013 and posted July 10, 2014. Patent URL:

Keywords for this news article include: Information Technology, Information and Data Processing, International Business Machines Corporation.

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC

For more stories covering the world of technology, please see HispanicBusiness' Tech Channel

Source: Information Technology Newsweekly

Story Tools Facebook Linkedin Twitter RSS Feed Email Alerts & Newsletters