Patent Application Titled "Method, Apparatus, System and Storage Medium Having Computer Executable Instrutions for Determination of a Measure of Similarity and Processing of Documents" Published Onlin
The assignee for this patent application is
Reporters obtained the following quote from the background information supplied by the inventors: "Field of the Invention
"The invention relates to the determination of a measure of similarity between two documents and to processing of documents on the basis of a measure of similarity.
"Different text recognition (also referred to as optical character recognition (OCR)) methods which can be used to recognize text inside images in an automated manner are known. The images are, for example, electronically scanned documents, the content of which is intended to be analyzed further.
"The documents may be electronic documents, for example electronically processed, preprocessed or processable documents. The approach can be used, for example, in applications relating to document management or document archiving, for example of business documents, but can also be used for other types of data extraction, for example extraction of information from photographed till receipts and other small documents.
"In document management, index data relating to a document, for example sender, recipient, invoice number or invoice amount, play a central role. A document management system provides, for example, search functions using index data or archives a document using its index data.
"Index data extraction (also referred to as 'extraction') denotes automatic determination of index data relating to a document. In addition to rule-based methods, use is also made of learning methods which determine the index data relating to a document using similar documents (so-called training documents) whose index data have already been confirmed or corrected by a user.
"A measure of similarity for comparing documents is known. Distance determination methods (Euclidean distance, vector space models and probabilistic methods) are thus applied to the problem of determining the distance between documents. An overview of the different methods is found, for example, in an article by A. Huang, entitled 'Similarity Measures for Text Document Clustering' edited by
"An article by
"An article by
"An article by
"However, the known approaches have disadvantages if the determination of the similarity of documents whose text and layout need to be considered is involved."
In addition to obtaining background information on this patent application, VerticalNews editors also obtained the inventors' summary information for this patent application: "The object of the invention is to avoid the abovementioned disadvantages and to specify, in particular, an efficient solution for determining the similarity between electronic documents and to provide possibilities for processing documents which use a similarity between documents which is determined in this manner.
"In order to achieve the object, a method for determining a measure of similarity between a first document and a second document is proposed, in which a vector space model which takes into account word frequencies and coordinates is determined for the first document and for the second document, in which a measure of the similarity between the first document and the second document is determined using the vector space model.
"The present approach has the advantage that the text and the layout of the documents to be compared are taken into account for the purpose of determining the similarity. An additional advantage is that, in addition to the similarity of the documents, the similarity of the index data relating to the documents can also be taken into account. It is therefore possible, for example, to quickly identify a document which has been erroneously or deliberately provided with incorrect index data by a user.
"The present solution allows a suitable measure of the similarity between two documents to be determined, for example a function which assigns a value of between 0 and 1 to each tuple of two documents. In this case, this value is higher, the more similar the two documents are with respect to content (i.e. vocabulary) and layout and assume the value 1, for example, when the two documents are identical.
"One development is that the coordinates of those words which occur together in both documents are taken into account.
"Another development is that the vector space model is determined by determining a first vector for the first document and a second vector for a second document.
"One development is, in particular, that the measure of the similarity is determined by determining a cosine between the first vector and the second vector.
"A development is also that a respective word vector is determined for the first document and for the second document. Elements of the word vectors indicate whether or not a word occurs in the respective document a word distance between the documents is determined. A respective coordinate vector is determined for the first document and for the second document. Elements of the word vectors indicating coordinates for words which occur together in the two documents. A coordinate distance between the documents is determined, and a total distance is determined on the basis of the word distance and the coordinate distance.
"For example, an element '1' denotes that the word occurs in the respective document (an element '0' accordingly denotes that the word does not occur and an element '4' denotes, for example, that the word occurs four times); the position of the element inside the word vector is linked to a particular word in this case. The coordinate vector contains, for example for each jointly occurring word in each document, two entries, for example for x and y coordinates within the respective document.
"One development involves determining the word distance using a cosine between the word vectors.
"One development also involves determining the coordinate distance using a cosine between the coordinate vectors.
"A next development involves determining the total distance according to
"where s denotes the word distance, t denotes the coordinate distance and p denotes a predefinable parameter.
"One refinement is that words occurring repeatedly in both documents are compared with one another in the coordinate vector according to one of the following mechanisms in accordance with their occurrence, using an assignment method in which those words for which the sum of the distances between the compared pairs is as small as possible are compared, using an assignment method in which those words for which the sum of the distances between the compared pairs is as large as possible are compared.
"In this case, the comparison denotes the use of identical positions inside the two vectors.
"The above object is also achieved by a method for processing an electronic document, in which a super ordinate database for extracting information is adapted on the basis of an electronic document if no documents which are sufficiently similar to the electronic document are present in the super ordinate database, the similarity between the electronic document and documents present in the super ordinate data bank being determined in accordance with the abovementioned method.
"This approach can be used repeatedly for a plurality of levels of super ordinate model spaces (model space corresponds to the abovementioned database here).
"In this case, it is advantageous that it is possible to interchange document information between individual users as a result of the cross-organizational approach.
"In the case of organization-based or company-based document management, users (for example companies) (also) provide a super ordinate model space (also referred to as a super ordinate database) or a multilevel hierarchy containing such super ordinate model spaces, for example, with their documents which have already been provided with correct index data. If another user now carries out extraction for a document, similar documents from the super ordinate model spaces can be used to determine the index data.
"In this case, the super ordinate model spaces can be used in different ways.
"First of all, the question arises of which documents from a user are intended to be supplied to the super ordinate model spaces up to which level of the hierarchy. On the one hand, it is desirable to provide only a small number of documents in terms of efficient storage space use. On the other hand, a large number of provided documents increases the likelihood of a current document being successfully indexed (that is to say of index data extraction for the current document being successful) by virtue of a sufficient number of similar documents being able to be provided.
"A set of documents which is as small as possible, but where the total set represents the documents of all users to be processed as well as possible with regard to their similarity, is therefore sought.
"An alternative embodiment involves adapting the super ordinate database by adding the electronic document or features of the electronic document to the super ordinate database.
"For example, index data or other data characteristic of the document can be added to the super ordinate database.
"A method for processing an electronic document is also proposed, in which a super ordinate database is used to extract information relating to the document, only those documents in the super ordinate database which have a predefined similarity to the electronic document being used, the similarity between the electronic document and documents present in the super ordinate data bank being determined in accordance with the method explained here.
"A next refinement is that the predefined similarity is determined by a threshold value comparison with a predefined minimum measure of similarity.
"A refinement is also that the super ordinate database is used to extract information relating to the document if the super ordinate database has more similar documents than a local database.
"The local database may be a local model space, in particular in the form of a data bank. The local database and the super ordinate database may contain already classified documents, document types, items of feedback from the user, data fields, values for data fields, etc.
"The super ordinate database may be a database of a further physical or logical unit which may be separate from a first unit containing the local database.
"In particular, it is possible to provide a plurality of super ordinate databases which are hierarchically arranged; accordingly, the present proposal can be carried out several times in succession in order to obtain a sufficiently good extraction result for the document across a plurality of hierarchical levels.
"A particular advantage of the solution presented is that the local database is used in a first step and the material (documents, classifications, fields, values, coordinates, etc.) already present locally is therefore used to produce the best possible classification result; this can be expected, in particular, for those document types which have already been extracted often and for which extensive extraction knowledge is accordingly stored in the local database. If no sufficient extraction knowledge is found locally, the escalation in the super ordinate database uses the information which is available there and possibly comes from a different organizational structure and/or from a different extraction service.
"The present solution makes it possible for a current user to benefit, in particular, from extraction results which have already been carried out, for example caused or carried out by other users or processes, by virtue of the extraction results being improved or only just enabled for the current user thereby.
"For example, extraction services in electronic documents (data extraction services and/or model spaces with training documents which are managed by the data extraction services) can be interconnected in a freely definable hierarchy, in particular without the current user being able to draw conclusions on the contents of the documents belonging to the other users. The confidentiality of the contents is therefore ensured and the extraction results which have already been carried out can nevertheless be used.
"The abovementioned object is also achieved by an apparatus for determining a measure of similarity between a first document and a second document, having a processing unit which is set up in such a manner that in which a vector space model which takes into account word frequencies and coordinates can be determined for the first document and for the second document, and in which a measure of the similarity between the first document and the second document can be determined using the vector space model.
"The object is also achieved by an apparatus for processing an electronic document, having a processing unit which is set up in such a manner that the steps of the method described herein can be carried out.
"The processing unit mentioned here may be, in particular, in the form of a processor unit, a computer or a distributed system of processor units or computers. In particular, the processing unit may have computers which are connected to one another via a network connection, for example via the Internet.
"The database may be or contains a data bank or a data bank management system.
"In particular, the processing unit may be or contains any type of processor or computer with accordingly required peripherals (memory, input/output interfaces, input/output devices, etc.).
"The above explanations relating to the method accordingly apply to the apparatus. The apparatus may be in one component or distributed in a plurality of components.
"One refinement is that the apparatus contains the local database and/or the super ordinate database.
"The abovementioned object is also achieved by a system containing at least one of the apparatuses described here.
"The solution presented here also contains a computer program product which can be loaded directly into a memory of a digital computer, containing program code parts which are suitable for carrying out steps of the method described here.
"The abovementioned problem is also solved by a non-transitory computer-readable storage medium, for example any desired memory, containing instructions (for example in the form of program code) which can be executed by a computer and are suitable for the computer to carry out steps of the method described here.
"The above-described properties, features and advantages of this invention and the manner in which they are achieved become more clearly and distinctly comprehensible in connection with the following schematic description of exemplary embodiments which are explained in more detail in connection with the drawings. For the sake of clarity, in this case, identical or identically acting elements can be provided with the identical reference symbols.
"Other features which are considered as characteristic for the invention are set forth in the appended claims.
"Although the invention is illustrated and described herein as embodied in a determination of a measure of similarity and processing of documents, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made therein without departing from the spirit of the invention and within the scope and range of equivalents of the claims.
"The construction and method of operation of the invention, however, together with additional objects and advantages thereof will be best understood from the following description of specific embodiments when read in connection with the accompanying drawings.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING
"FIG. 1 is a schematic illustration of a propagation strategy of documents across model spaces;
"FIG. 2 is a schematic image of an invoice as an exemplary document with blocks, coordinates and recognized words;
"FIG. 3 is a schematic image of an invoice, which is similar but alternative to FIG. 2, with blocks, coordinates and recognized words; and
"FIG. 4 is a schematic image of a cover letter with blocks, coordinates and recognized words."
For more information, see this patent application: HOFMEIER, Andreas; WEIDLING, Christoph; BERGER, Michael. Method, Apparatus, System and Storage Medium Having Computer Executable Instrutions for Determination of a Measure of Similarity and Processing of Documents. Filed
Keywords for this news article include:
Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC
Most Popular Stories
- Prosecutor to Investigate Walmart Police Shooting
- GM to Announce New Jobs in Tennessee
- Michael Brown Funeral: Can Americans Change the Script of Violence?
- Emirates Hit Libyan Targets With Airstrikes
- Smith & Wesson Misses Target
- American Killed With ISIS Fighters in Syria
- Marco Rubio Warns Obama on Deportations
- Ford Hires 300 at Louisville Lincoln Plant
- Surf's Up! SoCal Prepares for Big Storm Surf
- Hamas Claims Gaza Ceasefire as Victory Over Israel