"In another embodiment a system for generating non-standard tokens from standard tokens has been developed. The system includes a memory, the memory storing a plurality of standard tokens and a plurality of operational parameters for a random field model and a processing module operatively connected to the memory. The processing module is configured to obtain the operational parameters for the random field model from the memory, generate the random field model from the operational parameters, select a standard token from the plurality of standard tokens in the memory, the selected standard token having a plurality of input characters, select an operation from a plurality of predetermined operations in accordance with the random field model for each input character in the plurality of input characters for the selected standard token, perform the selected operation on each input character in the selected standard token to generate an output token that is different from each standard token in the plurality of standard tokens, and store the output token in the memory in association with the selected standard token.
"In another embodiment, a method for selection of training data for generation of a statistical model has been developed. The method includes identifying a plurality of occurrences of a non-standard token in a text corpus stored in a memory, identifying a first plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the non-standard token, identifying a plurality of occurrences of a candidate standard token in the text corpus, identifying a second plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the candidate standard token, identifying a contextual similarity between the first plurality of tokens and the second plurality of tokens, generating a statistical model for correction of non-standard tokens with the non-standard token in association with the standard token for generation of a statistical model only in response to the identified contextual similarity being greater than a predetermined threshold, and storing the generated statistical model in the memory for use in identification of another standard token that corresponds to another non-standard token identified in text data that are not included in the text corpus.
"In another embodiment, a method for identification of a standard token in a dictionary that corresponds to a non-standard token has been developed. The method includes identifying a candidate token in a plurality of standard tokens stored in a memory, identifying a longest common sequence (LCS) of features in the candidate token corresponding to at least one feature in the candidate token that is present in the non-standard token, identifying a number of features in the LCS, identifying a frequency of the candidate token in a text corpus stored in a memory, identifying a similarity score between the non-standard token and the standard token with reference to a ratio of the identified number of features in the LCS to a total number of features in the non-standard token multiplied by a logarithm of the identified frequency of the candidate token, and presenting with a user interface device the standard candidate token to a user in replacement of the non-standard token or in association with the non-standard token in response to the identified similarity score exceeding a predetermined threshold.
Most Popular Stories
- Updates on Everglades' Stranded Pilot Whales
- NSA Tracks 5 Billion Cellphone Records a Day
- Hezbollah Chief's Assassination Claimed by Sunni Group
- Stolen Cobalt-60 Recovered in Mexico
- Ford Mustang Still Packs Power
- Wind Power and Wildlife Can Coexist
- Allstate Seeks to Invest in Minority Firms
- Sarmiento to Handle Greeley Latin Ops
- First-time Jobless Claims Drop Below 300,000
- White House Pushes to Extend Unemployment Benefits