Word Recognition

While analyzing non-structured or dirty data it is sometimes hard to discriminate strings that are not actual words (garbage, tags, typos, …). In this case, the need of an automated method to differentiate actual words from garbage is of great help. While several approaches exist here are two methods available at no cost with Open Source Software:

  • statistical method: Google Hit Counts
  • lexicon-based method: WordNet Synsets

 

Google Hit Counts

From the principal that the probability of an existing word being present in the WWW is greater than the one of a non-existent word; a simple count of Google’s hits for any given string provides a measure of confidence of the string being an actual word. In other words:

if( hit_count(’cat’) > Q ) then ‘cat’ is a word

A few examples:

Google Hit Count for rainfall
Google Hit Count for asdasd

This method provides several advantages: it is free, easy and light. But mainly, the Google Hit Count method is able to recognize words of common usage that might not be present in dictionaries:

Google Hit Count for www

The downside of this method is its dependency to Google and the system’s Internet connection. The Google Hit Count method is not suitable for large scale batch processing.

 

WordNet Synset

WordNet is a digital lexicon. It is commonly used to identify the relationship or the nature of contextual terms. What is of interest for word-recognition is the possibility to request all synets given a term. Synets are WordNet’s representation of the definition of a “semantic sens”. Requesting synets of a term is equivalent to looking up the terms’ definition given by WN. If no such definition exist, chances are that the term is not a word.

Sample Java WN usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import edu.smu.tspell.wordnet.Synset;
import edu.smu.tspell.wordnet.WordNetDatabase;
 
...
 
public boolean isWNword(String term){
   WordNetDatabase wndatabase = WordNetDatabase.getFileInstance();
   Synset[] syns = wndatabase.getSynsets(term);
   if(syns.length > 0){
      return true;
   } else {
      return false;
   }
}

This method, while more complex than the Google Hit count one, is much faster. The downside is that only words within the strict WN definition will be captured: Any non-formal literals (such as Big-Blue, …) will be discarded.

Leave a Reply