Save This Page
Home » nutch-1.0 » org.apache.nutch » analysis » lang » [javadoc | source]
org.apache.nutch.analysis.lang
public class: NGramProfile [javadoc | source]
java.lang.Object
   org.apache.nutch.analysis.lang.NGramProfile
This class runs a ngram analysis over submitted text, results might be used for automatic language identifiaction. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.
Nested Class Summary:
class  NGramProfile.NGramEntry  Inner class that describes a NGram 
Field Summary
public static final  Log LOG     
static final  int ABSOLUTE_MIN_NGRAM_LENGTH    The minimum length allowed for a ngram. 
static final  int ABSOLUTE_MAX_NGRAM_LENGTH    The maximum length allowed for a ngram. 
static final  int DEFAULT_MIN_NGRAM_LENGTH    The default min length of ngram 
static final  int DEFAULT_MAX_NGRAM_LENGTH    The default max length of ngram 
static final  String FILE_EXTENSION    The ngram profile file extension 
static final  int MAX_SIZE    The profile max size (number of ngrams of the same size) 
static final  char SEPARATOR    separator char 
Constructor:
 public NGramProfile(String name,
    int minlen,
    int maxlen) 
    Construct a new ngram profile
    Parameters:
    name - is the name of the profile
    minlen - is the min length of ngram sequences
    maxlen - is the max length of ngram sequences
Method from org.apache.nutch.analysis.lang.NGramProfile Summary:
add,   add,   analyze,   create,   getName,   getSimilarity,   getSorted,   load,   main,   normalize,   save,   toString
Methods from java.lang.Object:
equals,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.nutch.analysis.lang.NGramProfile Detail:
 public  void add(Token t) 
    Add ngrams from a token to this profile
 public  void add(StringBuffer word) 
    Add ngrams from a single word to this profile
 public  void analyze(StringBuilder text) 
    Analyze a piece of text
 public static NGramProfile create(String name,
    InputStream is,
    String encoding) 
    Create a new Language profile from (preferably quite large) text file
 public String getName() 
 public float getSimilarity(NGramProfile another) 
    Calculate a score how well NGramProfiles match each other
 public List getSorted() 
    Return a sorted list of ngrams (sort done by 1. frequency 2. sequence)
 public  void load(InputStream is) throws IOException 
    Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
 public static  void main(String[] args) 
    main method used for testing only
 protected  void normalize() 
    Normalize the profile (calculates the ngrams frequencies)
 public  void save(OutputStream os) throws IOException 
    Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
 public String toString()