Save This Page
Home » lucene-2.3.2-src » org.apache » lucene » index » memory » [javadoc | source]
org.apache.lucene.index.memory
public class: PatternAnalyzer [javadoc | source]
java.lang.Object
   org.apache.lucene.analysis.Analyzer
      org.apache.lucene.index.memory.PatternAnalyzer
Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a java.io.Reader , that can flexibly separate text into terms via a regular expression Pattern (with behaviour identical to String#split(String) ), and that combines the functionality of org.apache.lucene.analysis.LetterTokenizer , org.apache.lucene.analysis.LowerCaseTokenizer , org.apache.lucene.analysis.WhitespaceTokenizer , org.apache.lucene.analysis.StopFilter into a single efficient multi-purpose class.

If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via String#split(String) . Once you are satisfied, give that regex to PatternAnalyzer. Also see Java Regular Expression Tutorial.

This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene org.apache.lucene.analysis.TokenFilter chain. For example as in this stemming example:

PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
pat.tokenStream("content", "James is running round in the woods"),
"English"));
Nested Class Summary:
static final class  PatternAnalyzer.FastStringReader  A StringReader that exposes it's contained string for fast direct access. Might make sense to generalize this to CharSequence and make it public? 
Field Summary
public static final  Pattern NON_WORD_PATTERN    "\\W+"; Divides text at non-letters (NOT Character.isLetter(c)) 
public static final  Pattern WHITESPACE_PATTERN    "\\s+"; Divides text at whitespaces (Character.isWhitespace(c)) 
public static final  PatternAnalyzer DEFAULT_ANALYZER    A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader. 
public static final  PatternAnalyzer EXTENDED_ANALYZER    A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html 
Constructor:
 public PatternAnalyzer(Pattern pattern,
    boolean toLowerCase,
    Set stopWords) 
    Constructs a new instance with the given parameters.
    Parameters:
    pattern - a regular expression delimiting tokens
    toLowerCase - if true returns tokens after applying String.toLowerCase()
    stopWords - if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via StopFilter#makeStopSet(String[]) and/or org.apache.lucene.analysis.WordlistLoader as in WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt") or other stop words lists .
Method from org.apache.lucene.index.memory.PatternAnalyzer Summary:
equals,   hashCode,   tokenStream,   tokenStream
Methods from org.apache.lucene.analysis.Analyzer:
getPositionIncrementGap,   getPreviousTokenStream,   reusableTokenStream,   setPreviousTokenStream,   tokenStream
Methods from java.lang.Object:
equals,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.lucene.index.memory.PatternAnalyzer Detail:
 public boolean equals(Object other) 
    Indicates whether some other object is "equal to" this one.
 public int hashCode() 
    Returns a hash code value for the object.
 public TokenStream tokenStream(String fieldName,
    String text) 
    Creates a token stream that tokenizes the given string into token terms (aka words).
 public TokenStream tokenStream(String fieldName,
    Reader reader) 
    Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards to tokenStream(String, String) and is less efficient than tokenStream(String, String).