org.apache.lucene.index.memory
public class: PatternAnalyzer [javadoc |
source]
java.lang.Object
org.apache.lucene.analysis.Analyzer
org.apache.lucene.index.memory.PatternAnalyzer
Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a
java.io.Reader , that can flexibly separate text into terms via a regular expression
Pattern
(with behaviour identical to
String#split(String) ),
and that combines the functionality of
org.apache.lucene.analysis.LetterTokenizer ,
org.apache.lucene.analysis.LowerCaseTokenizer ,
org.apache.lucene.analysis.WhitespaceTokenizer ,
org.apache.lucene.analysis.StopFilter into a single efficient
multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider
prototyping by simply trying various expressions on some test texts via
String#split(String) . Once you are satisfied, give that regex to
PatternAnalyzer. Also see Java Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers.
It can also serve as a building block in a compound Lucene
org.apache.lucene.analysis.TokenFilter chain. For example as in this
stemming example:
PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
pat.tokenStream("content", "James is running round in the woods"),
"English"));
- author:
whoschek.AT.lbl.DOT.gov -
| Nested Class Summary: |
|---|
| static final class | PatternAnalyzer.FastStringReader | A StringReader that exposes it's contained string for fast direct access.
Might make sense to generalize this to CharSequence and make it public? |
| Field Summary |
|---|
| public static final Pattern | NON_WORD_PATTERN | "\\W+"; Divides text at non-letters (NOT Character.isLetter(c)) |
| public static final Pattern | WHITESPACE_PATTERN | "\\s+"; Divides text at whitespaces (Character.isWhitespace(c)) |
| public static final PatternAnalyzer | DEFAULT_ANALYZER | A lower-casing word analyzer with English stop words (can be shared
freely across threads without harm); global per class loader. |
| public static final PatternAnalyzer | EXTENDED_ANALYZER | A lower-casing word analyzer with extended English stop words
(can be shared freely across threads without harm); global per class
loader. The stop words are borrowed from
http://thomas.loc.gov/home/stopwords.html, see
http://thomas.loc.gov/home/all.about.inquery.html |
| Constructor: |
public PatternAnalyzer(Pattern pattern,
boolean toLowerCase,
Set stopWords) {
if (pattern == null)
throw new IllegalArgumentException("pattern must not be null");
if (eqPattern(NON_WORD_PATTERN, pattern)) pattern = NON_WORD_PATTERN;
else if (eqPattern(WHITESPACE_PATTERN, pattern)) pattern = WHITESPACE_PATTERN;
if (stopWords != null && stopWords.size() == 0) stopWords = null;
this.pattern = pattern;
this.toLowerCase = toLowerCase;
this.stopWords = stopWords;
}
Constructs a new instance with the given parameters. Parameters:
pattern -
a regular expression delimiting tokens
toLowerCase -
if true returns tokens after applying
String.toLowerCase()
stopWords -
if non-null, ignores all tokens that are contained in the
given stop set (after previously having applied toLowerCase()
if applicable). For example, created via
StopFilter#makeStopSet(String[]) and/or
org.apache.lucene.analysis.WordlistLoader as in
WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt")
or other stop words
lists .
|
| Method from org.apache.lucene.index.memory.PatternAnalyzer Detail: |
public boolean equals(Object other) {
if (this == other) return true;
if (this == DEFAULT_ANALYZER && other == EXTENDED_ANALYZER) return false;
if (other == DEFAULT_ANALYZER && this == EXTENDED_ANALYZER) return false;
if (other instanceof PatternAnalyzer) {
PatternAnalyzer p2 = (PatternAnalyzer) other;
return
toLowerCase == p2.toLowerCase &&
eqPattern(pattern, p2.pattern) &&
eq(stopWords, p2.stopWords);
}
return false;
}
Indicates whether some other object is "equal to" this one. |
public int hashCode() {
if (this == DEFAULT_ANALYZER) return -1218418418; // fast path
if (this == EXTENDED_ANALYZER) return 1303507063; // fast path
int h = 1;
h = 31*h + pattern.pattern().hashCode();
h = 31*h + pattern.flags();
h = 31*h + (toLowerCase ? 1231 : 1237);
h = 31*h + (stopWords != null ? stopWords.hashCode() : 0);
return h;
}
Returns a hash code value for the object. |
public TokenStream tokenStream(String fieldName,
String text) {
// Ideally the Analyzer superclass should have a method with the same signature,
// with a default impl that simply delegates to the StringReader flavour.
if (text == null)
throw new IllegalArgumentException("text must not be null");
TokenStream stream;
if (pattern == NON_WORD_PATTERN) { // fast path
stream = new FastStringTokenizer(text, true, toLowerCase, stopWords);
}
else if (pattern == WHITESPACE_PATTERN) { // fast path
stream = new FastStringTokenizer(text, false, toLowerCase, stopWords);
}
else {
stream = new PatternTokenizer(text, pattern, toLowerCase);
if (stopWords != null) stream = new StopFilter(stream, stopWords);
}
return stream;
}
Creates a token stream that tokenizes the given string into token terms
(aka words). |
public TokenStream tokenStream(String fieldName,
Reader reader) {
if (reader instanceof FastStringReader) { // fast path
return tokenStream(fieldName, ((FastStringReader)reader).getString());
}
try {
String text = toString(reader);
return tokenStream(fieldName, text);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
Creates a token stream that tokenizes all the text in the given Reader;
This implementation forwards to tokenStream(String, String) and is
less efficient than tokenStream(String, String). |