org.apache.lucene.analysis.standard
public class: StandardTokenizer [javadoc |
source]
java.lang.Object
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.standard.StandardTokenizer
A grammar-based tokenizer constructed with JFlex
This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a
dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case
the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
Many applications have specific tokenizer needs. If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.
| Field Summary |
|---|
| public static final int | ALPHANUM | |
| public static final int | APOSTROPHE | |
| public static final int | ACRONYM | |
| public static final int | COMPANY | |
| public static final int | EMAIL | |
| public static final int | HOST | |
| public static final int | NUM | |
| public static final int | CJ | |
| public static final int | ACRONYM_DEP | |
| public static final String[] | TOKEN_TYPES | String token types that correspond to token type int constants |
| public static final String[] | tokenImage | |
| Constructor: |
public StandardTokenizer(Reader input) {
this.input = input;
this.scanner = new StandardTokenizerImpl(input);
}
Creates a new instance of the StandardTokenizer . Attaches the
input to a newly created JFlex scanner. |
public StandardTokenizer(Reader input,
boolean replaceInvalidAcronym) {
this.replaceInvalidAcronym = replaceInvalidAcronym;
this.input = input;
this.scanner = new StandardTokenizerImpl(input);
}
Parameters:
input - The input reader
replaceInvalidAcronym - Set to true to replace mischaracterized acronyms with HOST.
See http://issues.apache.org/jira/browse/LUCENE-1068
|
| Methods from org.apache.lucene.analysis.Tokenizer: |
|---|
|
close, reset |
| Method from org.apache.lucene.analysis.standard.StandardTokenizer Detail: |
public int getMaxTokenLength() {
return maxTokenLength;
}
|
public boolean isReplaceInvalidAcronym() {
return replaceInvalidAcronym;
} Deprecated! Remove - in 3.X and make true the only valid value
Prior to https://issues.apache.org/jira/browse/LUCENE-1068, StandardTokenizer mischaracterized as acronyms tokens like www.abc.com
when they should have been labeled as hosts instead. |
public Token next(Token result) throws IOException {
int posIncr = 1;
while(true) {
int tokenType = scanner.getNextToken();
if (tokenType == StandardTokenizerImpl.YYEOF) {
return null;
}
if (scanner.yylength() < = maxTokenLength) {
result.clear();
result.setPositionIncrement(posIncr);
scanner.getText(result);
final int start = scanner.yychar();
result.setStartOffset(start);
result.setEndOffset(start+result.termLength());
// This 'if' should be removed in the next release. For now, it converts
// invalid acronyms to HOST. When removed, only the 'else' part should
// remain.
if (tokenType == StandardTokenizerImpl.ACRONYM_DEP) {
if (replaceInvalidAcronym) {
result.setType(StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.HOST]);
result.setTermLength(result.termLength() - 1); // remove extra '.'
} else {
result.setType(StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.ACRONYM]);
}
} else {
result.setType(StandardTokenizerImpl.TOKEN_TYPES[tokenType]);
}
return result;
} else
// When we skip a too-long term, we still increment the
// position increment
posIncr++;
}
}
|
public void reset() throws IOException {
super.reset();
scanner.yyreset(input);
}
|
public void reset(Reader reader) throws IOException {
input = reader;
reset();
}
|
void setInput(Reader reader) {
this.input = reader;
}
|
public void setMaxTokenLength(int length) {
this.maxTokenLength = length;
}
Set the max allowed token length. Any token longer
than this is skipped. |
public void setReplaceInvalidAcronym(boolean replaceInvalidAcronym) {
this.replaceInvalidAcronym = replaceInvalidAcronym;
} Deprecated! Remove - in 3.X and make true the only valid value
See https://issues.apache.org/jira/browse/LUCENE-1068
|