Docjar: A Java Source and Docuemnt Enginecom.*    java.*    javax.*    org.*    all    new    plug-in

Quick Search    Search Deep

org.apache.lucene.analysis.cn.* (3)  
org.apache.lucene.analysis.ru.* (8)org.apache.lucene.analysis.standard.* (10)

org.apache.lucene.analysis: Javadoc index of package org.apache.lucene.analysis.


Package Samples:

org.apache.lucene.analysis.standard: API and code to convert text into indexable tokens.  
org.apache.lucene.analysis.ru
org.apache.lucene.analysis.cn

Classes:

PorterStemFilter: Transforms the token stream as per the Porter stemming algorithm. Note: the input to the stemming filter must already be in lower case, so you will need to use LowerCaseFilter or LowerCaseTokenizer farther down the Tokenizer chain in order for this to work properly! To use this filter with other analyzers, you'll want to write an Analyzer class that sets up the TokenStream chain as you want it. To use this with LowerCaseTokenizer, for example, you'd write an analyzer like this: class MyAnalyzer extends Analyzer { public final TokenStream tokenStream(String fieldName, Reader reader) { return new ...
Token: A Token is an occurence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string. The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc. The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might ...
CharStream: This interface describes a character stream that maintains line and column number positions of the characters. It also has the capability to backup the stream to some extent. An implementation of this interface is used in the TokenManager implementation generated by JavaCCParser. All the methods except backup can be implemented in any fashion. backup needs to be implemented correctly for the correct operation of the lexer. Rest of the methods are all used to get information like line number, column number and the String that constitutes a token and are not used by the lexer. Hence their implementation ...
TestPerFieldAnalzyerWrapper: Copyright 2004 The Apache Software Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
StandardAnalyzer: Filters StandardTokenizer with StandardFilter , org.apache.lucene.analysis.LowerCaseFilter and org.apache.lucene.analysis.StopFilter .
TokenStream: A TokenStream enumerates the sequence of tokens, either from fields of a document or from query text. This is an abstract class. Concrete subclasses are: Tokenizer , a TokenStream whose input is a Reader; and TokenFilter , a TokenStream whose input is another TokenStream.
LowerCaseTokenizer: LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together. It divides text at non-letters and converts them to lower case. While it is functionally equivalent to the combination of LetterTokenizer and LowerCaseFilter, there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
Analyzer: An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text. Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer. WARNING: You must override one of the methods defined by this class in your subclass or the Analyzer will enter an infinite loop.
PerFieldAnalyzerWrapper: This analyzer is used to facilitate scenarios where different fields require different analysis techniques. Use addAnalyzer(java.lang.String, org.apache.lucene.analysis.Analyzer) 55 to add a non-default analyzer on a field name basis. See TestPerFieldAnalyzerWrapper.java for example usage.
ChineseFilter: Title: ChineseFilter Description: Filter with a stop word table Rule: No digital is allowed. English word/token should larger than 1 character. One Chinese character as one Chinese word. TO DO: 1. Add Chinese stop words, such as ? 2. Dictionary based Chinese word extraction 3. Intelligent Chinese word extraction Copyright: Copyright (c) 2001 Company:
RussianCharsets: RussianCharsets class contains encodings schemes (charsets) and toLowerCase() method implementation for russian characters in Unicode, KOI8 and CP1252. Each encoding scheme contains lowercase (positions 0-31) and uppercase (position 32-63) characters. One should be able to add other encoding schemes (like ISO-8859-5 or customized) by adding a new charset and adding logic to toLowerCase() method for that charset.
LetterTokenizer: A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
StandardTokenizer: A grammar-based tokenizer constructed with JavaCC. This should be a good tokenizer for most European-language documents. Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
RussianLetterTokenizer: A RussianLetterTokenizer is a tokenizer that extends LetterTokenizer by additionally looking up letters in a given "russian charset". The problem with LeterTokenizer is that it uses Character.isLetter() method, which doesn't know how to detect letters in encodings like CP1252 and KOI8 (well-known problems with 0xD7 and 0xF7 chars)
FastCharStream: An efficient implementation of JavaCC's CharStream interface. Note that this does not do line-number counting, but instead keeps track of the character position of the token in the input, as required by Lucene's org.apache.lucene.analysis.Token API.
ParseException: This exception is thrown when parse errors are encountered. You can explicitly create objects of this exception type by calling the method generateParseException in the generated parser. You can modify this class to customize your error reporting mechanisms so long as you retain the public fields.
RussianStemFilter: A filter that stems Russian words. The implementation was inspired by GermanStemFilter. The input should be filtered by RussianLowerCaseFilter before passing it to RussianStemFilter , because RussianStemFilter only works with lowercase part of any "russian" charset.
PorterStemmer: Stemmer, implementing the Porter Stemming Algorithm The Stemmer class transforms a word into its root form. The input word can be provided a character at time (by calling add()), or at once by calling one of the various stem(something) methods.
ChineseTokenizer: Title: ChineseTokenizer Description: Extract tokens from the Stream using Character.getType() Rule: A Chinese character as a single token Copyright: Copyright (c) 2001 Company:
ChineseAnalyzer: Title: ChineseAnalyzer Description: Subclass of org.apache.lucene.analysis.Analyzer build from a ChineseTokenizer, filtered with ChineseFilter. Copyright: Copyright (c) 2001 Company:
RussianAnalyzer: Analyzer for Russian language. Supports an external list of stopwords (words that will not be indexed at all). A default set of stopwords is used unless an alternative list is specified.
StandardFilter: Normalizes tokens extracted with StandardTokenizer .
WhitespaceTokenizer: A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.
RussianStemmer: Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
TokenFilter: A TokenFilter is a TokenStream whose input is another token stream. This is an abstract class.

Home | Contact Us | Privacy Policy | Terms of Service