Stemmer, implementing the Porter Stemming Algorithm
The Stemmer class transforms a word into its root form. The input
word can be provided a character at time (by calling add()), or at once
by calling one of the various stem(something) methods.
| Method from org.apache.lucene.analysis.PorterStemmer Detail: |
public void add(char ch) {
if (b.length < = i + EXTRA) {
char[] new_b = new char[b.length+INC];
System.arraycopy(b, 0, new_b, 0, b.length);
b = new_b;
}
b[i++] = ch;
}
Add a character to the word being stemmed. When you are finished
adding characters, you can call stem(void) to process the word. |
public char[] getResultBuffer() {
return b;
}
Returns a reference to a character buffer containing the results of
the stemming process. You also need to consult getResultLength()
to determine the length of the result. |
public int getResultLength() {
return i;
}
Returns the length of the word resulting from the stemming process. |
public static void main(String[] args) {
PorterStemmer s = new PorterStemmer();
for (int i = 0; i < args.length; i++) {
try {
InputStream in = new FileInputStream(args[i]);
byte[] buffer = new byte[1024];
int bufferLen, offset, ch;
bufferLen = in.read(buffer);
offset = 0;
s.reset();
while(true) {
if (offset < bufferLen)
ch = buffer[offset++];
else {
bufferLen = in.read(buffer);
offset = 0;
if (bufferLen < 0)
ch = -1;
else
ch = buffer[offset++];
}
if (Character.isLetter((char) ch)) {
s.add(Character.toLowerCase((char) ch));
}
else {
s.stem();
System.out.print(s.toString());
s.reset();
if (ch < 0)
break;
else {
System.out.print((char) ch);
}
}
}
in.close();
}
catch (IOException e) {
System.out.println("error reading " + args[i]);
}
}
}
Test program for demonstrating the Stemmer. It reads a file and
stems each word, writing the result to standard out.
Usage: Stemmer file-name |
void r(String s) {
if (m() > 0) setto(s);
}
|
public void reset() {
i = 0; dirty = false;
}
reset() resets the stemmer so it can stem another word. If you invoke
the stemmer by calling add(char) and then stem(), you must call reset()
before starting another word. |
void setto(String s) {
int l = s.length();
int o = j+1;
for (int i = 0; i < l; i++)
b[o+i] = s.charAt(i);
k = j+l;
dirty = true;
}
|
public boolean stem() {
return stem(0);
}
Stem the word placed into the Stemmer buffer through calls to add().
Returns true if the stemming process resulted in a word different
from the input. You can retrieve the result with
getResultLength()/getResultBuffer() or toString(). |
public String stem(String s) {
if (stem(s.toCharArray(), s.length()))
return toString();
else
return s;
}
Stem a word provided as a String. Returns the result as a String. |
public boolean stem(char[] word) {
return stem(word, word.length);
}
Stem a word contained in a char[]. Returns true if the stemming process
resulted in a word different from the input. You can retrieve the
result with getResultLength()/getResultBuffer() or toString(). |
public boolean stem(int i0) {
k = i - 1;
k0 = i0;
if (k > k0+1) {
step1(); step2(); step3(); step4(); step5(); step6();
}
// Also, a word is considered dirty if we lopped off letters
// Thanks to Ifigenia Vairelles for pointing this out.
if (i != k+1)
dirty = true;
i = k+1;
return dirty;
}
|
public boolean stem(char[] word,
int wordLen) {
return stem(word, 0, wordLen);
}
Stem a word contained in a leading portion of a char[] array.
Returns true if the stemming process resulted in a word different
from the input. You can retrieve the result with
getResultLength()/getResultBuffer() or toString(). |
public boolean stem(char[] wordBuffer,
int offset,
int wordLen) {
reset();
if (b.length < wordLen) {
char[] new_b = new char[wordLen + EXTRA];
b = new_b;
}
System.arraycopy(wordBuffer, offset, b, 0, wordLen);
i = wordLen;
return stem(0);
}
Stem a word contained in a portion of a char[] array. Returns
true if the stemming process resulted in a word different from
the input. You can retrieve the result with
getResultLength()/getResultBuffer() or toString(). |
public String toString() {
return new String(b,0,i);
}
After a word has been stemmed, it can be retrieved by toString(),
or a reference to the internal buffer can be retrieved by getResultBuffer
and getResultLength (which is generally more efficient.) |