|
|||||||||
| Home >> All >> org >> dinopolis >> util >> [ io overview ] | PREV CLASS NEXT CLASS | ||||||||
SUMMARY: JAVADOC | SOURCE | DOWNLOAD | NESTED | FIELD | CONSTR | METHOD |
DETAIL: FIELD | CONSTR | METHOD | ||||||||
org.dinopolis.util.io
Class Tokenizer

java.lang.Objectorg.dinopolis.util.io.Tokenizer
- public class Tokenizer
- extends java.lang.Object
This tokenizer merges the benefits of the java.lang.StringTokenizer class and the java.io.StreamTokenizer class. It provides a low level and a high level interface to the tokenizer. The low level interface consists of the method pair nextToken() and getWord(), where the first returns the type of token in the parsing process, and the latter returns the String element itself.
The high level interface consists of the methods hasNextLine() and nextLine(). They use the low level interface to parse the data line by line and create a list of strings from it.
It is unsure, if it is wise to mix the usage of the high and the low level interface. For normal usage, the high level interface should be more comfortable to use and does not provide any drawbacks.
An example for the high level interface:
try
{
// simple example, tokenizing string, no escape, but quoted
// works:
System.out.println("example 1");
Tokenizer tokenizer = new Tokenizer("text,,,\"another,text\"");
List tokens;
while(tokenizer.hasNextLine())
{
tokens = tokenizer.nextLine();
System.out.println(tokens.get(0)); // prints 'text'
System.out.println(tokens.get(1)); // prints ''
System.out.println(tokens.get(2)); // prints ''
System.out.println(tokens.get(3)); // prints 'another,text'
}
System.out.println("example 2");
// simple example, tokenizing string, using escape char and
// quoted strings:
tokenizer = new Tokenizer("text,text with\\,comma,,\"another,text\"");
tokenizer.respectEscapedCharacters(true);
while(tokenizer.hasNextLine())
{
tokens = tokenizer.nextLine();
System.out.println(tokens.get(0)); // prints 'text'
System.out.println(tokens.get(1)); // prints 'text with, comma'
System.out.println(tokens.get(2)); // prints ''
System.out.println(tokens.get(3)); // prints 'another,text'
}
}
catch(Exception ioe)
{
ioe.printStackTrace();
}
The advantages compared to the StreamTokenizer class are: Unlike the StreamTokenizer, this Tokenizer class returns the delimiters as tokens and therefore may be used to tokenize e.g. comma separated files with empty fields (the StreamTokenizer handles multiple delimiters in a row like one delimiter).
The tokenizer respect quoted words, so the delimiter is ignored if
inside quotes. And it may handle escaped characters (like an
escaped quote character, or an escaped new line). So the line
eric,"he said, \"great!\"" returns eric
and he said, "great!" as words.
Low level interface: The design of the Tokenizer allows to get empty columns as well as treat multiple delimiters in a row as one delimiter. For the first approach trigger the values on every DELIMITER and EOF token whereas for the second, trigger only on WORD tokens.
If one wants to be informed about empty words as well, use the Tokenizer like in the following code fragment:
Tokenizer tokenizer = new Tokenizer("text,,,another text");
String word = "";
int token;
while((token = tokenizer.nextToken()) != Tokenizer.EOF)
{
switch(token)
{
case Tokenizer.EOL:
System.out.println("word: "+word);
word = "";
System.out.println("-------------");
break;
case Tokenizer.WORD:
word = tokenizer.getWord();
break;
case Tokenizer.QUOTED_WORD:
word = tokenizer.getWord() + " (quoted)";
break;
case Tokenizer.DELIMITER:
System.out.println("word: "+word);
word = "";
break;
default:
System.err.println("Unknown Token: "+token);
}
}
In this example, if the delimiter is set to a comma, a line like
column1,,,"column4,partofcolumn4" would be treated correctly.
This tokenizer uses the LF character as end of line characters. It ignores any CR characters, so it can be used in windows environments as well.
- Version:
- $Revision: 1.4 $
| Field Summary | |
protected java.lang.StringBuffer |
buffer_
|
static int |
DELIMITER
|
protected java.lang.String |
delimiters_
|
static int |
EOF
|
protected boolean |
eof_reached_
|
static int |
EOL
|
protected boolean |
eol_is_significant_
|
static int |
ERROR
|
protected int |
escape_char_
|
protected boolean |
escape_mode_
|
protected int |
last_token_
|
protected int |
line_count_
|
static int |
NOT_STARTED
|
protected int |
quote_char_
|
static int |
QUOTED_WORD
|
protected java.io.PushbackReader |
reader_
|
protected boolean |
respect_escaped_chars_
|
protected boolean |
respect_quoted_words_
|
static int |
WORD
|
| Constructor Summary | |
Tokenizer(java.io.InputStream in_stream)
Creates a tokenizer that reads from the given string. |
|
Tokenizer(java.io.Reader reader)
Creates a tokenizer that reads from the given reader. |
|
Tokenizer(java.lang.String string)
Creates a tokenizer that reads from the given string. |
|
Tokenizer(java.lang.String string,
java.lang.String delimiters)
Creates a tokenizer that reads from the given string. |
|
| Method Summary | |
void |
close()
Closes the tokenizer (and the reader is uses internally). |
void |
eolIsSignificant(boolean significant)
If set to true the end of line is signaled by the EOL
token. |
int |
getDelimiter()
Deprecated. use the getDelimiters() method now |
java.lang.String |
getDelimiters()
Get the delimiter character. |
int |
getEscapeChar()
Get the escape character. |
int |
getLastToken()
Returns the last token that was returned from the nextToken() method. |
int |
getLineNumber()
Returns the current line number of the reader. |
int |
getQuoteChar()
Get the quote character. |
java.lang.String |
getWord()
Returns the value of the token. |
boolean |
hasNextLine()
Returns true, if the tokenizer can return another line. |
protected boolean |
isDelimiter(int c)
Returns true, if the given character is seen as a delimiter. |
protected boolean |
isEndOfLine(int c)
Returns true, if the given character is seen as a end of line character. |
boolean |
isEolSignificant()
Returns true, if in case of an end of line detected,
an EOL token is returned. |
protected boolean |
isEscapeChar(int c)
Returns true, if the given character is seen as a escape character. |
protected boolean |
isQuoteChar(int c)
Returns true, if the given character is seen as a quote character. |
static void |
main(java.lang.String[] args)
|
java.util.List |
nextLine()
Returns a list of elements (Strings) from the next line of the tokenizer. |
int |
nextToken()
Returns the next token from the reader. |
protected int |
readNextChar()
Reads and returns the next character from the reader and checks for the escape character. |
static java.util.List |
removeZeroLengthElements(java.util.List list)
This helper method removes all zero length elements from the given list and returns it. |
boolean |
respectEscapedCharacters()
Returns true, if escape character is respected. |
void |
respectEscapedCharacters(boolean respect_escaped)
If escape characters should be respected, set the param to true. |
boolean |
respectQuotedWords()
Returns true, if quoted words are respected. |
void |
respectQuotedWords(boolean respect_quotes)
If quoted words should be respected, set the param to true. |
void |
setDelimiter(int delimiter_char)
Set the delimiter character. |
void |
setDelimiters(java.lang.String delimiters)
Set the delimiter characters. |
void |
setEscapeChar(int escape_char)
Set the escape character. |
void |
setQuoteChar(int quote_char)
Set the quote character. |
protected static void |
testGeonetUTF8(java.lang.String[] args)
|
protected static void |
testHighLevel(java.lang.String[] args)
|
protected static void |
testHighLevelExample()
|
protected static void |
testLowLevel(java.lang.String[] args)
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
reader_
protected java.io.PushbackReader reader_
buffer_
protected java.lang.StringBuffer buffer_
delimiters_
protected java.lang.String delimiters_
escape_char_
protected int escape_char_
quote_char_
protected int quote_char_
escape_mode_
protected boolean escape_mode_
eol_is_significant_
protected boolean eol_is_significant_
respect_escaped_chars_
protected boolean respect_escaped_chars_
respect_quoted_words_
protected boolean respect_quoted_words_
line_count_
protected int line_count_
eof_reached_
protected boolean eof_reached_
last_token_
protected int last_token_
EOF
public static final int EOF
- See Also:
- Constant Field Values
EOL
public static final int EOL
- See Also:
- Constant Field Values
WORD
public static final int WORD
- See Also:
- Constant Field Values
QUOTED_WORD
public static final int QUOTED_WORD
- See Also:
- Constant Field Values
DELIMITER
public static final int DELIMITER
- See Also:
- Constant Field Values
ERROR
public static final int ERROR
- See Also:
- Constant Field Values
NOT_STARTED
public static final int NOT_STARTED
- See Also:
- Constant Field Values
| Constructor Detail |
Tokenizer
public Tokenizer(java.lang.String string)
- Creates a tokenizer that reads from the given string. It uses the
comma as delimiter, does not respect escape characters but respects
quoted words.
Tokenizer
public Tokenizer(java.lang.String string, java.lang.String delimiters)
- Creates a tokenizer that reads from the given string. All
characters in the given delimiters string are used as
delimiter. The tokenizer does not respect escape characters but
respects quoted words.
Tokenizer
public Tokenizer(java.io.InputStream in_stream)
- Creates a tokenizer that reads from the given string. It uses the
comma as delimiter, does not respect escape characters but respects
quoted words.
Tokenizer
public Tokenizer(java.io.Reader reader)
- Creates a tokenizer that reads from the given reader. It uses the
comma as delimiter, does not respect escape characters but respects
quoted words.
| Method Detail |
setDelimiter
public void setDelimiter(int delimiter_char)
- Set the delimiter character. The default is the comma.
getDelimiter
public int getDelimiter()
- Deprecated. use the getDelimiters() method now
- Get the first delimiter character.
- Get the first delimiter character.
setDelimiters
public void setDelimiters(java.lang.String delimiters)
- Set the delimiter characters. All characters in the delimiters are
used as delimiter.
getDelimiters
public java.lang.String getDelimiters()
- Get the delimiter character.
setEscapeChar
public void setEscapeChar(int escape_char)
- Set the escape character. The default is the backslash.
getEscapeChar
public int getEscapeChar()
- Get the escape character.
respectEscapedCharacters
public void respectEscapedCharacters(boolean respect_escaped)
- If escape characters should be respected, set the param to
true. The default is to ignore escape characters.
respectEscapedCharacters
public boolean respectEscapedCharacters()
- Returns
true, if escape character is respected.
getQuoteChar
public int getQuoteChar()
- Get the quote character.
setQuoteChar
public void setQuoteChar(int quote_char)
- Set the quote character. The default is the double quote.
respectQuotedWords
public void respectQuotedWords(boolean respect_quotes)
- If quoted words should be respected, set the param to
true. The default is to respect quoted words.
respectQuotedWords
public boolean respectQuotedWords()
- Returns
true, if quoted words are respected.
eolIsSignificant
public void eolIsSignificant(boolean significant)
- If set to
truethe end of line is signaled by the EOL token. If set tofalseend of line is treated as a normal delimiter. The default value is true;
isEolSignificant
public boolean isEolSignificant()
- Returns
true, if in case of an end of line detected, an EOL token is returned. Iffalse, the end of line is treated as a normal delimiter.
getLineNumber
public int getLineNumber()
- Returns the current line number of the reader.
getWord
public java.lang.String getWord()
- Returns the value of the token. If the token was of the type WORD,
the word is returned.
getLastToken
public int getLastToken()
- Returns the last token that was returned from the nextToken() method.
isDelimiter
protected boolean isDelimiter(int c)
- Returns true, if the given character is seen as a delimiter. This
method respects escape_mode, so if the escape character was found
before, it has to act accordingly (usually, return false, even if
the character is a delimiter).
isQuoteChar
protected boolean isQuoteChar(int c)
- Returns true, if the given character is seen as a quote
character. This method respects escape_mode, so if the escape
character was found before, it has to act accordingly (usually,
return false, even if the character is a quote character).
isEscapeChar
protected boolean isEscapeChar(int c)
- Returns true, if the given character is seen as a escape
character. This method respects escape_mode, so if the escape
character was found before, it has to act accordingly (usually,
return false, even if the character is a escape character).
isEndOfLine
protected boolean isEndOfLine(int c)
- Returns true, if the given character is seen as a end of line
character. This method respects end of line_mode, so if the end of
line character was found before, it has to act accordingly
(usually, return false, even if the character is a end of line
character).
close
public void close()
throws java.io.IOException
- Closes the tokenizer (and the reader is uses internally).
readNextChar
protected int readNextChar()
throws java.io.IOException
- Reads and returns the next character from the reader and checks for
the escape character. If an escape character is read, a flag is set
and the next character is read. A newline following the escape
character is ignored.
nextToken
public int nextToken()
throws java.io.IOException
- Returns the next token from the reader. The token's value may be
WORD, QUOTED_WORD, EOF, EOL, or DELIMITER. In the case or WORD or
QUOTED_WORD the actual word can be obtained by the use of the
getWord method.
hasNextLine
public boolean hasNextLine()
throws java.io.IOException
- Returns true, if the tokenizer can return another line.
nextLine
public java.util.List nextLine() throws java.io.IOException
- Returns a list of elements (Strings) from the next line of the
tokenizer. If there are multiple delimiters without any values in
between, empty (zero length) strings are added to the list. They
may be removed by the use of the
removeZeroLengthElements(List)55 method.
removeZeroLengthElements
public static java.util.List removeZeroLengthElements(java.util.List list)
- This helper method removes all zero length elements from the given
list and returns it.
testLowLevel
protected static void testLowLevel(java.lang.String[] args)
testHighLevel
protected static void testHighLevel(java.lang.String[] args)
testGeonetUTF8
protected static void testGeonetUTF8(java.lang.String[] args)
testHighLevelExample
protected static void testHighLevelExample()
main
public static void main(java.lang.String[] args)
|
|||||||||
| Home >> All >> org >> dinopolis >> util >> [ io overview ] | PREV CLASS NEXT CLASS | ||||||||
SUMMARY: JAVADOC | SOURCE | DOWNLOAD | NESTED | FIELD | CONSTR | METHOD |
DETAIL: FIELD | CONSTR | METHOD | ||||||||
JAVADOC
org.dinopolis.util.io.Tokenizer