Docjar: A Java Source and Docuemnt Enginecom.*    java.*    javax.*    org.*    all    new    plug-in

Quick Search    Search Deep

org.dinopolis.util.io
Class Tokenizer  view Tokenizer download Tokenizer.java

java.lang.Object
  extended byorg.dinopolis.util.io.Tokenizer

public class Tokenizer
extends java.lang.Object

This tokenizer merges the benefits of the java.lang.StringTokenizer class and the java.io.StreamTokenizer class. It provides a low level and a high level interface to the tokenizer. The low level interface consists of the method pair nextToken() and getWord(), where the first returns the type of token in the parsing process, and the latter returns the String element itself.

The high level interface consists of the methods hasNextLine() and nextLine(). They use the low level interface to parse the data line by line and create a list of strings from it.

It is unsure, if it is wise to mix the usage of the high and the low level interface. For normal usage, the high level interface should be more comfortable to use and does not provide any drawbacks.

An example for the high level interface:

    try
    {
          // simple example, tokenizing string, no escape, but quoted
          // works:
      System.out.println("example 1");
      Tokenizer tokenizer = new Tokenizer("text,,,\"another,text\"");
      List tokens;
      while(tokenizer.hasNextLine())
      {
        tokens = tokenizer.nextLine();
        System.out.println(tokens.get(0)); // prints 'text'
        System.out.println(tokens.get(1)); // prints ''
        System.out.println(tokens.get(2)); // prints ''
        System.out.println(tokens.get(3)); // prints 'another,text'
      }

      System.out.println("example 2");
          // simple example, tokenizing string, using escape char and
          // quoted strings:
      tokenizer = new Tokenizer("text,text with\\,comma,,\"another,text\"");
      tokenizer.respectEscapedCharacters(true);
      while(tokenizer.hasNextLine())
      {
        tokens = tokenizer.nextLine();
        System.out.println(tokens.get(0)); // prints 'text'
        System.out.println(tokens.get(1)); // prints 'text with, comma'
        System.out.println(tokens.get(2)); // prints ''
        System.out.println(tokens.get(3)); // prints 'another,text'
      }
    }
    catch(Exception ioe)
    {
      ioe.printStackTrace();
    }
 

The advantages compared to the StreamTokenizer class are: Unlike the StreamTokenizer, this Tokenizer class returns the delimiters as tokens and therefore may be used to tokenize e.g. comma separated files with empty fields (the StreamTokenizer handles multiple delimiters in a row like one delimiter).

The tokenizer respect quoted words, so the delimiter is ignored if inside quotes. And it may handle escaped characters (like an escaped quote character, or an escaped new line). So the line eric,"he said, \"great!\"" returns eric and he said, "great!" as words.

Low level interface: The design of the Tokenizer allows to get empty columns as well as treat multiple delimiters in a row as one delimiter. For the first approach trigger the values on every DELIMITER and EOF token whereas for the second, trigger only on WORD tokens.

If one wants to be informed about empty words as well, use the Tokenizer like in the following code fragment:

   Tokenizer tokenizer = new Tokenizer("text,,,another text");
   String word = "";
   int token;
   while((token = tokenizer.nextToken()) != Tokenizer.EOF)
   {
     switch(token)
     {
     case Tokenizer.EOL:
       System.out.println("word: "+word);
       word = "";
       System.out.println("-------------");
       break;
     case Tokenizer.WORD:
       word = tokenizer.getWord();
       break;
     case Tokenizer.QUOTED_WORD:
       word = tokenizer.getWord() + " (quoted)";
       break;
     case Tokenizer.DELIMITER:
       System.out.println("word: "+word);
       word = "";
       break;
     default:
       System.err.println("Unknown Token: "+token);
     }
   }
 
In this example, if the delimiter is set to a comma, a line like column1,,,"column4,partofcolumn4" would be treated correctly.

This tokenizer uses the LF character as end of line characters. It ignores any CR characters, so it can be used in windows environments as well.

Version:
$Revision: 1.4 $

Field Summary
protected  java.lang.StringBuffer buffer_
           
static int DELIMITER
           
protected  java.lang.String delimiters_
           
static int EOF
           
protected  boolean eof_reached_
           
static int EOL
           
protected  boolean eol_is_significant_
           
static int ERROR
           
protected  int escape_char_
           
protected  boolean escape_mode_
           
protected  int last_token_
           
protected  int line_count_
           
static int NOT_STARTED
           
protected  int quote_char_
           
static int QUOTED_WORD
           
protected  java.io.PushbackReader reader_
           
protected  boolean respect_escaped_chars_
           
protected  boolean respect_quoted_words_
           
static int WORD
           
 
Constructor Summary
Tokenizer(java.io.InputStream in_stream)
          Creates a tokenizer that reads from the given string.
Tokenizer(java.io.Reader reader)
          Creates a tokenizer that reads from the given reader.
Tokenizer(java.lang.String string)
          Creates a tokenizer that reads from the given string.
Tokenizer(java.lang.String string, java.lang.String delimiters)
          Creates a tokenizer that reads from the given string.
 
Method Summary
 void close()
          Closes the tokenizer (and the reader is uses internally).
 void eolIsSignificant(boolean significant)
          If set to true the end of line is signaled by the EOL token.
 int getDelimiter()
          Deprecated. use the getDelimiters() method now
 java.lang.String getDelimiters()
          Get the delimiter character.
 int getEscapeChar()
          Get the escape character.
 int getLastToken()
          Returns the last token that was returned from the nextToken() method.
 int getLineNumber()
          Returns the current line number of the reader.
 int getQuoteChar()
          Get the quote character.
 java.lang.String getWord()
          Returns the value of the token.
 boolean hasNextLine()
          Returns true, if the tokenizer can return another line.
protected  boolean isDelimiter(int c)
          Returns true, if the given character is seen as a delimiter.
protected  boolean isEndOfLine(int c)
          Returns true, if the given character is seen as a end of line character.
 boolean isEolSignificant()
          Returns true, if in case of an end of line detected, an EOL token is returned.
protected  boolean isEscapeChar(int c)
          Returns true, if the given character is seen as a escape character.
protected  boolean isQuoteChar(int c)
          Returns true, if the given character is seen as a quote character.
static void main(java.lang.String[] args)
           
 java.util.List nextLine()
          Returns a list of elements (Strings) from the next line of the tokenizer.
 int nextToken()
          Returns the next token from the reader.
protected  int readNextChar()
          Reads and returns the next character from the reader and checks for the escape character.
static java.util.List removeZeroLengthElements(java.util.List list)
          This helper method removes all zero length elements from the given list and returns it.
 boolean respectEscapedCharacters()
          Returns true, if escape character is respected.
 void respectEscapedCharacters(boolean respect_escaped)
          If escape characters should be respected, set the param to true.
 boolean respectQuotedWords()
          Returns true, if quoted words are respected.
 void respectQuotedWords(boolean respect_quotes)
          If quoted words should be respected, set the param to true.
 void setDelimiter(int delimiter_char)
          Set the delimiter character.
 void setDelimiters(java.lang.String delimiters)
          Set the delimiter characters.
 void setEscapeChar(int escape_char)
          Set the escape character.
 void setQuoteChar(int quote_char)
          Set the quote character.
protected static void testGeonetUTF8(java.lang.String[] args)
           
protected static void testHighLevel(java.lang.String[] args)
           
protected static void testHighLevelExample()
           
protected static void testLowLevel(java.lang.String[] args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reader_

protected java.io.PushbackReader reader_

buffer_

protected java.lang.StringBuffer buffer_

delimiters_

protected java.lang.String delimiters_

escape_char_

protected int escape_char_

quote_char_

protected int quote_char_

escape_mode_

protected boolean escape_mode_

eol_is_significant_

protected boolean eol_is_significant_

respect_escaped_chars_

protected boolean respect_escaped_chars_

respect_quoted_words_

protected boolean respect_quoted_words_

line_count_

protected int line_count_

eof_reached_

protected boolean eof_reached_

last_token_

protected int last_token_

EOF

public static final int EOF
See Also:
Constant Field Values

EOL

public static final int EOL
See Also:
Constant Field Values

WORD

public static final int WORD
See Also:
Constant Field Values

QUOTED_WORD

public static final int QUOTED_WORD
See Also:
Constant Field Values

DELIMITER

public static final int DELIMITER
See Also:
Constant Field Values

ERROR

public static final int ERROR
See Also:
Constant Field Values

NOT_STARTED

public static final int NOT_STARTED
See Also:
Constant Field Values
Constructor Detail

Tokenizer

public Tokenizer(java.lang.String string)
Creates a tokenizer that reads from the given string. It uses the comma as delimiter, does not respect escape characters but respects quoted words.


Tokenizer

public Tokenizer(java.lang.String string,
                 java.lang.String delimiters)
Creates a tokenizer that reads from the given string. All characters in the given delimiters string are used as delimiter. The tokenizer does not respect escape characters but respects quoted words.


Tokenizer

public Tokenizer(java.io.InputStream in_stream)
Creates a tokenizer that reads from the given string. It uses the comma as delimiter, does not respect escape characters but respects quoted words.


Tokenizer

public Tokenizer(java.io.Reader reader)
Creates a tokenizer that reads from the given reader. It uses the comma as delimiter, does not respect escape characters but respects quoted words.

Method Detail

setDelimiter

public void setDelimiter(int delimiter_char)
Set the delimiter character. The default is the comma.


getDelimiter

public int getDelimiter()
Deprecated. use the getDelimiters() method now

Get the first delimiter character.


setDelimiters

public void setDelimiters(java.lang.String delimiters)
Set the delimiter characters. All characters in the delimiters are used as delimiter.


getDelimiters

public java.lang.String getDelimiters()
Get the delimiter character.


setEscapeChar

public void setEscapeChar(int escape_char)
Set the escape character. The default is the backslash.


getEscapeChar

public int getEscapeChar()
Get the escape character.


respectEscapedCharacters

public void respectEscapedCharacters(boolean respect_escaped)
If escape characters should be respected, set the param to true. The default is to ignore escape characters.


respectEscapedCharacters

public boolean respectEscapedCharacters()
Returns true, if escape character is respected.


getQuoteChar

public int getQuoteChar()
Get the quote character.


setQuoteChar

public void setQuoteChar(int quote_char)
Set the quote character. The default is the double quote.


respectQuotedWords

public void respectQuotedWords(boolean respect_quotes)
If quoted words should be respected, set the param to true. The default is to respect quoted words.


respectQuotedWords

public boolean respectQuotedWords()
Returns true, if quoted words are respected.


eolIsSignificant

public void eolIsSignificant(boolean significant)
If set to true the end of line is signaled by the EOL token. If set to false end of line is treated as a normal delimiter. The default value is true;


isEolSignificant

public boolean isEolSignificant()
Returns true, if in case of an end of line detected, an EOL token is returned. If false, the end of line is treated as a normal delimiter.


getLineNumber

public int getLineNumber()
Returns the current line number of the reader.


getWord

public java.lang.String getWord()
Returns the value of the token. If the token was of the type WORD, the word is returned.


getLastToken

public int getLastToken()
Returns the last token that was returned from the nextToken() method.


isDelimiter

protected boolean isDelimiter(int c)
Returns true, if the given character is seen as a delimiter. This method respects escape_mode, so if the escape character was found before, it has to act accordingly (usually, return false, even if the character is a delimiter).


isQuoteChar

protected boolean isQuoteChar(int c)
Returns true, if the given character is seen as a quote character. This method respects escape_mode, so if the escape character was found before, it has to act accordingly (usually, return false, even if the character is a quote character).


isEscapeChar

protected boolean isEscapeChar(int c)
Returns true, if the given character is seen as a escape character. This method respects escape_mode, so if the escape character was found before, it has to act accordingly (usually, return false, even if the character is a escape character).


isEndOfLine

protected boolean isEndOfLine(int c)
Returns true, if the given character is seen as a end of line character. This method respects end of line_mode, so if the end of line character was found before, it has to act accordingly (usually, return false, even if the character is a end of line character).


close

public void close()
           throws java.io.IOException
Closes the tokenizer (and the reader is uses internally).


readNextChar

protected int readNextChar()
                    throws java.io.IOException
Reads and returns the next character from the reader and checks for the escape character. If an escape character is read, a flag is set and the next character is read. A newline following the escape character is ignored.


nextToken

public int nextToken()
              throws java.io.IOException
Returns the next token from the reader. The token's value may be WORD, QUOTED_WORD, EOF, EOL, or DELIMITER. In the case or WORD or QUOTED_WORD the actual word can be obtained by the use of the getWord method.


hasNextLine

public boolean hasNextLine()
                    throws java.io.IOException
Returns true, if the tokenizer can return another line.


nextLine

public java.util.List nextLine()
                        throws java.io.IOException
Returns a list of elements (Strings) from the next line of the tokenizer. If there are multiple delimiters without any values in between, empty (zero length) strings are added to the list. They may be removed by the use of the removeZeroLengthElements(List) 55 method.


removeZeroLengthElements

public static java.util.List removeZeroLengthElements(java.util.List list)
This helper method removes all zero length elements from the given list and returns it.


testLowLevel

protected static void testLowLevel(java.lang.String[] args)

testHighLevel

protected static void testHighLevel(java.lang.String[] args)

testGeonetUTF8

protected static void testGeonetUTF8(java.lang.String[] args)

testHighLevelExample

protected static void testHighLevelExample()

main

public static void main(java.lang.String[] args)