Docjar: A Java Source and Docuemnt Enginecom.*    java.*    javax.*    org.*    all    new    plug-in

Quick Search    Search Deep

org.htmlparser
Class Parser  view Parser download Parser.java

java.lang.Object
  extended byorg.htmlparser.Parser
All Implemented Interfaces:
java.io.Serializable

public class Parser
extends java.lang.Object
implements java.io.Serializable

This is the class that the user will use, either to get an iterator into the html page or to directly parse the page and print the results
Typical usage of the parser is as follows :
[1] Create a parser object - passing the URL and a feedback object to the parser
[2] Register the common scanners. See registerScanners() 55
You wouldnt do this if you want to configure a custom lightweight parser. In that case, you would add the scanners of your choice using addScanner(TagScanner) 55
[3] Enumerate through the elements from the parser object
It is important to note that the parsing occurs when you enumerate, ON DEMAND. This is a thread-safe way, and you only get the control back after a particular element is parsed and returned.
Below is some sample code to parse Yahoo.com and print all the tags.

 Parser parser = new Parser("http://www.yahoo.com", new DefaultHTMLParserFeedback());
 // In this example, we are registering all the common scanners
 parser.registerScanners();
 for (NodeIterator i = parser.elements(); e.hasMoreNodes();) {
 	Node node = i.nextNode();
 	node.print();
 }
 
Below is some sample code to parse Yahoo.com and print only the text information. This scanning will run faster, as there are no scanners registered here.
 Parser parser = new Parser("http://www.yahoo.com", new DefaultHTMLParserFeedback());
 // In this example, none of the scanners need to be registered
 // as a string node is not a tag to be scanned for.
 for (NodeIterator i = parser.elements(); e.hasMoreNodes();) {
 	Node node = i.nextNode();
 	if (node instanceof StringNode) {
 		StringNode stringNode = (StringNode) node;
 		System.out.println(stringNode.getText());
 	}
 }
 
The above snippet will print out only the text contents in the html document.
Here's another snippet that will only print out the link urls in a document. This is an example of adding a link scanner.
 Parser parser = new Parser("http://www.yahoo.com", new DefaultHTMLParserFeedback());
 parser.addScanner(new LinkScanner("-l"));
 for (NodeIterator i = parser.elements(); e.hasMoreNodes();) {
 	Node node = i.nextNode();
 	if (node instanceof LinkTag) {
 		LinkTag linkTag = (LinkTag) node;
 		System.out.println(linkTag.getLink());
 	}
 }
 


Field Summary
protected  java.lang.String character_set
          The encoding being used to decode the connection input stream.
protected static java.lang.String CHARSET_STRING
          Trigger for charset detection.
protected static java.lang.String DEFAULT_CHARSET
          The default charset.
protected  org.htmlparser.util.ParserFeedback feedback
          Feedback object.
protected  java.io.BufferedInputStream input
          The bytes extracted from the source.
static org.htmlparser.util.ParserFeedback noFeedback
          A quiet message sink.
private  org.htmlparser.parserHelper.ParserHelper parserHelper
           
protected  NodeReader reader
          The html reader associated with this parser.
protected  java.lang.String resourceLocn
          The URL or filename to be parsed.
private  java.util.Map scanners
          The list of scanners to apply at the top level.
static org.htmlparser.util.ParserFeedback stdout
          A verbose message sink.
protected  java.net.URLConnection url_conn
          The source for HTML.
static java.lang.String VERSION_DATE
          The date of the version.
static double VERSION_NUMBER
          The floating point version number.
static java.lang.String VERSION_STRING
          The display version.
static java.lang.String VERSION_TYPE
          The type of version.
 
Constructor Summary
Parser()
          Zero argument constructor.
Parser(NodeReader reader)
          This constructor is present to enable users to plugin their own readers.
Parser(NodeReader rd, org.htmlparser.util.ParserFeedback fb)
          This constructor enables the construction of test cases, with readers associated with test string buffers.
Parser(java.lang.String resourceLocn)
          Creates a Parser object with the location of the resource (URL or file).
Parser(java.lang.String resourceLocn, org.htmlparser.util.ParserFeedback feedback)
          Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in.
Parser(java.net.URLConnection connection)
          Constructor for non-standard access.
Parser(java.net.URLConnection connection, org.htmlparser.util.ParserFeedback fb)
          Constructor for custom HTTP access.
 
Method Summary
 void addScanner(org.htmlparser.scanners.TagScanner scanner)
          Add a new Tag Scanner.
protected  java.io.InputStreamReader createInputStreamReader()
          Open a stream reader on the InputStream.
 org.htmlparser.util.IteratorImpl createIteratorImpl(boolean remove_scanner, org.htmlparser.util.IteratorImpl ret)
           
static Parser createLinkRecognizingParser(java.lang.String inputHTML)
           
static Parser createParser(java.lang.String inputHTML)
          Creates the parser on an input string.
protected  void createReader()
          Create a new reader for the URLConnection object.
 org.htmlparser.util.NodeIterator elements()
          Returns an iterator (enumeration) to the html nodes.
 Node[] extractAllNodesThatAre(java.lang.Class nodeType)
           
 void flushScanners()
          Flush the current scanners registered.
protected  java.lang.String getCharacterSet(java.net.URLConnection connection)
          Try and extract the character set from the HTTP header.
protected  java.lang.String getCharset(java.lang.String content)
          Get a CharacterSet name corresponding to a charset parameter.
 java.net.URLConnection getConnection()
          Return the current connection.
 java.lang.String getEncoding()
          The current encoding.
 org.htmlparser.util.ParserFeedback getFeedback()
          Returns the feedback.
 int getNumScanners()
          Get the number of scanners registered currently in the scanner.
 NodeReader getReader()
          Returns the reader associated with the parser
 org.htmlparser.scanners.TagScanner getScanner(java.lang.String id)
          Return the scanner registered in the parser having the given id
 java.util.Map getScanners()
          Get an enumeration of scanners registered currently in the parser
 java.lang.String getURL()
          Return the current URL being parsed.
static java.lang.String getVersion()
          Return the version string of this parser.
static double getVersionNumber()
          Return the version number of this parser.
static void main(java.lang.String[] args)
          The main program, which can be executed from the command line
 void parse(java.lang.String filter)
          Parse the given resource, using the filter provided
private  void readObject(java.io.ObjectInputStream in)
           
protected  void recreateReader()
          Create a new reader for the URLConnection object but reuse the input stream.
 void registerDomScanners()
          Make a call to registerDomScanners(), instead of registerScanners(), when you are interested in retrieving a Dom representation of the html page.
 void registerScanners()
          This method should be invoked in order to register some common scanners.
 void removeScanner(org.htmlparser.scanners.TagScanner scanner)
          Removes a specified scanner object.
 void setConnection(java.net.URLConnection connection)
          Set the connection for this parser.
 void setEncoding(java.lang.String encoding)
          Set the encoding for this parser.
 void setFeedback(org.htmlparser.util.ParserFeedback fb)
          Sets the feedback object used in scanning.
 void setInputHTML(java.lang.String inputHTML)
          Initializes the parser with the given input HTML String.
static void setLineSeparator(java.lang.String lineSeparator)
           
 void setReader(NodeReader rd)
          Set the reader for this parser.
 void setScanners(java.util.Map newScanners)
          This method is to be used to change the set of scanners in the current parser.
 void setURL(java.lang.String url)
          Set the URL for this parser.
 void visitAllNodesWith(org.htmlparser.visitors.NodeVisitor visitor)
           
private  void writeObject(java.io.ObjectOutputStream out)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERSION_NUMBER

public static final double VERSION_NUMBER
The floating point version number.

See Also:
Constant Field Values

VERSION_TYPE

public static final java.lang.String VERSION_TYPE
The type of version.

See Also:
Constant Field Values

VERSION_DATE

public static final java.lang.String VERSION_DATE
The date of the version.

See Also:
Constant Field Values

VERSION_STRING

public static final java.lang.String VERSION_STRING
The display version.

See Also:
Constant Field Values

DEFAULT_CHARSET

protected static final java.lang.String DEFAULT_CHARSET
The default charset. This should be ISO-8859-1, see RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) section 3.7.1 Another alias is "8859_1".

See Also:
Constant Field Values

CHARSET_STRING

protected static final java.lang.String CHARSET_STRING
Trigger for charset detection.

See Also:
Constant Field Values

feedback

protected org.htmlparser.util.ParserFeedback feedback
Feedback object.


resourceLocn

protected java.lang.String resourceLocn
The URL or filename to be parsed.


reader

protected transient NodeReader reader
The html reader associated with this parser.


scanners

private java.util.Map scanners
The list of scanners to apply at the top level.


character_set

protected java.lang.String character_set
The encoding being used to decode the connection input stream.


url_conn

protected transient java.net.URLConnection url_conn
The source for HTML.


input

protected transient java.io.BufferedInputStream input
The bytes extracted from the source.


noFeedback

public static org.htmlparser.util.ParserFeedback noFeedback
A quiet message sink. Use this for no feedback.


stdout

public static org.htmlparser.util.ParserFeedback stdout
A verbose message sink. Use this for output on System.out.


parserHelper

private org.htmlparser.parserHelper.ParserHelper parserHelper
Constructor Detail

Parser

public Parser()
Zero argument constructor. The parser is in a safe but useless state. Set the reader or connection using setReader() or setConnection().


Parser

public Parser(NodeReader rd,
              org.htmlparser.util.ParserFeedback fb)
This constructor enables the construction of test cases, with readers associated with test string buffers. It can also be used with readers of the user's choice streaming data into the parser.

Important: If you are using this constructor, and you would like to use the parser to parse multiple times (multiple calls to parser.elements()), you must ensure the following:

  • Before the first parse, you must mark the reader for a length that you anticipate (the size of the stream).
  • After the first parse, calls to elements() must be preceded by calls to :
     parser.getReader().reset();
     


Parser

public Parser(java.net.URLConnection connection,
              org.htmlparser.util.ParserFeedback fb)
       throws org.htmlparser.util.ParserException
Constructor for custom HTTP access.


Parser

public Parser(java.lang.String resourceLocn,
              org.htmlparser.util.ParserFeedback feedback)
       throws org.htmlparser.util.ParserException
Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in.


Parser

public Parser(java.lang.String resourceLocn)
       throws org.htmlparser.util.ParserException
Creates a Parser object with the location of the resource (URL or file). A DefaultHTMLParserFeedback object is used for feedback.


Parser

public Parser(NodeReader reader)
This constructor is present to enable users to plugin their own readers. A DefaultHTMLParserFeedback object is used for feedback. It can also be used with readers of the user's choice streaming data into the parser.

Important: If you are using this constructor, and you would like to use the parser to parse multiple times (multiple calls to parser.elements()), you must ensure the following:

  • Before the first parse, you must mark the reader for a length that you anticipate (the size of the stream).
  • After the first parse, calls to elements() must be preceded by calls to :
     parser.getReader().reset();
     

Parser

public Parser(java.net.URLConnection connection)
       throws org.htmlparser.util.ParserException
Constructor for non-standard access. A DefaultHTMLParserFeedback object is used for feedback.

Method Detail

setLineSeparator

public static void setLineSeparator(java.lang.String lineSeparator)

getVersion

public static java.lang.String getVersion()
Return the version string of this parser.


getVersionNumber

public static double getVersionNumber()
Return the version number of this parser.


writeObject

private void writeObject(java.io.ObjectOutputStream out)
                  throws java.io.IOException

readObject

private void readObject(java.io.ObjectInputStream in)
                 throws java.io.IOException,
                        java.lang.ClassNotFoundException

setConnection

public void setConnection(java.net.URLConnection connection)
                   throws org.htmlparser.util.ParserException
Set the connection for this parser. This method sets four of the fields in the parser object; resourceLocn, url_conn, character_set and reader. It does not adjust the scanners list or feedback object. The four fields are set atomicly by this method, either they are all set or none of them is set. Trying to set the connection to null is a noop.


getConnection

public java.net.URLConnection getConnection()
Return the current connection.


setURL

public void setURL(java.lang.String url)
            throws org.htmlparser.util.ParserException
Set the URL for this parser. This method sets four of the fields in the parser object; resourceLocn, url_conn, character_set and reader. It does not adjust the scanners list or feedback object.Trying to set the url to null or an empty string is a noop.


getURL

public java.lang.String getURL()
Return the current URL being parsed.


setEncoding

public void setEncoding(java.lang.String encoding)
                 throws org.htmlparser.util.ParserException
Set the encoding for this parser. If there is no connection (getConnection() returns null) it simply sets the character set name stored in the parser (Note: the reader object which must have been set in the constructor or by setReader(), may or may not be using this character set). Otherwise (getConnection() doesn't return null) it does this by reopening the input stream of the connection and creating a reader that uses this character set. In this case, this method sets two of the fields in the parser object; character_set and reader. It does not adjust resourceLocn, url_conn, scanners or feedback. The two fields are set atomicly by this method, either they are both set or none of them is set. Trying to set the encoding to null or an empty string is a noop.


getEncoding

public java.lang.String getEncoding()
The current encoding. This item is et from the HTTP header but may be overridden by meta tags in the head, so this may change after the head has been parsed.


setReader

public void setReader(NodeReader rd)
Set the reader for this parser. This method sets four of the fields in the parser object; resourceLocn, url_conn, character_set and reader. It does not adjust the scanners list or feedback object. The url_conn is set to null since this cannot be determined from the reader. The character_set is set to the default character set since this cannot be determined from the reader. Trying to set the reader to null is a noop.


getReader

public NodeReader getReader()
Returns the reader associated with the parser


getNumScanners

public int getNumScanners()
Get the number of scanners registered currently in the scanner.


setScanners

public void setScanners(java.util.Map newScanners)
This method is to be used to change the set of scanners in the current parser.


getScanners

public java.util.Map getScanners()
Get an enumeration of scanners registered currently in the parser


setFeedback

public void setFeedback(org.htmlparser.util.ParserFeedback fb)
Sets the feedback object used in scanning.


getFeedback

public org.htmlparser.util.ParserFeedback getFeedback()
Returns the feedback.


createInputStreamReader

protected java.io.InputStreamReader createInputStreamReader()
                                                     throws java.io.UnsupportedEncodingException
Open a stream reader on the InputStream. Revise the character set to it's default value if an UnsupportedEncodingException is thrown.


createReader

protected void createReader()
                     throws java.io.IOException
Create a new reader for the URLConnection object. The current character set is used to transform the input stream into a character reader.


recreateReader

protected void recreateReader()
                       throws java.io.IOException
Create a new reader for the URLConnection object but reuse the input stream. The current character set is used to transform the input stream into a character reader. Defaults to createReader() if there is no existing input stream.


getCharacterSet

protected java.lang.String getCharacterSet(java.net.URLConnection connection)
Try and extract the character set from the HTTP header.


getCharset

protected java.lang.String getCharset(java.lang.String content)
Get a CharacterSet name corresponding to a charset parameter.


addScanner

public void addScanner(org.htmlparser.scanners.TagScanner scanner)
Add a new Tag Scanner. In typical situations where you require a no-frills parser, use the registerScanners() method to add the most common parsers. But when you wish to either compose a parser with only certain scanners registered, use this method. It is advantageous to register only the scanners you want, in order to achieve faster parsing speed. This method would also be of use when you have developed custom scanners, and need to register them into the parser.


elements

public org.htmlparser.util.NodeIterator elements()
                                          throws org.htmlparser.util.ParserException
Returns an iterator (enumeration) to the html nodes. Each node can be a tag/endtag/ string/link/image
This is perhaps the most important method of this class. In typical situations, you will need to use the parser like this :
 
  Parser parser = new Parser("http://www.yahoo.com");
  parser.registerScanners();
  for (NodeIterator i = parser.elements();i.hasMoreElements();) {
     Node node = i.nextHTMLNode();
     if (node instanceof StringNode) {
       // Downcasting to StringNode
       StringNode stringNode = (StringNode)node;
       // Do whatever processing you want with the string node
       System.out.println(stringNode.getText());
     }
     // Check for the node or tag that you want
     if (node instanceof ...) {
       // Downcast, and process
     }
  }
  
 


createIteratorImpl

public org.htmlparser.util.IteratorImpl createIteratorImpl(boolean remove_scanner,
                                                           org.htmlparser.util.IteratorImpl ret)
                                                    throws org.htmlparser.util.ParserException

flushScanners

public void flushScanners()
Flush the current scanners registered. The registered scanners list becomes empty with this call.


getScanner

public org.htmlparser.scanners.TagScanner getScanner(java.lang.String id)
Return the scanner registered in the parser having the given id


parse

public void parse(java.lang.String filter)
           throws java.lang.Exception
Parse the given resource, using the filter provided


registerScanners

public void registerScanners()
This method should be invoked in order to register some common scanners. The scanners that get added are :
LinkScanner (filter key "-l")
HTMLImageScanner (filter key "-i")
HTMLScriptScanner (filter key "-s")
HTMLStyleScanner (filter key "-t")
HTMLJspScanner (filter key "-j")
HTMLAppletScanner (filter key "-a")
HTMLMetaTagScanner (filter key "-m")
HTMLTitleScanner (filter key "-t")
HTMLDoctypeScanner (filter key "-d")
HTMLFormScanner (filter key "-f")
HTMLFrameSetScanner(filter key "-r")
HTMLBaseHREFScanner(filter key "-b")

Call this method after creating the Parser object. e.g.
 Parser parser = new Parser("http://www.yahoo.com");
 parser.registerScanners();
 


registerDomScanners

public void registerDomScanners()
Make a call to registerDomScanners(), instead of registerScanners(), when you are interested in retrieving a Dom representation of the html page. Upon parsing, you will receive an Html object - which will contain children, one of which would be the body. This is still evolving, and in future releases, you might see consolidation of Html - to provide you with methods to access the body and the head.


removeScanner

public void removeScanner(org.htmlparser.scanners.TagScanner scanner)
Removes a specified scanner object. You can create an anonymous object as a parameter. This method will use the scanner's key and remove it from the registry of scanners. e.g.
 removeScanner(new FormScanner(""));
 


main

public static void main(java.lang.String[] args)
The main program, which can be executed from the command line


visitAllNodesWith

public void visitAllNodesWith(org.htmlparser.visitors.NodeVisitor visitor)
                       throws org.htmlparser.util.ParserException

setInputHTML

public void setInputHTML(java.lang.String inputHTML)
Initializes the parser with the given input HTML String.


extractAllNodesThatAre

public Node[] extractAllNodesThatAre(java.lang.Class nodeType)
                              throws org.htmlparser.util.ParserException

createParser

public static Parser createParser(java.lang.String inputHTML)
Creates the parser on an input string.


createLinkRecognizingParser

public static Parser createLinkRecognizingParser(java.lang.String inputHTML)