|
|||||||||
| Home >> All >> org >> [ htmlparser overview ] | PREV CLASS NEXT CLASS | ||||||||
SUMMARY: JAVADOC | SOURCE | DOWNLOAD | NESTED | FIELD | CONSTR | METHOD |
DETAIL: FIELD | CONSTR | METHOD | ||||||||
org.htmlparser
Class Parser

java.lang.Objectorg.htmlparser.Parser
- All Implemented Interfaces:
- java.io.Serializable
- public class Parser
- extends java.lang.Object
- implements java.io.Serializable
- extends java.lang.Object
This is the class that the user will use, either to get an iterator into the
html page or to directly parse the page and print the results
Typical usage of the parser is as follows :
[1] Create a parser object - passing the URL and a feedback object to the
parser
[2] Register the common scanners. See registerScanners() 55
You wouldnt do this if you want to configure a custom lightweight parser. In
that case, you would add the scanners of your choice using
addScanner(TagScanner) 55
[3] Enumerate through the elements from the parser object
It is important to note that the parsing occurs when you enumerate, ON
DEMAND. This is a thread-safe way, and you only get the control back after a
particular element is parsed and returned.
Below is some sample code to parse Yahoo.com and print all the tags.
Parser parser = new Parser("http://www.yahoo.com", new DefaultHTMLParserFeedback());
// In this example, we are registering all the common scanners
parser.registerScanners();
for (NodeIterator i = parser.elements(); e.hasMoreNodes();) {
Node node = i.nextNode();
node.print();
}
Below is some sample code to parse Yahoo.com and print only the text
information. This scanning will run faster, as there are no scanners
registered here.
Parser parser = new Parser("http://www.yahoo.com", new DefaultHTMLParserFeedback());
// In this example, none of the scanners need to be registered
// as a string node is not a tag to be scanned for.
for (NodeIterator i = parser.elements(); e.hasMoreNodes();) {
Node node = i.nextNode();
if (node instanceof StringNode) {
StringNode stringNode = (StringNode) node;
System.out.println(stringNode.getText());
}
}
The above snippet will print out only the text contents in the html document.Here's another snippet that will only print out the link urls in a document. This is an example of adding a link scanner.
Parser parser = new Parser("http://www.yahoo.com", new DefaultHTMLParserFeedback());
parser.addScanner(new LinkScanner("-l"));
for (NodeIterator i = parser.elements(); e.hasMoreNodes();) {
Node node = i.nextNode();
if (node instanceof LinkTag) {
LinkTag linkTag = (LinkTag) node;
System.out.println(linkTag.getLink());
}
}
| Field Summary | |
protected java.lang.String |
character_set
The encoding being used to decode the connection input stream. |
protected static java.lang.String |
CHARSET_STRING
Trigger for charset detection. |
protected static java.lang.String |
DEFAULT_CHARSET
The default charset. |
protected org.htmlparser.util.ParserFeedback |
feedback
Feedback object. |
protected java.io.BufferedInputStream |
input
The bytes extracted from the source. |
static org.htmlparser.util.ParserFeedback |
noFeedback
A quiet message sink. |
private org.htmlparser.parserHelper.ParserHelper |
parserHelper
|
protected NodeReader |
reader
The html reader associated with this parser. |
protected java.lang.String |
resourceLocn
The URL or filename to be parsed. |
private java.util.Map |
scanners
The list of scanners to apply at the top level. |
static org.htmlparser.util.ParserFeedback |
stdout
A verbose message sink. |
protected java.net.URLConnection |
url_conn
The source for HTML. |
static java.lang.String |
VERSION_DATE
The date of the version. |
static double |
VERSION_NUMBER
The floating point version number. |
static java.lang.String |
VERSION_STRING
The display version. |
static java.lang.String |
VERSION_TYPE
The type of version. |
| Constructor Summary | |
Parser()
Zero argument constructor. |
|
Parser(NodeReader reader)
This constructor is present to enable users to plugin their own readers. |
|
Parser(NodeReader rd,
org.htmlparser.util.ParserFeedback fb)
This constructor enables the construction of test cases, with readers associated with test string buffers. |
|
Parser(java.lang.String resourceLocn)
Creates a Parser object with the location of the resource (URL or file). |
|
Parser(java.lang.String resourceLocn,
org.htmlparser.util.ParserFeedback feedback)
Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in. |
|
Parser(java.net.URLConnection connection)
Constructor for non-standard access. |
|
Parser(java.net.URLConnection connection,
org.htmlparser.util.ParserFeedback fb)
Constructor for custom HTTP access. |
|
| Method Summary | |
void |
addScanner(org.htmlparser.scanners.TagScanner scanner)
Add a new Tag Scanner. |
protected java.io.InputStreamReader |
createInputStreamReader()
Open a stream reader on the InputStream. |
org.htmlparser.util.IteratorImpl |
createIteratorImpl(boolean remove_scanner,
org.htmlparser.util.IteratorImpl ret)
|
static Parser |
createLinkRecognizingParser(java.lang.String inputHTML)
|
static Parser |
createParser(java.lang.String inputHTML)
Creates the parser on an input string. |
protected void |
createReader()
Create a new reader for the URLConnection object. |
org.htmlparser.util.NodeIterator |
elements()
Returns an iterator (enumeration) to the html nodes. |
Node[] |
extractAllNodesThatAre(java.lang.Class nodeType)
|
void |
flushScanners()
Flush the current scanners registered. |
protected java.lang.String |
getCharacterSet(java.net.URLConnection connection)
Try and extract the character set from the HTTP header. |
protected java.lang.String |
getCharset(java.lang.String content)
Get a CharacterSet name corresponding to a charset parameter. |
java.net.URLConnection |
getConnection()
Return the current connection. |
java.lang.String |
getEncoding()
The current encoding. |
org.htmlparser.util.ParserFeedback |
getFeedback()
Returns the feedback. |
int |
getNumScanners()
Get the number of scanners registered currently in the scanner. |
NodeReader |
getReader()
Returns the reader associated with the parser |
org.htmlparser.scanners.TagScanner |
getScanner(java.lang.String id)
Return the scanner registered in the parser having the given id |
java.util.Map |
getScanners()
Get an enumeration of scanners registered currently in the parser |
java.lang.String |
getURL()
Return the current URL being parsed. |
static java.lang.String |
getVersion()
Return the version string of this parser. |
static double |
getVersionNumber()
Return the version number of this parser. |
static void |
main(java.lang.String[] args)
The main program, which can be executed from the command line |
void |
parse(java.lang.String filter)
Parse the given resource, using the filter provided |
private void |
readObject(java.io.ObjectInputStream in)
|
protected void |
recreateReader()
Create a new reader for the URLConnection object but reuse the input stream. |
void |
registerDomScanners()
Make a call to registerDomScanners(), instead of registerScanners(), when you are interested in retrieving a Dom representation of the html page. |
void |
registerScanners()
This method should be invoked in order to register some common scanners. |
void |
removeScanner(org.htmlparser.scanners.TagScanner scanner)
Removes a specified scanner object. |
void |
setConnection(java.net.URLConnection connection)
Set the connection for this parser. |
void |
setEncoding(java.lang.String encoding)
Set the encoding for this parser. |
void |
setFeedback(org.htmlparser.util.ParserFeedback fb)
Sets the feedback object used in scanning. |
void |
setInputHTML(java.lang.String inputHTML)
Initializes the parser with the given input HTML String. |
static void |
setLineSeparator(java.lang.String lineSeparator)
|
void |
setReader(NodeReader rd)
Set the reader for this parser. |
void |
setScanners(java.util.Map newScanners)
This method is to be used to change the set of scanners in the current parser. |
void |
setURL(java.lang.String url)
Set the URL for this parser. |
void |
visitAllNodesWith(org.htmlparser.visitors.NodeVisitor visitor)
|
private void |
writeObject(java.io.ObjectOutputStream out)
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
VERSION_NUMBER
public static final double VERSION_NUMBER
- The floating point version number.
- See Also:
- Constant Field Values
VERSION_TYPE
public static final java.lang.String VERSION_TYPE
- The type of version.
- See Also:
- Constant Field Values
VERSION_DATE
public static final java.lang.String VERSION_DATE
- The date of the version.
- See Also:
- Constant Field Values
VERSION_STRING
public static final java.lang.String VERSION_STRING
- The display version.
- See Also:
- Constant Field Values
DEFAULT_CHARSET
protected static final java.lang.String DEFAULT_CHARSET
- The default charset. This should be
ISO-8859-1, see RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) section 3.7.1 Another alias is "8859_1".- See Also:
- Constant Field Values
CHARSET_STRING
protected static final java.lang.String CHARSET_STRING
- Trigger for charset detection.
- See Also:
- Constant Field Values
feedback
protected org.htmlparser.util.ParserFeedback feedback
- Feedback object.
resourceLocn
protected java.lang.String resourceLocn
- The URL or filename to be parsed.
reader
protected transient NodeReader reader
- The html reader associated with this parser.
scanners
private java.util.Map scanners
- The list of scanners to apply at the top level.
character_set
protected java.lang.String character_set
- The encoding being used to decode the connection input stream.
url_conn
protected transient java.net.URLConnection url_conn
- The source for HTML.
input
protected transient java.io.BufferedInputStream input
- The bytes extracted from the source.
noFeedback
public static org.htmlparser.util.ParserFeedback noFeedback
- A quiet message sink. Use this for no feedback.
stdout
public static org.htmlparser.util.ParserFeedback stdout
- A verbose message sink. Use this for output on
System.out.
parserHelper
private org.htmlparser.parserHelper.ParserHelper parserHelper
| Constructor Detail |
Parser
public Parser()
- Zero argument constructor. The parser is in a safe but useless state. Set
the reader or connection using setReader() or setConnection().
Parser
public Parser(NodeReader rd, org.htmlparser.util.ParserFeedback fb)
- This constructor enables the construction of test cases, with readers
associated with test string buffers. It can also be used with readers of
the user's choice streaming data into the parser. Important:
If you are using this constructor, and you would like to use the parser
to parse multiple times (multiple calls to parser.elements()), you must
ensure the following:
- Before the first parse, you must mark the reader for a length that you anticipate (the size of the stream).
- After the first parse, calls to elements() must be preceded by calls
to :
parser.getReader().reset();
Parser
public Parser(java.net.URLConnection connection, org.htmlparser.util.ParserFeedback fb) throws org.htmlparser.util.ParserException
- Constructor for custom HTTP access.
Parser
public Parser(java.lang.String resourceLocn, org.htmlparser.util.ParserFeedback feedback) throws org.htmlparser.util.ParserException
- Creates a Parser object with the location of the resource (URL or file)
You would typically create a DefaultHTMLParserFeedback object and pass it
in.
Parser
public Parser(java.lang.String resourceLocn) throws org.htmlparser.util.ParserException
- Creates a Parser object with the location of the resource (URL or file).
A DefaultHTMLParserFeedback object is used for feedback.
Parser
public Parser(NodeReader reader)
- This constructor is present to enable users to plugin their own readers.
A DefaultHTMLParserFeedback object is used for feedback. It can also be
used with readers of the user's choice streaming data into the parser.
Important: If you are using this constructor, and you would like
to use the parser to parse multiple times (multiple calls to
parser.elements()), you must ensure the following:
- Before the first parse, you must mark the reader for a length that you anticipate (the size of the stream).
- After the first parse, calls to elements() must be preceded by calls
to :
parser.getReader().reset();
Parser
public Parser(java.net.URLConnection connection) throws org.htmlparser.util.ParserException
- Constructor for non-standard access. A DefaultHTMLParserFeedback object
is used for feedback.
| Method Detail |
setLineSeparator
public static void setLineSeparator(java.lang.String lineSeparator)
getVersion
public static java.lang.String getVersion()
- Return the version string of this parser.
getVersionNumber
public static double getVersionNumber()
- Return the version number of this parser.
writeObject
private void writeObject(java.io.ObjectOutputStream out) throws java.io.IOException
readObject
private void readObject(java.io.ObjectInputStream in) throws java.io.IOException, java.lang.ClassNotFoundException
setConnection
public void setConnection(java.net.URLConnection connection) throws org.htmlparser.util.ParserException
- Set the connection for this parser. This method sets four of the fields
in the parser object;
resourceLocn,url_conn,character_setandreader. It does not adjust thescannerslist orfeedbackobject. The four fields are set atomicly by this method, either they are all set or none of them is set. Trying to set the connection to null is a noop.
getConnection
public java.net.URLConnection getConnection()
- Return the current connection.
setURL
public void setURL(java.lang.String url) throws org.htmlparser.util.ParserException
- Set the URL for this parser. This method sets four of the fields in the
parser object;
resourceLocn,url_conn,character_setandreader. It does not adjust thescannerslist orfeedbackobject.Trying to set the url to null or an empty string is a noop.
getURL
public java.lang.String getURL()
- Return the current URL being parsed.
setEncoding
public void setEncoding(java.lang.String encoding) throws org.htmlparser.util.ParserException
- Set the encoding for this parser. If there is no connection
(getConnection() returns null) it simply sets the character set name
stored in the parser (Note: the reader object which must have been set in
the constructor or by
setReader(), may or may not be using this character set). Otherwise (getConnection() doesn't return null) it does this by reopening the input stream of the connection and creating a reader that uses this character set. In this case, this method sets two of the fields in the parser object;character_setandreader. It does not adjustresourceLocn,url_conn,scannersorfeedback. The two fields are set atomicly by this method, either they are both set or none of them is set. Trying to set the encoding to null or an empty string is a noop.
getEncoding
public java.lang.String getEncoding()
- The current encoding. This item is et from the HTTP header but may be
overridden by meta tags in the head, so this may change after the head
has been parsed.
setReader
public void setReader(NodeReader rd)
- Set the reader for this parser. This method sets four of the fields in
the parser object;
resourceLocn,url_conn,character_setandreader. It does not adjust thescannerslist orfeedbackobject. Theurl_connis set to null since this cannot be determined from the reader. Thecharacter_setis set to the default character set since this cannot be determined from the reader. Trying to set the reader tonullis a noop.
getReader
public NodeReader getReader()
- Returns the reader associated with the parser
getNumScanners
public int getNumScanners()
- Get the number of scanners registered currently in the scanner.
setScanners
public void setScanners(java.util.Map newScanners)
- This method is to be used to change the set of scanners in the current
parser.
getScanners
public java.util.Map getScanners()
- Get an enumeration of scanners registered currently in the parser
setFeedback
public void setFeedback(org.htmlparser.util.ParserFeedback fb)
- Sets the feedback object used in scanning.
getFeedback
public org.htmlparser.util.ParserFeedback getFeedback()
- Returns the feedback.
createInputStreamReader
protected java.io.InputStreamReader createInputStreamReader() throws java.io.UnsupportedEncodingException
- Open a stream reader on the
InputStream. Revise the character set to it's default value if anUnsupportedEncodingExceptionis thrown.
createReader
protected void createReader()
throws java.io.IOException
- Create a new reader for the URLConnection object. The current character
set is used to transform the input stream into a character reader.
recreateReader
protected void recreateReader()
throws java.io.IOException
- Create a new reader for the URLConnection object but reuse the input
stream. The current character set is used to transform the input stream
into a character reader. Defaults to
createReader()if there is no existing input stream.
getCharacterSet
protected java.lang.String getCharacterSet(java.net.URLConnection connection)
- Try and extract the character set from the HTTP header.
getCharset
protected java.lang.String getCharset(java.lang.String content)
- Get a CharacterSet name corresponding to a charset parameter.
addScanner
public void addScanner(org.htmlparser.scanners.TagScanner scanner)
- Add a new Tag Scanner. In typical situations where you require a
no-frills parser, use the registerScanners() method to add the most
common parsers. But when you wish to either compose a parser with only
certain scanners registered, use this method. It is advantageous to
register only the scanners you want, in order to achieve faster parsing
speed. This method would also be of use when you have developed custom
scanners, and need to register them into the parser.
elements
public org.htmlparser.util.NodeIterator elements() throws org.htmlparser.util.ParserException
- Returns an iterator (enumeration) to the html nodes. Each node can be a
tag/endtag/ string/link/image
This is perhaps the most important method of this class. In typical situations, you will need to use the parser like this :Parser parser = new Parser("http://www.yahoo.com"); parser.registerScanners(); for (NodeIterator i = parser.elements();i.hasMoreElements();) { Node node = i.nextHTMLNode(); if (node instanceof StringNode) { // Downcasting to StringNode StringNode stringNode = (StringNode)node; // Do whatever processing you want with the string node System.out.println(stringNode.getText()); } // Check for the node or tag that you want if (node instanceof ...) { // Downcast, and process } }
createIteratorImpl
public org.htmlparser.util.IteratorImpl createIteratorImpl(boolean remove_scanner, org.htmlparser.util.IteratorImpl ret) throws org.htmlparser.util.ParserException
flushScanners
public void flushScanners()
- Flush the current scanners registered. The registered scanners list
becomes empty with this call.
getScanner
public org.htmlparser.scanners.TagScanner getScanner(java.lang.String id)
- Return the scanner registered in the parser having the given id
parse
public void parse(java.lang.String filter) throws java.lang.Exception
- Parse the given resource, using the filter provided
registerScanners
public void registerScanners()
- This method should be invoked in order to register some common scanners.
The scanners that get added are :
LinkScanner (filter key "-l")
HTMLImageScanner (filter key "-i")
HTMLScriptScanner (filter key "-s")
HTMLStyleScanner (filter key "-t")
HTMLJspScanner (filter key "-j")
HTMLAppletScanner (filter key "-a")
HTMLMetaTagScanner (filter key "-m")
HTMLTitleScanner (filter key "-t")
HTMLDoctypeScanner (filter key "-d")
HTMLFormScanner (filter key "-f")
HTMLFrameSetScanner(filter key "-r")
HTMLBaseHREFScanner(filter key "-b")
Call this method after creating the Parser object. e.g.
Parser parser = new Parser("http://www.yahoo.com"); parser.registerScanners();
registerDomScanners
public void registerDomScanners()
- Make a call to registerDomScanners(), instead of registerScanners(), when
you are interested in retrieving a Dom representation of the html page.
Upon parsing, you will receive an Html object - which will contain
children, one of which would be the body. This is still evolving, and in
future releases, you might see consolidation of Html - to provide you
with methods to access the body and the head.
removeScanner
public void removeScanner(org.htmlparser.scanners.TagScanner scanner)
- Removes a specified scanner object. You can create an anonymous object as
a parameter. This method will use the scanner's key and remove it from
the registry of scanners. e.g.
removeScanner(new FormScanner(""));
main
public static void main(java.lang.String[] args)
- The main program, which can be executed from the command line
visitAllNodesWith
public void visitAllNodesWith(org.htmlparser.visitors.NodeVisitor visitor) throws org.htmlparser.util.ParserException
setInputHTML
public void setInputHTML(java.lang.String inputHTML)
- Initializes the parser with the given input HTML String.
extractAllNodesThatAre
public Node[] extractAllNodesThatAre(java.lang.Class nodeType) throws org.htmlparser.util.ParserException
createParser
public static Parser createParser(java.lang.String inputHTML)
- Creates the parser on an input string.
createLinkRecognizingParser
public static Parser createLinkRecognizingParser(java.lang.String inputHTML)
|
|||||||||
| Home >> All >> org >> [ htmlparser overview ] | PREV CLASS NEXT CLASS | ||||||||
SUMMARY: JAVADOC | SOURCE | DOWNLOAD | NESTED | FIELD | CONSTR | METHOD |
DETAIL: FIELD | CONSTR | METHOD | ||||||||
JAVADOC
org.htmlparser.Parser