| Home >> All >> org >> [ htmlparser Javadoc ] |
org.htmlparser: Javadoc index of package org.htmlparser.
Package Samples:
org.htmlparser.visitors: The basic API classes which will be used by most users when working with the html parser (the Parser class is the most important one in this).
org.htmlparser.tags.data: The tags package contains tag types that are created mostly by the scanners.
org.htmlparser.tests.visitorsTests: This package contains testcases for the html package.
org.htmlparser.beans
org.htmlparser.parserHelper
org.htmlparser.parserapplications
org.htmlparser.scanners
org.htmlparser.tags
org.htmlparser.tests.codeMetrics
org.htmlparser.tests.parserHelperTests
org.htmlparser.tests.scannersTests
org.htmlparser.tests.tagTests
org.htmlparser.tests.utilTests
org.htmlparser.tests
org.htmlparser.util
Classes:
Parser: This is the class that the user will use, either to get an iterator into the html page or to directly parse the page and print the results Typical usage of the parser is as follows : [1] Create a parser object - passing the URL and a feedback object to the parser [2] Register the common scanners. See registerScanners() 55 You wouldnt do this if you want to configure a custom lightweight parser. In that case, you would add the scanners of your choice using addScanner(TagScanner) 55 [3] Enumerate through the elements from the parser object It is important to note that the parsing occurs when you ...
CompositeTagScanner: To create your own scanner that can hold children, create a subclass of this class. The composite tag scanner can be configured with: Tags which will trigger a match Tags which when encountered before a legal end tag, should force a correction Preventing more tags of its own type to appear as children Here are examples of each: Tags which will trigger a match If we wish to recognize <mytag>, MyScanner extends CompositeTagScanner { private static final String [] MATCH_IDS = { "MYTAG" }; MyScanner() { super(MATCH_IDS); } ... } Tags which force correction If we wish to insert end tags ...
CommandLine: Simple command like parser/handler. A dashed argument is one preceded by a dash character. In a sequence of arguments: 1) If a dashed argument starts with a command character the rest of the argument, if any, is assume to be a value. 2) If a dashed argument is followed by a non-dashed argument value. The value is assumed to be associated with the preceding dashed argument name. 2) If an argument with a dash prefix is not followed by a non-dashed value, and does not use a command character, it is assumed to be a flag. 3) If none of the above is true, the argument is a name. Command characters can ...
TagScanner: TagScanner is an abstract superclass which is subclassed to create specific scanners, that operate on a tag's strings, identify it, and can extract data from it. If you wish to write your own scanner, then you must implement scan(). You MAY implement evaluate() as well, if your evaluation logic is not based on a simple text match. You MUST implement getID() - which identifies your scanner uniquely in the hashtable of scanners. Also, you have a feedback object provided to you, should you want to send log messages. This object is instantiated by Parser when a scanner is added to its collection.
StringBean: Extract strings from a URL. Text within <SCRIPT></SCRIPT> tags is removed. The text within <PRE></PRE> tags is not altered. The property Strings , which is the output property is null until a URL is set. So a typical usage is: StringBean sb = new StringBean(); sb.setLinks(false); sb.setReplaceNonBreakingSpaces(true); sb.setCollapse(true); sb.setURL("http://www.netbeans.org"); // the HTTP is performed here String s = sb.getStrings();
Generate: Create a character reference translation class source file. Usage: java -classpath .:lib/htmlparser.jar Generate > Translate.java Derived from HTMLStringFilter.java provided as an example with the htmlparser.jar file available at htmlparser.sourceforge.net written by Somik Raha ( somik@industriallogic. com http://industriallogic.com ).
BenchmarkTidy: Title: Apache Jakarta JMeter Copyright: Copyright (c) Apache Company: Apache License: The license is at the top! Description: This is a quick class to benchmark tidy against htmlparser. It is pretty basic and uses the same process as the original image parsing code in JMeter 1.9.0 and earlier. Author: pete Version: 0.1 Created on: Sep 30, 2003 Last Modified: 7:41:39 AM
Translate: Translate numeric character references and character entity references to unicode characters. Based on tables found at http://www.w3.org/TR/REC-html40/sgml/entities.html Note: Do not edit! This class is created by the Generate class. Typical usage: String s = Translate.decode(getTextFromHtmlPage());
BulletScanner: This scanner is created by BulletListScanner. It shares a stack to maintain the parent-child relationship with BulletListScanner. The rules implemented are : [1] A <ul> can have <li> under it [2] A <li> can have <ul> under it [3] A <li> cannot have <li> under it These rules are implemented easily through the shared stack.
BgSoundScanner: Scans for the bgsound Tag. This is a subclass of TagScanner, and is called using a variant of the template method. If the evaluate() method returns true, that means the given string contains an bgsound tag. Extraction is done by the scan method thereafter by the user of this class.
FormScanner: Scans for the Image Tag. This is a subclass of TagScanner, and is called using a variant of the template method. If the evaluate() method returns true, that means the given string contains an image tag. Extraction is done by the scan method thereafter by the user of this class.
FrameScanner: Scans for the Frame Tag. This is a subclass of TagScanner, and is called using a variant of the template method. If the evaluate() method returns true, that means the given string contains an image tag. Extraction is done by the scan method thereafter by the user of this class.
FrameSetScanner: Scans for the Frame Tag. This is a subclass of TagScanner, and is called using a variant of the template method. If the evaluate() method returns true, that means the given string contains an image tag. Extraction is done by the scan method thereafter by the user of this class.
ImageScanner: Scans for the Image Tag. This is a subclass of TagScanner, and is called using a variant of the template method. If the evaluate() method returns true, that means the given string contains an image tag. Extraction is done by the scan method thereafter by the user of this class.
LinkScanner: Scans for the Link Tag. This is a subclass of TagScanner, and is called using a variant of the template method. If the evaluate() method returns true, that means the given string contains an image tag. Extraction is done by the scan method thereafter by the user of this class.
BenchmarkP: Title: Apache Jakarta JMeter Copyright: Copyright (c) Apache Company: Apache License: The license is at the top! Description: Author: pete Version: 0.1 Created on: Sep 30, 2003 Last Modified: 4:45:28 PM
ParserFeedback: Interface for providing feedback without forcing the output destination to be predefined. A default implementation is provided to output events to the console but alternate implementations that log, watch for specific messages, etc. are also possible.
SimpleNodeIterator: The HTMLSimpleEnumeration interface is similar to NodeIterator, except that it does not throw exceptions. This interface is useful when using HTMLVector, to enumerate through its elements in a simple manner, without needing to do class casts for Node.
Tag: Tag represents a generic tag. This class allows users to register specific tag scanners, which can identify links, or image references. This tag asks the scanners to run over the text, and identify. It can be used to dynamically configure a parser.
TextExtractingVisitor: Extracts text from a web page. Usage: Parser parser = new Parser(...); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); String textInPage = visitor.getExtractedText();
AttributeParser: To change this generated comment edit the template variable "typecomment": Window>Preferences>Java>Templates. To enable and disable the creation of type comments go to Window>Preferences>Java>Code Generation.
DefaultParserFeedback: Default implementation of the HTMLParserFeedback interface. This implementation prints output to the console but users can implement their own classes to support alternate behavior.
MailRipper: MailRipper will rip out all the mail addresses from a given web page Pass a web site (or html file on your local disk) as an argument.
LinkExtractor: LinkExtractor extracts all the links from the given webpage and prints them on standard output.
Robot: The Robot Crawler application will crawl through urls recursively, based on a depth value.
| Home | Contact Us | Privacy Policy | Terms of Service |