Save This Page
Home » nutch-1.0 » org.apache.nutch » parse » [javadoc | source]
org.apache.nutch.parse
public interface: Parser [javadoc | source]

All Implemented Interfaces:
    org.apache.hadoop.conf.Configurable, Pluggable

All Known Implementing Classes:
    ZipParser, ExtParser, MP3Parser, MSWordParser, RSSParser, SWFParser, PdfParser, FeedParser, MSExcelParser, RTFParseFactory, MSBaseParser, HtmlParser, JSParseFilter, OOParser, MSPowerPointParser, TextParser

A parser for content generated by a org.apache.nutch.protocol.Protocol implementation. This interface is implemented by extensions. Nutch's core contains no page parsing code.
Field Summary
public static final  String X_POINT_ID    The name of the extension point. 
Method from org.apache.nutch.parse.Parser Summary:
getParse
Method from org.apache.nutch.parse.Parser Detail:
 public ParseResult getParse(Content c)

    This method parses the given content and returns a map of <key, parse> pairs. Parse instances will be persisted under the given key.

    Note: Meta-redirects should be followed only when they are coming from the original URL. That is:
    Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse with a ParseStatus indicating the redirect>.