Save This Page
Home » nutch-1.0 » org.apache.nutch » crawl » [javadoc | source]
org.apache.nutch.crawl
public class: CrawlDbMerger [javadoc | source]
java.lang.Object
   org.apache.hadoop.conf.Configured
      org.apache.nutch.crawl.CrawlDbMerger

All Implemented Interfaces:
    org.apache.hadoop.util.Tool

This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.

It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of org.apache.nutch.crawl.CrawlDatum#getFetchTime() . However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Nested Class Summary:
public static class  CrawlDbMerger.Merger   
Constructor:
 public CrawlDbMerger() 
 public CrawlDbMerger(Configuration conf) 
Method from org.apache.nutch.crawl.CrawlDbMerger Summary:
createMergeJob,   main,   merge,   run
Methods from java.lang.Object:
equals,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.nutch.crawl.CrawlDbMerger Detail:
 public static JobConf createMergeJob(Configuration conf,
    Path output,
    boolean normalize,
    boolean filter) 
 public static  void main(String[] args) throws Exception 
 public  void merge(Path output,
    Path[] dbs,
    boolean normalize,
    boolean filter) throws Exception 
 public int run(String[] args) throws Exception