Save This Page
Home » nutch-1.0 » org.apache.nutch » crawl » [javadoc | source]
org.apache.nutch.crawl
public class: LinkDbMerger [javadoc | source]
java.lang.Object
   org.apache.hadoop.conf.Configured
      org.apache.nutch.crawl.LinkDbMerger

All Implemented Interfaces:
    org.apache.hadoop.util.Tool, org.apache.hadoop.mapred.Reducer

This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.

It's possible to use this tool just for filtering - in that case only one LinkDb should be specified in arguments.

If more than one LinkDb contains information about the same URL, all inlinks are accumulated, but only at most db.max.inlinks inlinks will ever be added.

If activated, URLFilters will be applied to both the target URLs and to any incoming link URL. If a target URL is prohibited, all inlinks to that target will be removed, including the target URL. If some of incoming links are prohibited, only they will be removed, and they won't count when checking the above-mentioned maximum limit.

Constructor:
 public LinkDbMerger() 
 public LinkDbMerger(Configuration conf) 
Method from org.apache.nutch.crawl.LinkDbMerger Summary:
close,   configure,   createMergeJob,   main,   merge,   reduce,   run
Methods from java.lang.Object:
equals,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.nutch.crawl.LinkDbMerger Detail:
 public  void close() throws IOException 
 public  void configure(JobConf job) 
 public static JobConf createMergeJob(Configuration config,
    Path linkDb,
    boolean normalize,
    boolean filter) 
 public static  void main(String[] args) throws Exception 
 public  void merge(Path output,
    Path[] dbs,
    boolean normalize,
    boolean filter) throws Exception 
 public  void reduce(Text key,
    Iterator values,
    OutputCollector output,
    Reporter reporter) throws IOException 
 public int run(String[] args) throws Exception