Docjar: A Java Source and Docuemnt Enginecom.*    java.*    javax.*    org.*    all    new    plug-in

Quick Search    Search Deep

org.apache.lenya.lucene.html
Class HtmlDocument  view HtmlDocument download HtmlDocument.java

java.lang.Object
  extended byorg.apache.lenya.lucene.html.HtmlDocument

public class HtmlDocument
extends java.lang.Object

The HtmlDocument class creates a Lucene org.apache.lucene.document.Document from an HTML document.

It does this by using JTidy package. It can take input input from java.io.File or java.io.InputStream.


Field Summary
private  java.lang.String luceneClassValue
           
private  java.lang.String luceneTagName
           
private  org.w3c.dom.Element rawDoc
           
 
Constructor Summary
HtmlDocument(java.io.File file)
          Constructs an HtmlDocument from a java.io.File.
HtmlDocument(java.io.InputStream is)
          Constructs an HtmlDocument from an java.io.InputStream.
 
Method Summary
static org.apache.lucene.document.Document Document(java.io.File file)
          Creates a Lucene Document from a java.io.File.
 java.lang.String getBody()
          Gets the body text attribute of the HtmlDocument object.
private  java.lang.String getBodyText(org.w3c.dom.Node node, boolean indexByLucene)
          Gets the bodyText attribute of the HtmlDocument object.
static org.apache.lucene.document.Document getDocument(java.io.InputStream is)
          Creates a Lucene Document from an java.io.InputStream.
 java.lang.String getTitle()
          Gets the title attribute of the HtmlDocument object.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

rawDoc

private org.w3c.dom.Element rawDoc

luceneTagName

private java.lang.String luceneTagName

luceneClassValue

private java.lang.String luceneClassValue
Constructor Detail

HtmlDocument

public HtmlDocument(java.io.File file)
             throws java.io.IOException
Constructs an HtmlDocument from a java.io.File.


HtmlDocument

public HtmlDocument(java.io.InputStream is)
             throws java.io.IOException
Constructs an HtmlDocument from an java.io.InputStream.

Method Detail

getDocument

public static org.apache.lucene.document.Document getDocument(java.io.InputStream is)
                                                       throws java.io.IOException
Creates a Lucene Document from an java.io.InputStream.


Document

public static org.apache.lucene.document.Document Document(java.io.File file)
                                                    throws java.io.IOException
Creates a Lucene Document from a java.io.File.


getTitle

public java.lang.String getTitle()
Gets the title attribute of the HtmlDocument object.


getBody

public java.lang.String getBody()
Gets the body text attribute of the HtmlDocument object.


getBodyText

private java.lang.String getBodyText(org.w3c.dom.Node node,
                                     boolean indexByLucene)
Gets the bodyText attribute of the HtmlDocument object.