Docjar: A Java Source and Docuemnt Enginecom.*    java.*    javax.*    org.*    all    new    plug-in

Quick Search    Search Deep

Source code: com/eireneh/bible/book/ser/SerBible.java


1   
2   package com.eireneh.bible.book.ser;
3   
4   import java.io.*;
5   import java.util.*;
6   import java.net.*;
7   
8   import org.jdom.*;
9   
10  import com.sun.java.util.collections.*;
11  
12  import com.eireneh.util.*;
13  import com.eireneh.bible.util.*;
14  import com.eireneh.bible.book.*;
15  import com.eireneh.bible.passage.*;
16  
17  /**
18   * A Biblical source that comes from files on the local file system.
19   *
20   * <p>This format is designed to be fast. At any cost. So disk space does
21   * not matter, which is good because early versions used about 100Mb!</p>
22   *
23   * <p>This is a history of some of he design desisions that this class has
24   * been through.</p>
25   *
26   * <h4>Searching</h4>
27   * <p>I think that a Bible ought not to store anything other than Bible
28   * text. I have experimented with a saerch mechanism that cached searches
29   * in a very effective manner, however it took up a lot of disk space,
30   * and only worked for one version. It might be good to have it work in a
31   * more generic way, and an in-memory cache would also be a good idea. So
32   * I am going to move the natty search bit into a caching class.
33   *
34   * <h4>Text Storage</h4>
35   * It would be good to get a handle on the way the OLB and Sword and so on
36   * work:<ul>
37   * <li><b>OLB:</b> 2 core files: an index file that starts with text like:
38   *     "AaronitesbaddonAbagthanarimabasedingtedAbbadaeelielonednegolbet"
39   *     which is a strange sort of index. Possibly strings with start pos
40   *     and length. Then data files, and plenty of other indexes.
41   * <li><b>Theopholos:</b> Single data file that begins- "aaron aaronites
42   *     aarons abaddon abagtha abana abarim abase abased abasing abated"
43   *     This is again in index type affair.
44   * <li><b>Sword:</b> All this VerseKey stuff ...
45   * </ul>
46   * I think the answer is that an word index is good. (Like this is news)
47   * So we can map all the words to numbers and then encode the biblical
48   * text as a series of numbers.
49   *
50   * <h4>Priorities</h4>
51   * What factors affect our design the most?<ul>
52   * <li><b>Search Speed:</b> Proably the biggest reason people will have to
53   *     use this program initially will be the powerful search engine. This
54   *     can be very demanding though, and every effort should be taken to
55   *     make best match searches fast.
56   * <li><b>Size:</b> Size is not a huge problem from a disk space point of
57   *     view - the average hard disk is now about 10Gb. Looking at the
58   *     various installations that I have, the average is a little short of
59   *     20Mb each. Generally each version takes up 3-5Mb If we were to be
60   *     over double this size and take up 50Mb total, I don't think there
61   *     would be a huge problem.<br>
62   *     However many people will first come to use this program from a net
63   *     download - now size is a huge problem. Maybe we should have a
64   *     very very compact download that on installation indexed itself.
65   * <li><b>Text Retrieval Speed:</b> I do not see this as being a huge
66   *     issue. The text generation time from reverse-engineering my
67   *     concordance was acceptable if slow, so this should not be a big
68   *     deal, and I guess it is very easily cacheable too.
69   * </ul>
70   *
71   * <h4>Strategies</h4>
72   * For a single verse we have 2 basic strategies. Have a single block of
73   * data that specifies the words, punctuation, and markup, or for each set
74   * of data we could have a separate source. Clearly there are also hybrid
75   * versions. The pros and cons:<ul>
76   * <li>Searches only have to read one file, and the information is more
77   *     dense in that (less disk reads for wanted data) This also applies
78   *     to the ability to ignore certain types of mark-up.
79   * <li>It is easier to add/alter a single source of information - or even
80   *     to share a source amongst versions. Maybe things like red lettering
81   *     could benefit from this.
82   * <li>Text display is slower because the information is spread over
83   *     several files. But as mentioned above - who cares?
84   * </ul>
85   * So how far do we take this? The parts that we can split off from the
86   * words are these:<ul>
87   * <li>Markup: Most markup is tied to a particular word, so we would need
88   *     some way of attaching markup to words.
89   * <li>Inter-Word Punctuation: We could do for punctuation exactly what we
90   *     do for the words. List the options in a dictionary, and then write
91   *     out an index. I guess less than 255 different types of inter-word
92   *     punctuation (1 byte per inter-word). (as opposed to 18360 different
93   *     words 2 bytes per word)<br>
94   *     There are 32k words in the Bible - this would make the central data
95   *     file about 64k in size!
96   * <li>Case: To get down to 18k words you need to make "Foo" the same as
97   *     "foo" and "FOO", however I guess that even making words case
98   *     sensative we would be under 65k words.
99   *     Splitting case would not decrease file sizes (but may make it
100  *     compress better) however it would introduce a new case file. Since
101  *     there are only 4 cases (See PassageUtil) that is 0.25 bytes per
102  *     word. (8k for the whole Bible)
103  * <li>Intra-Word Punctuation: Examples "-',". Examples of words that use
104  *     these punctuations: Maher-Shalal-Hash-Baz, Aaron's, 144,000. Watch
105  *     out for --. The NIV uses it to join sentances together--Something
106  *     like this. However there is no space between these words. This is
107  *     closely linked to-
108  * <li>Word Semantics: We could make the words "job", "jobs", and "job's"
109  *     the same. Also "run", "runs", "running", "runned" and so on. Even
110  *     "am", "are", "is". This would dramatically reduce the size of the
111  *     dictionary, make the text re-generation quite complex and the data
112  *     generation nigh on impossible. But it would make for some really
113  *     powerful searches (although possibly nothing that a thesaurus would
114  *     not help)
115  * </ul>
116  * I think the last 2 are hard to sus. However I am keen to work on them
117  * next. So it looks like I sort out the first 3. Time to reasurect that
118  * VB code. Now is it a port or a re-write?
119  *
120  * <table border='1' cellPadding='3' cellSpacing='0' width="100%">
121  * <tr><td bgColor='white'class='TableRowColor'><font size='-7'>
122  * Distribution Licence:<br />
123  * Project B is free software; you can redistribute it
124  * and/or modify it under the terms of the GNU General Public License,
125  * version 2 as published by the Free Software Foundation.<br />
126  * This program is distributed in the hope that it will be useful,
127  * but WITHOUT ANY WARRANTY; without even the implied warranty of
128  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
129  * General Public License for more details.<br />
130  * The License is available on the internet
131  * <a href='http://www.gnu.org/copyleft/gpl.html'>here</a>, by writing to
132  * <i>Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
133  * MA 02111-1307, USA</i>, Or locally at the Licence link below.<br />
134  * The copyright to this program is held by it's authors.
135  * </font></td></tr></table>
136  * @see <a href='http://www.eireneh.com/servlets/Web'>Project B Home</a>
137  * @see docs.Licence
138  * @author Joe Walker
139  * @version D4.I0.T0
140  */
141 public class SerBible extends VersewiseBible
142 {
143     /**
144      * Basic constructor for a SerBible
145      */
146     public SerBible(String name, URL url, boolean create) throws BookException
147     {
148         this.name = name;
149         this.url = url;
150 
151         if (!url.getProtocol().equals("file"))
152             throw new BookException("ser_url");
153 
154         try
155         {
156             String file_mode;
157 
158             if (create)
159             {
160                 // We leave the Version unknown until we have data
161                 version = null;
162 
163                 // Create blank indexes
164                 ref_map = new TreeMap(new StringComparator());
165                 xml_arr = new long[Books.versesInBible()];
166 
167                 // Open the random access files read write
168                 file_mode = "rw";
169             }
170             else
171             {
172                 // The version information
173                 URL prop_url = NetUtil.lengthenURL(url, "bible.properties");
174                 InputStream prop_in = prop_url.openStream();
175                 Properties prop = new Properties();
176                 PropertiesUtil.load(prop, prop_in);
177                 String version_name = prop.getProperty("Version");
178                 version = VersionFactory.getVersion(version_name);
179 
180                 // Load the ascii Passage index
181                 URL ref_idy_url = NetUtil.lengthenURL(url, "ref.index");
182                 BufferedReader ref_idy_bin = new BufferedReader(new InputStreamReader(ref_idy_url.openStream()));
183                 ref_map = new TreeMap(new StringComparator());
184                 while (true)
185                 {
186                     String line = ref_idy_bin.readLine();
187                     if (line == null) break;
188                     int colon1 = line.indexOf(":");
189                     int colon2 = line.lastIndexOf(":");
190                     String word = line.substring(0, colon1);
191                     long offset = Long.parseLong(line.substring(colon1+1, colon2));
192                     int length = Integer.parseInt(line.substring(colon2+1));
193                     Section section = new Section(offset, length);
194                     ref_map.put(word, section);
195                 }
196                 ref_idy_bin.close();
197 
198                 // Load the ascii XML index
199                 URL xml_idy_url = NetUtil.lengthenURL(url, "xml.index");
200                 BufferedReader xml_idy_bin = new BufferedReader(new InputStreamReader(xml_idy_url.openStream()));
201                 xml_arr = new long[Books.versesInBible()];
202                 for (int i=0; i<Books.versesInBible(); i++)
203                 {
204                     String line = xml_idy_bin.readLine();
205                     xml_arr[i] = Integer.parseInt(line);
206                 }
207                 xml_idy_bin.close();
208 
209                 // Open the random access files read only
210                 file_mode = "r";
211             }
212 
213             // Open the Passage RAF
214             URL ref_dat_url = NetUtil.lengthenURL(url, "ref.data");
215             ref_dat = new RandomAccessFile(ref_dat_url.getFile(), file_mode);
216 
217             // Open the XML RAF
218             URL xml_dat_url = NetUtil.lengthenURL(url, "xml.data");
219             xml_dat = new RandomAccessFile(xml_dat_url.getFile(), file_mode);
220         }
221         catch (Exception ex)
222         {
223             throw new BookException("ser_init", ex);
224         }
225 
226         log.fine("Started SerBible url="+url+ " name="+name+" create="+create);
227     }
228 
229     /**
230      * What driver is controlling this Bible?
231      * @return A BibleDriver relevant to this Bible
232      */
233     public BibleDriver getDriver()
234     {
235         return SerBibleDriver.driver;
236     }
237 
238     /**
239      * Meta-Information: What name can I use to get this Bible in a call
240      * to Bibles.getBible(name);
241      * @return The name of this Bible
242      */
243     public String getName()
244     {
245         return name;
246     }
247 
248     /**
249      * Meta-Information: What version of the Bible is this?
250      * @return A Version for this Bible
251      */
252     public Version getVersion()
253     {
254         return version;
255     }
256 
257     /**
258      * Setup the Version information
259      * @param version The version that this Bible is becoming
260      */
261     public void setVersion(Version version)
262     {
263         this.version = version;
264     }
265 
266     /**
267      * Create a String for the specified Verses
268      * @param range The verses to search for
269      * @return The Bible text
270      */
271     public String getText(VerseRange range) throws BookException
272     {
273         try
274         {
275             BibleEle doc = new BibleEle();
276 
277             // We should be doing all to the <ref> manually.
278             Passage ref = PassageFactory.createPassage();
279             ref.add(range);
280             getDocument(doc, ref);
281 
282             return doc.getText().trim();
283         }
284         catch (Exception ex)
285         {
286             throw new BookException("ser_read", ex);
287         }
288     }
289 
290     /**
291      * Retrieval: Use JDOM to retrieve some Bible data
292      * @param doc The document
293      * @param ref The verses to search for
294      */
295     public Element getElement(Passage ref) throws BookException
296     {
297         try
298         {
299             Element doc = new Element("bible");
300 
301             // For all the ranges in this Passage
302             Enumeration ren = ref.rangeElements();
303             while (ren.hasMoreElements())
304             {
305                 VerseRange range = (VerseRange) ren.nextElement();
306 
307                 Element section = new Element("section");
308                 section.addAttribute("title", range.toString());
309                 doc.addContent(section);
310 
311                 // For all the verses in this range
312                 Enumeration ven = range.verseElements();
313                 while (ven.hasMoreElements())
314                 {
315                     Verse verse = (Verse) ven.nextElement();
316 
317                     // Seek to the correct point
318                     xml_dat.seek(xml_arr[verse.getOrdinal()-1]);
319 
320                     // Read the XML text
321                     String text = xml_dat.readUTF();
322 
323                     Element vref = new Element("ref");
324                     section.addContent(vref);
325                     vref.addAttribute("b", ""+verse.getBook());
326                     vref.addAttribute("c", ""+verse.getChapter());
327                     vref.addAttribute("v", ""+verse.getVerse());
328                     // vref.addAttribute("para", "true");
329 
330                     Element it = new Element("it");
331                     vref.addContent(it);
332                     it.setText(text);
333                 }
334             }
335 
336             return doc;
337         }
338         catch (Exception ex)
339         {
340             throw new BookException("ser_read", ex);
341         }
342     }
343 
344     /**
345      * Create an XML document for the specified Verses
346      * @param doc The XML document
347      * @param ele The elemenet to append to
348      * @param ref The verses to search for
349      */
350     public void getDocument(BibleEle doc, Passage ref) throws BookException
351     {
352         try
353         {
354             // For all the ranges in this Passage
355             Enumeration ren = ref.rangeElements();
356             while (ren.hasMoreElements())
357             {
358                 VerseRange range = (VerseRange) ren.nextElement();
359                 SectionEle section = doc.createSectionEle(range.toString());
360 
361                 // For all the verses in this range
362                 Enumeration ven = range.verseElements();
363                 while (ven.hasMoreElements())
364                 {
365                     Verse verse = (Verse) ven.nextElement();
366 
367                     // Seek to the correct point
368                     xml_dat.seek(xml_arr[verse.getOrdinal()-1]);
369 
370                     // Read the XML text
371                     String text = xml_dat.readUTF();
372 
373                     RefEle vref = section.createRefEle(verse, false);
374                     vref.setPlainText(text);
375                 }
376             }
377         }
378         catch (Exception ex)
379         {
380             throw new BookException("ser_read", ex);
381         }
382     }
383 
384     /**
385      * For a given word find a list of references to it
386      * @param word The text to search for
387      * @return The references to the word
388      */
389     public Passage findPassage(String word) throws BookException
390     {
391         if (word == null)
392             return PassageFactory.createPassage();
393 
394         Section section = (Section) ref_map.get(word.toLowerCase());
395 
396         if (section == null)
397             return PassageFactory.createPassage();
398 
399         try
400         {
401             // Read blob
402             byte[] blob = new byte[section.length];
403             ref_dat.seek(section.offset);
404             ref_dat.read(blob);
405 
406             // De-serialize
407             return PassageUtil.fromBinaryRepresentation(blob);
408         }
409         catch (Exception ex)
410         {
411             log.warning("Search failed on:");
412             log.warning("  word="+word);
413             log.warning("  ref_ptr="+section.offset);
414             log.warning("  ref_length="+section.length);
415             Reporter.informUser(this, ex);
416 
417             return PassageFactory.createPassage();
418         }
419     }
420 
421     /**
422      * Retrieval: Return an array of words that are used by this Bible
423      * that start with the given string. For example calling:
424      * <code>getStartsWith("love")</code> will return something like:
425      * { "love", "loves", "lover", "lovely", ... }
426      * @param base The word to base your word array on
427      * @return An array of words starting with the base
428      */
429     public String[] getStartsWith(String word) throws BookException
430     {
431         word = word.toLowerCase();
432         SortedMap sub_map = ref_map.subMap(word, word+"\u9999");
433         Object[] temp = sub_map.keySet().toArray();
434 
435         String[] retcode = new String[temp.length];
436         for (int i=0; i<temp.length; i++)
437         {
438             retcode[i] = (String) temp[i];
439         }
440 
441         return retcode;
442     }
443 
444     /**
445      * Retrieval: Get a list of the words used by this Version. This is
446      * not vital for normal display, however it is very useful for various
447      * things, not least of which is new Version generation. However if
448      * you are only looking to <i>display</i> from this Bible then you
449      * could skip this one.
450      * @return The references to the word
451      */
452     public Enumeration listWords() throws BookException
453     {
454         return new IteratorEnumeration(ref_map.keySet().iterator());
455     }
456 
457     /**
458      * Write the XML to disk
459      * @param verse The verse to write
460      * @param doc The data to write
461      */
462     public void setDocument(BibleEle doc) throws BookException
463     {
464         try
465         {
466             // For all of the sections
467             for (Enumeration sen=doc.getSectionEles(); sen.hasMoreElements(); )
468             {
469                 SectionEle section = (SectionEle) sen.nextElement();
470 
471                 // For all of the Verses in the section
472                 for (Enumeration ven=section.getRefEles(); ven.hasMoreElements(); )
473                 {
474                     RefEle vel = (RefEle) ven.nextElement();
475 
476                     Verse verse = vel.getVerse();
477                     String text = vel.getPlainText();
478 
479                     // Remember where we were so we can read it back later
480                     xml_arr[verse.getOrdinal()-1] = xml_dat.getFilePointer();
481 
482                     // And write the entry
483                     xml_dat.writeUTF(text);
484                 }
485             }
486         }
487         catch (IOException ex)
488         {
489             throw new BookException("ser_write", ex);
490         }
491     }
492 
493     /**
494      * Write the references for a Word
495      * @param word The word to write
496      * @param ref The references to the word
497      */
498     public void foundPassage(String word, Passage ref) throws BookException
499     {
500         if (word == null) return;
501 
502         try
503         {
504 log.fine("s "+word+" "+System.currentTimeMillis());
505             byte[] buffer = PassageUtil.toBinaryRepresentation(ref);
506 log.fine("e "+word+" "+System.currentTimeMillis());
507 
508             Section section = new Section(ref_dat.getFilePointer(), buffer.length);
509 
510             ref_dat.write(buffer);
511             ref_map.put(word.toLowerCase(), section);
512 
513             // log.debug(this, "Written:");
514             // log.debug(this, "  word="+word);
515             // log.debug(this, "  ref_ptr="+ref_ptr);
516             // log.debug(this, "  ref_length="+ref_blob.length);
517             // log.debug(this, "  ref_blob="+new String(ref_blob));
518         }
519         catch (Exception ex)
520         {
521             throw new BookException("ser_write", ex);
522         }
523     }
524 
525     /**
526      * Flush the data written to disk
527      */
528     public void flush() throws BookException
529     {
530         try
531         {
532             ObjectOutputStream oout;
533 
534             // Save the ascii Passage index
535             URL ref_idy_url = NetUtil.lengthenURL(url, "ref.index");
536             PrintWriter ref_idy_out = new PrintWriter(NetUtil.getOutputStream(ref_idy_url));
537             Iterator it = ref_map.keySet().iterator();
538             while (it.hasNext())
539             {
540                 String word = (String) it.next();
541                 Section section = (Section) ref_map.get(word);
542                 ref_idy_out.println(word+":"+section.offset+":"+section.length);
543             }
544             ref_idy_out.close();
545 
546             // Save the ascii XML index
547             URL xml_idy_url = NetUtil.lengthenURL(url, "xml.index");
548             PrintWriter xml_idy_out = new PrintWriter(NetUtil.getOutputStream(xml_idy_url));
549             for (int i=0; i<xml_arr.length; i++)
550             {
551                 xml_idy_out.println(xml_arr[i]);
552             }
553             xml_idy_out.close();
554 
555             // The Bible config info
556             Properties prop = new Properties();
557             prop.put("Version", getVersion().getFullName());
558             URL prop_url = NetUtil.lengthenURL(url, "bible.properties");
559             OutputStream prop_out = NetUtil.getOutputStream(prop_url);
560             PropertiesUtil.save(prop, prop_out, "RawBible Config");
561         }
562         catch (IOException ex)
563         {
564             throw new BookException("ser_index", ex);
565         }
566     }
567 
568    /**
569      * The directory that holds the RawBible files
570      * @return The index file directory
571      */
572     public URL getBaseURL()
573     {
574         return url;
575     }
576 
577     /** The SAX parser to use */
578     private static final String PARSER = "com.ibm.xml.parsers.SAXParser";
579 
580     /** The base url */
581     private URL url;
582 
583     /** The name of this version */
584     private String name;
585 
586     /** The passages random access file */
587     private RandomAccessFile ref_dat;
588 
589     /** The hash of indexes into the passages file */
590     private SortedMap ref_map;
591 
592     /** The text random access file */
593     private RandomAccessFile xml_dat;
594 
595     /**
596      * The hash of indexes into the text file, one per verse. Note that the
597      * index in use is NOT the ordinal number of the verse since ordinal nos are
598      * 1 based. The index into xml_arr is verse.getOrdinal() - 1
599      */
600     private long[] xml_arr;
601 
602     /** Some shortcuts into the list of names to help startsWith */
603     private long[] letters = new long[26];
604 
605     /** The Version of the Bible that this produces */
606     private Version version;
607 
608     /** The log stream */
609     protected static Logger log = Logger.getLogger("bible.book");
610 
611     /**
612      * A simple class to hold an offset and length into the passages random
613      * access file
614      */
615     static class Section
616     {
617         public Section(long offset, int length)
618         {
619             this.offset = offset;
620             this.length = length;
621         }
622         public long offset;
623         public int length;
624     }
625 
626     /**
627      * This customization just clips of the .ser from the array members
628      */
629     static class CustomArrayEnumeration extends ArrayEnumeration
630     {
631         /**
632          * This is the only of the ArrayEnumeration ctors that we need
633          */
634         CustomArrayEnumeration(Object[] array)
635         {
636             super(array);
637         }
638 
639         /**
640          * Get the next item from the database
641          * @return The next object in the list
642          */
643         public Object nextElement()
644         {
645             String file = (String) array[pos++];
646             return file.substring(0, file.length()-4);
647         }
648     }
649 
650     /**
651      * Check that the directories in the version directory really
652      * represent versions.
653      */
654     static class CustomFilenameFilter implements FilenameFilter
655     {
656         /**
657          * Create a CustomFilenameFilter with a word to match
658          * the start of
659          */
660         public CustomFilenameFilter(String word)
661         {
662             this.word = word;
663         }
664 
665         /**
666          * Match word
667          */
668         public boolean accept(File parent, String name)
669         {
670             return name.startsWith(word);
671         }
672 
673         /** The word to match */
674         private String word;
675     }
676 }