Source code: com/eireneh/bible/book/ser/SerBible.java
1
2 package com.eireneh.bible.book.ser;
3
4 import java.io.*;
5 import java.util.*;
6 import java.net.*;
7
8 import org.jdom.*;
9
10 import com.sun.java.util.collections.*;
11
12 import com.eireneh.util.*;
13 import com.eireneh.bible.util.*;
14 import com.eireneh.bible.book.*;
15 import com.eireneh.bible.passage.*;
16
17 /**
18 * A Biblical source that comes from files on the local file system.
19 *
20 * <p>This format is designed to be fast. At any cost. So disk space does
21 * not matter, which is good because early versions used about 100Mb!</p>
22 *
23 * <p>This is a history of some of he design desisions that this class has
24 * been through.</p>
25 *
26 * <h4>Searching</h4>
27 * <p>I think that a Bible ought not to store anything other than Bible
28 * text. I have experimented with a saerch mechanism that cached searches
29 * in a very effective manner, however it took up a lot of disk space,
30 * and only worked for one version. It might be good to have it work in a
31 * more generic way, and an in-memory cache would also be a good idea. So
32 * I am going to move the natty search bit into a caching class.
33 *
34 * <h4>Text Storage</h4>
35 * It would be good to get a handle on the way the OLB and Sword and so on
36 * work:<ul>
37 * <li><b>OLB:</b> 2 core files: an index file that starts with text like:
38 * "AaronitesbaddonAbagthanarimabasedingtedAbbadaeelielonednegolbet"
39 * which is a strange sort of index. Possibly strings with start pos
40 * and length. Then data files, and plenty of other indexes.
41 * <li><b>Theopholos:</b> Single data file that begins- "aaron aaronites
42 * aarons abaddon abagtha abana abarim abase abased abasing abated"
43 * This is again in index type affair.
44 * <li><b>Sword:</b> All this VerseKey stuff ...
45 * </ul>
46 * I think the answer is that an word index is good. (Like this is news)
47 * So we can map all the words to numbers and then encode the biblical
48 * text as a series of numbers.
49 *
50 * <h4>Priorities</h4>
51 * What factors affect our design the most?<ul>
52 * <li><b>Search Speed:</b> Proably the biggest reason people will have to
53 * use this program initially will be the powerful search engine. This
54 * can be very demanding though, and every effort should be taken to
55 * make best match searches fast.
56 * <li><b>Size:</b> Size is not a huge problem from a disk space point of
57 * view - the average hard disk is now about 10Gb. Looking at the
58 * various installations that I have, the average is a little short of
59 * 20Mb each. Generally each version takes up 3-5Mb If we were to be
60 * over double this size and take up 50Mb total, I don't think there
61 * would be a huge problem.<br>
62 * However many people will first come to use this program from a net
63 * download - now size is a huge problem. Maybe we should have a
64 * very very compact download that on installation indexed itself.
65 * <li><b>Text Retrieval Speed:</b> I do not see this as being a huge
66 * issue. The text generation time from reverse-engineering my
67 * concordance was acceptable if slow, so this should not be a big
68 * deal, and I guess it is very easily cacheable too.
69 * </ul>
70 *
71 * <h4>Strategies</h4>
72 * For a single verse we have 2 basic strategies. Have a single block of
73 * data that specifies the words, punctuation, and markup, or for each set
74 * of data we could have a separate source. Clearly there are also hybrid
75 * versions. The pros and cons:<ul>
76 * <li>Searches only have to read one file, and the information is more
77 * dense in that (less disk reads for wanted data) This also applies
78 * to the ability to ignore certain types of mark-up.
79 * <li>It is easier to add/alter a single source of information - or even
80 * to share a source amongst versions. Maybe things like red lettering
81 * could benefit from this.
82 * <li>Text display is slower because the information is spread over
83 * several files. But as mentioned above - who cares?
84 * </ul>
85 * So how far do we take this? The parts that we can split off from the
86 * words are these:<ul>
87 * <li>Markup: Most markup is tied to a particular word, so we would need
88 * some way of attaching markup to words.
89 * <li>Inter-Word Punctuation: We could do for punctuation exactly what we
90 * do for the words. List the options in a dictionary, and then write
91 * out an index. I guess less than 255 different types of inter-word
92 * punctuation (1 byte per inter-word). (as opposed to 18360 different
93 * words 2 bytes per word)<br>
94 * There are 32k words in the Bible - this would make the central data
95 * file about 64k in size!
96 * <li>Case: To get down to 18k words you need to make "Foo" the same as
97 * "foo" and "FOO", however I guess that even making words case
98 * sensative we would be under 65k words.
99 * Splitting case would not decrease file sizes (but may make it
100 * compress better) however it would introduce a new case file. Since
101 * there are only 4 cases (See PassageUtil) that is 0.25 bytes per
102 * word. (8k for the whole Bible)
103 * <li>Intra-Word Punctuation: Examples "-',". Examples of words that use
104 * these punctuations: Maher-Shalal-Hash-Baz, Aaron's, 144,000. Watch
105 * out for --. The NIV uses it to join sentances together--Something
106 * like this. However there is no space between these words. This is
107 * closely linked to-
108 * <li>Word Semantics: We could make the words "job", "jobs", and "job's"
109 * the same. Also "run", "runs", "running", "runned" and so on. Even
110 * "am", "are", "is". This would dramatically reduce the size of the
111 * dictionary, make the text re-generation quite complex and the data
112 * generation nigh on impossible. But it would make for some really
113 * powerful searches (although possibly nothing that a thesaurus would
114 * not help)
115 * </ul>
116 * I think the last 2 are hard to sus. However I am keen to work on them
117 * next. So it looks like I sort out the first 3. Time to reasurect that
118 * VB code. Now is it a port or a re-write?
119 *
120 * <table border='1' cellPadding='3' cellSpacing='0' width="100%">
121 * <tr><td bgColor='white'class='TableRowColor'><font size='-7'>
122 * Distribution Licence:<br />
123 * Project B is free software; you can redistribute it
124 * and/or modify it under the terms of the GNU General Public License,
125 * version 2 as published by the Free Software Foundation.<br />
126 * This program is distributed in the hope that it will be useful,
127 * but WITHOUT ANY WARRANTY; without even the implied warranty of
128 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
129 * General Public License for more details.<br />
130 * The License is available on the internet
131 * <a href='http://www.gnu.org/copyleft/gpl.html'>here</a>, by writing to
132 * <i>Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
133 * MA 02111-1307, USA</i>, Or locally at the Licence link below.<br />
134 * The copyright to this program is held by it's authors.
135 * </font></td></tr></table>
136 * @see <a href='http://www.eireneh.com/servlets/Web'>Project B Home</a>
137 * @see docs.Licence
138 * @author Joe Walker
139 * @version D4.I0.T0
140 */
141 public class SerBible extends VersewiseBible
142 {
143 /**
144 * Basic constructor for a SerBible
145 */
146 public SerBible(String name, URL url, boolean create) throws BookException
147 {
148 this.name = name;
149 this.url = url;
150
151 if (!url.getProtocol().equals("file"))
152 throw new BookException("ser_url");
153
154 try
155 {
156 String file_mode;
157
158 if (create)
159 {
160 // We leave the Version unknown until we have data
161 version = null;
162
163 // Create blank indexes
164 ref_map = new TreeMap(new StringComparator());
165 xml_arr = new long[Books.versesInBible()];
166
167 // Open the random access files read write
168 file_mode = "rw";
169 }
170 else
171 {
172 // The version information
173 URL prop_url = NetUtil.lengthenURL(url, "bible.properties");
174 InputStream prop_in = prop_url.openStream();
175 Properties prop = new Properties();
176 PropertiesUtil.load(prop, prop_in);
177 String version_name = prop.getProperty("Version");
178 version = VersionFactory.getVersion(version_name);
179
180 // Load the ascii Passage index
181 URL ref_idy_url = NetUtil.lengthenURL(url, "ref.index");
182 BufferedReader ref_idy_bin = new BufferedReader(new InputStreamReader(ref_idy_url.openStream()));
183 ref_map = new TreeMap(new StringComparator());
184 while (true)
185 {
186 String line = ref_idy_bin.readLine();
187 if (line == null) break;
188 int colon1 = line.indexOf(":");
189 int colon2 = line.lastIndexOf(":");
190 String word = line.substring(0, colon1);
191 long offset = Long.parseLong(line.substring(colon1+1, colon2));
192 int length = Integer.parseInt(line.substring(colon2+1));
193 Section section = new Section(offset, length);
194 ref_map.put(word, section);
195 }
196 ref_idy_bin.close();
197
198 // Load the ascii XML index
199 URL xml_idy_url = NetUtil.lengthenURL(url, "xml.index");
200 BufferedReader xml_idy_bin = new BufferedReader(new InputStreamReader(xml_idy_url.openStream()));
201 xml_arr = new long[Books.versesInBible()];
202 for (int i=0; i<Books.versesInBible(); i++)
203 {
204 String line = xml_idy_bin.readLine();
205 xml_arr[i] = Integer.parseInt(line);
206 }
207 xml_idy_bin.close();
208
209 // Open the random access files read only
210 file_mode = "r";
211 }
212
213 // Open the Passage RAF
214 URL ref_dat_url = NetUtil.lengthenURL(url, "ref.data");
215 ref_dat = new RandomAccessFile(ref_dat_url.getFile(), file_mode);
216
217 // Open the XML RAF
218 URL xml_dat_url = NetUtil.lengthenURL(url, "xml.data");
219 xml_dat = new RandomAccessFile(xml_dat_url.getFile(), file_mode);
220 }
221 catch (Exception ex)
222 {
223 throw new BookException("ser_init", ex);
224 }
225
226 log.fine("Started SerBible url="+url+ " name="+name+" create="+create);
227 }
228
229 /**
230 * What driver is controlling this Bible?
231 * @return A BibleDriver relevant to this Bible
232 */
233 public BibleDriver getDriver()
234 {
235 return SerBibleDriver.driver;
236 }
237
238 /**
239 * Meta-Information: What name can I use to get this Bible in a call
240 * to Bibles.getBible(name);
241 * @return The name of this Bible
242 */
243 public String getName()
244 {
245 return name;
246 }
247
248 /**
249 * Meta-Information: What version of the Bible is this?
250 * @return A Version for this Bible
251 */
252 public Version getVersion()
253 {
254 return version;
255 }
256
257 /**
258 * Setup the Version information
259 * @param version The version that this Bible is becoming
260 */
261 public void setVersion(Version version)
262 {
263 this.version = version;
264 }
265
266 /**
267 * Create a String for the specified Verses
268 * @param range The verses to search for
269 * @return The Bible text
270 */
271 public String getText(VerseRange range) throws BookException
272 {
273 try
274 {
275 BibleEle doc = new BibleEle();
276
277 // We should be doing all to the <ref> manually.
278 Passage ref = PassageFactory.createPassage();
279 ref.add(range);
280 getDocument(doc, ref);
281
282 return doc.getText().trim();
283 }
284 catch (Exception ex)
285 {
286 throw new BookException("ser_read", ex);
287 }
288 }
289
290 /**
291 * Retrieval: Use JDOM to retrieve some Bible data
292 * @param doc The document
293 * @param ref The verses to search for
294 */
295 public Element getElement(Passage ref) throws BookException
296 {
297 try
298 {
299 Element doc = new Element("bible");
300
301 // For all the ranges in this Passage
302 Enumeration ren = ref.rangeElements();
303 while (ren.hasMoreElements())
304 {
305 VerseRange range = (VerseRange) ren.nextElement();
306
307 Element section = new Element("section");
308 section.addAttribute("title", range.toString());
309 doc.addContent(section);
310
311 // For all the verses in this range
312 Enumeration ven = range.verseElements();
313 while (ven.hasMoreElements())
314 {
315 Verse verse = (Verse) ven.nextElement();
316
317 // Seek to the correct point
318 xml_dat.seek(xml_arr[verse.getOrdinal()-1]);
319
320 // Read the XML text
321 String text = xml_dat.readUTF();
322
323 Element vref = new Element("ref");
324 section.addContent(vref);
325 vref.addAttribute("b", ""+verse.getBook());
326 vref.addAttribute("c", ""+verse.getChapter());
327 vref.addAttribute("v", ""+verse.getVerse());
328 // vref.addAttribute("para", "true");
329
330 Element it = new Element("it");
331 vref.addContent(it);
332 it.setText(text);
333 }
334 }
335
336 return doc;
337 }
338 catch (Exception ex)
339 {
340 throw new BookException("ser_read", ex);
341 }
342 }
343
344 /**
345 * Create an XML document for the specified Verses
346 * @param doc The XML document
347 * @param ele The elemenet to append to
348 * @param ref The verses to search for
349 */
350 public void getDocument(BibleEle doc, Passage ref) throws BookException
351 {
352 try
353 {
354 // For all the ranges in this Passage
355 Enumeration ren = ref.rangeElements();
356 while (ren.hasMoreElements())
357 {
358 VerseRange range = (VerseRange) ren.nextElement();
359 SectionEle section = doc.createSectionEle(range.toString());
360
361 // For all the verses in this range
362 Enumeration ven = range.verseElements();
363 while (ven.hasMoreElements())
364 {
365 Verse verse = (Verse) ven.nextElement();
366
367 // Seek to the correct point
368 xml_dat.seek(xml_arr[verse.getOrdinal()-1]);
369
370 // Read the XML text
371 String text = xml_dat.readUTF();
372
373 RefEle vref = section.createRefEle(verse, false);
374 vref.setPlainText(text);
375 }
376 }
377 }
378 catch (Exception ex)
379 {
380 throw new BookException("ser_read", ex);
381 }
382 }
383
384 /**
385 * For a given word find a list of references to it
386 * @param word The text to search for
387 * @return The references to the word
388 */
389 public Passage findPassage(String word) throws BookException
390 {
391 if (word == null)
392 return PassageFactory.createPassage();
393
394 Section section = (Section) ref_map.get(word.toLowerCase());
395
396 if (section == null)
397 return PassageFactory.createPassage();
398
399 try
400 {
401 // Read blob
402 byte[] blob = new byte[section.length];
403 ref_dat.seek(section.offset);
404 ref_dat.read(blob);
405
406 // De-serialize
407 return PassageUtil.fromBinaryRepresentation(blob);
408 }
409 catch (Exception ex)
410 {
411 log.warning("Search failed on:");
412 log.warning(" word="+word);
413 log.warning(" ref_ptr="+section.offset);
414 log.warning(" ref_length="+section.length);
415 Reporter.informUser(this, ex);
416
417 return PassageFactory.createPassage();
418 }
419 }
420
421 /**
422 * Retrieval: Return an array of words that are used by this Bible
423 * that start with the given string. For example calling:
424 * <code>getStartsWith("love")</code> will return something like:
425 * { "love", "loves", "lover", "lovely", ... }
426 * @param base The word to base your word array on
427 * @return An array of words starting with the base
428 */
429 public String[] getStartsWith(String word) throws BookException
430 {
431 word = word.toLowerCase();
432 SortedMap sub_map = ref_map.subMap(word, word+"\u9999");
433 Object[] temp = sub_map.keySet().toArray();
434
435 String[] retcode = new String[temp.length];
436 for (int i=0; i<temp.length; i++)
437 {
438 retcode[i] = (String) temp[i];
439 }
440
441 return retcode;
442 }
443
444 /**
445 * Retrieval: Get a list of the words used by this Version. This is
446 * not vital for normal display, however it is very useful for various
447 * things, not least of which is new Version generation. However if
448 * you are only looking to <i>display</i> from this Bible then you
449 * could skip this one.
450 * @return The references to the word
451 */
452 public Enumeration listWords() throws BookException
453 {
454 return new IteratorEnumeration(ref_map.keySet().iterator());
455 }
456
457 /**
458 * Write the XML to disk
459 * @param verse The verse to write
460 * @param doc The data to write
461 */
462 public void setDocument(BibleEle doc) throws BookException
463 {
464 try
465 {
466 // For all of the sections
467 for (Enumeration sen=doc.getSectionEles(); sen.hasMoreElements(); )
468 {
469 SectionEle section = (SectionEle) sen.nextElement();
470
471 // For all of the Verses in the section
472 for (Enumeration ven=section.getRefEles(); ven.hasMoreElements(); )
473 {
474 RefEle vel = (RefEle) ven.nextElement();
475
476 Verse verse = vel.getVerse();
477 String text = vel.getPlainText();
478
479 // Remember where we were so we can read it back later
480 xml_arr[verse.getOrdinal()-1] = xml_dat.getFilePointer();
481
482 // And write the entry
483 xml_dat.writeUTF(text);
484 }
485 }
486 }
487 catch (IOException ex)
488 {
489 throw new BookException("ser_write", ex);
490 }
491 }
492
493 /**
494 * Write the references for a Word
495 * @param word The word to write
496 * @param ref The references to the word
497 */
498 public void foundPassage(String word, Passage ref) throws BookException
499 {
500 if (word == null) return;
501
502 try
503 {
504 log.fine("s "+word+" "+System.currentTimeMillis());
505 byte[] buffer = PassageUtil.toBinaryRepresentation(ref);
506 log.fine("e "+word+" "+System.currentTimeMillis());
507
508 Section section = new Section(ref_dat.getFilePointer(), buffer.length);
509
510 ref_dat.write(buffer);
511 ref_map.put(word.toLowerCase(), section);
512
513 // log.debug(this, "Written:");
514 // log.debug(this, " word="+word);
515 // log.debug(this, " ref_ptr="+ref_ptr);
516 // log.debug(this, " ref_length="+ref_blob.length);
517 // log.debug(this, " ref_blob="+new String(ref_blob));
518 }
519 catch (Exception ex)
520 {
521 throw new BookException("ser_write", ex);
522 }
523 }
524
525 /**
526 * Flush the data written to disk
527 */
528 public void flush() throws BookException
529 {
530 try
531 {
532 ObjectOutputStream oout;
533
534 // Save the ascii Passage index
535 URL ref_idy_url = NetUtil.lengthenURL(url, "ref.index");
536 PrintWriter ref_idy_out = new PrintWriter(NetUtil.getOutputStream(ref_idy_url));
537 Iterator it = ref_map.keySet().iterator();
538 while (it.hasNext())
539 {
540 String word = (String) it.next();
541 Section section = (Section) ref_map.get(word);
542 ref_idy_out.println(word+":"+section.offset+":"+section.length);
543 }
544 ref_idy_out.close();
545
546 // Save the ascii XML index
547 URL xml_idy_url = NetUtil.lengthenURL(url, "xml.index");
548 PrintWriter xml_idy_out = new PrintWriter(NetUtil.getOutputStream(xml_idy_url));
549 for (int i=0; i<xml_arr.length; i++)
550 {
551 xml_idy_out.println(xml_arr[i]);
552 }
553 xml_idy_out.close();
554
555 // The Bible config info
556 Properties prop = new Properties();
557 prop.put("Version", getVersion().getFullName());
558 URL prop_url = NetUtil.lengthenURL(url, "bible.properties");
559 OutputStream prop_out = NetUtil.getOutputStream(prop_url);
560 PropertiesUtil.save(prop, prop_out, "RawBible Config");
561 }
562 catch (IOException ex)
563 {
564 throw new BookException("ser_index", ex);
565 }
566 }
567
568 /**
569 * The directory that holds the RawBible files
570 * @return The index file directory
571 */
572 public URL getBaseURL()
573 {
574 return url;
575 }
576
577 /** The SAX parser to use */
578 private static final String PARSER = "com.ibm.xml.parsers.SAXParser";
579
580 /** The base url */
581 private URL url;
582
583 /** The name of this version */
584 private String name;
585
586 /** The passages random access file */
587 private RandomAccessFile ref_dat;
588
589 /** The hash of indexes into the passages file */
590 private SortedMap ref_map;
591
592 /** The text random access file */
593 private RandomAccessFile xml_dat;
594
595 /**
596 * The hash of indexes into the text file, one per verse. Note that the
597 * index in use is NOT the ordinal number of the verse since ordinal nos are
598 * 1 based. The index into xml_arr is verse.getOrdinal() - 1
599 */
600 private long[] xml_arr;
601
602 /** Some shortcuts into the list of names to help startsWith */
603 private long[] letters = new long[26];
604
605 /** The Version of the Bible that this produces */
606 private Version version;
607
608 /** The log stream */
609 protected static Logger log = Logger.getLogger("bible.book");
610
611 /**
612 * A simple class to hold an offset and length into the passages random
613 * access file
614 */
615 static class Section
616 {
617 public Section(long offset, int length)
618 {
619 this.offset = offset;
620 this.length = length;
621 }
622 public long offset;
623 public int length;
624 }
625
626 /**
627 * This customization just clips of the .ser from the array members
628 */
629 static class CustomArrayEnumeration extends ArrayEnumeration
630 {
631 /**
632 * This is the only of the ArrayEnumeration ctors that we need
633 */
634 CustomArrayEnumeration(Object[] array)
635 {
636 super(array);
637 }
638
639 /**
640 * Get the next item from the database
641 * @return The next object in the list
642 */
643 public Object nextElement()
644 {
645 String file = (String) array[pos++];
646 return file.substring(0, file.length()-4);
647 }
648 }
649
650 /**
651 * Check that the directories in the version directory really
652 * represent versions.
653 */
654 static class CustomFilenameFilter implements FilenameFilter
655 {
656 /**
657 * Create a CustomFilenameFilter with a word to match
658 * the start of
659 */
660 public CustomFilenameFilter(String word)
661 {
662 this.word = word;
663 }
664
665 /**
666 * Match word
667 */
668 public boolean accept(File parent, String name)
669 {
670 return name.startsWith(word);
671 }
672
673 /** The word to match */
674 private String word;
675 }
676 }