public String getTextFromPieces() {
StringBuffer textBuf = new StringBuffer();
Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
while (textPieces.hasNext()) {
TextPiece piece = (TextPiece) textPieces.next();
String encoding = "Cp1252";
if (piece.usesUnicode()) {
encoding = "UTF-16LE";
}
try {
String text = new String(piece.getRawBytes(), encoding);
textBuf.append(text);
} catch(UnsupportedEncodingException e) {
throw new InternalError("Standard Encoding " + encoding + " not found, JVM broken");
}
}
String text = textBuf.toString();
// Fix line endings (Note - won't get all of them
text = text.replaceAll("\r\r\r", "\r\n\r\n\r\n");
text = text.replaceAll("\r\r", "\r\n\r\n");
if(text.endsWith("\r")) {
text += "\n";
}
return text;
}
Grab the text out of the text pieces. Might also include various
bits of crud, but will work in cases where the text piece -> paragraph
mapping is broken. Fast too. |