You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2005/06/03 18:33:48 UTC
DO NOT REPLY [Bug 35208] New: -
[PATCH] HSLF Update: new (quicker but greedy) text extractor
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=35208>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=35208
Summary: [PATCH] HSLF Update: new (quicker but greedy) text
extractor
Product: POI
Version: unspecified
Platform: Other
OS/Version: other
Status: NEW
Severity: normal
Priority: P2
Component: POI Overall
AssignedTo: poi-dev@jakarta.apache.org
ReportedBy: nick@torchbox.com
To quote from the javadoc of this single class:
* This class will get all the text from a Powerpoint Document, including
* all the bits you didn't want, and in a somewhat random order, but will
* do it very fast.
* The class ignores most of the hslf classes, and doesn't use
* HSLFSlideShow. Instead, it just does a very basic scan through the
* file, grabbing all the text records as it goes. It then returns the
* text, either as a single string, or as a vector of all the individual
* strings.
* Because of how it works, it will return a lot of "crud" text that you
* probably didn't want! It will return text from master slides. It will
* return duplicate text, and some mangled text (powerpoint files often
* have duplicate copies of slide text in them). You don't get any idea
* what the text was associated with.
* Almost everyone will want to use @see PowerPointExtractor instead. There
* are only a very small number of cases (eg some performance sensitive
* lucene indexers) that would ever want to use this!
File should go in org.apache.poi.hslf.extractor. Also needs a single line change
in org.apache.poi.hslf.record.Record:
Index: Record.java
===================================================================
RCS file:
/home/cvspublic/jakarta-poi/src/scratchpad/src/org/apache/poi/hslf/record/Record.java,v
retrieving revision 1.1
diff -u -r1.1 Record.java
--- Record.java 28 May 2005 05:36:00 -0000 1.1
+++ Record.java 3 Jun 2005 16:31:00 -0000
@@ -122,7 +122,7 @@
* (not including the size of the header), this code assumes you're
* passing in corrected lengths
*/
- protected static Record createRecordForType(long type, byte[] b, int
start, int len) {
+ public static Record createRecordForType(long type, byte[] b, int start,
int len) {
// Default is to use UnknownRecordPlaceholder
// When you create classes for new Records, add them here
switch((int)type) {
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/