You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2005/06/03 18:33:48 UTC

DO NOT REPLY [Bug 35208] New: - [PATCH] HSLF Update: new (quicker but greedy) text extractor

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=35208>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=35208

           Summary: [PATCH] HSLF Update: new (quicker but greedy) text
                    extractor
           Product: POI
           Version: unspecified
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P2
         Component: POI Overall
        AssignedTo: poi-dev@jakarta.apache.org
        ReportedBy: nick@torchbox.com


To quote from the javadoc of this single class:
 * This class will get all the text from a Powerpoint Document, including
 *  all the bits you didn't want, and in a somewhat random order, but will
 *  do it very fast.
 * The class ignores most of the hslf classes, and doesn't use 
 *  HSLFSlideShow. Instead, it just does a very basic scan through the
 *  file, grabbing all the text records as it goes. It then returns the
 *  text, either as a single string, or as a vector of all the individual
 *  strings.
 * Because of how it works, it will return a lot of "crud" text that you 
 *  probably didn't want! It will return text from master slides. It will
 *  return duplicate text, and some mangled text (powerpoint files often
 *  have duplicate copies of slide text in them). You don't get any idea
 *  what the text was associated with.
 * Almost everyone will want to use @see PowerPointExtractor instead. There
 *  are only a very small number of cases (eg some performance sensitive
 *  lucene indexers) that would ever want to use this!


File should go in org.apache.poi.hslf.extractor. Also needs a single line change
in org.apache.poi.hslf.record.Record:


Index: Record.java
===================================================================
RCS file:
/home/cvspublic/jakarta-poi/src/scratchpad/src/org/apache/poi/hslf/record/Record.java,v
retrieving revision 1.1
diff -u -r1.1 Record.java
--- Record.java 28 May 2005 05:36:00 -0000      1.1
+++ Record.java 3 Jun 2005 16:31:00 -0000
@@ -122,7 +122,7 @@
         *  (not including the size of the header), this code assumes you're
         *  passing in corrected lengths
         */
-       protected static Record createRecordForType(long type, byte[] b, int
start, int len) {
+       public static Record createRecordForType(long type, byte[] b, int start,
int len) {
                // Default is to use UnknownRecordPlaceholder
                // When you create classes for new Records, add them here
                switch((int)type) {

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/