You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by Phil Varner <ph...@gmail.com> on 2009/12/14 03:43:16 UTC

excel text extraction

Hi all,

I've been working with the POI for text extraction and had a few
design questions.  And, yes, I am volunteering to fix these if they
are indeed problems.

1) ExtractorFactory uses the ExcelExtractor rather than the
EventBasedExcelExtractor, which causes it to OOM for very large
workbooks.  I was wondering why this was and if it would be reasonable
to change it.

2) Without an event-based extractor for OOXML workbooks, you can never
extract text from very large workbooks.  I implemented a hacky
workaround to read only the shared strings xml doc, but I was
wondering if there was a better way to do this or if there was any
interest in polishing this into something that could be part of POI.

3) QuickButCruddyTextExtractor doesn't extend POIOLE2TextExtractor,
and I was wondering if there was a reason why.

Thanks,

--Phil

-- 

Machines might be interesting, but people are fascinating. -- K.P.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: excel text extraction

Posted by Phil Varner <ph...@gmail.com>.

Thanks for the suggestions. Comments inline

>> 1) ExtractorFactory uses the ExcelExtractor rather than the
>> EventBasedExcelExtractor, which causes it to OOM for very large workbooks.
>>  I was wondering why this was and if it would be reasonable to change it.
>
> The default is to use the UserModel based ones, as they tend to be more
> accurate and more configurable. However, I don't see why we couldn't add a
> "boolean preferEventBased" flag to toggle this.
>
> That said, iirc we only have an event based extractor for .xls, so it might
> not make all that much difference given that all other files you throw at it
> will take loads of memory again :/

Yes, the event-based is only for xls.  However, I think the difference
is that an Excel doc has the potential to be very large (in my case,
65535x10) since they can be generated from another datasource, whereas
ppt and doc are usually human-created and are much smaller. I think
it's probably less common to have a word doc that took up multiple
gigs in memory, but easy to do with excel.  Anecdotally, my customer
has several XLS that cause OOMs and no ppt or doc that do so. No easy
answer here.

>> 2) Without an event-based extractor for OOXML workbooks, you can never
>> extract text from very large workbooks.  I implemented a hacky workaround to
>> read only the shared strings xml doc, but I was wondering if there was a
>> better way to do this or if there was any interest in polishing this into
>> something that could be part of POI.
>
> You could probably base something on XLSX2CSV which is largely event based

I got an OOM just loading the Package, so I don't think XLSX2CSV will work.

>> 3) QuickButCruddyTextExtractor doesn't extend POIOLE2TextExtractor,
>> and I was wondering if there was a reason why.
>
> It predates the extractor interface by quite a bit, so I'm guessing it was
> forgotten :/

I tried implementing it a few weeks ago, but there's a reason (now
forgotten) that QBCTE can't implement POTE.  I ran into a bug with it
and switched back to PowerpointTextExtractor, which has worked fine.

>
> If you do fancy knocking up some patches for any of this, that's be very
> much appreciated :)

Will do, once I'm sure they're stable.

--Phil


-- 

Machines might be interesting, but people are fascinating. -- K.P.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: excel text extraction

Posted by Nick Burch <ni...@alfresco.com>.

On Sun, 13 Dec 2009, Phil Varner wrote:
> 1) ExtractorFactory uses the ExcelExtractor rather than the 
> EventBasedExcelExtractor, which causes it to OOM for very large 
> workbooks.  I was wondering why this was and if it would be reasonable 
> to change it.

The default is to use the UserModel based ones, as they tend to be more 
accurate and more configurable. However, I don't see why we couldn't add a 
"boolean preferEventBased" flag to toggle this.

That said, iirc we only have an event based extractor for .xls, so it 
might not make all that much difference given that all other files you 
throw at it will take loads of memory again :/

> 2) Without an event-based extractor for OOXML workbooks, you can never 
> extract text from very large workbooks.  I implemented a hacky 
> workaround to read only the shared strings xml doc, but I was wondering 
> if there was a better way to do this or if there was any interest in 
> polishing this into something that could be part of POI.

You could probably base something on XLSX2CSV which is largely event based

> 3) QuickButCruddyTextExtractor doesn't extend POIOLE2TextExtractor,
> and I was wondering if there was a reason why.

It predates the extractor interface by quite a bit, so I'm guessing it was 
forgotten :/

If you do fancy knocking up some patches for any of this, that's be very 
much appreciated :)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org