You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Pedro Dalcin <pd...@gmail.com> on 2011/07/29 15:40:39 UTC

ExtractorFactory doubt

ExtractorFactory doubt (or problem)

So... I'm making some tests with POI to learn bit more about it and I
stumbled upon a problem I can't seem to figure out.
I've created a simple document reader based on the sample within the Apache
POI website.

But when I get to

POITextExtractor[] embeddedExtractors
=ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);

embeddedExtractos end up empty if I use an Excel file and I recive the
following error if I use a doc file

Exception in thread "main" java.lang.IllegalArgumentException: No supported
documents found in the OLE2 stream
 at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:238)
at
org.apache.poi.extractor.ExtractorFactory.getEmbededDocsTextExtractors(ExtractorFactory.java:309)
 at test.PoiReadExcelFile.main(PoiReadExcelFile.java:30)

Have I missed something? Or didn't I understood what ExtractorFactory
actually does?

Thanks,

Pedro Dalcin

Re: ExtractorFactory doubt

Posted by Pedro Dalcin <pd...@gmail.com>.
I know that it's a POI supported file : - )
Thanks, that gave me some directions to follow.

Pedro Dalcin


On Fri, Jul 29, 2011 at 12:02 PM, Nick Burch <ni...@alfresco.com>wrote:

> On Fri, 29 Jul 2011, Pedro Dalcin wrote:
>
>> I'm actually after a way to identify what type of file I'm inputting. I've
>> figured I can't simply check the extension since there are different types
>> of "doc" that use the same extension.
>>
>
> If you know it's a POI supported file, then passing it to the Extractor
> Factory will give you back a suitable simple text extractor for it
>
> If you don't know what it is, or if you want "fancy" text extraction, then
> use Apache Tika. That provides detection, and fairly full featured text
> extractors.
>
> (If you know what it is, and you need full control of the text extraction
> process, then you'll likely want to write you own code on top of POI)
>
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: ExtractorFactory doubt

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 29 Jul 2011, Pedro Dalcin wrote:
> I'm actually after a way to identify what type of file I'm inputting. 
> I've figured I can't simply check the extension since there are 
> different types of "doc" that use the same extension.

If you know it's a POI supported file, then passing it to the Extractor 
Factory will give you back a suitable simple text extractor for it

If you don't know what it is, or if you want "fancy" text extraction, then 
use Apache Tika. That provides detection, and fairly full featured text 
extractors.

(If you know what it is, and you need full control of the text extraction 
process, then you'll likely want to write you own code on top of POI)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: ExtractorFactory doubt

Posted by Pedro Dalcin <pd...@gmail.com>.
Yeah... While I was writing the e-mail I was suspecting I misunderstood what
ExtractorFactory does.
I'm actually after a way to identify what type of file I'm inputting. I've
figured I can't simply check the extension since there are different types
of "doc" that use the same extension.

Pedro Dalcin


On Fri, Jul 29, 2011 at 11:29 AM, Nick Burch <ni...@alfresco.com>wrote:

> On Fri, 29 Jul 2011, Pedro Dalcin wrote:
>
>> POITextExtractor[] embeddedExtractors
>> =ExtractorFactory.**getEmbededDocsTextExtractors(**oleTextExtractor);
>>
>> embeddedExtractos end up empty
>>
>
> Does your document have any other docs embedded in it? I suspect it
> doesn't, which is why you're not getting any back
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: ExtractorFactory doubt

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 29 Jul 2011, Pedro Dalcin wrote:
> POITextExtractor[] embeddedExtractors
> =ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
>
> embeddedExtractos end up empty

Does your document have any other docs embedded in it? I suspect it 
doesn't, which is why you're not getting any back

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org