You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Pedro Dalcin <pd...@gmail.com> on 2011/07/29 15:40:39 UTC
ExtractorFactory doubt
ExtractorFactory doubt (or problem)
So... I'm making some tests with POI to learn bit more about it and I
stumbled upon a problem I can't seem to figure out.
I've created a simple document reader based on the sample within the Apache
POI website.
But when I get to
POITextExtractor[] embeddedExtractors
=ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
embeddedExtractos end up empty if I use an Excel file and I recive the
following error if I use a doc file
Exception in thread "main" java.lang.IllegalArgumentException: No supported
documents found in the OLE2 stream
at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:238)
at
org.apache.poi.extractor.ExtractorFactory.getEmbededDocsTextExtractors(ExtractorFactory.java:309)
at test.PoiReadExcelFile.main(PoiReadExcelFile.java:30)
Have I missed something? Or didn't I understood what ExtractorFactory
actually does?
Thanks,
Pedro Dalcin
Re: ExtractorFactory doubt
Posted by Pedro Dalcin <pd...@gmail.com>.
I know that it's a POI supported file : - )
Thanks, that gave me some directions to follow.
Pedro Dalcin
On Fri, Jul 29, 2011 at 12:02 PM, Nick Burch <ni...@alfresco.com>wrote:
> On Fri, 29 Jul 2011, Pedro Dalcin wrote:
>
>> I'm actually after a way to identify what type of file I'm inputting. I've
>> figured I can't simply check the extension since there are different types
>> of "doc" that use the same extension.
>>
>
> If you know it's a POI supported file, then passing it to the Extractor
> Factory will give you back a suitable simple text extractor for it
>
> If you don't know what it is, or if you want "fancy" text extraction, then
> use Apache Tika. That provides detection, and fairly full featured text
> extractors.
>
> (If you know what it is, and you need full control of the text extraction
> process, then you'll likely want to write you own code on top of POI)
>
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>
Re: ExtractorFactory doubt
Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 29 Jul 2011, Pedro Dalcin wrote:
> I'm actually after a way to identify what type of file I'm inputting.
> I've figured I can't simply check the extension since there are
> different types of "doc" that use the same extension.
If you know it's a POI supported file, then passing it to the Extractor
Factory will give you back a suitable simple text extractor for it
If you don't know what it is, or if you want "fancy" text extraction, then
use Apache Tika. That provides detection, and fairly full featured text
extractors.
(If you know what it is, and you need full control of the text extraction
process, then you'll likely want to write you own code on top of POI)
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: ExtractorFactory doubt
Posted by Pedro Dalcin <pd...@gmail.com>.
Yeah... While I was writing the e-mail I was suspecting I misunderstood what
ExtractorFactory does.
I'm actually after a way to identify what type of file I'm inputting. I've
figured I can't simply check the extension since there are different types
of "doc" that use the same extension.
Pedro Dalcin
On Fri, Jul 29, 2011 at 11:29 AM, Nick Burch <ni...@alfresco.com>wrote:
> On Fri, 29 Jul 2011, Pedro Dalcin wrote:
>
>> POITextExtractor[] embeddedExtractors
>> =ExtractorFactory.**getEmbededDocsTextExtractors(**oleTextExtractor);
>>
>> embeddedExtractos end up empty
>>
>
> Does your document have any other docs embedded in it? I suspect it
> doesn't, which is why you're not getting any back
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>
Re: ExtractorFactory doubt
Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 29 Jul 2011, Pedro Dalcin wrote:
> POITextExtractor[] embeddedExtractors
> =ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
>
> embeddedExtractos end up empty
Does your document have any other docs embedded in it? I suspect it
doesn't, which is why you're not getting any back
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org