You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Roland Cornelissen <ro...@metamatter.nl> on 2010/11/05 01:20:01 UTC

error parsing .XLS file

Hi,

I use Tika 0.7, when trying to parse a .XLS file I get this error:

Could not parse document:class
org.apache.tika.exception.TikaException:TIKA-198: Illegal IOException
from org.apache.tika.parser.microsoft.OfficeParser@110003
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
from org.apache.tika.parser.microsoft.OfficeParser@110003
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
    at metricAv.TikaParser.parse(TikaParser.java:57)
    at metricAv.TikaParser.main(TikaParser.java:39)
Caused by: java.io.IOException: Unable to read entire block; 1 byte
read; expected 512 bytes
    at
org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:62)
    at
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:51)
    at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:86)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:74)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
    ... 4 more

When the same document is saved as .ODS file there are no problems.

This is the source used:

    private void parse(String resourceLocation) throws IOException,
    SAXException, TikaException {
        InputStream input = new FileInputStream(new File(resourceLocation));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        parser.parse(input, textHandler, metadata);
       
        input.close();
    }

Can anybody clarify my problem?

Roland

Re: error parsing .XLS file

Posted by Roland Cornelissen <ro...@metamatter.nl>.
Using Tika from trunk produces the same error with the XLS.

When the .XLS is saved as XSX antoher error comes up:

Could not parse document:class
java.lang.NoSuchMethodError:org.apache.poi.poifs.filesystem.POIFSFileSystem.hasPOIFSHeader(Ljava/io/InputStream;)Z
java.lang.NoSuchMethodError:
org.apache.poi.poifs.filesystem.POIFSFileSystem.hasPOIFSHeader(Ljava/io/InputStream;)Z
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:148)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
    at metricAv.TikaParser.parse(TikaParser.java:57)
    at metricAv.TikaParser.main(TikaParser.java:39)


I see there is a new release of POI (3.7) since 29. oct.
I would like to build this in TIKA, but I am not familiar with Maven.
Maybe someone could explain how to modify the POMs in order to use POI
3.7  with TIKA?

Thanks
Roland





On 11/05/2010 04:49 PM, Nick Burch wrote:
> On Fri, 5 Nov 2010, Roland Cornelissen wrote:
>> Caused by: java.io.IOException: Unable to read entire block; 1 byte
>> read; expected 512 bytes
>>    at
>> org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:62)
>
> This is normally caused by truncated files. However, it might be worth
> trying with a recent nightly build of tika, as that has a newer POI in
> it, and I can't remember if there have been POIFS fixes since 0.7
>
> Nick
>


Re: error parsing .XLS file

Posted by Yatin Baraiya <ya...@highqsolutions.net>.
Nick Burch <ni...@...> writes:

> 
> On Thu, 29 Sep 2011, Yatin Baraiya wrote:
> > i have tried with latest tika 0.9 and poi 3.7 jar. but same exception is 
> > getting
> 
> What about with the release candidate of tika 0.10 (which uses POI 3.8 
> beta 4)?
> 
> Nick
> 
> 
will provide me the apache toka 0.10 jar download link?

yatin





Re: error parsing .XLS file

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 29 Sep 2011, Yatin Baraiya wrote:
> i have tried with latest tika 0.9 and poi 3.7 jar. but same exception is 
> getting

What about with the release candidate of tika 0.10 (which uses POI 3.8 
beta 4)?

Nick

Re: error parsing .XLS file

Posted by Yatin Baraiya <ya...@highqsolutions.net>.
i have tried with latest tika 0.9 and poi 3.7 jar.
but same exception is getting

Regards
Yatin




RE: error parsing .XLS file

Posted by Uwe Schindler <uw...@thetaphi.de>.
How about using a newer TIKA version?

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Yatin Baraiya [mailto:yatin.baraiya@highqsolutions.net]
> Sent: Thursday, September 29, 2011 9:35 AM
> To: user@tika.apache.org
> Subject: Re: error parsing .XLS file
> 
> Hy roland
> 
> i get same issue when i parse the Microsoft office doc.
> i have poi-3.6 version jar and tika 0.6 file in my project.
> 
> we get the following exception
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 221433 at
> org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:45)
> at org.apache.poi.hwpf.model.ListLevel.<init>(ListLevel.java:120)
> at org.apache.poi.hwpf.model.ListFormatOverrideLevel.<init>
> (ListFormatOverrideLevel.java:48)
> at org.apache.poi.hwpf.model.ListTables.<init>(ListTables.java:88)
> at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:267)
> at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:157)
> at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:62)
> at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:87)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 
> i had tried with opening the tika-parsers-0.6.jar in winrar and find the
pom.xml
> from the jar and edit the pom.xml as per ur suggestion edited pom.xml
> snippets of the file is below <dependency>
>       <groupId>org.apache.poi</groupId>
>       <artifactId>poi</artifactId>
>       <version>3.6</version>
>     </dependency>
>     <dependency>
>       <groupId>org.apache.poi</groupId>
>       <artifactId>poi-scratchpad</artifactId>
>       <version>3.6</version>
>     </dependency>
>     <dependency>
>       <groupId>org.apache.poi</groupId>
>       <artifactId>poi-ooxml</artifactId>
>       <version>3.6</version>
>       <exclusions>
>         <exclusion>
>           <groupId>stax</groupId>
>           <artifactId>stax-api</artifactId>
>         </exclusion>
>       </exclusions>
>     </dependency>
> 
> can u tell me exactly  how would u get the solution?
> 
> can u help me to solve the said issue?
> 
> how to modify the POEM in order to use POI 3.7 with TIKA?
> 
> Thanks
> Yatin Baraiya
> 



Re: error parsing .XLS file

Posted by Yatin Baraiya <ya...@highqsolutions.net>.
Hy roland

i get same issue when i parse the Microsoft office doc.
i have poi-3.6 version jar and tika 0.6 file in my project.

we get the following exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: 221433
at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:45)
at org.apache.poi.hwpf.model.ListLevel.<init>(ListLevel.java:120)
at org.apache.poi.hwpf.model.ListFormatOverrideLevel.<init>
(ListFormatOverrideLevel.java:48)
at org.apache.poi.hwpf.model.ListTables.<init>(ListTables.java:88)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:267)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:157)
at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:62)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:87)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)

i had tried with opening the tika-parsers-0.6.jar in winrar and find the pom.xml
from the jar and edit the pom.xml as per ur suggestion
edited pom.xml snippets of the file is below
<dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi</artifactId>
      <version>3.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-scratchpad</artifactId>
      <version>3.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-ooxml</artifactId>
      <version>3.6</version>
      <exclusions>
        <exclusion>
          <groupId>stax</groupId>
          <artifactId>stax-api</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

can u tell me exactly  how would u get the solution?

can u help me to solve the said issue?

how to modify the POEM in order to use POI 3.7 with TIKA?

Thanks
Yatin Baraiya




Re: error parsing .XLS file

Posted by Roland Cornelissen <ro...@metamatter.nl>.
Solved...
There was an older, and second, version of POI (2.5.1) in my Classpath,
from another app  ...

Now I notice that there is a difference in parsing results between the
ODS and XLS file, the same tabel is in both files.
Is this as expected?

Thanks,
Roland




Re: error parsing .XLS file

Posted by Paul Jakubik <pa...@purediscovery.com>.
On Fri, Nov 5, 2010 at 1:28 PM, Roland Cornelissen <ro...@metamatter.nl>wrote:

> I see there is a new release of POI (3.7) since 29. oct.
> I would like to build this in TIKA, but I am not familiar with Maven.
> Maybe someone could explain how to modify the POMs in order to use POI
> 3.7  with TIKA?
>
>
In the pom.xml files you should see something like the following:

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>3.7</version>
</dependency>

The difference would be that the version will be something other than 3.7,
and you'll want to change it to 3.7.

I hope this helps,
Paul

Re: error parsing .XLS file

Posted by Roland Cornelissen <ro...@metamatter.nl>.
Using Tika from trunk produces the same error with the XLS.

When the .XLS is saved as XSX antoher error comes up:

Could not parse document:class
java.lang.NoSuchMethodError:org.apache.poi.poifs.filesystem.POIFSFileSystem.hasPOIFSHeader(Ljava/io/InputStream;)Z
java.lang.NoSuchMethodError:
org.apache.poi.poifs.filesystem.POIFSFileSystem.hasPOIFSHeader(Ljava/io/InputStream;)Z
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:148)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
    at metricAv.TikaParser.parse(TikaParser.java:57)
    at metricAv.TikaParser.main(TikaParser.java:39)


I see there is a new release of POI (3.7) since 29. oct.
I would like to build this in TIKA, but I am not familiar with Maven.
Maybe someone could explain how to modify the POMs in order to use POI
3.7  with TIKA?

Thanks
Roland


On 11/05/2010 04:49 PM, Nick Burch wrote:
> On Fri, 5 Nov 2010, Roland Cornelissen wrote:
>> Caused by: java.io.IOException: Unable to read entire block; 1 byte
>> read; expected 512 bytes
>>    at
>> org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:62)
>
> This is normally caused by truncated files. However, it might be worth
> trying with a recent nightly build of tika, as that has a newer POI in
> it, and I can't remember if there have been POIFS fixes since 0.7
>
> Nick
>


Re: error parsing .XLS file

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 5 Nov 2010, Roland Cornelissen wrote:
> Caused by: java.io.IOException: Unable to read entire block; 1 byte
> read; expected 512 bytes
>    at
> org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:62)

This is normally caused by truncated files. However, it might be worth 
trying with a recent nightly build of tika, as that has a newer POI in it, 
and I can't remember if there have been POIFS fixes since 0.7

Nick