You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by LEVILLAIN Olivier <ol...@coface.com> on 2012/05/15 16:17:26 UTC

Tika parser exception IndexOutOfBoundsException

Hi,
Each time I try to include a word file in my fetch/parse list, I always get the following error:

2012-05-15 15:02:40,319 ERROR tika.TikaParser - Error parsing http://mydomain/mydir/mydoc.doc
java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 103424 in stream of length 62511
            at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:41)
            at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420)
            at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397)
            at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356)
            at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:293)
            at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
            at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
            at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
            at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
            at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
            at java.util.concurrent.FutureTask.run(Unknown Source)
            at java.lang.Thread.run(Unknown Source)
Is this a known problem? (I googled it but didn't find anything related)
I'm using Nutch 1.4 with SOLR 3.6
Regards,
Olivier



Re: Tika parser exception IndexOutOfBoundsException

Posted by Piet van Remortel <pi...@gmail.com>.
Just a quick remark:  I recently had continuous problems setting that value
to -1 probably due to extremely large pages or loop issues, causing
timeouts.

Setting the value to just 'very large' solved that.

hth

Piet


On Tue, May 15, 2012 at 4:43 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Try setting  http.content.limit to a very large value or -1. The parser
> sometimes chokes on truncated content
>
> On 15 May 2012 15:17, LEVILLAIN Olivier <olivier_levillain@coface.com
> >wrote:
>
> > **
> >
> > Hi,****
> >
> > Each time I try to include a word file in my fetch/parse list, I always
> > get the following error:****
> >
> > ** **
> >
> > 2012-05-15 15:02:40,319 ERROR tika.TikaParser - Error parsing
> > http://mydomain/mydir/mydoc.doc****
> >
> > java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from
> **103424
> > in** stream of length 62511****
> >
> >             at
> >
> org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:41)
> > ****
> >
> >             at
> >
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420)
> > ****
> >
> >             at
> >
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397)
> > ****
> >
> >             at
> >
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356)
> > ****
> >
> >             at
> >
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:293)
> > ****
> >
> >             at
> >
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
> > ****
> >
> >             at
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)****
> >
> >             at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)****
> >
> >             at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)****
> >
> >             at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> > Source)****
> >
> >             at java.util.concurrent.FutureTask.run(Unknown Source)****
> >
> >             at java.lang.Thread.run(Unknown Source)****
> >
> > Is this a known problem? (I googled it but didn’t find anything
> related)**
> > **
> >
> > I’m using Nutch 1.4 with SOLR 3.6****
> >
> > Regards,****
> >
> > Olivier
> >
> > ****
> >
> > ** **
> >
> >
> > **********************************************************************
> > Le groupe Coface, un leader mondial de l'assurance-crédit, propose aux
> >  entreprises du monde entier des solutions pour les protéger contre le
> > risque de défaillance financière de leurs clients. Ses 4 600
> collaborateurs
> > assurent un service de proximité dans 66 pays.
> >
> > The Coface Group, a worldwide leader in credit insurance, offers
> companies
> > around the globe solutions to protect them against the risk of financial
> > default of their clients. 4 600 staff in 66 countries provide a local
> > service worldwide.
> >
> >
> > Confidentialité/Internet disclaimer
> >
> > Ce message ainsi que les fichiers attachés sont exclusivement adressés
> aux
> > destinataires désignés et peuvent contenir des informations à caractère
> > confidentiel. Si vous n'êtes pas le destinataire désigné, merci de
> prendre
> > contact avec l'expéditeur et de détruire ce message, sans en faire un
> > quelconque usage ni en prendre aucune copie.
> > Les messages électroniques sur Internet peuvent être interceptés,
> > modifiés, altérés, détruits, ou contenir des virus. L'expéditeur ne
> pourra
> > être tenu responsable des erreurs ou omissions qui résulteraient de la
> > transmission par voie électronique.
> >
> > This message and the attachments are exclusively addressed to their
> > designated addresses. If you are not the intended recipient, please
> contact
> > the sender and delete the message without making any use or copying it.
> > E-Mail transmissions could be intercepted, corrupted, lost, destroyed or
> > contain viruses. The sender therefore does not accept liability for any
> > errors or omissions in the contents of this message which arise as a
> result
> > of e-mail transmission.
> > **********************************************************************
> >
> >
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Tika parser exception IndexOutOfBoundsException

Posted by Julien Nioche <li...@gmail.com>.
Try setting  http.content.limit to a very large value or -1. The parser
sometimes chokes on truncated content

On 15 May 2012 15:17, LEVILLAIN Olivier <ol...@coface.com>wrote:

> **
>
> Hi,****
>
> Each time I try to include a word file in my fetch/parse list, I always
> get the following error:****
>
> ** **
>
> 2012-05-15 15:02:40,319 ERROR tika.TikaParser - Error parsing
> http://mydomain/mydir/mydoc.doc****
>
> java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from **103424
> in** stream of length 62511****
>
>             at
> org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:41)
> ****
>
>             at
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420)
> ****
>
>             at
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397)
> ****
>
>             at
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356)
> ****
>
>             at
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:293)
> ****
>
>             at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
> ****
>
>             at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)****
>
>             at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)****
>
>             at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)****
>
>             at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> Source)****
>
>             at java.util.concurrent.FutureTask.run(Unknown Source)****
>
>             at java.lang.Thread.run(Unknown Source)****
>
> Is this a known problem? (I googled it but didn’t find anything related)**
> **
>
> I’m using Nutch 1.4 with SOLR 3.6****
>
> Regards,****
>
> Olivier
>
> ****
>
> ** **
>
>
> **********************************************************************
> Le groupe Coface, un leader mondial de l'assurance-crédit, propose aux
>  entreprises du monde entier des solutions pour les protéger contre le
> risque de défaillance financière de leurs clients. Ses 4 600 collaborateurs
> assurent un service de proximité dans 66 pays.
>
> The Coface Group, a worldwide leader in credit insurance, offers companies
> around the globe solutions to protect them against the risk of financial
> default of their clients. 4 600 staff in 66 countries provide a local
> service worldwide.
>
>
> Confidentialité/Internet disclaimer
>
> Ce message ainsi que les fichiers attachés sont exclusivement adressés aux
> destinataires désignés et peuvent contenir des informations à caractère
> confidentiel. Si vous n'êtes pas le destinataire désigné, merci de prendre
> contact avec l'expéditeur et de détruire ce message, sans en faire un
> quelconque usage ni en prendre aucune copie.
> Les messages électroniques sur Internet peuvent être interceptés,
> modifiés, altérés, détruits, ou contenir des virus. L'expéditeur ne pourra
> être tenu responsable des erreurs ou omissions qui résulteraient de la
> transmission par voie électronique.
>
> This message and the attachments are exclusively addressed to their
> designated addresses. If you are not the intended recipient, please contact
> the sender and delete the message without making any use or copying it.
> E-Mail transmissions could be intercepted, corrupted, lost, destroyed or
> contain viruses. The sender therefore does not accept liability for any
> errors or omissions in the contents of this message which arise as a result
> of e-mail transmission.
> **********************************************************************
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble