You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by WebDev Freak <we...@gmail.com> on 2006/09/29 04:04:19 UTC

Need Help....Problem Crawling,

Hi when I'm crawling some Powerpoint documents some work and some give me
the following error:

2006-09-27 17:12:29,044 ERROR mspowerpoint.ContentReaderListener -
extractClientTextBoxes

java.lang.ArrayIndexOutOfBoundsException: 1611976644

            at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java
:491)

            at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java
:64)

            at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(
ContentReaderListener.java:200)

            at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent
(ContentReaderListener.java:110)

            at
org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(
POIFSReader.java:260)

            at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(
POIFSReader.java:96)

            at org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(
PPTExtractor.java:49)

            at org.apache.nutch.parse.ms.MSExtractor.extract(
MSExtractor.java:77)

            at org.apache.nutch.parse.ms.MSBaseParser.getParse(
MSBaseParser.java:81)

            at
org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(
MSPowerPointParser.java:44)

            at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)

            at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
Fetcher.java:283)

            at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(
Fetcher.java:152)


Any help is appreciated.

Re: Need Help....Problem Crawling,

Posted by tryma <tr...@creuna.no>.

Anyone ever get any closer to a solution to this?

We are encountering the same error parsing some PowerPoint documents too.

Am I right in assuming this stems from the POI library rather than Nutch?


Best,
Trym



WebDev Freak wrote:
> 
> Hi when I'm crawling some Powerpoint documents some work and some give me
> the following error:
> 
> 2006-09-27 17:12:29,044 ERROR mspowerpoint.ContentReaderListener -
> extractClientTextBoxes
> 
> java.lang.ArrayIndexOutOfBoundsException: 1611976644
> 
>             at
> org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java
> :491)
> 
>             at
> org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java
> :64)
> 
>             at
> org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(
> ContentReaderListener.java:200)
> 
>             at
> org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent
> (ContentReaderListener.java:110)
> 
>             at
> org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(
> POIFSReader.java:260)
> 
>             at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(
> POIFSReader.java:96)
> 
>             at
> org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(
> PPTExtractor.java:49)
> 
>             at org.apache.nutch.parse.ms.MSExtractor.extract(
> MSExtractor.java:77)
> 
>             at org.apache.nutch.parse.ms.MSBaseParser.getParse(
> MSBaseParser.java:81)
> 
>             at
> org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(
> MSPowerPointParser.java:44)
> 
>             at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 
>             at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
> Fetcher.java:283)
> 
>             at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(
> Fetcher.java:152)
> 
> 
> Any help is appreciated.
> 
> 

-- 
View this message in context: http://www.nabble.com/Need-Help....Problem-Crawling%2C-tf2354599.html#a7217888
Sent from the Nutch - User mailing list archive at Nabble.com.