You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by WebDev Freak <we...@gmail.com> on 2006/09/29 04:04:19 UTC
Need Help....Problem Crawling,
Hi when I'm crawling some Powerpoint documents some work and some give me
the following error:
2006-09-27 17:12:29,044 ERROR mspowerpoint.ContentReaderListener -
extractClientTextBoxes
java.lang.ArrayIndexOutOfBoundsException: 1611976644
at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java
:491)
at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java
:64)
at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(
ContentReaderListener.java:200)
at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent
(ContentReaderListener.java:110)
at
org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(
POIFSReader.java:260)
at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(
POIFSReader.java:96)
at org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(
PPTExtractor.java:49)
at org.apache.nutch.parse.ms.MSExtractor.extract(
MSExtractor.java:77)
at org.apache.nutch.parse.ms.MSBaseParser.getParse(
MSBaseParser.java:81)
at
org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(
MSPowerPointParser.java:44)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
Fetcher.java:283)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(
Fetcher.java:152)
Any help is appreciated.
Re: Need Help....Problem Crawling,
Posted by tryma <tr...@creuna.no>.
Anyone ever get any closer to a solution to this?
We are encountering the same error parsing some PowerPoint documents too.
Am I right in assuming this stems from the POI library rather than Nutch?
Best,
Trym
WebDev Freak wrote:
>
> Hi when I'm crawling some Powerpoint documents some work and some give me
> the following error:
>
> 2006-09-27 17:12:29,044 ERROR mspowerpoint.ContentReaderListener -
> extractClientTextBoxes
>
> java.lang.ArrayIndexOutOfBoundsException: 1611976644
>
> at
> org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java
> :491)
>
> at
> org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java
> :64)
>
> at
> org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(
> ContentReaderListener.java:200)
>
> at
> org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent
> (ContentReaderListener.java:110)
>
> at
> org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(
> POIFSReader.java:260)
>
> at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(
> POIFSReader.java:96)
>
> at
> org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(
> PPTExtractor.java:49)
>
> at org.apache.nutch.parse.ms.MSExtractor.extract(
> MSExtractor.java:77)
>
> at org.apache.nutch.parse.ms.MSBaseParser.getParse(
> MSBaseParser.java:81)
>
> at
> org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(
> MSPowerPointParser.java:44)
>
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
> Fetcher.java:283)
>
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(
> Fetcher.java:152)
>
>
> Any help is appreciated.
>
>
--
View this message in context: http://www.nabble.com/Need-Help....Problem-Crawling%2C-tf2354599.html#a7217888
Sent from the Nutch - User mailing list archive at Nabble.com.