You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Elisabeth Adler <el...@gmail.com> on 2011/08/30 19:29:51 UTC

Nutch 1.3 - DIFAT array IOException on parsing files

Hi,

I am using Nutch 1.3 to crawl our intranet page. I have turned on the 
tika-plugin (see [1]) to parse pdfs  and MS Office documents, and 
included the mime types in the parse-plugins.xml.

On crawling, the URLs of my files are correctly retrieved, but on 
parsing the files, I get the following errors:
[Error1]: Error parsing: 
http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.pdf: 
failed(2,0): null
[Error2]: Error parsing: 
http://../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.doc: 
failed(2,0): Your file contains 127 sectors, but the initial DIFAT array 
at index 0 referenced block # 208. This isn't allowed and  your file is 
corrupt
[Error3]: Error parsing: 
http://../sample-site/news/cactus/work_log_lisi.xls: failed(2,0): Your 
file contains 127 sectors, but the initial DIFAT array at index 0 
referenced block # 241. This isn't allowed and  your file is corrupt

Further stack traces to the errors are below. When entering the ULRs in 
a browser, the files can be opened without problems. Also, I used the 
file in the Nutch test cases, and the files could be opened and read 
correctly by Nutch, so it does not seem to be a problem with the files. 
Also below on how I parse the files [2].

Did anyone encounter any of these problems so far? Any pointers are very 
much appreciated!
Thanks a lot,
Elisabeth


[1] nutch-site.xml
<property><name>plugin.includes</name> 
<value>parse-(html|tika|js|zip)|...</value> </property>

[Error1]:
2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing 
http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.pdf
java.lang.NullPointerException
         at 
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
         at 
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:946)
         at 
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:107)
         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:88)
         at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
         at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at java.lang.Thread.run(Thread.java:662)

[Error2]:
2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing 
http://.../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.doc
java.io.IOException: Your file contains 127 sectors, but the initial 
DIFAT array at index 0 referenced block # 208. This isn't allowed and  
your file is corrupt
         at 
org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocationTableReader.java:113)
         at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:166)
         at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
         at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
         at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at java.lang.Thread.run(Thread.java:662)

[Error3]:
2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing 
http://.../sample-site/news/cactus/work_log_lisi.xls
java.io.IOException: Your file contains 127 sectors, but the initial 
DIFAT array at index 0 referenced block # 241. This isn't allowed and  
your file is corrupt
         at 
org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocationTableReader.java:113)
         at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:166)
         at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
         at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
         at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at java.lang.Thread.run(Thread.java:662)

[2]
./bin/nutch inject crawl/crawldb urls >> crawl.log
./bin/nutch generate crawl/crawldb crawl/segments >> crawl.log
s1=`ls -d crawl/segments/2* | tail -1` >> crawl.log
./bin/nutch fetch $s1 -noParsing >> crawl.log
./bin/nutch parse $s1 >> crawl.log

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Posted by Elisabeth Adler <el...@gmail.com>.
will do, thanks!

On 30.08.2011 19:41, Markus Jelsma wrote:
> Hi,
>
> Can you report your issues to the Tika mailing list? You're more likely to get
> help there.
>
> Cheers
>
>> Hi,
>>
>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
>> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
>> included the mime types in the parse-plugins.xml.
>>
>> On crawling, the URLs of my files are correctly retrieved, but on
>> parsing the files, I get the following errors:
>> [Error1]: Error parsing:
>> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.pdf
>> : failed(2,0): null
>> [Error2]: Error parsing:
>> http://../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.doc:
>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT array
>> at index 0 referenced block # 208. This isn't allowed and  your file is
>> corrupt
>> [Error3]: Error parsing:
>> http://../sample-site/news/cactus/work_log_lisi.xls: failed(2,0): Your
>> file contains 127 sectors, but the initial DIFAT array at index 0
>> referenced block # 241. This isn't allowed and  your file is corrupt
>>
>> Further stack traces to the errors are below. When entering the ULRs in
>> a browser, the files can be opened without problems. Also, I used the
>> file in the Nutch test cases, and the files could be opened and read
>> correctly by Nutch, so it does not seem to be a problem with the files.
>> Also below on how I parse the files [2].
>>
>> Did anyone encounter any of these problems so far? Any pointers are very
>> much appreciated!
>> Thanks a lot,
>> Elisabeth
>>
>>
>> [1] nutch-site.xml
>> <property><name>plugin.includes</name>
>> <value>parse-(html|tika|js|zip)|...</value>  </property>
>>
>> [Error1]:
>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
>> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.pdf
>> java.lang.NullPointerException
>>           at
>> org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
>>           at
>> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:946)
>>           at
>> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:107)
>>           at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:88)
>>           at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>           at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>           at java.lang.Thread.run(Thread.java:662)
>>
>> [Error2]:
>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
>> http://.../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.doc
>> java.io.IOException: Your file contains 127 sectors, but the initial
>> DIFAT array at index 0 referenced block # 208. This isn't allowed and
>> your file is corrupt
>>           at
>> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocat
>> ionTableReader.java:113) at
>> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java
>> :166) at
>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
>>           at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>           at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>           at java.lang.Thread.run(Thread.java:662)
>>
>> [Error3]:
>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
>> http://.../sample-site/news/cactus/work_log_lisi.xls
>> java.io.IOException: Your file contains 127 sectors, but the initial
>> DIFAT array at index 0 referenced block # 241. This isn't allowed and
>> your file is corrupt
>>           at
>> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocat
>> ionTableReader.java:113) at
>> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java
>> :166) at
>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
>>           at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>           at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>           at java.lang.Thread.run(Thread.java:662)
>>
>> [2]
>> ./bin/nutch inject crawl/crawldb urls>>  crawl.log
>> ./bin/nutch generate crawl/crawldb crawl/segments>>  crawl.log
>> s1=`ls -d crawl/segments/2* | tail -1`>>  crawl.log
>> ./bin/nutch fetch $s1 -noParsing>>  crawl.log
>> ./bin/nutch parse $s1>>  crawl.log

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Posted by Markus Jelsma <ma...@openindex.io>.
Ah, you are using an older version. Newer Nutch releases mention both limits 
in the description to avoid confusion.

Cheers

On Wednesday 31 August 2011 15:11:29 Elisabeth Adler wrote:
> Hi all,
> thanks for the help!
> The culprit was that I was setting the file.content.limit instead of the
> http.content.limit.
> 
> On 31.08.2011 08:22, Elisabeth Adler wrote:
> > The size of the PDF is 528kb  (the .doc is 108kb and the xls is 123kb)
> > and I set the limit in the config to -1:
> > <property>
> > <name>file.content.limit</name>
> > <value>-1</value>
> > <description>The length limit for downloaded content, in bytes.
> > 
> >             If this value is nonnegative (>=0), content longer than it
> > 
> > will be
> > 
> >             truncated;otherwise, no truncation at all.
> > 
> > </description>
> > </property>
> > 
> > Is there any setting where I can force Nutch to somehow persist the
> > file before parsing it so I can make sure it's actually there?
> > 
> > On 30.08.2011 21:42, lewis john mcgibbney wrote:
> >> Hi Elisabeth,
> >> 
> >> Can you please check the size of the pdf files you are trying to parse
> >> and set the http.content.limit property accordingly in nutch-site.xml
> >> 
> >> Anything over the default limit will be truncated (or skipped in some
> >> cases)
> >> 
> >> Please get back to us on this one..
> >> 
> >> On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler
> >> 
> >> <el...@gmail.com>wrote:
> >>> Actually, I don't think tika is the issue. If I add manually downloaded
> >>> PDFs to Nutch's test cases, the files are parsed correctly. I think it
> >>> is more likely something with Nutch not being able to download the
> >>> files correctly.
> >>> Any pointers?
> >>> thanks,
> >>> Elisabeth
> >>> 
> >>> On 30.08.2011 19:41, Markus Jelsma wrote:
> >>>> Hi,
> >>>> 
> >>>> Can you report your issues to the Tika mailing list? You're more
> >>>> likely to get
> >>>> help there.
> >>>> 
> >>>> Cheers
> >>>> 
> >>>>   Hi,
> >>>>> 
> >>>>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
> >>>>> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
> >>>>> included the mime types in the parse-plugins.xml.
> >>>>> 
> >>>>> On crawling, the URLs of my files are correctly retrieved, but on
> >>>>> parsing the files, I get the following errors:
> >>>>> [Error1]: Error parsing:
> >>>>> http://.../sample-site/news/**cactus/2011-06-22_**
> >>>>> ClientSupportWeeklyReport.pdf
> >>>>> 
> >>>>> : failed(2,0): null
> >>>>> 
> >>>>> [Error2]: Error parsing:
> >>>>> http://../sample-site/news/**cactus/Operations-Meeting-**
> >>>>> Minutes-2011-wk02.doc:
> >>>>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT
> >>>>> array at index 0 referenced block # 208. This isn't allowed and 
> >>>>> your file is corrupt
> >>>>> [Error3]: Error parsing:
> >>>>> http://../sample-site/news/**cactus/work_log_lisi.xls:  failed(2,0):
> >>>>> Your file contains 127 sectors, but the initial DIFAT array at index
> >>>>> 0 referenced block # 241. This isn't allowed and  your file is
> >>>>> corrupt
> >>>>> 
> >>>>> Further stack traces to the errors are below. When entering the ULRs
> >>>>> in a browser, the files can be opened without problems. Also, I used
> >>>>> the file in the Nutch test cases, and the files could be opened and
> >>>>> read correctly by Nutch, so it does not seem to be a problem with
> >>>>> the files. Also below on how I parse the files [2].
> >>>>> 
> >>>>> Did anyone encounter any of these problems so far? Any pointers are
> >>>>> very much appreciated!
> >>>>> Thanks a lot,
> >>>>> Elisabeth
> >>>>> 
> >>>>> 
> >>>>> [1] nutch-site.xml
> >>>>> <property><name>plugin.**includes</name>
> >>>>> <value>parse-(html|tika|js|**zip)|...</value>   </property>
> >>>>> 
> >>>>> [Error1]:
> >>>>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
> >>>>> http://.../sample-site/news/**cactus/2011-06-22_**
> >>>>> ClientSupportWeeklyReport.pdf
> >>>>> java.lang.NullPointerException
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109
> >>>>> )
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(**
> >>>>> PDDocument.java:946)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.tika.parser.pdf.**PDFParser.extractMetadata(**
> >>>>> PDFParser.java:107)
> >>>>> 
> >>>>>           at
> >>>>>           org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.**
> >>>>> 
> >>>>> java:88)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9
> >>>>> 5)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35)
> >>>>> at
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24)
> >>>>> at
> >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30
> >>>>> 3)
> >>>>> 
> >>>>>           at
> >>>>>           java.util.concurrent.**FutureTask.run(FutureTask.**java:13
> >>>>>           8) at java.lang.Thread.run(Thread.**java:662)
> >>>>> 
> >>>>> [Error2]:
> >>>>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
> >>>>> http://.../sample-site/news/**cactus/Operations-Meeting-**
> >>>>> Minutes-2011-wk02.doc
> >>>>> java.io.IOException: Your file contains 127 sectors, but the initial
> >>>>> DIFAT array at index 0 referenced block # 208. This isn't allowed and
> >>>>> your file is corrupt
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
> >>>>> init>(BlockAllocat
> >>>>> ionTableReader.java:113) at
> >>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
> >>>>> init>(POIFSFileSystem.java
> >>>>> 
> >>>>> :166) at
> >>>>> 
> >>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
> >>>>> OfficeParser.java:160)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9
> >>>>> 5)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35)
> >>>>> at
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24)
> >>>>> at
> >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30
> >>>>> 3)
> >>>>> 
> >>>>>           at
> >>>>>           java.util.concurrent.**FutureTask.run(FutureTask.**java:13
> >>>>>           8) at java.lang.Thread.run(Thread.**java:662)
> >>>>> 
> >>>>> [Error3]:
> >>>>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
> >>>>> http://.../sample-site/news/**cactus/work_log_lisi.xls
> >>>>> java.io.IOException: Your file contains 127 sectors, but the initial
> >>>>> DIFAT array at index 0 referenced block # 241. This isn't allowed and
> >>>>> your file is corrupt
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
> >>>>> init>(BlockAllocat
> >>>>> ionTableReader.java:113) at
> >>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
> >>>>> init>(POIFSFileSystem.java
> >>>>> 
> >>>>> :166) at
> >>>>> 
> >>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
> >>>>> OfficeParser.java:160)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9
> >>>>> 5)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35)
> >>>>> at
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24)
> >>>>> at
> >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30
> >>>>> 3)
> >>>>> 
> >>>>>           at
> >>>>>           java.util.concurrent.**FutureTask.run(FutureTask.**java:13
> >>>>>           8) at java.lang.Thread.run(Thread.**java:662)
> >>>>> 
> >>>>> [2]
> >>>>> ./bin/nutch inject crawl/crawldb urls>>   crawl.log
> >>>>> ./bin/nutch generate crawl/crawldb crawl/segments>>   crawl.log
> >>>>> s1=`ls -d crawl/segments/2* | tail -1`>>   crawl.log
> >>>>> ./bin/nutch fetch $s1 -noParsing>>   crawl.log
> >>>>> ./bin/nutch parse $s1>>   crawl.log

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Posted by Elisabeth Adler <el...@gmail.com>.
Hi all,
thanks for the help!
The culprit was that I was setting the file.content.limit instead of the 
http.content.limit.

On 31.08.2011 08:22, Elisabeth Adler wrote:
> The size of the PDF is 528kb  (the .doc is 108kb and the xls is 123kb) 
> and I set the limit in the config to -1:
> <property>
> <name>file.content.limit</name>
> <value>-1</value>
> <description>The length limit for downloaded content, in bytes.
>             If this value is nonnegative (>=0), content longer than it 
> will be
>             truncated;otherwise, no truncation at all.
> </description>
> </property>
>
> Is there any setting where I can force Nutch to somehow persist the 
> file before parsing it so I can make sure it's actually there?
>
>
> On 30.08.2011 21:42, lewis john mcgibbney wrote:
>> Hi Elisabeth,
>>
>> Can you please check the size of the pdf files you are trying to parse and
>> set the http.content.limit property accordingly in nutch-site.xml
>>
>> Anything over the default limit will be truncated (or skipped in some cases)
>>
>> Please get back to us on this one..
>>
>> On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler
>> <el...@gmail.com>wrote:
>>
>>> Actually, I don't think tika is the issue. If I add manually downloaded
>>> PDFs to Nutch's test cases, the files are parsed correctly. I think it is
>>> more likely something with Nutch not being able to download the files
>>> correctly.
>>> Any pointers?
>>> thanks,
>>> Elisabeth
>>>
>>> On 30.08.2011 19:41, Markus Jelsma wrote:
>>>
>>>> Hi,
>>>>
>>>> Can you report your issues to the Tika mailing list? You're more likely to
>>>> get
>>>> help there.
>>>>
>>>> Cheers
>>>>
>>>>   Hi,
>>>>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
>>>>> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
>>>>> included the mime types in the parse-plugins.xml.
>>>>>
>>>>> On crawling, the URLs of my files are correctly retrieved, but on
>>>>> parsing the files, I get the following errors:
>>>>> [Error1]: Error parsing:
>>>>> http://.../sample-site/news/**cactus/2011-06-22_**
>>>>> ClientSupportWeeklyReport.pdf
>>>>> : failed(2,0): null
>>>>> [Error2]: Error parsing:
>>>>> http://../sample-site/news/**cactus/Operations-Meeting-**
>>>>> Minutes-2011-wk02.doc:
>>>>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT array
>>>>> at index 0 referenced block # 208. This isn't allowed and  your file is
>>>>> corrupt
>>>>> [Error3]: Error parsing:
>>>>> http://../sample-site/news/**cactus/work_log_lisi.xls:  failed(2,0): Your
>>>>> file contains 127 sectors, but the initial DIFAT array at index 0
>>>>> referenced block # 241. This isn't allowed and  your file is corrupt
>>>>>
>>>>> Further stack traces to the errors are below. When entering the ULRs in
>>>>> a browser, the files can be opened without problems. Also, I used the
>>>>> file in the Nutch test cases, and the files could be opened and read
>>>>> correctly by Nutch, so it does not seem to be a problem with the files.
>>>>> Also below on how I parse the files [2].
>>>>>
>>>>> Did anyone encounter any of these problems so far? Any pointers are very
>>>>> much appreciated!
>>>>> Thanks a lot,
>>>>> Elisabeth
>>>>>
>>>>>
>>>>> [1] nutch-site.xml
>>>>> <property><name>plugin.**includes</name>
>>>>> <value>parse-(html|tika|js|**zip)|...</value>   </property>
>>>>>
>>>>> [Error1]:
>>>>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
>>>>> http://.../sample-site/news/**cactus/2011-06-22_**
>>>>> ClientSupportWeeklyReport.pdf
>>>>> java.lang.NullPointerException
>>>>>           at
>>>>> org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109)
>>>>>           at
>>>>> org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(**
>>>>> PDDocument.java:946)
>>>>>           at
>>>>> org.apache.tika.parser.pdf.**PDFParser.extractMetadata(**
>>>>> PDFParser.java:107)
>>>>>           at org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.**
>>>>> java:88)
>>>>>           at
>>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>>>           at
>>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>>>           at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>>>           at java.lang.Thread.run(Thread.**java:662)
>>>>>
>>>>> [Error2]:
>>>>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
>>>>> http://.../sample-site/news/**cactus/Operations-Meeting-**
>>>>> Minutes-2011-wk02.doc
>>>>> java.io.IOException: Your file contains 127 sectors, but the initial
>>>>> DIFAT array at index 0 referenced block # 208. This isn't allowed and
>>>>> your file is corrupt
>>>>>           at
>>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
>>>>> init>(BlockAllocat
>>>>> ionTableReader.java:113) at
>>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
>>>>> init>(POIFSFileSystem.java
>>>>> :166) at
>>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
>>>>> OfficeParser.java:160)
>>>>>           at
>>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>>>           at
>>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>>>           at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>>>           at java.lang.Thread.run(Thread.**java:662)
>>>>>
>>>>> [Error3]:
>>>>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
>>>>> http://.../sample-site/news/**cactus/work_log_lisi.xls
>>>>> java.io.IOException: Your file contains 127 sectors, but the initial
>>>>> DIFAT array at index 0 referenced block # 241. This isn't allowed and
>>>>> your file is corrupt
>>>>>           at
>>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
>>>>> init>(BlockAllocat
>>>>> ionTableReader.java:113) at
>>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
>>>>> init>(POIFSFileSystem.java
>>>>> :166) at
>>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
>>>>> OfficeParser.java:160)
>>>>>           at
>>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>>>           at
>>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>>>           at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>>>           at java.lang.Thread.run(Thread.**java:662)
>>>>>
>>>>> [2]
>>>>> ./bin/nutch inject crawl/crawldb urls>>   crawl.log
>>>>> ./bin/nutch generate crawl/crawldb crawl/segments>>   crawl.log
>>>>> s1=`ls -d crawl/segments/2* | tail -1`>>   crawl.log
>>>>> ./bin/nutch fetch $s1 -noParsing>>   crawl.log
>>>>> ./bin/nutch parse $s1>>   crawl.log
>>>>>

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Posted by Elisabeth Adler <el...@gmail.com>.
The size of the PDF is 528kb  (the .doc is 108kb and the xls is 123kb) 
and I set the limit in the config to -1:
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
             If this value is nonnegative (>=0), content longer than it 
will be
             truncated;otherwise, no truncation at all.
</description>
</property>

Is there any setting where I can force Nutch to somehow persist the file 
before parsing it so I can make sure it's actually there?


On 30.08.2011 21:42, lewis john mcgibbney wrote:
> Hi Elisabeth,
>
> Can you please check the size of the pdf files you are trying to parse and
> set the http.content.limit property accordingly in nutch-site.xml
>
> Anything over the default limit will be truncated (or skipped in some cases)
>
> Please get back to us on this one..
>
> On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler
> <el...@gmail.com>wrote:
>
>> Actually, I don't think tika is the issue. If I add manually downloaded
>> PDFs to Nutch's test cases, the files are parsed correctly. I think it is
>> more likely something with Nutch not being able to download the files
>> correctly.
>> Any pointers?
>> thanks,
>> Elisabeth
>>
>> On 30.08.2011 19:41, Markus Jelsma wrote:
>>
>>> Hi,
>>>
>>> Can you report your issues to the Tika mailing list? You're more likely to
>>> get
>>> help there.
>>>
>>> Cheers
>>>
>>>   Hi,
>>>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
>>>> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
>>>> included the mime types in the parse-plugins.xml.
>>>>
>>>> On crawling, the URLs of my files are correctly retrieved, but on
>>>> parsing the files, I get the following errors:
>>>> [Error1]: Error parsing:
>>>> http://.../sample-site/news/**cactus/2011-06-22_**
>>>> ClientSupportWeeklyReport.pdf
>>>> : failed(2,0): null
>>>> [Error2]: Error parsing:
>>>> http://../sample-site/news/**cactus/Operations-Meeting-**
>>>> Minutes-2011-wk02.doc:
>>>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT array
>>>> at index 0 referenced block # 208. This isn't allowed and  your file is
>>>> corrupt
>>>> [Error3]: Error parsing:
>>>> http://../sample-site/news/**cactus/work_log_lisi.xls: failed(2,0): Your
>>>> file contains 127 sectors, but the initial DIFAT array at index 0
>>>> referenced block # 241. This isn't allowed and  your file is corrupt
>>>>
>>>> Further stack traces to the errors are below. When entering the ULRs in
>>>> a browser, the files can be opened without problems. Also, I used the
>>>> file in the Nutch test cases, and the files could be opened and read
>>>> correctly by Nutch, so it does not seem to be a problem with the files.
>>>> Also below on how I parse the files [2].
>>>>
>>>> Did anyone encounter any of these problems so far? Any pointers are very
>>>> much appreciated!
>>>> Thanks a lot,
>>>> Elisabeth
>>>>
>>>>
>>>> [1] nutch-site.xml
>>>> <property><name>plugin.**includes</name>
>>>> <value>parse-(html|tika|js|**zip)|...</value>   </property>
>>>>
>>>> [Error1]:
>>>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
>>>> http://.../sample-site/news/**cactus/2011-06-22_**
>>>> ClientSupportWeeklyReport.pdf
>>>> java.lang.NullPointerException
>>>>           at
>>>> org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109)
>>>>           at
>>>> org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(**
>>>> PDDocument.java:946)
>>>>           at
>>>> org.apache.tika.parser.pdf.**PDFParser.extractMetadata(**
>>>> PDFParser.java:107)
>>>>           at org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.**
>>>> java:88)
>>>>           at
>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>>           at
>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>>           at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>>           at java.lang.Thread.run(Thread.**java:662)
>>>>
>>>> [Error2]:
>>>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
>>>> http://.../sample-site/news/**cactus/Operations-Meeting-**
>>>> Minutes-2011-wk02.doc
>>>> java.io.IOException: Your file contains 127 sectors, but the initial
>>>> DIFAT array at index 0 referenced block # 208. This isn't allowed and
>>>> your file is corrupt
>>>>           at
>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
>>>> init>(BlockAllocat
>>>> ionTableReader.java:113) at
>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
>>>> init>(POIFSFileSystem.java
>>>> :166) at
>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
>>>> OfficeParser.java:160)
>>>>           at
>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>>           at
>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>>           at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>>           at java.lang.Thread.run(Thread.**java:662)
>>>>
>>>> [Error3]:
>>>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
>>>> http://.../sample-site/news/**cactus/work_log_lisi.xls
>>>> java.io.IOException: Your file contains 127 sectors, but the initial
>>>> DIFAT array at index 0 referenced block # 241. This isn't allowed and
>>>> your file is corrupt
>>>>           at
>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
>>>> init>(BlockAllocat
>>>> ionTableReader.java:113) at
>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
>>>> init>(POIFSFileSystem.java
>>>> :166) at
>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
>>>> OfficeParser.java:160)
>>>>           at
>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>>           at
>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>>           at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>>           at java.lang.Thread.run(Thread.**java:662)
>>>>
>>>> [2]
>>>> ./bin/nutch inject crawl/crawldb urls>>   crawl.log
>>>> ./bin/nutch generate crawl/crawldb crawl/segments>>   crawl.log
>>>> s1=`ls -d crawl/segments/2* | tail -1`>>   crawl.log
>>>> ./bin/nutch fetch $s1 -noParsing>>   crawl.log
>>>> ./bin/nutch parse $s1>>   crawl.log
>>>>
>

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Elisabeth,

Can you please check the size of the pdf files you are trying to parse and
set the http.content.limit property accordingly in nutch-site.xml

Anything over the default limit will be truncated (or skipped in some cases)

Please get back to us on this one..

On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler
<el...@gmail.com>wrote:

> Actually, I don't think tika is the issue. If I add manually downloaded
> PDFs to Nutch's test cases, the files are parsed correctly. I think it is
> more likely something with Nutch not being able to download the files
> correctly.
> Any pointers?
> thanks,
> Elisabeth
>
> On 30.08.2011 19:41, Markus Jelsma wrote:
>
>> Hi,
>>
>> Can you report your issues to the Tika mailing list? You're more likely to
>> get
>> help there.
>>
>> Cheers
>>
>>  Hi,
>>>
>>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
>>> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
>>> included the mime types in the parse-plugins.xml.
>>>
>>> On crawling, the URLs of my files are correctly retrieved, but on
>>> parsing the files, I get the following errors:
>>> [Error1]: Error parsing:
>>> http://.../sample-site/news/**cactus/2011-06-22_**
>>> ClientSupportWeeklyReport.pdf
>>> : failed(2,0): null
>>> [Error2]: Error parsing:
>>> http://../sample-site/news/**cactus/Operations-Meeting-**
>>> Minutes-2011-wk02.doc:
>>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT array
>>> at index 0 referenced block # 208. This isn't allowed and  your file is
>>> corrupt
>>> [Error3]: Error parsing:
>>> http://../sample-site/news/**cactus/work_log_lisi.xls: failed(2,0): Your
>>> file contains 127 sectors, but the initial DIFAT array at index 0
>>> referenced block # 241. This isn't allowed and  your file is corrupt
>>>
>>> Further stack traces to the errors are below. When entering the ULRs in
>>> a browser, the files can be opened without problems. Also, I used the
>>> file in the Nutch test cases, and the files could be opened and read
>>> correctly by Nutch, so it does not seem to be a problem with the files.
>>> Also below on how I parse the files [2].
>>>
>>> Did anyone encounter any of these problems so far? Any pointers are very
>>> much appreciated!
>>> Thanks a lot,
>>> Elisabeth
>>>
>>>
>>> [1] nutch-site.xml
>>> <property><name>plugin.**includes</name>
>>> <value>parse-(html|tika|js|**zip)|...</value>  </property>
>>>
>>> [Error1]:
>>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
>>> http://.../sample-site/news/**cactus/2011-06-22_**
>>> ClientSupportWeeklyReport.pdf
>>> java.lang.NullPointerException
>>>          at
>>> org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109)
>>>          at
>>> org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(**
>>> PDDocument.java:946)
>>>          at
>>> org.apache.tika.parser.pdf.**PDFParser.extractMetadata(**
>>> PDFParser.java:107)
>>>          at org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.**
>>> java:88)
>>>          at
>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>          at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>          at java.lang.Thread.run(Thread.**java:662)
>>>
>>> [Error2]:
>>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
>>> http://.../sample-site/news/**cactus/Operations-Meeting-**
>>> Minutes-2011-wk02.doc
>>> java.io.IOException: Your file contains 127 sectors, but the initial
>>> DIFAT array at index 0 referenced block # 208. This isn't allowed and
>>> your file is corrupt
>>>          at
>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
>>> init>(BlockAllocat
>>> ionTableReader.java:113) at
>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
>>> init>(POIFSFileSystem.java
>>> :166) at
>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
>>> OfficeParser.java:160)
>>>          at
>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>          at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>          at java.lang.Thread.run(Thread.**java:662)
>>>
>>> [Error3]:
>>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
>>> http://.../sample-site/news/**cactus/work_log_lisi.xls
>>> java.io.IOException: Your file contains 127 sectors, but the initial
>>> DIFAT array at index 0 referenced block # 241. This isn't allowed and
>>> your file is corrupt
>>>          at
>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
>>> init>(BlockAllocat
>>> ionTableReader.java:113) at
>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
>>> init>(POIFSFileSystem.java
>>> :166) at
>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
>>> OfficeParser.java:160)
>>>          at
>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>          at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>          at java.lang.Thread.run(Thread.**java:662)
>>>
>>> [2]
>>> ./bin/nutch inject crawl/crawldb urls>>  crawl.log
>>> ./bin/nutch generate crawl/crawldb crawl/segments>>  crawl.log
>>> s1=`ls -d crawl/segments/2* | tail -1`>>  crawl.log
>>> ./bin/nutch fetch $s1 -noParsing>>  crawl.log
>>> ./bin/nutch parse $s1>>  crawl.log
>>>
>>


-- 
*Lewis*

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Posted by Markus Jelsma <ma...@openindex.io>.
In that case, check the file size. Nutch imposes configurable limits. Check 
size and config.

> Actually, I don't think tika is the issue. If I add manually downloaded
> PDFs to Nutch's test cases, the files are parsed correctly. I think it
> is more likely something with Nutch not being able to download the files
> correctly.
> Any pointers?
> thanks,
> Elisabeth
> 
> On 30.08.2011 19:41, Markus Jelsma wrote:
> > Hi,
> > 
> > Can you report your issues to the Tika mailing list? You're more likely
> > to get help there.
> > 
> > Cheers
> > 
> >> Hi,
> >> 
> >> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
> >> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
> >> included the mime types in the parse-plugins.xml.
> >> 
> >> On crawling, the URLs of my files are correctly retrieved, but on
> >> parsing the files, I get the following errors:
> >> [Error1]: Error parsing:
> >> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.
> >> pdf
> >> 
> >> : failed(2,0): null
> >> 
> >> [Error2]: Error parsing:
> >> http://../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.d
> >> oc: failed(2,0): Your file contains 127 sectors, but the initial DIFAT
> >> array at index 0 referenced block # 208. This isn't allowed and  your
> >> file is corrupt
> >> [Error3]: Error parsing:
> >> http://../sample-site/news/cactus/work_log_lisi.xls: failed(2,0): Your
> >> file contains 127 sectors, but the initial DIFAT array at index 0
> >> referenced block # 241. This isn't allowed and  your file is corrupt
> >> 
> >> Further stack traces to the errors are below. When entering the ULRs in
> >> a browser, the files can be opened without problems. Also, I used the
> >> file in the Nutch test cases, and the files could be opened and read
> >> correctly by Nutch, so it does not seem to be a problem with the files.
> >> Also below on how I parse the files [2].
> >> 
> >> Did anyone encounter any of these problems so far? Any pointers are very
> >> much appreciated!
> >> Thanks a lot,
> >> Elisabeth
> >> 
> >> 
> >> [1] nutch-site.xml
> >> <property><name>plugin.includes</name>
> >> <value>parse-(html|tika|js|zip)|...</value>  </property>
> >> 
> >> [Error1]:
> >> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
> >> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.
> >> pdf java.lang.NullPointerException
> >> 
> >>           at
> >> 
> >> org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
> >> 
> >>           at
> >> 
> >> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:94
> >> 6)
> >> 
> >>           at
> >> 
> >> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:107)
> >> 
> >>           at
> >>           org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:88)
> >>           at
> >> 
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >> 
> >>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>           at java.lang.Thread.run(Thread.java:662)
> >> 
> >> [Error2]:
> >> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
> >> http://.../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.
> >> doc java.io.IOException: Your file contains 127 sectors, but the initial
> >> DIFAT array at index 0 referenced block # 208. This isn't allowed and
> >> your file is corrupt
> >> 
> >>           at
> >> 
> >> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllo
> >> cat ionTableReader.java:113) at
> >> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.j
> >> ava
> >> 
> >> :166) at
> >> 
> >> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:16
> >> 0)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >> 
> >>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>           at java.lang.Thread.run(Thread.java:662)
> >> 
> >> [Error3]:
> >> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
> >> http://.../sample-site/news/cactus/work_log_lisi.xls
> >> java.io.IOException: Your file contains 127 sectors, but the initial
> >> DIFAT array at index 0 referenced block # 241. This isn't allowed and
> >> your file is corrupt
> >> 
> >>           at
> >> 
> >> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllo
> >> cat ionTableReader.java:113) at
> >> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.j
> >> ava
> >> 
> >> :166) at
> >> 
> >> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:16
> >> 0)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >> 
> >>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>           at java.lang.Thread.run(Thread.java:662)
> >> 
> >> [2]
> >> ./bin/nutch inject crawl/crawldb urls>>  crawl.log
> >> ./bin/nutch generate crawl/crawldb crawl/segments>>  crawl.log
> >> s1=`ls -d crawl/segments/2* | tail -1`>>  crawl.log
> >> ./bin/nutch fetch $s1 -noParsing>>  crawl.log
> >> ./bin/nutch parse $s1>>  crawl.log

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Posted by Elisabeth Adler <el...@gmail.com>.
Actually, I don't think tika is the issue. If I add manually downloaded 
PDFs to Nutch's test cases, the files are parsed correctly. I think it 
is more likely something with Nutch not being able to download the files 
correctly.
Any pointers?
thanks,
Elisabeth

On 30.08.2011 19:41, Markus Jelsma wrote:
> Hi,
>
> Can you report your issues to the Tika mailing list? You're more likely to get
> help there.
>
> Cheers
>
>> Hi,
>>
>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
>> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
>> included the mime types in the parse-plugins.xml.
>>
>> On crawling, the URLs of my files are correctly retrieved, but on
>> parsing the files, I get the following errors:
>> [Error1]: Error parsing:
>> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.pdf
>> : failed(2,0): null
>> [Error2]: Error parsing:
>> http://../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.doc:
>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT array
>> at index 0 referenced block # 208. This isn't allowed and  your file is
>> corrupt
>> [Error3]: Error parsing:
>> http://../sample-site/news/cactus/work_log_lisi.xls: failed(2,0): Your
>> file contains 127 sectors, but the initial DIFAT array at index 0
>> referenced block # 241. This isn't allowed and  your file is corrupt
>>
>> Further stack traces to the errors are below. When entering the ULRs in
>> a browser, the files can be opened without problems. Also, I used the
>> file in the Nutch test cases, and the files could be opened and read
>> correctly by Nutch, so it does not seem to be a problem with the files.
>> Also below on how I parse the files [2].
>>
>> Did anyone encounter any of these problems so far? Any pointers are very
>> much appreciated!
>> Thanks a lot,
>> Elisabeth
>>
>>
>> [1] nutch-site.xml
>> <property><name>plugin.includes</name>
>> <value>parse-(html|tika|js|zip)|...</value>  </property>
>>
>> [Error1]:
>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
>> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.pdf
>> java.lang.NullPointerException
>>           at
>> org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
>>           at
>> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:946)
>>           at
>> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:107)
>>           at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:88)
>>           at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>           at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>           at java.lang.Thread.run(Thread.java:662)
>>
>> [Error2]:
>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
>> http://.../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.doc
>> java.io.IOException: Your file contains 127 sectors, but the initial
>> DIFAT array at index 0 referenced block # 208. This isn't allowed and
>> your file is corrupt
>>           at
>> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocat
>> ionTableReader.java:113) at
>> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java
>> :166) at
>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
>>           at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>           at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>           at java.lang.Thread.run(Thread.java:662)
>>
>> [Error3]:
>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
>> http://.../sample-site/news/cactus/work_log_lisi.xls
>> java.io.IOException: Your file contains 127 sectors, but the initial
>> DIFAT array at index 0 referenced block # 241. This isn't allowed and
>> your file is corrupt
>>           at
>> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocat
>> ionTableReader.java:113) at
>> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java
>> :166) at
>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
>>           at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>           at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>           at java.lang.Thread.run(Thread.java:662)
>>
>> [2]
>> ./bin/nutch inject crawl/crawldb urls>>  crawl.log
>> ./bin/nutch generate crawl/crawldb crawl/segments>>  crawl.log
>> s1=`ls -d crawl/segments/2* | tail -1`>>  crawl.log
>> ./bin/nutch fetch $s1 -noParsing>>  crawl.log
>> ./bin/nutch parse $s1>>  crawl.log

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Can you report your issues to the Tika mailing list? You're more likely to get 
help there.

Cheers

> Hi,
> 
> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
> included the mime types in the parse-plugins.xml.
> 
> On crawling, the URLs of my files are correctly retrieved, but on
> parsing the files, I get the following errors:
> [Error1]: Error parsing:
> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.pdf
> : failed(2,0): null
> [Error2]: Error parsing:
> http://../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.doc:
> failed(2,0): Your file contains 127 sectors, but the initial DIFAT array
> at index 0 referenced block # 208. This isn't allowed and  your file is
> corrupt
> [Error3]: Error parsing:
> http://../sample-site/news/cactus/work_log_lisi.xls: failed(2,0): Your
> file contains 127 sectors, but the initial DIFAT array at index 0
> referenced block # 241. This isn't allowed and  your file is corrupt
> 
> Further stack traces to the errors are below. When entering the ULRs in
> a browser, the files can be opened without problems. Also, I used the
> file in the Nutch test cases, and the files could be opened and read
> correctly by Nutch, so it does not seem to be a problem with the files.
> Also below on how I parse the files [2].
> 
> Did anyone encounter any of these problems so far? Any pointers are very
> much appreciated!
> Thanks a lot,
> Elisabeth
> 
> 
> [1] nutch-site.xml
> <property><name>plugin.includes</name>
> <value>parse-(html|tika|js|zip)|...</value> </property>
> 
> [Error1]:
> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.pdf
> java.lang.NullPointerException
>          at
> org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
>          at
> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:946)
>          at
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:107)
>          at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:88)
>          at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>          at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>          at java.lang.Thread.run(Thread.java:662)
> 
> [Error2]:
> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
> http://.../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.doc
> java.io.IOException: Your file contains 127 sectors, but the initial
> DIFAT array at index 0 referenced block # 208. This isn't allowed and
> your file is corrupt
>          at
> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocat
> ionTableReader.java:113) at
> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java
> :166) at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
>          at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>          at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>          at java.lang.Thread.run(Thread.java:662)
> 
> [Error3]:
> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
> http://.../sample-site/news/cactus/work_log_lisi.xls
> java.io.IOException: Your file contains 127 sectors, but the initial
> DIFAT array at index 0 referenced block # 241. This isn't allowed and
> your file is corrupt
>          at
> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocat
> ionTableReader.java:113) at
> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java
> :166) at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
>          at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>          at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>          at java.lang.Thread.run(Thread.java:662)
> 
> [2]
> ./bin/nutch inject crawl/crawldb urls >> crawl.log
> ./bin/nutch generate crawl/crawldb crawl/segments >> crawl.log
> s1=`ls -d crawl/segments/2* | tail -1` >> crawl.log
> ./bin/nutch fetch $s1 -noParsing >> crawl.log
> ./bin/nutch parse $s1 >> crawl.log