You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ilya Vishnevsky <Il...@e-legion.com> on 2007/05/23 15:20:37 UTC

some pdf's are not parsed

Hi!
Some of fetched pdf-documents are not parsed. When I use SegmentReader
the value corresponding to key "pt" in the resulting map is empty.
For example this happens with following urls:

http://www.virtualacquisitionshowcase.com/docs/DETech-Brochure.pdf

http://www.dtic.mil/ndia/22ndISB2005/thursday/fong.pdf

http://www.dsto.defence.gov.au/publications/2581/DSTO-TR-1479.pdf

http://sill-www.army.mil/FAMAG/2000/JUL_AUG_2000/JUL_AUG-2000_PAGES_36_3
9.pdf

http://www.dtic.mil/ndia/2001armaments/fong.pdf

At the same time there are pdf-files that are parsed normally.
Why this problem can occur and how can I resolve it?

Re: some pdf's are not parsed

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On 5/23/07, Ilya Vishnevsky <Il...@e-legion.com> wrote:
> Hi!
> Some of fetched pdf-documents are not parsed. When I use SegmentReader
> the value corresponding to key "pt" in the resulting map is empty.
> For example this happens with following urls:
>
> http://www.virtualacquisitionshowcase.com/docs/DETech-Brochure.pdf
>
> http://www.dtic.mil/ndia/22ndISB2005/thursday/fong.pdf
>
> http://www.dsto.defence.gov.au/publications/2581/DSTO-TR-1479.pdf
>
> http://sill-www.army.mil/FAMAG/2000/JUL_AUG_2000/JUL_AUG-2000_PAGES_36_3
> 9.pdf
>
> http://www.dtic.mil/ndia/2001armaments/fong.pdf
>
> At the same time there are pdf-files that are parsed normally.
> Why this problem can occur and how can I resolve it?
>

What is your http.content.limit? For example, the first url in your
list is a little over 200K. So if http.content.limit is less than that
(btw, by default, it is 64K) Nutch truncates content after
http.content.limit. And parse-pdf can't parse partial pdf files.

-- 
Doğacan Güney