You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Mohammad Anbari <md...@gmail.com> on 2011/09/16 07:57:44 UTC

Problem to crawl pdf content in urls

I have some urls that contain many pdf links and i want to index them
but when i start crawling with nutch 1.3 no pdf link fetch,is there
any config i miss?
thanks

Re: Problem to crawl pdf content in urls

Posted by lewis john mcgibbney <le...@gmail.com>.

I think at the very least you should provide some log output of the URLs
which are not being fetched this would give us a chance of providing
accurate info.

http.content.limit is one of many many options which might be the problem
here.

Thank you

On Fri, Sep 16, 2011 at 6:57 AM, Mohammad Anbari <md...@gmail.com>wrote:

> I have some urls that contain many pdf links and i want to index them
> but when i start crawling with nutch 1.3 no pdf link fetch,is there
> any config i miss?
> thanks
>

-- 
*Lewis*