You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Kai_testing Middleton <ka...@yahoo.com> on 2007/06/27 02:49:44 UTC

NUTCH-119 :: how hard to fix

I am evaluating nutch+lucene as a crawl and search solution.

However, I am finding major bugs in nutch right off the bat.

In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some discussion of it here:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08644.html

Most of the links off www.variety.com, one of my main test sites, have relative URLs.  It seems incredible that nutch, which is capable of mapreduce, cannot fetch these URLs.

It could be that I would fix this bug if, for other reasons, I decide to go with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable?  Or are the developers, who are just volunteers anyway, more interested in fixing other problems?

Could someone outline the issue for me a bit more clearly so I would know how to evaluate it?




      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 

Re: NUTCH-119 :: how hard to fix

Posted by Doğacan Güney <do...@gmail.com>.
On 6/27/07, Kai_testing Middleton <ka...@yahoo.com> wrote:
> I am evaluating nutch+lucene as a crawl and search solution.
>
> However, I am finding major bugs in nutch right off the bat.
>
> In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some discussion of it here:
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08644.html
>
> Most of the links off www.variety.com, one of my main test sites, have relative URLs.  It seems incredible that nutch, which is capable of mapreduce, cannot fetch these URLs.
>
> It could be that I would fix this bug if, for other reasons, I decide to go with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable?  Or are the developers, who are just volunteers anyway, more interested in fixing other problems?
>
> Could someone outline the issue for me a bit more clearly so I would know how to evaluate it?

Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it  to -1 (which
means, store all outlinks).

>
>
>
>
>       ____________________________________________________________________________________
> Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
> http://autos.yahoo.com/green_center/


-- 
Doğacan Güney