You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nutch User - 1 <nu...@gmail.com> on 2011/06/20 10:16:01 UTC

URL redirection and zero scores

Hi.

I did a test crawl with the seed URL http://www.aalto.fi. When the
crawling and indexing process was over, I opened the index in Luke and
browsed the documents. Every one of them had 0.0f as their score (and
thus their boost value). I doubt that this is what I should have gotten.

The problem seems to be related to the fact that http://www.aalto.fi
redirects to http://www.aalto.fi/fi/ (in my case; probably to ....../en/
or .../sv/ in some other cases). This behavior showed up also when
http://www.muropaketti.com was used as a seed URL. The URL
http://www.muropaketti.com is redirected to http://plaza.fi/muropaketti/.

Is this a flaw in Nutch? If not, then why was every document's boost
value zero? I have lived under impression that a document's boost value
is supposed to describe its relevancy. I did the same tests using the
versions 1.2 and 1.3, and the problem appeared in both of the cases.

Re: URL redirection and zero scores

Posted by Nutch User - 1 <nu...@gmail.com>.
My example address http://www.muropaketti.com redirects to
http://muropaketti.com/ now.

Re: URL redirection and zero scores

Posted by Nutch User - 1 <nu...@gmail.com>.
On 06/20/2011 11:20 AM, Markus Jelsma wrote:
> What scoring plugin are you using? OPIC? Link? Custom? None?
>
>> Hi.
>>
>> I did a test crawl with the seed URL http://www.aalto.fi. When the
>> crawling and indexing process was over, I opened the index in Luke and
>> browsed the documents. Every one of them had 0.0f as their score (and
>> thus their boost value). I doubt that this is what I should have gotten.
>>
>> The problem seems to be related to the fact that http://www.aalto.fi
>> redirects to http://www.aalto.fi/fi/ (in my case; probably to ....../en/
>> or .../sv/ in some other cases). This behavior showed up also when
>> http://www.muropaketti.com was used as a seed URL. The URL
>> http://www.muropaketti.com is redirected to http://plaza.fi/muropaketti/.
>>
>> Is this a flaw in Nutch? If not, then why was every document's boost
>> value zero? I have lived under impression that a document's boost value
>> is supposed to describe its relevancy. I did the same tests using the
>> versions 1.2 and 1.3, and the problem appeared in both of the cases.

Could someone confirm whether he or she gets zero scores in a similar
situation?

Re: URL redirection and zero scores

Posted by Nutch User - 1 <nu...@gmail.com>.
On 06/20/2011 11:20 AM, Markus Jelsma wrote:
> What scoring plugin are you using? OPIC? Link? Custom? None?
>
>> Hi.
>>
>> I did a test crawl with the seed URL http://www.aalto.fi. When the
>> crawling and indexing process was over, I opened the index in Luke and
>> browsed the documents. Every one of them had 0.0f as their score (and
>> thus their boost value). I doubt that this is what I should have gotten.
>>
>> The problem seems to be related to the fact that http://www.aalto.fi
>> redirects to http://www.aalto.fi/fi/ (in my case; probably to ....../en/
>> or .../sv/ in some other cases). This behavior showed up also when
>> http://www.muropaketti.com was used as a seed URL. The URL
>> http://www.muropaketti.com is redirected to http://plaza.fi/muropaketti/.
>>
>> Is this a flaw in Nutch? If not, then why was every document's boost
>> value zero? I have lived under impression that a document's boost value
>> is supposed to describe its relevancy. I did the same tests using the
>> versions 1.2 and 1.3, and the problem appeared in both of the cases.

I used OPIC as it's the default.

Re: URL redirection and zero scores

Posted by Markus Jelsma <ma...@openindex.io>.
What scoring plugin are you using? OPIC? Link? Custom? None?

> Hi.
> 
> I did a test crawl with the seed URL http://www.aalto.fi. When the
> crawling and indexing process was over, I opened the index in Luke and
> browsed the documents. Every one of them had 0.0f as their score (and
> thus their boost value). I doubt that this is what I should have gotten.
> 
> The problem seems to be related to the fact that http://www.aalto.fi
> redirects to http://www.aalto.fi/fi/ (in my case; probably to ....../en/
> or .../sv/ in some other cases). This behavior showed up also when
> http://www.muropaketti.com was used as a seed URL. The URL
> http://www.muropaketti.com is redirected to http://plaza.fi/muropaketti/.
> 
> Is this a flaw in Nutch? If not, then why was every document's boost
> value zero? I have lived under impression that a document's boost value
> is supposed to describe its relevancy. I did the same tests using the
> versions 1.2 and 1.3, and the problem appeared in both of the cases.