You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nutch User - 1 <nu...@gmail.com> on 2011/07/12 13:45:24 UTC

A possible solution to my URL redirection and zero scores problem

I have mentioned earlier
(http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html)
that I have encountered a problem in which redirected URLs and possibly,
depending on the topography of the graph, all URLs inlinked to them will
have zero scores.

For instance, on the line 818 of Fetcher.java
(http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup)
a new CrawlDatum is created for the redirected URL but nowhere is the
original URL's CrawlDatum's score passed to the new one. ScoringFilter
interface's initialScore() method is called for the new CrawlDatum, but
it only sets the score to zero.

Is this how it was mentioned to be or is there a flaw?

I started a crawl from http://www.aalto.fi which is redirected to
http://www.aalto.fi/fi/ (in my case). The URL http://www.aalto.fi had
1.0f as its score but every other had 0.0f which in my opinion indicates
that there's a problem. By adding "newDatum.setScore(datum.getScore());"
after calling initialScore() resulted in a situation where none of the
URLs' scores is zero.

Re: A possible solution to my URL redirection and zero scores problem

Posted by Nutch User - 1 <nu...@gmail.com>.
On 07/13/2011 11:25 AM, Julien Nioche wrote:
> Can you please open a JIRA for this?
> Thanks
>
> Julien

Yes, it's here now: (https://issues.apache.org/jira/browse/NUTCH-1044).

Re: A possible solution to my URL redirection and zero scores problem

Posted by Julien Nioche <li...@gmail.com>.
Can you please open a JIRA for this?
Thanks

Julien

On 13 July 2011 08:01, Nutch User - 1 <nu...@gmail.com> wrote:

> On 07/12/2011 08:09 PM, lewis john mcgibbney wrote:
> > Well I think in order to address the problem directly it would be better
> to
> > focus on getting something working with a distribution of Nutch you are
> most
> > comfortable working with. For the time being I would avoid working with
> > trunk 2.0 unless you can justify otherwise. I would also either make a
> > decision between Nutch 1.2 and the current 1.3 release rather than
> focussing
> > on previous branches, which may or may not be stable depending on when
> you
> > last svn updated.
> >
> > If you can try working with a fresh 1.2 or 1.3 (preferrably 1.3) then we
> > could maybe get to the bottom of this one as it would be great to find
> > whether there is scope to file a JIRA with this.
> >
> > Thank you
>
> Currently I'm working with the official 1.3 distribution of Nutch
> (apache-nutch-1.3-bin.zip). I have encountered this URL redirection and
> zero scores problem in both 1.2 and 1.3.
>
> I crawled ~12k pages with the quick fix I made, and none of the URLs in
> the CrawlDB had zero as their score. Before the fix crawling the same
> pages resulted in ~1.5k of the URLs having zero scores.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: A possible solution to my URL redirection and zero scores problem

Posted by Nutch User - 1 <nu...@gmail.com>.
On 07/12/2011 08:09 PM, lewis john mcgibbney wrote:
> Well I think in order to address the problem directly it would be better to
> focus on getting something working with a distribution of Nutch you are most
> comfortable working with. For the time being I would avoid working with
> trunk 2.0 unless you can justify otherwise. I would also either make a
> decision between Nutch 1.2 and the current 1.3 release rather than focussing
> on previous branches, which may or may not be stable depending on when you
> last svn updated.
>
> If you can try working with a fresh 1.2 or 1.3 (preferrably 1.3) then we
> could maybe get to the bottom of this one as it would be great to find
> whether there is scope to file a JIRA with this.
>
> Thank you

Currently I'm working with the official 1.3 distribution of Nutch
(apache-nutch-1.3-bin.zip). I have encountered this URL redirection and
zero scores problem in both 1.2 and 1.3.

I crawled ~12k pages with the quick fix I made, and none of the URLs in
the CrawlDB had zero as their score. Before the fix crawling the same
pages resulted in ~1.5k of the URLs having zero scores.

Re: A possible solution to my URL redirection and zero scores problem

Posted by lewis john mcgibbney <le...@gmail.com>.
Well I think in order to address the problem directly it would be better to
focus on getting something working with a distribution of Nutch you are most
comfortable working with. For the time being I would avoid working with
trunk 2.0 unless you can justify otherwise. I would also either make a
decision between Nutch 1.2 and the current 1.3 release rather than focussing
on previous branches, which may or may not be stable depending on when you
last svn updated.

If you can try working with a fresh 1.2 or 1.3 (preferrably 1.3) then we
could maybe get to the bottom of this one as it would be great to find
whether there is scope to file a JIRA with this.

Thank you

On Tue, Jul 12, 2011 at 2:02 PM, Nutch User - 1 <nu...@gmail.com>wrote:

> On 07/12/2011 03:42 PM, lewis john mcgibbney wrote:
> > Hi,
> >
> > An observation is that you are using the 1.3 branch, which will now
> contain
> > some older code. For example the fetcher class has now been upgraded to
> deal
> > with Nutch-962, which is mentioned at the top of the class as per your
> URL
> > example.
> >
> > Can anyone explain what the existing metadata being transferred is as per
> > below if it does not include the score as you state?
> >
> >         } else {
> >           CrawlDatum newDatum = new CrawlDatum(CrawlDatum.STATUS_LINKED,
> >               datum.getFetchInterval());
> >           // transfer existing metadata
> >           newDatum.getMetaData().putAll(datum.getMetaData());
> >           try {
> >             scfilters.initialScore(url, newDatum);
> >
> > I would have imagined that the metadata would have included the relative
> > initial score we are discussing if it were to be of use in attributing an
> > initial URLs metadata to a redirect?
> > Apart from this, with the addition of your datum.getScore(), do the new
> > scores attributed to the URL redirects  reflect accurately you're general
> > understanding of the web graph?
>
> I have only been dealing with Nutch 1.2 and 1.3. I tried to setup 2.0
> with Eclipse but failed as described here
> (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html).
> The new scores were as they should have been in my opinion. (Even though
> I would state that Nutch's implementation of OPIC isn't exactly what the
> publication says.) I don't know what information is passed in metadata.
>



-- 
*Lewis*

Re: A possible solution to my URL redirection and zero scores problem

Posted by Nutch User - 1 <nu...@gmail.com>.
On 07/12/2011 03:42 PM, lewis john mcgibbney wrote:
> Hi,
>
> An observation is that you are using the 1.3 branch, which will now contain
> some older code. For example the fetcher class has now been upgraded to deal
> with Nutch-962, which is mentioned at the top of the class as per your URL
> example.
>
> Can anyone explain what the existing metadata being transferred is as per
> below if it does not include the score as you state?
>
>         } else {
>           CrawlDatum newDatum = new CrawlDatum(CrawlDatum.STATUS_LINKED,
>               datum.getFetchInterval());
>           // transfer existing metadata
>           newDatum.getMetaData().putAll(datum.getMetaData());
>           try {
>             scfilters.initialScore(url, newDatum);
>
> I would have imagined that the metadata would have included the relative
> initial score we are discussing if it were to be of use in attributing an
> initial URLs metadata to a redirect?
> Apart from this, with the addition of your datum.getScore(), do the new
> scores attributed to the URL redirects  reflect accurately you're general
> understanding of the web graph?

I have only been dealing with Nutch 1.2 and 1.3. I tried to setup 2.0
with Eclipse but failed as described here
(http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html).
The new scores were as they should have been in my opinion. (Even though
I would state that Nutch's implementation of OPIC isn't exactly what the
publication says.) I don't know what information is passed in metadata.

Re: A possible solution to my URL redirection and zero scores problem

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi,

An observation is that you are using the 1.3 branch, which will now contain
some older code. For example the fetcher class has now been upgraded to deal
with Nutch-962, which is mentioned at the top of the class as per your URL
example.

Can anyone explain what the existing metadata being transferred is as per
below if it does not include the score as you state?

        } else {
          CrawlDatum newDatum = new CrawlDatum(CrawlDatum.STATUS_LINKED,
              datum.getFetchInterval());
          // transfer existing metadata
          newDatum.getMetaData().putAll(datum.getMetaData());
          try {
            scfilters.initialScore(url, newDatum);

I would have imagined that the metadata would have included the relative
initial score we are discussing if it were to be of use in attributing an
initial URLs metadata to a redirect?
Apart from this, with the addition of your datum.getScore(), do the new
scores attributed to the URL redirects  reflect accurately you're general
understanding of the web graph?



On Tue, Jul 12, 2011 at 12:45 PM, Nutch User - 1 <nu...@gmail.com>wrote:

> I have mentioned earlier
> (
> http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
> )
> that I have encountered a problem in which redirected URLs and possibly,
> depending on the topography of the graph, all URLs inlinked to them will
> have zero scores.
>
> For instance, on the line 818 of Fetcher.java
> (
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
> )
> a new CrawlDatum is created for the redirected URL but nowhere is the
> original URL's CrawlDatum's score passed to the new one. ScoringFilter
> interface's initialScore() method is called for the new CrawlDatum, but
> it only sets the score to zero.
>
> Is this how it was mentioned to be or is there a flaw?
>
> I started a crawl from http://www.aalto.fi which is redirected to
> http://www.aalto.fi/fi/ (in my case). The URL http://www.aalto.fi had
> 1.0f as its score but every other had 0.0f which in my opinion indicates
> that there's a problem. By adding "newDatum.setScore(datum.getScore());"
> after calling initialScore() resulted in a situation where none of the
> URLs' scores is zero.
>



-- 
*Lewis*