You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/08/18 06:51:17 UTC

[jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time

pages that serverside forwards will be refetched every time
-----------------------------------------------------------

                 Key: NUTCH-353
                 URL: http://issues.apache.org/jira/browse/NUTCH-353
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8.1, 0.9.0
            Reporter: Stefan Groschupf
            Priority: Blocker
             Fix For: 0.8.1
         Attachments: doNotRefecthForwarderPagesV1.patch

Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by Uroš Gruber <ur...@sir-mag.com>.

Doug Cook wrote:
> In this case, the site uses the "right" kind of redirect. Unfortunately, as
> you point out, it's not at all clear that we can rely on sites correctly
> choosing the type of redirect (I tried a few sites and most were 302s, even
> in cases where the redirect was to the permanent, canonical version of the
> page). And then there's the problem of what to do with meta refresh tags,
> which don't have a "permanent" vs. "temporary" indication.
>
> An alternative is to use the link structure - the page with the most
> external links is likely the canonical version of the page. (Although with
> permanent redirects, there is a time lag as sites linking to the page stop
> using the old name and start using the new name). This won't work well in
> small crawls, though, given the relative paucity of links.
>
>   
This could be something, because others most certainly don't link 
redirects. But as you point out problem with permanent links, we have 
just the same stuff in our portal. We have new structure and some links 
have changed because of that we add permanent redirects from old to new 
ones. In this case the only solution is to replace url with permanent.

> In any case, if we have an inexpensive way of aliasing the two to be the
> same, we won't lose any anchor text, and we're effectively not "throwing
> out" either URL, so it matters less which one we choose. 
>   
Do you have any example what would this aliasdb look like.

regards
>       -Doug
>
>
> Uro? Gruber-2 wrote:
>   
>> Ken Krugler (JIRA) wrote:
>>     
>>>     [
>>> http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304
>>> ] 
>>>             
>>> Ken Krugler commented on NUTCH-353:
>>> -----------------------------------
>>>
>>> +1 that the redirect target is not always the "real" URL that we want to
>>> keep.
>>>
>>> For example,
>>> http://www.ibm.com/developerworks/lotus/downloads/toolkits.html =>
>>> http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This
>>> holds true for most  (all?) developerWorks pages; they redirect to
>>> www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees
>>> to still be www.ibm.com/<whatever>.
>>>
>>>   
>>>       
>> If you check status code of the original URL you get 302 Found. By 
>> definition
>>
>>
>>       10.3.3 302 Found
>>
>> The requested resource resides temporarily under a different URI. Since 
>> the redirection might be altered on occasion, the client SHOULD continue 
>> to use the Request-URI for future requests. This response is only 
>> cacheable if indicated by a Cache-Control or Expires header field.
>>
>> In this case there is no need to replace original url with redirected.
>>
>> I know that a lot of sites use permanent redirects in such cases. But I 
>> don't se any proper solution for both.
>>
>>
>> regards
>>
>> Uros
>>     
>>>> pages that serverside forwards will be refetched every time
>>>> -----------------------------------------------------------
>>>>
>>>>                 Key: NUTCH-353
>>>>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>>>>             Project: Nutch
>>>>          Issue Type: Bug
>>>>    Affects Versions: 0.8.1, 0.9.0
>>>>            Reporter: Stefan Groschupf
>>>>         Assigned To: Andrzej Bialecki 
>>>>            Priority: Blocker
>>>>             Fix For: 0.9.0
>>>>
>>>>         Attachments: doNotRefecthForwarderPagesV1.patch
>>>>
>>>>
>>>> Pages that do a serverside forward are not written with a status change
>>>> back into the crawlDb. Also the nextFetchTime is not changed. 
>>>> This causes a refetch of the same page again and again. The result is
>>>> nutch is not polite and refetching the forwarding and target page in
>>>> each segment iteration. Also it effects the scoring since the forward
>>>> page contribute it's score to all outlinks.
>>>>     
>>>>         
>>>   
>>>       
>>
>>     
>
>

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by Doug Cook <na...@candiru.com>.


In this case, the site uses the "right" kind of redirect. Unfortunately, as
you point out, it's not at all clear that we can rely on sites correctly
choosing the type of redirect (I tried a few sites and most were 302s, even
in cases where the redirect was to the permanent, canonical version of the
page). And then there's the problem of what to do with meta refresh tags,
which don't have a "permanent" vs. "temporary" indication.

An alternative is to use the link structure - the page with the most
external links is likely the canonical version of the page. (Although with
permanent redirects, there is a time lag as sites linking to the page stop
using the old name and start using the new name). This won't work well in
small crawls, though, given the relative paucity of links.

In any case, if we have an inexpensive way of aliasing the two to be the
same, we won't lose any anchor text, and we're effectively not "throwing
out" either URL, so it matters less which one we choose. 

      -Doug


Uro? Gruber-2 wrote:
> 
> Ken Krugler (JIRA) wrote:
>>     [
>> http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304
>> ] 
>>             
>> Ken Krugler commented on NUTCH-353:
>> -----------------------------------
>>
>> +1 that the redirect target is not always the "real" URL that we want to
>> keep.
>>
>> For example,
>> http://www.ibm.com/developerworks/lotus/downloads/toolkits.html =>
>> http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This
>> holds true for most  (all?) developerWorks pages; they redirect to
>> www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees
>> to still be www.ibm.com/<whatever>.
>>
>>   
> If you check status code of the original URL you get 302 Found. By 
> definition
> 
> 
>       10.3.3 302 Found
> 
> The requested resource resides temporarily under a different URI. Since 
> the redirection might be altered on occasion, the client SHOULD continue 
> to use the Request-URI for future requests. This response is only 
> cacheable if indicated by a Cache-Control or Expires header field.
> 
> In this case there is no need to replace original url with redirected.
> 
> I know that a lot of sites use permanent redirects in such cases. But I 
> don't se any proper solution for both.
> 
> 
> regards
> 
> Uros
>>> pages that serverside forwards will be refetched every time
>>> -----------------------------------------------------------
>>>
>>>                 Key: NUTCH-353
>>>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>>>             Project: Nutch
>>>          Issue Type: Bug
>>>    Affects Versions: 0.8.1, 0.9.0
>>>            Reporter: Stefan Groschupf
>>>         Assigned To: Andrzej Bialecki 
>>>            Priority: Blocker
>>>             Fix For: 0.9.0
>>>
>>>         Attachments: doNotRefecthForwarderPagesV1.patch
>>>
>>>
>>> Pages that do a serverside forward are not written with a status change
>>> back into the crawlDb. Also the nextFetchTime is not changed. 
>>> This causes a refetch of the same page again and again. The result is
>>> nutch is not polite and refetching the forwarding and target page in
>>> each segment iteration. Also it effects the scoring since the forward
>>> page contribute it's score to all outlinks.
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28NUTCH-353%29-pages-that-serverside-forwards-will-be-refetched-every-time-tf2125422.html#a6622168
Sent from the Nutch - Dev mailing list archive at Nabble.com.

[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-353?page=all ]

Stefan Groschupf updated NUTCH-353:
-----------------------------------

    Attachment: doNotRefecthForwarderPagesV1.patch

Since we discussed that nutch need to be more polite we should fix that asap. 

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.8.1
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by "King Kong (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12434881 ] 
            
King Kong commented on NUTCH-353:
---------------------------------

this is a really serious problem. because the orginal url are fetched again and again :-(

I argee with stefan's solution . 

I think this problem should attract more people's attention.

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.8.1
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by "Doug Cook (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ] 
            
Doug Cook commented on NUTCH-353:
---------------------------------

This is definitely a complex issue. It is also high priority -- issues with redirects and duplicates, which URL is chosen, and what happens to the anchor text for the pages involved are causing significant relevance issues.
A few observations:

(1) A redirect *target* is not always the canonical version of a URL. For example, is very common for root-level pages to redirect to an internal home page (some 30% of the root pages in my index do so). However, the root pages have all the anchor text and are truly the canonical, permanent version of the page; the internal redirect target is just the "temporary" homepage, and could change at any time depending on the site implementation. Here are some examples:
    http://www.landwirtschaft-bw.info/
    http://www.dlr-rnh.rlp.de/
    http://www.niederoesterreich.at/
Because of the current policy of "discarding" the redirect source, I lose 30% of the home pages in my index, which makes my relevance very poor for navigational queries.

In this case, we would likely want to mark the internal redirect target as an alias as Andrzej suggests, and automatically transfer any link information to the root page.

(2) There may be other cases where we want to alias two pages, either to avoid recrawling them, or to merge anchor text. Suppose we crawl both 
     http://www.x.com/
and
     http://www.x.com/index.html
and these are the same document.

Right now we will always crawl both of these, and the dedup algorithm will pick one (sadly often the /index.html version due to strange score anomalies), and throw out the anchor text for the other. While we can't safely normalize these two URLs to be the same in advance of seeing the content, once we see that the signatures are the same, we can, and should, merge them so that the index.html version is marked as an alias of the / version, and future crawls simply skip crawling the /index.html version and transfer its link information to the / page.

This problem, like the first one, is causing me to lose root-level URLs along with their anchor text, further affecting relevance for navigational queries.

In short, I agree with Andrzej that we need a way to mark a URL as an alias of another, to avoid recrawl, and to merge link information. We need to be careful, however, of *which* URL we pick. It is not always the redirect target that should win. And some of our current concept of "duplicates" should also be subsumed under the new notion of "alias."

I'm happy to help out in any way with a fix. I'm just looking at hacking together something in my own environment because the problems are affecting me so severely, but as I'm new-ish to Nutch, what I come up with might not be as elegant or flexible as what others might envision...

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-353?page=all ]

Andrzej Bialecki  updated NUTCH-353:
------------------------------------

    Fix Version/s: 0.9.0
                       (was: 0.8.1)
         Assignee: Andrzej Bialecki 

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0, 0.8.1
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12437131 ] 
            
Andrzej Bialecki  commented on NUTCH-353:
-----------------------------------------

I think this issue requires more discussion, especially how it affects the linkdb.

Let's say that page A links to B, but B redirects to C. Issues to discuss:

* should we mark B as gone? we could do so, to prevent refetching. We should also store the redirect url in CrawlDatum.metaData. This redirect url may change in the future to some other value, but since no page is ever truly gone (we should retry it at some point in the future) we should be able to adjust the redirect info.

* for all practical purposes, C now becomes a replacement for B. Should we transfer all inlink information (anchor text, incoming urls, and score contributions) to C? From the implementation point of view this would require changes to linkdb format, to be able to create "aliases" that automatically transfer all inlink information to C even though it's inserted under B ..

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0, 0.8.1
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.8.1
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by Uroš Gruber <ur...@sir-mag.com>.

Ken Krugler (JIRA) wrote:
>     [ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ] 
>             
> Ken Krugler commented on NUTCH-353:
> -----------------------------------
>
> +1 that the redirect target is not always the "real" URL that we want to keep.
>
> For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html => http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This holds true for most  (all?) developerWorks pages; they redirect to www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees to still be www.ibm.com/<whatever>.
>
>   
If you check status code of the original URL you get 302 Found. By 
definition


      10.3.3 302 Found

The requested resource resides temporarily under a different URI. Since 
the redirection might be altered on occasion, the client SHOULD continue 
to use the Request-URI for future requests. This response is only 
cacheable if indicated by a Cache-Control or Expires header field.

In this case there is no need to replace original url with redirected.

I know that a lot of sites use permanent redirects in such cases. But I 
don't se any proper solution for both.


regards

Uros
>> pages that serverside forwards will be refetched every time
>> -----------------------------------------------------------
>>
>>                 Key: NUTCH-353
>>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>>             Project: Nutch
>>          Issue Type: Bug
>>    Affects Versions: 0.8.1, 0.9.0
>>            Reporter: Stefan Groschupf
>>         Assigned To: Andrzej Bialecki 
>>            Priority: Blocker
>>             Fix For: 0.9.0
>>
>>         Attachments: doNotRefecthForwarderPagesV1.patch
>>
>>
>> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
>> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.
>>     
>
>

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ] 
            
Ken Krugler commented on NUTCH-353:
-----------------------------------

+1 that the redirect target is not always the "real" URL that we want to keep.

For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html => http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This holds true for most  (all?) developerWorks pages; they redirect to www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees to still be www.ibm.com/<whatever>.

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira