You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2006/10/02 20:12:37 UTC

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

    [ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ] 
            
Doug Cook commented on NUTCH-353:
---------------------------------

This is definitely a complex issue. It is also high priority -- issues with redirects and duplicates, which URL is chosen, and what happens to the anchor text for the pages involved are causing significant relevance issues.
A few observations:

(1) A redirect *target* is not always the canonical version of a URL. For example, is very common for root-level pages to redirect to an internal home page (some 30% of the root pages in my index do so). However, the root pages have all the anchor text and are truly the canonical, permanent version of the page; the internal redirect target is just the "temporary" homepage, and could change at any time depending on the site implementation. Here are some examples:
    http://www.landwirtschaft-bw.info/
    http://www.dlr-rnh.rlp.de/
    http://www.niederoesterreich.at/
Because of the current policy of "discarding" the redirect source, I lose 30% of the home pages in my index, which makes my relevance very poor for navigational queries.

In this case, we would likely want to mark the internal redirect target as an alias as Andrzej suggests, and automatically transfer any link information to the root page.

(2) There may be other cases where we want to alias two pages, either to avoid recrawling them, or to merge anchor text. Suppose we crawl both 
     http://www.x.com/
and
     http://www.x.com/index.html
and these are the same document.

Right now we will always crawl both of these, and the dedup algorithm will pick one (sadly often the /index.html version due to strange score anomalies), and throw out the anchor text for the other. While we can't safely normalize these two URLs to be the same in advance of seeing the content, once we see that the signatures are the same, we can, and should, merge them so that the index.html version is marked as an alias of the / version, and future crawls simply skip crawling the /index.html version and transfer its link information to the / page.

This problem, like the first one, is causing me to lose root-level URLs along with their anchor text, further affecting relevance for navigational queries.

In short, I agree with Andrzej that we need a way to mark a URL as an alias of another, to avoid recrawl, and to merge link information. We need to be careful, however, of *which* URL we pick. It is not always the redirect target that should win. And some of our current concept of "duplicates" should also be subsumed under the new notion of "alias."

I'm happy to help out in any way with a fix. I'm just looking at hacking together something in my own environment because the problems are affecting me so severely, but as I'm new-ish to Nutch, what I come up with might not be as elegant or flexible as what others might envision...

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira