You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2019/10/18 16:21:02 UTC
[jira] [Created] (NUTCH-2749) Fetcher and scoring-opic: transfer score to redirects

Sebastian Nagel created NUTCH-2749:
--------------------------------------

             Summary: Fetcher and scoring-opic: transfer score to redirects
                 Key: NUTCH-2749
                 URL: https://issues.apache.org/jira/browse/NUTCH-2749
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher, plugin, scoring
    Affects Versions: 1.16
            Reporter: Sebastian Nagel
             Fix For: 1.17


See the discussion "[Score value lost after two successive redirects|https://lists.apache.org/thread.html/dbf7737fb8e6566d252e76290db806fac19dc56b854749c78a995bb8@1385999850@%3Cuser.nutch.apache.org%3E]" dating back to 2012.

Redirects should be enabled to pass scores to the targets. This is mandatory for reliable scoring, otherwise scores often get lost when a link target is redirected. Eg. when the target site has moved from http:// to [https://], incoming links to http:// pages are usually redirected to https:// (on the target site), and the incoming score is lost. If the migration to https:// happened recently the scores for this site might just become zero.

I aggree with [~markus17]'s comment in the mentioned discussion @user that "it cannot be a good idea to just copy over the score". Instead redirects should have the same effect as a page containing a single href link.

This would require the following change(s):

1. in Fetcher (class FetcherThread): the score should be passed forward to the redirect target
 * because the method {{distributeScoreToOutlinks(...)}} cannot be called for redirects (no content is parsed) we would need a dedicated hook
 distributeScoreToRedirect(Text fromUrl, Text toUrl, CrawlDatum source, CrawlDatum target)
 * to be called both for "recorded" and followed redirects (depending on http.max.redirect)
 * scoring strategies can be implemented there, eg. apply "db.score.link.\{internal,external}"
 * to be implemented as [default method|https://docs.oracle.com/javase/tutorial/java/IandI/defaultmethods.html] which avoids that existing scoring filter plugins are broken

2. during CrawlDb update (class CrawlDbReducer), there are different cases to consider:

a. URL not yet in CrawlDb: nothing to do if the score has been already passed forward (step 1)

b. URL already in CrawlDb, redirects not followed in fetcher (htt.redirect.max == 0): the redirect target has been stored as db_outlink, so it will be used in the scoring method updateDbScore(...) -> nothing to do

c. URL already in CrawlDb, fetcher follows redirects: to get the same behavior as for incoming links we would need to mark fetches stemming from a followed redirect and use them in a modified updateDbScore(...)

Being pragmatic I would address in this issue only point 1 and (implicitely 2a and 2b). Point 2c would require significant changes and isn't easy to control in the worst case, if there are multiple redirects followed all ending in the same target



--
This message was sent by Atlassian Jira
(v8.3.4#803005)