You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2007/11/07 20:24:50 UTC
[jira] Updated: (NUTCH-547) Redirection handling: YahooSlurp's
algorithm
[ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-547:
-------------------------------
Attachment: NUTCH-547-3.patch
Updated to current trunk.
> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
> Key: NUTCH-547
> URL: https://issues.apache.org/jira/browse/NUTCH-547
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Reporter: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: NUTCH-547-3.patch, redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.