You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/09/03 09:47:18 UTC

[jira] Created: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Redirection handling: YahooSlurp's algorithm
--------------------------------------------

                 Key: NUTCH-547
                 URL: https://issues.apache.org/jira/browse/NUTCH-547
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Doğacan Güney
             Fix For: 1.0.0


After reading Yahoo's algorithm (then one Andrzej linked to:
http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
in the redirect/alias handling discussion, I had a bit of a spare
time, so I implemented it.

Note that the patch I am attaching is for the 'choosing' algorithm described in
Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).

E.g,
generate "http://www.milliyet.com.tr/"

fetch "http:/www.milliyet.com.tr/" which redirects to
"http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".

Update second page's datum's metadata to indicate that
"http://www.milliyet.com.tr/" is the representative form.

Updatedb, invertlinks, etc...

While indexing second page, change its "url" field to
"http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540884 ] 

Chris A. Mattmann commented on NUTCH-547:
-----------------------------------------

+1: without having tested it, it improves an existing significant functionality and since we're using a CM, we can always revert, so IMO, go ahead and commit since it's been out there for quite a while.

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-547-3.patch, redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-547:
--------------------------------

    Attachment: redirect_draft_v2.patch

Draft 2

 * Refresh time is now stored as an extra argument.
  * During indexing, if "url" field is changed, "host" and "site" fields are updated accordingly.
  * Made refreshTime magic number a constant. Made redirType a constant.

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540881 ] 

Dennis Kubes commented on NUTCH-547:
------------------------------------

I agree that we should be less reluctant to commit things and hopefully that will pick up the speed of Nutch development.  This patch has been out there for a while, all I did was to bring it up to the current trunk.  You may want to run it through a final test but I feel that it is ready and if there are larger problems that aren't apparent we can always roll it back.  +1 for committing it ASAP.  I wouldn't wait until Sunday.  Whenever you feel it is ready. :)

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-547-3.patch, redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524693 ] 

Doğacan Güney commented on NUTCH-547:
-------------------------------------

Thanks a lot for the quick review, Andrzej.

> * the patch uses a strange diff format ... the first lines of context diffs appear on the same lines as chunk coordinates. 

Sorry about that. I am using git-svn (which, by the way, is an awesome tool) to develop nutch so I may have forgotten to use "svn diff" for the patch.

> * in Fetcher[2].handleRedirect(), what happens when the selected reprUrl is the same as the urlString? We should skip the 
> redirect then. 

We don't follow reprUrl,  we follow newUrl which is tested for equality with urlString. However, we should probably avoid writing reprUrl in crawldatum metadata if it is the same as the urlString.

> * the repeating parsing of refreshTime should be hidden in a utility method in ParseStatus - although the proper way to 
> support this would be to extend ParseStatus to store this int value if necessary, i.e. if ParseStatus is SUCCESS_REDIRECT (we
> would have to bump the version number, too).

Good point. Will look into that.

> * minimum refreshTime should be at least a constant, or configurable, and not a literal. Similarly the redirType should be a 
> constant. 

This patch is only a rough draft. I will fix all such issues in a later patch.

> * if we change the "url" field in BasicIndexingFilter, shouldn't we also change the "site"and "host" fields? [...]

Wow, can't believe I missed that. 

> [..] We could also consider adding reprUrl as an additional value for the same "url" field - this way we would get hits both on
>  the original url and the reprUrl. 

This may cause problems with dedup which assumes that "url" field has a single value. Also, it may be difficult to decide which value of "url" to show in web UI. I also like that fact that "url" is like a UNIQUE KEY for the document. If we allow "url" to have multiple values we lose that. 

Perhaps we can add reprUrl to a "repr" field instead?

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526258 ] 

Andrzej Bialecki  commented on NUTCH-547:
-----------------------------------------

> > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum with the value from dbDatum [..]

I'm still not sure about this issue - could you please clarify?

> Perhaps we can add reprUrl to a "repr" field instead?

Shouldn't this be the other way around - the idea of your patch is to put the data under the reprUrl, so in order to minimize code changes you replace the original url with reprUrl. This way we lose the value of the original url, so it seems to me that if we want to preserve it we should add it to an "orig" field ..

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526263 ] 

Doğacan Güney commented on NUTCH-547:
-------------------------------------

>> > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum with the value from dbDatum [..]
>
> I'm still not sure about this issue - could you please clarify?

Sorry, it seems I forgot to answer it :) 

It is possible that we discover a meta-redirect during parse phase. We have no way of updating fetch datum-s at this point, so instead, parse writes this information to crawl_parse which of course is then passed to crawldb. So, during indexing, it is possible that dbDatum contains (meta-)redirect information while fetchDatum does not. But you are right that we should probably give some sort of priority to fetchDatum's metadata over dbDatum.

> Perhaps we can add reprUrl to a "repr" field instead?
> 
> Shouldn't this be the other way around - the idea of your patch is to put the data under the reprUrl, so in order to minimize code changes you 
> replace the original url with reprUrl. This way we lose the value of the original url, so it seems to me that if we want to preserve it we should add it
> to an "orig" field ..

OK, this makes sense to me. I guess we should make "orig" both indexed and stored?

---

Btw, one of the major issues with redirection (that this patch does not solve) is that scores/other information are not reflected in redirections. Assume foo.com is a major web site. Url http://www.foo.com/ redirects to http://www.foo.com/daily.html . People, naturally, are much more likely to link to http://www.foo.com/ then http://www.foo.com/daily.html (the problem is even more interesting if http://foo.com also points to http://www.foo.com/daily.html ). So, I think we must have some a way to "pass" the score from source site to redirection site. Same thing for adaptive crawls: It may look like www.foo.com never changes (since it just redirects to a different url). But it should be considered "modified" whenever page at redirect url is updated.

I am not sure how we can achieve this, though. We will probably need an extra job (that should run at least once before indexing) that merges information from such pages.



> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-547.
-------------------------------


> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-547-3.patch, redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-547:
--------------------------------

    Attachment: redirect_draft.patch

Draft version.

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-547:
-------------------------------

    Attachment: NUTCH-547-3.patch

Updated to current trunk.

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-547-3.patch, redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541233 ] 

Hudson commented on NUTCH-547:
------------------------------

Integrated in Nutch-Nightly #261 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/261/])

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-547-3.patch, redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540850 ] 

Dennis Kubes commented on NUTCH-547:
------------------------------------

Doğacan, great work on this patch.  I have brought your patch up to the most recent trunk and tested it.  Works good.  In terms of scoring your patch actually does handle the most common *scoring* error.  That where we wouldn't store or index the source url for temporary refreshes (See NUTCH-572).  All unit tests have passed.  I am +1 for committing this ASAP.

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540869 ] 

Doğacan Güney commented on NUTCH-547:
-------------------------------------

Dennis, thanks for the update and testing.  I have been a bit unresponsive lately, I hope that this will change (very) soon.

I actually was going to commit it a while back, but Andrzej said that he is working on adding alias support/etc. (as outlined in his earlier email to nutch-dev) which included this patch as part of his effort but it probably fell off his radar during his move and everything.

Anyway, I feel that lately we have been too reluctant to commit new stuff. So I am also +1 for committing it. If we break something or miss something, we can always fix it later :)

So I am going to give this one a day or two (say, until sunday) and commit it if there are no objections. Is this OK with you, Dennis?

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-547-3.patch, redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki (JIRA) wrote:

> * parsing of the redirect time should be moved IMHO to
> handleRedirect(), to simplify the logic in the FetcherThread.run().

Hehe .. I wrote a part of this comment just before I left for a few 
hours, and then finished it without checking the first part. So, strike 
out this section, the other comment on this subject should apply. ;)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524567 ] 

Andrzej Bialecki  commented on NUTCH-547:
-----------------------------------------

A few comments:

* the patch uses a strange diff format ... the first lines of context diffs appear on the same lines as chunk coordinates.

* in Fetcher[2].handleRedirect(), what happens when the selected reprUrl is the same as the urlString? We should skip the redirect then.

* the repeating parsing of refreshTime should be hidden in a utility method in ParseStatus - although the proper way to support this would be to extend ParseStatus to store this int value if necessary, i.e. if ParseStatus is SUCCESS_REDIRECT (we would have to bump the version number, too).

* minimum refreshTime should be at least a constant, or configurable, and not a literal. Similarly the redirType should be a constant.

* parsing of the redirect time should be moved IMHO to handleRedirect(), to simplify the logic in the FetcherThread.run().

* if we change the "url" field in BasicIndexingFilter, shouldn't we also change the "site"and "host" fields? We could also consider adding reprUrl as an additional value for the same "url" field - this way we would get hits both on the original url and the reprUrl.

* I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum with the value from dbDatum - if anything, the value in fetchDatum should be more up to date, no? as it is now, it's silently overwritten. The only way the reprUrl could end up in dbDatum is from a previous updatedb operation, so it should contain an older information.

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-547.
---------------------------------

    Resolution: Fixed
      Assignee: Doğacan Güney

Latest patch committed in rev. 593151 with a trivial modification.

There was a minor typo in Indexer.java (word "protocol" was added to a comment).

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-547-3.patch, redirect_draft.patch, redirect_draft_v2.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.