You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/08/19 12:14:19 UTC

[jira] [Commented] (NUTCH-685) Content-level redirect status lost in ParseSegment

    [ https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102092#comment-14102092 ] 

Sebastian Nagel commented on NUTCH-685:
---------------------------------------

Confirmed (for 1.x): content-level redirects (aka. meta refresh) do not result in a redirect status in CrawlDb (db_redir_perm or db_redir_temp).

In current trunk/1.x, they are even not recorded if Fetcher is parsing (fetcher.parse==true):
* set status "redir_perm" was introduced with r492525 in Fetcher.java:
{code}
 case ProtocolStatus.SUCCESS:        // got a page
   pstatus = output(url, datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
   if (pstatus != null && pstatus.isSuccess() &&
       pstatus.getMinorCode() == ParseStatus.SUCCESS_REDIRECT) {
...
      // record that we were redirected
      output(url, datum, null, status, CrawlDatum.STATUS_FETCH_REDIR_PERM);
{code}
* but lost with r593151 (since release 1.0 / NUTCH-547)

The problem is that pages containing a content-level redirect are indexed as successfully fetched pages. But usually they contain only a note like "You will be redirected in 10 seconds. If not click here." Possible solutions to exclude those pages (for 1.x):
# mark meta-refresh redirects as such (the status is arguable):
** re-introduce that Fetcher emits a CrawlDatum with redirect status
** try this also for ParseOutput (if fetcher.parse==false): principally possible, but with the price of lost information. If we emit a redirect CrawlDatum into crawl_parse it overwrites that from crawl_fetch. Status is then redirect, but we loose the fetch time and meta data. The original fetch datum is not available while parsing segments.
# skip and delete content-level redirects during indexing (similar to robots=noindex)
* check for {{parseData.getStatus().getMinorCode() == ParseStatus.SUCCESS_REDIRECT}} in IndexerMapReduce
* additionally, (it may not harm!) try to add the metarefresh to CrawlDatum's meta

> Content-level redirect status lost in ParseSegment
> --------------------------------------------------
>
>                 Key: NUTCH-685
>                 URL: https://issues.apache.org/jira/browse/NUTCH-685
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Julien Nioche
>             Fix For: 1.10
>
>
> When Fetcher runs in parsing mode, content-level redirects (HTML meta tag "Refresh") are properly discovered and recorded in crawl_fetch under source URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is run as a separate step, the content-level redirection data is used only to add the new (target) URL, but the status of the original URL is not reset to indicate a redirect. Consequently, status of the original URL will be different depending on the way you run Fetcher, whereas it should be the same.



--
This message was sent by Atlassian JIRA
(v6.2#6252)