You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/07/19 14:11:13 UTC

[jira] Created: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Fetcher discards ProtocolStatus, doesn't store redirected pages
---------------------------------------------------------------

                 Key: NUTCH-322
                 URL: http://issues.apache.org/jira/browse/NUTCH-322
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8-dev
            Reporter: Andrzej Bialecki 
             Fix For: 0.8-dev


Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.

I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.

Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:

* ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.

* ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422383 ] 
            
Andrzej Bialecki  commented on NUTCH-322:
-----------------------------------------

It's true that redirected pages are fetched, but it's also true that the intermediate pages (the ones that we were redirected from) are discarded. Please see the logic in Fetcher.FetcherThread.run() - there is no call to output() in such case, we just proceed to fetch the page we were redirected to.

Re: ProtocolStatus: if we decide to store the intermediate redirected pages, then ProtocolStatus will be stored under each intermediate URL, so there is no need to add it explicitly to ProtocolStatus. Also, in case of redirects, the URL we were redirected to is already stored in ProtocolStatus (or ParseStatus).

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] 
            
Stefan Groschupf commented on NUTCH-322:
----------------------------------------

I think this is a serious problem. Page A server side redirect to Page B. Page A is never writen to the output. That causes that Page A does not change the state or the next fetch time, what means that page A is fetched again, again, again ... ∞

I suggest that we write out Page A with a status change to STATUS_DB_GONE.


> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Updated: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]

Sami Siren updated NUTCH-322:
-----------------------------

    Fix Version/s: 0.9-dev
                       (was: 0.8-dev)

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.9-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422508 ] 
            
Andrzej Bialecki  commented on NUTCH-322:
-----------------------------------------

I hope I don't come across as arguing ... just trying to explain the rationale for this. Redirected pages often have content - take a look at e.g. http://dmoz.org/Arts (notice missing ending slash). I agree that most of the time this content is trivial, but we always read this content anyway. In some cases, it's not the content but metadata (HTTP headers) that are important - start a protocol analyzer and look what happens when you try to visit http://www.svd.se/annonsera (again, no slash at the end) - the second redirect will also set a cookie, which may be important for further requests in this session. And Nutch stores metadata only when it stores the content ...

There is also a case of content-level redirection, caused by <meta http-equiv="refresh" ...>, where you most likely get a full page of content, and then after a while you get redirected to another page. This may be immediately, but it also may be after 120 seconds - so, the intermediate content does matter in this case.

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Enrico Triolo (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422379 ] 
            
Enrico Triolo commented on NUTCH-322:
-------------------------------------

I'm not at all sure about the fact that fetcher doesn't store redirected pages. In my experience, redirected pages are fetched, and the crawldb is updated too.
Anyway I agree with you that the ProtocolStatus should be stored inside CrawlDatum, and I furthermore propose that it should contain the original url in case of redirection.

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12423187 ] 
            
Andrzej Bialecki  commented on NUTCH-322:
-----------------------------------------

Good questions ... ;)

ad 1: Google shows only the final page, and you can access it through both the original (starting) url and the final redirected url. You can't view the intermediate pages.

To be Google-compatible we should index only the final page, but put it under both URLs. This is relatively easy to implement in Fetcher and index-basic, by appropriately marking the starting and intermediate pages, skipping any non-final pages during indexing, and then adding the original url to the final url when indexing the final page.

Also, I think that if redirect refresh time is large (e.g. larger than 20 seconds) we should consider the pages to be separate, and treat them separately.

ad 2: Google shows only inlinks going to the final url. However, the same inlinks can be obtained by using either the starting or the final url. OTOH MSN has separate inlinks in each case. I'm not sure yet how we should implement this...

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf (JIRA) wrote:
>      [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
>
> Stefan Groschupf resolved NUTCH-322.
> ------------------------------------
>
>     Resolution: Duplicate
>
> duplicate of NUTCH-353
>   

??? If anything, NUTCH-353 is a duplicate of this issue, as it was 
created just now, and it should be closed, and the patch that is 
attached there should be moved here. Most of the discussion took place 
already in NUTCH-322, so this issue should stay open as the primary one 
until it is resolved. A presence of a patch doesn't mean the issue is 
resolved, either.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]

Stefan Groschupf resolved NUTCH-322.
------------------------------------

    Resolution: Duplicate

duplicate of NUTCH-353

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Reopened: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]

Andrzej Bialecki  reopened NUTCH-322:
-------------------------------------

      Assignee: Andrzej Bialecki 
             
Re-opening - this issue is not resolved yet.

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Enrico Triolo (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422409 ] 
            
Enrico Triolo commented on NUTCH-322:
-------------------------------------

I probably miss something, but redirected pages don't have content, they only return a 30x status in the http header... Why would you need to fetch those pages?
In my opinion it would be better if we store only the urls of the intermediate redirected pages into the ProtocolStatus of the 'final' page.

It's only my two cents ;-)

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Posted by "Enrico Triolo (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422996 ] 
            
Enrico Triolo commented on NUTCH-322:
-------------------------------------

Ok, I can see your point, nevertheless I think we should consider some potential problems that could arise from such modifications:

  1. When a redirect occours, both the redirecting and the redirected pages should be indexed, independently of crawling depth, but I think this is what you meant from the beginning...
  2. How should linkdb updated? Or better, should linkdb be updated somehow? I mean, if page A has a link to page B, and page B redirects to C, should we set an incoming link to C from A?

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira