You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/01/12 19:41:40 UTC

[jira] [Created] (NUTCH-1247) CrawlDatum.retries should be int

CrawlDatum.retries should be int
--------------------------------

                 Key: NUTCH-1247
                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.4
            Reporter: Markus Jelsma
             Fix For: 1.5


CrawlDatum.retries is a byte and goes bad with larger values.

12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185908#comment-13185908 ] 

Andrzej Bialecki  commented on NUTCH-1247:
------------------------------------------

Originally the reason for a byte was compactness, but we can get the same effect using vint.

Markus, something seems off in your setup if you get such high values of retries ... usually CrawlDbReducer will set STATUS_DB_GONE if the number of retries reaches db.fetch.retry.max, so the page will not be tried again until FetchSchedule.forceRefetch resets its status (and the number of retries).
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Sebastian Nagel (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187761#comment-13187761 ] 

Sebastian Nagel commented on NUTCH-1247:
----------------------------------------

A FETCH_RETRY is already set to DB_GONE in CrawlDbReducer if numRetries exceeded, see the code above.
The problem is: Only for FETCH_GONE (and old FETCH_NOT_MODIFIED) the retry counter is definitively reset.
So if a DB_GONE URL is fetched again after db.fetch.interval.max and you get again an exception the retry counter is incremented anew. For long-running continuous crawls or with shorter values of maxInterval the counter may overflow.
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185826#comment-13185826 ] 

Markus Jelsma commented on NUTCH-1247:
--------------------------------------

Hints and thoughts are much appreciated, messing with CrawlDatum is pretty invasive.
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186217#comment-13186217 ] 

Markus Jelsma commented on NUTCH-1247:
--------------------------------------

Alright, then i think this must be related to NUTCH-1245. In that case the record is set to DB_GONE but generated anyway so this counter would continue to increase forever.
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186212#comment-13186212 ] 

Andrzej Bialecki  commented on NUTCH-1247:
------------------------------------------

Indeed, line 264 increases the retry counter, but after it reaches retryMax then page status is set to DB_GONE, so it won't be generated again until it expires, and its retry counter won't increase. Once it expires then Generator should invoke FetchSchedule.forceRefetch on this page, and the default implementation resets the retry counter. So either there's some bug in this cycle, or your retryMax is greater than 127.
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186177#comment-13186177 ] 

Markus Jelsma commented on NUTCH-1247:
--------------------------------------

Lewis, we're seeing many URL's with a high retry value. When the value is greater than 127 they're negative. This is in itself not a problem but it seems in my setup it will continue to increase.

Andrzej, there may indeed be something wrong. Might this be related to NUTCH-1245 then? There seems to be something wrong with the following CrawlDBReducer code:

{code}
260 	case CrawlDatum.STATUS_FETCH_RETRY: // temporary failure
261 	if (oldSet) {
262 	result.setSignature(old.getSignature()); // use old signature
263 	}
264 	result = schedule.setPageRetrySchedule((Text)key, result, prevFetchTime,
265 	prevModifiedTime, fetch.getFetchTime());
266 	if (result.getRetriesSinceFetch() < retryMax) {
267 	result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
268 	} else {
269 	result.setStatus(CrawlDatum.STATUS_DB_GONE);
270 	}
271 	break;
{code}

In setPageRetrySchedule() the num retries is always incremented. This causes records with exceptions such as UnknownHostException to be refetched for each segment. This makes sense because the first segment in our cycle has much more exceptions than average.

Do you follow?
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185844#comment-13185844 ] 

Lewis John McGibbney commented on NUTCH-1247:
---------------------------------------------

Where in CrawlDatum is the CrawlDBReader map method on line 159 getting the RetriesSinceFetch() from? 
{code}
output.collect(new Text("retry " + value.getRetriesSinceFetch()), COUNT_1);
{code}

Also, excuse my naivety but can you be more verbose about why the byte value for CrawlDatum.retries goes bad?
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Sebastian Nagel (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187716#comment-13187716 ] 

Sebastian Nagel commented on NUTCH-1247:
----------------------------------------

Interestingly, I also found a couple of URLs with unreasonable high retry counter in the data where NUTCH-1245 was first observed (it was Nutch 1.2). 
* all these URLs failed with some exception (invalid URI or HTTP=403) and not 404, not found, or robots denied?
  Markus, do the URLs which overflow the retry counter in your Db also belong to this class?
* in the segments the status of these URLs is fetch_retry (in crawl_fetch):
  In Fetcher.java the case ProtocolStatus.EXCEPTION inside the switch statement in FetcherThread.run() falls through the default where the result is collected with STATUS_FETCH_RETRY.

CrawlDbReducer calls FetchSchedule.forceRefetch() only for the cases STATUS_FETCH_NOT_MODIFIED or STATUS_FETCH_GONE (here via setPageGoneSchedule). The branch STATUS_FETCH_RETRY does not reset the retry counter. Generator never calls forceRefetch() nor does it reset the retry counter.

If this analysis is correct there are two possible patches:
* A (CrawlDbReducer): call setPageGoneSchedule for the case STATUS_FETCH_RETRY
* B (Generator): reset the retry counter to zero when a db_gone URL is generated again


                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185141#comment-13185141 ] 

Markus Jelsma commented on NUTCH-1247:
--------------------------------------

I assume we have to update the CrawlDatum version and gracefully handle both versions and write an int in the end for this change to work properly, right?
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1247:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187730#comment-13187730 ] 

Markus Jelsma commented on NUTCH-1247:
--------------------------------------

Sebastian, most of these records throw an UnknownHostException and get STATUS_FETCH_RETRY indeed. Should AbstractFetchSchedule be modified to set GONE for RETRY if numRetries has been exceeded?
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1247) CrawlDatum.retries should be int

Posted by "Sebastian Nagel (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-1247:
-----------------------------------

    Attachment: NUTCH-1247.patch_B
                NUTCH-1247.patch_A
    
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira