You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Michiel (JIRA)" <ji...@apache.org> on 2015/02/02 16:46:35 UTC

[jira] [Updated] (NUTCH-1930) Fetcher erases Markers for certain URLs / documents

     [ https://issues.apache.org/jira/browse/NUTCH-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michiel updated NUTCH-1930:
---------------------------
    Description: 
During an active crawling project, I noticed what appears to be a bug in the fetcher: the markers for certain pages (PDFs especially) are either not saved, or erased altogether. The pages are thus not parsed, nor updated in the DB. They keep appearing in the generate lists and fetch lists. Note that this is a separate issue from NUTCH-1922. That one involves correctly parsed pages. This bug prevents certain pages from getting correct markers set.

Although I'm still new to Nutch and no java expert, I'm currently trying to debug this. Because it seems to be rather easy to replicate the error, so it seemed sensible to share my findings so far. If I find out more myself, I'll update this issue.

For this test, I injected two test URLs which never seemed to get parsed, even though they are valid documents which are not excluded by any filters. I use a http.content.limit of 64 MB, and tika is used for parsing documents. Note that these are just two examples, I can provide more if needed.

- http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
- http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf

Steps:
1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.

2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set. If so, it continues. Still, so far so good.

3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've logged the marker, and it gets set with the correct batchId. It gets a value.

4) However, when another nutch command is run, all the markers from these example URLs appear to have been erased. Not only is FETCH_MARK suddenly not set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't been fetched yet. The fetchStatus, however, is nicely set to "2 (status_fetched)". It's just the markers that are not correctly set.

My first assumption was that FETCH_MARK was not saved. However, as noted in step 3), it gets the correct value. Also, GENERATE_MARK is erased after the process is complete, so something else goes wrong. Somewhere before the end of FetcherJob, the markers for certain pages are erased. Note that all other values, like content, baseUrl, fetchtimes and fetchStatus, are saved correctly for these URLs.

Finally, for testing purposes, here is an example URL that DOES work: http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf

  was:
During an active crawling project, I noticed what appears to be a bug in the fetcher: the markers for certain pages (PDFs especially) are either not saved, or erased altogether. The pages are thus not parsed, nor updated in the DB. They keep appearing in the generate lists and fetch lists. Note that this is a separate issue from NUTCH-1922. That one involves correctly parsed pages. This bug prevents certain pages from getting correct markers set.

Although I'm still new to Nutch and no java expert, I'm currently trying to debug this. Because it seems to be rather easy to replicate the error, so it seemed sensible to share my findings so far. If I find out more myself, I'll update this issue.

For this test, I injected two test URLs which never seemed to get parsed, even though they are valid documents which are not excluded by any filters. I use a http.content.limit of 64 MB, and tika is used for parsing documents. Note that these are just two examples, I can provide more if needed.

- http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
- http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf

Notes:
1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.

2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set. If so, it continues. Still, so far so good.

3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've logged the marker, and it gets set with the correct batchId. It gets a value.

4) However, when another nutch command is run, all the markers from these example URLs appear to have been erased. Not only is FETCH_MARK suddenly not set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't been fetched yet. The fetchStatus, however, is nicely set to "2 (status_fetched)". It's just the markers that are not correctly set.

My first assumption was that FETCH_MARK was not saved. However, as noted in step 3), it gets the correct value. Also, GENERATE_MARK is erased after the process is complete, so something else goes wrong. Somewhere before the end of FetcherJob, the markers for certain pages are erased. Note that all other values, like content, baseUrl, fetchtimes and fetchStatus, are saved correctly for these URLs.

Finally, for testing purposes, here is an example URL that DOES work: http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf


> Fetcher erases Markers for certain URLs / documents
> ---------------------------------------------------
>
>                 Key: NUTCH-1930
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1930
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.3
>            Reporter: Michiel
>
> During an active crawling project, I noticed what appears to be a bug in the fetcher: the markers for certain pages (PDFs especially) are either not saved, or erased altogether. The pages are thus not parsed, nor updated in the DB. They keep appearing in the generate lists and fetch lists. Note that this is a separate issue from NUTCH-1922. That one involves correctly parsed pages. This bug prevents certain pages from getting correct markers set.
> Although I'm still new to Nutch and no java expert, I'm currently trying to debug this. Because it seems to be rather easy to replicate the error, so it seemed sensible to share my findings so far. If I find out more myself, I'll update this issue.
> For this test, I injected two test URLs which never seemed to get parsed, even though they are valid documents which are not excluded by any filters. I use a http.content.limit of 64 MB, and tika is used for parsing documents. Note that these are just two examples, I can provide more if needed.
> - http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
> - http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf
> Steps:
> 1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.
> 2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set. If so, it continues. Still, so far so good.
> 3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've logged the marker, and it gets set with the correct batchId. It gets a value.
> 4) However, when another nutch command is run, all the markers from these example URLs appear to have been erased. Not only is FETCH_MARK suddenly not set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't been fetched yet. The fetchStatus, however, is nicely set to "2 (status_fetched)". It's just the markers that are not correctly set.
> My first assumption was that FETCH_MARK was not saved. However, as noted in step 3), it gets the correct value. Also, GENERATE_MARK is erased after the process is complete, so something else goes wrong. Somewhere before the end of FetcherJob, the markers for certain pages are erased. Note that all other values, like content, baseUrl, fetchtimes and fetchStatus, are saved correctly for these URLs.
> Finally, for testing purposes, here is an example URL that DOES work: http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)