You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Michiel (JIRA)" <ji...@apache.org> on 2015/02/03 17:28:36 UTC
[jira] [Commented] (NUTCH-1930) Fetcher erases Markers for certain URLs / documents

    [ https://issues.apache.org/jira/browse/NUTCH-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303530#comment-14303530 ] 

Michiel commented on NUTCH-1930:
--------------------------------

Update: it took some searching, but I found the issue. If the server does NOT return a Content-Length, all header information is not saved inside the store. Changing http.content.limit has no effect, except if this gets changed to a value above the MAX_VALUE of Integer. In summary: in order to retrieve pages that have no Content-Length set, http.content.limit needs to be >= 2147483648. If set to any other value, including -1, the fetcher will NOT save any header values to the store, and will erase the markers already set.

For the past few hours I've tried to find the source, but I can't seem to localize the fault. I can't find the class where the Content-Length is compared to http.content.limit, except inside protocol-http(client). However, the problem does not seem to reside there. It seems clear that at some place in the code, Content-Length is compared to the maximum 32-bit, and if there is no Content-Length, the procedure does not save any header information and erases the markers.

Perhaps someone can further debug this problem based on these findings.

> Fetcher erases Markers for certain URLs / documents
> ---------------------------------------------------
>
>                 Key: NUTCH-1930
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1930
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.3
>            Reporter: Michiel
>
> During an active crawling project, I noticed what appears to be a bug in the fetcher: the markers for certain pages (PDFs especially) are either not saved, or erased altogether. The pages are thus not parsed, nor updated in the DB. They keep appearing in the generate lists and fetch lists. Note that this is a separate issue from NUTCH-1922. That one involves correctly parsed pages. This bug prevents certain pages from getting correct markers set.
> Although I'm still new to Nutch and no java expert, I'm currently trying to debug this. Because it seems to be rather easy to replicate the error, so it seemed sensible to share my findings so far. If I find out more myself, I'll update this issue.
> For this test, I injected two test URLs which never seemed to get parsed, even though they are valid documents which are not excluded by any filters. I use a http.content.limit of 64 MB, and tika is used for parsing documents. Note that these are just two examples, I can provide more if needed.
> - http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
> - http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf
> Steps:
> 1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.
> 2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set. If so, it continues. Still, so far so good.
> 3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've logged the marker, and it gets set with the correct batchId. It gets a value.
> 4) However, when another nutch command is run, all the markers from these example URLs appear to have been erased. Not only is FETCH_MARK suddenly not set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't been fetched yet. The fetchStatus, however, is nicely set to "2 (status_fetched)". It's just the markers that are not correctly set.
> My first assumption was that FETCH_MARK was not saved. However, as noted in step 3), it gets the correct value. Also, GENERATE_MARK is erased after the process is complete, so something else goes wrong. Somewhere before the end of FetcherJob, the markers for certain pages are erased. Note that all other values, like content, baseUrl, fetchtimes and fetchStatus, are saved correctly for these URLs.
> Finally, for testing purposes, here is an example URL that DOES work: http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)