You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Clement Mai (JIRA)" <ji...@apache.org> on 2015/04/16 00:13:59 UTC

[jira] [Commented] (NUTCH-1930) Fetcher erases Markers for certain URLs / documents

    [ https://issues.apache.org/jira/browse/NUTCH-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497115#comment-14497115 ] 

Clement Mai commented on NUTCH-1930:
------------------------------------

I'm new to Nutch and I'm currently trying to use Nutch to crawl some PDFs and index into Elasticsearch.  I see the same problem when fetch a PDF with size > 2MB.  The markers are erased after the fetch job completes.  However, it works fine with PDF < 2MB.

Setting http.content.limit to be >= 2147483648 gives exception.  You can't set a value above the MAX_VALUE of Integer.

I print the markers in FetcherReducer output(), and verify the markers are correctly put into the WebPage.  I'm not clear how if there is no Content-Length would cause the markers erased.

Thanks.

=============================
PDF size > 2MB, no markers in HBase
=============================
org.apache.nutch.fetcher.FetcherJob: fetching https://dev-web/fdnycfa/htmls/test8.pdf (queue crawl delay=100ms)
org.apache.nutch.protocol.httpclient.Http: http.content.limit = 20971520
org.apache.nutch.protocol.httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
org.apache.commons.httpclient.HttpMethodBase: Response content length is not known
org.apache.nutch.fetcher.FetcherJob: output content length: 8340405
org.apache.nutch.fetcher.FetcherJob: _gnmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: _injmrk_ = y
org.apache.nutch.fetcher.FetcherJob: _ftcmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: dist = 0
org.apache.nutch.fetcher.FetcherJob: -finishing thread FetcherThread10, activeThreads=5

hbase(main):069:0> get 'TestCrawl_webpage','dev-web:https/fdnycfa/htmls/test8.pdf',{COLUMN => 'mk'}
COLUMN                CELL
0 row(s) in 0.0070 seconds


=============================
PDF size < 2MB
=============================
org.apache.nutch.fetcher.FetcherJob: fetching https://dev-web/fdnycfa/htmls/test2_006.pdf (queue crawl delay=100ms)
org.apache.nutch.protocol.httpclient.Http: http.content.limit = 20971520
org.apache.nutch.protocol.httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
org.apache.commons.httpclient.HttpMethodBase: Response content length is not known
org.apache.nutch.fetcher.FetcherJob: output content length: 2006860
org.apache.nutch.fetcher.FetcherJob: _gnmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: _injmrk_ = y
org.apache.nutch.fetcher.FetcherJob: _ftcmrk_ = 1429127685-28325
org.apache.nutch.fetcher.FetcherJob: dist = 0
org.apache.nutch.fetcher.FetcherJob: -finishing thread FetcherThread9, activeThreads=7

hbase(main):007:0> get 'TestCrawl_webpage','dev-web:https/fdnycfa/htmls/test2_006.pdf',{COLUMN => 'mk'}
COLUMN                CELL
 mk:_ftcmrk_          timestamp=1429127801312, value=1429127685-28325
 mk:_gnmrk_           timestamp=1429127801312, value=1429127685-28325
 mk:_injmrk_          timestamp=1429127801312, value=y
 mk:dist              timestamp=1429127801312, value=0
4 row(s) in 0.0480 seconds


> Fetcher erases Markers for certain URLs / documents
> ---------------------------------------------------
>
>                 Key: NUTCH-1930
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1930
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.3
>            Reporter: Michiel
>             Fix For: 2.4
>
>
> During an active crawling project, I noticed what appears to be a bug in the fetcher: the markers for certain pages (PDFs especially) are either not saved, or erased altogether. The pages are thus not parsed, nor updated in the DB. They keep appearing in the generate lists and fetch lists. Note that this is a separate issue from NUTCH-1922. That one involves correctly parsed pages. This bug prevents certain pages from getting correct markers set.
> Although I'm still new to Nutch and no java expert, I'm currently trying to debug this. Because it seems to be rather easy to replicate the error, so it seemed sensible to share my findings so far. If I find out more myself, I'll update this issue.
> For this test, I injected two test URLs which never seemed to get parsed, even though they are valid documents which are not excluded by any filters. I use a http.content.limit of 64 MB, and tika is used for parsing documents. Note that these are just two examples, I can provide more if needed.
> - http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
> - http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf
> Steps:
> 1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.
> 2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set. If so, it continues. Still, so far so good.
> 3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've logged the marker, and it gets set with the correct batchId. It gets a value.
> 4) However, when another nutch command is run, all the markers from these example URLs appear to have been erased. Not only is FETCH_MARK suddenly not set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't been fetched yet. The fetchStatus, however, is nicely set to "2 (status_fetched)". It's just the markers that are not correctly set.
> My first assumption was that FETCH_MARK was not saved. However, as noted in step 3), it gets the correct value. Also, GENERATE_MARK is erased after the process is complete, so something else goes wrong. Somewhere before the end of FetcherJob, the markers for certain pages are erased. Note that all other values, like content, baseUrl, fetchtimes and fetchStatus, are saved correctly for these URLs.
> Finally, for testing purposes, here is an example URL that DOES work: http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)