You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2015/05/16 00:17:00 UTC

[jira] [Reopened] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

     [ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel reopened NUTCH-2011:
------------------------------------

Sorry, but this needs some rework:
- after 35.000+ fetched pages and the default max. heap size of 1000M fetcher becomes slow and throws mainly parser timeouts and catched OOM exceptions. Only small HTML pages with few outlinks per page have been crawled - the limit is reached sooner if there are many overlong outlinks or big PDF documents.
- why an in-memory "database" of page-related information (URL, title, outlinks + anchor texts)?
-- all information is available in CrawlDb, LinkDb, segments
-- MapReduce job counters provide instant progress information (e.g, number of fetched pages)
-- if required a queue of limited total size should be used
- in any case, this feature should be optional and off per default if NutchServer is not used
- "reporting" to FetchNodeDb is off if fetcher.parse is false (the default)? Is this intended? Construction of FetchNodes is then useless work.
- no traces to System.out: "FetchNodeDb : putting node ..."

> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)