You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sujen Shah (JIRA)" <ji...@apache.org> on 2015/05/18 12:08:00 UTC
[jira] [Comment Edited] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

    [ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547812#comment-14547812 ] 

Sujen Shah edited comment on NUTCH-2011 at 5/18/15 10:07 AM:
-------------------------------------------------------------

Hi [~wastl-nagel], 
Just to add a little to Asitang's reply, 

- "fetch round" means one fetch job, this basically corresponds to "bin/nutch fetch ...." in the crawl script. 

- "greater depth fetch rounds" means longer fetch lists, corresponding to higher iteration numbers specified in the noOfRounds parameter while running the bin/crawl script. 

As Asitang mentioned, the FetchNodeDb is used to make a D3 graph (currently a BFS tree) to show the progress of the crawl, we would need the "round"(or iteration) in which it was fetched to make the graph. 

There was some initial discussion about modifying the CrawlDb to hold one more parameter which is the round number. But since a FetchNodeDb was created to store real-time information, the idea of modifying the crawldb was dropped. 

One point from your earlier comment , the reason to store the FetchNodes in an enumerated manner was so that the client could paginate his requests to reduce the amount of bandwidth used. This was done to take care of client side failures in large crawls. This option is not currently supported by any persistent databases used (CrawlDb/LinkDb, etc)



was (Author: sujenshah):
Hi [~wastl-nagel], 
Just to add a little to Asitang's reply, 

- "fetch round" means one fetch job, this basically corresponds to "bin/nutch fetch ...." in the crawl script. 

- "greater depth fetch rounds" means longer fetch lists, corresponding to higher iteration numbers specified in the noOfRounds parameter while running the bin/crawl script. 

As Asitang mentioned, the FetchNodeDb is used to make a D3 graph (currently a BFS tree) to show the progress of the crawl, we would need the "round"(or iteration) in which it was fetched to make the graph. 

There was some initial discussion about modifying the CrawlDb to hold one more parameter which is the round number. But since a FetchNodeDb was created to store real-time information, the idea of modifying the crawldb was dropped. 

One point from your comment on NUTCH-2015, the reason to store the FetchNodes in an enumerated manner was so that the client could paginate his requests to reduce the amount of bandwidth used. This was done to take care of client side failures in large crawls. This option is not currently supported by any persistent databases used (CrawlDb/LinkDb, etc)


> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)