You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2015/05/20 13:58:31 UTC

[Nutch Wiki] Update of "Nutch_1.X_RESTAPI/RunningJobsTutorial" by SujenShah

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Nutch_1.X_RESTAPI/RunningJobsTutorial" page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI/RunningJobsTutorial?action=diff&rev1=3&rev2=4

  {   
      "type":"INJECT",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", "url_dir":"url/"}
+     "crawlId":"crawl01"
+     "args": {"url_dir":"url/"}
  }
  }}}}
- The args contain two keys - crawldb, url_dir. These should be put with appropriate values.
+ The args contains one key - url_dir. This should correspond to the path of the url dir where the seed file is stored
  The response of the request is a JSON output
  {{{{
  {
     "confId":"default",
-    "args":{"crawldb":"crawl/crawldb","url_dir":"url/"},
-    "crawlId":null,
+    "args":{"url_dir":"url/"},
+    "crawlId":"crawl01",
     "msg":"OK",
     "id":"default-INJECT-635077497",
     "state":"RUNNING",
@@ -56, +57 @@

  {  
      "type":"GENERATE",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", "segments_dir":"crawl/segments"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - crawldb, segments_dir, force, topN, numFetchers, adddays, noFilter, noNorm, maxNumSegments. These should be put with appropriate values.
+ The args contain keys - force, topN, numFetchers, adddays, noFilter, noNorm, maxNumSegments. These should be put with appropriate values.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20generate|here]].
  
@@ -67, +69 @@

  {{{{
  {
      "confId":"default",
-     "args":{"crawldb":"crawl/crawldb","segments_dir":"crawl/segments"},
-     "crawlId":null,
+     "args":{},
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-GENERATE-274614034",
      "state":"RUNNING",
@@ -84, +86 @@

  {  
      "type":"FETCH",
      "confId":"default",
-     "args": {"segment":"crawl/segments/20150331153517""}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - segment, threads, noParsing. These should be put with appropriate values.
+ The args contain keys - threads, noParsing. These should be put with appropriate values.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20fetch | here]].
  
@@ -95, +98 @@

  {{{{
  {
       "confId":"default",
-      "args":{"segment":"crawl/segments/20150331153517"},
+      "args":{},
-      "crawlId":null,
+      "crawlId":"crawl01",
       "msg":"idle",
       "id":"default-FETCH-99398319",
       "state":"IDLE",
@@ -112, +115 @@

  {  
      "type":"PARSE",
      "confId":"default",
-     "args": {"segment":"crawl/segments/20150331153517", "noFilter":"true"}
+     "crawlId":"crawl01",
+     "args": {"noFilter":"true"}
  }
  }}}}
- The args contain keys - segment, noFilter, noNormalize. These should be put with appropriate values.
+ The args contain keys - noFilter, noNormalize. These should be put with appropriate values.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20parse | here]].
  
@@ -123, +127 @@

  {{{{
  {
       "confId":"default",
-      "args":{"segment":"crawl/segments/20150331153517","noFilter":"true"},
+      "args":{"noFilter":"true"},
-      "crawlId":null,
+      "crawlId":"crawl01",
       "msg":"OK",
       "id":"default-PARSE-1413156163",
       "state":"IDLE",
@@ -140, +144 @@

  {  
      "type":"UPDATEDB",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", "segments":"crawl/segments/20150331153517"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - crawldb, segments, dir, force, normalize, filter, noAdditions. These should be put with appropriate values.
+ The args contain keys - force, normalize, filter, noAdditions. These should be put with appropriate values.
- 
- To use multiple segments, the segments parameter should contain the names of the segments seperated by space. If you wish to specify an entire directory then use the dir paramter.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20updatedb|here]].
  
@@ -170, +173 @@

  {  
      "type":"INVERTLINKS",
      "confId":"default",
-     "args": {"linkdb":"crawl/linkdb", "dir":"crawl/segments"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
  
- The args contain keys - crawldb, segments, dir, force, noNormalize, noFilter. These should be put with appropriate values.
+ The args contain keys -force, noNormalize, noFilter. These should be put with appropriate values.
- 
- To use multiple segments, the segments parameter should contain the names of the segments seperated by space. If you wish to specify an entire directory then use the dir paramter.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20invertlinks|here]].
  
@@ -184, +186 @@

  {{{{
  {
      "confId":"default",
-     "args":{"linkdb":"crawl/linkdb", "dir":"crawl/segments"},
-     "crawlId":null,
+     "args":{},
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-INVERTLINKS-572647647",
      "state":"RUNNING",
@@ -202, +204 @@

  {  
      "type":"DEDUP",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- 
- The args contain keys - crawldb. These should be put with appropriate values.
  
  The response of the request is a JSON output
  {{{{
  {
      "confId":"default",
      "args":{"crawldb":"crawl/crawldb"},
-     "crawlId":null,
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-DEDUP-1394212503",
      "state":"RUNNING",
@@ -222, +223 @@

  }
  }}}}
  
- === Readdb Job ===
- To run the generate job call '''POST /db/readdb''' with following
- {{{{
- POST /db/readdb
- {     
-     "type":"stats",
-     "confId":"default",
-     "args":{"crawldb":"crawl/crawldb"}
- }
- }}}}
- The different types are - dump, topN and url. Their corresponding arguments can be found [[https://wiki.apache.org/nutch/bin/nutch%20readdb|here]].
- 
- The response of the request is a JSON output
- {{{{
-   {
-       "retry 0":"8350",
-       "minScore":"0.0",
-       "retry 1":"96",
-       "status":{ 
-                 "3":{"count":"21","statusValue":"db_gone"},
-                 "2":{"count":"594","statusValue":"db_fetched"},
-                 "1":{"count":"7721","statusValue":"db_unfetched"},
-                 "5":{"count":"86","statusValue":"db_redir_perm"},
-                 "4":{"count":"24","statusValue":"db_redir_temp"}
-                 },
-       "totalUrls":"8446",
-       "maxScore":"0.528",
-       "avgScore":"0.029593771"
-   }
- }}}}
- '''Note: ''' If any other type than stats, like dump, topN, url is used then the response will be a file (application-octet-stream).
-