You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2007/01/29 05:30:11 UTC

[Nutch Wiki] Trivial Update of "Automating Fetches with Python" by DennisKubes

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/Automating_Fetches_with_Python

The comment on the change is:
Formatting changes

------------------------------------------------------------------------------
  --------------------------------------------------------------------------------
  The Fetching process is one of the most essential processes for a production search engine.  Automation of that process is also essential.  This brief document will cover the JobStream.py python script that is used to automate the fetching process including fetching, updating the crawl database, and merging fetches into single segments.  Please note that it is assumed that storage of the fetches is occuring on the Hadoop DFS (although the script could be altered to run on the local file system).
  
- === The JobStream.py Process === 
+ === The JobStream.py Process ===
  JobStream.py is a single class python script that automates the webpage fetching / update process.  
  
  The job starts by dumping the crawl database to a local disk from the master
@@ -92, +92 @@

  
  This assumes that you have set basic options in the main def of the script and are overriding basic options.
  
- === The JobStream.py Script === 
+ === The JobStream.py Script ===
  {{{
  #!/usr/bin/python
  
@@ -561, +561 @@

    main(sys.argv[1:])
  }}}
  
- === The JobStream Logging.conf File === 
+ === The JobStream Logging.conf File ===
  {{{
  [formatters]
  keys=simple