You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/08/19 04:54:47 UTC

[Nutch Wiki] Trivial Update of "JavaDemoApplication" by Cristian Vulpe

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "JavaDemoApplication" page has been changed by Cristian Vulpe.
http://wiki.apache.org/nutch/JavaDemoApplication?action=diff&rev1=5&rev2=6

--------------------------------------------------

  ## page was renamed from JavaApplication
  = Integrating Nutch search functionality into a Java application =
- 
  This example is the fruit of much searching of the nutch users mailing list in order to get a working application that used the Nutch APIs.  I couldn't find all that was needed to provide a quick-start in one place, so this document was born...
  
  Using Nutch within an application is actually very simple; the requirements are merely the existence of a previously created crawl index, a couple of settings in a configuration file, and a handful of jars in your classpath. Nothing else is needed from the Nutch release that you can download.
  
  This example assumes that an index has been created in the directory /home/nutch-java-demo/crawl-dir and a copy of the 'plugins' folder from the nutch distribution is in the directory /home/nutch-java-demo/plugins. This directory tree is completely external to the deployment of the java application.
- 
  
  == Configuration ==
  For the search to work, some appropriate settings need to be in a file called nutch-site.xml. If you have read the first part of this document, this file will be familiar to you. While you could use the same version of that file as before, there is no need to do so, as only two properties are required within it:
@@ -22, +20 @@

    <description />
  </property>
  }}}
- 
  This should point to a folder containing all the Nutch plugins. This can be placed anywhere within the filesystem and has no dependency on any other files distributed with Nutch.
  
  2) searcher.dir must be a fully qualified path to the crawl directory you want to use
+ 
  {{{
  <property>
    <name>searcher.dir</name>
@@ -36, +34 @@

  Place this copy of nutch-site.xml and a copy of common-terms.utf8 (from the conf directory in the Nutch distribution) in the WEB-INF/classes directory of the web application that you're deploying.
  
  You also need to make sure that the following jars are placed in WEB-INF/lib:
+ 
  {{{
  commons-cli-2.0-SNAPSHOT.jar
  hadoop-0.12.2-core.jar
@@ -43, +42 @@

  lucene-misc-2.2.0.jar
  nutch-0.9.jar
  }}}
- 
  == Sample code ==
  With that, all is ready and we can now write some simple code to search. A quick example in Java to search the crawl index and return the number of hits found is:
  
@@ -59, +57 @@

  Hits nutchHits = nutchBean.search(nutchQuery, maxHits);
  out.println("Found " + nutchHits.getLength() + " hits\n");
  }}}
- 
  Obviously this is not the most useful application, but it provides the basics for querying the Nutch index. Once a Hits object is returned, we can inspect each Hit object within that structure and glean more information from it:
  
  {{{
@@ -77, +74 @@

    System.out.println("----------------------------------------");
  }
  }}}
+ Chaz Hickman (Jan 2008) y
  
- Chaz Hickman (Jan 2008)
-