You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/03/29 09:33:23 UTC

[Nutch Wiki] Update of "FabioGiavazzi/HowtoGettingNutchRunningonWindows" by FabioGiavazzi

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FabioGiavazzi/HowtoGettingNutchRunningonWindows" page has been changed by FabioGiavazzi.
The comment on this change is: Template.. not finished yet.
http://wiki.apache.org/nutch/FabioGiavazzi/HowtoGettingNutchRunningonWindows

--------------------------------------------------

New page:
##master-page:HomepageReadPageTemplate
##master-date:Unknown-Date
#format wiki
#language en
Howto to setup nutch on a Windows Server 2008 R2 Enterprise(64-bit) Additional, how to crawl samba shares with nutch.

First of all you need to download the following software: Java 1.6 (or newer version): http://www.oracle.com/technetwork/java/javase/downloads/index.html Tomcat 7: http://tomcat.apache.org/download-70.cgi Cygwin: http://www.cygwin.com/ Nutch-1.2 (or newer version): ftp://mirror.switch.ch/mirror/apache/dist//nutch/ (apache-nutch-1.2-bin.zip)

Step 1: Install Cygwin, (run cygwin.exe) follow the setup-assistant.

Setp 2: Install Java (run jdk-6u24-windows-i586.exe) and set JAVA_HOME in Start -> Computer -> Properties -> Advanced system settings -> Advanced -> Environment Variables...

(Use 32-bit Version of Java, there are some troubles with the 64-bit version and the os!)

Step 3: Install Tomcat, (run apache-tomcat-7.0.11.exe). After installation, Tomcat should start the service automatically. When the service is not running, start it manually by clicking on Configure Tomcat and then Start:

Now go to http://localhost:8080 in your browser and check if you see the following screen:

Step 4: For crawling samba share, you first have to setup the networkdrive: (In this example it's ipa-data1)

Step 5: Unzip the apache-nutch-1.2-bin.zip to any directory you like, I prefer C:\:

Now go to the nutch-1.2 directory and create an urls folder:

In this folder, you create a text file with any name you like (e.g. files). Now edit it and paste your file urls:

You have to type file:///, otherwise it won't work.

Step 6: Go to the nutch-1.2\conf directory and edit the nutch-default.xml:

Here we have to change the property plugin-includes and set the limit for file content to -1 for unlimited file length. Take a look at the changes:

Change the value protocol-http to protocol-file in plugin-includes (Don't change the other default values):

To specifie that nutch only crawls your specified links in the folder urls, you have to disable this property with set it to false:

Step 7: Go to nutch-1.2\conf\ and edit the file crawl-urlfilter.txt:

Change -^(file|ftp|mailto) to -^(http|ftp|mailto) Disable skip URLs with slash-delimited & accept hosts in MY.DOMAIN.NAME Change skip everything else to accept everything else

Step 8: Edit the file nutch-1.2\conf\nutch-site.xml, paste some default properties: <configuration> <property>

 . <name>http.agent.name</name>
 <value>test</value>
 <description>test
 </description>

</property>

<property>

 . <name>http.agent.description</name>
 <value>Nutch</value>
 <description>Nutch
 </description>

</property>

<property>

 . <name>http.agent.url</name>
 <value>http://test.url</value>
 <description>http://test.url
 </description>

</property>

<property>

 . <name>http.agent.email</name>
 <value> test@test.ch </value>
 <description> test@test.ch
 </description>

Step 9: Open cygwin.exe and run the crawl, just use this command: (First, navigate to the nutch-1.2 directory with cd /cygdrive/c/nutch-1.2)

Options which u can use: •       -dir dir names the directory to put the crawl in •       -threads threads determines the number of threads that will fetch in parallel •       -depth depth indicates the link depth from the root page that should be crawled

Step10: To use the Tomcat manager you have to edit the tomcat-users.xml in C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\conf\:

Add a new user and new role, like this:

Save the settings and restart Tomcat (Take a look at Step 3).

Step 11: Go to http://localhost:8080/manager/html in the browser (login with the user in Step 10). In the WAR file to deploy section, select the \nutch-1.2\nutch-1.2.war file to upload:

Then you will see the /nutch-1.2 in the list, start it.

Go to C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\webapps\ and you will see that there is a folder called nutch-1.2.

Step 12: Navigate to C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\webapps\nutch-1.2\WEB-INF\classes\ and edit the nutch-site.xml: </property>

 . <property>
  . <name>searcher.dir</name> <value>your_crawl_folder (like C:\nutch-1.2\crawl\)</value>
 </property>

</configuration>

After that, restart Tomcat.

Step 13: Go to http://localhost:8080/nutch-1.2 and you should see the following:

Now you can search for your files! Don't forget, that you have to set up the networkdrives on every system, to enable editing files directly over nutch!

Enjoy!

(Pictures will follow).. Under Construction!