You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Info <in...@radionav.it> on 2006/07/21 22:17:40 UTC
Hadoop and Recrawl
Hi List
I try to use this script with hadoop but don't work.
I try to change ls with bin/hadoop dfs -ls
But the script don't work because is ls -d and don't ls only.
Someone can help me
Best Regards
Roberto Navoni
-----Messaggio originale-----
Da: Matthew Holt [mailto:mholt@redhat.com]
Inviato: venerdì 21 luglio 2006 18.58
A: nutch-user@lucene.apache.org
Oggetto: Re: Recrawl script for 0.8.0 completed...
Lourival Júnior wrote:
> I thing it wont work with me because i'm using the Nutch version 0.7.2.
> Actually I use this script (some comments are in Portuguese):
>
> #!/bin/bash
>
> # A simple script to run a Nutch re-crawl
> # Fonte do script:
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>
> #{
>
> if [ -n "$1" ]
> then
> crawl_dir=$1
> else
> echo "Usage: recrawl crawl_dir [depth] [adddays]"
> exit 1
> fi
>
> if [ -n "$2" ]
> then
> depth=$2
> else
> depth=5
> fi
>
> if [ -n "$3" ]
> then
> adddays=$3
> else
> adddays=0
> fi
>
> webdb_dir=$crawl_dir/db
> segments_dir=$crawl_dir/segments
> index_dir=$crawl_dir/index
>
> #Para o serviço do TomCat
> #net stop "Apache Tomcat"
>
> # The generate/fetch/update cycle
> for ((i=1; i <= depth ; i++))
> do
> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> segment=`ls -d $segments_dir/* | tail -1`
> bin/nutch fetch $segment
> bin/nutch updatedb $webdb_dir $segment
> echo
> echo "Fim do ciclo $i."
> echo
> done
>
> # Update segments
> echo
> echo "Atualizando os Segmentos..."
> echo
> mkdir tmp
> bin/nutch updatesegs $webdb_dir $segments_dir tmp
> rm -R tmp
>
> # Index segments
> echo "Indexando os segmentos..."
> echo
> for segment in `ls -d $segments_dir/* | tail -$depth`
> do
> bin/nutch index $segment
> done
>
> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup $segments_dir bogus
>
> # Merge indexes
> #echo "Unindo os segmentos..."
> #echo
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>
> chmod 777 -R $index_dir
>
> #Inicia o serviço do TomCat
> #net start "Apache Tomcat"
>
> echo "Fim."
>
> #} > recrawl.log 2>&1
>
> How you suggested I used the touch command instead stops the tomcat.
> However
> I get that error posted in previous message. I'm running nutch in windows
> plataform with cygwin. I only get no errors when I stops the tomcat. I
> use
> this command to call the script:
>
> ./recrawl crawl-legislacao 1
>
> Could you give me more clarifications?
>
> Thanks a lot!
>
> On 7/21/06, Matthew Holt <mh...@redhat.com> wrote:
>>
>> Lourival Júnior wrote:
>> > Hi Renaud!
>> >
>> > I'm newbie with shell scripts and I know stops tomcat service is
>> not the
>> > better way to do this. The problem is, when a run the re-crawl script
>> > with
>> > tomcat started I get this error:
>> >
>> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
>> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>> > at
>> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>> > at
>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>> > at
>> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>> > :141)
>> > at
>> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>> > at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>> :92)
>> > at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>> :160)
>> >
>> > So, I want another way to re-crawl my pages without this error and
>> > without
>> > restarting the tomcat. Could you suggest one?
>> >
>> > Thanks a lot!
>> >
>> >
>> Try this updated script and tell me what command exactly you run to call
>> the script. Let me know the error message then.
>>
>> Matt
>>
>>
>> #!/bin/bash
>>
>> # Nutch recrawl script.
>> # Based on 0.7.2 script at
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> # Modified by Matthew Holt
>>
>> if [ -n "$1" ]
>> then
>> nutch_dir=$1
>> else
>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>> echo "servlet_path - Path of the nutch servlet (i.e.
>> /usr/local/tomcat/webapps/ROOT)"
>> echo "crawl_dir - Name of the directory the crawl is located in."
>> echo "[depth] - The link depth from the root page that should be
>> crawled."
>> echo "[adddays] - Advance the clock # of days for fetchlist
>> generation."
>> exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>> crawl_dir=$2
>> else
>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>> echo "servlet_path - Path of the nutch servlet (i.e.
>> /usr/local/tomcat/webapps/ROOT)"
>> echo "crawl_dir - Name of the directory the crawl is located in."
>> echo "[depth] - The link depth from the root page that should be
>> crawled."
>> echo "[adddays] - Advance the clock # of days for fetchlist
>> generation."
>> exit 1
>> fi
>>
>> if [ -n "$3" ]
>> then
>> depth=$3
>> else
>> depth=5
>> fi
>>
>> if [ -n "$4" ]
>> then
>> adddays=$4
>> else
>> adddays=0
>> fi
>>
>> # Only change if your crawl subdirectories are named something different
>> webdb_dir=$crawl_dir/crawldb
>> segments_dir=$crawl_dir/segments
>> linkdb_dir=$crawl_dir/linkdb
>> index_dir=$crawl_dir/index
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>> segment=`ls -d $segments_dir/* | tail -1`
>> bin/nutch fetch $segment
>> bin/nutch updatedb $webdb_dir $segment
>> done
>>
>> # Update segments
>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>
>> # Index segments
>> new_indexes=$crawl_dir/newindexes
>> #ls -d $segments_dir/* | tail -$depth | xargs
>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>
>> # De-duplicate indexes
>> bin/nutch dedup $new_indexes
>>
>> # Merge indexes
>> bin/nutch merge $index_dir $new_indexes
>>
>> # Tell Tomcat to reload index
>> touch $nutch_dir/WEB-INF/web.xml
>>
>> # Clean up
>> rm -rf $new_indexes
>>
>>
>
>
Oh yea, you're right the one i sent out was for 0.8.... you should just
be able to put this at the end of your script..
# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml
and fill in the appropriate path of course.
gluck
matt
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006
Nutch to...Frutch
Posted by Hans Vallden <ha...@vallden.com>.
Greetings All!
I recently became interested in search technology and more
specifically Nutch. So, I'm a newbie by all standards. Don't hesitate
to treat me as one. :)
My vision would be to build a Froogle-like ecommerce search engine,
possibly using Nutch. I am wondering if anyone on this list has ever
pondered the same idea? I would be very interested in hearing
thoughts and experiences. Don't hesitate to contact me off the list,
if you feel it to be more appropriate.
--
--
Hans Vallden
hans@vallden.com
This is my tutorial for hadoop + nutch 0.8 I'm searching a
tutorial for recrawl script for nutch+hadoop
Posted by info <in...@radionav.it>.
Tutorial Nutch 0.8 and Hadoop
This tutorial derived by hadoop + nutch tutorial and other 0.8 tutorial
foun on wiky site and on google and "work fine!!!"
Now I working around a recrawl tutorial
#Format the hadoop namenode
root@LSearchDev01:/nutch/search# bin/hadoop namenode -format
Re-format filesystem in /nutch/filesystem/name ? (Y or N) Y
Formatted /nutch/filesystem/name
#Start Hadoop
root@LSearchDev01:/nutch/search# bin/start-all.sh
namenode running as process 16789.
root@lsearchdev01's password:
jobtracker running as process 16866.
root@lsearchdev01's password:
LSearchDev01: starting tasktracker, logging
to /nutch/search/logs/hadoop-root-tasktracker-LSearchDev01.out
#ls on hadoop file systems
root@LSearchDev01:/nutch/search#
root@LSearchDev01:/nutch/search# bin/hadoop dfs -ls
Found 0 items
#Hadoop work fine
# use vi to add your site in http://www.yoursite.com format
root@LSearchDev01:/nutch/search# vi urls.txt
# Make urls directory on hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -mkdir urls
# Copy urls.txt file from linux file system to hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt
# List the file on hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr /user/root/urls
<dir>
/user/root/urls/urls.txt <r 2> 41
#If you want to delete the old urls file on hadoop and put a new one
file system use the follow command
root@LSearchDev01:/nutch/search# bin/hadoop dfs
-rm /user/root/urls/urls.txt
Deleted /user/root/urls/urls.txt
root@LSearchDev01:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt
#Start to inject the urls in the urls.txt to <crawld> dbase
root@LSearchDev01:/nutch/search# bin/nutch inject crawld urls
# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030
# This is the new situation of your hadoop file system now
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr
/user/root/crawld <dir>
/user/root/crawld/current <dir>
/user/root/crawld/current/part-00000 <dir>
/user/root/crawld/current/part-00000/data <r 2> 62
/user/root/crawld/current/part-00000/index <r 2> 33
/user/root/crawld/current/part-00001 <dir>
/user/root/crawld/current/part-00001/data <r 2> 62
/user/root/crawld/current/part-00001/index <r 2> 33
/user/root/crawld/current/part-00002 <dir>
/user/root/crawld/current/part-00002/data <r 2> 124
/user/root/crawld/current/part-00002/index <r 2> 74
/user/root/crawld/current/part-00003 <dir>
/user/root/crawld/current/part-00003/data <r 2> 181
/user/root/crawld/current/part-00003/index <r 2> 74
/user/root/urls <dir>
/user/root/urls/urls.txt <r 2> 64
# Now you can generate the file for fetch job
root@LSearchDev01:/nutch/search# bin/nutch
generate /user/root/crawld /user/root/crawld/segments
# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030
# This /user/root/crawld/segments/20060722130642 is the name of the
segment that you want to fetch
root@LSearchDev01:/nutch/search# bin/hadoop dfs
-ls /user/root/crawld/segments
Found 1 items
/user/root/crawld/segments/20060722130642 <dir>
root@LSearchDev01:/nutch/search#
#Fetch the site list in urls.txt
root@LSearchDev01:/nutch/search# bin/nutch
fetch /user/root/crawld/segments/20060722130642
# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030
#This is what there are on your hadoop file systems now
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr /user/root/crawld
<dir>
/user/root/crawld/current <dir>
/user/root/crawld/current/part-00000 <dir>
/user/root/crawld/current/part-00000/data <r 2> 62
/user/root/crawld/current/part-00000/index <r 2> 33
/user/root/crawld/current/part-00001 <dir>
/user/root/crawld/current/part-00001/data <r 2> 62
/user/root/crawld/current/part-00001/index <r 2> 33
/user/root/crawld/current/part-00002 <dir>
/user/root/crawld/current/part-00002/data <r 2> 124
/user/root/crawld/current/part-00002/index <r 2> 74
/user/root/crawld/current/part-00003 <dir>
/user/root/crawld/current/part-00003/data <r 2> 181
/user/root/crawld/current/part-00003/index <r 2> 74
/user/root/crawld/segments <dir>
/user/root/crawld/segments/20060722130642 <dir>
/user/root/crawld/segments/20060722130642/content <dir>
/user/root/crawld/segments/20060722130642/content/part-00000 <dir>
/user/root/crawld/segments/20060722130642/content/part-00000/data
<r 2> 62
/user/root/crawld/segments/20060722130642/content/part-00000/index
<r 2> 33
/user/root/crawld/segments/20060722130642/content/part-00001 <dir>
/user/root/crawld/segments/20060722130642/content/part-00001/data
<r 2> 62
/user/root/crawld/segments/20060722130642/content/part-00001/index
<r 2> 33
/user/root/crawld/segments/20060722130642/content/part-00002 <dir>
/user/root/crawld/segments/20060722130642/content/part-00002/data
<r 2> 2559
/user/root/crawld/segments/20060722130642/content/part-00002/index
<r 2> 74
/user/root/crawld/segments/20060722130642/content/part-00003 <dir>
/user/root/crawld/segments/20060722130642/content/part-00003/data
<r 2> 6028
/user/root/crawld/segments/20060722130642/content/part-00003/index
<r 2> 74
/user/root/crawld/segments/20060722130642/crawl_fetch <dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/data
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/index
<r 2> 33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/data
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/index
<r 2> 33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/data
<r 2> 140
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/index
<r 2> 74
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/data
<r 2> 213
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/index
<r 2> 74
/user/root/crawld/segments/20060722130642/crawl_generate <dir>
/user/root/crawld/segments/20060722130642/crawl_generate/part-00000
<r 2> 119
/user/root/crawld/segments/20060722130642/crawl_generate/part-00001
<r 2> 124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00002
<r 2> 124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00003
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_parse <dir>
/user/root/crawld/segments/20060722130642/crawl_parse/part-00000
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00001
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00002
<r 2> 784
/user/root/crawld/segments/20060722130642/crawl_parse/part-00003
<r 2> 1698
/user/root/crawld/segments/20060722130642/parse_data <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000/data
<r 2> 61
/user/root/crawld/segments/20060722130642/parse_data/part-00000/index
<r 2> 33
/user/root/crawld/segments/20060722130642/parse_data/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00001/data
<r 2> 61
/user/root/crawld/segments/20060722130642/parse_data/part-00001/index
<r 2> 33
/user/root/crawld/segments/20060722130642/parse_data/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00002/data
<r 2> 839
/user/root/crawld/segments/20060722130642/parse_data/part-00002/index
<r 2> 74
/user/root/crawld/segments/20060722130642/parse_data/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00003/data
<r 2> 1798
/user/root/crawld/segments/20060722130642/parse_data/part-00003/index
<r 2> 74
/user/root/crawld/segments/20060722130642/parse_text <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000/data
<r 2> 61
/user/root/crawld/segments/20060722130642/parse_text/part-00000/index
<r 2> 33
/user/root/crawld/segments/20060722130642/parse_text/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00001/data
<r 2> 61
/user/root/crawld/segments/20060722130642/parse_text/part-00001/index
<r 2> 33
/user/root/crawld/segments/20060722130642/parse_text/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00002/data
<r 2> 377
/user/root/crawld/segments/20060722130642/parse_text/part-00002/index
<r 2> 74
/user/root/crawld/segments/20060722130642/parse_text/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00003/data
<r 2> 811
/user/root/crawld/segments/20060722130642/parse_text/part-00003/index
<r 2> 74
/user/root/urls <dir>
/user/root/urls/urls.txt <r 2> 64
#Now you need to do the invertlinks JOB
root@LSearchDev01:/nutch/search# bin/nutch
invertlinks /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642
#And at the end you need to build your index
root@LSearchDev01:/nutch/search# bin/nutch
index /user/root/crawld/indexes /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642
root@LSearchDev01:/nutch/search# bin/hadoop dfs -ls /user/root/crawld
Found 4 items
/user/root/crawld/current <dir>
/user/root/crawld/indexes <dir>
/user/root/crawld/linkdb <dir>
/user/root/crawld/segments <dir>
root@LSearchDev01:/nutch/search#
At the end of your hard job you have on your hadoop file system this
directory
So you are ready to start tomcat .
Before you start tomcat remeber to change the path of your search
directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
directory
#This is an example of my configuration
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>LSearchDev01:9000</value>
</property>
<property>
<name>searcher.dir</name>
<value>/user/root/crawld</value>
</property>
</configuration>
~
~
I hope that i Help someone to do they first search engine on nutch 0.8 +
hadoop :)
Best crawling
Roberto Navoni
Hadoop and Inject and Recrawl hadoop and nutch v0.8 WORK FINE!!!!
Posted by roberto navoni <r....@radionav.it>.
Tutorial Nutch 0.8 and Hadoop
This tutorial derived by hadoop + nutch tutorial and other 0.8 tutorial
found on wiky site and on google and "work fine!!!"
At the end of tutorial you can found also a recrawl tutorial and
rebuild index
#Format the hadoop namenode
root@LSearchDev01:/nutch/search#bin/hadoop namenode -format
Re-format filesystem in /nutch/filesystem/name ? (Y or N) Y
Formatted /nutch/filesystem/name
#Start Hadoop
root@LSearchDev01:/nutch/search# bin/start-all.sh
namenode running as process 16789.
root@lsearchdev01's password:
jobtracker running as process 16866.
root@lsearchdev01's password:
LSearchDev01: starting tasktracker, logging
to /nutch/search/logs/hadoop-root-tasktracker-LSearchDev01.out
#ls on hadoop file systems
root@LSearchDev01:/nutch/search#
root@LSearchDev01:/nutch/search# bin/hadoop dfs -ls
Found 0 items
#Hadoop work fine
# use vi to add your site in http://www.yoursite.com format
root@LSearchDev01:/nutch/search# vi urls.txt
# Make urls directory on hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -mkdir urls
# Copy urls.txt file from linux file system to hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt
# List the file on hadoop file system
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr /user/root/urls
<dir>
/user/root/urls/urls.txt <r 2> 41
#If you want to delete the old urls file on hadoop and put a new one
file system use the follow command
root@LSearchDev01:/nutch/search# bin/hadoop dfs
-rm /user/root/urls/urls.txt
Deleted /user/root/urls/urls.txt
root@LSearchDev01:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt
#Start to inject the urls in the urls.txt to <crawld> dbase
root@LSearchDev01:/nutch/search# bin/nutch inject crawld urls
# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030
# This is the new situation of your hadoop file system now
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr
/user/root/crawld <dir>
/user/root/crawld/current <dir>
/user/root/crawld/current/part-00000 <dir>
/user/root/crawld/current/part-00000/data <r 2> 62
/user/root/crawld/current/part-00000/index <r 2> 33
/user/root/crawld/current/part-00001 <dir>
/user/root/crawld/current/part-00001/data <r 2> 62
/user/root/crawld/current/part-00001/index <r 2> 33
/user/root/crawld/current/part-00002 <dir>
/user/root/crawld/current/part-00002/data <r 2> 124
/user/root/crawld/current/part-00002/index <r 2> 74
/user/root/crawld/current/part-00003 <dir>
/user/root/crawld/current/part-00003/data <r 2> 181
/user/root/crawld/current/part-00003/index <r 2> 74
/user/root/urls <dir>
/user/root/urls/urls.txt <r 2> 64
# Now you can generate the file for fetch job
root@LSearchDev01:/nutch/search# bin/nutch
generate /user/root/crawld /user/root/crawld/segments
# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030
# This /user/root/crawld/segments/20060722130642 is the name of the
segment that you want to fetch
root@LSearchDev01:/nutch/search# bin/hadoop dfs
-ls /user/root/crawld/segments
Found 1 items
/user/root/crawld/segments/20060722130642 <dir>
root@LSearchDev01:/nutch/search#
#Fetch the site list in urls.txt
root@LSearchDev01:/nutch/search# bin/nutch
fetch /user/root/crawld/segments/20060722130642
# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030
#This is what there are on your hadoop file systems now
root@LSearchDev01:/nutch/search# bin/hadoop dfs -lsr /user/root/crawld
<dir>
/user/root/crawld/current <dir>
/user/root/crawld/current/part-00000 <dir>
/user/root/crawld/current/part-00000/data <r 2> 62
/user/root/crawld/current/part-00000/index <r 2> 33
/user/root/crawld/current/part-00001 <dir>
/user/root/crawld/current/part-00001/data <r 2> 62
/user/root/crawld/current/part-00001/index <r 2> 33
/user/root/crawld/current/part-00002 <dir>
/user/root/crawld/current/part-00002/data <r 2> 124
/user/root/crawld/current/part-00002/index <r 2> 74
/user/root/crawld/current/part-00003 <dir>
/user/root/crawld/current/part-00003/data <r 2> 181
/user/root/crawld/current/part-00003/index <r 2> 74
/user/root/crawld/segments <dir>
/user/root/crawld/segments/20060722130642 <dir>
/user/root/crawld/segments/20060722130642/content <dir>
/user/root/crawld/segments/20060722130642/content/part-00000 <dir>
/user/root/crawld/segments/20060722130642/content/part-00000/data
<r 2> 62
/user/root/crawld/segments/20060722130642/content/part-00000/index
<r 2> 33
/user/root/crawld/segments/20060722130642/content/part-00001 <dir>
/user/root/crawld/segments/20060722130642/content/part-00001/data
<r 2> 62
/user/root/crawld/segments/20060722130642/content/part-00001/index
<r 2> 33
/user/root/crawld/segments/20060722130642/content/part-00002 <dir>
/user/root/crawld/segments/20060722130642/content/part-00002/data
<r 2> 2559
/user/root/crawld/segments/20060722130642/content/part-00002/index
<r 2> 74
/user/root/crawld/segments/20060722130642/content/part-00003 <dir>
/user/root/crawld/segments/20060722130642/content/part-00003/data
<r 2> 6028
/user/root/crawld/segments/20060722130642/content/part-00003/index
<r 2> 74
/user/root/crawld/segments/20060722130642/crawl_fetch <dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/data
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/index
<r 2> 33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/data
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/index
<r 2> 33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/data
<r 2> 140
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/index
<r 2> 74
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/data
<r 2> 213
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/index
<r 2> 74
/user/root/crawld/segments/20060722130642/crawl_generate <dir>
/user/root/crawld/segments/20060722130642/crawl_generate/part-00000
<r 2> 119
/user/root/crawld/segments/20060722130642/crawl_generate/part-00001
<r 2> 124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00002
<r 2> 124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00003
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_parse <dir>
/user/root/crawld/segments/20060722130642/crawl_parse/part-00000
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00001
<r 2> 62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00002
<r 2> 784
/user/root/crawld/segments/20060722130642/crawl_parse/part-00003
<r 2> 1698
/user/root/crawld/segments/20060722130642/parse_data <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000/data
<r 2> 61
/user/root/crawld/segments/20060722130642/parse_data/part-00000/index
<r 2> 33
/user/root/crawld/segments/20060722130642/parse_data/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00001/data
<r 2> 61
/user/root/crawld/segments/20060722130642/parse_data/part-00001/index
<r 2> 33
/user/root/crawld/segments/20060722130642/parse_data/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00002/data
<r 2> 839
/user/root/crawld/segments/20060722130642/parse_data/part-00002/index
<r 2> 74
/user/root/crawld/segments/20060722130642/parse_data/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00003/data
<r 2> 1798
/user/root/crawld/segments/20060722130642/parse_data/part-00003/index
<r 2> 74
/user/root/crawld/segments/20060722130642/parse_text <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000/data
<r 2> 61
/user/root/crawld/segments/20060722130642/parse_text/part-00000/index
<r 2> 33
/user/root/crawld/segments/20060722130642/parse_text/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00001/data
<r 2> 61
/user/root/crawld/segments/20060722130642/parse_text/part-00001/index
<r 2> 33
/user/root/crawld/segments/20060722130642/parse_text/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00002/data
<r 2> 377
/user/root/crawld/segments/20060722130642/parse_text/part-00002/index
<r 2> 74
/user/root/crawld/segments/20060722130642/parse_text/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00003/data
<r 2> 811
/user/root/crawld/segments/20060722130642/parse_text/part-00003/index
<r 2> 74
/user/root/urls <dir>
/user/root/urls/urls.txt <r 2> 64
#Now you need to do the invertlinks JOB
root@LSearchDev01:/nutch/search# bin/nutch
invertlinks /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642
#And at the end you need to build your index
root@LSearchDev01:/nutch/search# bin/nutch
index /user/root/crawld/indexes /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642
root@LSearchDev01:/nutch/search# bin/hadoop dfs -ls /user/root/crawld
Found 4 items
/user/root/crawld/current <dir>
/user/root/crawld/indexes <dir>
/user/root/crawld/linkdb <dir>
/user/root/crawld/segments <dir>
root@LSearchDev01:/nutch/search#
At the end of your hard job you have on your hadoop file system this
directory
So you are ready to start tomcat .
Before you start tomcat remeber to change the path of your search
directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
directory
#This is an example of my configuration
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>LSearchDev01:9000</value>
</property>
<property>bin
<name>searcher.dir</name>
<value>/user/root/crawld</value>
</property>
</configuration>
~
~
#RECRAWL AND NEW INJECT
# Create a new indexe0
bin/nutch
index /user/root/crawld/indexe0 /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722153133
# Create a new index1
bin/nutch
index /user/root/crawld/indexe1 /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722182213
#Dedup the new indexe0
bin/nutch dedup /user/root/crawld/indexe0
#Dedup the new index1
bin/nutch dedup /user/root/crawld/indexe1
#Delete the old index
#Merge the new index merge directory
bin/nutch
merge /user/root/crawld/index /user/root/crawld/indexe0 /user/root/crawld/indexe1 ... #(and the other index create for the fetch segments)
#index is the stardard directory in the crawld (DB) where there is a
merge master index
I hope that i Help someone to do they first search engine on nutch 0.8 +
hadoop :)
Best crawling
Roberto Navoni
HELP ME PLEASE R: Hadoop and Nutch 0.8
Posted by Info <in...@radionav.it>.
Hi Renaud,
I try that link , but I don't solve my problem.
The problem is that If I use nutch-0.8 night build that version of nutch
don't use linux file system but Hadoop file system .
So if I try to see the name of segment , the script use ls -d on local linux
file system.
Instead I need an a script that use hadoop dfs file system because I need
for my experimental project to use it.
I have 4 linux server
Where I install nutch . I use hadoop to have an distribuited file system.
My first problem is to merge the index
My second problem is that if I try to connect by ssh the slave server they
ask me the password. I see the online tutorial Nutch + hadoop ...
There's some people that can help me.
Best Regard
Roberto Navoni
-----Messaggio originale-----
Da: Info [mailto:info@radionav.it]
Inviato: sabato 22 luglio 2006 10.09
A: nutch-user@lucene.apache.org
Oggetto: R: Hadoop and Recrawl
-----Messaggio originale-----
Da: Renaud Richardet [mailto:renaud.richardet@wyona.com]
Inviato: venerdì 21 luglio 2006 22.24
A: nutch-user@lucene.apache.org
Oggetto: Re: Hadoop and Recrawl
Hi Roberto,
Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to
Matthew Holt)
HTH,
Renaud
Info wrote:
> Hi List
> I try to use this script with hadoop but don't work.
> I try to change ls with bin/hadoop dfs -ls
> But the script don't work because is ls -d and don't ls only.
> Someone can help me
> Best Regards
> Roberto Navoni
>
> -----Messaggio originale-----
> Da: Matthew Holt [mailto:mholt@redhat.com]
> Inviato: venerdì 21 luglio 2006 18.58
> A: nutch-user@lucene.apache.org
> Oggetto: Re: Recrawl script for 0.8.0 completed...
>
> Lourival Júnior wrote:
>
>> I thing it wont work with me because i'm using the Nutch version 0.7.2.
>> Actually I use this script (some comments are in Portuguese):
>>
>> #!/bin/bash
>>
>> # A simple script to run a Nutch re-crawl
>> # Fonte do script:
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> #{
>>
>> if [ -n "$1" ]
>> then
>> crawl_dir=$1
>> else
>> echo "Usage: recrawl crawl_dir [depth] [adddays]"
>> exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>> depth=$2
>> else
>> depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>> adddays=$3
>> else
>> adddays=0
>> fi
>>
>> webdb_dir=$crawl_dir/db
>> segments_dir=$crawl_dir/segments
>> index_dir=$crawl_dir/index
>>
>> #Para o serviço do TomCat
>> #net stop "Apache Tomcat"
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>> segment=`ls -d $segments_dir/* | tail -1`
>> bin/nutch fetch $segment
>> bin/nutch updatedb $webdb_dir $segment
>> echo
>> echo "Fim do ciclo $i."
>> echo
>> done
>>
>> # Update segments
>> echo
>> echo "Atualizando os Segmentos..."
>> echo
>> mkdir tmp
>> bin/nutch updatesegs $webdb_dir $segments_dir tmp
>> rm -R tmp
>>
>> # Index segments
>> echo "Indexando os segmentos..."
>> echo
>> for segment in `ls -d $segments_dir/* | tail -$depth`
>> do
>> bin/nutch index $segment
>> done
>>
>> # De-duplicate indexes
>> # "bogus" argument is ignored but needed due to
>> # a bug in the number of args expected
>> bin/nutch dedup $segments_dir bogus
>>
>> # Merge indexes
>> #echo "Unindo os segmentos..."
>> #echo
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> chmod 777 -R $index_dir
>>
>> #Inicia o serviço do TomCat
>> #net start "Apache Tomcat"
>>
>> echo "Fim."
>>
>> #} > recrawl.log 2>&1
>>
>> How you suggested I used the touch command instead stops the tomcat.
>> However
>> I get that error posted in previous message. I'm running nutch in windows
>> plataform with cygwin. I only get no errors when I stops the tomcat. I
>> use
>> this command to call the script:
>>
>> ./recrawl crawl-legislacao 1
>>
>> Could you give me more clarifications?
>>
>> Thanks a lot!
>>
>> On 7/21/06, Matthew Holt <mh...@redhat.com> wrote:
>>
>>> Lourival Júnior wrote:
>>>
>>>> Hi Renaud!
>>>>
>>>> I'm newbie with shell scripts and I know stops tomcat service is
>>>>
>>> not the
>>>
>>>> better way to do this. The problem is, when a run the re-crawl script
>>>> with
>>>> tomcat started I get this error:
>>>>
>>>> 060721 132224 merging segment indexes to: crawl-legislacao2\index
>>>> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>>>> at
>>>> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>>>> at
>>>>
>>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>>>
>>>> at
>>>> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>>>> :141)
>>>> at
>>>> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>>>> at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>>>>
>>> :92)
>>>
>>>> at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>>>>
>>> :160)
>>>
>>>> So, I want another way to re-crawl my pages without this error and
>>>> without
>>>> restarting the tomcat. Could you suggest one?
>>>>
>>>> Thanks a lot!
>>>>
>>>>
>>>>
>>> Try this updated script and tell me what command exactly you run to call
>>> the script. Let me know the error message then.
>>>
>>> Matt
>>>
>>>
>>> #!/bin/bash
>>>
>>> # Nutch recrawl script.
>>> # Based on 0.7.2 script at
>>>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>>
>
>
>>> # Modified by Matthew Holt
>>>
>>> if [ -n "$1" ]
>>> then
>>> nutch_dir=$1
>>> else
>>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>> echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>> echo "crawl_dir - Name of the directory the crawl is located in."
>>> echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>> echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>> exit 1
>>> fi
>>>
>>> if [ -n "$2" ]
>>> then
>>> crawl_dir=$2
>>> else
>>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>> echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>> echo "crawl_dir - Name of the directory the crawl is located in."
>>> echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>> echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>> exit 1
>>> fi
>>>
>>> if [ -n "$3" ]
>>> then
>>> depth=$3
>>> else
>>> depth=5
>>> fi
>>>
>>> if [ -n "$4" ]
>>> then
>>> adddays=$4
>>> else
>>> adddays=0
>>> fi
>>>
>>> # Only change if your crawl subdirectories are named something different
>>> webdb_dir=$crawl_dir/crawldb
>>> segments_dir=$crawl_dir/segments
>>> linkdb_dir=$crawl_dir/linkdb
>>> index_dir=$crawl_dir/index
>>>
>>> # The generate/fetch/update cycle
>>> for ((i=1; i <= depth ; i++))
>>> do
>>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>> segment=`ls -d $segments_dir/* | tail -1`
>>> bin/nutch fetch $segment
>>> bin/nutch updatedb $webdb_dir $segment
>>> done
>>>
>>> # Update segments
>>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>>
>>> # Index segments
>>> new_indexes=$crawl_dir/newindexes
>>> #ls -d $segments_dir/* | tail -$depth | xargs
>>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>>
>>> # De-duplicate indexes
>>> bin/nutch dedup $new_indexes
>>>
>>> # Merge indexes
>>> bin/nutch merge $index_dir $new_indexes
>>>
>>> # Tell Tomcat to reload index
>>> touch $nutch_dir/WEB-INF/web.xml
>>>
>>> # Clean up
>>> rm -rf $new_indexes
>>>
>>>
>>>
>>
> Oh yea, you're right the one i sent out was for 0.8.... you should just
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>
>
>
>
--
Renaud Richardet
COO America
Wyona Inc. - Open Source Content Management - Apache Lenya
office +1 857 776-3195 mobile +1 617 230 9112
renaud.richardet <at> wyona.com http://www.wyona.com
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/395 - Release Date: 21/07/2006
R: Hadoop and Recrawl
Posted by Info <in...@radionav.it>.
-----Messaggio originale-----
Da: Renaud Richardet [mailto:renaud.richardet@wyona.com]
Inviato: venerdì 21 luglio 2006 22.24
A: nutch-user@lucene.apache.org
Oggetto: Re: Hadoop and Recrawl
Hi Roberto,
Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to
Matthew Holt)
HTH,
Renaud
Info wrote:
> Hi List
> I try to use this script with hadoop but don't work.
> I try to change ls with bin/hadoop dfs -ls
> But the script don't work because is ls -d and don't ls only.
> Someone can help me
> Best Regards
> Roberto Navoni
>
> -----Messaggio originale-----
> Da: Matthew Holt [mailto:mholt@redhat.com]
> Inviato: venerdì 21 luglio 2006 18.58
> A: nutch-user@lucene.apache.org
> Oggetto: Re: Recrawl script for 0.8.0 completed...
>
> Lourival Júnior wrote:
>
>> I thing it wont work with me because i'm using the Nutch version 0.7.2.
>> Actually I use this script (some comments are in Portuguese):
>>
>> #!/bin/bash
>>
>> # A simple script to run a Nutch re-crawl
>> # Fonte do script:
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> #{
>>
>> if [ -n "$1" ]
>> then
>> crawl_dir=$1
>> else
>> echo "Usage: recrawl crawl_dir [depth] [adddays]"
>> exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>> depth=$2
>> else
>> depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>> adddays=$3
>> else
>> adddays=0
>> fi
>>
>> webdb_dir=$crawl_dir/db
>> segments_dir=$crawl_dir/segments
>> index_dir=$crawl_dir/index
>>
>> #Para o serviço do TomCat
>> #net stop "Apache Tomcat"
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>> segment=`ls -d $segments_dir/* | tail -1`
>> bin/nutch fetch $segment
>> bin/nutch updatedb $webdb_dir $segment
>> echo
>> echo "Fim do ciclo $i."
>> echo
>> done
>>
>> # Update segments
>> echo
>> echo "Atualizando os Segmentos..."
>> echo
>> mkdir tmp
>> bin/nutch updatesegs $webdb_dir $segments_dir tmp
>> rm -R tmp
>>
>> # Index segments
>> echo "Indexando os segmentos..."
>> echo
>> for segment in `ls -d $segments_dir/* | tail -$depth`
>> do
>> bin/nutch index $segment
>> done
>>
>> # De-duplicate indexes
>> # "bogus" argument is ignored but needed due to
>> # a bug in the number of args expected
>> bin/nutch dedup $segments_dir bogus
>>
>> # Merge indexes
>> #echo "Unindo os segmentos..."
>> #echo
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> chmod 777 -R $index_dir
>>
>> #Inicia o serviço do TomCat
>> #net start "Apache Tomcat"
>>
>> echo "Fim."
>>
>> #} > recrawl.log 2>&1
>>
>> How you suggested I used the touch command instead stops the tomcat.
>> However
>> I get that error posted in previous message. I'm running nutch in windows
>> plataform with cygwin. I only get no errors when I stops the tomcat. I
>> use
>> this command to call the script:
>>
>> ./recrawl crawl-legislacao 1
>>
>> Could you give me more clarifications?
>>
>> Thanks a lot!
>>
>> On 7/21/06, Matthew Holt <mh...@redhat.com> wrote:
>>
>>> Lourival Júnior wrote:
>>>
>>>> Hi Renaud!
>>>>
>>>> I'm newbie with shell scripts and I know stops tomcat service is
>>>>
>>> not the
>>>
>>>> better way to do this. The problem is, when a run the re-crawl script
>>>> with
>>>> tomcat started I get this error:
>>>>
>>>> 060721 132224 merging segment indexes to: crawl-legislacao2\index
>>>> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>>>> at
>>>> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>>>> at
>>>>
>>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>>>
>>>> at
>>>> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>>>> :141)
>>>> at
>>>> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>>>> at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>>>>
>>> :92)
>>>
>>>> at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>>>>
>>> :160)
>>>
>>>> So, I want another way to re-crawl my pages without this error and
>>>> without
>>>> restarting the tomcat. Could you suggest one?
>>>>
>>>> Thanks a lot!
>>>>
>>>>
>>>>
>>> Try this updated script and tell me what command exactly you run to call
>>> the script. Let me know the error message then.
>>>
>>> Matt
>>>
>>>
>>> #!/bin/bash
>>>
>>> # Nutch recrawl script.
>>> # Based on 0.7.2 script at
>>>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>>
>
>
>>> # Modified by Matthew Holt
>>>
>>> if [ -n "$1" ]
>>> then
>>> nutch_dir=$1
>>> else
>>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>> echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>> echo "crawl_dir - Name of the directory the crawl is located in."
>>> echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>> echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>> exit 1
>>> fi
>>>
>>> if [ -n "$2" ]
>>> then
>>> crawl_dir=$2
>>> else
>>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>> echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>> echo "crawl_dir - Name of the directory the crawl is located in."
>>> echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>> echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>> exit 1
>>> fi
>>>
>>> if [ -n "$3" ]
>>> then
>>> depth=$3
>>> else
>>> depth=5
>>> fi
>>>
>>> if [ -n "$4" ]
>>> then
>>> adddays=$4
>>> else
>>> adddays=0
>>> fi
>>>
>>> # Only change if your crawl subdirectories are named something different
>>> webdb_dir=$crawl_dir/crawldb
>>> segments_dir=$crawl_dir/segments
>>> linkdb_dir=$crawl_dir/linkdb
>>> index_dir=$crawl_dir/index
>>>
>>> # The generate/fetch/update cycle
>>> for ((i=1; i <= depth ; i++))
>>> do
>>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>> segment=`ls -d $segments_dir/* | tail -1`
>>> bin/nutch fetch $segment
>>> bin/nutch updatedb $webdb_dir $segment
>>> done
>>>
>>> # Update segments
>>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>>
>>> # Index segments
>>> new_indexes=$crawl_dir/newindexes
>>> #ls -d $segments_dir/* | tail -$depth | xargs
>>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>>
>>> # De-duplicate indexes
>>> bin/nutch dedup $new_indexes
>>>
>>> # Merge indexes
>>> bin/nutch merge $index_dir $new_indexes
>>>
>>> # Tell Tomcat to reload index
>>> touch $nutch_dir/WEB-INF/web.xml
>>>
>>> # Clean up
>>> rm -rf $new_indexes
>>>
>>>
>>>
>>
> Oh yea, you're right the one i sent out was for 0.8.... you should just
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>
>
>
>
--
Renaud Richardet
COO America
Wyona Inc. - Open Source Content Management - Apache Lenya
office +1 857 776-3195 mobile +1 617 230 9112
renaud.richardet <at> wyona.com http://www.wyona.com
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006
Re: Hadoop and Recrawl
Posted by Renaud Richardet <re...@wyona.com>.
Hi Roberto,
Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to
Matthew Holt)
HTH,
Renaud
Info wrote:
> Hi List
> I try to use this script with hadoop but don't work.
> I try to change ls with bin/hadoop dfs -ls
> But the script don't work because is ls -d and don't ls only.
> Someone can help me
> Best Regards
> Roberto Navoni
>
> -----Messaggio originale-----
> Da: Matthew Holt [mailto:mholt@redhat.com]
> Inviato: venerdì 21 luglio 2006 18.58
> A: nutch-user@lucene.apache.org
> Oggetto: Re: Recrawl script for 0.8.0 completed...
>
> Lourival Júnior wrote:
>
>> I thing it wont work with me because i'm using the Nutch version 0.7.2.
>> Actually I use this script (some comments are in Portuguese):
>>
>> #!/bin/bash
>>
>> # A simple script to run a Nutch re-crawl
>> # Fonte do script:
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> #{
>>
>> if [ -n "$1" ]
>> then
>> crawl_dir=$1
>> else
>> echo "Usage: recrawl crawl_dir [depth] [adddays]"
>> exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>> depth=$2
>> else
>> depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>> adddays=$3
>> else
>> adddays=0
>> fi
>>
>> webdb_dir=$crawl_dir/db
>> segments_dir=$crawl_dir/segments
>> index_dir=$crawl_dir/index
>>
>> #Para o serviço do TomCat
>> #net stop "Apache Tomcat"
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>> segment=`ls -d $segments_dir/* | tail -1`
>> bin/nutch fetch $segment
>> bin/nutch updatedb $webdb_dir $segment
>> echo
>> echo "Fim do ciclo $i."
>> echo
>> done
>>
>> # Update segments
>> echo
>> echo "Atualizando os Segmentos..."
>> echo
>> mkdir tmp
>> bin/nutch updatesegs $webdb_dir $segments_dir tmp
>> rm -R tmp
>>
>> # Index segments
>> echo "Indexando os segmentos..."
>> echo
>> for segment in `ls -d $segments_dir/* | tail -$depth`
>> do
>> bin/nutch index $segment
>> done
>>
>> # De-duplicate indexes
>> # "bogus" argument is ignored but needed due to
>> # a bug in the number of args expected
>> bin/nutch dedup $segments_dir bogus
>>
>> # Merge indexes
>> #echo "Unindo os segmentos..."
>> #echo
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> chmod 777 -R $index_dir
>>
>> #Inicia o serviço do TomCat
>> #net start "Apache Tomcat"
>>
>> echo "Fim."
>>
>> #} > recrawl.log 2>&1
>>
>> How you suggested I used the touch command instead stops the tomcat.
>> However
>> I get that error posted in previous message. I'm running nutch in windows
>> plataform with cygwin. I only get no errors when I stops the tomcat. I
>> use
>> this command to call the script:
>>
>> ./recrawl crawl-legislacao 1
>>
>> Could you give me more clarifications?
>>
>> Thanks a lot!
>>
>> On 7/21/06, Matthew Holt <mh...@redhat.com> wrote:
>>
>>> Lourival Júnior wrote:
>>>
>>>> Hi Renaud!
>>>>
>>>> I'm newbie with shell scripts and I know stops tomcat service is
>>>>
>>> not the
>>>
>>>> better way to do this. The problem is, when a run the re-crawl script
>>>> with
>>>> tomcat started I get this error:
>>>>
>>>> 060721 132224 merging segment indexes to: crawl-legislacao2\index
>>>> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>>>> at
>>>> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>>>> at
>>>>
>>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>>>
>>>> at
>>>> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>>>> :141)
>>>> at
>>>> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>>>> at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>>>>
>>> :92)
>>>
>>>> at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>>>>
>>> :160)
>>>
>>>> So, I want another way to re-crawl my pages without this error and
>>>> without
>>>> restarting the tomcat. Could you suggest one?
>>>>
>>>> Thanks a lot!
>>>>
>>>>
>>>>
>>> Try this updated script and tell me what command exactly you run to call
>>> the script. Let me know the error message then.
>>>
>>> Matt
>>>
>>>
>>> #!/bin/bash
>>>
>>> # Nutch recrawl script.
>>> # Based on 0.7.2 script at
>>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>>
>
>
>>> # Modified by Matthew Holt
>>>
>>> if [ -n "$1" ]
>>> then
>>> nutch_dir=$1
>>> else
>>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>> echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>> echo "crawl_dir - Name of the directory the crawl is located in."
>>> echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>> echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>> exit 1
>>> fi
>>>
>>> if [ -n "$2" ]
>>> then
>>> crawl_dir=$2
>>> else
>>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>> echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>> echo "crawl_dir - Name of the directory the crawl is located in."
>>> echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>> echo "[adddays] - Advance the clock # of days for fetchlist
>>> generation."
>>> exit 1
>>> fi
>>>
>>> if [ -n "$3" ]
>>> then
>>> depth=$3
>>> else
>>> depth=5
>>> fi
>>>
>>> if [ -n "$4" ]
>>> then
>>> adddays=$4
>>> else
>>> adddays=0
>>> fi
>>>
>>> # Only change if your crawl subdirectories are named something different
>>> webdb_dir=$crawl_dir/crawldb
>>> segments_dir=$crawl_dir/segments
>>> linkdb_dir=$crawl_dir/linkdb
>>> index_dir=$crawl_dir/index
>>>
>>> # The generate/fetch/update cycle
>>> for ((i=1; i <= depth ; i++))
>>> do
>>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>> segment=`ls -d $segments_dir/* | tail -1`
>>> bin/nutch fetch $segment
>>> bin/nutch updatedb $webdb_dir $segment
>>> done
>>>
>>> # Update segments
>>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>>
>>> # Index segments
>>> new_indexes=$crawl_dir/newindexes
>>> #ls -d $segments_dir/* | tail -$depth | xargs
>>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>>
>>> # De-duplicate indexes
>>> bin/nutch dedup $new_indexes
>>>
>>> # Merge indexes
>>> bin/nutch merge $index_dir $new_indexes
>>>
>>> # Tell Tomcat to reload index
>>> touch $nutch_dir/WEB-INF/web.xml
>>>
>>> # Clean up
>>> rm -rf $new_indexes
>>>
>>>
>>>
>>
> Oh yea, you're right the one i sent out was for 0.8.... you should just
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>
>
>
>
--
Renaud Richardet
COO America
Wyona Inc. - Open Source Content Management - Apache Lenya
office +1 857 776-3195 mobile +1 617 230 9112
renaud.richardet <at> wyona.com http://www.wyona.com