You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Info <in...@radionav.it> on 2006/07/22 10:16:20 UTC
HELP ME PLEASE R: Hadoop and Nutch 0.8

Hi Renaud,
I try that link , but I don't solve my problem.
The problem is that If I use nutch-0.8 night build that version of nutch
don't use linux file system but Hadoop file system .
So if I try to see the name of segment , the script use ls -d on local linux
file system.
Instead I need an a script that use hadoop dfs file system because I need
for my experimental project to use it.
I have 4 linux server
Where I install nutch . I use hadoop to have an distribuited file system.

My first problem is to merge the index
My second problem is that if I try to connect by ssh the slave server they
ask me the password. I  see the online tutorial Nutch + hadoop ...

There's  some people that can help me.
Best Regard
Roberto Navoni

-----Messaggio originale-----
Da: Info [mailto:info@radionav.it] 
Inviato: sabato 22 luglio 2006 10.09
A: nutch-user@lucene.apache.org
Oggetto: R: Hadoop and Recrawl



-----Messaggio originale-----
Da: Renaud Richardet [mailto:renaud.richardet@wyona.com] 
Inviato: venerdì 21 luglio 2006 22.24
A: nutch-user@lucene.apache.org
Oggetto: Re: Hadoop and Recrawl

Hi Roberto,

Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to 
Matthew Holt)

HTH,
Renaud


Info wrote:
> Hi List
> I try to use this script with hadoop but don't work.
> I try to change ls with bin/hadoop dfs -ls 
> But the script don't work because is ls -d and don't ls only.
> Someone can help me 
> Best Regards
> Roberto Navoni 
>
> -----Messaggio originale-----
> Da: Matthew Holt [mailto:mholt@redhat.com] 
> Inviato: venerdì 21 luglio 2006 18.58
> A: nutch-user@lucene.apache.org
> Oggetto: Re: Recrawl script for 0.8.0 completed...
>
> Lourival Júnior wrote:
>   
>> I thing it wont work with me because i'm using the Nutch version 0.7.2.
>> Actually I use this script (some comments are in Portuguese):
>>
>> #!/bin/bash
>>
>> # A simple script to run a Nutch re-crawl
>> # Fonte do script:
>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> #{
>>
>> if [ -n "$1" ]
>> then
>>  crawl_dir=$1
>> else
>>  echo "Usage: recrawl crawl_dir [depth] [adddays]"
>>  exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>>  depth=$2
>> else
>>  depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>>  adddays=$3
>> else
>>  adddays=0
>> fi
>>
>> webdb_dir=$crawl_dir/db
>> segments_dir=$crawl_dir/segments
>> index_dir=$crawl_dir/index
>>
>> #Para o serviço do TomCat
>> #net stop "Apache Tomcat"
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>>  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>  segment=`ls -d $segments_dir/* | tail -1`
>>  bin/nutch fetch $segment
>>  bin/nutch updatedb $webdb_dir $segment
>>  echo
>>  echo "Fim do ciclo $i."
>>  echo
>> done
>>
>> # Update segments
>> echo
>> echo "Atualizando os Segmentos..."
>> echo
>> mkdir tmp
>> bin/nutch updatesegs $webdb_dir $segments_dir tmp
>> rm -R tmp
>>
>> # Index segments
>> echo "Indexando os segmentos..."
>> echo
>> for segment in `ls -d $segments_dir/* | tail -$depth`
>> do
>>  bin/nutch index $segment
>> done
>>
>> # De-duplicate indexes
>> # "bogus" argument is ignored but needed due to
>> # a bug in the number of args expected
>> bin/nutch dedup $segments_dir bogus
>>
>> # Merge indexes
>> #echo "Unindo os segmentos..."
>> #echo
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> chmod 777 -R $index_dir
>>
>> #Inicia o serviço do TomCat
>> #net start "Apache Tomcat"
>>
>> echo "Fim."
>>
>> #} > recrawl.log 2>&1
>>
>> How you suggested I used the touch command instead stops the tomcat. 
>> However
>> I get that error posted in previous message. I'm running nutch in windows
>> plataform with cygwin. I only get no errors when I stops the tomcat. I 
>> use
>> this command to call the script:
>>
>> ./recrawl crawl-legislacao 1
>>
>> Could you give me more clarifications?
>>
>> Thanks a lot!
>>
>> On 7/21/06, Matthew Holt <mh...@redhat.com> wrote:
>>     
>>> Lourival Júnior wrote:
>>>       
>>>> Hi Renaud!
>>>>
>>>> I'm newbie with shell scripts and I know stops tomcat service is 
>>>>         
>>> not the
>>>       
>>>> better way to do this. The problem is, when a run the re-crawl script
>>>> with
>>>> tomcat started I get this error:
>>>>
>>>> 060721 132224 merging segment indexes to: crawl-legislacao2\index
>>>> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>>>>        at
>>>> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>>>>        at 
>>>>         
>>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>>>       
>>>>        at
>>>> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>>>> :141)
>>>>        at
>>>> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>>>>        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>>>>         
>>> :92)
>>>       
>>>>        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>>>>         
>>> :160)
>>>       
>>>> So, I want another way to re-crawl my pages without this error and
>>>> without
>>>> restarting the tomcat. Could you suggest one?
>>>>
>>>> Thanks a lot!
>>>>
>>>>
>>>>         
>>> Try this updated script and tell me what command exactly you run to call
>>> the script. Let me know the error message then.
>>>
>>> Matt
>>>
>>>
>>> #!/bin/bash
>>>
>>> # Nutch recrawl script.
>>> # Based on 0.7.2 script at
>>>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>>       
>
>   
>>> # Modified by Matthew Holt
>>>
>>> if [ -n "$1" ]
>>> then
>>>   nutch_dir=$1
>>> else
>>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>>   echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>>   echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>>   echo "[adddays] - Advance the clock # of days for fetchlist 
>>> generation."
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$2" ]
>>> then
>>>   crawl_dir=$2
>>> else
>>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>>   echo "servlet_path - Path of the nutch servlet (i.e.
>>> /usr/local/tomcat/webapps/ROOT)"
>>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>>   echo "[depth] - The link depth from the root page that should be
>>> crawled."
>>>   echo "[adddays] - Advance the clock # of days for fetchlist 
>>> generation."
>>>   exit 1
>>> fi
>>>
>>> if [ -n "$3" ]
>>> then
>>>   depth=$3
>>> else
>>>   depth=5
>>> fi
>>>
>>> if [ -n "$4" ]
>>> then
>>>   adddays=$4
>>> else
>>>   adddays=0
>>> fi
>>>
>>> # Only change if your crawl subdirectories are named something different
>>> webdb_dir=$crawl_dir/crawldb
>>> segments_dir=$crawl_dir/segments
>>> linkdb_dir=$crawl_dir/linkdb
>>> index_dir=$crawl_dir/index
>>>
>>> # The generate/fetch/update cycle
>>> for ((i=1; i <= depth ; i++))
>>> do
>>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>>   segment=`ls -d $segments_dir/* | tail -1`
>>>   bin/nutch fetch $segment
>>>   bin/nutch updatedb $webdb_dir $segment
>>> done
>>>
>>> # Update segments
>>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>>
>>> # Index segments
>>> new_indexes=$crawl_dir/newindexes
>>> #ls -d $segments_dir/* | tail -$depth | xargs
>>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>>
>>> # De-duplicate indexes
>>> bin/nutch dedup $new_indexes
>>>
>>> # Merge indexes
>>> bin/nutch merge $index_dir $new_indexes
>>>
>>> # Tell Tomcat to reload index
>>> touch $nutch_dir/WEB-INF/web.xml
>>>
>>> # Clean up
>>> rm -rf $new_indexes
>>>
>>>
>>>       
>>     
> Oh yea, you're right the one i sent out was for 0.8.... you should just 
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>
>
>
>   

-- 
Renaud Richardet
COO America
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                     mobile +1 617 230 9112
renaud.richardet <at> wyona.com              http://www.wyona.com




-- 
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006




-- 
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/395 - Release Date: 21/07/2006