You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chetan Patel <ch...@webmail.aruhat.com> on 2008/09/15 13:43:21 UTC

Re: hadoop dfs -ls and nutch generate/fetch commands

Hi,

I have tried to re crawl script which is available on
http://wiki.apache.org/nutch/IntranetRecrawl.

I have got following error.

2008-09-15 17:04:32,238 INFO  fetcher.Fetcher - Fetcher: starting
2008-09-15 17:04:32,254 INFO  fetcher.Fetcher - Fetcher: segment:
google/segments/20080915170335
2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
java.io.IOException: Segment already fetched!
	at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
	at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)

2008-09-15 17:04:35,144 INFO  crawl.CrawlDb - CrawlDb update: starting

Plz. help me to solve this error.

Thanks in advance

Regards,
Chetan Patel



Hilkiah Lavinier wrote:
> 
> Hi,
> 
> I think I've come across an issue with the way hadoop lists files.  Or
> maybe its just me...anyway, I'm using a modified version of the crawl
> script found on the wiki site.  I'm trying to ensure that the fetch
> operation always uses the latest segment generated by the generate
> operation.  thus  the code looks like :
> 
> 
>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> 
>   echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays
> $adddays"
>   $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays
> 
>   if [ $? -ne 0 ]
>   then
>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>     break
>   fi
> 
>   #debug
>   ls -l --sort=t -r $crawl/segments
>   $HADOOP dfs -ls $crawl/segments
> 
>   segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e
> 's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
>   echo "** segment: $segment"
> 
>   echo "** $NUTCH fetch $segment -threads $threads"
>   $NUTCH fetch $segment -threads $threads
> 
> 
> 
> However every so often the crawl fails as per :
> 
> --- Beginning crawl at depth 1 of 1 ---
> ** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments 
> -adddays 24
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080417204644
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> generate return value: 0
> total 8
> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
> Found 2 items
> /nutch/search/crawl/segments/20080417204644     <dir>           2008-04-17
> 20:46        rwxr-xr-x       hilkiah hilkiah
> /nutch/search/crawl/segments/20080417204628     <dir>           2008-04-17
> 20:46        rwxr-xr-x       hilkiah hilkiah
> ** segment: crawl/segments/20080417204628
> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628
> -threads 1000
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080417204628
> Fetcher: java.io.IOException: Segment already fetched!
>         at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
>         at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)
> 
> 
> 
> I think the problem is that hadoop doesn't return the latest segment
> directory in chronological orde (or alphanumeric order).  First, is this a
> known issue and if so, how do I work around it?  Secondly, since I believe
> I refetch/reindex all pages (I only do depth=1 and run a series of crawls
> at depth=1), can I safely delete old segment/nnnnn folder before
> generating new one??
> 
> Regards,
>  
> Hilkiah G. Lavinier MEng (Hons), ACGI 
> 6 Winston Lane, 
> Goodwill, 
> Roseau, Dominica
> 
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
> 
> 
> Email: hilkiah@yahoo.com
> Email: hilkiah.lavinier@gmail.com
> IM: Yahoo hilkiah / MSN hilkiahlavinier@hotmail.com
> IM: ICQ #8978201  / AOL hilkiah21
> 
> 
> 
> 
> 
>      
> ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now. 
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> 
> 

-- 
View this message in context: http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491397.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: hadoop dfs -ls and nutch generate/fetch commands

Posted by Chetan Patel <ch...@webmail.aruhat.com>.
Hi,

I have a new problem.

When I run recrawling script it work fine first time. 

When I tried again it fails.

see attached log file for error.
http://www.nabble.com/file/p19493344/hadoop.log hadoop.log 

Plz. give me a solution.

Thanks in advance.

Regads,
Chetan Patel



Chetan Patel wrote:
> 
> Hi Doğacan Güney,
> 
> Thanks for giving solution.
> 
> Is it possible to recrawl without removing files?
> 
> Thank you again.
> 
> Regads,
> Chetan Patel
> 
> 
> Doğacan Güney-3 wrote:
>> 
>> Hi,
>> 
>> On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <ch...@webmail.aruhat.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have tried to re crawl script which is available on
>>> http://wiki.apache.org/nutch/IntranetRecrawl.
>>>
>>> I have got following error.
>>>
>>> 2008-09-15 17:04:32,238 INFO  fetcher.Fetcher - Fetcher: starting
>>> 2008-09-15 17:04:32,254 INFO  fetcher.Fetcher - Fetcher: segment:
>>> google/segments/20080915170335
>>> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
>>> java.io.IOException: Segment already fetched!
>>>        at
>>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
>>>        at
>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>>>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>>>
>>> 2008-09-15 17:04:35,144 INFO  crawl.CrawlDb - CrawlDb update: starting
>>>
>>> Plz. help me to solve this error.
>>>
>> 
>> Segment you are trying to crawl is already fetched. Try removing
>> everything but crawl_generate under that segment.
>> 
>>> Thanks in advance
>>>
>>> Regards,
>>> Chetan Patel
>>>
>>>
>>>
>> 
>> 
>> 
>> 
>> -- 
>> Doğacan Güney
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19493344.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: hadoop dfs -ls and nutch generate/fetch commands

Posted by Dennis Kubes <ku...@apache.org>.

Chetan Patel wrote:
> Hi Doğacan Güney,
> 
> Thanks for giving solution.
> 
> Is it possible to recrawl without removing files?
> 

Yes and no.  If the segment is already fetched, there is no fetch-again 
or update mode.  You can copy segments to a new directory, rename the 
old directory, rename the new directory to the old name, do as Dogacan 
suggested and removed everything except crawl_generate in the copied 
directory, then fetch again.

Dennis

> Thank you again.
> 
> Regads,
> Chetan Patel
> 
> 
> Doğacan Güney-3 wrote:
>> Hi,
>>
>> On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <ch...@webmail.aruhat.com>
>> wrote:
>>> Hi,
>>>
>>> I have tried to re crawl script which is available on
>>> http://wiki.apache.org/nutch/IntranetRecrawl.
>>>
>>> I have got following error.
>>>
>>> 2008-09-15 17:04:32,238 INFO  fetcher.Fetcher - Fetcher: starting
>>> 2008-09-15 17:04:32,254 INFO  fetcher.Fetcher - Fetcher: segment:
>>> google/segments/20080915170335
>>> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
>>> java.io.IOException: Segment already fetched!
>>>        at
>>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
>>>        at
>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>>>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>>>
>>> 2008-09-15 17:04:35,144 INFO  crawl.CrawlDb - CrawlDb update: starting
>>>
>>> Plz. help me to solve this error.
>>>
>> Segment you are trying to crawl is already fetched. Try removing
>> everything but crawl_generate under that segment.
>>
>>> Thanks in advance
>>>
>>> Regards,
>>> Chetan Patel
>>>
>>>
>>>
>>
>>
>>
>> -- 
>> Doğacan Güney
>>
>>
> 

Re: hadoop dfs -ls and nutch generate/fetch commands

Posted by Chetan Patel <ch...@webmail.aruhat.com>.
Hi Doğacan Güney,

Thanks for giving solution.

Is it possible to recrawl without removing files?

Thank you again.

Regads,
Chetan Patel


Doğacan Güney-3 wrote:
> 
> Hi,
> 
> On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <ch...@webmail.aruhat.com>
> wrote:
>>
>> Hi,
>>
>> I have tried to re crawl script which is available on
>> http://wiki.apache.org/nutch/IntranetRecrawl.
>>
>> I have got following error.
>>
>> 2008-09-15 17:04:32,238 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-09-15 17:04:32,254 INFO  fetcher.Fetcher - Fetcher: segment:
>> google/segments/20080915170335
>> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
>> java.io.IOException: Segment already fetched!
>>        at
>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
>>        at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>>
>> 2008-09-15 17:04:35,144 INFO  crawl.CrawlDb - CrawlDb update: starting
>>
>> Plz. help me to solve this error.
>>
> 
> Segment you are trying to crawl is already fetched. Try removing
> everything but crawl_generate under that segment.
> 
>> Thanks in advance
>>
>> Regards,
>> Chetan Patel
>>
>>
>>
> 
> 
> 
> 
> -- 
> Doğacan Güney
> 
> 

-- 
View this message in context: http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491939.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: hadoop dfs -ls and nutch generate/fetch commands

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <ch...@webmail.aruhat.com> wrote:
>
> Hi,
>
> I have tried to re crawl script which is available on
> http://wiki.apache.org/nutch/IntranetRecrawl.
>
> I have got following error.
>
> 2008-09-15 17:04:32,238 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-09-15 17:04:32,254 INFO  fetcher.Fetcher - Fetcher: segment:
> google/segments/20080915170335
> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
> java.io.IOException: Segment already fetched!
>        at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>
> 2008-09-15 17:04:35,144 INFO  crawl.CrawlDb - CrawlDb update: starting
>
> Plz. help me to solve this error.
>

Segment you are trying to crawl is already fetched. Try removing
everything but crawl_generate under that segment.

> Thanks in advance
>
> Regards,
> Chetan Patel
>
>
>
> Hilkiah Lavinier wrote:
>>
>> Hi,
>>
>> I think I've come across an issue with the way hadoop lists files.  Or
>> maybe its just me...anyway, I'm using a modified version of the crawl
>> script found on the wiki site.  I'm trying to ensure that the fetch
>> operation always uses the latest segment generated by the generate
>> operation.  thus  the code looks like :
>>
>>
>>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>
>>   echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays
>> $adddays"
>>   $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays
>>
>>   if [ $? -ne 0 ]
>>   then
>>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>>     break
>>   fi
>>
>>   #debug
>>   ls -l --sort=t -r $crawl/segments
>>   $HADOOP dfs -ls $crawl/segments
>>
>>   segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e
>> 's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
>>   echo "** segment: $segment"
>>
>>   echo "** $NUTCH fetch $segment -threads $threads"
>>   $NUTCH fetch $segment -threads $threads
>>
>>
>>
>> However every so often the crawl fails as per :
>>
>> --- Beginning crawl at depth 1 of 1 ---
>> ** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments
>> -adddays 24
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl/segments/20080417204644
>> Generator: filtering: true
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> generate return value: 0
>> total 8
>> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
>> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
>> Found 2 items
>> /nutch/search/crawl/segments/20080417204644     <dir>           2008-04-17
>> 20:46        rwxr-xr-x       hilkiah hilkiah
>> /nutch/search/crawl/segments/20080417204628     <dir>           2008-04-17
>> 20:46        rwxr-xr-x       hilkiah hilkiah
>> ** segment: crawl/segments/20080417204628
>> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628
>> -threads 1000
>> Fetcher: starting
>> Fetcher: segment: crawl/segments/20080417204628
>> Fetcher: java.io.IOException: Segment already fetched!
>>         at
>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
>>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)
>>
>>
>>
>> I think the problem is that hadoop doesn't return the latest segment
>> directory in chronological orde (or alphanumeric order).  First, is this a
>> known issue and if so, how do I work around it?  Secondly, since I believe
>> I refetch/reindex all pages (I only do depth=1 and run a series of crawls
>> at depth=1), can I safely delete old segment/nnnnn folder before
>> generating new one??
>>
>> Regards,
>>
>> Hilkiah G. Lavinier MEng (Hons), ACGI
>> 6 Winston Lane,
>> Goodwill,
>> Roseau, Dominica
>>
>> Mbl: (767) 275 3382
>> Hm : (767) 440 3924
>> Fax: (767) 440 4991
>> VoIP USA: (646) 432 4487
>>
>>
>> Email: hilkiah@yahoo.com
>> Email: hilkiah.lavinier@gmail.com
>> IM: Yahoo hilkiah / MSN hilkiahlavinier@hotmail.com
>> IM: ICQ #8978201  / AOL hilkiah21
>>
>>
>>
>>
>>
>>
>> ____________________________________________________________________________________
>> Be a better friend, newshound, and
>> know-it-all with Yahoo! Mobile.  Try it now.
>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>
>>
>
> --
> View this message in context: http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491397.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney