You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chetan Patel <ch...@webmail.aruhat.com> on 2008/09/15 13:43:21 UTC
Re: hadoop dfs -ls and nutch generate/fetch commands
Hi,
I have tried to re crawl script which is available on
http://wiki.apache.org/nutch/IntranetRecrawl.
I have got following error.
2008-09-15 17:04:32,238 INFO fetcher.Fetcher - Fetcher: starting
2008-09-15 17:04:32,254 INFO fetcher.Fetcher - Fetcher: segment:
google/segments/20080915170335
2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
java.io.IOException: Segment already fetched!
at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
2008-09-15 17:04:35,144 INFO crawl.CrawlDb - CrawlDb update: starting
Plz. help me to solve this error.
Thanks in advance
Regards,
Chetan Patel
Hilkiah Lavinier wrote:
>
> Hi,
>
> I think I've come across an issue with the way hadoop lists files. Or
> maybe its just me...anyway, I'm using a modified version of the crawl
> script found on the wiki site. I'm trying to ensure that the fetch
> operation always uses the latest segment generated by the generate
> operation. thus the code looks like :
>
>
> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>
> echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays
> $adddays"
> $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays
>
> if [ $? -ne 0 ]
> then
> echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> break
> fi
>
> #debug
> ls -l --sort=t -r $crawl/segments
> $HADOOP dfs -ls $crawl/segments
>
> segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e
> 's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
> echo "** segment: $segment"
>
> echo "** $NUTCH fetch $segment -threads $threads"
> $NUTCH fetch $segment -threads $threads
>
>
>
> However every so often the crawl fails as per :
>
> --- Beginning crawl at depth 1 of 1 ---
> ** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments
> -adddays 24
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080417204644
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> generate return value: 0
> total 8
> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
> Found 2 items
> /nutch/search/crawl/segments/20080417204644 <dir> 2008-04-17
> 20:46 rwxr-xr-x hilkiah hilkiah
> /nutch/search/crawl/segments/20080417204628 <dir> 2008-04-17
> 20:46 rwxr-xr-x hilkiah hilkiah
> ** segment: crawl/segments/20080417204628
> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628
> -threads 1000
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080417204628
> Fetcher: java.io.IOException: Segment already fetched!
> at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
> at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)
>
>
>
> I think the problem is that hadoop doesn't return the latest segment
> directory in chronological orde (or alphanumeric order). First, is this a
> known issue and if so, how do I work around it? Secondly, since I believe
> I refetch/reindex all pages (I only do depth=1 and run a series of crawls
> at depth=1), can I safely delete old segment/nnnnn folder before
> generating new one??
>
> Regards,
>
> Hilkiah G. Lavinier MEng (Hons), ACGI
> 6 Winston Lane,
> Goodwill,
> Roseau, Dominica
>
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
>
>
> Email: hilkiah@yahoo.com
> Email: hilkiah.lavinier@gmail.com
> IM: Yahoo hilkiah / MSN hilkiahlavinier@hotmail.com
> IM: ICQ #8978201 / AOL hilkiah21
>
>
>
>
>
>
> ____________________________________________________________________________________
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile. Try it now.
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>
>
--
View this message in context: http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491397.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: hadoop dfs -ls and nutch generate/fetch commands
Posted by Chetan Patel <ch...@webmail.aruhat.com>.
Hi,
I have a new problem.
When I run recrawling script it work fine first time.
When I tried again it fails.
see attached log file for error.
http://www.nabble.com/file/p19493344/hadoop.log hadoop.log
Plz. give me a solution.
Thanks in advance.
Regads,
Chetan Patel
Chetan Patel wrote:
>
> Hi Doğacan Güney,
>
> Thanks for giving solution.
>
> Is it possible to recrawl without removing files?
>
> Thank you again.
>
> Regads,
> Chetan Patel
>
>
> Doğacan Güney-3 wrote:
>>
>> Hi,
>>
>> On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <ch...@webmail.aruhat.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have tried to re crawl script which is available on
>>> http://wiki.apache.org/nutch/IntranetRecrawl.
>>>
>>> I have got following error.
>>>
>>> 2008-09-15 17:04:32,238 INFO fetcher.Fetcher - Fetcher: starting
>>> 2008-09-15 17:04:32,254 INFO fetcher.Fetcher - Fetcher: segment:
>>> google/segments/20080915170335
>>> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
>>> java.io.IOException: Segment already fetched!
>>> at
>>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
>>> at
>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>>> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>>>
>>> 2008-09-15 17:04:35,144 INFO crawl.CrawlDb - CrawlDb update: starting
>>>
>>> Plz. help me to solve this error.
>>>
>>
>> Segment you are trying to crawl is already fetched. Try removing
>> everything but crawl_generate under that segment.
>>
>>> Thanks in advance
>>>
>>> Regards,
>>> Chetan Patel
>>>
>>>
>>>
>>
>>
>>
>>
>> --
>> Doğacan Güney
>>
>>
>
>
--
View this message in context: http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19493344.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: hadoop dfs -ls and nutch generate/fetch commands
Posted by Dennis Kubes <ku...@apache.org>.
Chetan Patel wrote:
> Hi Doğacan Güney,
>
> Thanks for giving solution.
>
> Is it possible to recrawl without removing files?
>
Yes and no. If the segment is already fetched, there is no fetch-again
or update mode. You can copy segments to a new directory, rename the
old directory, rename the new directory to the old name, do as Dogacan
suggested and removed everything except crawl_generate in the copied
directory, then fetch again.
Dennis
> Thank you again.
>
> Regads,
> Chetan Patel
>
>
> Doğacan Güney-3 wrote:
>> Hi,
>>
>> On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <ch...@webmail.aruhat.com>
>> wrote:
>>> Hi,
>>>
>>> I have tried to re crawl script which is available on
>>> http://wiki.apache.org/nutch/IntranetRecrawl.
>>>
>>> I have got following error.
>>>
>>> 2008-09-15 17:04:32,238 INFO fetcher.Fetcher - Fetcher: starting
>>> 2008-09-15 17:04:32,254 INFO fetcher.Fetcher - Fetcher: segment:
>>> google/segments/20080915170335
>>> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
>>> java.io.IOException: Segment already fetched!
>>> at
>>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
>>> at
>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>>> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>>>
>>> 2008-09-15 17:04:35,144 INFO crawl.CrawlDb - CrawlDb update: starting
>>>
>>> Plz. help me to solve this error.
>>>
>> Segment you are trying to crawl is already fetched. Try removing
>> everything but crawl_generate under that segment.
>>
>>> Thanks in advance
>>>
>>> Regards,
>>> Chetan Patel
>>>
>>>
>>>
>>
>>
>>
>> --
>> Doğacan Güney
>>
>>
>
Re: hadoop dfs -ls and nutch generate/fetch commands
Posted by Chetan Patel <ch...@webmail.aruhat.com>.
Hi Doğacan Güney,
Thanks for giving solution.
Is it possible to recrawl without removing files?
Thank you again.
Regads,
Chetan Patel
Doğacan Güney-3 wrote:
>
> Hi,
>
> On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <ch...@webmail.aruhat.com>
> wrote:
>>
>> Hi,
>>
>> I have tried to re crawl script which is available on
>> http://wiki.apache.org/nutch/IntranetRecrawl.
>>
>> I have got following error.
>>
>> 2008-09-15 17:04:32,238 INFO fetcher.Fetcher - Fetcher: starting
>> 2008-09-15 17:04:32,254 INFO fetcher.Fetcher - Fetcher: segment:
>> google/segments/20080915170335
>> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
>> java.io.IOException: Segment already fetched!
>> at
>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>>
>> 2008-09-15 17:04:35,144 INFO crawl.CrawlDb - CrawlDb update: starting
>>
>> Plz. help me to solve this error.
>>
>
> Segment you are trying to crawl is already fetched. Try removing
> everything but crawl_generate under that segment.
>
>> Thanks in advance
>>
>> Regards,
>> Chetan Patel
>>
>>
>>
>
>
>
>
> --
> Doğacan Güney
>
>
--
View this message in context: http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491939.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: hadoop dfs -ls and nutch generate/fetch commands
Posted by Doğacan Güney <do...@gmail.com>.
Hi,
On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <ch...@webmail.aruhat.com> wrote:
>
> Hi,
>
> I have tried to re crawl script which is available on
> http://wiki.apache.org/nutch/IntranetRecrawl.
>
> I have got following error.
>
> 2008-09-15 17:04:32,238 INFO fetcher.Fetcher - Fetcher: starting
> 2008-09-15 17:04:32,254 INFO fetcher.Fetcher - Fetcher: segment:
> google/segments/20080915170335
> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
> java.io.IOException: Segment already fetched!
> at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>
> 2008-09-15 17:04:35,144 INFO crawl.CrawlDb - CrawlDb update: starting
>
> Plz. help me to solve this error.
>
Segment you are trying to crawl is already fetched. Try removing
everything but crawl_generate under that segment.
> Thanks in advance
>
> Regards,
> Chetan Patel
>
>
>
> Hilkiah Lavinier wrote:
>>
>> Hi,
>>
>> I think I've come across an issue with the way hadoop lists files. Or
>> maybe its just me...anyway, I'm using a modified version of the crawl
>> script found on the wiki site. I'm trying to ensure that the fetch
>> operation always uses the latest segment generated by the generate
>> operation. thus the code looks like :
>>
>>
>> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>
>> echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays
>> $adddays"
>> $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays
>>
>> if [ $? -ne 0 ]
>> then
>> echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>> break
>> fi
>>
>> #debug
>> ls -l --sort=t -r $crawl/segments
>> $HADOOP dfs -ls $crawl/segments
>>
>> segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e
>> 's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
>> echo "** segment: $segment"
>>
>> echo "** $NUTCH fetch $segment -threads $threads"
>> $NUTCH fetch $segment -threads $threads
>>
>>
>>
>> However every so often the crawl fails as per :
>>
>> --- Beginning crawl at depth 1 of 1 ---
>> ** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments
>> -adddays 24
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl/segments/20080417204644
>> Generator: filtering: true
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> generate return value: 0
>> total 8
>> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
>> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
>> Found 2 items
>> /nutch/search/crawl/segments/20080417204644 <dir> 2008-04-17
>> 20:46 rwxr-xr-x hilkiah hilkiah
>> /nutch/search/crawl/segments/20080417204628 <dir> 2008-04-17
>> 20:46 rwxr-xr-x hilkiah hilkiah
>> ** segment: crawl/segments/20080417204628
>> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628
>> -threads 1000
>> Fetcher: starting
>> Fetcher: segment: crawl/segments/20080417204628
>> Fetcher: java.io.IOException: Segment already fetched!
>> at
>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)
>>
>>
>>
>> I think the problem is that hadoop doesn't return the latest segment
>> directory in chronological orde (or alphanumeric order). First, is this a
>> known issue and if so, how do I work around it? Secondly, since I believe
>> I refetch/reindex all pages (I only do depth=1 and run a series of crawls
>> at depth=1), can I safely delete old segment/nnnnn folder before
>> generating new one??
>>
>> Regards,
>>
>> Hilkiah G. Lavinier MEng (Hons), ACGI
>> 6 Winston Lane,
>> Goodwill,
>> Roseau, Dominica
>>
>> Mbl: (767) 275 3382
>> Hm : (767) 440 3924
>> Fax: (767) 440 4991
>> VoIP USA: (646) 432 4487
>>
>>
>> Email: hilkiah@yahoo.com
>> Email: hilkiah.lavinier@gmail.com
>> IM: Yahoo hilkiah / MSN hilkiahlavinier@hotmail.com
>> IM: ICQ #8978201 / AOL hilkiah21
>>
>>
>>
>>
>>
>>
>> ____________________________________________________________________________________
>> Be a better friend, newshound, and
>> know-it-all with Yahoo! Mobile. Try it now.
>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>
>>
>
> --
> View this message in context: http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491397.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
--
Doğacan Güney