You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Moore, Lee C" <Le...@xerox.com> on 2007/11/19 21:41:01 UTC

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09096.html

Hello:
 
I am trying do recrawling with Nutch-0.9.  I have done some Google
searching but I haven't an answer that works.
 
I had hopes for the script located at:
 
 
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09096.html
 
I tried this script for re-crawling and it has the same problem after a
couple of re-crawls:

	----- Merge Indexes (Step 7 of 8) -----
	merging indexes to: crawl/index
	Adding crawl/NEWindexes/part-00000
	IndexMerger: java.io.IOException: Target
crawl/index/merge-output already exists

(also, this script has a un-related bug as it references the variable
$rank but $rank is not defined. I guess this is supposed to be topN.) 
 
Has anybody found the solution to sucessfully re-crawling with 0.9?
 
thanks,
 
 -Lee
 

RE: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09096.html

Posted by "Moore, Lee C" <Le...@xerox.com>.
Hi Susam Pal,

Thanks for the pointer to the latest crawl/recrawl script. It has worked
very well with Nutch-0.9. It is the answer to my problem!

Thanks again!

 -Lee

-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com] 
Sent: Tuesday, November 20, 2007 1:39 AM
To: nutch-user@lucene.apache.org
Subject: Re:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09096.html

Hi Lee,

Thanks for the feedback. The script posted in the mailing list has
some bugs. Please use the latest script from
http://wiki.apache.org/nutch/Crawl

I have also made some minor changes to make it work with Nutch 1.0-dev
in trunk. I have tested this with Nutch 1.0-dev. I believe this should
work fine for Nutch 0.9 too.

We had a discussion on re-crawling for Nutch 1.0-dev here:-
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09514.html

Please try this script for re-crawling with Nutch-0.9 and let us know
how it goes.

Regards,
Susam Pal

On Nov 20, 2007 2:11 AM, Moore, Lee C <Le...@xerox.com> wrote:
>
>
> Hello:
>
> I am trying do recrawling with Nutch-0.9.  I have done some Google
searching
> but I haven't an answer that works.
>
> I had hopes for the script located at:
>
>
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09096.html
>
> I tried this script for re-crawling and it has the same problem after
a
> couple of re-crawls:
>
> ----- Merge Indexes (Step 7 of 8) -----
> merging indexes to: crawl/index
> Adding crawl/NEWindexes/part-00000
> IndexMerger: java.io.IOException: Target crawl/index/merge-output
already
> exists
> (also, this script has a un-related bug as it references the variable
$rank
> but $rank is not defined. I guess this is supposed to be topN.)
>
> Has anybody found the solution to sucessfully re-crawling with 0.9?
>
> thanks,
>
>  -Lee
>

Re: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09096.html

Posted by Susam Pal <su...@gmail.com>.
Hi Lee,

Thanks for the feedback. The script posted in the mailing list has
some bugs. Please use the latest script from
http://wiki.apache.org/nutch/Crawl

I have also made some minor changes to make it work with Nutch 1.0-dev
in trunk. I have tested this with Nutch 1.0-dev. I believe this should
work fine for Nutch 0.9 too.

We had a discussion on re-crawling for Nutch 1.0-dev here:-
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09514.html

Please try this script for re-crawling with Nutch-0.9 and let us know
how it goes.

Regards,
Susam Pal

On Nov 20, 2007 2:11 AM, Moore, Lee C <Le...@xerox.com> wrote:
>
>
> Hello:
>
> I am trying do recrawling with Nutch-0.9.  I have done some Google searching
> but I haven't an answer that works.
>
> I had hopes for the script located at:
>
>     http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09096.html
>
> I tried this script for re-crawling and it has the same problem after a
> couple of re-crawls:
>
> ----- Merge Indexes (Step 7 of 8) -----
> merging indexes to: crawl/index
> Adding crawl/NEWindexes/part-00000
> IndexMerger: java.io.IOException: Target crawl/index/merge-output already
> exists
> (also, this script has a un-related bug as it references the variable $rank
> but $rank is not defined. I guess this is supposed to be topN.)
>
> Has anybody found the solution to sucessfully re-crawling with 0.9?
>
> thanks,
>
>  -Lee
>