You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by zzcgiacomini <zz...@echo.fr> on 2006/05/22 17:45:57 UTC

Incremental crawl again ... (Please explain)

I am currently using the last nightly nutch-0.8-dev build and
I am really confused about how to proceed after I have done two 
different "whole web" incremental crawl

The tutorial to me is not clear on how to merge the results after the 
two crawls in order to be able to
make a search operation.

Could some one please give me an Hints on what is the right procedure ?! 
here is what I am doing:

1. create an initial urls file  /tmp/dmoz/urls.txt
2. hadoop dfs -put /tmp/urls/ url
3. nutch inject test/crawldb dmoz
4. nutch generate test/crawldb test/segments
5. nutch fetch test/segments/20060522144050
6. nutch updatedb test/crawldb   test/segments/20060522144050
7. nutch invertlinks linkdb test/segments/20060522144050
8. nutch  index  test/indexes test/crawldb linkdb 
test/segments/20060522144050
 
..and now I am able to search... 

Now I run

 9. nutch generate test/crawldb test/segments -topN 1000

and I will end up to have a new segment  :   test/segments/20060522151957

10. nutch fetch test/segments/20060522151957
11. nutch updatedb test/crawldb test/segments/20060522151957


>From this point on I cannot make any progresses much

A) I have tried to merge the two segments into a new one with the idea to rerun an invertlinks and index  on it but:

   nutch mergesegs test/segments -dir test/segments  

   whatever I specify as outputdir or outputsegment I get errors 

B) I have also tried to make invertlinks on all test/segments with the goal to run nutch index command to produce a second 
   indexes directory, let say test/indexes1, an finally run the merge index on index2

   nutch invertlinks  test/linkdb  -dir test/segments

   This as created a new linkdb directory *NOT* under test as specified but as <user>/linkdb-1108390519

   nutch index  test/indexes1 test/crawldb linkdb test/segments/20060522144050
   nutch merge index2 test/indexes test/indexes1

   now I am not sure what to do; If I rename test/index2 to be test/indexes after having removed test/indexes
   I will not able to search anymore.


-Corrado
   



Re: Incremental crawl again ... (Please explain)

Posted by Jacob Brunson <ja...@gmail.com>.
Addition comments and testcase below.

On 5/25/06, Zaheed Haque <za...@gmail.com> wrote:
> On 5/25/06, Jacob Brunson <ja...@gmail.com> wrote:
> > I looked at the referenced messaged at
> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
> > but I am still having problems.
> >
> > I am running the latest checkout from subversion.
> >
> > These are the commands which I've run:
>
> > bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000
>
> bin/nutch crawl - is a one shot command to fetch/generate/index a
> nutch index. I would NOT recommend one to use this one shot command.
Thats funny because when I look at the source code for crawling, it
does pretty much the same thing as the "whole web crawling" method.
>
> Please take the long route which will give you more control over your
> tasks. The long route meaning - inject, generate, fetch, updatedb,
> index, dedup, merge. Please see the following -
> Whole web crawling...
>
> http://lucene.apache.org/nutch/tutorial8.html#Whole-web+Crawling
>
Yes, I've gone though that tutorial also and followed it and I'm
having the same problem.  The tutorial does not describe how to add to
the original index.  If you can help me figure out this, I would be
glad to add to the tutorial and make it more complete.

Just to be perfectly clear, these are the complete set of steps I take
to get the error.  (I'm running Java1.5, only the urls/ directory
exists at the beginning):
$ svn update
$ ant
$ bin/nutch inject crawl.test/crawldb urls/
$ bin/nutch generate crawl.test/crawldb crawl.test/segments -topN 20
$ lastsegment=`ls -d crawl.test/segments/2* | tail -1`
$ bin/nutch fetch $lastsegment
$ bin/nutch updatedb crawl.test/crawldb $lastsegment
$ bin/nutch invertlinks crawl.test/linkdb $lastsegment
$ bin/nutch index crawl.test/indexes crawl.test/crawldb
crawl.test/linkdb $lastsegment
$ bin/nutch merge crawl.test/index crawl.test/indexes
$ bin/nutch generate crawl.test/crawldb crawl.test/segments -topN 20
$ lastsegment=`ls -d crawl.test/segments/2* | tail -1`
$ bin/nutch fetch $lastsegment
$ bin/nutch updatedb crawl.test/crawldb $lastsegment
$ bin/nutch invertlinks crawl.test/linkdb $lastsegment
$ bin/nutch index crawl.test/indexes crawl.test/crawldb
crawl.test/linkdb $lastsegment

And at this point, I have my problem.  I get the following output:
060525 171327 Indexer: adding segment: crawl.test/segments/20060525165518
Exception in thread "main" java.io.IOException: Output directory
/home/nutch/nutch/crawl.test/indexes already exists.
        at org.apache.hadoop.mapred.OutputFormatBase.checkOutputSpecs(OutputFormatBase.java:37)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:263)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:311)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

So if you could help me figure out what I need to do differently, I
would be sure to update the tutorial on the on the wiki to help others
who might have the same problems as me.
Thanks,
Jacob

Re: Incremental crawl again ... (Please explain)

Posted by Zaheed Haque <za...@gmail.com>.
On 5/25/06, Jacob Brunson <ja...@gmail.com> wrote:
> I looked at the referenced messaged at
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
> but I am still having problems.
>
> I am running the latest checkout from subversion.
>
> These are the commands which I've run:

> bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000

bin/nutch crawl - is a one shot command to fetch/generate/index a
nutch index. I would NOT recommend one to use this one shot command.

Please take the long route which will give you more control over your
tasks. The long route meaning - inject, generate, fetch, updatedb,
index, dedup, merge. Please see the following -
Whole web crawling...

http://lucene.apache.org/nutch/tutorial8.html#Whole-web+Crawling

Cheers

> bin/nutch generate crawl/crawldb crawl/segments -topN 500
> lastsegment=`ls -d crawl/segments/2* | tail -1`
> bin/nutch fetch $lastsegment
> bin/nutch updatedb crawl/crawldb $lastsegment
> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment
>
> This last command fails with a java.io.IOException saying: "Output
> directory /home/nutch/nutch/crawl/indexes already exists"
>
> So I'm confused because it seems like I did exactly what was described
> in the referenced email, but it didn't work for me.  Can someone help
> me figure out what I'm doing wrong or what I need to do instead?
> Thanks,
> Jacob
>
>
> On 5/22/06, sudhendra seshachala <su...@yahoo.com> wrote:
> > Please do follow the link below..
> >   http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
> >
> >   I have been able to follow the threads as explained and merge multiple crawl.. It works like a champ.
> >
> >   Thanks
> >   Sudhi
> >
> > zzcgiacomini <zz...@echo.fr> wrote:
> >   I am currently using the last nightly nutch-0.8-dev build and
> > I am really confused about how to proceed after I have done two
> > different "whole web" incremental crawl
> >
> > The tutorial to me is not clear on how to merge the results after the
> > two crawls in order to be able to
> > make a search operation.
> >
> > Could some one please give me an Hints on what is the right procedure ?!
> > here is what I am doing:
> >
> > 1. create an initial urls file /tmp/dmoz/urls.txt
> > 2. hadoop dfs -put /tmp/urls/ url
> > 3. nutch inject test/crawldb dmoz
> > 4. nutch generate test/crawldb test/segments
> > 5. nutch fetch test/segments/20060522144050
> > 6. nutch updatedb test/crawldb test/segments/20060522144050
> > 7. nutch invertlinks linkdb test/segments/20060522144050
> > 8. nutch index test/indexes test/crawldb linkdb
> > test/segments/20060522144050
> >
> > ..and now I am able to search...
> >
> > Now I run
> >
> > 9. nutch generate test/crawldb test/segments -topN 1000
> >
> > and I will end up to have a new segment : test/segments/20060522151957
> >
> > 10. nutch fetch test/segments/20060522151957
> > 11. nutch updatedb test/crawldb test/segments/20060522151957
> >
> >
> > From this point on I cannot make any progresses much
> >
> > A) I have tried to merge the two segments into a new one with the idea to rerun an invertlinks and index on it but:
> >
> > nutch mergesegs test/segments -dir test/segments
> >
> > whatever I specify as outputdir or outputsegment I get errors
> >
> > B) I have also tried to make invertlinks on all test/segments with the goal to run nutch index command to produce a second
> > indexes directory, let say test/indexes1, an finally run the merge index on index2
> >
> > nutch invertlinks test/linkdb -dir test/segments
> >
> > This as created a new linkdb directory *NOT* under test as specified but as /linkdb-1108390519
> >
> > nutch index test/indexes1 test/crawldb linkdb test/segments/20060522144050
> > nutch merge index2 test/indexes test/indexes1
> >
> > now I am not sure what to do; If I rename test/index2 to be test/indexes after having removed test/indexes
> > I will not able to search anymore.
> >
> >
> > -Corrado
> >
> >
> >
> >
> >
> >
> >   Sudhi Seshachala
> >   http://sudhilogs.blogspot.com/
> >
> >
> >
> >  __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> >
>
>
> --
> http://JacobBrunson.com
>

Re: Incremental crawl again ... (Please explain)

Posted by zzcgiacomini <zz...@echo.fr>.
I just tried this and it looks is working:

nutch index test/indexes1 test/crawldb linkdb test/segments/20060522181136
nutch index test/indexes2 test/crawldb linkdb test/segments/20060522181136
nutch merge test/index test/indexes1 test/indexes2

querying also works, I have setup searcher.dir in nutch-site.xml as "test"
and used  the following line to query :
/opt/nutch-0.8-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer

I am just experimenting, so I do not know if is the right way to do things

-Corrado


Stefan Neufeind wrote:
> I haven't yet tried - but could you maybe:
> - move the new segments somewhere independent of the existing ones
> - create a separate linkdb for it (to my understanding the linkdb is
> only needed when indexing)
> - create a separate index on that
> - then move segment into segments-dir and new index into indexes-dir as
> "part-XXXX"
> - just merge indexes (should work relatively fast)
>
> In the long term your segments, indexes etc. add up - so in this case
> you'd need to maybe think about merging segments etc.
>
> Also, this is "only" my current understanding of the topic. It would be
> nice to get feedback and maybe easier solutions from others as well.
>
>
>
> Regards,
>  Stefan
>
>   


Re: Incremental crawl again ... (Please explain)

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
I haven't yet tried - but could you maybe:
- move the new segments somewhere independent of the existing ones
- create a separate linkdb for it (to my understanding the linkdb is
only needed when indexing)
- create a separate index on that
- then move segment into segments-dir and new index into indexes-dir as
"part-XXXX"
- just merge indexes (should work relatively fast)

In the long term your segments, indexes etc. add up - so in this case
you'd need to maybe think about merging segments etc.

Also, this is "only" my current understanding of the topic. It would be
nice to get feedback and maybe easier solutions from others as well.



Regards,
 Stefan

Jacob Brunson wrote:
> Yes, I see what you mean about re-indexing again over all the
> segments.  However, indexing takes a lot of time and I was hoping that
> merging many smaller indexes would be a much more efficient method.
> Besides, deleting the index and re-indexing just doesn't seem like
> *The Right Thing(tm)*.
> 
> On 5/26/06, zzcgiacomini <zz...@echo.fr> wrote:
>> I am not at all a Nutch expert, I am just experimenting a little bit,
>> but  as far as I understood it
>> you can remove the indexes directory and re-index again the segments:
>> In may case ofter step 8 of the (see below) I have only one segment :
>> test/segments/20060522144050
>> after step 9 I will have a second segment
>> test/segments/20060522144050
>> Now what we can do is to remove the test/indexes directory and
>> re-index the two segments:
>> this what I did :
>>
>> hadoop dfs -rm test/indexes
>> nutch index test/indexes test/crawldb linkdb
>> test/segments/20060522144050 test/segments/20060522144050
>>
>> Hope it helps
>>
>> -Corrqdo
>>
>>
>>
>> Jacob Brunson wrote:
>> > I looked at the referenced messaged at
>> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
>> > but I am still having problems.
>> >
>> > I am running the latest checkout from subversion.
>> >
>> > These are the commands which I've run:
>> > bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000
>> > bin/nutch generate crawl/crawldb crawl/segments -topN 500
>> > lastsegment=`ls -d crawl/segments/2* | tail -1`
>> > bin/nutch fetch $lastsegment
>> > bin/nutch updatedb crawl/crawldb $lastsegment
>> > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment
>> >
>> > This last command fails with a java.io.IOException saying: "Output
>> > directory /home/nutch/nutch/crawl/indexes already exists"
>> >
>> > So I'm confused because it seems like I did exactly what was described
>> > in the referenced email, but it didn't work for me.  Can someone help
>> > me figure out what I'm doing wrong or what I need to do instead?
>> > Thanks,
>> > Jacob
>> >
>> >
>> > On 5/22/06, sudhendra seshachala <su...@yahoo.com> wrote:
>> >> Please do follow the link below..
>> >>  
>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
>> >>
>> >>   I have been able to follow the threads as explained and merge
>> >> multiple crawl.. It works like a champ.
>> >>
>> >>   Thanks
>> >>   Sudhi
>> >>
>> >> zzcgiacomini <zz...@echo.fr> wrote:
>> >>   I am currently using the last nightly nutch-0.8-dev build and
>> >> I am really confused about how to proceed after I have done two
>> >> different "whole web" incremental crawl
>> >>
>> >> The tutorial to me is not clear on how to merge the results after the
>> >> two crawls in order to be able to
>> >> make a search operation.
>> >>
>> >> Could some one please give me an Hints on what is the right
>> procedure ?!
>> >> here is what I am doing:
>> >>
>> >> 1. create an initial urls file /tmp/dmoz/urls.txt
>> >> 2. hadoop dfs -put /tmp/urls/ url
>> >> 3. nutch inject test/crawldb dmoz
>> >> 4. nutch generate test/crawldb test/segments
>> >> 5. nutch fetch test/segments/20060522144050
>> >> 6. nutch updatedb test/crawldb test/segments/20060522144050
>> >> 7. nutch invertlinks linkdb test/segments/20060522144050
>> >> 8. nutch index test/indexes test/crawldb linkdb
>> >> test/segments/20060522144050
>> >>
>> >> ..and now I am able to search...
>> >>
>> >> Now I run
>> >>
>> >> 9. nutch generate test/crawldb test/segments -topN 1000
>> >>
>> >> and I will end up to have a new segment : test/segments/20060522151957
>> >>
>> >> 10. nutch fetch test/segments/20060522151957
>> >> 11. nutch updatedb test/crawldb test/segments/20060522151957
>> >>
>> >>
>> >> From this point on I cannot make any progresses much
>> >>
>> >> A) I have tried to merge the two segments into a new one with the
>> >> idea to rerun an invertlinks and index on it but:
>> >>
>> >> nutch mergesegs test/segments -dir test/segments
>> >>
>> >> whatever I specify as outputdir or outputsegment I get errors
>> >>
>> >> B) I have also tried to make invertlinks on all test/segments with
>> >> the goal to run nutch index command to produce a second
>> >> indexes directory, let say test/indexes1, an finally run the merge
>> >> index on index2
>> >>
>> >> nutch invertlinks test/linkdb -dir test/segments
>> >>
>> >> This as created a new linkdb directory *NOT* under test as specified
>> >> but as /linkdb-1108390519
>> >>
>> >> nutch index test/indexes1 test/crawldb linkdb
>> >> test/segments/20060522144050
>> >> nutch merge index2 test/indexes test/indexes1
>> >>
>> >> now I am not sure what to do; If I rename test/index2 to be
>> >> test/indexes after having removed test/indexes
>> >> I will not able to search anymore.

Re: Incremental crawl again ... (Please explain)

Posted by Jacob Brunson <ja...@gmail.com>.
Yes, I see what you mean about re-indexing again over all the
segments.  However, indexing takes a lot of time and I was hoping that
merging many smaller indexes would be a much more efficient method.
Besides, deleting the index and re-indexing just doesn't seem like
*The Right Thing(tm)*.

On 5/26/06, zzcgiacomini <zz...@echo.fr> wrote:
> I am not at all a Nutch expert, I am just experimenting a little bit,
> but  as far as I understood it
> you can remove the indexes directory and re-index again the segments:
> In may case ofter step 8 of the (see below) I have only one segment :
> test/segments/20060522144050
> after step 9 I will have a second segment
> test/segments/20060522144050
> Now what we can do is to remove the test/indexes directory and
> re-index the two segments:
> this what I did :
>
> hadoop dfs -rm test/indexes
> nutch index test/indexes test/crawldb linkdb
> test/segments/20060522144050 test/segments/20060522144050
>
> Hope it helps
>
> -Corrqdo
>
>
>
> Jacob Brunson wrote:
> > I looked at the referenced messaged at
> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
> > but I am still having problems.
> >
> > I am running the latest checkout from subversion.
> >
> > These are the commands which I've run:
> > bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000
> > bin/nutch generate crawl/crawldb crawl/segments -topN 500
> > lastsegment=`ls -d crawl/segments/2* | tail -1`
> > bin/nutch fetch $lastsegment
> > bin/nutch updatedb crawl/crawldb $lastsegment
> > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment
> >
> > This last command fails with a java.io.IOException saying: "Output
> > directory /home/nutch/nutch/crawl/indexes already exists"
> >
> > So I'm confused because it seems like I did exactly what was described
> > in the referenced email, but it didn't work for me.  Can someone help
> > me figure out what I'm doing wrong or what I need to do instead?
> > Thanks,
> > Jacob
> >
> >
> > On 5/22/06, sudhendra seshachala <su...@yahoo.com> wrote:
> >> Please do follow the link below..
> >>   http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
> >>
> >>   I have been able to follow the threads as explained and merge
> >> multiple crawl.. It works like a champ.
> >>
> >>   Thanks
> >>   Sudhi
> >>
> >> zzcgiacomini <zz...@echo.fr> wrote:
> >>   I am currently using the last nightly nutch-0.8-dev build and
> >> I am really confused about how to proceed after I have done two
> >> different "whole web" incremental crawl
> >>
> >> The tutorial to me is not clear on how to merge the results after the
> >> two crawls in order to be able to
> >> make a search operation.
> >>
> >> Could some one please give me an Hints on what is the right procedure ?!
> >> here is what I am doing:
> >>
> >> 1. create an initial urls file /tmp/dmoz/urls.txt
> >> 2. hadoop dfs -put /tmp/urls/ url
> >> 3. nutch inject test/crawldb dmoz
> >> 4. nutch generate test/crawldb test/segments
> >> 5. nutch fetch test/segments/20060522144050
> >> 6. nutch updatedb test/crawldb test/segments/20060522144050
> >> 7. nutch invertlinks linkdb test/segments/20060522144050
> >> 8. nutch index test/indexes test/crawldb linkdb
> >> test/segments/20060522144050
> >>
> >> ..and now I am able to search...
> >>
> >> Now I run
> >>
> >> 9. nutch generate test/crawldb test/segments -topN 1000
> >>
> >> and I will end up to have a new segment : test/segments/20060522151957
> >>
> >> 10. nutch fetch test/segments/20060522151957
> >> 11. nutch updatedb test/crawldb test/segments/20060522151957
> >>
> >>
> >> From this point on I cannot make any progresses much
> >>
> >> A) I have tried to merge the two segments into a new one with the
> >> idea to rerun an invertlinks and index on it but:
> >>
> >> nutch mergesegs test/segments -dir test/segments
> >>
> >> whatever I specify as outputdir or outputsegment I get errors
> >>
> >> B) I have also tried to make invertlinks on all test/segments with
> >> the goal to run nutch index command to produce a second
> >> indexes directory, let say test/indexes1, an finally run the merge
> >> index on index2
> >>
> >> nutch invertlinks test/linkdb -dir test/segments
> >>
> >> This as created a new linkdb directory *NOT* under test as specified
> >> but as /linkdb-1108390519
> >>
> >> nutch index test/indexes1 test/crawldb linkdb
> >> test/segments/20060522144050
> >> nutch merge index2 test/indexes test/indexes1
> >>
> >> now I am not sure what to do; If I rename test/index2 to be
> >> test/indexes after having removed test/indexes
> >> I will not able to search anymore.
> >>
> >>
> >> -Corrado
> >>
> >>
> >>
> >>
> >>
> >>
> >>   Sudhi Seshachala
> >>   http://sudhilogs.blogspot.com/
> >>
> >>
> >>
> >>  __________________________________________________
> >> Do You Yahoo!?
> >> Tired of spam?  Yahoo! Mail has the best spam protection around
> >> http://mail.yahoo.com
> >>
> >
> >
>
>


-- 
http://JacobBrunson.com

Re: Incremental crawl again ... (Please explain)

Posted by zzcgiacomini <zz...@echo.fr>.
I am not at all a Nutch expert, I am just experimenting a little bit, 
but  as far as I understood it
you can remove the indexes directory and re-index again the segments:
In may case ofter step 8 of the (see below) I have only one segment :
test/segments/20060522144050
after step 9 I will have a second segment
test/segments/20060522144050
Now what we can do is to remove the test/indexes directory and
re-index the two segments:
this what I did :
 
hadoop dfs -rm test/indexes
nutch index test/indexes test/crawldb linkdb 
test/segments/20060522144050 test/segments/20060522144050

Hope it helps

-Corrqdo



Jacob Brunson wrote:
> I looked at the referenced messaged at
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
> but I am still having problems.
>
> I am running the latest checkout from subversion.
>
> These are the commands which I've run:
> bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000
> bin/nutch generate crawl/crawldb crawl/segments -topN 500
> lastsegment=`ls -d crawl/segments/2* | tail -1`
> bin/nutch fetch $lastsegment
> bin/nutch updatedb crawl/crawldb $lastsegment
> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment
>
> This last command fails with a java.io.IOException saying: "Output
> directory /home/nutch/nutch/crawl/indexes already exists"
>
> So I'm confused because it seems like I did exactly what was described
> in the referenced email, but it didn't work for me.  Can someone help
> me figure out what I'm doing wrong or what I need to do instead?
> Thanks,
> Jacob
>
>
> On 5/22/06, sudhendra seshachala <su...@yahoo.com> wrote:
>> Please do follow the link below..
>>   http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
>>
>>   I have been able to follow the threads as explained and merge 
>> multiple crawl.. It works like a champ.
>>
>>   Thanks
>>   Sudhi
>>
>> zzcgiacomini <zz...@echo.fr> wrote:
>>   I am currently using the last nightly nutch-0.8-dev build and
>> I am really confused about how to proceed after I have done two
>> different "whole web" incremental crawl
>>
>> The tutorial to me is not clear on how to merge the results after the
>> two crawls in order to be able to
>> make a search operation.
>>
>> Could some one please give me an Hints on what is the right procedure ?!
>> here is what I am doing:
>>
>> 1. create an initial urls file /tmp/dmoz/urls.txt
>> 2. hadoop dfs -put /tmp/urls/ url
>> 3. nutch inject test/crawldb dmoz
>> 4. nutch generate test/crawldb test/segments
>> 5. nutch fetch test/segments/20060522144050
>> 6. nutch updatedb test/crawldb test/segments/20060522144050
>> 7. nutch invertlinks linkdb test/segments/20060522144050
>> 8. nutch index test/indexes test/crawldb linkdb
>> test/segments/20060522144050
>>
>> ..and now I am able to search...
>>
>> Now I run
>>
>> 9. nutch generate test/crawldb test/segments -topN 1000
>>
>> and I will end up to have a new segment : test/segments/20060522151957
>>
>> 10. nutch fetch test/segments/20060522151957
>> 11. nutch updatedb test/crawldb test/segments/20060522151957
>>
>>
>> From this point on I cannot make any progresses much
>>
>> A) I have tried to merge the two segments into a new one with the 
>> idea to rerun an invertlinks and index on it but:
>>
>> nutch mergesegs test/segments -dir test/segments
>>
>> whatever I specify as outputdir or outputsegment I get errors
>>
>> B) I have also tried to make invertlinks on all test/segments with 
>> the goal to run nutch index command to produce a second
>> indexes directory, let say test/indexes1, an finally run the merge 
>> index on index2
>>
>> nutch invertlinks test/linkdb -dir test/segments
>>
>> This as created a new linkdb directory *NOT* under test as specified 
>> but as /linkdb-1108390519
>>
>> nutch index test/indexes1 test/crawldb linkdb 
>> test/segments/20060522144050
>> nutch merge index2 test/indexes test/indexes1
>>
>> now I am not sure what to do; If I rename test/index2 to be 
>> test/indexes after having removed test/indexes
>> I will not able to search anymore.
>>
>>
>> -Corrado
>>
>>
>>
>>
>>
>>
>>   Sudhi Seshachala
>>   http://sudhilogs.blogspot.com/
>>
>>
>>
>>  __________________________________________________
>> Do You Yahoo!?
>> Tired of spam?  Yahoo! Mail has the best spam protection around
>> http://mail.yahoo.com
>>
>
>


Re: Incremental crawl again ... (Please explain)

Posted by Jacob Brunson <ja...@gmail.com>.
I looked at the referenced messaged at
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
but I am still having problems.

I am running the latest checkout from subversion.

These are the commands which I've run:
bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000
bin/nutch generate crawl/crawldb crawl/segments -topN 500
lastsegment=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $lastsegment
bin/nutch updatedb crawl/crawldb $lastsegment
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment

This last command fails with a java.io.IOException saying: "Output
directory /home/nutch/nutch/crawl/indexes already exists"

So I'm confused because it seems like I did exactly what was described
in the referenced email, but it didn't work for me.  Can someone help
me figure out what I'm doing wrong or what I need to do instead?
Thanks,
Jacob


On 5/22/06, sudhendra seshachala <su...@yahoo.com> wrote:
> Please do follow the link below..
>   http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
>
>   I have been able to follow the threads as explained and merge multiple crawl.. It works like a champ.
>
>   Thanks
>   Sudhi
>
> zzcgiacomini <zz...@echo.fr> wrote:
>   I am currently using the last nightly nutch-0.8-dev build and
> I am really confused about how to proceed after I have done two
> different "whole web" incremental crawl
>
> The tutorial to me is not clear on how to merge the results after the
> two crawls in order to be able to
> make a search operation.
>
> Could some one please give me an Hints on what is the right procedure ?!
> here is what I am doing:
>
> 1. create an initial urls file /tmp/dmoz/urls.txt
> 2. hadoop dfs -put /tmp/urls/ url
> 3. nutch inject test/crawldb dmoz
> 4. nutch generate test/crawldb test/segments
> 5. nutch fetch test/segments/20060522144050
> 6. nutch updatedb test/crawldb test/segments/20060522144050
> 7. nutch invertlinks linkdb test/segments/20060522144050
> 8. nutch index test/indexes test/crawldb linkdb
> test/segments/20060522144050
>
> ..and now I am able to search...
>
> Now I run
>
> 9. nutch generate test/crawldb test/segments -topN 1000
>
> and I will end up to have a new segment : test/segments/20060522151957
>
> 10. nutch fetch test/segments/20060522151957
> 11. nutch updatedb test/crawldb test/segments/20060522151957
>
>
> From this point on I cannot make any progresses much
>
> A) I have tried to merge the two segments into a new one with the idea to rerun an invertlinks and index on it but:
>
> nutch mergesegs test/segments -dir test/segments
>
> whatever I specify as outputdir or outputsegment I get errors
>
> B) I have also tried to make invertlinks on all test/segments with the goal to run nutch index command to produce a second
> indexes directory, let say test/indexes1, an finally run the merge index on index2
>
> nutch invertlinks test/linkdb -dir test/segments
>
> This as created a new linkdb directory *NOT* under test as specified but as /linkdb-1108390519
>
> nutch index test/indexes1 test/crawldb linkdb test/segments/20060522144050
> nutch merge index2 test/indexes test/indexes1
>
> now I am not sure what to do; If I rename test/index2 to be test/indexes after having removed test/indexes
> I will not able to search anymore.
>
>
> -Corrado
>
>
>
>
>
>
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>
>
>
>  __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


-- 
http://JacobBrunson.com

Re: Incremental crawl again ... (Please explain)

Posted by sudhendra seshachala <su...@yahoo.com>.
Please do follow the link below..
  http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
   
  I have been able to follow the threads as explained and merge multiple crawl.. It works like a champ.
   
  Thanks
  Sudhi

zzcgiacomini <zz...@echo.fr> wrote:
  I am currently using the last nightly nutch-0.8-dev build and
I am really confused about how to proceed after I have done two 
different "whole web" incremental crawl

The tutorial to me is not clear on how to merge the results after the 
two crawls in order to be able to
make a search operation.

Could some one please give me an Hints on what is the right procedure ?! 
here is what I am doing:

1. create an initial urls file /tmp/dmoz/urls.txt
2. hadoop dfs -put /tmp/urls/ url
3. nutch inject test/crawldb dmoz
4. nutch generate test/crawldb test/segments
5. nutch fetch test/segments/20060522144050
6. nutch updatedb test/crawldb test/segments/20060522144050
7. nutch invertlinks linkdb test/segments/20060522144050
8. nutch index test/indexes test/crawldb linkdb 
test/segments/20060522144050

..and now I am able to search... 

Now I run

9. nutch generate test/crawldb test/segments -topN 1000

and I will end up to have a new segment : test/segments/20060522151957

10. nutch fetch test/segments/20060522151957
11. nutch updatedb test/crawldb test/segments/20060522151957


>From this point on I cannot make any progresses much

A) I have tried to merge the two segments into a new one with the idea to rerun an invertlinks and index on it but:

nutch mergesegs test/segments -dir test/segments 

whatever I specify as outputdir or outputsegment I get errors 

B) I have also tried to make invertlinks on all test/segments with the goal to run nutch index command to produce a second 
indexes directory, let say test/indexes1, an finally run the merge index on index2

nutch invertlinks test/linkdb -dir test/segments

This as created a new linkdb directory *NOT* under test as specified but as /linkdb-1108390519

nutch index test/indexes1 test/crawldb linkdb test/segments/20060522144050
nutch merge index2 test/indexes test/indexes1

now I am not sure what to do; If I rename test/index2 to be test/indexes after having removed test/indexes
I will not able to search anymore.


-Corrado






  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com