You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2007/02/15 22:13:28 UTC

crawl indexes and part-00000

I am looking for a simple explanation on what the "part-00000"  
directory in my craw/index folders are, and when they are created and  
when they are not.

I am having a bit of a trouble merging multiple nutch-created indexes  
using bin/nutch merge -- the merge tools seems to always expect the  
index directory to only have a "part-00000" folder in it with the  
actual lucene indexes below. For some reason, in my crawls I end up  
with some index dirs with the part-00000 and some index dirs without  
it. And so I get "IndexMerger: java.io.IOException: crawl/index/ 
_0.fnm not a directory"

How could this happen? Do all nutch index generation tools generate  
this part-00000 folder? Did something mess up up the chain? Will  
there ever be a part-00001? :)

RE: crawl indexes and part-00000

Posted by Gal Nitzan <gn...@usa.net>.

It's funny but merge is not ran as a job so you end up with one folder with
the merged index in it no parts there.

Let's say you have 2 separate indexes created in 2 separate runs.
Now let's say that one index is located at crawl/index_1 and the second is
in crawl/index_2

So now in each of those folders you have a folder part-000 something so your
tree looks like crawl/index_1/part-00000 right?

Now do the following:
1. bin/hadoop dfs -mkdir crawl/indexes
2. bin/hadoop dfs -cp crawl/index_1/part-00000 crawl/indexes/index_1_part_0
3. bin/hadoop dfs -cp crawl/index_2/part-00000 crawl/indexes/index_2_part_0
4. bin/nutch merge crawl/newindex crawl/indexes

When done you should have a new folder (crawl/newindex) with the merged
index in it.

HTH,

Gal

-----Original Message-----
From: Brian Whitman [mailto:brian.whitman@variogr.am] 
Sent: Friday, February 16, 2007 12:06 AM
To: nutch-user@lucene.apache.org
Subject: Re: crawl indexes and part-00000

> The merge program doesn't care what the name of the folder is. It  
> cares it
> should be in a certain structure.
>
> So if we assume you have a folder named indexes, the program wants  
> that each
> folder inside indexes (represents a previous run of index) should  
> have a
> Lucene index in it (it looks for a folder name segments).


Thanks Gal for the explanation. It makes sense.

What doesn't though is that

bin/nutch merge crawl/index crawl/index_1 crawl/index_2 crawl/index

(i.e. merging three indexes including the previously merged one) will  
not generate the part-00000 in crawl/index, it just dumps the merged  
Lucene index directly into crawl/index. So then the next time I do a  
crawl merge I have to manually move the crawl/index/* to crawl/index/ 
part-00000/.

But knowing this at least is helpful so I can update my scripts!

-Brian

Re: crawl indexes and part-00000

Posted by Brian Whitman <br...@variogr.am>.

> The merge program doesn't care what the name of the folder is. It  
> cares it
> should be in a certain structure.
>
> So if we assume you have a folder named indexes, the program wants  
> that each
> folder inside indexes (represents a previous run of index) should  
> have a
> Lucene index in it (it looks for a folder name segments).


Thanks Gal for the explanation. It makes sense.

What doesn't though is that

bin/nutch merge crawl/index crawl/index_1 crawl/index_2 crawl/index

(i.e. merging three indexes including the previously merged one) will  
not generate the part-00000 in crawl/index, it just dumps the merged  
Lucene index directly into crawl/index. So then the next time I do a  
crawl merge I have to manually move the crawl/index/* to crawl/index/ 
part-00000/.

But knowing this at least is helpful so I can update my scripts!

-Brian

RE: crawl indexes and part-00000

Posted by Gal Nitzan <gn...@usa.net>.

Hi Brian,

Well, it took me a while to figure it out too :-).

The number of parts actually is the number of reduce tasks defined in
hadoop-site.xml. If you are working with only one machine this value should
be one and when you run different jobs you will notice that the result is
saved in part-00000 if you had two machines and you also changed the number
of reduce to two you would get part-0000 and part-0001 and so on.
To emphasize: If you ran the following command and you have two machines in
your cluster:
Bin/nutch index indexes crawldb linkdb segments/2007....
Than you will end up with the folder indexes which contains two folders
part-00000 and part-00001 and each of these folders contains a Lucene index.
You could actually import each of those folders and open it with Luke.


Now to merge:
The merge program doesn't care what the name of the folder is. It cares it
should be in a certain structure.

So if we assume you have a folder named indexes, the program wants that each
folder inside indexes (represents a previous run of index) should have a
Lucene index in it (it looks for a folder name segments).


What I do (I run the whole process of generate-merge in a loop) is I create
the index in the DFS root with a name like the segments (2007...) and than I
transfer the parts folders to the indexes folder (I rename it from part...
to date-00001 and so forth.
Than you could call merge and it shall merge all indexes.

HTH,

Gal








-----Original Message-----
From: Brian Whitman [mailto:brian.whitman@variogr.am] 
Sent: Thursday, February 15, 2007 11:13 PM
To: nutch-user@lucene.apache.org
Subject: crawl indexes and part-00000

I am looking for a simple explanation on what the "part-00000"  
directory in my craw/index folders are, and when they are created and  
when they are not.

I am having a bit of a trouble merging multiple nutch-created indexes  
using bin/nutch merge -- the merge tools seems to always expect the  
index directory to only have a "part-00000" folder in it with the  
actual lucene indexes below. For some reason, in my crawls I end up  
with some index dirs with the part-00000 and some index dirs without  
it. And so I get "IndexMerger: java.io.IOException: crawl/index/ 
_0.fnm not a directory"

How could this happen? Do all nutch index generation tools generate  
this part-00000 folder? Did something mess up up the chain? Will  
there ever be a part-00001? :)