You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2007/02/15 22:13:28 UTC
crawl indexes and part-00000
I am looking for a simple explanation on what the "part-00000"
directory in my craw/index folders are, and when they are created and
when they are not.
I am having a bit of a trouble merging multiple nutch-created indexes
using bin/nutch merge -- the merge tools seems to always expect the
index directory to only have a "part-00000" folder in it with the
actual lucene indexes below. For some reason, in my crawls I end up
with some index dirs with the part-00000 and some index dirs without
it. And so I get "IndexMerger: java.io.IOException: crawl/index/
_0.fnm not a directory"
How could this happen? Do all nutch index generation tools generate
this part-00000 folder? Did something mess up up the chain? Will
there ever be a part-00001? :)
RE: crawl indexes and part-00000
Posted by Gal Nitzan <gn...@usa.net>.
It's funny but merge is not ran as a job so you end up with one folder with
the merged index in it no parts there.
Let's say you have 2 separate indexes created in 2 separate runs.
Now let's say that one index is located at crawl/index_1 and the second is
in crawl/index_2
So now in each of those folders you have a folder part-000 something so your
tree looks like crawl/index_1/part-00000 right?
Now do the following:
1. bin/hadoop dfs -mkdir crawl/indexes
2. bin/hadoop dfs -cp crawl/index_1/part-00000 crawl/indexes/index_1_part_0
3. bin/hadoop dfs -cp crawl/index_2/part-00000 crawl/indexes/index_2_part_0
4. bin/nutch merge crawl/newindex crawl/indexes
When done you should have a new folder (crawl/newindex) with the merged
index in it.
HTH,
Gal
-----Original Message-----
From: Brian Whitman [mailto:brian.whitman@variogr.am]
Sent: Friday, February 16, 2007 12:06 AM
To: nutch-user@lucene.apache.org
Subject: Re: crawl indexes and part-00000
> The merge program doesn't care what the name of the folder is. It
> cares it
> should be in a certain structure.
>
> So if we assume you have a folder named indexes, the program wants
> that each
> folder inside indexes (represents a previous run of index) should
> have a
> Lucene index in it (it looks for a folder name segments).
Thanks Gal for the explanation. It makes sense.
What doesn't though is that
bin/nutch merge crawl/index crawl/index_1 crawl/index_2 crawl/index
(i.e. merging three indexes including the previously merged one) will
not generate the part-00000 in crawl/index, it just dumps the merged
Lucene index directly into crawl/index. So then the next time I do a
crawl merge I have to manually move the crawl/index/* to crawl/index/
part-00000/.
But knowing this at least is helpful so I can update my scripts!
-Brian
Re: crawl indexes and part-00000
Posted by Brian Whitman <br...@variogr.am>.
> The merge program doesn't care what the name of the folder is. It
> cares it
> should be in a certain structure.
>
> So if we assume you have a folder named indexes, the program wants
> that each
> folder inside indexes (represents a previous run of index) should
> have a
> Lucene index in it (it looks for a folder name segments).
Thanks Gal for the explanation. It makes sense.
What doesn't though is that
bin/nutch merge crawl/index crawl/index_1 crawl/index_2 crawl/index
(i.e. merging three indexes including the previously merged one) will
not generate the part-00000 in crawl/index, it just dumps the merged
Lucene index directly into crawl/index. So then the next time I do a
crawl merge I have to manually move the crawl/index/* to crawl/index/
part-00000/.
But knowing this at least is helpful so I can update my scripts!
-Brian
RE: crawl indexes and part-00000
Posted by Gal Nitzan <gn...@usa.net>.
Hi Brian,
Well, it took me a while to figure it out too :-).
The number of parts actually is the number of reduce tasks defined in
hadoop-site.xml. If you are working with only one machine this value should
be one and when you run different jobs you will notice that the result is
saved in part-00000 if you had two machines and you also changed the number
of reduce to two you would get part-0000 and part-0001 and so on.
To emphasize: If you ran the following command and you have two machines in
your cluster:
Bin/nutch index indexes crawldb linkdb segments/2007....
Than you will end up with the folder indexes which contains two folders
part-00000 and part-00001 and each of these folders contains a Lucene index.
You could actually import each of those folders and open it with Luke.
Now to merge:
The merge program doesn't care what the name of the folder is. It cares it
should be in a certain structure.
So if we assume you have a folder named indexes, the program wants that each
folder inside indexes (represents a previous run of index) should have a
Lucene index in it (it looks for a folder name segments).
What I do (I run the whole process of generate-merge in a loop) is I create
the index in the DFS root with a name like the segments (2007...) and than I
transfer the parts folders to the indexes folder (I rename it from part...
to date-00001 and so forth.
Than you could call merge and it shall merge all indexes.
HTH,
Gal
-----Original Message-----
From: Brian Whitman [mailto:brian.whitman@variogr.am]
Sent: Thursday, February 15, 2007 11:13 PM
To: nutch-user@lucene.apache.org
Subject: crawl indexes and part-00000
I am looking for a simple explanation on what the "part-00000"
directory in my craw/index folders are, and when they are created and
when they are not.
I am having a bit of a trouble merging multiple nutch-created indexes
using bin/nutch merge -- the merge tools seems to always expect the
index directory to only have a "part-00000" folder in it with the
actual lucene indexes below. For some reason, in my crawls I end up
with some index dirs with the part-00000 and some index dirs without
it. And so I get "IndexMerger: java.io.IOException: crawl/index/
_0.fnm not a directory"
How could this happen? Do all nutch index generation tools generate
this part-00000 folder? Did something mess up up the chain? Will
there ever be a part-00001? :)