You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Hetal Shah <he...@investorsprovident.com> on 2007/02/01 13:12:09 UTC

RE: Dedup index error

Another quick update:

I ran Luke on the index, and part-00000 works fine, whereas part-00001 comes
up as corrupt or missing. Now seeing from the list of files in both these
directories, we know that there is nothing in part-00001 - so why does it
get generated? And if it does, why does dedup not handle it gracefully?

I also ran a merge on the two indexes, and it worked fine. 

So that rests the case that both the indexes are corrupted. This brings me
to understand that since I only had two pages indexed and the index was
small, part-00001 came up with nothing, and dedup does not handle it????

Any thoughts?



-- Hetal Shah wrote: --

That's what I had read on another post as well, but somehow, I can't
understand how it can be corrupted! It's not even a massive index. Just a
couple of urls. Every step that I followed was per the tutorials on the wiki
page.

Here's the list under /indexes:

drwxr-xr-x  2 root root 4096 Jan 31 16:21 part-00000
drwxr-xr-x  2 root root 4096 Jan 31 16:21 part-00001

This is what's under part-00000

-rw-r--r--  1 root root    2 Jan 31 16:21 _2.f0
-rw-r--r--  1 root root    2 Jan 31 16:21 _2.f1
-rw-r--r--  1 root root    2 Jan 31 16:21 _2.f2
-rw-r--r--  1 root root    2 Jan 31 16:21 _2.f3
-rw-r--r--  1 root root    2 Jan 31 16:21 _2.f4
-rw-r--r--  1 root root    2 Jan 31 16:21 _2.f5
-rw-r--r--  1 root root  399 Jan 31 16:21 _2.fdt
-rw-r--r--  1 root root   16 Jan 31 16:21 _2.fdx
-rw-r--r--  1 root root   74 Jan 31 16:21 _2.fnm
-rw-r--r--  1 root root  945 Jan 31 16:21 _2.frq
-rw-r--r--  1 root root 1790 Jan 31 16:21 _2.prx
-rw-r--r--  1 root root  105 Jan 31 16:21 _2.tii
-rw-r--r--  1 root root 6850 Jan 31 16:21 _2.tis
-rw-r--r--  1 root root    4 Jan 31 16:21 deletable
-rw-r--r--  1 root root    0 Jan 31 16:21 index.done
-rw-r--r--  1 root root   27 Jan 31 16:21 segments

This is what's under part-00001

-rw-r--r--  1 root root  0 Jan 31 16:21 index.done
-rw-r--r--  1 root root 20 Jan 31 16:21 segments
 
By the way, also to mention here that I am running dedup on the DFS system.
I haven't tried running it on the local system yet, but does that matter?

Thanks for your help.

RE: Dedup index error

Posted by Hetal Shah <he...@investorsprovident.com>.

Thanks Andrzej. I don't think my scenario would be applicable in real-life
situations. However, it would be great to know where the root of the problem
lies.

I have managed to dedup a larger index, and is working perfect. So your
theory is correct. I guess it's a matter of digging a little deeper to
eliminate this once and for all.

Thanks.

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: 01 February 2007 17:59
To: nutch-user@lucene.apache.org
Subject: Re: Dedup index error

Hetal Shah wrote:
> Another quick update:
>
> I ran Luke on the index, and part-00000 works fine, whereas part-00001 
> comes up as corrupt or missing. Now seeing from the list of files in 
> both these directories, we know that there is nothing in part-00001 - 
> so why does it get generated? And if it does, why does dedup not handle it
gracefully?
>
> I also ran a merge on the two indexes, and it worked fine. 
>
> So that rests the case that both the indexes are corrupted. This 
> brings me to understand that since I only had two pages indexed and 
> the index was small, part-00001 came up with nothing, and dedup does not
handle it????
>
> Any thoughts?
>   

There seems to be an issue with the document partitioning - it seems that
for larger numbers of document the partitioning schema generates at least
one document per partition, but in your case there were too few documents to
fill the second partition ... I need to check where the problem originates -
however, this should not happen if you index more documents than 2 * the
number of reduce tasks.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com

Re: Dedup index error

Posted by Andrzej Bialecki <ab...@getopt.org>.

Hetal Shah wrote:
> Another quick update:
>
> I ran Luke on the index, and part-00000 works fine, whereas part-00001 comes
> up as corrupt or missing. Now seeing from the list of files in both these
> directories, we know that there is nothing in part-00001 - so why does it
> get generated? And if it does, why does dedup not handle it gracefully?
>
> I also ran a merge on the two indexes, and it worked fine. 
>
> So that rests the case that both the indexes are corrupted. This brings me
> to understand that since I only had two pages indexed and the index was
> small, part-00001 came up with nothing, and dedup does not handle it????
>
> Any thoughts?
>   

There seems to be an issue with the document partitioning - it seems 
that for larger numbers of document the partitioning schema generates at 
least one document per partition, but in your case there were too few 
documents to fill the second partition ... I need to check where the 
problem originates - however, this should not happen if you index more 
documents than 2 * the number of reduce tasks.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com