You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Howie Wang <ho...@hotmail.com> on 2006/09/15 22:27:23 UTC

mergesegs is losing (and adding!) urls

I'm on Nutch 0.7, and I just noticed recently that after
merging segments, a lot of URLs that I thought should be
there disappeared. I did a segread -dumpsort on the
original segments and on the merged segment and found
that I had lost 30% of my URLs.

Doing a diff on the url files, I found that some URLs were
even resurrected (they didn't show up in the original
segments, but showed up on the merged segment).

I checked the logs and there was one small corrupted segment
(not enough to account for the lost URLs), but mergesegs just
seemed to ignore it and go on.

I commented out the code in SegmentMergeTool.java that
had to do with deleting duplicates, and the problem went
away. I get the same set of URLs before and after merging.

My plan for now is to locally comment out this deletion
code, and use bin/nutch dedup on the merged index, but
I was wondering if anyone else has seen this problem in either
0.7 or 0.8. Any ideas on why it might be happening?

Thanks!
Howie



Re: mergesegs is losing (and adding!) urls

Posted by Howie Wang <ho...@hotmail.com>.
Thanks for the response, Andrzej.

>Mergesegs also performs dedup. If you compare the list of urls in the index 
>based on the original input segments, but AFTER dedup, and in the index 
>built from the merged segment, are they different?

I should have specified. I didn't run index after merging. I just did
bin/nutch mergesegs -dir mydb/segments (no -i or -ds options).
Then I immediately do a segread on the new merged segment.
And the list of URLs are different -- mostly missing URLs, but also
some "new" URLs.

I find the addition of new URLs in the merged segments especially
puzzling. Where do they come from? Is segread lying to me about
what's in the original segments?

I checked the segread output on the deleted URLs and I don't
find anything strange in their status.

I have a feeling that the mergesegs dedup is what is causing the
problem since when I commented out this code, the list of urls
is the same before and after merging. It's possible that I have
some sort of corruption in the original segments that is causing
unpredictable behavior in the mergesegs dedup code.

>Could you perhaps provide a minimal fetchlist + exact steps you took, to 
>illustrate and reproduce the problem?

I don't have a minimal fetchlist right now. I'll see if I can get one
together. I wouldn't be surprised if the problem only occurred after
getting a significant number of pages.

Thanks,
Howie



Re: mergesegs is losing (and adding!) urls

Posted by Andrzej Bialecki <ab...@getopt.org>.
Howie Wang wrote:
> I'm on Nutch 0.7, and I just noticed recently that after
> merging segments, a lot of URLs that I thought should be
> there disappeared. I did a segread -dumpsort on the
> original segments and on the merged segment and found
> that I had lost 30% of my URLs.
>
> Doing a diff on the url files, I found that some URLs were
> even resurrected (they didn't show up in the original
> segments, but showed up on the merged segment).
>
> I checked the logs and there was one small corrupted segment
> (not enough to account for the lost URLs), but mergesegs just
> seemed to ignore it and go on.
>
> I commented out the code in SegmentMergeTool.java that
> had to do with deleting duplicates, and the problem went
> away. I get the same set of URLs before and after merging.
>
> My plan for now is to locally comment out this deletion
> code, and use bin/nutch dedup on the merged index, but
> I was wondering if anyone else has seen this problem in either
> 0.7 or 0.8. Any ideas on why it might be happening?

Mergesegs also performs dedup. If you compare the list of urls in the 
index based on the original input segments, but AFTER dedup, and in the 
index built from the merged segment, are they different?

Could you perhaps provide a minimal fetchlist + exact steps you took, to 
illustrate and reproduce the problem?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com