You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "NG-Marketing, M.Schneider" <sc...@ng-marketing.com> on 2006/10/30 10:41:09 UTC
mergesegs bigger than original
Hello List,
I noticed that if I run a mergesegs on a single segment, the resulting
segment is bigger than the original. Is it a feature or a bug?
This behaviour occurred mainly in the directory parse-data. All other dirs
are more or less equal (+-3MB).
Org
Parse-data.: 550 MB
After mergesegs
Parse-data: 1.5 GB
Greetings
Matthias
RE: mergesegs bigger than original
Posted by "NG-Marketing, M.Schneider" <sc...@ng-marketing.com>.
> > Hello List,
> > I noticed that if I run a mergesegs on a single segment, the resulting
> > segment is bigger than the original. Is it a feature or a bug?
> > This behaviour occurred mainly in the directory parse-data. All other
dirs
> > are more or less equal (+-3MB).
> >
> > Org
> > Parse-data.: 550 MB
> >
> > After mergesegs
> > Parse-data: 1.5 GB
> >
> This looks wrong. Could you try it on a smaller segment, dump the
> "before" and "after" to a text file and see what's different?
>
Hmm it looks like that it has something to do with DMOZ parsed fetchlists.
Unfortunately I can not recover the complete process. On smaller segments
the merge went well. I will let you know if I find out some more.
Greets
Matthias
Re: mergesegs bigger than original
Posted by Andrzej Bialecki <ab...@getopt.org>.
NG-Marketing, M.Schneider wrote:
> Hello List,
>
>
>
> I noticed that if I run a mergesegs on a single segment, the resulting
> segment is bigger than the original. Is it a feature or a bug?
>
>
>
> This behaviour occurred mainly in the directory parse-data. All other dirs
> are more or less equal (+-3MB).
>
>
>
> Org
>
> Parse-data.: 550 MB
>
>
>
> After mergesegs
>
> Parse-data: 1.5 GB
>
This looks wrong. Could you try it on a smaller segment, dump the
"before" and "after" to a text file and see what's different?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com