You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "NG-Marketing, M.Schneider" <sc...@ng-marketing.com> on 2006/10/30 10:41:09 UTC

mergesegs bigger than original

Hello List,

 

I noticed that if I run a mergesegs on a single segment, the resulting
segment is bigger than the original. Is it a feature or a bug?

 

This behaviour occurred mainly in the directory parse-data. All other dirs
are more or less equal (+-3MB).

 

Org

Parse-data.: 550 MB

 

After mergesegs

Parse-data: 1.5 GB

 

Greetings

Matthias

 


RE: mergesegs bigger than original

Posted by "NG-Marketing, M.Schneider" <sc...@ng-marketing.com>.
> > Hello List,
> > I noticed that if I run a mergesegs on a single segment, the resulting
> > segment is bigger than the original. Is it a feature or a bug?
> > This behaviour occurred mainly in the directory parse-data. All other
dirs
> > are more or less equal (+-3MB).
> >
> > Org
> > Parse-data.: 550 MB
> >
> > After mergesegs
> > Parse-data: 1.5 GB
> >

> This looks wrong. Could you try it on a smaller segment, dump the
> "before" and "after" to a text file and see what's different?
> 
Hmm it looks like that it has something to do with DMOZ parsed fetchlists.
Unfortunately I can not recover the complete process. On smaller segments
the merge went well. I will let you know if I find out some more. 

Greets 
Matthias


Re: mergesegs bigger than original

Posted by Andrzej Bialecki <ab...@getopt.org>.
NG-Marketing, M.Schneider wrote:
> Hello List,
>
>  
>
> I noticed that if I run a mergesegs on a single segment, the resulting
> segment is bigger than the original. Is it a feature or a bug?
>
>  
>
> This behaviour occurred mainly in the directory parse-data. All other dirs
> are more or less equal (+-3MB).
>
>  
>
> Org
>
> Parse-data.: 550 MB
>
>  
>
> After mergesegs
>
> Parse-data: 1.5 GB
>   

This looks wrong. Could you try it on a smaller segment, dump the 
"before" and "after" to a text file and see what's different?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com