You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Byron Miller <by...@yahoo.com> on 2005/10/27 21:47:02 UTC

Peak index performance

When generating an index from a segment, is there a
measure of peak performance?

For example i've been tweaking max merge/min merge and
such and i've been able to double my performance
without increasing anything but cpu load..

Is there a point that tweaking these will cause a
heavier IO load or is there lots of cpu work actually
being done? I would have thought to see an io wait,
but its 49.5 percent user, 5.6 sys, 40.4 idle  and
everyonce in a while .5 io wait.  (hyperthreaded cpu
so take the idle with a grain of salt)

Re: Peak index performance

Posted by Byron Miller <by...@yahoo.com>.
My testing is on 100k documents, but most of the time
i work with 1 million so i don't have a gazillion
segments across my servers.

i'll try and adjust that number down and see what
happens.

-byron

--- Doug Cutting <cu...@nutch.org> wrote:

> Byron Miller wrote:
> > <property>
> >   <name>indexer.mergeFactor</name>
> >   <value>350</value>
> >   <description>
> >   </description>
> > </property>
> > 
> > Initially high index merge factor caused out of
> file
> > handle errors but increasing the others along with
> it
> > seemed to help get around that.
> 
> That is a very large mergeFactor, larger than I
> would recommend.  How 
> many documents do you index in a run?  More than
> 350*500=175,000?  If 
> not then you're not hitting a merge yet.  What does
> 'ulimit -n' show? 
> Does your performance actually change much when you
> lower this?
> 
> Doug
> 


Re: Peak index performance

Posted by Doug Cutting <cu...@nutch.org>.
Byron Miller wrote:
> <property>
>   <name>indexer.mergeFactor</name>
>   <value>350</value>
>   <description>
>   </description>
> </property>
> 
> Initially high index merge factor caused out of file
> handle errors but increasing the others along with it
> seemed to help get around that.

That is a very large mergeFactor, larger than I would recommend.  How 
many documents do you index in a run?  More than 350*500=175,000?  If 
not then you're not hitting a merge yet.  What does 'ulimit -n' show? 
Does your performance actually change much when you lower this?

Doug

Re: Peak index performance

Posted by Byron Miller <by...@yahoo.com>.
Just as an aside, with java 1.4 if i increase these
further performance degrades much quicker.  With 1.5
i've bumped up minMergeDocs to 600 and i doubled my
rec/s processing speed (now nearly 400 rec/s..)

Getting above 600 and performance starts to dwindle..

-byron


--- Byron Miller <by...@yahoo.com> wrote:

> I've been working with the following to consistently
> get 200 rec/s indexed (index_more and language-ident
> enabled)
> 
> Mind you i have over sized these and i'm working
> backwards to shrink them down (all this machine does
> is index). Odd thing is the jvm really didn't change
> much with these adjusted.  Resident memory used went
> up a bit, but cpu and overall memory usage didn't
> change. This is on a 2gig ram server.
> 
> <property>
>   <name>lang.ngram.max.length</name>
>   <value>3</value>
>   <description>
>   </description>
> </property>
> 
> <property>
>   <name>lang.analyze.max.length</name>
>   <value>512</value>
>   <description>
>   </description>
> </property>
> 
> <property>
>   <name>indexer.minMergeDocs</name>
>   <value>500</value>
>   <description>
>   </description>
> </property>
> 
> <property>
>   <name>indexer.maxMergeDocs</name>
>   <value>17179869176</value>
>   <description>
>   </description>
> </property>
> 
> <property>
>   <name>indexer.mergeFactor</name>
>   <value>350</value>
>   <description>
>   </description>
> </property>
> 
> Initially high index merge factor caused out of file
> handle errors but increasing the others along with
> it
> seemed to help get around that.
> 
> -byron
> 
> 
> --- Doug Cutting <cu...@nutch.org> wrote:
> 
> > Byron Miller wrote:
> > > For example i've been tweaking max merge/min
> merge
> > and
> > > such and i've been able to double my performance
> > > without increasing anything but cpu load..
> > 
> > Smaller maxMergeDocs will cost you in the end,
> since
> > these will 
> > eventually be merged during the index optimization
> > at the end.  I would 
> > just leave this at Integer.MAX_VALUE.
> > 
> > Larger minMergeDocs will improve performance, but
> by
> > using more heap. 
> > So watch your heap size as you increase this and
> > leave a healthy margin 
> > for safety.  This is the best way to tweak
> indexing
> > performance.
> > 
> > Larger mergeFactors may improve performance
> > somewhat, but by using more 
> > file handles.  In general, the maximum number of
> > file handles is around 
> > 10-20x (depending on plugins) the mergeFactor.  So
> > raising this above 50 
> > on most systems is risky, and the performance
> > improvements are marginal, 
> > so I wouldn't bother.
> > 
> > Doug
> > 
> 
> 


Re: Peak index performance

Posted by Byron Miller <by...@yahoo.com>.
I've been working with the following to consistently
get 200 rec/s indexed (index_more and language-ident
enabled)

Mind you i have over sized these and i'm working
backwards to shrink them down (all this machine does
is index). Odd thing is the jvm really didn't change
much with these adjusted.  Resident memory used went
up a bit, but cpu and overall memory usage didn't
change. This is on a 2gig ram server.

<property>
  <name>lang.ngram.max.length</name>
  <value>3</value>
  <description>
  </description>
</property>

<property>
  <name>lang.analyze.max.length</name>
  <value>512</value>
  <description>
  </description>
</property>

<property>
  <name>indexer.minMergeDocs</name>
  <value>500</value>
  <description>
  </description>
</property>

<property>
  <name>indexer.maxMergeDocs</name>
  <value>17179869176</value>
  <description>
  </description>
</property>

<property>
  <name>indexer.mergeFactor</name>
  <value>350</value>
  <description>
  </description>
</property>

Initially high index merge factor caused out of file
handle errors but increasing the others along with it
seemed to help get around that.

-byron


--- Doug Cutting <cu...@nutch.org> wrote:

> Byron Miller wrote:
> > For example i've been tweaking max merge/min merge
> and
> > such and i've been able to double my performance
> > without increasing anything but cpu load..
> 
> Smaller maxMergeDocs will cost you in the end, since
> these will 
> eventually be merged during the index optimization
> at the end.  I would 
> just leave this at Integer.MAX_VALUE.
> 
> Larger minMergeDocs will improve performance, but by
> using more heap. 
> So watch your heap size as you increase this and
> leave a healthy margin 
> for safety.  This is the best way to tweak indexing
> performance.
> 
> Larger mergeFactors may improve performance
> somewhat, but by using more 
> file handles.  In general, the maximum number of
> file handles is around 
> 10-20x (depending on plugins) the mergeFactor.  So
> raising this above 50 
> on most systems is risky, and the performance
> improvements are marginal, 
> so I wouldn't bother.
> 
> Doug
> 


Re: Peak index performance

Posted by Doug Cutting <cu...@nutch.org>.
Byron Miller wrote:
> For example i've been tweaking max merge/min merge and
> such and i've been able to double my performance
> without increasing anything but cpu load..

Smaller maxMergeDocs will cost you in the end, since these will 
eventually be merged during the index optimization at the end.  I would 
just leave this at Integer.MAX_VALUE.

Larger minMergeDocs will improve performance, but by using more heap. 
So watch your heap size as you increase this and leave a healthy margin 
for safety.  This is the best way to tweak indexing performance.

Larger mergeFactors may improve performance somewhat, but by using more 
file handles.  In general, the maximum number of file handles is around 
10-20x (depending on plugins) the mergeFactor.  So raising this above 50 
on most systems is risky, and the performance improvements are marginal, 
so I wouldn't bother.

Doug