You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Britske <gb...@gmail.com> on 2008/04/05 16:09:16 UTC

indexing slow, IO-bound?

Hi, 

I have a schema with a lot of (about 10000) non-stored indexed fields, which
I use for sorting. (no really, that is needed). Moreover I have about 30
stored fields. 

Indexing of these documents takes a long time. Because of the size of the
documents (because of the indexed fields) I am currently batching 50
documents at once which takes about 2 seconds.Without adding the 10000
indexed fields to the document, indexing flies at about 15 ms for these 50
documents. INdexing is done using SolrJ

This is on a intel core 2 6400 @2.13ghz and 2 gb ram. 

To speed this up I let 2 threads do the indexing in parallel. What happens
is that solr just takes double the time (about 4 seconds) to complete these
two jobs of 50 docs each in parallel. I figured because of the multi-core
setup indexing should improve, which it doesn't. 

Does this perhaps indicate that the setup is IO-bound? What would be your
best guess  (given the fact that the schema has a big amount of indexed
fields) to try next to improve indexing performance? 

Geert-Jan
-- 
View this message in context: http://www.nabble.com/indexing-slow%2C-IO-bound--tp16513196p16513196.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: indexing slow, IO-bound?

Posted by "Norskog, Lance" <la...@divvio.com>.
Also Linux has optional file systems that might be better for this. We
plan to try them.  ReiserFS and XFS have good reputations. (Reiser
himself, that's a different story :(

Cheers,

Lance

-----Original Message-----
From: Mike Klaas [mailto:mike.klaas@gmail.com] 
Sent: Monday, April 07, 2008 12:04 PM
To: solr-user@lucene.apache.org
Subject: Re: indexing slow, IO-bound?

On 5-Apr-08, at 7:09 AM, Britske wrote:

> Indexing of these documents takes a long time. Because of the size of 
> the documents (because of the indexed fields) I am currently batching 
> 50 documents at once which takes about 2 seconds.Without adding the 
> 10000 indexed fields to the document, indexing flies at about 15 ms 
> for these 50 documents. INdexing is done using SolrJ
>
> This is on a intel core 2 6400 @2.13ghz and 2 gb ram.
>
> To speed this up I let 2 threads do the indexing in parallel. What 
> happens is that solr just takes double the time (about 4 seconds) to 
> complete these two jobs of 50 docs each in parallel. I figured because

> of the multi- core setup indexing should improve, which it doesn't.

Multiple processors really only help indexing speeds when there is heavy
analysis.

> Does this perhaps indicate that the setup is IO-bound? What would be 
> your best guess  (given the fact that the schema has a big amount of 
> indexed
> fields) to try next to improve indexing performance?

Use Lucene 2.3 with solr 1.2, or simple try out solr trunk.  The
indexing has been reworked to be considerably faster (it also makes
better use of multiple processors by spawing a background merging
thread).

-Mike

Re: indexing slow, IO-bound?

Posted by Norberto Meijome <fr...@meijome.net>.
On Mon, 7 Apr 2008 16:37:48 -0400
"Yonik Seeley" <yo...@apache.org> wrote:

> On Mon, Apr 7, 2008 at 4:30 PM, Mike Klaas <mi...@gmail.com> wrote:
> >  'top', 'vmstat' tell exactly what's going on in terms of io and cpu on
> > unix.  Perhaps someone has gotten these to work under windows with cygwin.  
> 
> The windows task manager is a pretty good replacement of top... do
> "select columns" and you can get all sorts of stuff like number of
> threads, file handles, page faults, etc.  You can also simply see if
> things are CPU bound or not (sort by the CPU column, or go to the
> "Performance" tab.

I suggest you use the Performance monitor tool - in server versions of Win32, it should be under Administration tools. You can also generate logs for later reviewing (otherwise it only shows u the last x minutes of activity). You can mix and match different performance providers ...not sure if Java itself providers counters -  you *may* be able to trace CPU / memory by application once the app is running, but I doubt you can do that for IO. 
if only u had dtrace in windows ;)

B

_________________________
{Beto|Norberto|Numard} Meijome

"Web2.0 is what you were doing while the rest of us were building businesses."
  The Reverend

I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.

Re: indexing slow, IO-bound?

Posted by Yonik Seeley <yo...@apache.org>.
On Mon, Apr 7, 2008 at 4:30 PM, Mike Klaas <mi...@gmail.com> wrote:
>  'top', 'vmstat' tell exactly what's going on in terms of io and cpu on
> unix.  Perhaps someone has gotten these to work under windows with cygwin.

The windows task manager is a pretty good replacement of top... do
"select columns" and you can get all sorts of stuff like number of
threads, file handles, page faults, etc.  You can also simply see if
things are CPU bound or not (sort by the CPU column, or go to the
"Performance" tab.

For a better view inside the JVM, use JDK16 and run jconsole.

-Yonik

Re: indexing slow, IO-bound?

Posted by Mike Klaas <mi...@gmail.com>.
On 7-Apr-08, at 12:21 PM, Geert-Jan Brits wrote:
> Thanks Mike, I'll try that.
>
> So nog being cpu-bound, you would indeed think indexing here is IO- 
> bound?
> (Maybe it generally is, I'm not sure. )

That's pretty much impossible for me to say.  All I said was that it  
doesn't seem to be bound by the _analysis_ phase of indexing.

> What's a good tool to profile IO on windows, anyone?

'top', 'vmstat' tell exactly what's going on in terms of io and cpu on  
unix.  Perhaps someone has gotten these to work under windows with  
cygwin.

-Mike

Re: indexing slow, IO-bound?

Posted by Geert-Jan Brits <gb...@gmail.com>.
Thanks Mike, I'll try that.

So nog being cpu-bound, you would indeed think indexing here is IO-bound?
(Maybe it generally is, I'm not sure. )
What's a good tool to profile IO on windows, anyone?


2008/4/7, Mike Klaas <mi...@gmail.com>:
>
> On 5-Apr-08, at 7:09 AM, Britske wrote:
>
>  Indexing of these documents takes a long time. Because of the size of the
> > documents (because of the indexed fields) I am currently batching 50
> > documents at once which takes about 2 seconds.Without adding the 10000
> > indexed fields to the document, indexing flies at about 15 ms for these
> > 50
> > documents. INdexing is done using SolrJ
> >
> > This is on a intel core 2 6400 @2.13ghz and 2 gb ram.
> >
> > To speed this up I let 2 threads do the indexing in parallel. What
> > happens
> > is that solr just takes double the time (about 4 seconds) to complete
> > these
> > two jobs of 50 docs each in parallel. I figured because of the
> > multi-core
> > setup indexing should improve, which it doesn't.
> >
>
> Multiple processors really only help indexing speeds when there is heavy
> analysis.
>
>  Does this perhaps indicate that the setup is IO-bound? What would be your
> > best guess  (given the fact that the schema has a big amount of indexed
> > fields) to try next to improve indexing performance?
> >
>
> Use Lucene 2.3 with solr 1.2, or simple try out solr trunk.  The indexing
> has been reworked to be considerably faster (it also makes better use of
> multiple processors by spawing a background merging thread).
>
> -Mike
>

Re: indexing slow, IO-bound?

Posted by Mike Klaas <mi...@gmail.com>.
On 5-Apr-08, at 7:09 AM, Britske wrote:

> Indexing of these documents takes a long time. Because of the size  
> of the
> documents (because of the indexed fields) I am currently batching 50
> documents at once which takes about 2 seconds.Without adding the 10000
> indexed fields to the document, indexing flies at about 15 ms for  
> these 50
> documents. INdexing is done using SolrJ
>
> This is on a intel core 2 6400 @2.13ghz and 2 gb ram.
>
> To speed this up I let 2 threads do the indexing in parallel. What  
> happens
> is that solr just takes double the time (about 4 seconds) to  
> complete these
> two jobs of 50 docs each in parallel. I figured because of the multi- 
> core
> setup indexing should improve, which it doesn't.

Multiple processors really only help indexing speeds when there is  
heavy analysis.

> Does this perhaps indicate that the setup is IO-bound? What would be  
> your
> best guess  (given the fact that the schema has a big amount of  
> indexed
> fields) to try next to improve indexing performance?

Use Lucene 2.3 with solr 1.2, or simple try out solr trunk.  The  
indexing has been reworked to be considerably faster (it also makes  
better use of multiple processors by spawing a background merging  
thread).

-Mike

RE: indexing slow, IO-bound?

Posted by Jae Joo <jj...@ECNext.com>.
You can adjust the performance of indexing by configuring of these parameters.

<mainIndex>
    <!-- lucene options specific to the main on-disk lucene index -->
    <useCompoundFile>false</useCompoundFile>
    <mergeFactor>10</mergeFactor>
    <maxBufferedDocs>1000</maxBufferedDocs>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
  </mainIndex>


Jae

-----Original Message-----
From: Britske [mailto:gbrits@gmail.com]
Sent: Sat 4/5/2008 10:09 AM
To: solr-user@lucene.apache.org
Subject: indexing slow, IO-bound?
 

Hi, 

I have a schema with a lot of (about 10000) non-stored indexed fields, which
I use for sorting. (no really, that is needed). Moreover I have about 30
stored fields. 

Indexing of these documents takes a long time. Because of the size of the
documents (because of the indexed fields) I am currently batching 50
documents at once which takes about 2 seconds.Without adding the 10000
indexed fields to the document, indexing flies at about 15 ms for these 50
documents. INdexing is done using SolrJ

This is on a intel core 2 6400 @2.13ghz and 2 gb ram. 

To speed this up I let 2 threads do the indexing in parallel. What happens
is that solr just takes double the time (about 4 seconds) to complete these
two jobs of 50 docs each in parallel. I figured because of the multi-core
setup indexing should improve, which it doesn't. 

Does this perhaps indicate that the setup is IO-bound? What would be your
best guess  (given the fact that the schema has a big amount of indexed
fields) to try next to improve indexing performance? 

Geert-Jan
-- 
View this message in context: http://www.nabble.com/indexing-slow%2C-IO-bound--tp16513196p16513196.html
Sent from the Solr - User mailing list archive at Nabble.com.