You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chantal Ackermann <ch...@btelligent.de> on 2009/08/03 18:32:12 UTC

Re: mergeFactor / indexing speed

Hi all,

I'm still struggling with the index performance. I've moved the indexer
to a different machine, now, which is faster and less occupied.

The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
It has been processing roughly 70k documents in half an hour, so far. 
Which means 1,5 hours at least for 200k - which is as fast/slow as 
before (on the less performant machine).

The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
  iostat
Linux 2.6.9-67.ELsmp      08/03/2009

avg-cpu:  %user   %nice    %sys %iowait   %idle
            1.23    0.00    0.03    0.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked that 
from my own machine, and did only a ping from the linux box to the db 
server.)

Any help, any hint on where to look would be greatly appreciated.


Thanks!
Chantal


Chantal Ackermann schrieb:
> Hi again!
>
> Thanks for the answer, Grant.
>
>  > It could very well be the case that you aren't seeing any merges with
>  > only 20K docs.  Ultimately, if you really want to, you can look in
>  > your data.dir and count the files.  If you have indexed a lot and have
>  > an MF of 100 and haven't done an optimize, you will see a lot more
>  > index files.
>
> Do you mean that 20k is not representative enough to test those settings?
> I've chosen the smaller data set so that the index can run completely
> but doesn't take too long at the same time.
> If it would be faster to begin with, I could use a larger data set, of
> course. I still can't believe that 11 minutes is normal (I haven't
> managed to make it run faster or slower than that, that duration is very
> stable).
>
> It "feels kinda" slow to me...
> Out of your experience - what would you expect as duration for an index
> with:
> - 21 fields, some using a text type with 6 filters
> - database access using DataImportHandler with a query of (far) less
> than 20ms
> - 2 transformers
>
> If I knew that indexing time should be shorter than that, at least, I
> would know that something is definitely wrong with what I am doing or
> with the environment I am using.
>
>  > Likely, but not guaranteed.  Typically, larger merge factors are good
>  > for batch indexing, but a lot of that has changed with Lucene's new
>  > background merger, such that I don't know if it matters as much anymore.
>
> Ok. I also read some posting where it basically said that the default
> parameters are ok. And one shouldn't mess around with them.
>
> The thing is that our current search setup uses Lucene directly, and the
> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
> fields are different, the complete setup is different. But it will be
> hard to advertise a new implementation/setup where indexing is three
> times slower - unless I can give some reasons why that is.
>
> The full index should be fairly fast because the backing data is update
> every few hours. I want to put in place an incremental/partial update as
> main process, but full indexing might have to be done at certain times
> if data has changed completely, or the schema has to be changed/extended.
>
>  > No, those are separate things.  The ramBufferSizeMB (although, I like
>  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>  > Lucene holds in memory before it has to flush.  MF controls how many
>  > segments are on disk
>
> alas! the rum. I had that typo on the commandline before. that's my
> subconscious telling me what I should do when I get home, tonight...
>
> So, increasing ramBufferSize should lead to higher memory usage,
> shouldn't it? I'm not seeing that. :-(
>
> I'll try once more with MF 10 and a higher rum... well, you know... ;-)
>
> Cheers,
> Chantal
>
> Grant Ingersoll schrieb:
>> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
>>
>>> Dear all,
>>>
>>> I want to find out which settings give the best full index
>>> performance for my setup.
>>> Therefore, I have been running a small index (less than 20k
>>> documents) with a mergeFactor of 10 and 100.
>>> In both cases, indexing took about 11.5 min:
>>>
>>> mergeFactor: 10
>>> <str name="Time taken ">0:11:46.792</str>
>>> mergeFactor: 100
>>> /admin/cores?action=RELOAD
>>> <str name="Time taken ">0:11:44.441</str>
>>> Tomcat restart
>>> <str name="Time taken ">0:11:34.143</str>
>>>
>>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
>>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
>>> ATA disk).
>>>
>>>
>>> Now, I have three questions:
>>>
>>> 1. How can I check which mergeFactor is really being used? The
>>> solrconfig.xml that is displayed in the admin application is the up-
>>> to-date view on the file system. I tested that. But it's not
>>> necessarily what the current SOLR core is using, isn't it?
>>> Is there a way to check on the actually used mergeFactor (while the
>>> index is running)?
>> It could very well be the case that you aren't seeing any merges with
>> only 20K docs.  Ultimately, if you really want to, you can look in
>> your data.dir and count the files.  If you have indexed a lot and have
>> an MF of 100 and haven't done an optimize, you will see a lot more
>> index files.
>>
>>
>>> 2. I changed the mergeFactor in both available settings (default and
>>> main index) in the solrconfig.xml file of the core I am reindexing.
>>> That is the correct place? Should a change in performance be
>>> noticeable when increasing from 10 to 100? Or is the change not
>>> perceivable if the requests for data are taking far longer than all
>>> the indexing itself?
>> Likely, but not guaranteed.  Typically, larger merge factors are good
>> for batch indexing, but a lot of that has changed with Lucene's new
>> background merger, such that I don't know if it matters as much anymore.
>>
>>
>>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>>> (Or some other setting?)
>> No, those are separate things.  The ramBufferSizeMB (although, I like
>> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>> Lucene holds in memory before it has to flush.  MF controls how many
>> segments are on disk
>>
>>> (I am still trying to get profiling information on how much
>>> application time is eaten up by db connection/requests/processing.
>>> The root entity query is about (average) 20ms. The child entity
>>> query is less than 10ms.
>>> I have my custom entity processor running on the child entity that
>>> populates the map using a multi-row result set. I have also attached
>>> one regex and one script transformer.)
>>>
>>> Thank you for any tips!
>>> Chantal
>>>
>>>
>>>
>>> --
>>> Chantal Ackermann
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>

Re: mergeFactor / indexing speed

Posted by Avlesh Singh <av...@gmail.com>.

>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
I agree, real bad statistics, actually.

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
>
To me the former appears to be too high and latter too low (for your machine
configuration). You can safely increase the ramBufferSize (or
maxBufferedDocs) to a higher value.

Couple of things -

   1. The stock solrconfig.xml comes with two sections <indexDefaults> and
   <mainIndex>. Options in the latter override the former. Just make sure that
   you have right values at the right place.
   2. Do you have too many nested entities inside the DIH's data-config? If
   yes, a database level optimization (creating views, in memory tables ...)
   might hold the answer.
   3. Tried playing around with jdbc paramters in the data source? Setting
   "batchSize" property to a considerable value might help.

Cheers
Avlesh

On Mon, Aug 3, 2009 at 10:02 PM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Hi all,
>
> I'm still struggling with the index performance. I've moved the indexer
> to a different machine, now, which is faster and less occupied.
>
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
>
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so far. Which
> means 1,5 hours at least for 200k - which is as fast/slow as before (on the
> less performant machine).
>
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
>  iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
> Basically, it is doing very little? *scratch*
>
> The sourcing database is responding as fast as ever. (I checked that from
> my own machine, and did only a ping from the linux box to the db server.)
>
> Any help, any hint on where to look would be greatly appreciated.
>
>
> Thanks!
> Chantal
>
>
> Chantal Ackermann schrieb:
>
>> Hi again!
>>
>> Thanks for the answer, Grant.
>>
>>  > It could very well be the case that you aren't seeing any merges with
>>  > only 20K docs.  Ultimately, if you really want to, you can look in
>>  > your data.dir and count the files.  If you have indexed a lot and have
>>  > an MF of 100 and haven't done an optimize, you will see a lot more
>>  > index files.
>>
>> Do you mean that 20k is not representative enough to test those settings?
>> I've chosen the smaller data set so that the index can run completely
>> but doesn't take too long at the same time.
>> If it would be faster to begin with, I could use a larger data set, of
>> course. I still can't believe that 11 minutes is normal (I haven't
>> managed to make it run faster or slower than that, that duration is very
>> stable).
>>
>> It "feels kinda" slow to me...
>> Out of your experience - what would you expect as duration for an index
>> with:
>> - 21 fields, some using a text type with 6 filters
>> - database access using DataImportHandler with a query of (far) less
>> than 20ms
>> - 2 transformers
>>
>> If I knew that indexing time should be shorter than that, at least, I
>> would know that something is definitely wrong with what I am doing or
>> with the environment I am using.
>>
>>  > Likely, but not guaranteed.  Typically, larger merge factors are good
>>  > for batch indexing, but a lot of that has changed with Lucene's new
>>  > background merger, such that I don't know if it matters as much
>> anymore.
>>
>> Ok. I also read some posting where it basically said that the default
>> parameters are ok. And one shouldn't mess around with them.
>>
>> The thing is that our current search setup uses Lucene directly, and the
>> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
>> fields are different, the complete setup is different. But it will be
>> hard to advertise a new implementation/setup where indexing is three
>> times slower - unless I can give some reasons why that is.
>>
>> The full index should be fairly fast because the backing data is update
>> every few hours. I want to put in place an incremental/partial update as
>> main process, but full indexing might have to be done at certain times
>> if data has changed completely, or the schema has to be changed/extended.
>>
>>  > No, those are separate things.  The ramBufferSizeMB (although, I like
>>  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>>  > Lucene holds in memory before it has to flush.  MF controls how many
>>  > segments are on disk
>>
>> alas! the rum. I had that typo on the commandline before. that's my
>> subconscious telling me what I should do when I get home, tonight...
>>
>> So, increasing ramBufferSize should lead to higher memory usage,
>> shouldn't it? I'm not seeing that. :-(
>>
>> I'll try once more with MF 10 and a higher rum... well, you know... ;-)
>>
>> Cheers,
>> Chantal
>>
>> Grant Ingersoll schrieb:
>>
>>> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
>>>
>>>  Dear all,
>>>>
>>>> I want to find out which settings give the best full index
>>>> performance for my setup.
>>>> Therefore, I have been running a small index (less than 20k
>>>> documents) with a mergeFactor of 10 and 100.
>>>> In both cases, indexing took about 11.5 min:
>>>>
>>>> mergeFactor: 10
>>>> <str name="Time taken ">0:11:46.792</str>
>>>> mergeFactor: 100
>>>> /admin/cores?action=RELOAD
>>>> <str name="Time taken ">0:11:44.441</str>
>>>> Tomcat restart
>>>> <str name="Time taken ">0:11:34.143</str>
>>>>
>>>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
>>>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
>>>> ATA disk).
>>>>
>>>>
>>>> Now, I have three questions:
>>>>
>>>> 1. How can I check which mergeFactor is really being used? The
>>>> solrconfig.xml that is displayed in the admin application is the up-
>>>> to-date view on the file system. I tested that. But it's not
>>>> necessarily what the current SOLR core is using, isn't it?
>>>> Is there a way to check on the actually used mergeFactor (while the
>>>> index is running)?
>>>>
>>> It could very well be the case that you aren't seeing any merges with
>>> only 20K docs.  Ultimately, if you really want to, you can look in
>>> your data.dir and count the files.  If you have indexed a lot and have
>>> an MF of 100 and haven't done an optimize, you will see a lot more
>>> index files.
>>>
>>>
>>>  2. I changed the mergeFactor in both available settings (default and
>>>> main index) in the solrconfig.xml file of the core I am reindexing.
>>>> That is the correct place? Should a change in performance be
>>>> noticeable when increasing from 10 to 100? Or is the change not
>>>> perceivable if the requests for data are taking far longer than all
>>>> the indexing itself?
>>>>
>>> Likely, but not guaranteed.  Typically, larger merge factors are good
>>> for batch indexing, but a lot of that has changed with Lucene's new
>>> background merger, such that I don't know if it matters as much anymore.
>>>
>>>
>>>  3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>>>> (Or some other setting?)
>>>>
>>> No, those are separate things.  The ramBufferSizeMB (although, I like
>>> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>>> Lucene holds in memory before it has to flush.  MF controls how many
>>> segments are on disk
>>>
>>>  (I am still trying to get profiling information on how much
>>>> application time is eaten up by db connection/requests/processing.
>>>> The root entity query is about (average) 20ms. The child entity
>>>> query is less than 10ms.
>>>> I have my custom entity processor running on the child entity that
>>>> populates the map using a multi-row result set. I have also attached
>>>> one regex and one script transformer.)
>>>>
>>>> Thank you for any tips!
>>>> Chantal
>>>>
>>>>
>>>>
>>>> --
>>>> Chantal Ackermann
>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
>
>

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Thanks for the tip, Shalin. I'm happy with 6 indexes running in parallel 
and completing in less than 10min, right now, but I'll have look anyway.


Shalin Shekhar Mangar schrieb:
> On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann <
> chantal.ackermann@btelligent.de> wrote:
> 
>> Juhu, great news, guys. I merged my child entity into the root entity, and
>> changed the custom entityprocessor to handle the additional columns
>> correctly.
>> And - indexing 160k documents now takes 5min instead of 1.5h!
>>
> 
> I'm a little late to the party but you may also want to look at
> CachedSqlEntityProcessor.
> 
> --
> Regards,
> Shalin Shekhar Mangar.

Re: mergeFactor / indexing speed

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Juhu, great news, guys. I merged my child entity into the root entity, and
> changed the custom entityprocessor to handle the additional columns
> correctly.
> And - indexing 160k documents now takes 5min instead of 1.5h!
>

I'm a little late to the party but you may also want to look at
CachedSqlEntityProcessor.

-- 
Regards,
Shalin Shekhar Mangar.

Re: mergeFactor / indexing speed

Posted by Avlesh Singh <av...@gmail.com>.

>
> And - indexing 160k documents now takes 5min instead of 1.5h!
>
Awesome! It works for all!

(Now I can go relaxed on vacation. :-D )
>
Take me along!

Cheers
Avlesh

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Juhu, great news, guys. I merged my child entity into the root entity, and
> changed the custom entityprocessor to handle the additional columns
> correctly.
> And - indexing 160k documents now takes 5min instead of 1.5h!
>
> (Now I can go relaxed on vacation. :-D )
>
>
> Conclusion:
> In my case performance was so bad because of constantly querying a database
> on a different machine (network traffic + db query per document).
>
>
> Thanks for all your help!
> Chantal
>
>
> Avlesh Singh schrieb:
>
>> does DIH call commit periodically, or are things done in one big batch?
>>>
>>>  AFAIK, one big batch.
>>
>
> yes. There is no index available once the full-import started (and the
> searcher has no cache, other wise it still reads from that). There is no
> data (i.e. in the Admin/Luke frontend) visible until the import is finished
> correctly.
>

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Juhu, great news, guys. I merged my child entity into the root entity, 
and changed the custom entityprocessor to handle the additional columns 
correctly.
And - indexing 160k documents now takes 5min instead of 1.5h!

(Now I can go relaxed on vacation. :-D )


Conclusion:
In my case performance was so bad because of constantly querying a 
database on a different machine (network traffic + db query per document).


Thanks for all your help!
Chantal


Avlesh Singh schrieb:
>> does DIH call commit periodically, or are things done in one big batch?
>>
> AFAIK, one big batch.

yes. There is no index available once the full-import started (and the 
searcher has no cache, other wise it still reads from that). There is no 
data (i.e. in the Admin/Luke frontend) visible until the import is 
finished correctly.

Re: mergeFactor / indexing speed

Posted by Avlesh Singh <av...@gmail.com>.

>
> does DIH call commit periodically, or are things done in one big batch?
>
AFAIK, one big batch.

Cheers
Avlesh

On Thu, Aug 6, 2009 at 11:23 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Mon, Aug 3, 2009 at 12:32 PM, Chantal
> Ackermann<ch...@btelligent.de> wrote:
> > avg-cpu:  %user   %nice    %sys %iowait   %idle
> >           1.23    0.00    0.03    0.03   98.71
> >
> > Basically, it is doing very little? *scratch*
>
> How often is commit being called?  (a  Lucene commit sync's all of the
> index files so a crash won't result in a corrupted index... this can
> be costly).
>
> Guys - does DIH call commit periodically, or are things done in one big
> batch?
> Chantal - is autocommit configured in solrconfig.xml?
>
> -Yonik
> http://www.lucidimagination.com
>

Re: mergeFactor / indexing speed

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Aug 3, 2009 at 12:32 PM, Chantal
Ackermann<ch...@btelligent.de> wrote:
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
> Basically, it is doing very little? *scratch*

How often is commit being called?  (a  Lucene commit sync's all of the
index files so a crash won't result in a corrupted index... this can
be costly).

Guys - does DIH call commit periodically, or are things done in one big batch?
Chantal - is autocommit configured in solrconfig.xml?

-Yonik
http://www.lucidimagination.com

Re: mergeFactor / indexing speed

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

I'd have to poke around the machine(s) to give you better guidance, but here is some initial feedback:

- mergeFactor of 1000 seems crazy.  mergeFactor is probably not your problem.  I'd go back to default of 10.
- 256 MB for ramBufferSizeMB sounds OK.
- pinging the DB won't tell you much about the DB server's performance - ssh to the machine and check its CPU load, memory usage, disk IO

Other things to look into:
- Network as the bottleneck?
- Field analysis as the bottleneck?


Otis 
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Chantal Ackermann <ch...@btelligent.de>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Monday, August 3, 2009 12:32:12 PM
> Subject: Re: mergeFactor / indexing speed
> 
> Hi all,
> 
> I'm still struggling with the index performance. I've moved the indexer
> to a different machine, now, which is faster and less occupied.
> 
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
> 
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so far. 
> Which means 1,5 hours at least for 200k - which is as fast/slow as 
> before (on the less performant machine).
> 
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
>   iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             1.23    0.00    0.03    0.03   98.71
> 
> Basically, it is doing very little? *scratch*
> 
> The sourcing database is responding as fast as ever. (I checked that 
> from my own machine, and did only a ping from the linux box to the db 
> server.)
> 
> Any help, any hint on where to look would be greatly appreciated.
> 
> 
> Thanks!
> Chantal
> 
> 
> Chantal Ackermann schrieb:
> > Hi again!
> >
> > Thanks for the answer, Grant.
> >
> >  > It could very well be the case that you aren't seeing any merges with
> >  > only 20K docs.  Ultimately, if you really want to, you can look in
> >  > your data.dir and count the files.  If you have indexed a lot and have
> >  > an MF of 100 and haven't done an optimize, you will see a lot more
> >  > index files.
> >
> > Do you mean that 20k is not representative enough to test those settings?
> > I've chosen the smaller data set so that the index can run completely
> > but doesn't take too long at the same time.
> > If it would be faster to begin with, I could use a larger data set, of
> > course. I still can't believe that 11 minutes is normal (I haven't
> > managed to make it run faster or slower than that, that duration is very
> > stable).
> >
> > It "feels kinda" slow to me...
> > Out of your experience - what would you expect as duration for an index
> > with:
> > - 21 fields, some using a text type with 6 filters
> > - database access using DataImportHandler with a query of (far) less
> > than 20ms
> > - 2 transformers
> >
> > If I knew that indexing time should be shorter than that, at least, I
> > would know that something is definitely wrong with what I am doing or
> > with the environment I am using.
> >
> >  > Likely, but not guaranteed.  Typically, larger merge factors are good
> >  > for batch indexing, but a lot of that has changed with Lucene's new
> >  > background merger, such that I don't know if it matters as much anymore.
> >
> > Ok. I also read some posting where it basically said that the default
> > parameters are ok. And one shouldn't mess around with them.
> >
> > The thing is that our current search setup uses Lucene directly, and the
> > indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
> > fields are different, the complete setup is different. But it will be
> > hard to advertise a new implementation/setup where indexing is three
> > times slower - unless I can give some reasons why that is.
> >
> > The full index should be fairly fast because the backing data is update
> > every few hours. I want to put in place an incremental/partial update as
> > main process, but full indexing might have to be done at certain times
> > if data has changed completely, or the schema has to be changed/extended.
> >
> >  > No, those are separate things.  The ramBufferSizeMB (although, I like
> >  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
> >  > Lucene holds in memory before it has to flush.  MF controls how many
> >  > segments are on disk
> >
> > alas! the rum. I had that typo on the commandline before. that's my
> > subconscious telling me what I should do when I get home, tonight...
> >
> > So, increasing ramBufferSize should lead to higher memory usage,
> > shouldn't it? I'm not seeing that. :-(
> >
> > I'll try once more with MF 10 and a higher rum... well, you know... ;-)
> >
> > Cheers,
> > Chantal
> >
> > Grant Ingersoll schrieb:
> >> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
> >>
> >>> Dear all,
> >>>
> >>> I want to find out which settings give the best full index
> >>> performance for my setup.
> >>> Therefore, I have been running a small index (less than 20k
> >>> documents) with a mergeFactor of 10 and 100.
> >>> In both cases, indexing took about 11.5 min:
> >>>
> >>> mergeFactor: 10
> >>> 0:11:46.792
> >>> mergeFactor: 100
> >>> /admin/cores?action=RELOAD
> >>> 0:11:44.441
> >>> Tomcat restart
> >>> 0:11:34.143
> >>>
> >>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
> >>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
> >>> ATA disk).
> >>>
> >>>
> >>> Now, I have three questions:
> >>>
> >>> 1. How can I check which mergeFactor is really being used? The
> >>> solrconfig.xml that is displayed in the admin application is the up-
> >>> to-date view on the file system. I tested that. But it's not
> >>> necessarily what the current SOLR core is using, isn't it?
> >>> Is there a way to check on the actually used mergeFactor (while the
> >>> index is running)?
> >> It could very well be the case that you aren't seeing any merges with
> >> only 20K docs.  Ultimately, if you really want to, you can look in
> >> your data.dir and count the files.  If you have indexed a lot and have
> >> an MF of 100 and haven't done an optimize, you will see a lot more
> >> index files.
> >>
> >>
> >>> 2. I changed the mergeFactor in both available settings (default and
> >>> main index) in the solrconfig.xml file of the core I am reindexing.
> >>> That is the correct place? Should a change in performance be
> >>> noticeable when increasing from 10 to 100? Or is the change not
> >>> perceivable if the requests for data are taking far longer than all
> >>> the indexing itself?
> >> Likely, but not guaranteed.  Typically, larger merge factors are good
> >> for batch indexing, but a lot of that has changed with Lucene's new
> >> background merger, such that I don't know if it matters as much anymore.
> >>
> >>
> >>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
> >>> (Or some other setting?)
> >> No, those are separate things.  The ramBufferSizeMB (although, I like
> >> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
> >> Lucene holds in memory before it has to flush.  MF controls how many
> >> segments are on disk
> >>
> >>> (I am still trying to get profiling information on how much
> >>> application time is eaten up by db connection/requests/processing.
> >>> The root entity query is about (average) 20ms. The child entity
> >>> query is less than 10ms.
> >>> I have my custom entity processor running on the child entity that
> >>> populates the map using a multi-row result set. I have also attached
> >>> one regex and one script transformer.)
> >>>
> >>> Thank you for any tips!
> >>> Chantal
> >>>
> >>>
> >>>
> >>> --
> >>> Chantal Ackermann
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com/
> >>
> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> >> using Solr/Lucene:
> >> http://www.lucidimagination.com/search
> >>

Re: mergeFactor / indexing speed

Posted by Avlesh Singh <av...@gmail.com>.

>
> Do you think it's possible to return (in the nested entity) rows
> independent of the unique id, and let the processor decide when a document
> is complete?
>
I don't think so.

In my case, I had 9 (JDBC) entities for each document. Most of these
entities returned a single column and limited number rows for each document.
I observed a significant improvement in performance by using an aggregation
query in my parent query. e.g. in MySQL, I used group_concat() function to
aggregate all the values (separated using some delimiter) into a single
column of the parent query's resultset. I would then use a RegexTransformer
to split this data on the previously used delimiter to populate in a
multi-valued field.
I actually got rid of 5 entities out of 9 in my data-config. It reduced the
import time significantly too.

Cheers
Avlesh

On Thu, Aug 6, 2009 at 10:15 PM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Hi all,
>
> to keep this thread up to date... ;-)
>
>
> d) jdbc batch size
> changed to 10. (Was default: 500, then 1000)
>
> The problem with my dih setup is that the root entity query returns a huge
> set (all ids that shall be indexed). A larger fetchsize would be good for
> that query.
> The nested entity, however, returns only up 9 rows, ever. The constraints
> are so strict (by id) that there is no way that any additional data could be
> pre-fetched.
> (Actually, anynone using DIH with nested entities should run into that
> problem?)
>
> After changing to 10, I cannot see that this low batch size slowed the
> indexer down (significantly).
>
> As I would like to stick with DIH (instead of dumping the data into CSV and
> import it then) here is my question:
>
> Do you think it's possible to return (in the nested entity) rows
> independent of the unique id, and let the processor decide when a document
> is complete?
> The examples in the wiki always use an ID to get the data for the nested
> entity, so I'm not sure it was planned with that in mind. But as I'm already
> handling multiple db rows for one document, it might not be too difficult to
> change to handling the unique id correctly, as well?
> Of course, I would need something like a look ahead to know whether the
> next row is already part of the next document.
>
>
> Cheers,
> Chantal
>
>
>
> Concerning the other settings (just fyi):
>
> a) mergeFactor 10 (and also tried 100)
> I don't think that changed anything to the worse, rather to the better. So,
> I'll stick with 10 from now on.
>
> b) ramBufferSizeMB
> tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not
> sure about 1024. I'll stick to 512.
>
>
>

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi all,

to keep this thread up to date... ;-)


d) jdbc batch size
changed to 10. (Was default: 500, then 1000)

The problem with my dih setup is that the root entity query returns a 
huge set (all ids that shall be indexed). A larger fetchsize would be 
good for that query.
The nested entity, however, returns only up 9 rows, ever. The 
constraints are so strict (by id) that there is no way that any 
additional data could be pre-fetched.
(Actually, anynone using DIH with nested entities should run into that 
problem?)

After changing to 10, I cannot see that this low batch size slowed the 
indexer down (significantly).

As I would like to stick with DIH (instead of dumping the data into CSV 
and import it then) here is my question:

Do you think it's possible to return (in the nested entity) rows 
independent of the unique id, and let the processor decide when a 
document is complete?
The examples in the wiki always use an ID to get the data for the nested 
entity, so I'm not sure it was planned with that in mind. But as I'm 
already handling multiple db rows for one document, it might not be too 
difficult to change to handling the unique id correctly, as well?
Of course, I would need something like a look ahead to know whether the 
next row is already part of the next document.


Cheers,
Chantal



Concerning the other settings (just fyi):

a) mergeFactor 10 (and also tried 100)
I don't think that changed anything to the worse, rather to the better. 
So, I'll stick with 10 from now on.

b) ramBufferSizeMB
tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not 
sure about 1024. I'll stick to 512.

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi Avlesh,
hi Otis,
hi Grant,
hi all,


(enumerating to keep track of all the input)

a) mergeFactor 1000 too high
I'll change that back to 10. I thought it would make Lucene use more RAM 
before starting IO.

b) ramBufferSize:
OK, or maybe more. I'll keep that in mind.

c) solrconfig.xml - default and main index:
I've always changed both sections, the default and the main index one.

d) JDBC batch size:
I haven't set it. I'll do that.

e) DB server performance:
I agree, ping is definitely not much information. I also did queries 
from my own computer towards it (while the indexer ran) which came back 
as fast as usual.
Currently, I don't have any login to ssh to that machine, but I'm going 
to try get one.

f) Network:
I'll definitely need to have a look at that once I have access to the db 
machine.


g) the data

g.1) nested entity in DIH conf
there is only the root and one nested entity. However, that nested 
entity returns multiple rows (about 10) for one query. (Fetched rows is 
about 10 times the number of processed documents.)

g.2) my custom EntityProcessor
( The code is pasted at the very end of this e-mail. )
- iterates over those multiple rows,
- uses one column to create a key in a map,
- uses two other columns to create the corresponding value (String 
concatenation),
- if a key already exists, it gets the value, if that value is a list, 
it adds the new value to that list, if it's not a list, it creates one 
and adds the old and the new value to it.
I refrained from adding any business logic to that processor. It treats 
all rows alike, no matter whether they hold values that can appear 
multiple or values that must appear only once.

g.3) the two transformers
- to split one value into two (regex)
<field column="person" />
<field column="participant" sourceColName="person" regex="([^\|]+)\|.*"/>
<field column="role" sourceColName="person" 
regex="[^\|]+\|\d+,\d+,\d+,(.*)"/>

- to create extract a number from an existing number (bit calculation 
using the script transformer). As that one works on a field that is 
potentially multiValued, it needs to take care of creating and 
populating a list, as well.
<field column="cat" name="cat" />
<script><![CDATA[
function getMainCategory(row) {
	var cat = row.get('cat');
	var mainCat;
	if (cat != null) {
		// check whether cat is an array
		if (cat instanceof java.util.List) {
			var arr = java.util.ArrayList();
			for (var i=0; i<cat.size(); i++) {
				mainCat = new java.lang.Integer(cat.get(i)>>8);
				if (!arr.contains(mainCat)) {
					arr.add(mainCat);
				}
			}
			row.put('maincat', arr);
		} else { // it is a single value
			var mainCat = new java.lang.Integer(cat>>8);
			row.put('maincat', mainCat);
		}
	}
	return row;
}
]]></script>
(The EpgValueEntityProcessor decides on creating lists on a case by case 
basis: only if a value is specified multiple times for a certain data 
set does it create a list. This is because I didn't want to put any 
complex configuration or business logic into it.)

g.4) fields
the DIH extracts 5 fields from the root entity, 11 fields from the 
nested entity, and the transformers might create additional 3 (multiValued).
schema.xml defines 21 fields (two additional fields: the timestamp field 
(default="NOW") and a field collecting three other text fields for 
default search (using copy field)):
- 2 long
- 3 integer
- 3 sint
- 3 date
- 6 text_cs (class="solr.TextField" positionIncrementGap="100"):
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"
generateWordParts="0" generateNumberParts="0" catenateWords="0" 
catenateNumbers="0" catenateAll="0" />
</analyzer>
- 4 text_de (one is the field populated by copying from the 3 others):
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LengthFilterFactory" min="2" max="5000" />
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>


Thank you for taking your time!
Cheers,
Chantal





************** EpgValueEntityProcessor.java *******************

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.SqlEntityProcessor;

public class EpgValueEntityProcessor extends SqlEntityProcessor {
	private static final Logger log = Logger
			.getLogger(EpgValueEntityProcessor.class.getName());
	private static final String ATTR_ID_EPG_DEFINITION = 
"columnIdEpgDefinition";
	private static final String ATTR_COLUMN_ATT_NAME = "columnAttName";
	private static final String ATTR_COLUMN_EPG_VALUE = "columnEpgValue";
	private static final String ATTR_COLUMN_EPG_SUBVALUE = "columnEpgSubvalue";
	private static final String DEF_ATT_NAME = "ATT_NAME";
	private static final String DEF_EPG_VALUE = "EPG_VALUE";
	private static final String DEF_EPG_SUBVALUE = "EPG_SUBVALUE";
	private static final String DEF_ID_EPG_DEFINITION = "ID_EPG_DEFINITION";
	private String colIdEpgDef = DEF_ID_EPG_DEFINITION;
	private String colAttName = DEF_ATT_NAME;
	private String colEpgValue = DEF_EPG_VALUE;
	private String colEpgSubvalue = DEF_EPG_SUBVALUE;

	@SuppressWarnings("unchecked")
	public void init(Context context) {
		super.init(context);
		colIdEpgDef = context.getEntityAttribute(ATTR_ID_EPG_DEFINITION);
		colAttName = context.getEntityAttribute(ATTR_COLUMN_ATT_NAME);
		colEpgValue = context.getEntityAttribute(ATTR_COLUMN_EPG_VALUE);
		colEpgSubvalue = context.getEntityAttribute(ATTR_COLUMN_EPG_SUBVALUE);
	}

	public Map<String, Object> nextRow() {
		if (rowcache != null)
			return getFromRowCache();
		if (rowIterator == null) {
			String q = getQuery();
			initQuery(resolver.replaceTokens(q));
		}
		Map<String, Object> pivottedRow = new HashMap<String, Object>();
		Map<String, Object> epgValue;
		String attName, value, subvalue;
		Object existingValue, newValue;
		String id = null;
		
		// return null once the end of that data set is reached
		if (!rowIterator.hasNext()) {
			rowIterator = null;
			return null;
		}
		// as long as there is data, iterate over the rows and pivot them
		// return the pivotted row after the last row of data has been reached
		do {
			epgValue = rowIterator.next();
			id = epgValue.get(colIdEpgDef).toString();
			assert id != null;
			if (pivottedRow.containsKey(colIdEpgDef)) {
				assert id.equals(pivottedRow.get(colIdEpgDef));
			} else {
				pivottedRow.put(colIdEpgDef, id);
			}
			attName = (String) epgValue.get(colAttName);
			if (attName == null) {
				log.warning("No value returned for attribute name column "
						+ colAttName);
			}
			value = (String) epgValue.get(colEpgValue);
			subvalue = (String) epgValue.get(colEpgSubvalue);

			// create a single object for value and subvalue
			// if subvalue is not set, use value only, otherwise create string
			// array
			if (subvalue == null || subvalue.trim().length() == 0) {
				newValue = value;
			} else {
				newValue = value + "|" + subvalue;
			}

			// if there is already an entry for that attribute, extend
			// the existing value
			if (pivottedRow.containsKey(attName)) {
				existingValue = pivottedRow.get(attName);
//				newValue = existingValue + " " + newValue;
//				pivottedRow.put(attName, newValue);
				if (existingValue instanceof List) {
					((List) existingValue).add(newValue);
				} else {
					ArrayList v = new ArrayList();
					Collections.addAll(v, existingValue, newValue);
					pivottedRow.put(attName, v);
				}
			} else {
				pivottedRow.put(attName, newValue);
			}
		} while (rowIterator.hasNext());
		
		pivottedRow = applyTransformer(pivottedRow);
		return pivottedRow;
	}

}

Re: mergeFactor / indexing speed

Posted by Grant Ingersoll <gs...@apache.org>.

How big are your documents?  I haven't benchmarked DIH, so I am not  
sure what to expect, but it does seem like something isn't right.  Can  
you fully describe how you are indexing?  Have you done any profiling?

On Aug 3, 2009, at 12:32 PM, Chantal Ackermann wrote:

> Hi all,
>
> I'm still struggling with the index performance. I've moved the  
> indexer
> to a different machine, now, which is faster and less occupied.
>
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
>
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so  
> far. Which means 1,5 hours at least for 200k - which is as fast/slow  
> as before (on the less performant machine).
>
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
> iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
> Basically, it is doing very little? *scratch*
>
> The sourcing database is responding as fast as ever. (I checked that  
> from my own machine, and did only a ping from the linux box to the  
> db server.)
>
> Any help, any hint on where to look would be greatly appreciated.
>
>
> Thanks!
> Chantal
>
>
> Chantal Ackermann schrieb:
>> Hi again!
>>
>> Thanks for the answer, Grant.
>>
>> > It could very well be the case that you aren't seeing any merges  
>> with
>> > only 20K docs.  Ultimately, if you really want to, you can look in
>> > your data.dir and count the files.  If you have indexed a lot and  
>> have
>> > an MF of 100 and haven't done an optimize, you will see a lot more
>> > index files.
>>
>> Do you mean that 20k is not representative enough to test those  
>> settings?
>> I've chosen the smaller data set so that the index can run completely
>> but doesn't take too long at the same time.
>> If it would be faster to begin with, I could use a larger data set,  
>> of
>> course. I still can't believe that 11 minutes is normal (I haven't
>> managed to make it run faster or slower than that, that duration is  
>> very
>> stable).
>>
>> It "feels kinda" slow to me...
>> Out of your experience - what would you expect as duration for an  
>> index
>> with:
>> - 21 fields, some using a text type with 6 filters
>> - database access using DataImportHandler with a query of (far) less
>> than 20ms
>> - 2 transformers
>>
>> If I knew that indexing time should be shorter than that, at least, I
>> would know that something is definitely wrong with what I am doing or
>> with the environment I am using.
>>
>> > Likely, but not guaranteed.  Typically, larger merge factors are  
>> good
>> > for batch indexing, but a lot of that has changed with Lucene's new
>> > background merger, such that I don't know if it matters as much  
>> anymore.
>>
>> Ok. I also read some posting where it basically said that the default
>> parameters are ok. And one shouldn't mess around with them.
>>
>> The thing is that our current search setup uses Lucene directly,  
>> and the
>> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
>> fields are different, the complete setup is different. But it will be
>> hard to advertise a new implementation/setup where indexing is three
>> times slower - unless I can give some reasons why that is.
>>
>> The full index should be fairly fast because the backing data is  
>> update
>> every few hours. I want to put in place an incremental/partial  
>> update as
>> main process, but full indexing might have to be done at certain  
>> times
>> if data has changed completely, or the schema has to be changed/ 
>> extended.
>>
>> > No, those are separate things.  The ramBufferSizeMB (although, I  
>> like
>> > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many  
>> docs
>> > Lucene holds in memory before it has to flush.  MF controls how  
>> many
>> > segments are on disk
>>
>> alas! the rum. I had that typo on the commandline before. that's my
>> subconscious telling me what I should do when I get home, tonight...
>>
>> So, increasing ramBufferSize should lead to higher memory usage,
>> shouldn't it? I'm not seeing that. :-(
>>
>> I'll try once more with MF 10 and a higher rum... well, you  
>> know... ;-)
>>
>> Cheers,
>> Chantal
>>
>> Grant Ingersoll schrieb:
>>> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
>>>
>>>> Dear all,
>>>>
>>>> I want to find out which settings give the best full index
>>>> performance for my setup.
>>>> Therefore, I have been running a small index (less than 20k
>>>> documents) with a mergeFactor of 10 and 100.
>>>> In both cases, indexing took about 11.5 min:
>>>>
>>>> mergeFactor: 10
>>>> <str name="Time taken ">0:11:46.792</str>
>>>> mergeFactor: 100
>>>> /admin/cores?action=RELOAD
>>>> <str name="Time taken ">0:11:44.441</str>
>>>> Tomcat restart
>>>> <str name="Time taken ">0:11:34.143</str>
>>>>
>>>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But  
>>>> it
>>>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM,  
>>>> old
>>>> ATA disk).
>>>>
>>>>
>>>> Now, I have three questions:
>>>>
>>>> 1. How can I check which mergeFactor is really being used? The
>>>> solrconfig.xml that is displayed in the admin application is the  
>>>> up-
>>>> to-date view on the file system. I tested that. But it's not
>>>> necessarily what the current SOLR core is using, isn't it?
>>>> Is there a way to check on the actually used mergeFactor (while the
>>>> index is running)?
>>> It could very well be the case that you aren't seeing any merges  
>>> with
>>> only 20K docs.  Ultimately, if you really want to, you can look in
>>> your data.dir and count the files.  If you have indexed a lot and  
>>> have
>>> an MF of 100 and haven't done an optimize, you will see a lot more
>>> index files.
>>>
>>>
>>>> 2. I changed the mergeFactor in both available settings (default  
>>>> and
>>>> main index) in the solrconfig.xml file of the core I am reindexing.
>>>> That is the correct place? Should a change in performance be
>>>> noticeable when increasing from 10 to 100? Or is the change not
>>>> perceivable if the requests for data are taking far longer than all
>>>> the indexing itself?
>>> Likely, but not guaranteed.  Typically, larger merge factors are  
>>> good
>>> for batch indexing, but a lot of that has changed with Lucene's new
>>> background merger, such that I don't know if it matters as much  
>>> anymore.
>>>
>>>
>>>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>>>> (Or some other setting?)
>>> No, those are separate things.  The ramBufferSizeMB (although, I  
>>> like
>>> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many  
>>> docs
>>> Lucene holds in memory before it has to flush.  MF controls how many
>>> segments are on disk
>>>
>>>> (I am still trying to get profiling information on how much
>>>> application time is eaten up by db connection/requests/processing.
>>>> The root entity query is about (average) 20ms. The child entity
>>>> query is less than 10ms.
>>>> I have my custom entity processor running on the child entity that
>>>> populates the map using a multi-row result set. I have also  
>>>> attached
>>>> one regex and one script transformer.)
>>>>
>>>> Thank you for any tips!
>>>> Chantal
>>>>
>>>>
>>>>
>>>> --
>>>> Chantal Ackermann
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search