You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2011/06/13 20:13:06 UTC

Indexing slower in trunk

I half remember that this has come up before, but I couldn't find the
thread. I was running some tests over the weekend that involved
indexing 1.9M documents from the English Wiki dump.

I'm consistently seeing that trunk takes about twice as long to index
the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
longer I admit that these aren't very sophisticated tests, and I only
ran the trunk process twice (although both those were consistent).

I'm pretty sure my rambuffersize and autocommit settings are
identical. I remove the data/index directory before each run. These
results are running the indexing program in IntelliJ, on my Mac, both
the server and the indexing programs were running locally.

No, trunk isn't compiling before running <G>.

Here's the server definition:
new StreamingUpdateSolrServer(url, 10, 4);

and I'm batching up the documents and sending them to Solr in batches of 1,000.

So, my question is whether this should be pursued. Note that I'm still
getting around 3K docs/second, which I can't complain about. Not that
that stops me, you understand. And in return for a memory footprint
reduction from 389M to 90M after some off-the-wall sorting and
faceting I'll take it!

Hmmmm, speaking of which, the memory usage changes seem like a good
candidate for a page on the Wiki, anyone want to suggest a home?


Solr 1.4.1
Total Time Taken-> 257 seconds
Total documents added-> 1917728
Docs/sec-> 7461
starting optimize
optimizing took 26 seconds

Solr 3.2
Total Time Taken-> 243 seconds
Total documents added-> 1917728
Docs/sec-> 7891
starting optimize
optimizing took 21 seconds

Solr 3x
Total Time Taken-> 269 seconds
Total documents added-> 1917728
Docs/sec-> 7129
starting optimize
optimizing took 21 seconds

Solr trunk. 2011-6-11: 17:24 EST
Total Time Taken-> 592 seconds
Total documents added-> 1917728
Docs/sec-> 3239
starting optimize
optimizing took 159 seconds

What do folks think? Is there anything I can/should do to narrow this down?

Erick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Indexing slower in trunk

Posted by Simon Willnauer <si...@googlemail.com>.

On Tue, Jun 14, 2011 at 2:54 PM, Erick Erickson <er...@gmail.com> wrote:
> Thanks, guys. Yes, I am running it all locally and disk seeks
> may well be the culprit. This thread is mainly to be sure that
> the behavior I'm seeing is expected, or at least explainable.
>
> Really, I don't need to pursue this further unless there's
> actually data I can gather to help speed things up. If this
> is just a consequence of DWPT and/or my particular
> setup then that's fine. I'm mostly trying to understand
> the characteristics of indexing/searching on the trunk.
> This started with me exploring memory
> requirements, and is really just something I noticed along
> the way and wanted to get some feedback on.
>
> So, absent the commit step, the times are reasonably
> comparable. Can I impose upon one of you to give a
> two-sentence summary of what DWPT buys us from a
> user perspective? If memory serves it should have
> background merging and other goodies.

Let me try to conclude this in a couple of sentences:

Previously IW wrote small in memory segments on a per-thread basis and
merged then together on flush. Yet, this means you can't add / update
any documents while we are flushing and flushing can take a long time.
With DWPT  we write segments single-threaded so each thread gets its
private DWPT. That allows to flush segments to disc concurrently while
carry on indexing at the same time. The performance gains are massive
here, our nightly benchmark sees 269% speedup on indexing throughput.
The downside is that you have to do more merges eventually since you
write more smallish segments (no in memory merge on flush). Plus if
you read from the same disk you are writing too you might see slower
indexing.

to read more about this I wrote a blog that explains a big portion of
it: http://blog.jteam.nl/2011/04/01/gimme-all-resources-you-have-i-can-use-them/

>
> Uwe:
> Yep, I was curious about optimize but understand that it's not required
> in recent code. That said, data is not searchable until a commit
> happens, so just for yucks I changed the optimize to a commit. Stats
> of that run below.
>
> Simon:
> OK, adjusted the ram buffer size to 512M, and it's a bit faster, but
> not all that much, see stats, and the delta could well be sampling
> errors, one run doth not a statistical certainty make. Up until the
> commit step, the admin stats page is showing no documents in
> the index so I think this setting completely avoids intermediate
> committing although that says nothing about the individual writers
> writing lots of segments to disk, that still happens.
>
> Added 1880000 docs. Took 1437 ms. cumulative interval (seconds) = 284
> Added 1890000 docs. Took 1285 ms. cumulative interval (seconds) = 285
> Added 1900000 docs. Took 1182 ms. cumulative interval (seconds) = 286
> Added 1910000 docs. Took 1675 ms. cumulative interval (seconds) = 288
> About to commit, total time so far: 290
> Total Time Taken-> 395 seconds    ***100 secs for the commit to finish.
> Total documents added-> 1917728
> Docs/sec-> 4855
>
> Thanks, all
> Erick
>
>
> On Tue, Jun 14, 2011 at 4:39 AM, Simon Willnauer
> <si...@googlemail.com> wrote:
>> Erick, it seems you need to adjust your settings for 4.0 a little.
>> When you index with DWPT it builds thread private segments which are
>> independently flushed to disk. Yet, when you set your ram buffer IW
>> will accumulate the ram used by all active DWPT and flush the largest
>> once you reach your ram buffer. with 128M you might end up wil lots of
>> small segments which need to be merged in the background. Eventually
>> what will happen here is that your disk is so busy that you are not
>> able to flush fast enough and threads might stall.
>>
>> What you can try here is adjust your RAM buffer to be a little higher,
>> lets say 350MB or change the max number of thread states in
>> DocumentsWriterPerThreadPool ie.
>> ThreadAffinityDocumentsWriterThreadPool. The latter is unfortunately
>> not exposed yet in solr so maybe for testing you just want to change
>> the default value in DocumentsWriterPerThreadPool to 4. That will also
>> cause segments to be bigger eventually.
>>
>> simon
>>
>> On Tue, Jun 14, 2011 at 10:28 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>>> Hi Erick,
>>>
>>> Do you use harddisks or SSDs? I assume harddisks, which may explain what you
>>> see:
>>>
>>> - DWPT writes lots of segments in parallel, which also explains why you are
>>> seeing more files. Writing in parallel to several files, needs more head
>>> movements of your harddisk and this slows down. In the past, only one
>>> segment was written at the same time (sequential), so the harddisk is not so
>>> stressed.
>>> - Optimizing may be slower for the same reason: there are many more files to
>>> merge (but optimize cost should not be counted as a problem here as normally
>>> you won't need to optimize after initial indexing and optimizing was only a
>>> good idea pre Lucene-2.9, now it's mostly obsolete)
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>>
>>>
>>>> -----Original Message-----
>>>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>>>> Sent: Tuesday, June 14, 2011 2:46 AM
>>>> To: dev@lucene.apache.org; simon.willnauer@gmail.com
>>>> Subject: Re: Indexing slower in trunk
>>>>
>>>> Simon:
>>>>
>>>> Yep, I was asking to see if it was weird. Pursuant to our
>>>> chat I tried some things, results below:
>>>>
>>>> All these are running on my local machine, same disk, same
>>>> JVM settings re: memory. The SolrJ indexing is happening
>>>> from IntelliJ (OK, I'm lazy).
>>>>
>>>> Rambuffer is set at 128M in all cases. Merge factor is 10.
>>>>
>>>> I'm allocation 2G to the server and 2G to the indexer
>>>>
>>>> Servers get started like this:
>>>> _server = new StreamingUpdateSolrServer(url, 10, 4);
>>>>
>>>> I choose the threads and queue length semi-arbitrarily.
>>>>
>>>> Autocommit originally was 10,000 docs, 60,000 maxTime for
>>>> all tests, but I removed that in the case of trunk , substituting
>>>> a "commitWithin" in the SolrJ code of 10 minutes. That flattened
>>>> out the run up until after the last doc was added, when I tried
>>>> to do an optimize and it took 270 seconds on trunk and 19
>>>> seconds on 3x. I do see a commit happening in the log on trunk
>>>> when I issue this, which isn't surprising.
>>>>
>>>> I'm a little surprised that I also see a bunch of  segments forming
>>>> during indexing on trunk, even though I don't issue any commits and
>>>> I doubt that I was running into ram buffer problems. By the time I'd
>>> indexed
>>>> 400K docs, there were 206 files in the index, by 1000K docs
>>>> there were 416. But I vaguely remember that there's some
>>>> background merging going on with DWPT, which would
>>>> explain the fact that the number of fies in the directory
>>>> keeps bouncing up and down (although generally
>>>> trends up till flattening out in the low-mid 400s). Don't know
>>>> whether it means anything or not, but the number of files
>>>> goes down when there are long pauses in the "interval"
>>>> reported to the console.
>>>>
>>>> Note below that the trunk seems slower even without
>>>> committing. But the optimize shoots through the roof.
>>>>
>>>> I've attached a file containing the full console output from
>>>> both runs, and the indexing code for your perusal. I can
>>>> easily try different parameters if you have some you
>>>> want me to run. I can instrument the code, compile and run
>>>> it if that'd help, just let me know where.
>>>>
>>>> I suspect that I've done something silly, but I have no clue
>>>> what as yet.
>>>>
>>>>
>>>> Erick
>>>>
>>>> Here's the end of the console output for the 3x case:
>>>> End of the run from my Idea console window (Took is in ms):
>>>>
>>>> Added 1890000 docs. Took 1361 ms. cumulative interval (seconds) = 248
>>>> Added 1900000 docs. Took 719 ms. cumulative interval (seconds) = 248
>>>> Added 1910000 docs. Took 1305 ms. cumulative interval (seconds) = 250
>>>> Jun 13, 2011 8:01:08 PM
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>>> INFO: finished:
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@60487
>>>> c5f
>>>> Jun 13, 2011 8:01:08 PM
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>>> INFO: finished:
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@132f4
>>>> 538
>>>> Jun 13, 2011 8:01:09 PM
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>>> INFO: finished:
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@46969
>>>> 5f
>>>> Jun 13, 2011 8:01:09 PM
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>>> INFO: finished:
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@58f41
>>>> 393
>>>> Total Time Taken-> 252 seconds
>>>> Total documents added-> 1917728
>>>> Docs/sec-> 7610
>>>> starting optimize
>>>> optimizing took 19 seconds
>>>>
>>>> Here's the end from trunk:
>>>> Added 1880000 docs. Took 1513 ms. cumulative interval (seconds) = 304
>>>> Added 1890000 docs. Took 1518 ms. cumulative interval (seconds) = 306
>>>> Added 1900000 docs. Took 1703 ms. cumulative interval (seconds) = 307
>>>> Added 1910000 docs. Took 1161 ms. cumulative interval (seconds) = 309
>>>> About to commit, total time so far: 309
>>>> Total Time Taken-> 309 seconds
>>>> Total documents added-> 1917728
>>>> Docs/sec-> 6206
>>>> starting optimize
>>>> optimizing took 270 seconds
>>>>
>>>> On Mon, Jun 13, 2011 at 4:50 PM, Simon Willnauer
>>>> <si...@googlemail.com> wrote:
>>>> > On Mon, Jun 13, 2011 at 8:13 PM, Erick Erickson
>>> <er...@gmail.com>
>>>> wrote:
>>>> >> I half remember that this has come up before, but I couldn't find the
>>>> >> thread. I was running some tests over the weekend that involved
>>>> >> indexing 1.9M documents from the English Wiki dump.
>>>> >>
>>>> >> I'm consistently seeing that trunk takes about twice as long to index
>>>> >> the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
>>>> >> longer I admit that these aren't very sophisticated tests, and I only
>>>> >> ran the trunk process twice (although both those were consistent).
>>>> >>
>>>> >> I'm pretty sure my rambuffersize and autocommit settings are
>>>> >> identical. I remove the data/index directory before each run. These
>>>> >> results are running the indexing program in IntelliJ, on my Mac, both
>>>> >> the server and the indexing programs were running locally.
>>>> >>
>>>> >> No, trunk isn't compiling before running <G>.
>>>> >>
>>>> >> Here's the server definition:
>>>> >> new StreamingUpdateSolrServer(url, 10, 4);
>>>> >>
>>>> >> and I'm batching up the documents and sending them to Solr in batches
>>> of
>>>> 1,000.
>>>> >>
>>>> >> So, my question is whether this should be pursued. Note that I'm still
>>>> >> getting around 3K docs/second, which I can't complain about. Not that
>>>> >> that stops me, you understand. And in return for a memory footprint
>>>> >> reduction from 389M to 90M after some off-the-wall sorting and
>>>> >> faceting I'll take it!
>>>> >>
>>>> >> Hmmmm, speaking of which, the memory usage changes seem like a
>>>> good
>>>> >> candidate for a page on the Wiki, anyone want to suggest a home?
>>>> >>
>>>> >>
>>>> >> Solr 1.4.1
>>>> >> Total Time Taken-> 257 seconds
>>>> >> Total documents added-> 1917728
>>>> >> Docs/sec-> 7461
>>>> >> starting optimize
>>>> >> optimizing took 26 seconds
>>>> >>
>>>> >> Solr 3.2
>>>> >> Total Time Taken-> 243 seconds
>>>> >> Total documents added-> 1917728
>>>> >> Docs/sec-> 7891
>>>> >> starting optimize
>>>> >> optimizing took 21 seconds
>>>> >>
>>>> >> Solr 3x
>>>> >> Total Time Taken-> 269 seconds
>>>> >> Total documents added-> 1917728
>>>> >> Docs/sec-> 7129
>>>> >> starting optimize
>>>> >> optimizing took 21 seconds
>>>> >>
>>>> >> Solr trunk. 2011-6-11: 17:24 EST
>>>> >> Total Time Taken-> 592 seconds
>>>> >> Total documents added-> 1917728
>>>> >> Docs/sec-> 3239
>>>> >> starting optimize
>>>> >> optimizing took 159 seconds
>>>> >>
>>>> >> What do folks think? Is there anything I can/should do to narrow this
>>>> down?
>>>> >
>>>> > Hi Eric,
>>>> >
>>>> > this looks weird, I have some questions:
>>>> >
>>>> > - you are indexing into the same disk as you read the data from?
>>>> > - what are you rambuffer settings?
>>>> > - how many threads are you using to send data to solr?
>>>> > - what is your autocommit setting?
>>>> >
>>>> > simon
>>>> >
>>>> >
>>>> >>
>>>> >> Erick
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >>
>>>> >>
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >
>>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Indexing slower in trunk

Posted by Erick Erickson <er...@gmail.com>.

Thanks, guys. Yes, I am running it all locally and disk seeks
may well be the culprit. This thread is mainly to be sure that
the behavior I'm seeing is expected, or at least explainable.

Really, I don't need to pursue this further unless there's
actually data I can gather to help speed things up. If this
is just a consequence of DWPT and/or my particular
setup then that's fine. I'm mostly trying to understand
the characteristics of indexing/searching on the trunk.
This started with me exploring memory
requirements, and is really just something I noticed along
the way and wanted to get some feedback on.

So, absent the commit step, the times are reasonably
comparable. Can I impose upon one of you to give a
two-sentence summary of what DWPT buys us from a
user perspective? If memory serves it should have
background merging and other goodies.

Uwe:
Yep, I was curious about optimize but understand that it's not required
in recent code. That said, data is not searchable until a commit
happens, so just for yucks I changed the optimize to a commit. Stats
of that run below.

Simon:
OK, adjusted the ram buffer size to 512M, and it's a bit faster, but
not all that much, see stats, and the delta could well be sampling
errors, one run doth not a statistical certainty make. Up until the
commit step, the admin stats page is showing no documents in
the index so I think this setting completely avoids intermediate
committing although that says nothing about the individual writers
writing lots of segments to disk, that still happens.

Added 1880000 docs. Took 1437 ms. cumulative interval (seconds) = 284
Added 1890000 docs. Took 1285 ms. cumulative interval (seconds) = 285
Added 1900000 docs. Took 1182 ms. cumulative interval (seconds) = 286
Added 1910000 docs. Took 1675 ms. cumulative interval (seconds) = 288
About to commit, total time so far: 290
Total Time Taken-> 395 seconds    ***100 secs for the commit to finish.
Total documents added-> 1917728
Docs/sec-> 4855

Thanks, all
Erick


On Tue, Jun 14, 2011 at 4:39 AM, Simon Willnauer
<si...@googlemail.com> wrote:
> Erick, it seems you need to adjust your settings for 4.0 a little.
> When you index with DWPT it builds thread private segments which are
> independently flushed to disk. Yet, when you set your ram buffer IW
> will accumulate the ram used by all active DWPT and flush the largest
> once you reach your ram buffer. with 128M you might end up wil lots of
> small segments which need to be merged in the background. Eventually
> what will happen here is that your disk is so busy that you are not
> able to flush fast enough and threads might stall.
>
> What you can try here is adjust your RAM buffer to be a little higher,
> lets say 350MB or change the max number of thread states in
> DocumentsWriterPerThreadPool ie.
> ThreadAffinityDocumentsWriterThreadPool. The latter is unfortunately
> not exposed yet in solr so maybe for testing you just want to change
> the default value in DocumentsWriterPerThreadPool to 4. That will also
> cause segments to be bigger eventually.
>
> simon
>
> On Tue, Jun 14, 2011 at 10:28 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> Hi Erick,
>>
>> Do you use harddisks or SSDs? I assume harddisks, which may explain what you
>> see:
>>
>> - DWPT writes lots of segments in parallel, which also explains why you are
>> seeing more files. Writing in parallel to several files, needs more head
>> movements of your harddisk and this slows down. In the past, only one
>> segment was written at the same time (sequential), so the harddisk is not so
>> stressed.
>> - Optimizing may be slower for the same reason: there are many more files to
>> merge (but optimize cost should not be counted as a problem here as normally
>> you won't need to optimize after initial indexing and optimizing was only a
>> good idea pre Lucene-2.9, now it's mostly obsolete)
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>>> Sent: Tuesday, June 14, 2011 2:46 AM
>>> To: dev@lucene.apache.org; simon.willnauer@gmail.com
>>> Subject: Re: Indexing slower in trunk
>>>
>>> Simon:
>>>
>>> Yep, I was asking to see if it was weird. Pursuant to our
>>> chat I tried some things, results below:
>>>
>>> All these are running on my local machine, same disk, same
>>> JVM settings re: memory. The SolrJ indexing is happening
>>> from IntelliJ (OK, I'm lazy).
>>>
>>> Rambuffer is set at 128M in all cases. Merge factor is 10.
>>>
>>> I'm allocation 2G to the server and 2G to the indexer
>>>
>>> Servers get started like this:
>>> _server = new StreamingUpdateSolrServer(url, 10, 4);
>>>
>>> I choose the threads and queue length semi-arbitrarily.
>>>
>>> Autocommit originally was 10,000 docs, 60,000 maxTime for
>>> all tests, but I removed that in the case of trunk , substituting
>>> a "commitWithin" in the SolrJ code of 10 minutes. That flattened
>>> out the run up until after the last doc was added, when I tried
>>> to do an optimize and it took 270 seconds on trunk and 19
>>> seconds on 3x. I do see a commit happening in the log on trunk
>>> when I issue this, which isn't surprising.
>>>
>>> I'm a little surprised that I also see a bunch of  segments forming
>>> during indexing on trunk, even though I don't issue any commits and
>>> I doubt that I was running into ram buffer problems. By the time I'd
>> indexed
>>> 400K docs, there were 206 files in the index, by 1000K docs
>>> there were 416. But I vaguely remember that there's some
>>> background merging going on with DWPT, which would
>>> explain the fact that the number of fies in the directory
>>> keeps bouncing up and down (although generally
>>> trends up till flattening out in the low-mid 400s). Don't know
>>> whether it means anything or not, but the number of files
>>> goes down when there are long pauses in the "interval"
>>> reported to the console.
>>>
>>> Note below that the trunk seems slower even without
>>> committing. But the optimize shoots through the roof.
>>>
>>> I've attached a file containing the full console output from
>>> both runs, and the indexing code for your perusal. I can
>>> easily try different parameters if you have some you
>>> want me to run. I can instrument the code, compile and run
>>> it if that'd help, just let me know where.
>>>
>>> I suspect that I've done something silly, but I have no clue
>>> what as yet.
>>>
>>>
>>> Erick
>>>
>>> Here's the end of the console output for the 3x case:
>>> End of the run from my Idea console window (Took is in ms):
>>>
>>> Added 1890000 docs. Took 1361 ms. cumulative interval (seconds) = 248
>>> Added 1900000 docs. Took 719 ms. cumulative interval (seconds) = 248
>>> Added 1910000 docs. Took 1305 ms. cumulative interval (seconds) = 250
>>> Jun 13, 2011 8:01:08 PM
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>> INFO: finished:
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@60487
>>> c5f
>>> Jun 13, 2011 8:01:08 PM
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>> INFO: finished:
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@132f4
>>> 538
>>> Jun 13, 2011 8:01:09 PM
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>> INFO: finished:
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@46969
>>> 5f
>>> Jun 13, 2011 8:01:09 PM
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>> INFO: finished:
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@58f41
>>> 393
>>> Total Time Taken-> 252 seconds
>>> Total documents added-> 1917728
>>> Docs/sec-> 7610
>>> starting optimize
>>> optimizing took 19 seconds
>>>
>>> Here's the end from trunk:
>>> Added 1880000 docs. Took 1513 ms. cumulative interval (seconds) = 304
>>> Added 1890000 docs. Took 1518 ms. cumulative interval (seconds) = 306
>>> Added 1900000 docs. Took 1703 ms. cumulative interval (seconds) = 307
>>> Added 1910000 docs. Took 1161 ms. cumulative interval (seconds) = 309
>>> About to commit, total time so far: 309
>>> Total Time Taken-> 309 seconds
>>> Total documents added-> 1917728
>>> Docs/sec-> 6206
>>> starting optimize
>>> optimizing took 270 seconds
>>>
>>> On Mon, Jun 13, 2011 at 4:50 PM, Simon Willnauer
>>> <si...@googlemail.com> wrote:
>>> > On Mon, Jun 13, 2011 at 8:13 PM, Erick Erickson
>> <er...@gmail.com>
>>> wrote:
>>> >> I half remember that this has come up before, but I couldn't find the
>>> >> thread. I was running some tests over the weekend that involved
>>> >> indexing 1.9M documents from the English Wiki dump.
>>> >>
>>> >> I'm consistently seeing that trunk takes about twice as long to index
>>> >> the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
>>> >> longer I admit that these aren't very sophisticated tests, and I only
>>> >> ran the trunk process twice (although both those were consistent).
>>> >>
>>> >> I'm pretty sure my rambuffersize and autocommit settings are
>>> >> identical. I remove the data/index directory before each run. These
>>> >> results are running the indexing program in IntelliJ, on my Mac, both
>>> >> the server and the indexing programs were running locally.
>>> >>
>>> >> No, trunk isn't compiling before running <G>.
>>> >>
>>> >> Here's the server definition:
>>> >> new StreamingUpdateSolrServer(url, 10, 4);
>>> >>
>>> >> and I'm batching up the documents and sending them to Solr in batches
>> of
>>> 1,000.
>>> >>
>>> >> So, my question is whether this should be pursued. Note that I'm still
>>> >> getting around 3K docs/second, which I can't complain about. Not that
>>> >> that stops me, you understand. And in return for a memory footprint
>>> >> reduction from 389M to 90M after some off-the-wall sorting and
>>> >> faceting I'll take it!
>>> >>
>>> >> Hmmmm, speaking of which, the memory usage changes seem like a
>>> good
>>> >> candidate for a page on the Wiki, anyone want to suggest a home?
>>> >>
>>> >>
>>> >> Solr 1.4.1
>>> >> Total Time Taken-> 257 seconds
>>> >> Total documents added-> 1917728
>>> >> Docs/sec-> 7461
>>> >> starting optimize
>>> >> optimizing took 26 seconds
>>> >>
>>> >> Solr 3.2
>>> >> Total Time Taken-> 243 seconds
>>> >> Total documents added-> 1917728
>>> >> Docs/sec-> 7891
>>> >> starting optimize
>>> >> optimizing took 21 seconds
>>> >>
>>> >> Solr 3x
>>> >> Total Time Taken-> 269 seconds
>>> >> Total documents added-> 1917728
>>> >> Docs/sec-> 7129
>>> >> starting optimize
>>> >> optimizing took 21 seconds
>>> >>
>>> >> Solr trunk. 2011-6-11: 17:24 EST
>>> >> Total Time Taken-> 592 seconds
>>> >> Total documents added-> 1917728
>>> >> Docs/sec-> 3239
>>> >> starting optimize
>>> >> optimizing took 159 seconds
>>> >>
>>> >> What do folks think? Is there anything I can/should do to narrow this
>>> down?
>>> >
>>> > Hi Eric,
>>> >
>>> > this looks weird, I have some questions:
>>> >
>>> > - you are indexing into the same disk as you read the data from?
>>> > - what are you rambuffer settings?
>>> > - how many threads are you using to send data to solr?
>>> > - what is your autocommit setting?
>>> >
>>> > simon
>>> >
>>> >
>>> >>
>>> >> Erick
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Indexing slower in trunk

Posted by Simon Willnauer <si...@googlemail.com>.

Erick, it seems you need to adjust your settings for 4.0 a little.
When you index with DWPT it builds thread private segments which are
independently flushed to disk. Yet, when you set your ram buffer IW
will accumulate the ram used by all active DWPT and flush the largest
once you reach your ram buffer. with 128M you might end up wil lots of
small segments which need to be merged in the background. Eventually
what will happen here is that your disk is so busy that you are not
able to flush fast enough and threads might stall.

What you can try here is adjust your RAM buffer to be a little higher,
lets say 350MB or change the max number of thread states in
DocumentsWriterPerThreadPool ie.
ThreadAffinityDocumentsWriterThreadPool. The latter is unfortunately
not exposed yet in solr so maybe for testing you just want to change
the default value in DocumentsWriterPerThreadPool to 4. That will also
cause segments to be bigger eventually.

simon

On Tue, Jun 14, 2011 at 10:28 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Hi Erick,
>
> Do you use harddisks or SSDs? I assume harddisks, which may explain what you
> see:
>
> - DWPT writes lots of segments in parallel, which also explains why you are
> seeing more files. Writing in parallel to several files, needs more head
> movements of your harddisk and this slows down. In the past, only one
> segment was written at the same time (sequential), so the harddisk is not so
> stressed.
> - Optimizing may be slower for the same reason: there are many more files to
> merge (but optimize cost should not be counted as a problem here as normally
> you won't need to optimize after initial indexing and optimizing was only a
> good idea pre Lucene-2.9, now it's mostly obsolete)
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Tuesday, June 14, 2011 2:46 AM
>> To: dev@lucene.apache.org; simon.willnauer@gmail.com
>> Subject: Re: Indexing slower in trunk
>>
>> Simon:
>>
>> Yep, I was asking to see if it was weird. Pursuant to our
>> chat I tried some things, results below:
>>
>> All these are running on my local machine, same disk, same
>> JVM settings re: memory. The SolrJ indexing is happening
>> from IntelliJ (OK, I'm lazy).
>>
>> Rambuffer is set at 128M in all cases. Merge factor is 10.
>>
>> I'm allocation 2G to the server and 2G to the indexer
>>
>> Servers get started like this:
>> _server = new StreamingUpdateSolrServer(url, 10, 4);
>>
>> I choose the threads and queue length semi-arbitrarily.
>>
>> Autocommit originally was 10,000 docs, 60,000 maxTime for
>> all tests, but I removed that in the case of trunk , substituting
>> a "commitWithin" in the SolrJ code of 10 minutes. That flattened
>> out the run up until after the last doc was added, when I tried
>> to do an optimize and it took 270 seconds on trunk and 19
>> seconds on 3x. I do see a commit happening in the log on trunk
>> when I issue this, which isn't surprising.
>>
>> I'm a little surprised that I also see a bunch of  segments forming
>> during indexing on trunk, even though I don't issue any commits and
>> I doubt that I was running into ram buffer problems. By the time I'd
> indexed
>> 400K docs, there were 206 files in the index, by 1000K docs
>> there were 416. But I vaguely remember that there's some
>> background merging going on with DWPT, which would
>> explain the fact that the number of fies in the directory
>> keeps bouncing up and down (although generally
>> trends up till flattening out in the low-mid 400s). Don't know
>> whether it means anything or not, but the number of files
>> goes down when there are long pauses in the "interval"
>> reported to the console.
>>
>> Note below that the trunk seems slower even without
>> committing. But the optimize shoots through the roof.
>>
>> I've attached a file containing the full console output from
>> both runs, and the indexing code for your perusal. I can
>> easily try different parameters if you have some you
>> want me to run. I can instrument the code, compile and run
>> it if that'd help, just let me know where.
>>
>> I suspect that I've done something silly, but I have no clue
>> what as yet.
>>
>>
>> Erick
>>
>> Here's the end of the console output for the 3x case:
>> End of the run from my Idea console window (Took is in ms):
>>
>> Added 1890000 docs. Took 1361 ms. cumulative interval (seconds) = 248
>> Added 1900000 docs. Took 719 ms. cumulative interval (seconds) = 248
>> Added 1910000 docs. Took 1305 ms. cumulative interval (seconds) = 250
>> Jun 13, 2011 8:01:08 PM
>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>> INFO: finished:
>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@60487
>> c5f
>> Jun 13, 2011 8:01:08 PM
>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>> INFO: finished:
>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@132f4
>> 538
>> Jun 13, 2011 8:01:09 PM
>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>> INFO: finished:
>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@46969
>> 5f
>> Jun 13, 2011 8:01:09 PM
>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>> INFO: finished:
>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@58f41
>> 393
>> Total Time Taken-> 252 seconds
>> Total documents added-> 1917728
>> Docs/sec-> 7610
>> starting optimize
>> optimizing took 19 seconds
>>
>> Here's the end from trunk:
>> Added 1880000 docs. Took 1513 ms. cumulative interval (seconds) = 304
>> Added 1890000 docs. Took 1518 ms. cumulative interval (seconds) = 306
>> Added 1900000 docs. Took 1703 ms. cumulative interval (seconds) = 307
>> Added 1910000 docs. Took 1161 ms. cumulative interval (seconds) = 309
>> About to commit, total time so far: 309
>> Total Time Taken-> 309 seconds
>> Total documents added-> 1917728
>> Docs/sec-> 6206
>> starting optimize
>> optimizing took 270 seconds
>>
>> On Mon, Jun 13, 2011 at 4:50 PM, Simon Willnauer
>> <si...@googlemail.com> wrote:
>> > On Mon, Jun 13, 2011 at 8:13 PM, Erick Erickson
> <er...@gmail.com>
>> wrote:
>> >> I half remember that this has come up before, but I couldn't find the
>> >> thread. I was running some tests over the weekend that involved
>> >> indexing 1.9M documents from the English Wiki dump.
>> >>
>> >> I'm consistently seeing that trunk takes about twice as long to index
>> >> the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
>> >> longer I admit that these aren't very sophisticated tests, and I only
>> >> ran the trunk process twice (although both those were consistent).
>> >>
>> >> I'm pretty sure my rambuffersize and autocommit settings are
>> >> identical. I remove the data/index directory before each run. These
>> >> results are running the indexing program in IntelliJ, on my Mac, both
>> >> the server and the indexing programs were running locally.
>> >>
>> >> No, trunk isn't compiling before running <G>.
>> >>
>> >> Here's the server definition:
>> >> new StreamingUpdateSolrServer(url, 10, 4);
>> >>
>> >> and I'm batching up the documents and sending them to Solr in batches
> of
>> 1,000.
>> >>
>> >> So, my question is whether this should be pursued. Note that I'm still
>> >> getting around 3K docs/second, which I can't complain about. Not that
>> >> that stops me, you understand. And in return for a memory footprint
>> >> reduction from 389M to 90M after some off-the-wall sorting and
>> >> faceting I'll take it!
>> >>
>> >> Hmmmm, speaking of which, the memory usage changes seem like a
>> good
>> >> candidate for a page on the Wiki, anyone want to suggest a home?
>> >>
>> >>
>> >> Solr 1.4.1
>> >> Total Time Taken-> 257 seconds
>> >> Total documents added-> 1917728
>> >> Docs/sec-> 7461
>> >> starting optimize
>> >> optimizing took 26 seconds
>> >>
>> >> Solr 3.2
>> >> Total Time Taken-> 243 seconds
>> >> Total documents added-> 1917728
>> >> Docs/sec-> 7891
>> >> starting optimize
>> >> optimizing took 21 seconds
>> >>
>> >> Solr 3x
>> >> Total Time Taken-> 269 seconds
>> >> Total documents added-> 1917728
>> >> Docs/sec-> 7129
>> >> starting optimize
>> >> optimizing took 21 seconds
>> >>
>> >> Solr trunk. 2011-6-11: 17:24 EST
>> >> Total Time Taken-> 592 seconds
>> >> Total documents added-> 1917728
>> >> Docs/sec-> 3239
>> >> starting optimize
>> >> optimizing took 159 seconds
>> >>
>> >> What do folks think? Is there anything I can/should do to narrow this
>> down?
>> >
>> > Hi Eric,
>> >
>> > this looks weird, I have some questions:
>> >
>> > - you are indexing into the same disk as you read the data from?
>> > - what are you rambuffer settings?
>> > - how many threads are you using to send data to solr?
>> > - what is your autocommit setting?
>> >
>> > simon
>> >
>> >
>> >>
>> >> Erick
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Indexing slower in trunk

Posted by Erick Erickson <er...@gmail.com>.

OK, I tried using separate machines for indexing and running Solr, connected
my work and personal Macs with an ethernet cable. Poor-man's network...

I also changed things a bit to parse the Wiki data and store 1.5M docs in memory
then then try to send them to Solr in various-sized batches, thus removing
all the work associated with reading/parsing the XML from the timings.

And the results are...ambiguous. So I re-read some of the blog posts by
Simon and Mark, and think that where I'm missing out is the phrase
"... computers with highly concurrent hardware ...". I don't have that, and
what I'm seeing is that DWPT doesn't seem to make much difference in
this situation. Of course my situation is probably totally irrelevant,
since I've
got to believe that people indexing *serious* data will have, shall we say,
more impressive hardware than I have.

Or perhaps I should say that whatever I do, I can get trunk and 3x to
perform pretty equivalently. Do note that what I'm really looking for is
the time until I can search the last document sent to Solr, so included
in here is a "commit" step. If I take that out, I'm seeing very substantial
gains in trunk. So presumably with a run that lasted longer than just
a couple of minutes I'd see impressive speedups.

I suspect that I just don't have enough hardware to consistently
encounter the situations where DWPT really shines.

It's also possible that I'm doing something stupid, but until some kind
person sets me up with sufficient hardware I'm afraid I'll have to drop
it <G>....

Best
Erick

On Thu, Jun 16, 2011 at 12:05 PM, Erick Erickson
<er...@gmail.com> wrote:
> OK, after more tests I'm pretty sure that my personal machine
> that I'm testing on is just resource-constrained, leading to the
> results I mentioned before. After all, I'm running my Solr
> instance, the indexing program, etc on a Macbook
> with 1 CPU and 2 cores. The indexing program is parsing the
> XML.
>
> On a proper setup, where the indexing machine was separate
> from the machine(s) feeding the index process I suspect this would
> be a different story. Hmmmm, I may try that sometime too....
>
> Best
> Erick
>
> On Tue, Jun 14, 2011 at 9:25 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> For simple removing deletes, there is also IW.expungeDeletes(), which is
>> less intensive! Not sure if solr support this, too, but as far as I know
>> there is an issue open.
>>
>> Also please note: As soon as one segment is selected for merging (the merge
>> policy may also do this dependent on the number of deletes in a segment), it
>> will reclaim all deleted ressources - that's what merging does. So expunging
>> deletes once per week is a good idea, if your index consists of very old and
>> large segments that are rarely merged anymore and lots of documents are
>> deleted from them.
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>>> Sent: Tuesday, June 14, 2011 3:19 PM
>>> To: dev@lucene.apache.org
>>> Subject: Re: Indexing slower in trunk
>>>
>>> Optimization used to have a very noticeable impact on search speed prior
>> to
>>> some index format changes from quite a while ago.
>>>
>>> At this point the effect is much less noticeable, but the thing optimize
>> does
>>> do is reclaim resources from deleted documents. If you have lots of
>>> deletions, it's a good idea to periodically optimize, but in that case
>> it's often
>>> done pretty infrequently (once a
>>> day/week/month) rather than as part of any ongoing indexing process.
>>>
>>> Best
>>> Erick
>>>
>>> 2011/6/14 Yury Kats <yu...@yahoo.com>:
>>> > On 6/14/2011 4:28 AM, Uwe Schindler wrote:
>>> >> indexing and optimizing was only a
>>> >> good idea pre Lucene-2.9, now it's mostly obsolete)
>>> >
>>> > Could you please elaborate on this? Is optimizing obsolete in general
>>> > or after indexing new documents? Is it obsolete after deletions? And
>>> > what it "mostly"?
>>> >
>>> > Thanks!
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>>> > additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
>>> commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Indexing slower in trunk

Posted by Martijn v Groningen <ma...@gmail.com>.

@Uwe
Solr does support expunge deletes:
http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22commit.22

On 16 June 2011 18:05, Erick Erickson <er...@gmail.com> wrote:

> OK, after more tests I'm pretty sure that my personal machine
> that I'm testing on is just resource-constrained, leading to the
> results I mentioned before. After all, I'm running my Solr
> instance, the indexing program, etc on a Macbook
> with 1 CPU and 2 cores. The indexing program is parsing the
> XML.
>
> On a proper setup, where the indexing machine was separate
> from the machine(s) feeding the index process I suspect this would
> be a different story. Hmmmm, I may try that sometime too....
>
> Best
> Erick
>
> On Tue, Jun 14, 2011 at 9:25 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> > For simple removing deletes, there is also IW.expungeDeletes(), which is
> > less intensive! Not sure if solr support this, too, but as far as I know
> > there is an issue open.
> >
> > Also please note: As soon as one segment is selected for merging (the
> merge
> > policy may also do this dependent on the number of deletes in a segment),
> it
> > will reclaim all deleted ressources - that's what merging does. So
> expunging
> > deletes once per week is a good idea, if your index consists of very old
> and
> > large segments that are rarely merged anymore and lots of documents are
> > deleted from them.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> >> -----Original Message-----
> >> From: Erick Erickson [mailto:erickerickson@gmail.com]
> >> Sent: Tuesday, June 14, 2011 3:19 PM
> >> To: dev@lucene.apache.org
> >> Subject: Re: Indexing slower in trunk
> >>
> >> Optimization used to have a very noticeable impact on search speed prior
> > to
> >> some index format changes from quite a while ago.
> >>
> >> At this point the effect is much less noticeable, but the thing optimize
> > does
> >> do is reclaim resources from deleted documents. If you have lots of
> >> deletions, it's a good idea to periodically optimize, but in that case
> > it's often
> >> done pretty infrequently (once a
> >> day/week/month) rather than as part of any ongoing indexing process.
> >>
> >> Best
> >> Erick
> >>
> >> 2011/6/14 Yury Kats <yu...@yahoo.com>:
> >> > On 6/14/2011 4:28 AM, Uwe Schindler wrote:
> >> >> indexing and optimizing was only a
> >> >> good idea pre Lucene-2.9, now it's mostly obsolete)
> >> >
> >> > Could you please elaborate on this? Is optimizing obsolete in general
> >> > or after indexing new documents? Is it obsolete after deletions? And
> >> > what it "mostly"?
> >> >
> >> > Thanks!
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> > additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> additional
> >> commands, e-mail: dev-help@lucene.apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
Met vriendelijke groet,

Martijn van Groningen

Re: Indexing slower in trunk

Posted by Erick Erickson <er...@gmail.com>.

OK, after more tests I'm pretty sure that my personal machine
that I'm testing on is just resource-constrained, leading to the
results I mentioned before. After all, I'm running my Solr
instance, the indexing program, etc on a Macbook
with 1 CPU and 2 cores. The indexing program is parsing the
XML.

On a proper setup, where the indexing machine was separate
from the machine(s) feeding the index process I suspect this would
be a different story. Hmmmm, I may try that sometime too....

Best
Erick

On Tue, Jun 14, 2011 at 9:25 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> For simple removing deletes, there is also IW.expungeDeletes(), which is
> less intensive! Not sure if solr support this, too, but as far as I know
> there is an issue open.
>
> Also please note: As soon as one segment is selected for merging (the merge
> policy may also do this dependent on the number of deletes in a segment), it
> will reclaim all deleted ressources - that's what merging does. So expunging
> deletes once per week is a good idea, if your index consists of very old and
> large segments that are rarely merged anymore and lots of documents are
> deleted from them.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Tuesday, June 14, 2011 3:19 PM
>> To: dev@lucene.apache.org
>> Subject: Re: Indexing slower in trunk
>>
>> Optimization used to have a very noticeable impact on search speed prior
> to
>> some index format changes from quite a while ago.
>>
>> At this point the effect is much less noticeable, but the thing optimize
> does
>> do is reclaim resources from deleted documents. If you have lots of
>> deletions, it's a good idea to periodically optimize, but in that case
> it's often
>> done pretty infrequently (once a
>> day/week/month) rather than as part of any ongoing indexing process.
>>
>> Best
>> Erick
>>
>> 2011/6/14 Yury Kats <yu...@yahoo.com>:
>> > On 6/14/2011 4:28 AM, Uwe Schindler wrote:
>> >> indexing and optimizing was only a
>> >> good idea pre Lucene-2.9, now it's mostly obsolete)
>> >
>> > Could you please elaborate on this? Is optimizing obsolete in general
>> > or after indexing new documents? Is it obsolete after deletions? And
>> > what it "mostly"?
>> >
>> > Thanks!
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> > additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
>> commands, e-mail: dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Indexing slower in trunk

Posted by Uwe Schindler <uw...@thetaphi.de>.

For simple removing deletes, there is also IW.expungeDeletes(), which is
less intensive! Not sure if solr support this, too, but as far as I know
there is an issue open.

Also please note: As soon as one segment is selected for merging (the merge
policy may also do this dependent on the number of deletes in a segment), it
will reclaim all deleted ressources - that's what merging does. So expunging
deletes once per week is a good idea, if your index consists of very old and
large segments that are rarely merged anymore and lots of documents are
deleted from them.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, June 14, 2011 3:19 PM
> To: dev@lucene.apache.org
> Subject: Re: Indexing slower in trunk
> 
> Optimization used to have a very noticeable impact on search speed prior
to
> some index format changes from quite a while ago.
> 
> At this point the effect is much less noticeable, but the thing optimize
does
> do is reclaim resources from deleted documents. If you have lots of
> deletions, it's a good idea to periodically optimize, but in that case
it's often
> done pretty infrequently (once a
> day/week/month) rather than as part of any ongoing indexing process.
> 
> Best
> Erick
> 
> 2011/6/14 Yury Kats <yu...@yahoo.com>:
> > On 6/14/2011 4:28 AM, Uwe Schindler wrote:
> >> indexing and optimizing was only a
> >> good idea pre Lucene-2.9, now it's mostly obsolete)
> >
> > Could you please elaborate on this? Is optimizing obsolete in general
> > or after indexing new documents? Is it obsolete after deletions? And
> > what it "mostly"?
> >
> > Thanks!
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Indexing slower in trunk

Posted by Erick Erickson <er...@gmail.com>.

Optimization used to have a very noticeable impact
on search speed prior to some index format changes
from quite a while ago.

At this point the effect is much less noticeable, but
the thing optimize does do is reclaim resources
from deleted documents. If you have lots of deletions,
it's a good idea to periodically optimize, but in that
case it's often done pretty infrequently (once a
day/week/month) rather than as part of any ongoing
indexing process.

Best
Erick

2011/6/14 Yury Kats <yu...@yahoo.com>:
> On 6/14/2011 4:28 AM, Uwe Schindler wrote:
>> indexing and optimizing was only a
>> good idea pre Lucene-2.9, now it's mostly obsolete)
>
> Could you please elaborate on this? Is optimizing obsolete
> in general or after indexing new documents? Is it obsolete
> after deletions? And what it "mostly"?
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Indexing slower in trunk

Posted by Uwe Schindler <uw...@thetaphi.de>.

Since Lucene 2.9, Lucene works on a per segment basis when searching. Since
Lucene 3.1 it can even parallelize on multiple segments. If you optimize
your index you only have one segment. Also, when you frequently reopen your
indexes (e.g. after updates), the cost for warming the cache and FieldCache
for a very large new segment ( and the CPU power lost for optimizing the
whole index) does not rectify a more compact representation on disk. For
search performance it has a minor impact.

It's a much better idea to configure the MergePolicy accordingly to merge
segments optimally (Lucene 3.2 has a TieredMergePolicy that's now the
default). This one will minimize number of segments. 

The same overhead applies to Solr when you replicate your index. If you
don't optimize, Solr will only exchange new segments between the indexes. As
after a optimize the whole index is rebuilt, it has to always transfer the
whole index files to the replicas.

Optimizing only makes sense for e.g. read-only indexes that are build one
time and never touched.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Yury Kats [mailto:yurykats@yahoo.com]
> Sent: Tuesday, June 14, 2011 3:04 PM
> To: dev@lucene.apache.org
> Subject: Re: Indexing slower in trunk
> 
> On 6/14/2011 4:28 AM, Uwe Schindler wrote:
> > indexing and optimizing was only a
> > good idea pre Lucene-2.9, now it's mostly obsolete)
> 
> Could you please elaborate on this? Is optimizing obsolete in general or
after
> indexing new documents? Is it obsolete after deletions? And what it
> "mostly"?
> 
> Thanks!
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Indexing slower in trunk

Posted by Yury Kats <yu...@yahoo.com>.

On 6/14/2011 4:28 AM, Uwe Schindler wrote:
> indexing and optimizing was only a
> good idea pre Lucene-2.9, now it's mostly obsolete)

Could you please elaborate on this? Is optimizing obsolete
in general or after indexing new documents? Is it obsolete
after deletions? And what it "mostly"?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Indexing slower in trunk

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Erick,

Do you use harddisks or SSDs? I assume harddisks, which may explain what you
see:

- DWPT writes lots of segments in parallel, which also explains why you are
seeing more files. Writing in parallel to several files, needs more head
movements of your harddisk and this slows down. In the past, only one
segment was written at the same time (sequential), so the harddisk is not so
stressed.
- Optimizing may be slower for the same reason: there are many more files to
merge (but optimize cost should not be counted as a problem here as normally
you won't need to optimize after initial indexing and optimizing was only a
good idea pre Lucene-2.9, now it's mostly obsolete)

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, June 14, 2011 2:46 AM
> To: dev@lucene.apache.org; simon.willnauer@gmail.com
> Subject: Re: Indexing slower in trunk
> 
> Simon:
> 
> Yep, I was asking to see if it was weird. Pursuant to our
> chat I tried some things, results below:
> 
> All these are running on my local machine, same disk, same
> JVM settings re: memory. The SolrJ indexing is happening
> from IntelliJ (OK, I'm lazy).
> 
> Rambuffer is set at 128M in all cases. Merge factor is 10.
> 
> I'm allocation 2G to the server and 2G to the indexer
> 
> Servers get started like this:
> _server = new StreamingUpdateSolrServer(url, 10, 4);
> 
> I choose the threads and queue length semi-arbitrarily.
> 
> Autocommit originally was 10,000 docs, 60,000 maxTime for
> all tests, but I removed that in the case of trunk , substituting
> a "commitWithin" in the SolrJ code of 10 minutes. That flattened
> out the run up until after the last doc was added, when I tried
> to do an optimize and it took 270 seconds on trunk and 19
> seconds on 3x. I do see a commit happening in the log on trunk
> when I issue this, which isn't surprising.
> 
> I'm a little surprised that I also see a bunch of  segments forming
> during indexing on trunk, even though I don't issue any commits and
> I doubt that I was running into ram buffer problems. By the time I'd
indexed
> 400K docs, there were 206 files in the index, by 1000K docs
> there were 416. But I vaguely remember that there's some
> background merging going on with DWPT, which would
> explain the fact that the number of fies in the directory
> keeps bouncing up and down (although generally
> trends up till flattening out in the low-mid 400s). Don't know
> whether it means anything or not, but the number of files
> goes down when there are long pauses in the "interval"
> reported to the console.
> 
> Note below that the trunk seems slower even without
> committing. But the optimize shoots through the roof.
> 
> I've attached a file containing the full console output from
> both runs, and the indexing code for your perusal. I can
> easily try different parameters if you have some you
> want me to run. I can instrument the code, compile and run
> it if that'd help, just let me know where.
> 
> I suspect that I've done something silly, but I have no clue
> what as yet.
> 
> 
> Erick
> 
> Here's the end of the console output for the 3x case:
> End of the run from my Idea console window (Took is in ms):
> 
> Added 1890000 docs. Took 1361 ms. cumulative interval (seconds) = 248
> Added 1900000 docs. Took 719 ms. cumulative interval (seconds) = 248
> Added 1910000 docs. Took 1305 ms. cumulative interval (seconds) = 250
> Jun 13, 2011 8:01:08 PM
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
> INFO: finished:
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@60487
> c5f
> Jun 13, 2011 8:01:08 PM
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
> INFO: finished:
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@132f4
> 538
> Jun 13, 2011 8:01:09 PM
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
> INFO: finished:
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@46969
> 5f
> Jun 13, 2011 8:01:09 PM
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
> INFO: finished:
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@58f41
> 393
> Total Time Taken-> 252 seconds
> Total documents added-> 1917728
> Docs/sec-> 7610
> starting optimize
> optimizing took 19 seconds
> 
> Here's the end from trunk:
> Added 1880000 docs. Took 1513 ms. cumulative interval (seconds) = 304
> Added 1890000 docs. Took 1518 ms. cumulative interval (seconds) = 306
> Added 1900000 docs. Took 1703 ms. cumulative interval (seconds) = 307
> Added 1910000 docs. Took 1161 ms. cumulative interval (seconds) = 309
> About to commit, total time so far: 309
> Total Time Taken-> 309 seconds
> Total documents added-> 1917728
> Docs/sec-> 6206
> starting optimize
> optimizing took 270 seconds
> 
> On Mon, Jun 13, 2011 at 4:50 PM, Simon Willnauer
> <si...@googlemail.com> wrote:
> > On Mon, Jun 13, 2011 at 8:13 PM, Erick Erickson
<er...@gmail.com>
> wrote:
> >> I half remember that this has come up before, but I couldn't find the
> >> thread. I was running some tests over the weekend that involved
> >> indexing 1.9M documents from the English Wiki dump.
> >>
> >> I'm consistently seeing that trunk takes about twice as long to index
> >> the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
> >> longer I admit that these aren't very sophisticated tests, and I only
> >> ran the trunk process twice (although both those were consistent).
> >>
> >> I'm pretty sure my rambuffersize and autocommit settings are
> >> identical. I remove the data/index directory before each run. These
> >> results are running the indexing program in IntelliJ, on my Mac, both
> >> the server and the indexing programs were running locally.
> >>
> >> No, trunk isn't compiling before running <G>.
> >>
> >> Here's the server definition:
> >> new StreamingUpdateSolrServer(url, 10, 4);
> >>
> >> and I'm batching up the documents and sending them to Solr in batches
of
> 1,000.
> >>
> >> So, my question is whether this should be pursued. Note that I'm still
> >> getting around 3K docs/second, which I can't complain about. Not that
> >> that stops me, you understand. And in return for a memory footprint
> >> reduction from 389M to 90M after some off-the-wall sorting and
> >> faceting I'll take it!
> >>
> >> Hmmmm, speaking of which, the memory usage changes seem like a
> good
> >> candidate for a page on the Wiki, anyone want to suggest a home?
> >>
> >>
> >> Solr 1.4.1
> >> Total Time Taken-> 257 seconds
> >> Total documents added-> 1917728
> >> Docs/sec-> 7461
> >> starting optimize
> >> optimizing took 26 seconds
> >>
> >> Solr 3.2
> >> Total Time Taken-> 243 seconds
> >> Total documents added-> 1917728
> >> Docs/sec-> 7891
> >> starting optimize
> >> optimizing took 21 seconds
> >>
> >> Solr 3x
> >> Total Time Taken-> 269 seconds
> >> Total documents added-> 1917728
> >> Docs/sec-> 7129
> >> starting optimize
> >> optimizing took 21 seconds
> >>
> >> Solr trunk. 2011-6-11: 17:24 EST
> >> Total Time Taken-> 592 seconds
> >> Total documents added-> 1917728
> >> Docs/sec-> 3239
> >> starting optimize
> >> optimizing took 159 seconds
> >>
> >> What do folks think? Is there anything I can/should do to narrow this
> down?
> >
> > Hi Eric,
> >
> > this looks weird, I have some questions:
> >
> > - you are indexing into the same disk as you read the data from?
> > - what are you rambuffer settings?
> > - how many threads are you using to send data to solr?
> > - what is your autocommit setting?
> >
> > simon
> >
> >
> >>
> >> Erick
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Indexing slower in trunk

Posted by Erick Erickson <er...@gmail.com>.

Simon:

Yep, I was asking to see if it was weird. Pursuant to our
chat I tried some things, results below:

All these are running on my local machine, same disk, same
JVM settings re: memory. The SolrJ indexing is happening
from IntelliJ (OK, I'm lazy).

Rambuffer is set at 128M in all cases. Merge factor is 10.

I'm allocation 2G to the server and 2G to the indexer

Servers get started like this:
_server = new StreamingUpdateSolrServer(url, 10, 4);

I choose the threads and queue length semi-arbitrarily.

Autocommit originally was 10,000 docs, 60,000 maxTime for
all tests, but I removed that in the case of trunk , substituting
a "commitWithin" in the SolrJ code of 10 minutes. That flattened
out the run up until after the last doc was added, when I tried
to do an optimize and it took 270 seconds on trunk and 19
seconds on 3x. I do see a commit happening in the log on trunk
when I issue this, which isn't surprising.

I'm a little surprised that I also see a bunch of  segments forming
during indexing on trunk, even though I don't issue any commits and
I doubt that I was running into ram buffer problems. By the time I'd indexed
400K docs, there were 206 files in the index, by 1000K docs
there were 416. But I vaguely remember that there's some
background merging going on with DWPT, which would
explain the fact that the number of fies in the directory
keeps bouncing up and down (although generally
trends up till flattening out in the low-mid 400s). Don't know
whether it means anything or not, but the number of files
goes down when there are long pauses in the "interval"
reported to the console.

Note below that the trunk seems slower even without
committing. But the optimize shoots through the roof.

I've attached a file containing the full console output from
both runs, and the indexing code for your perusal. I can
easily try different parameters if you have some you
want me to run. I can instrument the code, compile and run
it if that'd help, just let me know where.

I suspect that I've done something silly, but I have no clue
what as yet.

Erick

Here's the end of the console output for the 3x case:
End of the run from my Idea console window (Took is in ms):

Added 1890000 docs. Took 1361 ms. cumulative interval (seconds) = 248
Added 1900000 docs. Took 719 ms. cumulative interval (seconds) = 248
Added 1910000 docs. Took 1305 ms. cumulative interval (seconds) = 250
Jun 13, 2011 8:01:08 PM
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
INFO: finished:
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@60487c5f
Jun 13, 2011 8:01:08 PM
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
INFO: finished:
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@132f4538
Jun 13, 2011 8:01:09 PM
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
INFO: finished:
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@469695f
Jun 13, 2011 8:01:09 PM
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
INFO: finished:
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@58f41393
Total Time Taken-> 252 seconds
Total documents added-> 1917728
Docs/sec-> 7610
starting optimize
optimizing took 19 seconds

Here's the end from trunk:
Added 1880000 docs. Took 1513 ms. cumulative interval (seconds) = 304
Added 1890000 docs. Took 1518 ms. cumulative interval (seconds) = 306
Added 1900000 docs. Took 1703 ms. cumulative interval (seconds) = 307
Added 1910000 docs. Took 1161 ms. cumulative interval (seconds) = 309
About to commit, total time so far: 309
Total Time Taken-> 309 seconds
Total documents added-> 1917728
Docs/sec-> 6206
starting optimize
optimizing took 270 seconds

On Mon, Jun 13, 2011 at 4:50 PM, Simon Willnauer
<si...@googlemail.com> wrote:
> On Mon, Jun 13, 2011 at 8:13 PM, Erick Erickson <er...@gmail.com> wrote:
>> I half remember that this has come up before, but I couldn't find the
>> thread. I was running some tests over the weekend that involved
>> indexing 1.9M documents from the English Wiki dump.
>>
>> I'm consistently seeing that trunk takes about twice as long to index
>> the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
>> longer I admit that these aren't very sophisticated tests, and I only
>> ran the trunk process twice (although both those were consistent).
>>
>> I'm pretty sure my rambuffersize and autocommit settings are
>> identical. I remove the data/index directory before each run. These
>> results are running the indexing program in IntelliJ, on my Mac, both
>> the server and the indexing programs were running locally.
>>
>> No, trunk isn't compiling before running <G>.
>>
>> Here's the server definition:
>> new StreamingUpdateSolrServer(url, 10, 4);
>>
>> and I'm batching up the documents and sending them to Solr in batches of 1,000.
>>
>> So, my question is whether this should be pursued. Note that I'm still
>> getting around 3K docs/second, which I can't complain about. Not that
>> that stops me, you understand. And in return for a memory footprint
>> reduction from 389M to 90M after some off-the-wall sorting and
>> faceting I'll take it!
>>
>> Hmmmm, speaking of which, the memory usage changes seem like a good
>> candidate for a page on the Wiki, anyone want to suggest a home?
>>
>>
>> Solr 1.4.1
>> Total Time Taken-> 257 seconds
>> Total documents added-> 1917728
>> Docs/sec-> 7461
>> starting optimize
>> optimizing took 26 seconds
>>
>> Solr 3.2
>> Total Time Taken-> 243 seconds
>> Total documents added-> 1917728
>> Docs/sec-> 7891
>> starting optimize
>> optimizing took 21 seconds
>>
>> Solr 3x
>> Total Time Taken-> 269 seconds
>> Total documents added-> 1917728
>> Docs/sec-> 7129
>> starting optimize
>> optimizing took 21 seconds
>>
>> Solr trunk. 2011-6-11: 17:24 EST
>> Total Time Taken-> 592 seconds
>> Total documents added-> 1917728
>> Docs/sec-> 3239
>> starting optimize
>> optimizing took 159 seconds
>>
>> What do folks think? Is there anything I can/should do to narrow this down?
>
> Hi Eric,
>
> this looks weird, I have some questions:
>
> - you are indexing into the same disk as you read the data from?
> - what are you rambuffer settings?
> - how many threads are you using to send data to solr?
> - what is your autocommit setting?
>
> simon
>
>
>>
>> Erick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Indexing slower in trunk

Posted by Simon Willnauer <si...@googlemail.com>.

On Mon, Jun 13, 2011 at 8:13 PM, Erick Erickson <er...@gmail.com> wrote:
> I half remember that this has come up before, but I couldn't find the
> thread. I was running some tests over the weekend that involved
> indexing 1.9M documents from the English Wiki dump.
>
> I'm consistently seeing that trunk takes about twice as long to index
> the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
> longer I admit that these aren't very sophisticated tests, and I only
> ran the trunk process twice (although both those were consistent).
>
> I'm pretty sure my rambuffersize and autocommit settings are
> identical. I remove the data/index directory before each run. These
> results are running the indexing program in IntelliJ, on my Mac, both
> the server and the indexing programs were running locally.
>
> No, trunk isn't compiling before running <G>.
>
> Here's the server definition:
> new StreamingUpdateSolrServer(url, 10, 4);
>
> and I'm batching up the documents and sending them to Solr in batches of 1,000.
>
> So, my question is whether this should be pursued. Note that I'm still
> getting around 3K docs/second, which I can't complain about. Not that
> that stops me, you understand. And in return for a memory footprint
> reduction from 389M to 90M after some off-the-wall sorting and
> faceting I'll take it!
>
> Hmmmm, speaking of which, the memory usage changes seem like a good
> candidate for a page on the Wiki, anyone want to suggest a home?
>
>
> Solr 1.4.1
> Total Time Taken-> 257 seconds
> Total documents added-> 1917728
> Docs/sec-> 7461
> starting optimize
> optimizing took 26 seconds
>
> Solr 3.2
> Total Time Taken-> 243 seconds
> Total documents added-> 1917728
> Docs/sec-> 7891
> starting optimize
> optimizing took 21 seconds
>
> Solr 3x
> Total Time Taken-> 269 seconds
> Total documents added-> 1917728
> Docs/sec-> 7129
> starting optimize
> optimizing took 21 seconds
>
> Solr trunk. 2011-6-11: 17:24 EST
> Total Time Taken-> 592 seconds
> Total documents added-> 1917728
> Docs/sec-> 3239
> starting optimize
> optimizing took 159 seconds
>
> What do folks think? Is there anything I can/should do to narrow this down?

Hi Eric,

this looks weird, I have some questions:

- you are indexing into the same disk as you read the data from?
- what are you rambuffer settings?
- how many threads are you using to send data to solr?
- what is your autocommit setting?

simon


>
> Erick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org