You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matteo Grolla <ma...@gmail.com> on 2013/10/06 18:19:10 UTC

Improving indexing performance

I'd like to have some suggestion on how to improve the indexing performance on the following scenario
I'm uploading 1M docs to solr, 

every docs has
	id: sequential number
	title:	small string
	date: date
	body: 1kb of text

Here are my benchmarks (they are all single executions, not averages from multiple executions):

1) 	using the updaterequesthandler 
	and streaming docs from a csv file on the same disk of solr
	auto commit every 15s with openSearcher=false and commit after last document	

	total time: 143035ms

1.1) 	using the updaterequesthandler 
	and streaming docs from a csv file on the same disk of solr
	auto commit every 15s with openSearcher=false and commit after last document	
	<ramBufferSizeMB>500</ramBufferSizeMB>
	<maxBufferedDocs>100000</maxBufferedDocs>
	
	total time: 134493ms

1.2) 	using the updaterequesthandler 
	and streaming docs from a csv file on the same disk of solr
	auto commit every 15s with openSearcher=false and commit after last document	
	<mergeFactor>30</mergeFactor>

	total time: 143134ms

2)	using a solrj client from another pc in the lan (100Mbps)
	with httpsolrserver
	with javabin format
	add documents to the server in batches of 1k docs	( server.add( <collection> ) ) 
	auto commit every 15s with openSearcher=false and commit after last document		

	total time: 139022ms

3)	using a solrj client from another pc in the lan (100Mbps)
	with concurrentupdatesolrserver
	with javelin format
	add documents to the server in batches of 1k docs	( server.add( <collection> ) ) 
	server queue size=20k
    server threads=4
	no auto-commit and commit every 100k docs

	total time: 167301ms


--On the solr server--
cpu averages	25%
	at best 100% for 1 core
IO	is still far from being saturated
	iostat gives a pattern like this (every 5 s)

	time(s)		%util
	100			45,20
	105			1,68
	110			17,44
	115			76,32
	120			2,64
	125			68
	130			1,28

I thought that using concurrentupdatesolrserver I was able to max cpu or IO but I wasn't.
With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an OutOfMemory error
and I found that committing every 100k docs gives worse performance than auto commit every 15s (benchmark 3 with httpsolrserver took 193515)

I'd really like to understand why I can't max out the resources on the server hosting solr (disk above all)
And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver

thanks


Re: Improving indexing performance

Posted by Erick Erickson <er...@gmail.com>.
queue size shouldn't really be too large, the whole point of
the concurrency is to keep from waiting around for the
communication with the server in a single thread. So having
a bunch of stuff backed up in the queue isn't buying you anything....

And you can always increase the memory allocated to the JVM
running SolrJ...

Erick

On Tue, Oct 8, 2013 at 5:29 AM, Matteo Grolla <ma...@gmail.com> wrote:
> Thanks Erik,
>         I think I have been able to exhaust a resource
>         if I split the data in 2 and upload it with 2 clients like benchmark 1.1 it takes 120s here the bottleneck it my LAN,
>         if I use a setting like benchmark 1 probably the bottleneck is the ramBuffer.
>
>         I'm going to buy a Gigabit ethernet cable so I can make a better test.
>
>         OutOfMemory error: it's the solrj client that crashes
>                 I'm using solr 4.2.1 and corresponding solrj client
>                 httpsolrserver works fine
>                 concurrentupdatesolrsever gives me problems, and I didn't understand how to size the queuesize parameter optimally
>
>
> Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:
>
>> Just skimmed, but the usual reason you can't max out the server
>> is that the client can't go fast enough. Very quick experiment:
>> comment out the server.add line in your client and run it again,
>> does that speed up the client substantially? If not, then the time
>> is being spent on the client.
>>
>> Or split your csv file into, say, 5 parts and run it from 5 different
>> PCs in parallel.
>>
>> bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
>> This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
>> allocating more memory to the JVM running Solr.
>>
>> bq: committing every 100k docs gives worse performance
>> It'll be best to specify openSearcher=false for max indexing throughput
>> BTW. You should be able to do this quite frequently, 15 seconds seems
>> quite reasonable.
>>
>> Best,
>> Erick
>>
>> On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <ma...@gmail.com> wrote:
>>> I'd like to have some suggestion on how to improve the indexing performance on the following scenario
>>> I'm uploading 1M docs to solr,
>>>
>>> every docs has
>>>        id: sequential number
>>>        title:  small string
>>>        date: date
>>>        body: 1kb of text
>>>
>>> Here are my benchmarks (they are all single executions, not averages from multiple executions):
>>>
>>> 1)      using the updaterequesthandler
>>>        and streaming docs from a csv file on the same disk of solr
>>>        auto commit every 15s with openSearcher=false and commit after last document
>>>
>>>        total time: 143035ms
>>>
>>> 1.1)    using the updaterequesthandler
>>>        and streaming docs from a csv file on the same disk of solr
>>>        auto commit every 15s with openSearcher=false and commit after last document
>>>        <ramBufferSizeMB>500</ramBufferSizeMB>
>>>        <maxBufferedDocs>100000</maxBufferedDocs>
>>>
>>>        total time: 134493ms
>>>
>>> 1.2)    using the updaterequesthandler
>>>        and streaming docs from a csv file on the same disk of solr
>>>        auto commit every 15s with openSearcher=false and commit after last document
>>>        <mergeFactor>30</mergeFactor>
>>>
>>>        total time: 143134ms
>>>
>>> 2)      using a solrj client from another pc in the lan (100Mbps)
>>>        with httpsolrserver
>>>        with javabin format
>>>        add documents to the server in batches of 1k docs       ( server.add( <collection> ) )
>>>        auto commit every 15s with openSearcher=false and commit after last document
>>>
>>>        total time: 139022ms
>>>
>>> 3)      using a solrj client from another pc in the lan (100Mbps)
>>>        with concurrentupdatesolrserver
>>>        with javelin format
>>>        add documents to the server in batches of 1k docs       ( server.add( <collection> ) )
>>>        server queue size=20k
>>>    server threads=4
>>>        no auto-commit and commit every 100k docs
>>>
>>>        total time: 167301ms
>>>
>>>
>>> --On the solr server--
>>> cpu averages    25%
>>>        at best 100% for 1 core
>>> IO      is still far from being saturated
>>>        iostat gives a pattern like this (every 5 s)
>>>
>>>        time(s)         %util
>>>        100                     45,20
>>>        105                     1,68
>>>        110                     17,44
>>>        115                     76,32
>>>        120                     2,64
>>>        125                     68
>>>        130                     1,28
>>>
>>> I thought that using concurrentupdatesolrserver I was able to max cpu or IO but I wasn't.
>>> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an OutOfMemory error
>>> and I found that committing every 100k docs gives worse performance than auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
>>>
>>> I'd really like to understand why I can't max out the resources on the server hosting solr (disk above all)
>>> And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver
>>>
>>> thanks
>>>
>

Re: Improving indexing performance

Posted by Matteo Grolla <ma...@gmail.com>.
Thanks Erik,
	I think I have been able to exhaust a resource
	if I split the data in 2 and upload it with 2 clients like benchmark 1.1 it takes 120s here the bottleneck it my LAN,
	if I use a setting like benchmark 1 probably the bottleneck is the ramBuffer.

	I'm going to buy a Gigabit ethernet cable so I can make a better test.

	OutOfMemory error: it's the solrj client that crashes
		I'm using solr 4.2.1 and corresponding solrj client
		httpsolrserver works fine
		concurrentupdatesolrsever gives me problems, and I didn't understand how to size the queuesize parameter optimally

	
Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:

> Just skimmed, but the usual reason you can't max out the server
> is that the client can't go fast enough. Very quick experiment:
> comment out the server.add line in your client and run it again,
> does that speed up the client substantially? If not, then the time
> is being spent on the client.
> 
> Or split your csv file into, say, 5 parts and run it from 5 different
> PCs in parallel.
> 
> bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
> This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
> allocating more memory to the JVM running Solr.
> 
> bq: committing every 100k docs gives worse performance
> It'll be best to specify openSearcher=false for max indexing throughput
> BTW. You should be able to do this quite frequently, 15 seconds seems
> quite reasonable.
> 
> Best,
> Erick
> 
> On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <ma...@gmail.com> wrote:
>> I'd like to have some suggestion on how to improve the indexing performance on the following scenario
>> I'm uploading 1M docs to solr,
>> 
>> every docs has
>>        id: sequential number
>>        title:  small string
>>        date: date
>>        body: 1kb of text
>> 
>> Here are my benchmarks (they are all single executions, not averages from multiple executions):
>> 
>> 1)      using the updaterequesthandler
>>        and streaming docs from a csv file on the same disk of solr
>>        auto commit every 15s with openSearcher=false and commit after last document
>> 
>>        total time: 143035ms
>> 
>> 1.1)    using the updaterequesthandler
>>        and streaming docs from a csv file on the same disk of solr
>>        auto commit every 15s with openSearcher=false and commit after last document
>>        <ramBufferSizeMB>500</ramBufferSizeMB>
>>        <maxBufferedDocs>100000</maxBufferedDocs>
>> 
>>        total time: 134493ms
>> 
>> 1.2)    using the updaterequesthandler
>>        and streaming docs from a csv file on the same disk of solr
>>        auto commit every 15s with openSearcher=false and commit after last document
>>        <mergeFactor>30</mergeFactor>
>> 
>>        total time: 143134ms
>> 
>> 2)      using a solrj client from another pc in the lan (100Mbps)
>>        with httpsolrserver
>>        with javabin format
>>        add documents to the server in batches of 1k docs       ( server.add( <collection> ) )
>>        auto commit every 15s with openSearcher=false and commit after last document
>> 
>>        total time: 139022ms
>> 
>> 3)      using a solrj client from another pc in the lan (100Mbps)
>>        with concurrentupdatesolrserver
>>        with javelin format
>>        add documents to the server in batches of 1k docs       ( server.add( <collection> ) )
>>        server queue size=20k
>>    server threads=4
>>        no auto-commit and commit every 100k docs
>> 
>>        total time: 167301ms
>> 
>> 
>> --On the solr server--
>> cpu averages    25%
>>        at best 100% for 1 core
>> IO      is still far from being saturated
>>        iostat gives a pattern like this (every 5 s)
>> 
>>        time(s)         %util
>>        100                     45,20
>>        105                     1,68
>>        110                     17,44
>>        115                     76,32
>>        120                     2,64
>>        125                     68
>>        130                     1,28
>> 
>> I thought that using concurrentupdatesolrserver I was able to max cpu or IO but I wasn't.
>> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an OutOfMemory error
>> and I found that committing every 100k docs gives worse performance than auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
>> 
>> I'd really like to understand why I can't max out the resources on the server hosting solr (disk above all)
>> And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver
>> 
>> thanks
>> 


Re: Improving indexing performance

Posted by Erick Erickson <er...@gmail.com>.
Just skimmed, but the usual reason you can't max out the server
is that the client can't go fast enough. Very quick experiment:
comment out the server.add line in your client and run it again,
does that speed up the client substantially? If not, then the time
is being spent on the client.

Or split your csv file into, say, 5 parts and run it from 5 different
PCs in parallel.

bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
allocating more memory to the JVM running Solr.

bq: committing every 100k docs gives worse performance
It'll be best to specify openSearcher=false for max indexing throughput
BTW. You should be able to do this quite frequently, 15 seconds seems
quite reasonable.

Best,
Erick

On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <ma...@gmail.com> wrote:
> I'd like to have some suggestion on how to improve the indexing performance on the following scenario
> I'm uploading 1M docs to solr,
>
> every docs has
>         id: sequential number
>         title:  small string
>         date: date
>         body: 1kb of text
>
> Here are my benchmarks (they are all single executions, not averages from multiple executions):
>
> 1)      using the updaterequesthandler
>         and streaming docs from a csv file on the same disk of solr
>         auto commit every 15s with openSearcher=false and commit after last document
>
>         total time: 143035ms
>
> 1.1)    using the updaterequesthandler
>         and streaming docs from a csv file on the same disk of solr
>         auto commit every 15s with openSearcher=false and commit after last document
>         <ramBufferSizeMB>500</ramBufferSizeMB>
>         <maxBufferedDocs>100000</maxBufferedDocs>
>
>         total time: 134493ms
>
> 1.2)    using the updaterequesthandler
>         and streaming docs from a csv file on the same disk of solr
>         auto commit every 15s with openSearcher=false and commit after last document
>         <mergeFactor>30</mergeFactor>
>
>         total time: 143134ms
>
> 2)      using a solrj client from another pc in the lan (100Mbps)
>         with httpsolrserver
>         with javabin format
>         add documents to the server in batches of 1k docs       ( server.add( <collection> ) )
>         auto commit every 15s with openSearcher=false and commit after last document
>
>         total time: 139022ms
>
> 3)      using a solrj client from another pc in the lan (100Mbps)
>         with concurrentupdatesolrserver
>         with javelin format
>         add documents to the server in batches of 1k docs       ( server.add( <collection> ) )
>         server queue size=20k
>     server threads=4
>         no auto-commit and commit every 100k docs
>
>         total time: 167301ms
>
>
> --On the solr server--
> cpu averages    25%
>         at best 100% for 1 core
> IO      is still far from being saturated
>         iostat gives a pattern like this (every 5 s)
>
>         time(s)         %util
>         100                     45,20
>         105                     1,68
>         110                     17,44
>         115                     76,32
>         120                     2,64
>         125                     68
>         130                     1,28
>
> I thought that using concurrentupdatesolrserver I was able to max cpu or IO but I wasn't.
> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an OutOfMemory error
> and I found that committing every 100k docs gives worse performance than auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
>
> I'd really like to understand why I can't max out the resources on the server hosting solr (disk above all)
> And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver
>
> thanks
>