You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matteo Grolla <ma...@gmail.com> on 2013/10/06 18:19:10 UTC
Improving indexing performance
I'd like to have some suggestion on how to improve the indexing performance on the following scenario
I'm uploading 1M docs to solr,
every docs has
id: sequential number
title: small string
date: date
body: 1kb of text
Here are my benchmarks (they are all single executions, not averages from multiple executions):
1) using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last document
total time: 143035ms
1.1) using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last document
<ramBufferSizeMB>500</ramBufferSizeMB>
<maxBufferedDocs>100000</maxBufferedDocs>
total time: 134493ms
1.2) using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last document
<mergeFactor>30</mergeFactor>
total time: 143134ms
2) using a solrj client from another pc in the lan (100Mbps)
with httpsolrserver
with javabin format
add documents to the server in batches of 1k docs ( server.add( <collection> ) )
auto commit every 15s with openSearcher=false and commit after last document
total time: 139022ms
3) using a solrj client from another pc in the lan (100Mbps)
with concurrentupdatesolrserver
with javelin format
add documents to the server in batches of 1k docs ( server.add( <collection> ) )
server queue size=20k
server threads=4
no auto-commit and commit every 100k docs
total time: 167301ms
--On the solr server--
cpu averages 25%
at best 100% for 1 core
IO is still far from being saturated
iostat gives a pattern like this (every 5 s)
time(s) %util
100 45,20
105 1,68
110 17,44
115 76,32
120 2,64
125 68
130 1,28
I thought that using concurrentupdatesolrserver I was able to max cpu or IO but I wasn't.
With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an OutOfMemory error
and I found that committing every 100k docs gives worse performance than auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
I'd really like to understand why I can't max out the resources on the server hosting solr (disk above all)
And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver
thanks
Re: Improving indexing performance
Posted by Erick Erickson <er...@gmail.com>.
queue size shouldn't really be too large, the whole point of
the concurrency is to keep from waiting around for the
communication with the server in a single thread. So having
a bunch of stuff backed up in the queue isn't buying you anything....
And you can always increase the memory allocated to the JVM
running SolrJ...
Erick
On Tue, Oct 8, 2013 at 5:29 AM, Matteo Grolla <ma...@gmail.com> wrote:
> Thanks Erik,
> I think I have been able to exhaust a resource
> if I split the data in 2 and upload it with 2 clients like benchmark 1.1 it takes 120s here the bottleneck it my LAN,
> if I use a setting like benchmark 1 probably the bottleneck is the ramBuffer.
>
> I'm going to buy a Gigabit ethernet cable so I can make a better test.
>
> OutOfMemory error: it's the solrj client that crashes
> I'm using solr 4.2.1 and corresponding solrj client
> httpsolrserver works fine
> concurrentupdatesolrsever gives me problems, and I didn't understand how to size the queuesize parameter optimally
>
>
> Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:
>
>> Just skimmed, but the usual reason you can't max out the server
>> is that the client can't go fast enough. Very quick experiment:
>> comment out the server.add line in your client and run it again,
>> does that speed up the client substantially? If not, then the time
>> is being spent on the client.
>>
>> Or split your csv file into, say, 5 parts and run it from 5 different
>> PCs in parallel.
>>
>> bq: I can't rely on auto commit, otherwise I get an OutOfMemory error
>> This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
>> allocating more memory to the JVM running Solr.
>>
>> bq: committing every 100k docs gives worse performance
>> It'll be best to specify openSearcher=false for max indexing throughput
>> BTW. You should be able to do this quite frequently, 15 seconds seems
>> quite reasonable.
>>
>> Best,
>> Erick
>>
>> On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <ma...@gmail.com> wrote:
>>> I'd like to have some suggestion on how to improve the indexing performance on the following scenario
>>> I'm uploading 1M docs to solr,
>>>
>>> every docs has
>>> id: sequential number
>>> title: small string
>>> date: date
>>> body: 1kb of text
>>>
>>> Here are my benchmarks (they are all single executions, not averages from multiple executions):
>>>
>>> 1) using the updaterequesthandler
>>> and streaming docs from a csv file on the same disk of solr
>>> auto commit every 15s with openSearcher=false and commit after last document
>>>
>>> total time: 143035ms
>>>
>>> 1.1) using the updaterequesthandler
>>> and streaming docs from a csv file on the same disk of solr
>>> auto commit every 15s with openSearcher=false and commit after last document
>>> <ramBufferSizeMB>500</ramBufferSizeMB>
>>> <maxBufferedDocs>100000</maxBufferedDocs>
>>>
>>> total time: 134493ms
>>>
>>> 1.2) using the updaterequesthandler
>>> and streaming docs from a csv file on the same disk of solr
>>> auto commit every 15s with openSearcher=false and commit after last document
>>> <mergeFactor>30</mergeFactor>
>>>
>>> total time: 143134ms
>>>
>>> 2) using a solrj client from another pc in the lan (100Mbps)
>>> with httpsolrserver
>>> with javabin format
>>> add documents to the server in batches of 1k docs ( server.add( <collection> ) )
>>> auto commit every 15s with openSearcher=false and commit after last document
>>>
>>> total time: 139022ms
>>>
>>> 3) using a solrj client from another pc in the lan (100Mbps)
>>> with concurrentupdatesolrserver
>>> with javelin format
>>> add documents to the server in batches of 1k docs ( server.add( <collection> ) )
>>> server queue size=20k
>>> server threads=4
>>> no auto-commit and commit every 100k docs
>>>
>>> total time: 167301ms
>>>
>>>
>>> --On the solr server--
>>> cpu averages 25%
>>> at best 100% for 1 core
>>> IO is still far from being saturated
>>> iostat gives a pattern like this (every 5 s)
>>>
>>> time(s) %util
>>> 100 45,20
>>> 105 1,68
>>> 110 17,44
>>> 115 76,32
>>> 120 2,64
>>> 125 68
>>> 130 1,28
>>>
>>> I thought that using concurrentupdatesolrserver I was able to max cpu or IO but I wasn't.
>>> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an OutOfMemory error
>>> and I found that committing every 100k docs gives worse performance than auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
>>>
>>> I'd really like to understand why I can't max out the resources on the server hosting solr (disk above all)
>>> And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver
>>>
>>> thanks
>>>
>
Re: Improving indexing performance
Posted by Matteo Grolla <ma...@gmail.com>.
Thanks Erik,
I think I have been able to exhaust a resource
if I split the data in 2 and upload it with 2 clients like benchmark 1.1 it takes 120s here the bottleneck it my LAN,
if I use a setting like benchmark 1 probably the bottleneck is the ramBuffer.
I'm going to buy a Gigabit ethernet cable so I can make a better test.
OutOfMemory error: it's the solrj client that crashes
I'm using solr 4.2.1 and corresponding solrj client
httpsolrserver works fine
concurrentupdatesolrsever gives me problems, and I didn't understand how to size the queuesize parameter optimally
Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:
> Just skimmed, but the usual reason you can't max out the server
> is that the client can't go fast enough. Very quick experiment:
> comment out the server.add line in your client and run it again,
> does that speed up the client substantially? If not, then the time
> is being spent on the client.
>
> Or split your csv file into, say, 5 parts and run it from 5 different
> PCs in parallel.
>
> bq: I can't rely on auto commit, otherwise I get an OutOfMemory error
> This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
> allocating more memory to the JVM running Solr.
>
> bq: committing every 100k docs gives worse performance
> It'll be best to specify openSearcher=false for max indexing throughput
> BTW. You should be able to do this quite frequently, 15 seconds seems
> quite reasonable.
>
> Best,
> Erick
>
> On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <ma...@gmail.com> wrote:
>> I'd like to have some suggestion on how to improve the indexing performance on the following scenario
>> I'm uploading 1M docs to solr,
>>
>> every docs has
>> id: sequential number
>> title: small string
>> date: date
>> body: 1kb of text
>>
>> Here are my benchmarks (they are all single executions, not averages from multiple executions):
>>
>> 1) using the updaterequesthandler
>> and streaming docs from a csv file on the same disk of solr
>> auto commit every 15s with openSearcher=false and commit after last document
>>
>> total time: 143035ms
>>
>> 1.1) using the updaterequesthandler
>> and streaming docs from a csv file on the same disk of solr
>> auto commit every 15s with openSearcher=false and commit after last document
>> <ramBufferSizeMB>500</ramBufferSizeMB>
>> <maxBufferedDocs>100000</maxBufferedDocs>
>>
>> total time: 134493ms
>>
>> 1.2) using the updaterequesthandler
>> and streaming docs from a csv file on the same disk of solr
>> auto commit every 15s with openSearcher=false and commit after last document
>> <mergeFactor>30</mergeFactor>
>>
>> total time: 143134ms
>>
>> 2) using a solrj client from another pc in the lan (100Mbps)
>> with httpsolrserver
>> with javabin format
>> add documents to the server in batches of 1k docs ( server.add( <collection> ) )
>> auto commit every 15s with openSearcher=false and commit after last document
>>
>> total time: 139022ms
>>
>> 3) using a solrj client from another pc in the lan (100Mbps)
>> with concurrentupdatesolrserver
>> with javelin format
>> add documents to the server in batches of 1k docs ( server.add( <collection> ) )
>> server queue size=20k
>> server threads=4
>> no auto-commit and commit every 100k docs
>>
>> total time: 167301ms
>>
>>
>> --On the solr server--
>> cpu averages 25%
>> at best 100% for 1 core
>> IO is still far from being saturated
>> iostat gives a pattern like this (every 5 s)
>>
>> time(s) %util
>> 100 45,20
>> 105 1,68
>> 110 17,44
>> 115 76,32
>> 120 2,64
>> 125 68
>> 130 1,28
>>
>> I thought that using concurrentupdatesolrserver I was able to max cpu or IO but I wasn't.
>> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an OutOfMemory error
>> and I found that committing every 100k docs gives worse performance than auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
>>
>> I'd really like to understand why I can't max out the resources on the server hosting solr (disk above all)
>> And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver
>>
>> thanks
>>
Re: Improving indexing performance
Posted by Erick Erickson <er...@gmail.com>.
Just skimmed, but the usual reason you can't max out the server
is that the client can't go fast enough. Very quick experiment:
comment out the server.add line in your client and run it again,
does that speed up the client substantially? If not, then the time
is being spent on the client.
Or split your csv file into, say, 5 parts and run it from 5 different
PCs in parallel.
bq: I can't rely on auto commit, otherwise I get an OutOfMemory error
This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
allocating more memory to the JVM running Solr.
bq: committing every 100k docs gives worse performance
It'll be best to specify openSearcher=false for max indexing throughput
BTW. You should be able to do this quite frequently, 15 seconds seems
quite reasonable.
Best,
Erick
On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <ma...@gmail.com> wrote:
> I'd like to have some suggestion on how to improve the indexing performance on the following scenario
> I'm uploading 1M docs to solr,
>
> every docs has
> id: sequential number
> title: small string
> date: date
> body: 1kb of text
>
> Here are my benchmarks (they are all single executions, not averages from multiple executions):
>
> 1) using the updaterequesthandler
> and streaming docs from a csv file on the same disk of solr
> auto commit every 15s with openSearcher=false and commit after last document
>
> total time: 143035ms
>
> 1.1) using the updaterequesthandler
> and streaming docs from a csv file on the same disk of solr
> auto commit every 15s with openSearcher=false and commit after last document
> <ramBufferSizeMB>500</ramBufferSizeMB>
> <maxBufferedDocs>100000</maxBufferedDocs>
>
> total time: 134493ms
>
> 1.2) using the updaterequesthandler
> and streaming docs from a csv file on the same disk of solr
> auto commit every 15s with openSearcher=false and commit after last document
> <mergeFactor>30</mergeFactor>
>
> total time: 143134ms
>
> 2) using a solrj client from another pc in the lan (100Mbps)
> with httpsolrserver
> with javabin format
> add documents to the server in batches of 1k docs ( server.add( <collection> ) )
> auto commit every 15s with openSearcher=false and commit after last document
>
> total time: 139022ms
>
> 3) using a solrj client from another pc in the lan (100Mbps)
> with concurrentupdatesolrserver
> with javelin format
> add documents to the server in batches of 1k docs ( server.add( <collection> ) )
> server queue size=20k
> server threads=4
> no auto-commit and commit every 100k docs
>
> total time: 167301ms
>
>
> --On the solr server--
> cpu averages 25%
> at best 100% for 1 core
> IO is still far from being saturated
> iostat gives a pattern like this (every 5 s)
>
> time(s) %util
> 100 45,20
> 105 1,68
> 110 17,44
> 115 76,32
> 120 2,64
> 125 68
> 130 1,28
>
> I thought that using concurrentupdatesolrserver I was able to max cpu or IO but I wasn't.
> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an OutOfMemory error
> and I found that committing every 100k docs gives worse performance than auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
>
> I'd really like to understand why I can't max out the resources on the server hosting solr (disk above all)
> And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver
>
> thanks
>