You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by deniz <de...@gmail.com> on 2012/11/27 06:02:25 UTC

SolrCloud Performance - Indexing

As I am some kinda confused, I wanna check if anyone else has same confusions
like mine about solrcloud..

I have set up an environment with 3 solr instances and 2 zookeepers, amd
tried to index some documents from mysql db. the total amount the docs are
around 3.5M. before indexing i was expecting some longer time for cloud as
it does replication between nodes, but i am some kinda disappointed after
seeing that indexing took 4 to 5 times higher than indexing on a single solr
instance. on a single solr instance i am able to index those docs around 17
mins while with cloud it tooks around 60 minutes. and as a possible
production environment will have more instances and machines available for
the cloud, i cant imagine the indexing time... in adiditon to initial
indexing time, we will be updating our indexes frequently, which makes me
sceptical about solrcloud. 

so in a possible production environment with solrcloud, in case there is a
serious failure on some nodes, sync operation on cloud will take long
time... in this case, reindexing everything on a single instance will took
less than 17 mins, which is a reasonable amount of time for a crash.. so in
this case does it make sense use solrcloud although indexing time will
increase much higher than a single instance? or using a traditional master -
slave structure will be better for this case? 

I am aware cloud makes loadbalancing and some other stuff largely concerned
about searching, rather than indexing, but for a frequently updated system,
does it still useful to set up a cloud environment? 

and are there some workarounds for indexing speed, other than the known ones
for solr, on cloud? 



-----
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Performance-Indexing-tp4022549.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud Performance - Indexing

Posted by Mark Miller <ma...@gmail.com>.
Yup, DIH is not optimal for SolrCloud yet. I made a few JIRA issues a short while ago that may help.

I've seen people use it with SolrCloud in the past though - and it wasn't so slow…(though I'm sure slower than a single node).

Search me...

- Mark

On Nov 27, 2012, at 1:24 PM, Mikhail Khludnev <mk...@griddynamics.com> wrote:

> It sounds like DataImportHandler will not be really performant with
> SolrCloud. From what I see it should essentiallly work - it sends doc to
> the chain, which should distribute them via DistributedUpdateProcessor. But
> it works synchronously - no multithreading in DIH since 4.0!
> Does anyone has an experience or idea of fast data acquisition with
> DIH&SolrCloud?
> Excuse me for thread hijacking.
> 
> 
> On Tue, Nov 27, 2012 at 8:10 PM, Mark Miller <ma...@gmail.com> wrote:
> 
>> To get the best speed out of SolrCloud you have to index from many clients
>> (or threads). Even better is if you index to many nodes rather than one.
>> 
>> Using a single thread against a single instance with replicas will be a
>> fair amount slower with cloud than if you just used one node.
>> 
>> - Mark
>> 
>> On Nov 27, 2012, at 12:02 AM, deniz <de...@gmail.com> wrote:
>> 
>>> As I am some kinda confused, I wanna check if anyone else has same
>> confusions
>>> like mine about solrcloud..
>>> 
>>> I have set up an environment with 3 solr instances and 2 zookeepers, amd
>>> tried to index some documents from mysql db. the total amount the docs
>> are
>>> around 3.5M. before indexing i was expecting some longer time for cloud
>> as
>>> it does replication between nodes, but i am some kinda disappointed after
>>> seeing that indexing took 4 to 5 times higher than indexing on a single
>> solr
>>> instance. on a single solr instance i am able to index those docs around
>> 17
>>> mins while with cloud it tooks around 60 minutes. and as a possible
>>> production environment will have more instances and machines available
>> for
>>> the cloud, i cant imagine the indexing time... in adiditon to initial
>>> indexing time, we will be updating our indexes frequently, which makes me
>>> sceptical about solrcloud.
>>> 
>>> so in a possible production environment with solrcloud, in case there is
>> a
>>> serious failure on some nodes, sync operation on cloud will take long
>>> time... in this case, reindexing everything on a single instance will
>> took
>>> less than 17 mins, which is a reasonable amount of time for a crash.. so
>> in
>>> this case does it make sense use solrcloud although indexing time will
>>> increase much higher than a single instance? or using a traditional
>> master -
>>> slave structure will be better for this case?
>>> 
>>> I am aware cloud makes loadbalancing and some other stuff largely
>> concerned
>>> about searching, rather than indexing, but for a frequently updated
>> system,
>>> does it still useful to set up a cloud environment?
>>> 
>>> and are there some workarounds for indexing speed, other than the known
>> ones
>>> for solr, on cloud?
>>> 
>>> 
>>> 
>>> -----
>>> Zeki ama calismiyor... Calissa yapar...
>>> --
>>> View this message in context:
>> http://lucene.472066.n3.nabble.com/SolrCloud-Performance-Indexing-tp4022549.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
> 
> <http://www.griddynamics.com>
> <mk...@griddynamics.com>


Re: SolrCloud Performance - Indexing

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
It sounds like DataImportHandler will not be really performant with
SolrCloud. From what I see it should essentiallly work - it sends doc to
the chain, which should distribute them via DistributedUpdateProcessor. But
it works synchronously - no multithreading in DIH since 4.0!
Does anyone has an experience or idea of fast data acquisition with
DIH&SolrCloud?
Excuse me for thread hijacking.


On Tue, Nov 27, 2012 at 8:10 PM, Mark Miller <ma...@gmail.com> wrote:

> To get the best speed out of SolrCloud you have to index from many clients
> (or threads). Even better is if you index to many nodes rather than one.
>
> Using a single thread against a single instance with replicas will be a
> fair amount slower with cloud than if you just used one node.
>
> - Mark
>
> On Nov 27, 2012, at 12:02 AM, deniz <de...@gmail.com> wrote:
>
> > As I am some kinda confused, I wanna check if anyone else has same
> confusions
> > like mine about solrcloud..
> >
> > I have set up an environment with 3 solr instances and 2 zookeepers, amd
> > tried to index some documents from mysql db. the total amount the docs
> are
> > around 3.5M. before indexing i was expecting some longer time for cloud
> as
> > it does replication between nodes, but i am some kinda disappointed after
> > seeing that indexing took 4 to 5 times higher than indexing on a single
> solr
> > instance. on a single solr instance i am able to index those docs around
> 17
> > mins while with cloud it tooks around 60 minutes. and as a possible
> > production environment will have more instances and machines available
> for
> > the cloud, i cant imagine the indexing time... in adiditon to initial
> > indexing time, we will be updating our indexes frequently, which makes me
> > sceptical about solrcloud.
> >
> > so in a possible production environment with solrcloud, in case there is
> a
> > serious failure on some nodes, sync operation on cloud will take long
> > time... in this case, reindexing everything on a single instance will
> took
> > less than 17 mins, which is a reasonable amount of time for a crash.. so
> in
> > this case does it make sense use solrcloud although indexing time will
> > increase much higher than a single instance? or using a traditional
> master -
> > slave structure will be better for this case?
> >
> > I am aware cloud makes loadbalancing and some other stuff largely
> concerned
> > about searching, rather than indexing, but for a frequently updated
> system,
> > does it still useful to set up a cloud environment?
> >
> > and are there some workarounds for indexing speed, other than the known
> ones
> > for solr, on cloud?
> >
> >
> >
> > -----
> > Zeki ama calismiyor... Calissa yapar...
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Performance-Indexing-tp4022549.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: SolrCloud Performance - Indexing

Posted by Mark Miller <ma...@gmail.com>.
To get the best speed out of SolrCloud you have to index from many clients (or threads). Even better is if you index to many nodes rather than one.

Using a single thread against a single instance with replicas will be a fair amount slower with cloud than if you just used one node.

- Mark

On Nov 27, 2012, at 12:02 AM, deniz <de...@gmail.com> wrote:

> As I am some kinda confused, I wanna check if anyone else has same confusions
> like mine about solrcloud..
> 
> I have set up an environment with 3 solr instances and 2 zookeepers, amd
> tried to index some documents from mysql db. the total amount the docs are
> around 3.5M. before indexing i was expecting some longer time for cloud as
> it does replication between nodes, but i am some kinda disappointed after
> seeing that indexing took 4 to 5 times higher than indexing on a single solr
> instance. on a single solr instance i am able to index those docs around 17
> mins while with cloud it tooks around 60 minutes. and as a possible
> production environment will have more instances and machines available for
> the cloud, i cant imagine the indexing time... in adiditon to initial
> indexing time, we will be updating our indexes frequently, which makes me
> sceptical about solrcloud. 
> 
> so in a possible production environment with solrcloud, in case there is a
> serious failure on some nodes, sync operation on cloud will take long
> time... in this case, reindexing everything on a single instance will took
> less than 17 mins, which is a reasonable amount of time for a crash.. so in
> this case does it make sense use solrcloud although indexing time will
> increase much higher than a single instance? or using a traditional master -
> slave structure will be better for this case? 
> 
> I am aware cloud makes loadbalancing and some other stuff largely concerned
> about searching, rather than indexing, but for a frequently updated system,
> does it still useful to set up a cloud environment? 
> 
> and are there some workarounds for indexing speed, other than the known ones
> for solr, on cloud? 
> 
> 
> 
> -----
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Performance-Indexing-tp4022549.html
> Sent from the Solr - User mailing list archive at Nabble.com.