You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Modassar Ather <mo...@gmail.com> on 2014/07/08 11:01:34 UTC

Parallel optimize of index on SolrCloud.

Hi,

Need to optimize index created using CloudSolrServer APIs under SolrCloud
setup of 3 instances on separate machines. Currently it optimizes
sequentially if I invoke cloudSolrServer.optimize().

To make it parallel I tried making three separate HttpSolrServer instances
and invoked httpSolrServer.opimize() on them parallely but still it seems
to be doing optimization sequentially.

I tried invoking optimize directly using HttpPost with following url and
parameters but still it seems to be sequential.
*URL* : http://host:port/solr/collection/update

*Parameters*:
params.add(new BasicNameValuePair("optimize", "true"));
params.add(new BasicNameValuePair("maxSegments", "1"));
params.add(new BasicNameValuePair("waitFlush", "true"));
params.add(new BasicNameValuePair("distrib", "false"));

Kindly provide your suggestion and help.

Regards,
Modassar

Re: Parallel optimize of index on SolrCloud.

Posted by Mark Miller <ma...@gmail.com>.
I think that’s pretty much a search time param, though it might end being used on the update side as well. In any case, I know it doesn’t affect commit or optimize.

Also, to my knowledge, SolrCloud optimize support was never explicitly added or tested.

--  
Mark Miller
about.me/markrmiller

On July 9, 2014 at 12:00:27 PM, Shawn Heisey (solr@elyograg.org) wrote:
> > I thought a bug had been filed on the distrib=false problem,  


Re: Parallel optimize of index on SolrCloud.

Posted by Shawn Heisey <so...@elyograg.org>.
On 7/9/2014 8:49 AM, Timothy Potter wrote:
> Hi Modassar,
>
> Have you tried hitting the cores for each replica directly (instead of
> using the collection)? i.e. if you had col_shard1_replica1 on node1,
> then send the optimize command to that core URL directly:
>
> curl -i -v "http://host:port/solr/col_shard1_replica1/update" -H
> 'Content-type:application/xml' \
>   --data-binary "<optimize/>"
>
> I haven't tried this myself but might work ;-)

That doesn't work.  It will optimize the whole collection, one core at a
time.  I thought that sending the optimize with distrib=false would
limit the optimize to just the called core, but that also doesn't work. 
I thought a bug had been filed on the distrib=false problem, but it's
been long enough that I'm no longer sure about that.

Thanks,
Shawn


Re: Parallel optimize of index on SolrCloud.

Posted by Timothy Potter <th...@gmail.com>.
Hi Modassar,

Have you tried hitting the cores for each replica directly (instead of
using the collection)? i.e. if you had col_shard1_replica1 on node1,
then send the optimize command to that core URL directly:

curl -i -v "http://host:port/solr/col_shard1_replica1/update" -H
'Content-type:application/xml' \
  --data-binary "<optimize/>"

I haven't tried this myself but might work ;-)

Tim

On Wed, Jul 9, 2014 at 12:59 AM, Modassar Ather <mo...@gmail.com> wrote:
> Hi All,
>
> Thanks for your kind suggestions and inputs.
>
> We have been going the optimize way and it has helped. There have been
> testing and benchmarking already done around memory and performance.
> So while optimizing we see a scope of improvement on it by doing it
> parallel so kindly suggest in what way it can be achieved.
>
> Thanks,
> Modassar
>
>
> On Wed, Jul 9, 2014 at 11:48 AM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
>> Hi Walter,
>>
>> I wonder why you think SolrCloud isn't necessary if you're indexing once
>> per week. Isn't the automatic failover and auto-sharding still useful? One
>> can also do custom sharding with SolrCloud if necessary.
>>
>>
>> On Wed, Jul 9, 2014 at 11:38 AM, Walter Underwood <wu...@wunderwood.org>
>> wrote:
>>
>> > More memory or faster disks will make a much bigger improvement than a
>> > forced merge.
>> >
>> > What are you measuring? If it is average query time, that is not a good
>> > measure. Look at 90th or 95th percentile. Test with queries from logs.
>> >
>> > No user can see a 10% or 20% difference. If your managers are watching
>> > that, they are watching the wrong thing.
>> >
>> > If you are indexing once per week, you don't really need the complexity
>> of
>> > Solr Cloud. You can do manual sharding.
>> >
>> > wunder
>> >
>> > On Jul 8, 2014, at 10:55 PM, Modassar Ather <mo...@gmail.com>
>> > wrote:
>> >
>> > > Our index has almost 100M documents running on SolrCloud of 3 shards
>> and
>> > > each shard has an index size of about 700GB (for the record, we are not
>> > > using stored fields - our documents are pretty large). We perform a
>> full
>> > > indexing every weekend and during the week there are no updates made to
>> > the
>> > > index. Most of the queries that we run are pretty complex with hundreds
>> > of
>> > > terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts
>> etc.
>> > > and take many minutes to execute. A difference of 10-20% is also a big
>> > > advantage for us.
>> > >
>> > > We have been optimizing the index after indexing for years and it has
>> > > worked well for us. Every once in a while, we upgrade Solr to the
>> latest
>> > > version and try without optimizing so that we can save the many hours
>> it
>> > > take to optimize such a huge index, but it does not work well.
>> > >
>> > > Kindly provide your suggestion.
>> > >
>> > > Thanks,
>> > > Modassar
>> > >
>> > >
>> > > On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood <
>> wunder@wunderwood.org
>> > >
>> > > wrote:
>> > >
>> > >> I seriously doubt that you are required to force merge.
>> > >>
>> > >> How much improvement? And is the big performance cost also OK?
>> > >>
>> > >> I have worked on search engines that do automatic merges and offer
>> > forced
>> > >> merges for over fifteen years. For all that time, forced merges have
>> > >> usually caused problems.
>> > >>
>> > >> Stop doing forced merges.
>> > >>
>> > >> wunder
>> > >>
>> > >> On Jul 8, 2014, at 10:09 PM, Modassar Ather <mo...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> Thanks Walter for your inputs.
>> > >>>
>> > >>> Our use case and performance benchmark requires us to invoke
>> optimize.
>> > >>>
>> > >>> Here we see a chance of improvement in performance of optimize() if
>> > >> invoked
>> > >>> in parallel.
>> > >>> I found that if* distrib=false *is used, the optimization will happen
>> > in
>> > >>> parallel.
>> > >>>
>> > >>> But I could not find a way to set it using
>> > >> HttpSolrServer/CloudSolrServer.
>> > >>> Also with the parameter setting as given in my mail above does not
>> > seems
>> > >> to
>> > >>> work.
>> > >>>
>> > >>> Please let me know in what ways I can achieve the parallel optimize
>> on
>> > >>> SolrCloud.
>> > >>>
>> > >>> Thanks,
>> > >>> Modassar
>> > >>>
>> > >>> On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood <
>> > wunder@wunderwood.org>
>> > >>> wrote:
>> > >>>
>> > >>>> You probably do not need to force merge (mistakenly called
>> "optimize")
>> > >>>> your index.
>> > >>>>
>> > >>>> Solr does automatic merges, which work just fine.
>> > >>>>
>> > >>>> There are only a few situations where a forced merge is even a good
>> > >> idea.
>> > >>>> The most common one is a replicated (non-cloud) setup with a full
>> > >> reindex
>> > >>>> every night.
>> > >>>>
>> > >>>> If you need Solr Cloud, I cannot think of a situation where you
>> would
>> > >> want
>> > >>>> a forced merge.
>> > >>>>
>> > >>>> wunder
>> > >>>>
>> > >>>> On Jul 8, 2014, at 2:01 AM, Modassar Ather <mo...@gmail.com>
>> > >> wrote:
>> > >>>>
>> > >>>>> Hi,
>> > >>>>>
>> > >>>>> Need to optimize index created using CloudSolrServer APIs under
>> > >> SolrCloud
>> > >>>>> setup of 3 instances on separate machines. Currently it optimizes
>> > >>>>> sequentially if I invoke cloudSolrServer.optimize().
>> > >>>>>
>> > >>>>> To make it parallel I tried making three separate HttpSolrServer
>> > >>>> instances
>> > >>>>> and invoked httpSolrServer.opimize() on them parallely but still it
>> > >> seems
>> > >>>>> to be doing optimization sequentially.
>> > >>>>>
>> > >>>>> I tried invoking optimize directly using HttpPost with following
>> url
>> > >> and
>> > >>>>> parameters but still it seems to be sequential.
>> > >>>>> *URL* : http://host:port/solr/collection/update
>> > >>>>>
>> > >>>>> *Parameters*:
>> > >>>>> params.add(new BasicNameValuePair("optimize", "true"));
>> > >>>>> params.add(new BasicNameValuePair("maxSegments", "1"));
>> > >>>>> params.add(new BasicNameValuePair("waitFlush", "true"));
>> > >>>>> params.add(new BasicNameValuePair("distrib", "false"));
>> > >>>>>
>> > >>>>> Kindly provide your suggestion and help.
>> > >>>>>
>> > >>>>> Regards,
>> > >>>>> Modassar
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>
>> > >> --
>> > >> Walter Underwood
>> > >> wunder@wunderwood.org
>> > >>
>> > >>
>> > >>
>> > >>
>> >
>> > --
>> > Walter Underwood
>> > wunder@wunderwood.org
>> >
>> >
>> >
>> >
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Re: Parallel optimize of index on SolrCloud.

Posted by Modassar Ather <mo...@gmail.com>.
Hi All,

Thanks for your kind suggestions and inputs.

We have been going the optimize way and it has helped. There have been
testing and benchmarking already done around memory and performance.
So while optimizing we see a scope of improvement on it by doing it
parallel so kindly suggest in what way it can be achieved.

Thanks,
Modassar


On Wed, Jul 9, 2014 at 11:48 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Hi Walter,
>
> I wonder why you think SolrCloud isn't necessary if you're indexing once
> per week. Isn't the automatic failover and auto-sharding still useful? One
> can also do custom sharding with SolrCloud if necessary.
>
>
> On Wed, Jul 9, 2014 at 11:38 AM, Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> > More memory or faster disks will make a much bigger improvement than a
> > forced merge.
> >
> > What are you measuring? If it is average query time, that is not a good
> > measure. Look at 90th or 95th percentile. Test with queries from logs.
> >
> > No user can see a 10% or 20% difference. If your managers are watching
> > that, they are watching the wrong thing.
> >
> > If you are indexing once per week, you don't really need the complexity
> of
> > Solr Cloud. You can do manual sharding.
> >
> > wunder
> >
> > On Jul 8, 2014, at 10:55 PM, Modassar Ather <mo...@gmail.com>
> > wrote:
> >
> > > Our index has almost 100M documents running on SolrCloud of 3 shards
> and
> > > each shard has an index size of about 700GB (for the record, we are not
> > > using stored fields - our documents are pretty large). We perform a
> full
> > > indexing every weekend and during the week there are no updates made to
> > the
> > > index. Most of the queries that we run are pretty complex with hundreds
> > of
> > > terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts
> etc.
> > > and take many minutes to execute. A difference of 10-20% is also a big
> > > advantage for us.
> > >
> > > We have been optimizing the index after indexing for years and it has
> > > worked well for us. Every once in a while, we upgrade Solr to the
> latest
> > > version and try without optimizing so that we can save the many hours
> it
> > > take to optimize such a huge index, but it does not work well.
> > >
> > > Kindly provide your suggestion.
> > >
> > > Thanks,
> > > Modassar
> > >
> > >
> > > On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood <
> wunder@wunderwood.org
> > >
> > > wrote:
> > >
> > >> I seriously doubt that you are required to force merge.
> > >>
> > >> How much improvement? And is the big performance cost also OK?
> > >>
> > >> I have worked on search engines that do automatic merges and offer
> > forced
> > >> merges for over fifteen years. For all that time, forced merges have
> > >> usually caused problems.
> > >>
> > >> Stop doing forced merges.
> > >>
> > >> wunder
> > >>
> > >> On Jul 8, 2014, at 10:09 PM, Modassar Ather <mo...@gmail.com>
> > >> wrote:
> > >>
> > >>> Thanks Walter for your inputs.
> > >>>
> > >>> Our use case and performance benchmark requires us to invoke
> optimize.
> > >>>
> > >>> Here we see a chance of improvement in performance of optimize() if
> > >> invoked
> > >>> in parallel.
> > >>> I found that if* distrib=false *is used, the optimization will happen
> > in
> > >>> parallel.
> > >>>
> > >>> But I could not find a way to set it using
> > >> HttpSolrServer/CloudSolrServer.
> > >>> Also with the parameter setting as given in my mail above does not
> > seems
> > >> to
> > >>> work.
> > >>>
> > >>> Please let me know in what ways I can achieve the parallel optimize
> on
> > >>> SolrCloud.
> > >>>
> > >>> Thanks,
> > >>> Modassar
> > >>>
> > >>> On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood <
> > wunder@wunderwood.org>
> > >>> wrote:
> > >>>
> > >>>> You probably do not need to force merge (mistakenly called
> "optimize")
> > >>>> your index.
> > >>>>
> > >>>> Solr does automatic merges, which work just fine.
> > >>>>
> > >>>> There are only a few situations where a forced merge is even a good
> > >> idea.
> > >>>> The most common one is a replicated (non-cloud) setup with a full
> > >> reindex
> > >>>> every night.
> > >>>>
> > >>>> If you need Solr Cloud, I cannot think of a situation where you
> would
> > >> want
> > >>>> a forced merge.
> > >>>>
> > >>>> wunder
> > >>>>
> > >>>> On Jul 8, 2014, at 2:01 AM, Modassar Ather <mo...@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> Need to optimize index created using CloudSolrServer APIs under
> > >> SolrCloud
> > >>>>> setup of 3 instances on separate machines. Currently it optimizes
> > >>>>> sequentially if I invoke cloudSolrServer.optimize().
> > >>>>>
> > >>>>> To make it parallel I tried making three separate HttpSolrServer
> > >>>> instances
> > >>>>> and invoked httpSolrServer.opimize() on them parallely but still it
> > >> seems
> > >>>>> to be doing optimization sequentially.
> > >>>>>
> > >>>>> I tried invoking optimize directly using HttpPost with following
> url
> > >> and
> > >>>>> parameters but still it seems to be sequential.
> > >>>>> *URL* : http://host:port/solr/collection/update
> > >>>>>
> > >>>>> *Parameters*:
> > >>>>> params.add(new BasicNameValuePair("optimize", "true"));
> > >>>>> params.add(new BasicNameValuePair("maxSegments", "1"));
> > >>>>> params.add(new BasicNameValuePair("waitFlush", "true"));
> > >>>>> params.add(new BasicNameValuePair("distrib", "false"));
> > >>>>>
> > >>>>> Kindly provide your suggestion and help.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Modassar
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >> --
> > >> Walter Underwood
> > >> wunder@wunderwood.org
> > >>
> > >>
> > >>
> > >>
> >
> > --
> > Walter Underwood
> > wunder@wunderwood.org
> >
> >
> >
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Parallel optimize of index on SolrCloud.

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Hi Walter,

I wonder why you think SolrCloud isn't necessary if you're indexing once
per week. Isn't the automatic failover and auto-sharding still useful? One
can also do custom sharding with SolrCloud if necessary.


On Wed, Jul 9, 2014 at 11:38 AM, Walter Underwood <wu...@wunderwood.org>
wrote:

> More memory or faster disks will make a much bigger improvement than a
> forced merge.
>
> What are you measuring? If it is average query time, that is not a good
> measure. Look at 90th or 95th percentile. Test with queries from logs.
>
> No user can see a 10% or 20% difference. If your managers are watching
> that, they are watching the wrong thing.
>
> If you are indexing once per week, you don't really need the complexity of
> Solr Cloud. You can do manual sharding.
>
> wunder
>
> On Jul 8, 2014, at 10:55 PM, Modassar Ather <mo...@gmail.com>
> wrote:
>
> > Our index has almost 100M documents running on SolrCloud of 3 shards and
> > each shard has an index size of about 700GB (for the record, we are not
> > using stored fields - our documents are pretty large). We perform a full
> > indexing every weekend and during the week there are no updates made to
> the
> > index. Most of the queries that we run are pretty complex with hundreds
> of
> > terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc.
> > and take many minutes to execute. A difference of 10-20% is also a big
> > advantage for us.
> >
> > We have been optimizing the index after indexing for years and it has
> > worked well for us. Every once in a while, we upgrade Solr to the latest
> > version and try without optimizing so that we can save the many hours it
> > take to optimize such a huge index, but it does not work well.
> >
> > Kindly provide your suggestion.
> >
> > Thanks,
> > Modassar
> >
> >
> > On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood <wunder@wunderwood.org
> >
> > wrote:
> >
> >> I seriously doubt that you are required to force merge.
> >>
> >> How much improvement? And is the big performance cost also OK?
> >>
> >> I have worked on search engines that do automatic merges and offer
> forced
> >> merges for over fifteen years. For all that time, forced merges have
> >> usually caused problems.
> >>
> >> Stop doing forced merges.
> >>
> >> wunder
> >>
> >> On Jul 8, 2014, at 10:09 PM, Modassar Ather <mo...@gmail.com>
> >> wrote:
> >>
> >>> Thanks Walter for your inputs.
> >>>
> >>> Our use case and performance benchmark requires us to invoke optimize.
> >>>
> >>> Here we see a chance of improvement in performance of optimize() if
> >> invoked
> >>> in parallel.
> >>> I found that if* distrib=false *is used, the optimization will happen
> in
> >>> parallel.
> >>>
> >>> But I could not find a way to set it using
> >> HttpSolrServer/CloudSolrServer.
> >>> Also with the parameter setting as given in my mail above does not
> seems
> >> to
> >>> work.
> >>>
> >>> Please let me know in what ways I can achieve the parallel optimize on
> >>> SolrCloud.
> >>>
> >>> Thanks,
> >>> Modassar
> >>>
> >>> On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood <
> wunder@wunderwood.org>
> >>> wrote:
> >>>
> >>>> You probably do not need to force merge (mistakenly called "optimize")
> >>>> your index.
> >>>>
> >>>> Solr does automatic merges, which work just fine.
> >>>>
> >>>> There are only a few situations where a forced merge is even a good
> >> idea.
> >>>> The most common one is a replicated (non-cloud) setup with a full
> >> reindex
> >>>> every night.
> >>>>
> >>>> If you need Solr Cloud, I cannot think of a situation where you would
> >> want
> >>>> a forced merge.
> >>>>
> >>>> wunder
> >>>>
> >>>> On Jul 8, 2014, at 2:01 AM, Modassar Ather <mo...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Need to optimize index created using CloudSolrServer APIs under
> >> SolrCloud
> >>>>> setup of 3 instances on separate machines. Currently it optimizes
> >>>>> sequentially if I invoke cloudSolrServer.optimize().
> >>>>>
> >>>>> To make it parallel I tried making three separate HttpSolrServer
> >>>> instances
> >>>>> and invoked httpSolrServer.opimize() on them parallely but still it
> >> seems
> >>>>> to be doing optimization sequentially.
> >>>>>
> >>>>> I tried invoking optimize directly using HttpPost with following url
> >> and
> >>>>> parameters but still it seems to be sequential.
> >>>>> *URL* : http://host:port/solr/collection/update
> >>>>>
> >>>>> *Parameters*:
> >>>>> params.add(new BasicNameValuePair("optimize", "true"));
> >>>>> params.add(new BasicNameValuePair("maxSegments", "1"));
> >>>>> params.add(new BasicNameValuePair("waitFlush", "true"));
> >>>>> params.add(new BasicNameValuePair("distrib", "false"));
> >>>>>
> >>>>> Kindly provide your suggestion and help.
> >>>>>
> >>>>> Regards,
> >>>>> Modassar
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >> --
> >> Walter Underwood
> >> wunder@wunderwood.org
> >>
> >>
> >>
> >>
>
> --
> Walter Underwood
> wunder@wunderwood.org
>
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Parallel optimize of index on SolrCloud.

Posted by Walter Underwood <wu...@wunderwood.org>.
More memory or faster disks will make a much bigger improvement than a forced merge.

What are you measuring? If it is average query time, that is not a good measure. Look at 90th or 95th percentile. Test with queries from logs.

No user can see a 10% or 20% difference. If your managers are watching that, they are watching the wrong thing.

If you are indexing once per week, you don't really need the complexity of Solr Cloud. You can do manual sharding.

wunder

On Jul 8, 2014, at 10:55 PM, Modassar Ather <mo...@gmail.com> wrote:

> Our index has almost 100M documents running on SolrCloud of 3 shards and
> each shard has an index size of about 700GB (for the record, we are not
> using stored fields - our documents are pretty large). We perform a full
> indexing every weekend and during the week there are no updates made to the
> index. Most of the queries that we run are pretty complex with hundreds of
> terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc.
> and take many minutes to execute. A difference of 10-20% is also a big
> advantage for us.
> 
> We have been optimizing the index after indexing for years and it has
> worked well for us. Every once in a while, we upgrade Solr to the latest
> version and try without optimizing so that we can save the many hours it
> take to optimize such a huge index, but it does not work well.
> 
> Kindly provide your suggestion.
> 
> Thanks,
> Modassar
> 
> 
> On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>> I seriously doubt that you are required to force merge.
>> 
>> How much improvement? And is the big performance cost also OK?
>> 
>> I have worked on search engines that do automatic merges and offer forced
>> merges for over fifteen years. For all that time, forced merges have
>> usually caused problems.
>> 
>> Stop doing forced merges.
>> 
>> wunder
>> 
>> On Jul 8, 2014, at 10:09 PM, Modassar Ather <mo...@gmail.com>
>> wrote:
>> 
>>> Thanks Walter for your inputs.
>>> 
>>> Our use case and performance benchmark requires us to invoke optimize.
>>> 
>>> Here we see a chance of improvement in performance of optimize() if
>> invoked
>>> in parallel.
>>> I found that if* distrib=false *is used, the optimization will happen in
>>> parallel.
>>> 
>>> But I could not find a way to set it using
>> HttpSolrServer/CloudSolrServer.
>>> Also with the parameter setting as given in my mail above does not seems
>> to
>>> work.
>>> 
>>> Please let me know in what ways I can achieve the parallel optimize on
>>> SolrCloud.
>>> 
>>> Thanks,
>>> Modassar
>>> 
>>> On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood <wu...@wunderwood.org>
>>> wrote:
>>> 
>>>> You probably do not need to force merge (mistakenly called "optimize")
>>>> your index.
>>>> 
>>>> Solr does automatic merges, which work just fine.
>>>> 
>>>> There are only a few situations where a forced merge is even a good
>> idea.
>>>> The most common one is a replicated (non-cloud) setup with a full
>> reindex
>>>> every night.
>>>> 
>>>> If you need Solr Cloud, I cannot think of a situation where you would
>> want
>>>> a forced merge.
>>>> 
>>>> wunder
>>>> 
>>>> On Jul 8, 2014, at 2:01 AM, Modassar Ather <mo...@gmail.com>
>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Need to optimize index created using CloudSolrServer APIs under
>> SolrCloud
>>>>> setup of 3 instances on separate machines. Currently it optimizes
>>>>> sequentially if I invoke cloudSolrServer.optimize().
>>>>> 
>>>>> To make it parallel I tried making three separate HttpSolrServer
>>>> instances
>>>>> and invoked httpSolrServer.opimize() on them parallely but still it
>> seems
>>>>> to be doing optimization sequentially.
>>>>> 
>>>>> I tried invoking optimize directly using HttpPost with following url
>> and
>>>>> parameters but still it seems to be sequential.
>>>>> *URL* : http://host:port/solr/collection/update
>>>>> 
>>>>> *Parameters*:
>>>>> params.add(new BasicNameValuePair("optimize", "true"));
>>>>> params.add(new BasicNameValuePair("maxSegments", "1"));
>>>>> params.add(new BasicNameValuePair("waitFlush", "true"));
>>>>> params.add(new BasicNameValuePair("distrib", "false"));
>>>>> 
>>>>> Kindly provide your suggestion and help.
>>>>> 
>>>>> Regards,
>>>>> Modassar
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> --
>> Walter Underwood
>> wunder@wunderwood.org
>> 
>> 
>> 
>> 

--
Walter Underwood
wunder@wunderwood.org




Re: Parallel optimize of index on SolrCloud.

Posted by Modassar Ather <mo...@gmail.com>.
Our index has almost 100M documents running on SolrCloud of 3 shards and
each shard has an index size of about 700GB (for the record, we are not
using stored fields - our documents are pretty large). We perform a full
indexing every weekend and during the week there are no updates made to the
index. Most of the queries that we run are pretty complex with hundreds of
terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc.
and take many minutes to execute. A difference of 10-20% is also a big
advantage for us.

We have been optimizing the index after indexing for years and it has
worked well for us. Every once in a while, we upgrade Solr to the latest
version and try without optimizing so that we can save the many hours it
take to optimize such a huge index, but it does not work well.

Kindly provide your suggestion.

Thanks,
Modassar


On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood <wu...@wunderwood.org>
wrote:

> I seriously doubt that you are required to force merge.
>
> How much improvement? And is the big performance cost also OK?
>
> I have worked on search engines that do automatic merges and offer forced
> merges for over fifteen years. For all that time, forced merges have
> usually caused problems.
>
> Stop doing forced merges.
>
> wunder
>
> On Jul 8, 2014, at 10:09 PM, Modassar Ather <mo...@gmail.com>
> wrote:
>
> > Thanks Walter for your inputs.
> >
> > Our use case and performance benchmark requires us to invoke optimize.
> >
> > Here we see a chance of improvement in performance of optimize() if
> invoked
> > in parallel.
> > I found that if* distrib=false *is used, the optimization will happen in
> > parallel.
> >
> > But I could not find a way to set it using
> HttpSolrServer/CloudSolrServer.
> > Also with the parameter setting as given in my mail above does not seems
> to
> > work.
> >
> > Please let me know in what ways I can achieve the parallel optimize on
> > SolrCloud.
> >
> > Thanks,
> > Modassar
> >
> > On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood <wu...@wunderwood.org>
> > wrote:
> >
> >> You probably do not need to force merge (mistakenly called "optimize")
> >> your index.
> >>
> >> Solr does automatic merges, which work just fine.
> >>
> >> There are only a few situations where a forced merge is even a good
> idea.
> >> The most common one is a replicated (non-cloud) setup with a full
> reindex
> >> every night.
> >>
> >> If you need Solr Cloud, I cannot think of a situation where you would
> want
> >> a forced merge.
> >>
> >> wunder
> >>
> >> On Jul 8, 2014, at 2:01 AM, Modassar Ather <mo...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> Need to optimize index created using CloudSolrServer APIs under
> SolrCloud
> >>> setup of 3 instances on separate machines. Currently it optimizes
> >>> sequentially if I invoke cloudSolrServer.optimize().
> >>>
> >>> To make it parallel I tried making three separate HttpSolrServer
> >> instances
> >>> and invoked httpSolrServer.opimize() on them parallely but still it
> seems
> >>> to be doing optimization sequentially.
> >>>
> >>> I tried invoking optimize directly using HttpPost with following url
> and
> >>> parameters but still it seems to be sequential.
> >>> *URL* : http://host:port/solr/collection/update
> >>>
> >>> *Parameters*:
> >>> params.add(new BasicNameValuePair("optimize", "true"));
> >>> params.add(new BasicNameValuePair("maxSegments", "1"));
> >>> params.add(new BasicNameValuePair("waitFlush", "true"));
> >>> params.add(new BasicNameValuePair("distrib", "false"));
> >>>
> >>> Kindly provide your suggestion and help.
> >>>
> >>> Regards,
> >>> Modassar
> >>
> >>
> >>
> >>
> >>
>
> --
> Walter Underwood
> wunder@wunderwood.org
>
>
>
>

Re: Parallel optimize of index on SolrCloud.

Posted by Walter Underwood <wu...@wunderwood.org>.
I seriously doubt that you are required to force merge.

How much improvement? And is the big performance cost also OK?

I have worked on search engines that do automatic merges and offer forced merges for over fifteen years. For all that time, forced merges have usually caused problems.

Stop doing forced merges.

wunder

On Jul 8, 2014, at 10:09 PM, Modassar Ather <mo...@gmail.com> wrote:

> Thanks Walter for your inputs.
> 
> Our use case and performance benchmark requires us to invoke optimize.
> 
> Here we see a chance of improvement in performance of optimize() if invoked
> in parallel.
> I found that if* distrib=false *is used, the optimization will happen in
> parallel.
> 
> But I could not find a way to set it using HttpSolrServer/CloudSolrServer.
> Also with the parameter setting as given in my mail above does not seems to
> work.
> 
> Please let me know in what ways I can achieve the parallel optimize on
> SolrCloud.
> 
> Thanks,
> Modassar
> 
> On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>> You probably do not need to force merge (mistakenly called "optimize")
>> your index.
>> 
>> Solr does automatic merges, which work just fine.
>> 
>> There are only a few situations where a forced merge is even a good idea.
>> The most common one is a replicated (non-cloud) setup with a full reindex
>> every night.
>> 
>> If you need Solr Cloud, I cannot think of a situation where you would want
>> a forced merge.
>> 
>> wunder
>> 
>> On Jul 8, 2014, at 2:01 AM, Modassar Ather <mo...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> Need to optimize index created using CloudSolrServer APIs under SolrCloud
>>> setup of 3 instances on separate machines. Currently it optimizes
>>> sequentially if I invoke cloudSolrServer.optimize().
>>> 
>>> To make it parallel I tried making three separate HttpSolrServer
>> instances
>>> and invoked httpSolrServer.opimize() on them parallely but still it seems
>>> to be doing optimization sequentially.
>>> 
>>> I tried invoking optimize directly using HttpPost with following url and
>>> parameters but still it seems to be sequential.
>>> *URL* : http://host:port/solr/collection/update
>>> 
>>> *Parameters*:
>>> params.add(new BasicNameValuePair("optimize", "true"));
>>> params.add(new BasicNameValuePair("maxSegments", "1"));
>>> params.add(new BasicNameValuePair("waitFlush", "true"));
>>> params.add(new BasicNameValuePair("distrib", "false"));
>>> 
>>> Kindly provide your suggestion and help.
>>> 
>>> Regards,
>>> Modassar
>> 
>> 
>> 
>> 
>> 

--
Walter Underwood
wunder@wunderwood.org




Re: Parallel optimize of index on SolrCloud.

Posted by Modassar Ather <mo...@gmail.com>.
Thanks Walter for your inputs.

Our use case and performance benchmark requires us to invoke optimize.

Here we see a chance of improvement in performance of optimize() if invoked
in parallel.
I found that if* distrib=false *is used, the optimization will happen in
parallel.

But I could not find a way to set it using HttpSolrServer/CloudSolrServer.
Also with the parameter setting as given in my mail above does not seems to
work.

Please let me know in what ways I can achieve the parallel optimize on
SolrCloud.

Thanks,
Modassar



On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood <wu...@wunderwood.org>
wrote:

> You probably do not need to force merge (mistakenly called "optimize")
> your index.
>
> Solr does automatic merges, which work just fine.
>
> There are only a few situations where a forced merge is even a good idea.
> The most common one is a replicated (non-cloud) setup with a full reindex
> every night.
>
> If you need Solr Cloud, I cannot think of a situation where you would want
> a forced merge.
>
> wunder
>
> On Jul 8, 2014, at 2:01 AM, Modassar Ather <mo...@gmail.com> wrote:
>
> > Hi,
> >
> > Need to optimize index created using CloudSolrServer APIs under SolrCloud
> > setup of 3 instances on separate machines. Currently it optimizes
> > sequentially if I invoke cloudSolrServer.optimize().
> >
> > To make it parallel I tried making three separate HttpSolrServer
> instances
> > and invoked httpSolrServer.opimize() on them parallely but still it seems
> > to be doing optimization sequentially.
> >
> > I tried invoking optimize directly using HttpPost with following url and
> > parameters but still it seems to be sequential.
> > *URL* : http://host:port/solr/collection/update
> >
> > *Parameters*:
> > params.add(new BasicNameValuePair("optimize", "true"));
> > params.add(new BasicNameValuePair("maxSegments", "1"));
> > params.add(new BasicNameValuePair("waitFlush", "true"));
> > params.add(new BasicNameValuePair("distrib", "false"));
> >
> > Kindly provide your suggestion and help.
> >
> > Regards,
> > Modassar
>
>
>
>
>

Re: Parallel optimize of index on SolrCloud.

Posted by Walter Underwood <wu...@wunderwood.org>.
You probably do not need to force merge (mistakenly called "optimize") your index.

Solr does automatic merges, which work just fine.

There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night.

If you need Solr Cloud, I cannot think of a situation where you would want a forced merge.

wunder

On Jul 8, 2014, at 2:01 AM, Modassar Ather <mo...@gmail.com> wrote:

> Hi,
> 
> Need to optimize index created using CloudSolrServer APIs under SolrCloud
> setup of 3 instances on separate machines. Currently it optimizes
> sequentially if I invoke cloudSolrServer.optimize().
> 
> To make it parallel I tried making three separate HttpSolrServer instances
> and invoked httpSolrServer.opimize() on them parallely but still it seems
> to be doing optimization sequentially.
> 
> I tried invoking optimize directly using HttpPost with following url and
> parameters but still it seems to be sequential.
> *URL* : http://host:port/solr/collection/update
> 
> *Parameters*:
> params.add(new BasicNameValuePair("optimize", "true"));
> params.add(new BasicNameValuePair("maxSegments", "1"));
> params.add(new BasicNameValuePair("waitFlush", "true"));
> params.add(new BasicNameValuePair("distrib", "false"));
> 
> Kindly provide your suggestion and help.
> 
> Regards,
> Modassar