You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mukesh Jha <me...@gmail.com> on 2014/04/25 15:49:45 UTC

Solr Cluster management having too many cores

Hi Experts,

I need to divide my indexes based on hour/day with each index having ~50-80
GB data & ~50-80 mill docs, so I'm planning to create daily collection with
names e.g. *sample_colledction_yyyy_mm_dd_hh.*
I'll also create an alias *sample_collection* and update it whenever I will
create a new collection so that the entire data set is searchable.

I've a couple of question on the above design
1) How far can it scale? As my collections will increase (so will the
shards & replicas) do we have a breaking point when adding more/searching
will become an issue?
2) As my cluster will grow because of huge number of collections the
clusterstate.json file present in zookeeper will grow too, won't this be a
limiting factor? If so instead of storing all this info in one
clusterstate.json file shouldn't Solr save cluster specific details in this
file & have collection specific config files present on zookeeper?
3) How can I easily manage all these collections? Do we have Java Coreadmin
API's available. I cannot find much documented on it.

-- 
Txz,

*Mukesh Jha <me...@gmail.com>*

Re: Solr Cluster management having too many cores

Posted by Anshum Gupta <an...@anshumgupta.net>.

You're correct Mukesh, that's the JIRA with pretty much all of that discussion.

On Fri, Aug 8, 2014 at 8:44 PM, Mukesh Jha <me...@gmail.com> wrote:
> Looks like https://issues.apache.org/jira/browse/SOLR-5473 is the story :)
>
>
> On Fri, Aug 8, 2014 at 9:30 PM, Mukesh Jha <me...@gmail.com> wrote:
>
>> Hey *Shawn*, *Erik*,
>>
>> I's wondering if there is a JIRA story for splitting the current
>> clusterstate.json to collection specific clusterstate config that I can
>> track.
>> I looked around a bit but couldn't get my hands on anything useful on that.
>>
>>
>> On Mon, Apr 28, 2014 at 7:43 AM, Shawn Heisey <so...@elyograg.org> wrote:
>>
>>> On 4/28/2014 5:05 AM, Mukesh Jha wrote:
>>> > Thanks Erik,
>>> >
>>> > Sounds about right.
>>> >
>>> > BTW how long can I keep adding collections i.e. can I keep 5/10 years
>>> data
>>> > like this?
>>> >
>>> > Also what do you think of bullet 2) of having collection specific
>>> > configurations in zookeeper?
>>>
>>> Regarding bullet 2, there is work underway right now to create a
>>> separate clusterstate within zookeeper for each collection.  I do not
>>> know how far along that work is.
>>>
>>> There are no hard limits in SolrCloud at all.  The things that will
>>> cause issues with scalability are resource-related problems.  You'll
>>> exceed the 1MB default limit on a zookeeper database pretty quickly.  If
>>> you're not using the example jetty included with Solr, you'll exceed the
>>> default maxThreads on most servlet containers very quickly.  You may run
>>> into problems with the default limits on Solr's HttpShardHandler.
>>>
>>> Running hundreds or thousands of cores efficiently will require lots of
>>> RAM, both for the OS disk cache and the java heap.  A large java heap
>>> will require significant tuning of Java garbage collection parameters.
>>>
>>> Most operating systems limit a user to 1024 open files and 1024 running
>>> processes (which includes threads).  These limits will need to be
>>> increased.
>>>
>>> There may be other limits imposed by the Solr config, Java, and/or the
>>> operating system that I have not thought of or stated here.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>
>>
>>
>> --
>>
>>
>> Thanks & Regards,
>>
>> * Mukesh Jha <me...@gmail.com>*
>>
>
>
>
> --
>
>
> Thanks & Regards,
>
> *Mukesh Jha <me...@gmail.com>*



-- 

Anshum Gupta
http://www.anshumgupta.net

Re: Solr Cluster management having too many cores

Posted by Mukesh Jha <me...@gmail.com>.

Looks like https://issues.apache.org/jira/browse/SOLR-5473 is the story :)


On Fri, Aug 8, 2014 at 9:30 PM, Mukesh Jha <me...@gmail.com> wrote:

> Hey *Shawn*, *Erik*,
>
> I's wondering if there is a JIRA story for splitting the current
> clusterstate.json to collection specific clusterstate config that I can
> track.
> I looked around a bit but couldn't get my hands on anything useful on that.
>
>
> On Mon, Apr 28, 2014 at 7:43 AM, Shawn Heisey <so...@elyograg.org> wrote:
>
>> On 4/28/2014 5:05 AM, Mukesh Jha wrote:
>> > Thanks Erik,
>> >
>> > Sounds about right.
>> >
>> > BTW how long can I keep adding collections i.e. can I keep 5/10 years
>> data
>> > like this?
>> >
>> > Also what do you think of bullet 2) of having collection specific
>> > configurations in zookeeper?
>>
>> Regarding bullet 2, there is work underway right now to create a
>> separate clusterstate within zookeeper for each collection.  I do not
>> know how far along that work is.
>>
>> There are no hard limits in SolrCloud at all.  The things that will
>> cause issues with scalability are resource-related problems.  You'll
>> exceed the 1MB default limit on a zookeeper database pretty quickly.  If
>> you're not using the example jetty included with Solr, you'll exceed the
>> default maxThreads on most servlet containers very quickly.  You may run
>> into problems with the default limits on Solr's HttpShardHandler.
>>
>> Running hundreds or thousands of cores efficiently will require lots of
>> RAM, both for the OS disk cache and the java heap.  A large java heap
>> will require significant tuning of Java garbage collection parameters.
>>
>> Most operating systems limit a user to 1024 open files and 1024 running
>> processes (which includes threads).  These limits will need to be
>> increased.
>>
>> There may be other limits imposed by the Solr config, Java, and/or the
>> operating system that I have not thought of or stated here.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
>
>
> Thanks & Regards,
>
> * Mukesh Jha <me...@gmail.com>*
>



-- 


Thanks & Regards,

*Mukesh Jha <me...@gmail.com>*

Re: Solr Cluster management having too many cores

Posted by Mukesh Jha <me...@gmail.com>.

Hey *Shawn*, *Erik*,

I's wondering if there is a JIRA story for splitting the current
clusterstate.json to collection specific clusterstate config that I can
track.
I looked around a bit but couldn't get my hands on anything useful on that.


On Mon, Apr 28, 2014 at 7:43 AM, Shawn Heisey <so...@elyograg.org> wrote:

> On 4/28/2014 5:05 AM, Mukesh Jha wrote:
> > Thanks Erik,
> >
> > Sounds about right.
> >
> > BTW how long can I keep adding collections i.e. can I keep 5/10 years
> data
> > like this?
> >
> > Also what do you think of bullet 2) of having collection specific
> > configurations in zookeeper?
>
> Regarding bullet 2, there is work underway right now to create a
> separate clusterstate within zookeeper for each collection.  I do not
> know how far along that work is.
>
> There are no hard limits in SolrCloud at all.  The things that will
> cause issues with scalability are resource-related problems.  You'll
> exceed the 1MB default limit on a zookeeper database pretty quickly.  If
> you're not using the example jetty included with Solr, you'll exceed the
> default maxThreads on most servlet containers very quickly.  You may run
> into problems with the default limits on Solr's HttpShardHandler.
>
> Running hundreds or thousands of cores efficiently will require lots of
> RAM, both for the OS disk cache and the java heap.  A large java heap
> will require significant tuning of Java garbage collection parameters.
>
> Most operating systems limit a user to 1024 open files and 1024 running
> processes (which includes threads).  These limits will need to be
> increased.
>
> There may be other limits imposed by the Solr config, Java, and/or the
> operating system that I have not thought of or stated here.
>
> Thanks,
> Shawn
>
>


-- 


Thanks & Regards,

*Mukesh Jha <me...@gmail.com>*

Re: Solr Cluster management having too many cores

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/28/2014 5:05 AM, Mukesh Jha wrote:
> Thanks Erik,
> 
> Sounds about right.
> 
> BTW how long can I keep adding collections i.e. can I keep 5/10 years data
> like this?
> 
> Also what do you think of bullet 2) of having collection specific
> configurations in zookeeper?

Regarding bullet 2, there is work underway right now to create a
separate clusterstate within zookeeper for each collection.  I do not
know how far along that work is.

There are no hard limits in SolrCloud at all.  The things that will
cause issues with scalability are resource-related problems.  You'll
exceed the 1MB default limit on a zookeeper database pretty quickly.  If
you're not using the example jetty included with Solr, you'll exceed the
default maxThreads on most servlet containers very quickly.  You may run
into problems with the default limits on Solr's HttpShardHandler.

Running hundreds or thousands of cores efficiently will require lots of
RAM, both for the OS disk cache and the java heap.  A large java heap
will require significant tuning of Java garbage collection parameters.

Most operating systems limit a user to 1024 open files and 1024 running
processes (which includes threads).  These limits will need to be increased.

There may be other limits imposed by the Solr config, Java, and/or the
operating system that I have not thought of or stated here.

Thanks,
Shawn

Re: Solr Cluster management having too many cores

Posted by Mukesh Jha <me...@gmail.com>.

Thanks Erik,

Sounds about right.

BTW how long can I keep adding collections i.e. can I keep 5/10 years data
like this?

Also what do you think of bullet 2) of having collection specific
configurations in zookeeper?


On Fri, Apr 25, 2014 at 11:44 PM, Erick Erickson <er...@gmail.com>wrote:

> So you're talking about 700 or so collections. That should be do-able,
> especially as Solr is rapidly evolving to handle more and more
> collections and there's two years for that to happen.
>
> The aging out bit is manual (well, you'd script it I suppose). So
> every day there'd be a script that ran and "just knew" the right
> collection to change the alias on, there's nothing automatic yet.
>
> Best,
> Erick
>
> On Fri, Apr 25, 2014 at 9:37 AM, Mukesh Jha <me...@gmail.com>
> wrote:
> > Thanks for quick reply Erik,
> >
> > I want to keep my collections till I run out of hardware, which is at
> least
> > a couple of years worth data.
> > I'd like to know more on ageing out aliases, did a quick search but
> didn't
> > find much.
> >
> >
> > On Fri, Apr 25, 2014 at 9:45 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> Hmmm, tell us a little more about your use-case. In particular, how
> >> long do you need to keep the data around? Days? Months? Years?
> >>
> >> Because if you only need to keep the data for a specified period, you
> >> can use the collection aliasing process to age-out collections and
> >> keep the number of cores from growing too large.
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Apr 25, 2014 at 6:49 AM, Mukesh Jha <me...@gmail.com>
> >> wrote:
> >> > Hi Experts,
> >> >
> >> > I need to divide my indexes based on hour/day with each index having
> >> ~50-80
> >> > GB data & ~50-80 mill docs, so I'm planning to create daily collection
> >> with
> >> > names e.g. *sample_colledction_yyyy_mm_dd_hh.*
> >> > I'll also create an alias *sample_collection* and update it whenever I
> >> will
> >> > create a new collection so that the entire data set is searchable.
> >> >
> >> > I've a couple of question on the above design
> >> > 1) How far can it scale? As my collections will increase (so will the
> >> > shards & replicas) do we have a breaking point when adding
> more/searching
> >> > will become an issue?
> >> > 2) As my cluster will grow because of huge number of collections the
> >> > clusterstate.json file present in zookeeper will grow too, won't this
> be
> >> a
> >> > limiting factor? If so instead of storing all this info in one
> >> > clusterstate.json file shouldn't Solr save cluster specific details in
> >> this
> >> > file & have collection specific config files present on zookeeper?
> >> > 3) How can I easily manage all these collections? Do we have Java
> >> Coreadmin
> >> > API's available. I cannot find much documented on it.
> >> >
> >> > --
> >> > Txz,
> >> >
> >> > *Mukesh Jha <me...@gmail.com>*
> >>
> >
> >
> >
> > --
> >
> >
> > Thanks & Regards,
> >
> > *Mukesh Jha <me...@gmail.com>*
>



-- 


Thanks & Regards,

*Mukesh Jha <me...@gmail.com>*

Re: Solr Cluster management having too many cores

Posted by Erick Erickson <er...@gmail.com>.

So you're talking about 700 or so collections. That should be do-able,
especially as Solr is rapidly evolving to handle more and more
collections and there's two years for that to happen.

The aging out bit is manual (well, you'd script it I suppose). So
every day there'd be a script that ran and "just knew" the right
collection to change the alias on, there's nothing automatic yet.

Best,
Erick

On Fri, Apr 25, 2014 at 9:37 AM, Mukesh Jha <me...@gmail.com> wrote:
> Thanks for quick reply Erik,
>
> I want to keep my collections till I run out of hardware, which is at least
> a couple of years worth data.
> I'd like to know more on ageing out aliases, did a quick search but didn't
> find much.
>
>
> On Fri, Apr 25, 2014 at 9:45 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> Hmmm, tell us a little more about your use-case. In particular, how
>> long do you need to keep the data around? Days? Months? Years?
>>
>> Because if you only need to keep the data for a specified period, you
>> can use the collection aliasing process to age-out collections and
>> keep the number of cores from growing too large.
>>
>> Best,
>> Erick
>>
>> On Fri, Apr 25, 2014 at 6:49 AM, Mukesh Jha <me...@gmail.com>
>> wrote:
>> > Hi Experts,
>> >
>> > I need to divide my indexes based on hour/day with each index having
>> ~50-80
>> > GB data & ~50-80 mill docs, so I'm planning to create daily collection
>> with
>> > names e.g. *sample_colledction_yyyy_mm_dd_hh.*
>> > I'll also create an alias *sample_collection* and update it whenever I
>> will
>> > create a new collection so that the entire data set is searchable.
>> >
>> > I've a couple of question on the above design
>> > 1) How far can it scale? As my collections will increase (so will the
>> > shards & replicas) do we have a breaking point when adding more/searching
>> > will become an issue?
>> > 2) As my cluster will grow because of huge number of collections the
>> > clusterstate.json file present in zookeeper will grow too, won't this be
>> a
>> > limiting factor? If so instead of storing all this info in one
>> > clusterstate.json file shouldn't Solr save cluster specific details in
>> this
>> > file & have collection specific config files present on zookeeper?
>> > 3) How can I easily manage all these collections? Do we have Java
>> Coreadmin
>> > API's available. I cannot find much documented on it.
>> >
>> > --
>> > Txz,
>> >
>> > *Mukesh Jha <me...@gmail.com>*
>>
>
>
>
> --
>
>
> Thanks & Regards,
>
> *Mukesh Jha <me...@gmail.com>*

Re: Solr Cluster management having too many cores

Posted by Mukesh Jha <me...@gmail.com>.

Thanks for quick reply Erik,

I want to keep my collections till I run out of hardware, which is at least
a couple of years worth data.
I'd like to know more on ageing out aliases, did a quick search but didn't
find much.


On Fri, Apr 25, 2014 at 9:45 PM, Erick Erickson <er...@gmail.com>wrote:

> Hmmm, tell us a little more about your use-case. In particular, how
> long do you need to keep the data around? Days? Months? Years?
>
> Because if you only need to keep the data for a specified period, you
> can use the collection aliasing process to age-out collections and
> keep the number of cores from growing too large.
>
> Best,
> Erick
>
> On Fri, Apr 25, 2014 at 6:49 AM, Mukesh Jha <me...@gmail.com>
> wrote:
> > Hi Experts,
> >
> > I need to divide my indexes based on hour/day with each index having
> ~50-80
> > GB data & ~50-80 mill docs, so I'm planning to create daily collection
> with
> > names e.g. *sample_colledction_yyyy_mm_dd_hh.*
> > I'll also create an alias *sample_collection* and update it whenever I
> will
> > create a new collection so that the entire data set is searchable.
> >
> > I've a couple of question on the above design
> > 1) How far can it scale? As my collections will increase (so will the
> > shards & replicas) do we have a breaking point when adding more/searching
> > will become an issue?
> > 2) As my cluster will grow because of huge number of collections the
> > clusterstate.json file present in zookeeper will grow too, won't this be
> a
> > limiting factor? If so instead of storing all this info in one
> > clusterstate.json file shouldn't Solr save cluster specific details in
> this
> > file & have collection specific config files present on zookeeper?
> > 3) How can I easily manage all these collections? Do we have Java
> Coreadmin
> > API's available. I cannot find much documented on it.
> >
> > --
> > Txz,
> >
> > *Mukesh Jha <me...@gmail.com>*
>



-- 


Thanks & Regards,

*Mukesh Jha <me...@gmail.com>*

Re: Solr Cluster management having too many cores

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, tell us a little more about your use-case. In particular, how
long do you need to keep the data around? Days? Months? Years?

Because if you only need to keep the data for a specified period, you
can use the collection aliasing process to age-out collections and
keep the number of cores from growing too large.

Best,
Erick

On Fri, Apr 25, 2014 at 6:49 AM, Mukesh Jha <me...@gmail.com> wrote:
> Hi Experts,
>
> I need to divide my indexes based on hour/day with each index having ~50-80
> GB data & ~50-80 mill docs, so I'm planning to create daily collection with
> names e.g. *sample_colledction_yyyy_mm_dd_hh.*
> I'll also create an alias *sample_collection* and update it whenever I will
> create a new collection so that the entire data set is searchable.
>
> I've a couple of question on the above design
> 1) How far can it scale? As my collections will increase (so will the
> shards & replicas) do we have a breaking point when adding more/searching
> will become an issue?
> 2) As my cluster will grow because of huge number of collections the
> clusterstate.json file present in zookeeper will grow too, won't this be a
> limiting factor? If so instead of storing all this info in one
> clusterstate.json file shouldn't Solr save cluster specific details in this
> file & have collection specific config files present on zookeeper?
> 3) How can I easily manage all these collections? Do we have Java Coreadmin
> API's available. I cannot find much documented on it.
>
> --
> Txz,
>
> *Mukesh Jha <me...@gmail.com>*