You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by Chris Hillery <ch...@hillery.land> on 2016/06/29 08:05:43 UTC

Cluster XML files

My understanding of how Managix-based deployment currently works is as
follows:

  - User composes a cluster.xml

  - Managix consumes this and produces an asterix-configuration.xml, which
contains some of the same data as cluster.xml as well as some things
derived from that data (such as composing the <iodevices> directories with
the <store> subdirectory name to produce <storeDirs>)

  - Managix places both the original cluster.xml and the generated
asterix-configuration.xml onto the CLASSPATH of the NCs and CCs

  - The user is never directly aware of asterix-configuration.xml, and
certainly does not edit it in the normal course of operation

Is this an accurate summary?

Ceej
aka Chris Hillery

Re: Cluster XML files

Posted by Raman Grover <ra...@gmail.com>.

Hi

I understand the challenge here and its tricky :)

We have two dimensions to slice our configs along.
*a) physical config v/s application config*
*b) mutability via managix alter command. (these can be further split as
those requiring restarts or not)  *

I think it is simpler and cleaner to slice it along dimension (b). The
reasons are as follows.

(i) Config params that needs to be user-defined but just once (e.g. page
size) sit nicely in cluster.xml and follow the immutable yet configurable
policy instantly.

(ii) Config params in cluster.xml are validated prior to use (mamagix
validate). Performance-sensitive properties like page size can be validated
here to avoid mistakes.

(iii) We do not need any additional logic in alter command to filter
immutable configs.

(iv) Config params such as ports, directories, and ip addresses are not
typically modified across restarts. Using (b), these remain in cluster.xml.
One may argue for ports being mutable, but if a port is conflicting, it is
so right from the beginning and should be caught as part of cluster
validation. A non-conflicting port should not require any modifications
thereafter. So it is immutable for practical reasons.

Next, there are configs in asterix-configuration.xml that can be
theoretically changed without having to restart. Currently we do not have a
way to propagate values without restarting (but this can be done easily).
So lets assume we did that. This would cause NC opts to move to cluster.xml
(which is not ugly but actually makes sense).

*So the split is as follows*

a) Cluster.xml: Configurable yet immutable configs
                      params: ip addresses, directories, ports, jvm
options, page size etc.

b) AsterixConfiguration. Mutable  (current limitation requires restarts but
we would fix it).
                      params: all others.

We need to find better names for (a) and (b) though to better reflect this
split.
Thoughts?

Regards,
Raman

On Wed, Jun 29, 2016 at 12:56 PM, Till Westmann <ti...@apache.org> wrote:

> Hi Raman,
>
> thanks for chiming in. The separation of the physical configuration from
> the
> software configuration indeed looks good.
> However, I’m a little challenged by the current split. If the physical
> configuration is in 1 file, it seems that it should contain all network and
> storage settings. However, we also have some network settings (e.g.
> "web.port") and storage-like settings (at least "compiler.pregelix.home"
> refers to a directory ..) in asterix-configuration.xml.
> Should those move to the cluster.xml? Or should those be where they are?
>
> Also, I'm wondering if there's a difference in the lifecycle of the
> parameter settings? Are all the parameters in cluster.xml fixed "forever"?
> Or could some of them be modified between restarts (e.g. it seems feasible
> to change ports between restarts, while changing storage directories will
> probably break the cluster).
> Also it seems that some of the parameters in asterix-configuration.xml can
> only be changed between restarts (e.g. "nc.java.opts"), while changing
> others would break the cluster (e.g. "storage.buffercache.pagesize"), and
> others could theoretically be changed in a running cluster or even per job
> (e.g. "compiler.sortmemory").
> Would it maybe make sense to split configurations along those lines? Or
> should we just put all configurations in one file and leave it up the the
> user to make sense of the lifecycles?
>
> I'm really not sure if there's a "right" way to organize these and - if so
> -
> what it is.
>
> Cheers,
> Till
>
> On 29 Jun 2016, at 9:40, Raman Grover wrote:
>
> > Hi,
> >
> > It was natural to define your physical clusters separately from the
> > properties of the Asterix instance(s) that run over the hardware.
> >
> > As such, the cluster xml mapped to the clusters we had - sensorium,
> > asterix, or the yahoo cluster we once had access to. A single cluster xml
> > could be reused by multiple devs wishing to use a part (by commenting out
> > sections in the xml)  or the complete cluster to launch their instances.
> > Properties related to the cluster do not change often e.g. the IP
> addresses
> > etc and so these need not be repeated and redefined for each asterix
> > instance.
> >
> > Asterix configuration xml was meant to contain tuning parameters specific
> > to an asterix instance.
> >
> > So the user model was to have a fixed set of cluster xmls and a set of
> > asterix configuration files, maintained by different users, each
> > representing different runtine tuning parameters that devs would have
> > different values for or would frequently change as per the workload or
> > experiments they are running.
> >
> > separation of concerns and avoiding repetition of properties (when
> defining
> > multiple instances over the same hardware)  were the main reasons for
> > having two separate files.
> >
> > Regards,
> > Raman
> > On Jun 29, 2016 8:36 AM, "Till Westmann" <ti...@apache.org> wrote:
> >
> >> Is there a conceptual or lifecycle reason to put a parameter in one or
> the
> >> other file? I really would like to understand why we have 2 files and
> what
> >> the difference is. I think that one hint might be what Ian just
> mentioned,
> >> that the parameters in asterix-configuration.xml can be modified (with a
> >> restart?) and the other ones cannot. Is that right?
> >>
> >> On 29 Jun 2016, at 7:56, Ian Maxon wrote:
> >>
> >>> Managix sort of splices the cluster.xml with the existing
> >>> asterix-configuration.xml to produce a new asterix-configuration.xml
> that
> >>> then gets put into the asterix-app jar inside of asterix-server. The
> user
> >>> has to know about the base asterix-configuration.xml because that is
> >> where
> >>> you change some important memory parameters. You can also edit it
> without
> >>> deleting the cluster itself (managix alter).
> >>>
> >>> On Wed, Jun 29, 2016 at 1:05 AM, Chris Hillery <ch...@hillery.land>
> >>> wrote:
> >>>
> >>>> My understanding of how Managix-based deployment currently works is as
> >>>> follows:
> >>>>
> >>>>   - User composes a cluster.xml
> >>>>
> >>>>   - Managix consumes this and produces an asterix-configuration.xml,
> >> which
> >>>> contains some of the same data as cluster.xml as well as some things
> >>>> derived from that data (such as composing the <iodevices> directories
> >> with
> >>>> the <store> subdirectory name to produce <storeDirs>)
> >>>>
> >>>>   - Managix places both the original cluster.xml and the generated
> >>>> asterix-configuration.xml onto the CLASSPATH of the NCs and CCs
> >>>>
> >>>>   - The user is never directly aware of asterix-configuration.xml, and
> >>>> certainly does not edit it in the normal course of operation
> >>>>
> >>>> Is this an accurate summary?
> >>>>
> >>>> Ceej
> >>>> aka Chris Hillery
> >>>>
> >>
>



-- 
Raman

Re: Cluster XML files

Posted by Till Westmann <ti...@apache.org>.

Hi Raman,

thanks for chiming in. The separation of the physical configuration from the
software configuration indeed looks good.
However, I\u2019m a little challenged by the current split. If the physical
configuration is in 1 file, it seems that it should contain all network and
storage settings. However, we also have some network settings (e.g.
"web.port") and storage-like settings (at least "compiler.pregelix.home"
refers to a directory ..) in asterix-configuration.xml.
Should those move to the cluster.xml? Or should those be where they are?

Also, I'm wondering if there's a difference in the lifecycle of the
parameter settings? Are all the parameters in cluster.xml fixed "forever"?
Or could some of them be modified between restarts (e.g. it seems feasible
to change ports between restarts, while changing storage directories will
probably break the cluster).
Also it seems that some of the parameters in asterix-configuration.xml can
only be changed between restarts (e.g. "nc.java.opts"), while changing
others would break the cluster (e.g. "storage.buffercache.pagesize"), and
others could theoretically be changed in a running cluster or even per job
(e.g. "compiler.sortmemory").
Would it maybe make sense to split configurations along those lines? Or
should we just put all configurations in one file and leave it up the the
user to make sense of the lifecycles?

I'm really not sure if there's a "right" way to organize these and - if so -
what it is.

Cheers,
Till

On 29 Jun 2016, at 9:40, Raman Grover wrote:

> Hi,
>
> It was natural to define your physical clusters separately from the
> properties of the Asterix instance(s) that run over the hardware.
>
> As such, the cluster xml mapped to the clusters we had - sensorium,
> asterix, or the yahoo cluster we once had access to. A single cluster xml
> could be reused by multiple devs wishing to use a part (by commenting out
> sections in the xml)  or the complete cluster to launch their instances.
> Properties related to the cluster do not change often e.g. the IP addresses
> etc and so these need not be repeated and redefined for each asterix
> instance.
>
> Asterix configuration xml was meant to contain tuning parameters specific
> to an asterix instance.
>
> So the user model was to have a fixed set of cluster xmls and a set of
> asterix configuration files, maintained by different users, each
> representing different runtine tuning parameters that devs would have
> different values for or would frequently change as per the workload or
> experiments they are running.
>
> separation of concerns and avoiding repetition of properties (when defining
> multiple instances over the same hardware)  were the main reasons for
> having two separate files.
>
> Regards,
> Raman
> On Jun 29, 2016 8:36 AM, "Till Westmann" <ti...@apache.org> wrote:
>
>> Is there a conceptual or lifecycle reason to put a parameter in one or the
>> other file? I really would like to understand why we have 2 files and what
>> the difference is. I think that one hint might be what Ian just mentioned,
>> that the parameters in asterix-configuration.xml can be modified (with a
>> restart?) and the other ones cannot. Is that right?
>>
>> On 29 Jun 2016, at 7:56, Ian Maxon wrote:
>>
>>> Managix sort of splices the cluster.xml with the existing
>>> asterix-configuration.xml to produce a new asterix-configuration.xml that
>>> then gets put into the asterix-app jar inside of asterix-server. The user
>>> has to know about the base asterix-configuration.xml because that is
>> where
>>> you change some important memory parameters. You can also edit it without
>>> deleting the cluster itself (managix alter).
>>>
>>> On Wed, Jun 29, 2016 at 1:05 AM, Chris Hillery <ch...@hillery.land>
>>> wrote:
>>>
>>>> My understanding of how Managix-based deployment currently works is as
>>>> follows:
>>>>
>>>>   - User composes a cluster.xml
>>>>
>>>>   - Managix consumes this and produces an asterix-configuration.xml,
>> which
>>>> contains some of the same data as cluster.xml as well as some things
>>>> derived from that data (such as composing the <iodevices> directories
>> with
>>>> the <store> subdirectory name to produce <storeDirs>)
>>>>
>>>>   - Managix places both the original cluster.xml and the generated
>>>> asterix-configuration.xml onto the CLASSPATH of the NCs and CCs
>>>>
>>>>   - The user is never directly aware of asterix-configuration.xml, and
>>>> certainly does not edit it in the normal course of operation
>>>>
>>>> Is this an accurate summary?
>>>>
>>>> Ceej
>>>> aka Chris Hillery
>>>>
>>

Re: Cluster XML files

Posted by Raman Grover <ra...@gmail.com>.

Hi,

It was natural to define your physical clusters separately from the
properties of the Asterix instance(s) that run over the hardware.

As such, the cluster xml mapped to the clusters we had - sensorium,
asterix, or the yahoo cluster we once had access to. A single cluster xml
could be reused by multiple devs wishing to use a part (by commenting out
sections in the xml)  or the complete cluster to launch their instances.
Properties related to the cluster do not change often e.g. the IP addresses
etc and so these need not be repeated and redefined for each asterix
instance.

Asterix configuration xml was meant to contain tuning parameters specific
to an asterix instance.

So the user model was to have a fixed set of cluster xmls and a set of
asterix configuration files, maintained by different users, each
representing different runtine tuning parameters that devs would have
different values for or would frequently change as per the workload or
experiments they are running.

separation of concerns and avoiding repetition of properties (when defining
multiple instances over the same hardware)  were the main reasons for
having two separate files.

Regards,
Raman
On Jun 29, 2016 8:36 AM, "Till Westmann" <ti...@apache.org> wrote:

> Is there a conceptual or lifecycle reason to put a parameter in one or the
> other file? I really would like to understand why we have 2 files and what
> the difference is. I think that one hint might be what Ian just mentioned,
> that the parameters in asterix-configuration.xml can be modified (with a
> restart?) and the other ones cannot. Is that right?
>
> On 29 Jun 2016, at 7:56, Ian Maxon wrote:
>
> > Managix sort of splices the cluster.xml with the existing
> > asterix-configuration.xml to produce a new asterix-configuration.xml that
> > then gets put into the asterix-app jar inside of asterix-server. The user
> > has to know about the base asterix-configuration.xml because that is
> where
> > you change some important memory parameters. You can also edit it without
> > deleting the cluster itself (managix alter).
> >
> > On Wed, Jun 29, 2016 at 1:05 AM, Chris Hillery <ch...@hillery.land>
> > wrote:
> >
> >> My understanding of how Managix-based deployment currently works is as
> >> follows:
> >>
> >>   - User composes a cluster.xml
> >>
> >>   - Managix consumes this and produces an asterix-configuration.xml,
> which
> >> contains some of the same data as cluster.xml as well as some things
> >> derived from that data (such as composing the <iodevices> directories
> with
> >> the <store> subdirectory name to produce <storeDirs>)
> >>
> >>   - Managix places both the original cluster.xml and the generated
> >> asterix-configuration.xml onto the CLASSPATH of the NCs and CCs
> >>
> >>   - The user is never directly aware of asterix-configuration.xml, and
> >> certainly does not edit it in the normal course of operation
> >>
> >> Is this an accurate summary?
> >>
> >> Ceej
> >> aka Chris Hillery
> >>
>

Re: Cluster XML files

Posted by Raman Grover <ra...@gmail.com>.

yeah, not documenting the alter command was a consciuous decision we took
(i remember in Mike's office) as a not so advanced user may inadvertently
change parameters that would slow down the system.  Exposing these could
mean introducing too many knobs.
One quick point to add is that somehow I've always found the 'alter'
command being hidden from the typical end-users. It was never part of the
installation documentation (with managix):

https://ci.apache.org/projects/asterixdb/install.html#Section4ManagingTheLifecycleOfAnAsterixDBInstance

So basically if one goal of separation is letting end users modify a
portion of config at least they need to be aware of that.

Pouria

On Wed, Jun 29, 2016 at 10:53 AM, Ian Maxon <im...@uci.edu> wrote:

> You can modify some things in the asterix-configuration.xml that you
really
> shouldn't once you've created a cluster (page size for one), so its not
> total, but in general yes the cluster.xml contains most of the "immutable"
> stuff.
>
> On Wed, Jun 29, 2016 at 8:36 AM, Till Westmann <ti...@apache.org> wrote:
>
> > Is there a conceptual or lifecycle reason to put a parameter in one or
> the
> > other file? I really would like to understand why we have 2 files and
> what
> > the difference is. I think that one hint might be what Ian just
> mentioned,
> > that the parameters in asterix-configuration.xml can be modified (with a
> > restart?) and the other ones cannot. Is that right?
> >
> > On 29 Jun 2016, at 7:56, Ian Maxon wrote:
> >
> > > Managix sort of splices the cluster.xml with the existing
> > > asterix-configuration.xml to produce a new asterix-configuration.xml
> that
> > > then gets put into the asterix-app jar inside of asterix-server. The
> user
> > > has to know about the base asterix-configuration.xml because that is
> > where
> > > you change some important memory parameters. You can also edit it
> without
> > > deleting the cluster itself (managix alter).
> > >
> > > On Wed, Jun 29, 2016 at 1:05 AM, Chris Hillery <ch...@hillery.land>
> > > wrote:
> > >
> > >> My understanding of how Managix-based deployment currently works is
as
> > >> follows:
> > >>
> > >>   - User composes a cluster.xml
> > >>
> > >>   - Managix consumes this and produces an asterix-configuration.xml,
> > which
> > >> contains some of the same data as cluster.xml as well as some things
> > >> derived from that data (such as composing the <iodevices> directories
> > with
> > >> the <store> subdirectory name to produce <storeDirs>)
> > >>
> > >>   - Managix places both the original cluster.xml and the generated
> > >> asterix-configuration.xml onto the CLASSPATH of the NCs and CCs
> > >>
> > >>   - The user is never directly aware of asterix-configuration.xml,
and
> > >> certainly does not edit it in the normal course of operation
> > >>
> > >> Is this an accurate summary?
> > >>
> > >> Ceej
> > >> aka Chris Hillery
> > >>
> >
>

Re: Cluster XML files

Posted by Pouria Pirzadeh <po...@gmail.com>.

One quick point to add is that somehow I've always found the 'alter'
command being hidden from the typical end-users. It was never part of the
installation documentation (with managix):

https://ci.apache.org/projects/asterixdb/install.html#Section4ManagingTheLifecycleOfAnAsterixDBInstance

So basically if one goal of separation is letting end users modify a
portion of config at least they need to be aware of that.

Pouria

On Wed, Jun 29, 2016 at 10:53 AM, Ian Maxon <im...@uci.edu> wrote:

> You can modify some things in the asterix-configuration.xml that you really
> shouldn't once you've created a cluster (page size for one), so its not
> total, but in general yes the cluster.xml contains most of the "immutable"
> stuff.
>
> On Wed, Jun 29, 2016 at 8:36 AM, Till Westmann <ti...@apache.org> wrote:
>
> > Is there a conceptual or lifecycle reason to put a parameter in one or
> the
> > other file? I really would like to understand why we have 2 files and
> what
> > the difference is. I think that one hint might be what Ian just
> mentioned,
> > that the parameters in asterix-configuration.xml can be modified (with a
> > restart?) and the other ones cannot. Is that right?
> >
> > On 29 Jun 2016, at 7:56, Ian Maxon wrote:
> >
> > > Managix sort of splices the cluster.xml with the existing
> > > asterix-configuration.xml to produce a new asterix-configuration.xml
> that
> > > then gets put into the asterix-app jar inside of asterix-server. The
> user
> > > has to know about the base asterix-configuration.xml because that is
> > where
> > > you change some important memory parameters. You can also edit it
> without
> > > deleting the cluster itself (managix alter).
> > >
> > > On Wed, Jun 29, 2016 at 1:05 AM, Chris Hillery <ch...@hillery.land>
> > > wrote:
> > >
> > >> My understanding of how Managix-based deployment currently works is as
> > >> follows:
> > >>
> > >>   - User composes a cluster.xml
> > >>
> > >>   - Managix consumes this and produces an asterix-configuration.xml,
> > which
> > >> contains some of the same data as cluster.xml as well as some things
> > >> derived from that data (such as composing the <iodevices> directories
> > with
> > >> the <store> subdirectory name to produce <storeDirs>)
> > >>
> > >>   - Managix places both the original cluster.xml and the generated
> > >> asterix-configuration.xml onto the CLASSPATH of the NCs and CCs
> > >>
> > >>   - The user is never directly aware of asterix-configuration.xml, and
> > >> certainly does not edit it in the normal course of operation
> > >>
> > >> Is this an accurate summary?
> > >>
> > >> Ceej
> > >> aka Chris Hillery
> > >>
> >
>

Re: Cluster XML files

Posted by Ian Maxon <im...@uci.edu>.

You can modify some things in the asterix-configuration.xml that you really
shouldn't once you've created a cluster (page size for one), so its not
total, but in general yes the cluster.xml contains most of the "immutable"
stuff.

On Wed, Jun 29, 2016 at 8:36 AM, Till Westmann <ti...@apache.org> wrote:

> Is there a conceptual or lifecycle reason to put a parameter in one or the
> other file? I really would like to understand why we have 2 files and what
> the difference is. I think that one hint might be what Ian just mentioned,
> that the parameters in asterix-configuration.xml can be modified (with a
> restart?) and the other ones cannot. Is that right?
>
> On 29 Jun 2016, at 7:56, Ian Maxon wrote:
>
> > Managix sort of splices the cluster.xml with the existing
> > asterix-configuration.xml to produce a new asterix-configuration.xml that
> > then gets put into the asterix-app jar inside of asterix-server. The user
> > has to know about the base asterix-configuration.xml because that is
> where
> > you change some important memory parameters. You can also edit it without
> > deleting the cluster itself (managix alter).
> >
> > On Wed, Jun 29, 2016 at 1:05 AM, Chris Hillery <ch...@hillery.land>
> > wrote:
> >
> >> My understanding of how Managix-based deployment currently works is as
> >> follows:
> >>
> >>   - User composes a cluster.xml
> >>
> >>   - Managix consumes this and produces an asterix-configuration.xml,
> which
> >> contains some of the same data as cluster.xml as well as some things
> >> derived from that data (such as composing the <iodevices> directories
> with
> >> the <store> subdirectory name to produce <storeDirs>)
> >>
> >>   - Managix places both the original cluster.xml and the generated
> >> asterix-configuration.xml onto the CLASSPATH of the NCs and CCs
> >>
> >>   - The user is never directly aware of asterix-configuration.xml, and
> >> certainly does not edit it in the normal course of operation
> >>
> >> Is this an accurate summary?
> >>
> >> Ceej
> >> aka Chris Hillery
> >>
>

Re: Cluster XML files

Posted by Till Westmann <ti...@apache.org>.

Is there a conceptual or lifecycle reason to put a parameter in one or the
other file? I really would like to understand why we have 2 files and what
the difference is. I think that one hint might be what Ian just mentioned,
that the parameters in asterix-configuration.xml can be modified (with a
restart?) and the other ones cannot. Is that right?

On 29 Jun 2016, at 7:56, Ian Maxon wrote:

> Managix sort of splices the cluster.xml with the existing
> asterix-configuration.xml to produce a new asterix-configuration.xml that
> then gets put into the asterix-app jar inside of asterix-server. The user
> has to know about the base asterix-configuration.xml because that is where
> you change some important memory parameters. You can also edit it without
> deleting the cluster itself (managix alter).
>
> On Wed, Jun 29, 2016 at 1:05 AM, Chris Hillery <ch...@hillery.land>
> wrote:
>
>> My understanding of how Managix-based deployment currently works is as
>> follows:
>>
>>   - User composes a cluster.xml
>>
>>   - Managix consumes this and produces an asterix-configuration.xml, which
>> contains some of the same data as cluster.xml as well as some things
>> derived from that data (such as composing the <iodevices> directories with
>> the <store> subdirectory name to produce <storeDirs>)
>>
>>   - Managix places both the original cluster.xml and the generated
>> asterix-configuration.xml onto the CLASSPATH of the NCs and CCs
>>
>>   - The user is never directly aware of asterix-configuration.xml, and
>> certainly does not edit it in the normal course of operation
>>
>> Is this an accurate summary?
>>
>> Ceej
>> aka Chris Hillery
>>

Re: Cluster XML files

Posted by Ian Maxon <im...@uci.edu>.

Managix sort of splices the cluster.xml with the existing
asterix-configuration.xml to produce a new asterix-configuration.xml that
then gets put into the asterix-app jar inside of asterix-server. The user
has to know about the base asterix-configuration.xml because that is where
you change some important memory parameters. You can also edit it without
deleting the cluster itself (managix alter).

On Wed, Jun 29, 2016 at 1:05 AM, Chris Hillery <ch...@hillery.land>
wrote:

> My understanding of how Managix-based deployment currently works is as
> follows:
>
>   - User composes a cluster.xml
>
>   - Managix consumes this and produces an asterix-configuration.xml, which
> contains some of the same data as cluster.xml as well as some things
> derived from that data (such as composing the <iodevices> directories with
> the <store> subdirectory name to produce <storeDirs>)
>
>   - Managix places both the original cluster.xml and the generated
> asterix-configuration.xml onto the CLASSPATH of the NCs and CCs
>
>   - The user is never directly aware of asterix-configuration.xml, and
> certainly does not edit it in the normal course of operation
>
> Is this an accurate summary?
>
> Ceej
> aka Chris Hillery
>