You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Steve Loughran <st...@apache.org> on 2008/05/15 14:05:55 UTC

How do people keep their client configurations in sync with the remote cluster(s)

I have a question for users: how do they ensure their client apps have 
configuration XML file that are kept up to date?

I know how I do it to date (get the site config off the site team, have 
my private copy in SVN), but that is too brittle, and diagnosing 
failures is pretty tricky. All you get is "Failed to Submit Job!" 
exceptions and local stack traces, from which you have to work backwards 
to the work.

I'm thinking of looking at what it would take for a job submitter to ask 
the tracker for its config data, to get things like the various 
directory bases from the cluster, instead of being compiled into the 
client. Then the management problem becomes one of keeping the cluster 
configuration under control, which is a much easier proposition.

what do people do right now?

-steve

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

Steve,

On May 15, 2008, at 5:05 AM, Steve Loughran wrote:

>
> I have a question for users: how do they ensure their client apps  
> have configuration XML file that are kept up to date?
>
> I know how I do it to date (get the site config off the site team,  
> have my private copy in SVN), but that is too brittle, and  
> diagnosing failures is pretty tricky. All you get is "Failed to  
> Submit Job!" exceptions and local stack traces, from which you have  
> to work backwards to the work.
>
> I'm thinking of looking at what it would take for a job submitter  
> to ask the tracker for its config data, to get things like the  
> various directory bases from the cluster, instead of being compiled  
> into the client. Then the management problem becomes one of keeping  
> the cluster configuration under control, which is a much easier  
> proposition.
>

Fair analysis. http://issues.apache.org/jira/browse/HADOOP-3135 is  
quite close, should get in hadoop-0.18 and will help a great deal. It  
does precisely what you proposed: fix the job-submitter to query the  
JT to get the right configs.

Arun

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.



On 5/15/08 8:56 AM, "Steve Loughran" <st...@apache.org> wrote:

> Allen Wittenauer wrote:
>> On 5/15/08 5:05 AM, "Steve Loughran" <st...@apache.org> wrote:
>>> I have a question for users: how do they ensure their client apps have
>>> configuration XML file that are kept up to date?
>> 
>>     We control the client as well as the servers, so it all gets pushed at
>> once. :)
> 
> yes, but you use NFS, so you have your own problems, like the log
> message "NFS Server not responding still trying" appearing across
> everyone's machines simultaneously, which is to be feared almost as much
> as when ClearCase announces that its filesystem is offline.

    We don't use NFS for this.

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Steve Loughran <st...@apache.org>.

Allen Wittenauer wrote:
> On 5/15/08 5:05 AM, "Steve Loughran" <st...@apache.org> wrote:
>> I have a question for users: how do they ensure their client apps have
>> configuration XML file that are kept up to date?
> 
>     We control the client as well as the servers, so it all gets pushed at
> once. :)

yes, but you use NFS, so you have your own problems, like the log 
message "NFS Server not responding still trying" appearing across 
everyone's machines simultaneously, which is to be feared almost as much 
as when ClearCase announces that its filesystem is offline.

> 
>     That said, we're starting to allow clients that aren't controlled by us
> to talk to our grids.  We'll likely re-bundle our configs into digest-able
> packages for them at some point and then have flag days.

mmm. but then you have the problem that once you change your settings, 
all code that compiles the old settings into their JAR break.

>> I'm thinking of looking at what it would take for a job submitter to ask
>> the tracker for its config data, to get things like the various
>> directory bases from the cluster, instead of being compiled into the
>> client. Then the management problem becomes one of keeping the cluster
>> configuration under control, which is a much easier proposition.
> 
>     But I like this idea a lot.  The tricky part comes when clients really
> do need to modify something (# of mappers, heap size, whatever).
> 

yes. I think the jobs need to have the right to override most of a sites 
settings, but I don't see why they should have the responsibility of 
getting all those settings in the first place, at build time. At the 
very least, they should be retrieved at run time.

Having looked at  http://issues.apache.org/jira/browse/HADOOP-3135, I 
can see that it addresses a core issue -you need to know the cluster's 
filesystem layout.

-steve

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Marco Nicosia <ma...@yahoo-inc.com>.

On 5/15/08 08:35, "Allen Wittenauer" <aw...@yahoo-inc.com> wrote:
>> I'm thinking of looking at what it would take for a job submitter to ask
>> the tracker for its config data, to get things like the various
>> directory bases from the cluster, instead of being compiled into the
>> client. Then the management problem becomes one of keeping the cluster
>> configuration under control, which is a much easier proposition.
> 
>     But I like this idea a lot.  The tricky part comes when clients really
> do need to modify something (# of mappers, heap size, whatever).

It may be too late in the process, but all of this configuration could be
included as fields in the job config?

-- 
   Marco Nicosia - Grid Services Ops
   Systems, Tools, and Services Group

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 5/15/08 5:05 AM, "Steve Loughran" <st...@apache.org> wrote:
> I have a question for users: how do they ensure their client apps have
> configuration XML file that are kept up to date?

    We control the client as well as the servers, so it all gets pushed at
once. :)

    That said, we're starting to allow clients that aren't controlled by us
to talk to our grids.  We'll likely re-bundle our configs into digest-able
packages for them at some point and then have flag days.

> I'm thinking of looking at what it would take for a job submitter to ask
> the tracker for its config data, to get things like the various
> directory bases from the cluster, instead of being compiled into the
> client. Then the management problem becomes one of keeping the cluster
> configuration under control, which is a much easier proposition.

    But I like this idea a lot.  The tricky part comes when clients really
do need to modify something (# of mappers, heap size, whatever).

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by brainstorm <br...@gmail.com>.

In a more "sysadminish" sense... has anyone tried out rocks[1] for
hadoop cluster deployment/management ? I'm about to start with it...

[1] http://www.rocksclusters.org

On Tue, May 20, 2008 at 5:23 AM, Alejandro Abdelnur <tu...@gmail.com> wrote:
> That would be an option too.
>
> On Mon, May 19, 2008 at 10:26 PM, Ted Dunning <td...@veoh.com> wrote:
>>
>> I think it would be better to have the client retrieve the default
>> configuration.  Not all configuration settings are simple overrides.   Some
>> are read-modify-write operations.
>>
>> This also fits the current code better.
>>
>>
>> On 5/19/08 6:38 AM, "Steve Loughran" <st...@apache.org> wrote:
>>
>>> Alejandro Abdelnur wrote:
>>>> A while ago I've opened an issue related to this topic
>>>>
>>>>   https://issues.apache.org/jira/browse/HADOOP-3287
>>>>
>>>> My take is a little different, when submitting a job, the clients
>>>> should only send to the jobtracker the configuration they explicitly
>>>> set, then the job tracker would apply the defaults for all the other
>>>> configuration.
>>>>
>>>> By doing this the cluster admin can modify things at any time and
>>>> changes on default values take effect for all clients without having
>>>> to distribute a new configuration to all clients.
>>>>
>>>> IMO, this approach was the intended behavior at some point, according
>>>> to the Configuration.write(OutputStream) javadocs ' Writes non-default
>>>> properties in this configuration.'. But as the write method is writing
>>>> default properties this is not happening.
>>>
>>> I'll keep an eye on that issue. I think a key problem right now is that
>>> clients take their config from the configuration file in the core jar,
>>> and from their own settings, You need to keep the settings in sync
>>> somehow, and have to take what the core jar provides.
>>>
>>>
>>>> This approach would also get rid of the separate mechanism (zookeeper,
>>>> svn, etc) to keep clients synchronized as there would be no need to do
>>>> so.
>>>
>>> zookeeper and similar are to keep the cluster alive; they shouldnt be
>>> needed for clients, which should only need some URL of a job tracker to
>>> talk to.
>>
>>
>

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Alejandro Abdelnur <tu...@gmail.com>.

That would be an option too.

On Mon, May 19, 2008 at 10:26 PM, Ted Dunning <td...@veoh.com> wrote:
>
> I think it would be better to have the client retrieve the default
> configuration.  Not all configuration settings are simple overrides.   Some
> are read-modify-write operations.
>
> This also fits the current code better.
>
>
> On 5/19/08 6:38 AM, "Steve Loughran" <st...@apache.org> wrote:
>
>> Alejandro Abdelnur wrote:
>>> A while ago I've opened an issue related to this topic
>>>
>>>   https://issues.apache.org/jira/browse/HADOOP-3287
>>>
>>> My take is a little different, when submitting a job, the clients
>>> should only send to the jobtracker the configuration they explicitly
>>> set, then the job tracker would apply the defaults for all the other
>>> configuration.
>>>
>>> By doing this the cluster admin can modify things at any time and
>>> changes on default values take effect for all clients without having
>>> to distribute a new configuration to all clients.
>>>
>>> IMO, this approach was the intended behavior at some point, according
>>> to the Configuration.write(OutputStream) javadocs ' Writes non-default
>>> properties in this configuration.'. But as the write method is writing
>>> default properties this is not happening.
>>
>> I'll keep an eye on that issue. I think a key problem right now is that
>> clients take their config from the configuration file in the core jar,
>> and from their own settings, You need to keep the settings in sync
>> somehow, and have to take what the core jar provides.
>>
>>
>>> This approach would also get rid of the separate mechanism (zookeeper,
>>> svn, etc) to keep clients synchronized as there would be no need to do
>>> so.
>>
>> zookeeper and similar are to keep the cluster alive; they shouldnt be
>> needed for clients, which should only need some URL of a job tracker to
>> talk to.
>
>

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Alejandro Abdelnur <tu...@gmail.com>.

> I'll keep an eye on that issue. I think a key problem right now is that
> clients take their config from the configuration file in the core jar, and
> from their own settings, You need to keep the settings in sync somehow, and
> have to take what the core jar provides.

Yes, exactly that is the problem, there is no way to have default
values set by the cluster unless you redistribute a hadoop-site.xml to
all your clients.

Regarding Hadoop-3287 there is some rejection for such fix, so if you
feel it make sense please comment on it.

Alejandro

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Ted Dunning <td...@veoh.com>.

I think it would be better to have the client retrieve the default
configuration.  Not all configuration settings are simple overrides.   Some
are read-modify-write operations.

This also fits the current code better.


On 5/19/08 6:38 AM, "Steve Loughran" <st...@apache.org> wrote:

> Alejandro Abdelnur wrote:
>> A while ago I've opened an issue related to this topic
>> 
>>   https://issues.apache.org/jira/browse/HADOOP-3287
>> 
>> My take is a little different, when submitting a job, the clients
>> should only send to the jobtracker the configuration they explicitly
>> set, then the job tracker would apply the defaults for all the other
>> configuration.
>> 
>> By doing this the cluster admin can modify things at any time and
>> changes on default values take effect for all clients without having
>> to distribute a new configuration to all clients.
>> 
>> IMO, this approach was the intended behavior at some point, according
>> to the Configuration.write(OutputStream) javadocs ' Writes non-default
>> properties in this configuration.'. But as the write method is writing
>> default properties this is not happening.
> 
> I'll keep an eye on that issue. I think a key problem right now is that
> clients take their config from the configuration file in the core jar,
> and from their own settings, You need to keep the settings in sync
> somehow, and have to take what the core jar provides.
> 
> 
>> This approach would also get rid of the separate mechanism (zookeeper,
>> svn, etc) to keep clients synchronized as there would be no need to do
>> so.
> 
> zookeeper and similar are to keep the cluster alive; they shouldnt be
> needed for clients, which should only need some URL of a job tracker to
> talk to.

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Steve Loughran <st...@apache.org>.

Alejandro Abdelnur wrote:
> A while ago I've opened an issue related to this topic
> 
>   https://issues.apache.org/jira/browse/HADOOP-3287
> 
> My take is a little different, when submitting a job, the clients
> should only send to the jobtracker the configuration they explicitly
> set, then the job tracker would apply the defaults for all the other
> configuration.
> 
> By doing this the cluster admin can modify things at any time and
> changes on default values take effect for all clients without having
> to distribute a new configuration to all clients.
> 
> IMO, this approach was the intended behavior at some point, according
> to the Configuration.write(OutputStream) javadocs ' Writes non-default
> properties in this configuration.'. But as the write method is writing
> default properties this is not happening.

I'll keep an eye on that issue. I think a key problem right now is that 
clients take their config from the configuration file in the core jar, 
and from their own settings, You need to keep the settings in sync 
somehow, and have to take what the core jar provides.


> This approach would also get rid of the separate mechanism (zookeeper,
> svn, etc) to keep clients synchronized as there would be no need to do
> so.

zookeeper and similar are to keep the cluster alive; they shouldnt be 
needed for clients, which should only need some URL of a job tracker to 
talk to.

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Alejandro Abdelnur <tu...@gmail.com>.

A while ago I've opened an issue related to this topic

  https://issues.apache.org/jira/browse/HADOOP-3287

My take is a little different, when submitting a job, the clients
should only send to the jobtracker the configuration they explicitly
set, then the job tracker would apply the defaults for all the other
configuration.

By doing this the cluster admin can modify things at any time and
changes on default values take effect for all clients without having
to distribute a new configuration to all clients.

IMO, this approach was the intended behavior at some point, according
to the Configuration.write(OutputStream) javadocs ' Writes non-default
properties in this configuration.'. But as the write method is writing
default properties this is not happening.

This approach would also get rid of the separate mechanism (zookeeper,
svn, etc) to keep clients synchronized as there would be no need to do
so.

Alejandro

On Fri, May 16, 2008 at 10:25 PM, Ted Dunning <td...@veoh.com> wrote:
>
> That is all that almost all of my arms-length clients need.  With 18, all
> clients should be able to ask for the default configuration if they have a
> root URL which will make the amount of information needed for any and all
> clients very small.
>
>
> On 5/16/08 2:03 AM, "Steve Loughran" <st...@apache.org> wrote:
>
>> I agree. I think right now clients need a bit too much info about the
>> name node; its URL should be all they need, and presumably who to log in
>> as.
>
>

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Ted Dunning <td...@veoh.com>.

That is all that almost all of my arms-length clients need.  With 18, all
clients should be able to ask for the default configuration if they have a
root URL which will make the amount of information needed for any and all
clients very small.

On 5/16/08 2:03 AM, "Steve Loughran" <st...@apache.org> wrote:

> I agree. I think right now clients need a bit too much info about the
> name node; its URL should be all they need, and presumably who to log in
> as.

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Ted Dunning <td...@veoh.com>.

I would think that this would cover about 80% of the newbie problem reports.

It would be especially good if it included ssh'ed commands run on the slaves
that look back at the namenode and job tracker.  Forward and reverse name
lookups are also important.

This is worth a Jira.

On 5/16/08 2:03 AM, "Steve Loughran" <st...@apache.org> wrote:

> For Hadoop, something that prints out the config and looks for good/bad
> problems, maybe even does nslookup() on the hosts, checks the ports are
> open, etc, would be nice, especially if it provides hints when things
> are screwed up.

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Steve Loughran <st...@apache.org>.

Ted Dunning wrote:
> I use several strategies:
> 
> A) avoid dependency on hadoop's configuration by using http access to files.
> I use this, for example, where we have a PHP or grails or oracle app that
> needs to read a data file or three from HDFS.
> 
> B) rsync early and often and lock down the config directory.
> 
> C) get a really good sysop who does (b) and shoots people who mess up
> 
> D) (we don't do this yet) establish a configuration repository using
> zookeeper or a webdav or a (horrors) NFS file system.  At the very least, I
> would like to be able to get namenode address and port.

I think in a single organisation, you can get away with SVN management 
of conf files -if you build everything in. With an HTTP-svn bridge you 
could aways pull in the file over http during startup.

> 
> Mostly, our apps are in the cluster and covered by b+c or very out of the
> cluster and covered by a.  Many of our apps are pure import or pure export.
> The import side really only needs to know where the namenode is and the pure
> export only really needs the http access.  That makes the configuration
> management task vastly easier.
> 
> Another serious (as in SERIOUS) problem is how you keep data-processing
> elements from a QA or staging data chain from inserting bogus data into the
> production data chain, but still have them work in production with minimal
> reconfiguration on final deploy. 

That's a problem on all projects. One team I was on once had a classic 
disaster where a test cluster bonded to the production MSSQL database 
(Remember, windows likes flat naming \\database-3 style names) and 
caused plenty of damage. Still, its good to test your backup strategy 
works -at least in hindsight. It never seems so good at the time, though.

> We don't have a particularly good solution
> for that yet, but are planning on using zookeeper host based permissions to
> good effect there.  That should let us have data mirrors that shadow the
> production data feed system so that staged systems can process live data,
> but be unable to insert it back into the production setting.  The mirror
> will have read-only access to the feed meta-data and the staging machines
> will have no access to the production feed meta-data and these limitations
> will be imposed by a single configuration on the zookeeper rather than on
> each machine.  This should allow us to keep it cleaner than these things
> normally wind up.
> 
> But the short answer is that this is a hard problem to get really, really
> right.  

I agree. I think right now clients need a bit too much info about the 
name node; its URL should be all they need, and presumably who to log in 
as. In a local cluster you can use discovery services to get a list of 
machines offering the service too, though its that kind of automatic 
binding that leads to the erased database problem I've hit before.

One thing that might be useful over time would be more client-side 
diagnostics. if you type ant -diagnostics (or run <diagnostics> ant runs 
through a set of health checks that have caused problems in the past
  -mixed up JAR versions
  -tmp dir unwriteable
  -tmp dir on a filesystem with a different clock/TZ from the local machines
  -bad proxy settings
  -wrong XML parser

Some you can test, some it just prints. What it prints out is enough for 
a support email/bugrep.

For Hadoop, something that prints out the config and looks for good/bad 
problems, maybe even does nslookup() on the hosts, checks the ports are 
open, etc, would be nice, especially if it provides hints when things 
are screwed up.

-Steve

-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: How do people keep their client configurations in sync with the remote cluster(s)

Posted by Ted Dunning <td...@veoh.com>.

I use several strategies:

A) avoid dependency on hadoop's configuration by using http access to files.
I use this, for example, where we have a PHP or grails or oracle app that
needs to read a data file or three from HDFS.

B) rsync early and often and lock down the config directory.

C) get a really good sysop who does (b) and shoots people who mess up

D) (we don't do this yet) establish a configuration repository using
zookeeper or a webdav or a (horrors) NFS file system.  At the very least, I
would like to be able to get namenode address and port.

Mostly, our apps are in the cluster and covered by b+c or very out of the
cluster and covered by a.  Many of our apps are pure import or pure export.
The import side really only needs to know where the namenode is and the pure
export only really needs the http access.  That makes the configuration
management task vastly easier.

Another serious (as in SERIOUS) problem is how you keep data-processing
elements from a QA or staging data chain from inserting bogus data into the
production data chain, but still have them work in production with minimal
reconfiguration on final deploy.  We don't have a particularly good solution
for that yet, but are planning on using zookeeper host based permissions to
good effect there.  That should let us have data mirrors that shadow the
production data feed system so that staged systems can process live data,
but be unable to insert it back into the production setting.  The mirror
will have read-only access to the feed meta-data and the staging machines
will have no access to the production feed meta-data and these limitations
will be imposed by a single configuration on the zookeeper rather than on
each machine.  This should allow us to keep it cleaner than these things
normally wind up.

But the short answer is that this is a hard problem to get really, really
right.  

On 5/15/08 5:05 AM, "Steve Loughran" <st...@apache.org> wrote:

> 
> I have a question for users: how do they ensure their client apps have
> configuration XML file that are kept up to date?
> 
> I know how I do it to date (get the site config off the site team, have
> my private copy in SVN), but that is too brittle, and diagnosing
> failures is pretty tricky. All you get is "Failed to Submit Job!"
> exceptions and local stack traces, from which you have to work backwards
> to the work.
> 
> I'm thinking of looking at what it would take for a job submitter to ask
> the tracker for its config data, to get things like the various
> directory bases from the cluster, instead of being compiled into the
> client. Then the management problem becomes one of keeping the cluster
> configuration under control, which is a much easier proposition.
> 
> what do people do right now?
> 
> -steve