You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@whirr.apache.org by pr...@nokia.com on 2011/01/27 21:21:53 UTC

Running Mapred jobs after launching cluster

Hello all,
I wrote a java class HadoopLanucher that is very similar to HadoopServiceController. I was succesfully able to launch a cluster programtically from my application using Whirr. Now I want to copy files to hdfs and also run a job progrmatically.

When I copy a file to hdfs its copying to local file system, not hdfs. Here is the code I used:

Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
hdfs.copyFromLocalFile(false, true, new Path(localFilePath), new Path(hdfsFileDirectory));

Do I need to add anything else to the classpath so Hadoop libraries know that it needs to talk to the dynamically lanuched cluster? When running Whirr from command line I know it uses HADOOP_CONF_DIR to find the hadoop config files but when doing the same from Java I am wondering how to solve this issue.

Praveen



Re: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
I think that's what I did unless u see something wrong in the below mentioned steps.

Praveen

On Jan 31, 2011, at 6:15 PM, "ext Andrei Savu" <sa...@gmail.com> wrote:

> You can overwrite the properties in the scripts and that should be enough.
> 
> You need to place the scripts on a public web server, accessible from
> your AWS instances.
> 
> On Tue, Feb 1, 2011 at 1:03 AM,  <pr...@nokia.com> wrote:
>> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.
>> 
>> BTW I tried with webserver path and I could not make it work so far.
>> 
>> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
>> 2. I changed hadoop.properties added following line
>>       run-url-base=http://localhost/
>> 3. Launched the cluster and verified the job properties are not what I changed to. They are all defaults.
>> 
>> How do I debug this issue?
>> 
>> Praveen
>> 
>> 
>> Launched the cluster and I didn't see child jvm have 2G alloc
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Monday, January 31, 2011 3:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>> 
>> Hi Praveen,
>> 
>> I think removing the webserver dependency (or making it optional) would be a good goal, but we're not there yet. I've just created
>> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>> 
>> In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>> 
>> Hope this helps.
>> 
>> Cheers,
>> Tom
>> 
>> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>> 
>> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
>>> Hi Tom,
>>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>> 
>>> mapred.reduce.tasks=24
>>> mapred.map.tasks=64
>>> mapred.child.java.opts=-Xmx2048m
>>> 
>>> Thanks in advance.
>>> Praveen
>>> 
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 4:02 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>> 
>>> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>>> 
>>> Cheers
>>> Tom
>>> 
>>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>>>> Hi Tom,
>>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>> 
>>>> Thanks
>>>> Praveen
>>>> 
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Friday, January 28, 2011 3:35 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>> 
>>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>> 
>>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>> 
>>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>> 
>>>>> 
>>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>> 
>>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>> 
>>>> Cheers
>>>> Tom
>>>> 
>>>>> 
>>>>> Thanks
>>>>> Praveen
>>>>> 
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>> 
>>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>> 
>>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>> 
>>>>> ~/.whirr/<cluster-name>
>>>>> 
>>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>> 
>>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>> 
>>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>> 
>>>>> Tom
>>>>> 
>>>>>> 
>>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>> 
>>>>>> Praveen
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>> 
>>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>> 
>>>>>> Cheers,
>>>>>> Tom
>>>>>> 
>>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>>>> Hmm...
>>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>> 
>>>>>>> Praveen
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>>> To: whirr-user@incubator.apache.org
>>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>> 
>>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Tom
>>>>>>> 
>>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>>>> Hello all,
>>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>> 
>>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>>> Here is the code I used:
>>>>>>>> 
>>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>> 
>>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>>> When running Whirr from command line I know it uses
>>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>> 
>>>>>>>> Praveen
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> --
> Andrei Savu -- andreisavu.ro

Re: Running Mapred jobs after launching cluster

Posted by Andrei Savu <sa...@gmail.com>.
You can overwrite the properties in the scripts and that should be enough.

You need to place the scripts on a public web server, accessible from
your AWS instances.

On Tue, Feb 1, 2011 at 1:03 AM,  <pr...@nokia.com> wrote:
> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.
>
> BTW I tried with webserver path and I could not make it work so far.
>
> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
> 2. I changed hadoop.properties added following line
>        run-url-base=http://localhost/
> 3. Launched the cluster and verified the job properties are not what I changed to. They are all defaults.
>
> How do I debug this issue?
>
> Praveen
>
>
> Launched the cluster and I didn't see child jvm have 2G alloc
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Monday, January 31, 2011 3:02 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> Hi Praveen,
>
> I think removing the webserver dependency (or making it optional) would be a good goal, but we're not there yet. I've just created
> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>
> In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>
> Hope this helps.
>
> Cheers,
> Tom
>
> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>
> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
>> Hi Tom,
>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>
>> mapred.reduce.tasks=24
>> mapred.map.tasks=64
>> mapred.child.java.opts=-Xmx2048m
>>
>> Thanks in advance.
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Friday, January 28, 2011 4:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>>
>> Cheers
>> Tom
>>
>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>>> Hi Tom,
>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>
>>> Thanks
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 3:35 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>
>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>
>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>
>>>>
>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>
>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>
>>> Cheers
>>> Tom
>>>
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>
>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>
>>>> ~/.whirr/<cluster-name>
>>>>
>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>
>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>
>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>
>>>> Tom
>>>>
>>>>>
>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>
>>>>> Cheers,
>>>>> Tom
>>>>>
>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>>> Hmm...
>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>>> Hello all,
>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>
>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>> Here is the code I used:
>>>>>>>
>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>
>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>> When running Whirr from command line I know it uses
>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Andrei Savu -- andreisavu.ro

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Hi Andrei,
I checked the /usr/local/hadoop-0.20.2/conf directory on master node and all the *site.xml files are empty. I know some of my properties that I changed in post_configure script are working but I wonder where that information is stored. Even the master and slaves files are pointing to just localhost. It almost looks like this is not the conf directory.

Praveen

________________________________
From: ext praveen.peddi@nokia.com [mailto:praveen.peddi@nokia.com]
Sent: Thursday, February 03, 2011 1:07 PM
To: whirr-user@incubator.apache.org
Cc: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

Yes I modified the post_configure script. I modified 3 different properties and I saw the other two are overwritten but not the mapred.child.java.opts property.

Praveen

On Feb 3, 2011, at 12:33 PM, ext Andrei Savu <sa...@gmail.com>> wrote:

Here are the two relevant lines from the install scripts:

HADOOP_VERSION=${HADOOP_VERSION:-0.20.2}
HADOOP_HOME=/usr/local/hadoop-$HADOOP_VERSION
HADOOP_CONF_DIR=$HADOOP_HOME/conf

Have you tried changing CHIL_OPTS in apache/hadoop/post-configure and use that custom script to deploy a cluster?

I don't have a running cluster to check this now.

On Thu, Feb 3, 2011 at 7:18 PM, <<m...@nokia.com>> wrote:
Hi Tom/all
Where are the hadoop config files stored on the cluster nodes? I would like to debug this issue since I need to give more memory for child java mapred processes to process huge chunks of data.

Thanks
Praveen
-----Original Message-----
From: ext <ma...@nokia.com> praveen.peddi@nokia.com<ma...@nokia.com> [mailto:<ma...@nokia.com>]
Sent: Wednesday, February 02, 2011 5:23 PM
To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: RE: Running Mapred jobs after launching cluster

Can anyone think of a reason why the below property is not honoured when I overwrote this along with other properties in the post_configure. Other properties are correctly overwritten except this one. I need to set the mapred tasks jvm to bigger than 200m.

Praveen
________________________________________
From: Peddi Praveen (Nokia-MS/Boston)
Sent: Tuesday, February 01, 2011 11:21 AM
To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: RE: Running Mapred jobs after launching cluster

Thanks Tom. Silly me I should have thought of the property name. It works now except one issue: I ran the wordcount example and I saw that no. of map and reduce tasks are as I configured in post_configure script but for some reason the below property in job.xml is always -Xmx200m and I set it to -Xmx1700m. Not sure if this property is any special.

mapred.child.java.opts  -Xmx200m

Praveen
________________________________________
From: ext Tom White [<ma...@gmail.com>]
Sent: Tuesday, February 01, 2011 12:13 AM
To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Running Mapred jobs after launching cluster

Try setting whirr.run-url-base, not run-url-base.

Tom

On Mon, Jan 31, 2011 at 5:33 PM,  <<m...@nokia.com>> wrote:
> I am not using cdh (for now anyway) but the default hadoop. I even changed the "localhost" to ipaddress and still no luck. It likely that I am doing something wrong but having hard time debugging.
> Here are the properties I changed in  /var/www/apache/hadoop/post-configure but when I run the job I am not seeing these values.
>  MAX_MAP_TASKS=16
>  MAX_REDUCE_TASKS=24
>  CHILD_OPTS=-Xmx1700m
>
> Here is what I see in  /tmp/runscript/runscript.sh of master node. It doesn't look like it used my scripts...
>
> installRunUrl || exit 1
> runurl
> <http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure
> -hostnames -c cloudservers runurl
> <http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta
> ll runurl
> <http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/
> install -c cloudservers
>
> Any suggestions?
> Praveen
> ________________________________________
> From: ext Tom White [<ma...@gmail.com>]
> Sent: Monday, January 31, 2011 6:23 PM
> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Mon, Jan 31, 2011 at 3:03 PM,  <<m...@nokia.com>> wrote:
>> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.
>
> I suggested you look at the patch (WHIRR-55) so you can see how it
> will be possible once it's committed. To try it out you need to upload
> the scripts to a webserver (since the patch changes one of them).
>
>>
>> BTW I tried with webserver path and I could not make it work so far.
>>
>> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
>> 2. I changed hadoop.properties added following line
>>        run-url-base=<http://localhost/>http://localhost/ 3. Launched the cluster and
>> verified the job properties are not what I changed to. They are all defaults.
>
> This looks right to me. If you are using CDH you need to change
> cloudera/cdh/post-configure.
>
>>
>> How do I debug this issue?
>
> You can log into the instances (see the FAQ for how to do this) and
> look at the scripts that actually ran (and their output) in the /tmp
> directory.
>
>
> Tom
>
>>
>> Praveen
>>
>>
>> Launched the cluster and I didn't see child jvm have 2G alloc
>> -----Original Message-----
>> From: ext Tom White [mailto:<ma...@gmail.com>]
>> Sent: Monday, January 31, 2011 3:02 PM
>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> Hi Praveen,
>>
>> I think removing the webserver dependency (or making it optional)
>> would be a good goal, but we're not there yet. I've just created
>> <https://issues.apache.org/jira/browse/WHIRR-225> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>>
>> In the meantime you could take a look at <https://issues.apache.org/jira/browse/WHIRR-55> https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>>
>> Hope this helps.
>>
>> Cheers,
>> Tom
>>
>> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>>
>> On Mon, Jan 31, 2011 at 11:00 AM,  <<m...@nokia.com>> wrote:
>>> Hi Tom,
>>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>>
>>> mapred.reduce.tasks=24
>>> mapred.map.tasks=64
>>> mapred.child.java.opts=-Xmx2048m
>>>
>>> Thanks in advance.
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:<ma...@gmail.com>]
>>> Sent: Friday, January 28, 2011 4:02 PM
>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> It is fixed, and currently on 0.20.2. It will be made configurable in <https://issues.apache.org/jira/browse/WHIRR-222> https://issues.apache.org/jira/browse/WHIRR-222.
>>>
>>> Cheers
>>> Tom
>>>
>>> On Fri, Jan 28, 2011 at 12:56 PM,  <<m...@nokia.com>> wrote:
>>>> Hi Tom,
>>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:<ma...@gmail.com>]
>>>> Sent: Friday, January 28, 2011 3:35 PM
>>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 12:06 PM,  <<m...@nokia.com>> wrote:
>>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>>
>>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>>
>>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>>
>>>>>
>>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>>
>>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>>
>>>> Cheers
>>>> Tom
>>>>
>>>>>
>>>>> Thanks
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:<ma...@gmail.com>]
>>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <<m...@nokia.com>> wrote:
>>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>>
>>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>>
>>>>> ~/.whirr/<cluster-name>
>>>>>
>>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>>
>>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>>
>>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>>
>>>>> Tom
>>>>>
>>>>>>
>>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:<ma...@cloudera.com>]
>>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <<m...@nokia.com>> wrote:
>>>>>>> Hmm...
>>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ext Tom White [mailto:<ma...@gmail.com>]
>>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>>
>>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Tom
>>>>>>>
>>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <<m...@nokia.com>> wrote:
>>>>>>>> Hello all,
>>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>>
>>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>>> Here is the code I used:
>>>>>>>>
>>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>>
>>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>>> When running Whirr from command line I know it uses
>>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>>
>>>>>>>> Praveen
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>



--
Andrei Savu -- andreisavu.ro<http://andreisavu.ro>


Re: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Yes I modified the post_configure script. I modified 3 different properties and I saw the other two are overwritten but not the mapred.child.java.opts property.

Praveen

On Feb 3, 2011, at 12:33 PM, ext Andrei Savu <sa...@gmail.com>> wrote:

Here are the two relevant lines from the install scripts:

HADOOP_VERSION=${HADOOP_VERSION:-0.20.2}
HADOOP_HOME=/usr/local/hadoop-$HADOOP_VERSION
HADOOP_CONF_DIR=$HADOOP_HOME/conf

Have you tried changing CHIL_OPTS in apache/hadoop/post-configure and use that custom script to deploy a cluster?

I don't have a running cluster to check this now.

On Thu, Feb 3, 2011 at 7:18 PM, <<m...@nokia.com>> wrote:
Hi Tom/all
Where are the hadoop config files stored on the cluster nodes? I would like to debug this issue since I need to give more memory for child java mapred processes to process huge chunks of data.

Thanks
Praveen
-----Original Message-----
From: ext <ma...@nokia.com> praveen.peddi@nokia.com<ma...@nokia.com> [mailto:<ma...@nokia.com>]
Sent: Wednesday, February 02, 2011 5:23 PM
To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: RE: Running Mapred jobs after launching cluster

Can anyone think of a reason why the below property is not honoured when I overwrote this along with other properties in the post_configure. Other properties are correctly overwritten except this one. I need to set the mapred tasks jvm to bigger than 200m.

Praveen
________________________________________
From: Peddi Praveen (Nokia-MS/Boston)
Sent: Tuesday, February 01, 2011 11:21 AM
To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: RE: Running Mapred jobs after launching cluster

Thanks Tom. Silly me I should have thought of the property name. It works now except one issue: I ran the wordcount example and I saw that no. of map and reduce tasks are as I configured in post_configure script but for some reason the below property in job.xml is always -Xmx200m and I set it to -Xmx1700m. Not sure if this property is any special.

mapred.child.java.opts  -Xmx200m

Praveen
________________________________________
From: ext Tom White [<ma...@gmail.com>]
Sent: Tuesday, February 01, 2011 12:13 AM
To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Running Mapred jobs after launching cluster

Try setting whirr.run-url-base, not run-url-base.

Tom

On Mon, Jan 31, 2011 at 5:33 PM,  <<m...@nokia.com>> wrote:
> I am not using cdh (for now anyway) but the default hadoop. I even changed the "localhost" to ipaddress and still no luck. It likely that I am doing something wrong but having hard time debugging.
> Here are the properties I changed in  /var/www/apache/hadoop/post-configure but when I run the job I am not seeing these values.
>  MAX_MAP_TASKS=16
>  MAX_REDUCE_TASKS=24
>  CHILD_OPTS=-Xmx1700m
>
> Here is what I see in  /tmp/runscript/runscript.sh of master node. It doesn't look like it used my scripts...
>
> installRunUrl || exit 1
> runurl
> <http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure
> -hostnames -c cloudservers runurl
> <http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta
> ll runurl
> <http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/
> install -c cloudservers
>
> Any suggestions?
> Praveen
> ________________________________________
> From: ext Tom White [<ma...@gmail.com>]
> Sent: Monday, January 31, 2011 6:23 PM
> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Mon, Jan 31, 2011 at 3:03 PM,  <<m...@nokia.com>> wrote:
>> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.
>
> I suggested you look at the patch (WHIRR-55) so you can see how it
> will be possible once it's committed. To try it out you need to upload
> the scripts to a webserver (since the patch changes one of them).
>
>>
>> BTW I tried with webserver path and I could not make it work so far.
>>
>> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
>> 2. I changed hadoop.properties added following line
>>        run-url-base=<http://localhost/>http://localhost/ 3. Launched the cluster and
>> verified the job properties are not what I changed to. They are all defaults.
>
> This looks right to me. If you are using CDH you need to change
> cloudera/cdh/post-configure.
>
>>
>> How do I debug this issue?
>
> You can log into the instances (see the FAQ for how to do this) and
> look at the scripts that actually ran (and their output) in the /tmp
> directory.
>
>
> Tom
>
>>
>> Praveen
>>
>>
>> Launched the cluster and I didn't see child jvm have 2G alloc
>> -----Original Message-----
>> From: ext Tom White [mailto:<ma...@gmail.com>]
>> Sent: Monday, January 31, 2011 3:02 PM
>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> Hi Praveen,
>>
>> I think removing the webserver dependency (or making it optional)
>> would be a good goal, but we're not there yet. I've just created
>> <https://issues.apache.org/jira/browse/WHIRR-225> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>>
>> In the meantime you could take a look at <https://issues.apache.org/jira/browse/WHIRR-55> https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>>
>> Hope this helps.
>>
>> Cheers,
>> Tom
>>
>> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>>
>> On Mon, Jan 31, 2011 at 11:00 AM,  <<m...@nokia.com>> wrote:
>>> Hi Tom,
>>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>>
>>> mapred.reduce.tasks=24
>>> mapred.map.tasks=64
>>> mapred.child.java.opts=-Xmx2048m
>>>
>>> Thanks in advance.
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:<ma...@gmail.com>]
>>> Sent: Friday, January 28, 2011 4:02 PM
>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> It is fixed, and currently on 0.20.2. It will be made configurable in <https://issues.apache.org/jira/browse/WHIRR-222> https://issues.apache.org/jira/browse/WHIRR-222.
>>>
>>> Cheers
>>> Tom
>>>
>>> On Fri, Jan 28, 2011 at 12:56 PM,  <<m...@nokia.com>> wrote:
>>>> Hi Tom,
>>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:<ma...@gmail.com>]
>>>> Sent: Friday, January 28, 2011 3:35 PM
>>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 12:06 PM,  <<m...@nokia.com>> wrote:
>>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>>
>>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>>
>>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>>
>>>>>
>>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>>
>>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>>
>>>> Cheers
>>>> Tom
>>>>
>>>>>
>>>>> Thanks
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:<ma...@gmail.com>]
>>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <<m...@nokia.com>> wrote:
>>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>>
>>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>>
>>>>> ~/.whirr/<cluster-name>
>>>>>
>>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>>
>>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>>
>>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>>
>>>>> Tom
>>>>>
>>>>>>
>>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:<ma...@cloudera.com>]
>>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <<m...@nokia.com>> wrote:
>>>>>>> Hmm...
>>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ext Tom White [mailto:<ma...@gmail.com>]
>>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>>> To: <ma...@incubator.apache.org> whirr-user@incubator.apache.org<ma...@incubator.apache.org>
>>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>>
>>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Tom
>>>>>>>
>>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <<m...@nokia.com>> wrote:
>>>>>>>> Hello all,
>>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>>
>>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>>> Here is the code I used:
>>>>>>>>
>>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>>
>>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>>> When running Whirr from command line I know it uses
>>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>>
>>>>>>>> Praveen
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>



--
Andrei Savu -- andreisavu.ro<http://andreisavu.ro>


Re: Running Mapred jobs after launching cluster

Posted by Andrei Savu <sa...@gmail.com>.
Here are the two relevant lines from the install scripts:

HADOOP_VERSION=${HADOOP_VERSION:-0.20.2}
HADOOP_HOME=/usr/local/hadoop-$HADOOP_VERSION
HADOOP_CONF_DIR=$HADOOP_HOME/conf

Have you tried changing CHIL_OPTS in apache/hadoop/post-configure and use
that custom script to deploy a cluster?

I don't have a running cluster to check this now.

On Thu, Feb 3, 2011 at 7:18 PM, <pr...@nokia.com> wrote:

> Hi Tom/all
> Where are the hadoop config files stored on the cluster nodes? I would like
> to debug this issue since I need to give more memory for child java mapred
> processes to process huge chunks of data.
>
> Thanks
> Praveen
> -----Original Message-----
> From: ext praveen.peddi@nokia.com [mailto:praveen.peddi@nokia.com]
> Sent: Wednesday, February 02, 2011 5:23 PM
> To: whirr-user@incubator.apache.org
> Subject: RE: Running Mapred jobs after launching cluster
>
> Can anyone think of a reason why the below property is not honoured when I
> overwrote this along with other properties in the post_configure. Other
> properties are correctly overwritten except this one. I need to set the
> mapred tasks jvm to bigger than 200m.
>
> Praveen
> ________________________________________
> From: Peddi Praveen (Nokia-MS/Boston)
> Sent: Tuesday, February 01, 2011 11:21 AM
> To: whirr-user@incubator.apache.org
> Subject: RE: Running Mapred jobs after launching cluster
>
> Thanks Tom. Silly me I should have thought of the property name. It works
> now except one issue: I ran the wordcount example and I saw that no. of map
> and reduce tasks are as I configured in post_configure script but for some
> reason the below property in job.xml is always -Xmx200m and I set it to
> -Xmx1700m. Not sure if this property is any special.
>
> mapred.child.java.opts  -Xmx200m
>
> Praveen
> ________________________________________
> From: ext Tom White [tom.e.white@gmail.com]
> Sent: Tuesday, February 01, 2011 12:13 AM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> Try setting whirr.run-url-base, not run-url-base.
>
> Tom
>
> On Mon, Jan 31, 2011 at 5:33 PM,  <pr...@nokia.com> wrote:
> > I am not using cdh (for now anyway) but the default hadoop. I even
> changed the "localhost" to ipaddress and still no luck. It likely that I am
> doing something wrong but having hard time debugging.
> > Here are the properties I changed in
>  /var/www/apache/hadoop/post-configure but when I run the job I am not
> seeing these values.
> >  MAX_MAP_TASKS=16
> >  MAX_REDUCE_TASKS=24
> >  CHILD_OPTS=-Xmx1700m
> >
> > Here is what I see in  /tmp/runscript/runscript.sh of master node. It
> doesn't look like it used my scripts...
> >
> > installRunUrl || exit 1
> > runurl
> > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure
> > -hostnames -c cloudservers runurl
> > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta
> > ll runurl
> > http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/
> > install -c cloudservers
> >
> > Any suggestions?
> > Praveen
> > ________________________________________
> > From: ext Tom White [tom.e.white@gmail.com]
> > Sent: Monday, January 31, 2011 6:23 PM
> > To: whirr-user@incubator.apache.org
> > Subject: Re: Running Mapred jobs after launching cluster
> >
> > On Mon, Jan 31, 2011 at 3:03 PM,  <pr...@nokia.com> wrote:
> >> If I anyway have to upload the files to webservers, do I still need the
> patch then? It looks like the script has these properties that I can
> overwrite.
> >
> > I suggested you look at the patch (WHIRR-55) so you can see how it
> > will be possible once it's committed. To try it out you need to upload
> > the scripts to a webserver (since the patch changes one of them).
> >
> >>
> >> BTW I tried with webserver path and I could not make it work so far.
> >>
> >> 1. I copied scripts/apache folder to my /var/www folder and modified
> below 3 properties in /var/www/apache/hadoop/post-configure.
> >> 2. I changed hadoop.properties added following line
> >>        run-url-base=http://localhost/ 3. Launched the cluster and
> >> verified the job properties are not what I changed to. They are all
> defaults.
> >
> > This looks right to me. If you are using CDH you need to change
> > cloudera/cdh/post-configure.
> >
> >>
> >> How do I debug this issue?
> >
> > You can log into the instances (see the FAQ for how to do this) and
> > look at the scripts that actually ran (and their output) in the /tmp
> > directory.
> >
> >
> > Tom
> >
> >>
> >> Praveen
> >>
> >>
> >> Launched the cluster and I didn't see child jvm have 2G alloc
> >> -----Original Message-----
> >> From: ext Tom White [mailto:tom.e.white@gmail.com]
> >> Sent: Monday, January 31, 2011 3:02 PM
> >> To: whirr-user@incubator.apache.org
> >> Subject: Re: Running Mapred jobs after launching cluster
> >>
> >> Hi Praveen,
> >>
> >> I think removing the webserver dependency (or making it optional)
> >> would be a good goal, but we're not there yet. I've just created
> >> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss
> the design and implementation.
> >>
> >> In the meantime you could take a look at
> https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch
> there to override some Hadoop properties (you will need to upload the
> scripts to a webserver still however, until it is committed, since it
> modifies Hadoop's post-configure script).
> >>
> >> Hope this helps.
> >>
> >> Cheers,
> >> Tom
> >>
> >> BTW what are the security concerns you have? There are no credentials
> embedded in the scripts, so it should be safe to host them publicly, no?
> >>
> >> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
> >>> Hi Tom,
> >>> If the hadoop install is fixed, Whirr must be getting all default
> hadoop properties from the hadoop install itself, correct? I sent an email
> about configuring hadoop properties and you mentioned I need to put the
> modified scripts on a webserver that is publicly accessible. I was wondering
> if there is place inside hadoop install I can change so that I don't need to
> put the scripts on webserver (for security reasons). Do you think it is
> possible? If so, how? I do not mind customizing the jar file for our
> purposes. I want to change the following properties:
> >>>
> >>> mapred.reduce.tasks=24
> >>> mapred.map.tasks=64
> >>> mapred.child.java.opts=-Xmx2048m
> >>>
> >>> Thanks in advance.
> >>> Praveen
> >>>
> >>> -----Original Message-----
> >>> From: ext Tom White [mailto:tom.e.white@gmail.com]
> >>> Sent: Friday, January 28, 2011 4:02 PM
> >>> To: whirr-user@incubator.apache.org
> >>> Subject: Re: Running Mapred jobs after launching cluster
> >>>
> >>> It is fixed, and currently on 0.20.2. It will be made configurable in
> https://issues.apache.org/jira/browse/WHIRR-222.
> >>>
> >>> Cheers
> >>> Tom
> >>>
> >>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
> >>>> Hi Tom,
> >>>> So the hadoop version is not going to change for a given Whirr
> install? I thought Whirr is getting hadoop install dynamically from a URL
> which is always going to have the latest hadoop version. If that is not the
> case I guess I am fine. I just don't want to get hadoop version mismatch 6
> months after our software is released just because new hadoop version got
> released.
> >>>>
> >>>> Thanks
> >>>> Praveen
> >>>>
> >>>> -----Original Message-----
> >>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
> >>>> Sent: Friday, January 28, 2011 3:35 PM
> >>>> To: whirr-user@incubator.apache.org
> >>>> Subject: Re: Running Mapred jobs after launching cluster
> >>>>
> >>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
> >>>>> Thanks Tom. I think I got it working with my own driver so I will go
> with it for now (unless that proves to be a bad option).
> >>>>>
> >>>>> BTW, could you tell me how to stick with one hadoop version while
> launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like
> Whirr gets the latest hadoop from the repository. Since the latest version
> may be different depending on the time, I would like to stick to one version
> so that hadoop version mismatch won't happen.
> >>>>
> >>>> You do need to make sure that the versions are the same. See the
> Hadoop integration tests, which specify the version of Hadoop to use in
> their POM.
> >>>>
> >>>>>
> >>>>> Also what jar files are necessary for launching cluster using Java.
> Currently I have cli version of jar file but that's way too large since it
> has ervrything in it.
> >>>>
> >>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
> >>>> If you look at the POMs in the source code they will tell you the
> dependencies.
> >>>>
> >>>> Cheers
> >>>> Tom
> >>>>
> >>>>>
> >>>>> Thanks
> >>>>> Praveen
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
> >>>>> Sent: Friday, January 28, 2011 2:12 PM
> >>>>> To: whirr-user@incubator.apache.org
> >>>>> Subject: Re: Running Mapred jobs after launching cluster
> >>>>>
> >>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
> >>>>>> Thanks Tom. Could you eloborate little more on the second option.
> >>>>>>
> >>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
> >>>>>
> >>>>> ~/.whirr/<cluster-name>
> >>>>>
> >>>>>> When you said run in new process, did you mean using command line
> Whirr tool?
> >>>>>
> >>>>> I meant that you could launch Whirr using the CLI, or Java. Then run
> the job in another process, with HADOOP_CONF_DIR set.
> >>>>>
> >>>>> The MR jobs you are running I assume can be run against an arbitrary
> cluster, so you should be able to point them at a cluster started by Whirr.
> >>>>>
> >>>>> Tom
> >>>>>
> >>>>>>
> >>>>>> I may finally end up writing my own driver for running external
> mapred jobs so I can have more control but I was just curious to know if
> option #2 is better than writing my own driver.
> >>>>>>
> >>>>>> Praveen
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: ext Tom White [mailto:tom@cloudera.com]
> >>>>>> Sent: Thursday, January 27, 2011 4:01 PM
> >>>>>> To: whirr-user@incubator.apache.org
> >>>>>> Subject: Re: Running Mapred jobs after launching cluster
> >>>>>>
> >>>>>> If they implement the Tool interface then you can set configuration
> on them. Failing that you could set HADOOP_CONF_DIR and run them in a new
> process.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Tom
> >>>>>>
> >>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
> >>>>>>> Hmm...
> >>>>>>> I am running some of the map reduce jobs written by me but some of
> them are in external libraries (eg. Mahout) which I don't have control over.
> Since I can't modify the code in external libraries, is there any other way
> to make this work?
> >>>>>>>
> >>>>>>> Praveen
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
> >>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
> >>>>>>> To: whirr-user@incubator.apache.org
> >>>>>>> Subject: Re: Running Mapred jobs after launching cluster
> >>>>>>>
> >>>>>>> You don't need to add anything to the classpath, but you need to
> use the configuration in the org.apache.whirr.service.Cluster object to
> populate your Hadoop Configuration object so that your code knows which
> cluster to connect to. See the getConfiguration() method in
> HadoopServiceController for how to do this.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Tom
> >>>>>>>
> >>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com>
> wrote:
> >>>>>>>> Hello all,
> >>>>>>>> I wrote a java class HadoopLanucher that is very similar to
> >>>>>>>> HadoopServiceController. I was succesfully able to launch a
> >>>>>>>> cluster programtically from my application using Whirr. Now I
> >>>>>>>> want to copy files to hdfs and also run a job progrmatically.
> >>>>>>>>
> >>>>>>>> When I copy a file to hdfs its copying to local file system, not
> hdfs.
> >>>>>>>> Here is the code I used:
> >>>>>>>>
> >>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
> >>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
> >>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
> >>>>>>>>
> >>>>>>>> Do I need to add anything else to the classpath so Hadoop
> >>>>>>>> libraries know that it needs to talk to the dynamically lanuched
> cluster?
> >>>>>>>> When running Whirr from command line I know it uses
> >>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
> >>>>>>>> the same from Java I am wondering how to solve this issue.
> >>>>>>>>
> >>>>>>>> Praveen
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>



-- 
Andrei Savu -- andreisavu.ro

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Hi Tom/all
Where are the hadoop config files stored on the cluster nodes? I would like to debug this issue since I need to give more memory for child java mapred processes to process huge chunks of data.

Thanks
Praveen
-----Original Message-----
From: ext praveen.peddi@nokia.com [mailto:praveen.peddi@nokia.com] 
Sent: Wednesday, February 02, 2011 5:23 PM
To: whirr-user@incubator.apache.org
Subject: RE: Running Mapred jobs after launching cluster

Can anyone think of a reason why the below property is not honoured when I overwrote this along with other properties in the post_configure. Other properties are correctly overwritten except this one. I need to set the mapred tasks jvm to bigger than 200m.

Praveen
________________________________________
From: Peddi Praveen (Nokia-MS/Boston)
Sent: Tuesday, February 01, 2011 11:21 AM
To: whirr-user@incubator.apache.org
Subject: RE: Running Mapred jobs after launching cluster

Thanks Tom. Silly me I should have thought of the property name. It works now except one issue: I ran the wordcount example and I saw that no. of map and reduce tasks are as I configured in post_configure script but for some reason the below property in job.xml is always -Xmx200m and I set it to -Xmx1700m. Not sure if this property is any special.

mapred.child.java.opts  -Xmx200m

Praveen
________________________________________
From: ext Tom White [tom.e.white@gmail.com]
Sent: Tuesday, February 01, 2011 12:13 AM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

Try setting whirr.run-url-base, not run-url-base.

Tom

On Mon, Jan 31, 2011 at 5:33 PM,  <pr...@nokia.com> wrote:
> I am not using cdh (for now anyway) but the default hadoop. I even changed the "localhost" to ipaddress and still no luck. It likely that I am doing something wrong but having hard time debugging.
> Here are the properties I changed in  /var/www/apache/hadoop/post-configure but when I run the job I am not seeing these values.
>  MAX_MAP_TASKS=16
>  MAX_REDUCE_TASKS=24
>  CHILD_OPTS=-Xmx1700m
>
> Here is what I see in  /tmp/runscript/runscript.sh of master node. It doesn't look like it used my scripts...
>
> installRunUrl || exit 1
> runurl 
> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure
> -hostnames -c cloudservers runurl 
> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/insta
> ll runurl 
> http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/
> install -c cloudservers
>
> Any suggestions?
> Praveen
> ________________________________________
> From: ext Tom White [tom.e.white@gmail.com]
> Sent: Monday, January 31, 2011 6:23 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Mon, Jan 31, 2011 at 3:03 PM,  <pr...@nokia.com> wrote:
>> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.
>
> I suggested you look at the patch (WHIRR-55) so you can see how it 
> will be possible once it's committed. To try it out you need to upload 
> the scripts to a webserver (since the patch changes one of them).
>
>>
>> BTW I tried with webserver path and I could not make it work so far.
>>
>> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
>> 2. I changed hadoop.properties added following line
>>        run-url-base=http://localhost/ 3. Launched the cluster and 
>> verified the job properties are not what I changed to. They are all defaults.
>
> This looks right to me. If you are using CDH you need to change 
> cloudera/cdh/post-configure.
>
>>
>> How do I debug this issue?
>
> You can log into the instances (see the FAQ for how to do this) and 
> look at the scripts that actually ran (and their output) in the /tmp 
> directory.
>
>
> Tom
>
>>
>> Praveen
>>
>>
>> Launched the cluster and I didn't see child jvm have 2G alloc 
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Monday, January 31, 2011 3:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> Hi Praveen,
>>
>> I think removing the webserver dependency (or making it optional) 
>> would be a good goal, but we're not there yet. I've just created
>> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>>
>> In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>>
>> Hope this helps.
>>
>> Cheers,
>> Tom
>>
>> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>>
>> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
>>> Hi Tom,
>>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>>
>>> mapred.reduce.tasks=24
>>> mapred.map.tasks=64
>>> mapred.child.java.opts=-Xmx2048m
>>>
>>> Thanks in advance.
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 4:02 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>>>
>>> Cheers
>>> Tom
>>>
>>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>>>> Hi Tom,
>>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Friday, January 28, 2011 3:35 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>>
>>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>>
>>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>>
>>>>>
>>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>>
>>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>>
>>>> Cheers
>>>> Tom
>>>>
>>>>>
>>>>> Thanks
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>>
>>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>>
>>>>> ~/.whirr/<cluster-name>
>>>>>
>>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>>
>>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>>
>>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>>
>>>>> Tom
>>>>>
>>>>>>
>>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>>>> Hmm...
>>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>>> To: whirr-user@incubator.apache.org
>>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>>
>>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Tom
>>>>>>>
>>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>>>> Hello all,
>>>>>>>> I wrote a java class HadoopLanucher that is very similar to 
>>>>>>>> HadoopServiceController. I was succesfully able to launch a 
>>>>>>>> cluster programtically from my application using Whirr. Now I 
>>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>>
>>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>>> Here is the code I used:
>>>>>>>>
>>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs = 
>>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new 
>>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>>
>>>>>>>> Do I need to add anything else to the classpath so Hadoop 
>>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>>> When running Whirr from command line I know it uses 
>>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing 
>>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>>
>>>>>>>> Praveen
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Can anyone think of a reason why the below property is not honoured when I overwrote this along with other properties in the post_configure. Other properties are correctly overwritten except this one. I need to set the mapred tasks jvm to bigger than 200m.

Praveen
________________________________________
From: Peddi Praveen (Nokia-MS/Boston)
Sent: Tuesday, February 01, 2011 11:21 AM
To: whirr-user@incubator.apache.org
Subject: RE: Running Mapred jobs after launching cluster

Thanks Tom. Silly me I should have thought of the property name. It works now except one issue: I ran the wordcount example and I saw that no. of map and reduce tasks are as I configured in post_configure script but for some reason the below property in job.xml is always -Xmx200m and I set it to -Xmx1700m. Not sure if this property is any special.

mapred.child.java.opts  -Xmx200m

Praveen
________________________________________
From: ext Tom White [tom.e.white@gmail.com]
Sent: Tuesday, February 01, 2011 12:13 AM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

Try setting whirr.run-url-base, not run-url-base.

Tom

On Mon, Jan 31, 2011 at 5:33 PM,  <pr...@nokia.com> wrote:
> I am not using cdh (for now anyway) but the default hadoop. I even changed the "localhost" to ipaddress and still no luck. It likely that I am doing something wrong but having hard time debugging.
> Here are the properties I changed in  /var/www/apache/hadoop/post-configure but when I run the job I am not seeing these values.
>  MAX_MAP_TASKS=16
>  MAX_REDUCE_TASKS=24
>  CHILD_OPTS=-Xmx1700m
>
> Here is what I see in  /tmp/runscript/runscript.sh of master node. It doesn't look like it used my scripts...
>
> installRunUrl || exit 1
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure-hostnames -c cloudservers
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/install
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/install -c cloudservers
>
> Any suggestions?
> Praveen
> ________________________________________
> From: ext Tom White [tom.e.white@gmail.com]
> Sent: Monday, January 31, 2011 6:23 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Mon, Jan 31, 2011 at 3:03 PM,  <pr...@nokia.com> wrote:
>> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.
>
> I suggested you look at the patch (WHIRR-55) so you can see how it
> will be possible once it's committed. To try it out you need to upload
> the scripts to a webserver (since the patch changes one of them).
>
>>
>> BTW I tried with webserver path and I could not make it work so far.
>>
>> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
>> 2. I changed hadoop.properties added following line
>>        run-url-base=http://localhost/
>> 3. Launched the cluster and verified the job properties are not what I changed to. They are all defaults.
>
> This looks right to me. If you are using CDH you need to change
> cloudera/cdh/post-configure.
>
>>
>> How do I debug this issue?
>
> You can log into the instances (see the FAQ for how to do this) and
> look at the scripts that actually ran (and their output) in the /tmp
> directory.
>
>
> Tom
>
>>
>> Praveen
>>
>>
>> Launched the cluster and I didn't see child jvm have 2G alloc
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Monday, January 31, 2011 3:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> Hi Praveen,
>>
>> I think removing the webserver dependency (or making it optional) would be a good goal, but we're not there yet. I've just created
>> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>>
>> In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>>
>> Hope this helps.
>>
>> Cheers,
>> Tom
>>
>> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>>
>> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
>>> Hi Tom,
>>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>>
>>> mapred.reduce.tasks=24
>>> mapred.map.tasks=64
>>> mapred.child.java.opts=-Xmx2048m
>>>
>>> Thanks in advance.
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 4:02 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>>>
>>> Cheers
>>> Tom
>>>
>>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>>>> Hi Tom,
>>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Friday, January 28, 2011 3:35 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>>
>>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>>
>>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>>
>>>>>
>>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>>
>>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>>
>>>> Cheers
>>>> Tom
>>>>
>>>>>
>>>>> Thanks
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>>
>>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>>
>>>>> ~/.whirr/<cluster-name>
>>>>>
>>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>>
>>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>>
>>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>>
>>>>> Tom
>>>>>
>>>>>>
>>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>>>> Hmm...
>>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>>> To: whirr-user@incubator.apache.org
>>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>>
>>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Tom
>>>>>>>
>>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>>>> Hello all,
>>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>>
>>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>>> Here is the code I used:
>>>>>>>>
>>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>>
>>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>>> When running Whirr from command line I know it uses
>>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>>
>>>>>>>> Praveen
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Thanks Tom. Silly me I should have thought of the property name. It works now except one issue: I ran the wordcount example and I saw that no. of map and reduce tasks are as I configured in post_configure script but for some reason the below property in job.xml is always -Xmx200m and I set it to -Xmx1700m. Not sure if this property is any special.

mapred.child.java.opts	-Xmx200m

Praveen
________________________________________
From: ext Tom White [tom.e.white@gmail.com]
Sent: Tuesday, February 01, 2011 12:13 AM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

Try setting whirr.run-url-base, not run-url-base.

Tom

On Mon, Jan 31, 2011 at 5:33 PM,  <pr...@nokia.com> wrote:
> I am not using cdh (for now anyway) but the default hadoop. I even changed the "localhost" to ipaddress and still no luck. It likely that I am doing something wrong but having hard time debugging.
> Here are the properties I changed in  /var/www/apache/hadoop/post-configure but when I run the job I am not seeing these values.
>  MAX_MAP_TASKS=16
>  MAX_REDUCE_TASKS=24
>  CHILD_OPTS=-Xmx1700m
>
> Here is what I see in  /tmp/runscript/runscript.sh of master node. It doesn't look like it used my scripts...
>
> installRunUrl || exit 1
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure-hostnames -c cloudservers
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/install
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/install -c cloudservers
>
> Any suggestions?
> Praveen
> ________________________________________
> From: ext Tom White [tom.e.white@gmail.com]
> Sent: Monday, January 31, 2011 6:23 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Mon, Jan 31, 2011 at 3:03 PM,  <pr...@nokia.com> wrote:
>> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.
>
> I suggested you look at the patch (WHIRR-55) so you can see how it
> will be possible once it's committed. To try it out you need to upload
> the scripts to a webserver (since the patch changes one of them).
>
>>
>> BTW I tried with webserver path and I could not make it work so far.
>>
>> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
>> 2. I changed hadoop.properties added following line
>>        run-url-base=http://localhost/
>> 3. Launched the cluster and verified the job properties are not what I changed to. They are all defaults.
>
> This looks right to me. If you are using CDH you need to change
> cloudera/cdh/post-configure.
>
>>
>> How do I debug this issue?
>
> You can log into the instances (see the FAQ for how to do this) and
> look at the scripts that actually ran (and their output) in the /tmp
> directory.
>
>
> Tom
>
>>
>> Praveen
>>
>>
>> Launched the cluster and I didn't see child jvm have 2G alloc
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Monday, January 31, 2011 3:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> Hi Praveen,
>>
>> I think removing the webserver dependency (or making it optional) would be a good goal, but we're not there yet. I've just created
>> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>>
>> In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>>
>> Hope this helps.
>>
>> Cheers,
>> Tom
>>
>> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>>
>> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
>>> Hi Tom,
>>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>>
>>> mapred.reduce.tasks=24
>>> mapred.map.tasks=64
>>> mapred.child.java.opts=-Xmx2048m
>>>
>>> Thanks in advance.
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 4:02 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>>>
>>> Cheers
>>> Tom
>>>
>>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>>>> Hi Tom,
>>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Friday, January 28, 2011 3:35 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>>
>>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>>
>>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>>
>>>>>
>>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>>
>>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>>
>>>> Cheers
>>>> Tom
>>>>
>>>>>
>>>>> Thanks
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>>
>>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>>
>>>>> ~/.whirr/<cluster-name>
>>>>>
>>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>>
>>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>>
>>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>>
>>>>> Tom
>>>>>
>>>>>>
>>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>>>> Hmm...
>>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>>> To: whirr-user@incubator.apache.org
>>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>>
>>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Tom
>>>>>>>
>>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>>>> Hello all,
>>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>>
>>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>>> Here is the code I used:
>>>>>>>>
>>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>>
>>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>>> When running Whirr from command line I know it uses
>>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>>
>>>>>>>> Praveen
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running Mapred jobs after launching cluster

Posted by Tom White <to...@gmail.com>.
Try setting whirr.run-url-base, not run-url-base.

Tom

On Mon, Jan 31, 2011 at 5:33 PM,  <pr...@nokia.com> wrote:
> I am not using cdh (for now anyway) but the default hadoop. I even changed the "localhost" to ipaddress and still no luck. It likely that I am doing something wrong but having hard time debugging.
> Here are the properties I changed in  /var/www/apache/hadoop/post-configure but when I run the job I am not seeing these values.
>  MAX_MAP_TASKS=16
>  MAX_REDUCE_TASKS=24
>  CHILD_OPTS=-Xmx1700m
>
> Here is what I see in  /tmp/runscript/runscript.sh of master node. It doesn't look like it used my scripts...
>
> installRunUrl || exit 1
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure-hostnames -c cloudservers
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/install
> runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/install -c cloudservers
>
> Any suggestions?
> Praveen
> ________________________________________
> From: ext Tom White [tom.e.white@gmail.com]
> Sent: Monday, January 31, 2011 6:23 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Mon, Jan 31, 2011 at 3:03 PM,  <pr...@nokia.com> wrote:
>> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.
>
> I suggested you look at the patch (WHIRR-55) so you can see how it
> will be possible once it's committed. To try it out you need to upload
> the scripts to a webserver (since the patch changes one of them).
>
>>
>> BTW I tried with webserver path and I could not make it work so far.
>>
>> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
>> 2. I changed hadoop.properties added following line
>>        run-url-base=http://localhost/
>> 3. Launched the cluster and verified the job properties are not what I changed to. They are all defaults.
>
> This looks right to me. If you are using CDH you need to change
> cloudera/cdh/post-configure.
>
>>
>> How do I debug this issue?
>
> You can log into the instances (see the FAQ for how to do this) and
> look at the scripts that actually ran (and their output) in the /tmp
> directory.
>
>
> Tom
>
>>
>> Praveen
>>
>>
>> Launched the cluster and I didn't see child jvm have 2G alloc
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Monday, January 31, 2011 3:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> Hi Praveen,
>>
>> I think removing the webserver dependency (or making it optional) would be a good goal, but we're not there yet. I've just created
>> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>>
>> In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>>
>> Hope this helps.
>>
>> Cheers,
>> Tom
>>
>> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>>
>> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
>>> Hi Tom,
>>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>>
>>> mapred.reduce.tasks=24
>>> mapred.map.tasks=64
>>> mapred.child.java.opts=-Xmx2048m
>>>
>>> Thanks in advance.
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 4:02 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>>>
>>> Cheers
>>> Tom
>>>
>>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>>>> Hi Tom,
>>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Friday, January 28, 2011 3:35 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>>
>>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>>
>>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>>
>>>>>
>>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>>
>>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>>
>>>> Cheers
>>>> Tom
>>>>
>>>>>
>>>>> Thanks
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>>
>>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>>
>>>>> ~/.whirr/<cluster-name>
>>>>>
>>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>>
>>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>>
>>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>>
>>>>> Tom
>>>>>
>>>>>>
>>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>>>> Hmm...
>>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>>> To: whirr-user@incubator.apache.org
>>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>>
>>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Tom
>>>>>>>
>>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>>>> Hello all,
>>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>>
>>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>>> Here is the code I used:
>>>>>>>>
>>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>>
>>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>>> When running Whirr from command line I know it uses
>>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>>
>>>>>>>> Praveen
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
I am not using cdh (for now anyway) but the default hadoop. I even changed the "localhost" to ipaddress and still no luck. It likely that I am doing something wrong but having hard time debugging. 
Here are the properties I changed in  /var/www/apache/hadoop/post-configure but when I run the job I am not seeing these values.
 MAX_MAP_TASKS=16
  MAX_REDUCE_TASKS=24
  CHILD_OPTS=-Xmx1700m

Here is what I see in  /tmp/runscript/runscript.sh of master node. It doesn't look like it used my scripts...

installRunUrl || exit 1
runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/util/configure-hostnames -c cloudservers
runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/sun/java/install
runurl http://whirr.s3.amazonaws.com/0.3.0-incubating-SNAPSHOT/apache/hadoop/install -c cloudservers

Any suggestions?
Praveen
________________________________________
From: ext Tom White [tom.e.white@gmail.com]
Sent: Monday, January 31, 2011 6:23 PM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

On Mon, Jan 31, 2011 at 3:03 PM,  <pr...@nokia.com> wrote:
> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.

I suggested you look at the patch (WHIRR-55) so you can see how it
will be possible once it's committed. To try it out you need to upload
the scripts to a webserver (since the patch changes one of them).

>
> BTW I tried with webserver path and I could not make it work so far.
>
> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
> 2. I changed hadoop.properties added following line
>        run-url-base=http://localhost/
> 3. Launched the cluster and verified the job properties are not what I changed to. They are all defaults.

This looks right to me. If you are using CDH you need to change
cloudera/cdh/post-configure.

>
> How do I debug this issue?

You can log into the instances (see the FAQ for how to do this) and
look at the scripts that actually ran (and their output) in the /tmp
directory.


Tom

>
> Praveen
>
>
> Launched the cluster and I didn't see child jvm have 2G alloc
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Monday, January 31, 2011 3:02 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> Hi Praveen,
>
> I think removing the webserver dependency (or making it optional) would be a good goal, but we're not there yet. I've just created
> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>
> In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>
> Hope this helps.
>
> Cheers,
> Tom
>
> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>
> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
>> Hi Tom,
>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>
>> mapred.reduce.tasks=24
>> mapred.map.tasks=64
>> mapred.child.java.opts=-Xmx2048m
>>
>> Thanks in advance.
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Friday, January 28, 2011 4:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>>
>> Cheers
>> Tom
>>
>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>>> Hi Tom,
>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>
>>> Thanks
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 3:35 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>
>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>
>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>
>>>>
>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>
>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>
>>> Cheers
>>> Tom
>>>
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>
>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>
>>>> ~/.whirr/<cluster-name>
>>>>
>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>
>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>
>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>
>>>> Tom
>>>>
>>>>>
>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>
>>>>> Cheers,
>>>>> Tom
>>>>>
>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>>> Hmm...
>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>>> Hello all,
>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>
>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>> Here is the code I used:
>>>>>>>
>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>
>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>> When running Whirr from command line I know it uses
>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running Mapred jobs after launching cluster

Posted by Tom White <to...@gmail.com>.
On Mon, Jan 31, 2011 at 3:03 PM,  <pr...@nokia.com> wrote:
> If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.

I suggested you look at the patch (WHIRR-55) so you can see how it
will be possible once it's committed. To try it out you need to upload
the scripts to a webserver (since the patch changes one of them).

>
> BTW I tried with webserver path and I could not make it work so far.
>
> 1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
> 2. I changed hadoop.properties added following line
>        run-url-base=http://localhost/
> 3. Launched the cluster and verified the job properties are not what I changed to. They are all defaults.

This looks right to me. If you are using CDH you need to change
cloudera/cdh/post-configure.

>
> How do I debug this issue?

You can log into the instances (see the FAQ for how to do this) and
look at the scripts that actually ran (and their output) in the /tmp
directory.


Tom

>
> Praveen
>
>
> Launched the cluster and I didn't see child jvm have 2G alloc
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Monday, January 31, 2011 3:02 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> Hi Praveen,
>
> I think removing the webserver dependency (or making it optional) would be a good goal, but we're not there yet. I've just created
> https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.
>
> In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).
>
> Hope this helps.
>
> Cheers,
> Tom
>
> BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?
>
> On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
>> Hi Tom,
>> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>>
>> mapred.reduce.tasks=24
>> mapred.map.tasks=64
>> mapred.child.java.opts=-Xmx2048m
>>
>> Thanks in advance.
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Friday, January 28, 2011 4:02 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>>
>> Cheers
>> Tom
>>
>> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>>> Hi Tom,
>>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>>
>>> Thanks
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 3:35 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>>
>>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>>
>>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>>
>>>>
>>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>>
>>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>>> If you look at the POMs in the source code they will tell you the dependencies.
>>>
>>> Cheers
>>> Tom
>>>
>>>>
>>>> Thanks
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Friday, January 28, 2011 2:12 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>>
>>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>>
>>>> ~/.whirr/<cluster-name>
>>>>
>>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>>
>>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>>
>>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>>
>>>> Tom
>>>>
>>>>>
>>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>>
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>>
>>>>> Cheers,
>>>>> Tom
>>>>>
>>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>>> Hmm...
>>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>>> To: whirr-user@incubator.apache.org
>>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>>
>>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>>> Hello all,
>>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>>> cluster programtically from my application using Whirr. Now I
>>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>>
>>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>>> Here is the code I used:
>>>>>>>
>>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>>
>>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>>> When running Whirr from command line I know it uses
>>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing
>>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>>
>>>>>>> Praveen
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
If I anyway have to upload the files to webservers, do I still need the patch then? It looks like the script has these properties that I can overwrite.

BTW I tried with webserver path and I could not make it work so far.

1. I copied scripts/apache folder to my /var/www folder and modified below 3 properties in /var/www/apache/hadoop/post-configure.
2. I changed hadoop.properties added following line
	run-url-base=http://localhost/
3. Launched the cluster and verified the job properties are not what I changed to. They are all defaults. 

How do I debug this issue?

Praveen


Launched the cluster and I didn't see child jvm have 2G alloc
-----Original Message-----
From: ext Tom White [mailto:tom.e.white@gmail.com] 
Sent: Monday, January 31, 2011 3:02 PM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

Hi Praveen,

I think removing the webserver dependency (or making it optional) would be a good goal, but we're not there yet. I've just created
https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss the design and implementation.

In the meantime you could take a look at https://issues.apache.org/jira/browse/WHIRR-55, and try using the patch there to override some Hadoop properties (you will need to upload the scripts to a webserver still however, until it is committed, since it modifies Hadoop's post-configure script).

Hope this helps.

Cheers,
Tom

BTW what are the security concerns you have? There are no credentials embedded in the scripts, so it should be safe to host them publicly, no?

On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
> Hi Tom,
> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>
> mapred.reduce.tasks=24
> mapred.map.tasks=64
> mapred.child.java.opts=-Xmx2048m
>
> Thanks in advance.
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Friday, January 28, 2011 4:02 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>
> Cheers
> Tom
>
> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>> Hi Tom,
>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>
>> Thanks
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Friday, January 28, 2011 3:35 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>
>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>
>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>
>>>
>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>
>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>> If you look at the POMs in the source code they will tell you the dependencies.
>>
>> Cheers
>> Tom
>>
>>>
>>> Thanks
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 2:12 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>
>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>
>>> ~/.whirr/<cluster-name>
>>>
>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>
>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>
>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>
>>> Tom
>>>
>>>>
>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>
>>>> Cheers,
>>>> Tom
>>>>
>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>> Hmm...
>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>
>>>>> Cheers,
>>>>> Tom
>>>>>
>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>> Hello all,
>>>>>> I wrote a java class HadoopLanucher that is very similar to 
>>>>>> HadoopServiceController. I was succesfully able to launch a 
>>>>>> cluster programtically from my application using Whirr. Now I 
>>>>>> want to copy files to hdfs and also run a job progrmatically.
>>>>>>
>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>> Here is the code I used:
>>>>>>
>>>>>> Configuration conf = new Configuration(); FileSystem hdfs = 
>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new 
>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>
>>>>>> Do I need to add anything else to the classpath so Hadoop 
>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>> When running Whirr from command line I know it uses 
>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing 
>>>>>> the same from Java I am wondering how to solve this issue.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running Mapred jobs after launching cluster

Posted by Tom White <to...@gmail.com>.
Hi Praveen,

I think removing the webserver dependency (or making it optional)
would be a good goal, but we're not there yet. I've just created
https://issues.apache.org/jira/browse/WHIRR-225 as a place to discuss
the design and implementation.

In the meantime you could take a look at
https://issues.apache.org/jira/browse/WHIRR-55, and try using the
patch there to override some Hadoop properties (you will need to
upload the scripts to a webserver still however, until it is
committed, since it modifies Hadoop's post-configure script).

Hope this helps.

Cheers,
Tom

BTW what are the security concerns you have? There are no credentials
embedded in the scripts, so it should be safe to host them publicly,
no?

On Mon, Jan 31, 2011 at 11:00 AM,  <pr...@nokia.com> wrote:
> Hi Tom,
> If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:
>
> mapred.reduce.tasks=24
> mapred.map.tasks=64
> mapred.child.java.opts=-Xmx2048m
>
> Thanks in advance.
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Friday, January 28, 2011 4:02 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.
>
> Cheers
> Tom
>
> On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
>> Hi Tom,
>> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>>
>> Thanks
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Friday, January 28, 2011 3:35 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>>
>>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>>
>> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>>
>>>
>>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>>
>> You need Whirr's core and Hadoop jars, as well as their dependencies.
>> If you look at the POMs in the source code they will tell you the dependencies.
>>
>> Cheers
>> Tom
>>
>>>
>>> Thanks
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Friday, January 28, 2011 2:12 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>>> Thanks Tom. Could you eloborate little more on the second option.
>>>>
>>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>>
>>> ~/.whirr/<cluster-name>
>>>
>>>> When you said run in new process, did you mean using command line Whirr tool?
>>>
>>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>>
>>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>>
>>> Tom
>>>
>>>>
>>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>>
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom@cloudera.com]
>>>> Sent: Thursday, January 27, 2011 4:01 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>>
>>>> Cheers,
>>>> Tom
>>>>
>>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>>> Hmm...
>>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>>
>>>>> Praveen
>>>>>
>>>>> -----Original Message-----
>>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>>> To: whirr-user@incubator.apache.org
>>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>>
>>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>>
>>>>> Cheers,
>>>>> Tom
>>>>>
>>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>>> Hello all,
>>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>>> HadoopServiceController. I was succesfully able to launch a
>>>>>> cluster programtically from my application using Whirr. Now I want
>>>>>> to copy files to hdfs and also run a job progrmatically.
>>>>>>
>>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>>> Here is the code I used:
>>>>>>
>>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>>
>>>>>> Do I need to add anything else to the classpath so Hadoop
>>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>>> When running Whirr from command line I know it uses
>>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing the
>>>>>> same from Java I am wondering how to solve this issue.
>>>>>>
>>>>>> Praveen
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Hi Tom,
If the hadoop install is fixed, Whirr must be getting all default hadoop properties from the hadoop install itself, correct? I sent an email about configuring hadoop properties and you mentioned I need to put the modified scripts on a webserver that is publicly accessible. I was wondering if there is place inside hadoop install I can change so that I don't need to put the scripts on webserver (for security reasons). Do you think it is possible? If so, how? I do not mind customizing the jar file for our purposes. I want to change the following properties:

mapred.reduce.tasks=24
mapred.map.tasks=64
mapred.child.java.opts=-Xmx2048m

Thanks in advance.
Praveen

-----Original Message-----
From: ext Tom White [mailto:tom.e.white@gmail.com] 
Sent: Friday, January 28, 2011 4:02 PM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

It is fixed, and currently on 0.20.2. It will be made configurable in https://issues.apache.org/jira/browse/WHIRR-222.

Cheers
Tom

On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
> Hi Tom,
> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>
> Thanks
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Friday, January 28, 2011 3:35 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>
>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>
> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>
>>
>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>
> You need Whirr's core and Hadoop jars, as well as their dependencies.
> If you look at the POMs in the source code they will tell you the dependencies.
>
> Cheers
> Tom
>
>>
>> Thanks
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Friday, January 28, 2011 2:12 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>> Thanks Tom. Could you eloborate little more on the second option.
>>>
>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>
>> ~/.whirr/<cluster-name>
>>
>>> When you said run in new process, did you mean using command line Whirr tool?
>>
>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>
>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>
>> Tom
>>
>>>
>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom@cloudera.com]
>>> Sent: Thursday, January 27, 2011 4:01 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>
>>> Cheers,
>>> Tom
>>>
>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>> Hmm...
>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>
>>>> Cheers,
>>>> Tom
>>>>
>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>> Hello all,
>>>>> I wrote a java class HadoopLanucher that is very similar to 
>>>>> HadoopServiceController. I was succesfully able to launch a 
>>>>> cluster programtically from my application using Whirr. Now I want 
>>>>> to copy files to hdfs and also run a job progrmatically.
>>>>>
>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>> Here is the code I used:
>>>>>
>>>>> Configuration conf = new Configuration(); FileSystem hdfs = 
>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new 
>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>
>>>>> Do I need to add anything else to the classpath so Hadoop 
>>>>> libraries know that it needs to talk to the dynamically lanuched cluster?
>>>>> When running Whirr from command line I know it uses 
>>>>> HADOOP_CONF_DIR to find the hadoop config files but when doing the 
>>>>> same from Java I am wondering how to solve this issue.
>>>>>
>>>>> Praveen
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running Mapred jobs after launching cluster

Posted by Tom White <to...@gmail.com>.
It is fixed, and currently on 0.20.2. It will be made configurable in
https://issues.apache.org/jira/browse/WHIRR-222.

Cheers
Tom

On Fri, Jan 28, 2011 at 12:56 PM,  <pr...@nokia.com> wrote:
> Hi Tom,
> So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.
>
> Thanks
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Friday, January 28, 2011 3:35 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
>> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>>
>> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.
>
> You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.
>
>>
>> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.
>
> You need Whirr's core and Hadoop jars, as well as their dependencies.
> If you look at the POMs in the source code they will tell you the dependencies.
>
> Cheers
> Tom
>
>>
>> Thanks
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Friday, January 28, 2011 2:12 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>>> Thanks Tom. Could you eloborate little more on the second option.
>>>
>>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>>
>> ~/.whirr/<cluster-name>
>>
>>> When you said run in new process, did you mean using command line Whirr tool?
>>
>> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>>
>> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>>
>> Tom
>>
>>>
>>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>>
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom@cloudera.com]
>>> Sent: Thursday, January 27, 2011 4:01 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>>
>>> Cheers,
>>> Tom
>>>
>>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>>> Hmm...
>>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>>
>>>> Praveen
>>>>
>>>> -----Original Message-----
>>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>>> Sent: Thursday, January 27, 2011 3:42 PM
>>>> To: whirr-user@incubator.apache.org
>>>> Subject: Re: Running Mapred jobs after launching cluster
>>>>
>>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>>
>>>> Cheers,
>>>> Tom
>>>>
>>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>>> Hello all,
>>>>> I wrote a java class HadoopLanucher that is very similar to
>>>>> HadoopServiceController. I was succesfully able to launch a cluster
>>>>> programtically from my application using Whirr. Now I want to copy
>>>>> files to hdfs and also run a job progrmatically.
>>>>>
>>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>>> Here is the code I used:
>>>>>
>>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>>
>>>>> Do I need to add anything else to the classpath so Hadoop libraries
>>>>> know that it needs to talk to the dynamically lanuched cluster?
>>>>> When running Whirr from command line I know it uses HADOOP_CONF_DIR
>>>>> to find the hadoop config files but when doing the same from Java I
>>>>> am wondering how to solve this issue.
>>>>>
>>>>> Praveen
>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Hi Tom,
So the hadoop version is not going to change for a given Whirr install? I thought Whirr is getting hadoop install dynamically from a URL which is always going to have the latest hadoop version. If that is not the case I guess I am fine. I just don't want to get hadoop version mismatch 6 months after our software is released just because new hadoop version got released.

Thanks
Praveen

-----Original Message-----
From: ext Tom White [mailto:tom.e.white@gmail.com] 
Sent: Friday, January 28, 2011 3:35 PM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>
> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.

You do need to make sure that the versions are the same. See the Hadoop integration tests, which specify the version of Hadoop to use in their POM.

>
> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.

You need Whirr's core and Hadoop jars, as well as their dependencies.
If you look at the POMs in the source code they will tell you the dependencies.

Cheers
Tom

>
> Thanks
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Friday, January 28, 2011 2:12 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>> Thanks Tom. Could you eloborate little more on the second option.
>>
>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>
> ~/.whirr/<cluster-name>
>
>> When you said run in new process, did you mean using command line Whirr tool?
>
> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>
> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>
> Tom
>
>>
>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom@cloudera.com]
>> Sent: Thursday, January 27, 2011 4:01 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>
>> Cheers,
>> Tom
>>
>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>> Hmm...
>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Thursday, January 27, 2011 3:42 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>
>>> Cheers,
>>> Tom
>>>
>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>> Hello all,
>>>> I wrote a java class HadoopLanucher that is very similar to 
>>>> HadoopServiceController. I was succesfully able to launch a cluster 
>>>> programtically from my application using Whirr. Now I want to copy 
>>>> files to hdfs and also run a job progrmatically.
>>>>
>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>> Here is the code I used:
>>>>
>>>> Configuration conf = new Configuration(); FileSystem hdfs = 
>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new 
>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>
>>>> Do I need to add anything else to the classpath so Hadoop libraries 
>>>> know that it needs to talk to the dynamically lanuched cluster? 
>>>> When running Whirr from command line I know it uses HADOOP_CONF_DIR 
>>>> to find the hadoop config files but when doing the same from Java I 
>>>> am wondering how to solve this issue.
>>>>
>>>> Praveen
>>>>
>>>>
>>>
>>
>

Re: Running Mapred jobs after launching cluster

Posted by Tom White <to...@gmail.com>.
On Fri, Jan 28, 2011 at 12:06 PM,  <pr...@nokia.com> wrote:
> Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).
>
> BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.

You do need to make sure that the versions are the same. See the
Hadoop integration tests, which specify the version of Hadoop to use
in their POM.

>
> Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.

You need Whirr's core and Hadoop jars, as well as their dependencies.
If you look at the POMs in the source code they will tell you the
dependencies.

Cheers
Tom

>
> Thanks
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Friday, January 28, 2011 2:12 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
>> Thanks Tom. Could you eloborate little more on the second option.
>>
>> What is the HADOOP_CONF_DIR here, after launching the cluster?
>
> ~/.whirr/<cluster-name>
>
>> When you said run in new process, did you mean using command line Whirr tool?
>
> I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.
>
> The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.
>
> Tom
>
>>
>> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>>
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom@cloudera.com]
>> Sent: Thursday, January 27, 2011 4:01 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>>
>> Cheers,
>> Tom
>>
>> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>>> Hmm...
>>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>>
>>> Praveen
>>>
>>> -----Original Message-----
>>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>>> Sent: Thursday, January 27, 2011 3:42 PM
>>> To: whirr-user@incubator.apache.org
>>> Subject: Re: Running Mapred jobs after launching cluster
>>>
>>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>>
>>> Cheers,
>>> Tom
>>>
>>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>>> Hello all,
>>>> I wrote a java class HadoopLanucher that is very similar to
>>>> HadoopServiceController. I was succesfully able to launch a cluster
>>>> programtically from my application using Whirr. Now I want to copy
>>>> files to hdfs and also run a job progrmatically.
>>>>
>>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>>> Here is the code I used:
>>>>
>>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>>
>>>> Do I need to add anything else to the classpath so Hadoop libraries
>>>> know that it needs to talk to the dynamically lanuched cluster? When
>>>> running Whirr from command line I know it uses HADOOP_CONF_DIR to
>>>> find the hadoop config files but when doing the same from Java I am
>>>> wondering how to solve this issue.
>>>>
>>>> Praveen
>>>>
>>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Thanks Tom. I think I got it working with my own driver so I will go with it for now (unless that proves to be a bad option).

BTW, could you tell me how to stick with one hadoop version while launching cluster. I have hadoop-0.20.2 in my classpath but it lookws like Whirr gets the latest hadoop from the repository. Since the latest version may be different depending on the time, I would like to stick to one version so that hadoop version mismatch won't happen.

Also what jar files are necessary for launching cluster using Java. Currently I have cli version of jar file but that's way too large since it has ervrything in it.

Thanks
Praveen 

-----Original Message-----
From: ext Tom White [mailto:tom.e.white@gmail.com] 
Sent: Friday, January 28, 2011 2:12 PM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
> Thanks Tom. Could you eloborate little more on the second option.
>
> What is the HADOOP_CONF_DIR here, after launching the cluster?

~/.whirr/<cluster-name>

> When you said run in new process, did you mean using command line Whirr tool?

I meant that you could launch Whirr using the CLI, or Java. Then run the job in another process, with HADOOP_CONF_DIR set.

The MR jobs you are running I assume can be run against an arbitrary cluster, so you should be able to point them at a cluster started by Whirr.

Tom

>
> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom@cloudera.com]
> Sent: Thursday, January 27, 2011 4:01 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>
> Cheers,
> Tom
>
> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>> Hmm...
>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Thursday, January 27, 2011 3:42 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>
>> Cheers,
>> Tom
>>
>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>> Hello all,
>>> I wrote a java class HadoopLanucher that is very similar to 
>>> HadoopServiceController. I was succesfully able to launch a cluster 
>>> programtically from my application using Whirr. Now I want to copy 
>>> files to hdfs and also run a job progrmatically.
>>>
>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>> Here is the code I used:
>>>
>>> Configuration conf = new Configuration(); FileSystem hdfs = 
>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new 
>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>
>>> Do I need to add anything else to the classpath so Hadoop libraries 
>>> know that it needs to talk to the dynamically lanuched cluster? When 
>>> running Whirr from command line I know it uses HADOOP_CONF_DIR to 
>>> find the hadoop config files but when doing the same from Java I am 
>>> wondering how to solve this issue.
>>>
>>> Praveen
>>>
>>>
>>
>

Re: Running Mapred jobs after launching cluster

Posted by Tom White <to...@gmail.com>.
On Fri, Jan 28, 2011 at 6:28 AM,  <pr...@nokia.com> wrote:
> Thanks Tom. Could you eloborate little more on the second option.
>
> What is the HADOOP_CONF_DIR here, after launching the cluster?

~/.whirr/<cluster-name>

> When you said run in new process, did you mean using command line Whirr tool?

I meant that you could launch Whirr using the CLI, or Java. Then run
the job in another process, with HADOOP_CONF_DIR set.

The MR jobs you are running I assume can be run against an arbitrary
cluster, so you should be able to point them at a cluster started by
Whirr.

Tom

>
> I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.
>
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom@cloudera.com]
> Sent: Thursday, January 27, 2011 4:01 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.
>
> Cheers,
> Tom
>
> On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
>> Hmm...
>> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>>
>> Praveen
>>
>> -----Original Message-----
>> From: ext Tom White [mailto:tom.e.white@gmail.com]
>> Sent: Thursday, January 27, 2011 3:42 PM
>> To: whirr-user@incubator.apache.org
>> Subject: Re: Running Mapred jobs after launching cluster
>>
>> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>>
>> Cheers,
>> Tom
>>
>> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>>> Hello all,
>>> I wrote a java class HadoopLanucher that is very similar to
>>> HadoopServiceController. I was succesfully able to launch a cluster
>>> programtically from my application using Whirr. Now I want to copy
>>> files to hdfs and also run a job progrmatically.
>>>
>>> When I copy a file to hdfs its copying to local file system, not hdfs.
>>> Here is the code I used:
>>>
>>> Configuration conf = new Configuration(); FileSystem hdfs =
>>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>>> Path(localFilePath), new Path(hdfsFileDirectory));
>>>
>>> Do I need to add anything else to the classpath so Hadoop libraries
>>> know that it needs to talk to the dynamically lanuched cluster? When
>>> running Whirr from command line I know it uses HADOOP_CONF_DIR to
>>> find the hadoop config files but when doing the same from Java I am
>>> wondering how to solve this issue.
>>>
>>> Praveen
>>>
>>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Thanks Tom. Could you eloborate little more on the second option. 

What is the HADOOP_CONF_DIR here, after launching the cluster? 
When you said run in new process, did you mean using command line Whirr tool?

I may finally end up writing my own driver for running external mapred jobs so I can have more control but I was just curious to know if option #2 is better than writing my own driver.

Praveen

-----Original Message-----
From: ext Tom White [mailto:tom@cloudera.com] 
Sent: Thursday, January 27, 2011 4:01 PM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

If they implement the Tool interface then you can set configuration on them. Failing that you could set HADOOP_CONF_DIR and run them in a new process.

Cheers,
Tom

On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
> Hmm...
> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Thursday, January 27, 2011 3:42 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>
> Cheers,
> Tom
>
> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>> Hello all,
>> I wrote a java class HadoopLanucher that is very similar to 
>> HadoopServiceController. I was succesfully able to launch a cluster 
>> programtically from my application using Whirr. Now I want to copy 
>> files to hdfs and also run a job progrmatically.
>>
>> When I copy a file to hdfs its copying to local file system, not hdfs.
>> Here is the code I used:
>>
>> Configuration conf = new Configuration(); FileSystem hdfs = 
>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new 
>> Path(localFilePath), new Path(hdfsFileDirectory));
>>
>> Do I need to add anything else to the classpath so Hadoop libraries 
>> know that it needs to talk to the dynamically lanuched cluster? When 
>> running Whirr from command line I know it uses HADOOP_CONF_DIR to 
>> find the hadoop config files but when doing the same from Java I am 
>> wondering how to solve this issue.
>>
>> Praveen
>>
>>
>

Re: Running Mapred jobs after launching cluster

Posted by Tom White <to...@cloudera.com>.
If they implement the Tool interface then you can set configuration on
them. Failing that you could set HADOOP_CONF_DIR and run them in a new
process.

Cheers,
Tom

On Thu, Jan 27, 2011 at 12:52 PM,  <pr...@nokia.com> wrote:
> Hmm...
> I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?
>
> Praveen
>
> -----Original Message-----
> From: ext Tom White [mailto:tom.e.white@gmail.com]
> Sent: Thursday, January 27, 2011 3:42 PM
> To: whirr-user@incubator.apache.org
> Subject: Re: Running Mapred jobs after launching cluster
>
> You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.
>
> Cheers,
> Tom
>
> On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
>> Hello all,
>> I wrote a java class HadoopLanucher that is very similar to
>> HadoopServiceController. I was succesfully able to launch a cluster
>> programtically from my application using Whirr. Now I want to copy
>> files to hdfs and also run a job progrmatically.
>>
>> When I copy a file to hdfs its copying to local file system, not hdfs.
>> Here is the code I used:
>>
>> Configuration conf = new Configuration(); FileSystem hdfs =
>> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new
>> Path(localFilePath), new Path(hdfsFileDirectory));
>>
>> Do I need to add anything else to the classpath so Hadoop libraries
>> know that it needs to talk to the dynamically lanuched cluster? When
>> running Whirr from command line I know it uses HADOOP_CONF_DIR to find
>> the hadoop config files but when doing the same from Java I am
>> wondering how to solve this issue.
>>
>> Praveen
>>
>>
>

RE: Running Mapred jobs after launching cluster

Posted by pr...@nokia.com.
Hmm...
I am running some of the map reduce jobs written by me but some of them are in external libraries (eg. Mahout) which I don't have control over. Since I can't modify the code in external libraries, is there any other way to make this work?

Praveen 

-----Original Message-----
From: ext Tom White [mailto:tom.e.white@gmail.com] 
Sent: Thursday, January 27, 2011 3:42 PM
To: whirr-user@incubator.apache.org
Subject: Re: Running Mapred jobs after launching cluster

You don't need to add anything to the classpath, but you need to use the configuration in the org.apache.whirr.service.Cluster object to populate your Hadoop Configuration object so that your code knows which cluster to connect to. See the getConfiguration() method in HadoopServiceController for how to do this.

Cheers,
Tom

On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
> Hello all,
> I wrote a java class HadoopLanucher that is very similar to 
> HadoopServiceController. I was succesfully able to launch a cluster 
> programtically from my application using Whirr. Now I want to copy 
> files to hdfs and also run a job progrmatically.
>
> When I copy a file to hdfs its copying to local file system, not hdfs. 
> Here is the code I used:
>
> Configuration conf = new Configuration(); FileSystem hdfs = 
> FileSystem.get(conf); hdfs.copyFromLocalFile(false, true, new 
> Path(localFilePath), new Path(hdfsFileDirectory));
>
> Do I need to add anything else to the classpath so Hadoop libraries 
> know that it needs to talk to the dynamically lanuched cluster? When 
> running Whirr from command line I know it uses HADOOP_CONF_DIR to find 
> the hadoop config files but when doing the same from Java I am 
> wondering how to solve this issue.
>
> Praveen
>
>

Re: Running Mapred jobs after launching cluster

Posted by Tom White <to...@gmail.com>.
You don't need to add anything to the classpath, but you need to use
the configuration in the org.apache.whirr.service.Cluster object to
populate your Hadoop Configuration object so that your code knows
which cluster to connect to. See the getConfiguration() method in
HadoopServiceController for how to do this.

Cheers,
Tom

On Thu, Jan 27, 2011 at 12:21 PM,  <pr...@nokia.com> wrote:
> Hello all,
> I wrote a java class HadoopLanucher that is very similar to
> HadoopServiceController. I was succesfully able to launch a cluster
> programtically from my application using Whirr. Now I want to copy files to
> hdfs and also run a job progrmatically.
>
> When I copy a file to hdfs its copying to local file system, not hdfs. Here
> is the code I used:
>
> Configuration conf = new Configuration();
> FileSystem hdfs = FileSystem.get(conf);
> hdfs.copyFromLocalFile(false, true, new Path(localFilePath), new
> Path(hdfsFileDirectory));
>
> Do I need to add anything else to the classpath so Hadoop libraries know
> that it needs to talk to the dynamically lanuched cluster? When running
> Whirr from command line I know it uses HADOOP_CONF_DIR to find the hadoop
> config files but when doing the same from Java I am wondering how to solve
> this issue.
>
> Praveen
>
>