You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@whirr.apache.org by Selwyn McCracken <se...@gmail.com> on 2011/03/07 22:51:28 UTC

some nodes terminating at startup

Hi Whirrers,

I have been successfully launching smaller clusters with whirr (<= 4
data nodes).

When I try to scale to something larger (8+ nodes), some of the nodes
terminate during the startup process, and frequently it is the name
node.

I have reviewed the logs and there doesn't to be anything I can spot
(in fact the whirr script hangs and never closes, so the log never
completes).

I suspect something is timing out if the cluster is being launched serially...

Has there been any progress made in adding nodes to an already running
cluster? This might help to work around this problem, and make it
easier for my benchmarking tests, where I am trying to show a linear
decrease in processing time as the number of nodes increase. That is,
I wont have to start a fresh cluster and reload the data into HDFS for
each test run.

Anyway, here is the recipe I have been using:

whirr.cluster-name=hadoop8l
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,8
hadoop-datanode+hadoop-tasktracker
whirr.hadoop-install-function=install_cdh_hadoop
whirr.hadoop-configure-function=configure_cdh_hadoop
whirr.provider=aws-ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=m1.large
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1

Any help greatly appreciated.
Selwyn

Re: some nodes terminating at startup

Posted by Selwyn McCracken <se...@gmail.com>.
sorry I forgot to report that using 0.4.0 from the trunk seems to
launch larger clusters more reliably.

thanks for the good work and help.

Selwyn.

On Thu, Mar 10, 2011 at 4:12 PM, Tom White <to...@gmail.com> wrote:
> The client might be hanging when trying to connect to instances over
> SSH. I'm not sure if jclouds has (or supports) timeouts for this
> operation. If you see this situation again then a thread dump would be
> very useful in diagnosing further.
>
> Thanks,
> Tom
>
> On Thu, Mar 10, 2011 at 12:25 AM, Selwyn McCracken
> <se...@gmail.com> wrote:
>> Thanks Tom.
>>
>> Will build from the trunk tonight and give it a test (it does appear
>> to be the same issue as WHIRR-167).
>>
>> The script hangs on the launch machine. I launched some smaller
>> clusters, so hopefully this is the relevant section of the log
>> displayed when I had to use Ctrl-Z to recover control of the terminal
>> so I could destroy the cluster.
>>
>> --
>> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
>> dpkg-preconfigure: unable to re-open stdin:
>>
>> 2011-03-07 20:23:31,518 DEBUG [jclouds.compute] (user thread 11) <<
>> options applied node(us-east-1/i-851d14e9)
>> 2011-03-07 20:23:31,524 INFO
>> [org.apache.whirr.cluster.actions.NodeStarter] (pool-1-thread-2) Nodes
>> started: [[id=us-east-1/i-8b1d14e7, providerId=i-8b1d14e7,
>> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
>> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
>> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
>> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
>> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
>> state=RUNNING, loginPort=22, privateAddresses=[10.114.121.62],
>> publicAddresses=[184.73.9.122], hardware=[id=m1.large,
>> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
>> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
>> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
>> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
>> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
>> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
>> userMetadata={}], [id=us-east-1/i-891d14e5, providerId=i-891d14e5,
>> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
>> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
>> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
>> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
>> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
>> state=RUNNING, loginPort=22, privateAddresses=[10.114.206.253],
>> publicAddresses=[72.44.38.144], hardware=[id=m1.large,
>> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
>> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
>> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
>> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
>> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
>> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
>> userMetadata={}], [id=us-east-1/i-871d14eb, providerId=i-871d14eb,
>> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
>> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
>> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
>> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
>> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
>> state=RUNNING, loginPort=22, privateAddresses=[10.114.74.91],
>> publicAddresses=[50.16.96.184], hardware=[id=m1.large,
>> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
>> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
>> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
>> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
>> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
>> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
>> userMetadata={}], [id=us-east-1/i-8d1d14e1, providerId=i-8d1d14e1,
>> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
>> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
>> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
>> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
>> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
>> state=RUNNING, loginPort=22, privateAddresses=[10.212.167.31],
>> publicAddresses=[174.129.88.235], hardware=[id=m1.large,
>> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
>> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
>> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
>> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
>> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
>> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
>> userMetadata={}], [id=us-east-1/i-b11d14dd, providerId=i-b11d14dd,
>> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
>> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
>> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
>> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
>> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
>> state=RUNNING, loginPort=22, privateAddresses=[10.116.149.144],
>> publicAddresses=[174.129.74.156], hardware=[id=m1.large,
>> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
>> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
>> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
>> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
>> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
>> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
>> userMetadata={}], [id=us-east-1/i-8f1d14e3, providerId=i-8f1d14e3,
>> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
>> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
>> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
>> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
>> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
>> state=RUNNING, loginPort=22, privateAddresses=[10.114.251.250],
>> publicAddresses=[67.202.41.42], hardware=[id=m1.large,
>> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
>> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
>> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
>> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
>> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
>> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
>> userMetadata={}], [id=us-east-1/i-b31d14df, providerId=i-b31d14df,
>> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
>> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
>> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
>> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
>> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
>> state=RUNNING, loginPort=22, privateAddresses=[10.116.222.97],
>> publicAddresses=[75.101.229.142], hardware=[id=m1.large,
>> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
>> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
>> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
>> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
>> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
>> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
>> userMetadata={}], [id=us-east-1/i-851d14e9, providerId=i-851d14e9,
>> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
>> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
>> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
>> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
>> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
>> state=RUNNING, loginPort=22, privateAddresses=[10.116.222.165],
>> publicAddresses=[50.16.23.148], hardware=[id=m1.large,
>> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
>> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
>> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
>> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
>> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
>> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
>> userMetadata={}]]
>>
>> On Tue, Mar 8, 2011 at 11:39 PM, Tom White <to...@gmail.com> wrote:
>>> Hi Selwyn,
>>>
>>> https://issues.apache.org/jira/browse/WHIRR-167 should improve
>>> reliability of larger clusters, but it isn't in a released version yet
>>> (it's in 0.4.0). You might try building trunk to see if it helps you.
>>>
>>> Where does the script hang? On the cloud instance or on the launch
>>> machine? What's the last thing in the log?
>>>
>>> Adding nodes to a running cluster is still under development
>>> (https://issues.apache.org/jira/browse/WHIRR-214).
>>>
>>> Cheers,
>>> Tom
>>>
>>> On Mon, Mar 7, 2011 at 1:51 PM, Selwyn McCracken
>>> <se...@gmail.com> wrote:
>>>> Hi Whirrers,
>>>>
>>>> I have been successfully launching smaller clusters with whirr (<= 4
>>>> data nodes).
>>>>
>>>> When I try to scale to something larger (8+ nodes), some of the nodes
>>>> terminate during the startup process, and frequently it is the name
>>>> node.
>>>>
>>>> I have reviewed the logs and there doesn't to be anything I can spot
>>>> (in fact the whirr script hangs and never closes, so the log never
>>>> completes).
>>>>
>>>> I suspect something is timing out if the cluster is being launched serially...
>>>>
>>>> Has there been any progress made in adding nodes to an already running
>>>> cluster? This might help to work around this problem, and make it
>>>> easier for my benchmarking tests, where I am trying to show a linear
>>>> decrease in processing time as the number of nodes increase. That is,
>>>> I wont have to start a fresh cluster and reload the data into HDFS for
>>>> each test run.
>>>>
>>>> Anyway, here is the recipe I have been using:
>>>>
>>>> whirr.cluster-name=hadoop8l
>>>> whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,8
>>>> hadoop-datanode+hadoop-tasktracker
>>>> whirr.hadoop-install-function=install_cdh_hadoop
>>>> whirr.hadoop-configure-function=configure_cdh_hadoop
>>>> whirr.provider=aws-ec2
>>>> whirr.identity=${env:AWS_ACCESS_KEY_ID}
>>>> whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
>>>> whirr.hardware-id=m1.large
>>>> whirr.image-id=us-east-1/ami-da0cf8b3
>>>> whirr.location-id=us-east-1
>>>>
>>>> Any help greatly appreciated.
>>>> Selwyn
>>>>
>>>
>>
>

Re: some nodes terminating at startup

Posted by Tom White <to...@gmail.com>.
The client might be hanging when trying to connect to instances over
SSH. I'm not sure if jclouds has (or supports) timeouts for this
operation. If you see this situation again then a thread dump would be
very useful in diagnosing further.

Thanks,
Tom

On Thu, Mar 10, 2011 at 12:25 AM, Selwyn McCracken
<se...@gmail.com> wrote:
> Thanks Tom.
>
> Will build from the trunk tonight and give it a test (it does appear
> to be the same issue as WHIRR-167).
>
> The script hangs on the launch machine. I launched some smaller
> clusters, so hopefully this is the relevant section of the log
> displayed when I had to use Ctrl-Z to recover control of the terminal
> so I could destroy the cluster.
>
> --
> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
> dpkg-preconfigure: unable to re-open stdin:
>
> 2011-03-07 20:23:31,518 DEBUG [jclouds.compute] (user thread 11) <<
> options applied node(us-east-1/i-851d14e9)
> 2011-03-07 20:23:31,524 INFO
> [org.apache.whirr.cluster.actions.NodeStarter] (pool-1-thread-2) Nodes
> started: [[id=us-east-1/i-8b1d14e7, providerId=i-8b1d14e7,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.114.121.62],
> publicAddresses=[184.73.9.122], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-891d14e5, providerId=i-891d14e5,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.114.206.253],
> publicAddresses=[72.44.38.144], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-871d14eb, providerId=i-871d14eb,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.114.74.91],
> publicAddresses=[50.16.96.184], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-8d1d14e1, providerId=i-8d1d14e1,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.212.167.31],
> publicAddresses=[174.129.88.235], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-b11d14dd, providerId=i-b11d14dd,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.116.149.144],
> publicAddresses=[174.129.74.156], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-8f1d14e3, providerId=i-8f1d14e3,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.114.251.250],
> publicAddresses=[67.202.41.42], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-b31d14df, providerId=i-b31d14df,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.116.222.97],
> publicAddresses=[75.101.229.142], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-851d14e9, providerId=i-851d14e9,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.116.222.165],
> publicAddresses=[50.16.23.148], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}]]
>
> On Tue, Mar 8, 2011 at 11:39 PM, Tom White <to...@gmail.com> wrote:
>> Hi Selwyn,
>>
>> https://issues.apache.org/jira/browse/WHIRR-167 should improve
>> reliability of larger clusters, but it isn't in a released version yet
>> (it's in 0.4.0). You might try building trunk to see if it helps you.
>>
>> Where does the script hang? On the cloud instance or on the launch
>> machine? What's the last thing in the log?
>>
>> Adding nodes to a running cluster is still under development
>> (https://issues.apache.org/jira/browse/WHIRR-214).
>>
>> Cheers,
>> Tom
>>
>> On Mon, Mar 7, 2011 at 1:51 PM, Selwyn McCracken
>> <se...@gmail.com> wrote:
>>> Hi Whirrers,
>>>
>>> I have been successfully launching smaller clusters with whirr (<= 4
>>> data nodes).
>>>
>>> When I try to scale to something larger (8+ nodes), some of the nodes
>>> terminate during the startup process, and frequently it is the name
>>> node.
>>>
>>> I have reviewed the logs and there doesn't to be anything I can spot
>>> (in fact the whirr script hangs and never closes, so the log never
>>> completes).
>>>
>>> I suspect something is timing out if the cluster is being launched serially...
>>>
>>> Has there been any progress made in adding nodes to an already running
>>> cluster? This might help to work around this problem, and make it
>>> easier for my benchmarking tests, where I am trying to show a linear
>>> decrease in processing time as the number of nodes increase. That is,
>>> I wont have to start a fresh cluster and reload the data into HDFS for
>>> each test run.
>>>
>>> Anyway, here is the recipe I have been using:
>>>
>>> whirr.cluster-name=hadoop8l
>>> whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,8
>>> hadoop-datanode+hadoop-tasktracker
>>> whirr.hadoop-install-function=install_cdh_hadoop
>>> whirr.hadoop-configure-function=configure_cdh_hadoop
>>> whirr.provider=aws-ec2
>>> whirr.identity=${env:AWS_ACCESS_KEY_ID}
>>> whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
>>> whirr.hardware-id=m1.large
>>> whirr.image-id=us-east-1/ami-da0cf8b3
>>> whirr.location-id=us-east-1
>>>
>>> Any help greatly appreciated.
>>> Selwyn
>>>
>>
>

Re: some nodes terminating at startup

Posted by Selwyn McCracken <se...@gmail.com>.
Thanks Tom.

Will build from the trunk tonight and give it a test (it does appear
to be the same issue as WHIRR-167).

The script hangs on the launch machine. I launched some smaller
clusters, so hopefully this is the relevant section of the log
displayed when I had to use Ctrl-Z to recover control of the terminal
so I could destroy the cluster.

--
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
dpkg-preconfigure: unable to re-open stdin:

2011-03-07 20:23:31,518 DEBUG [jclouds.compute] (user thread 11) <<
options applied node(us-east-1/i-851d14e9)
2011-03-07 20:23:31,524 INFO
[org.apache.whirr.cluster.actions.NodeStarter] (pool-1-thread-2) Nodes
started: [[id=us-east-1/i-8b1d14e7, providerId=i-8b1d14e7,
group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
state=RUNNING, loginPort=22, privateAddresses=[10.114.121.62],
publicAddresses=[184.73.9.122], hardware=[id=m1.large,
providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
userMetadata={}], [id=us-east-1/i-891d14e5, providerId=i-891d14e5,
group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
state=RUNNING, loginPort=22, privateAddresses=[10.114.206.253],
publicAddresses=[72.44.38.144], hardware=[id=m1.large,
providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
userMetadata={}], [id=us-east-1/i-871d14eb, providerId=i-871d14eb,
group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
state=RUNNING, loginPort=22, privateAddresses=[10.114.74.91],
publicAddresses=[50.16.96.184], hardware=[id=m1.large,
providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
userMetadata={}], [id=us-east-1/i-8d1d14e1, providerId=i-8d1d14e1,
group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
state=RUNNING, loginPort=22, privateAddresses=[10.212.167.31],
publicAddresses=[174.129.88.235], hardware=[id=m1.large,
providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
userMetadata={}], [id=us-east-1/i-b11d14dd, providerId=i-b11d14dd,
group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
state=RUNNING, loginPort=22, privateAddresses=[10.116.149.144],
publicAddresses=[174.129.74.156], hardware=[id=m1.large,
providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
userMetadata={}], [id=us-east-1/i-8f1d14e3, providerId=i-8f1d14e3,
group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
state=RUNNING, loginPort=22, privateAddresses=[10.114.251.250],
publicAddresses=[67.202.41.42], hardware=[id=m1.large,
providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
userMetadata={}], [id=us-east-1/i-b31d14df, providerId=i-b31d14df,
group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
state=RUNNING, loginPort=22, privateAddresses=[10.116.222.97],
publicAddresses=[75.101.229.142], hardware=[id=m1.large,
providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
userMetadata={}], [id=us-east-1/i-851d14e9, providerId=i-851d14e9,
group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
state=RUNNING, loginPort=22, privateAddresses=[10.116.222.165],
publicAddresses=[50.16.23.148], hardware=[id=m1.large,
providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
userMetadata={}]]

On Tue, Mar 8, 2011 at 11:39 PM, Tom White <to...@gmail.com> wrote:
> Hi Selwyn,
>
> https://issues.apache.org/jira/browse/WHIRR-167 should improve
> reliability of larger clusters, but it isn't in a released version yet
> (it's in 0.4.0). You might try building trunk to see if it helps you.
>
> Where does the script hang? On the cloud instance or on the launch
> machine? What's the last thing in the log?
>
> Adding nodes to a running cluster is still under development
> (https://issues.apache.org/jira/browse/WHIRR-214).
>
> Cheers,
> Tom
>
> On Mon, Mar 7, 2011 at 1:51 PM, Selwyn McCracken
> <se...@gmail.com> wrote:
>> Hi Whirrers,
>>
>> I have been successfully launching smaller clusters with whirr (<= 4
>> data nodes).
>>
>> When I try to scale to something larger (8+ nodes), some of the nodes
>> terminate during the startup process, and frequently it is the name
>> node.
>>
>> I have reviewed the logs and there doesn't to be anything I can spot
>> (in fact the whirr script hangs and never closes, so the log never
>> completes).
>>
>> I suspect something is timing out if the cluster is being launched serially...
>>
>> Has there been any progress made in adding nodes to an already running
>> cluster? This might help to work around this problem, and make it
>> easier for my benchmarking tests, where I am trying to show a linear
>> decrease in processing time as the number of nodes increase. That is,
>> I wont have to start a fresh cluster and reload the data into HDFS for
>> each test run.
>>
>> Anyway, here is the recipe I have been using:
>>
>> whirr.cluster-name=hadoop8l
>> whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,8
>> hadoop-datanode+hadoop-tasktracker
>> whirr.hadoop-install-function=install_cdh_hadoop
>> whirr.hadoop-configure-function=configure_cdh_hadoop
>> whirr.provider=aws-ec2
>> whirr.identity=${env:AWS_ACCESS_KEY_ID}
>> whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
>> whirr.hardware-id=m1.large
>> whirr.image-id=us-east-1/ami-da0cf8b3
>> whirr.location-id=us-east-1
>>
>> Any help greatly appreciated.
>> Selwyn
>>
>

Re: some nodes terminating at startup

Posted by Tom White <to...@gmail.com>.
Hi Selwyn,

https://issues.apache.org/jira/browse/WHIRR-167 should improve
reliability of larger clusters, but it isn't in a released version yet
(it's in 0.4.0). You might try building trunk to see if it helps you.

Where does the script hang? On the cloud instance or on the launch
machine? What's the last thing in the log?

Adding nodes to a running cluster is still under development
(https://issues.apache.org/jira/browse/WHIRR-214).

Cheers,
Tom

On Mon, Mar 7, 2011 at 1:51 PM, Selwyn McCracken
<se...@gmail.com> wrote:
> Hi Whirrers,
>
> I have been successfully launching smaller clusters with whirr (<= 4
> data nodes).
>
> When I try to scale to something larger (8+ nodes), some of the nodes
> terminate during the startup process, and frequently it is the name
> node.
>
> I have reviewed the logs and there doesn't to be anything I can spot
> (in fact the whirr script hangs and never closes, so the log never
> completes).
>
> I suspect something is timing out if the cluster is being launched serially...
>
> Has there been any progress made in adding nodes to an already running
> cluster? This might help to work around this problem, and make it
> easier for my benchmarking tests, where I am trying to show a linear
> decrease in processing time as the number of nodes increase. That is,
> I wont have to start a fresh cluster and reload the data into HDFS for
> each test run.
>
> Anyway, here is the recipe I have been using:
>
> whirr.cluster-name=hadoop8l
> whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,8
> hadoop-datanode+hadoop-tasktracker
> whirr.hadoop-install-function=install_cdh_hadoop
> whirr.hadoop-configure-function=configure_cdh_hadoop
> whirr.provider=aws-ec2
> whirr.identity=${env:AWS_ACCESS_KEY_ID}
> whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
> whirr.hardware-id=m1.large
> whirr.image-id=us-east-1/ami-da0cf8b3
> whirr.location-id=us-east-1
>
> Any help greatly appreciated.
> Selwyn
>