You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rakhi Khatwani <ra...@gmail.com> on 2009/04/17 18:39:45 UTC

Ec2 instability

Hi,
        Its been several days since we have been trying to stabilize
hadoop/hbase on ec2 cluster. but failed to do so.
We still come across frequent region server fails, scanner timeout
exceptions and OS level deadlocks etc...

and 2day while doing a list of tables on hbase i get the following
exception:

hbase(main):001:0> list
09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
available yet, Zzzzz...
09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
not be reached after 1 tries, giving up.
09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
available yet, Zzzzz...
09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
not be reached after 1 tries, giving up.
09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
available yet, Zzzzz...
09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
not be reached after 1 tries, giving up.
09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
available yet, Zzzzz...
09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
not be reached after 1 tries, giving up.
09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 0 time(s).
09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 1 time(s).
09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
10.254.234.32:60020. Already tried 2 time(s).
09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
available yet, Zzzzz...

but if i check on the UI, hbase master is still on, (tried refreshing it
several times).


and i have been getting a lot of exceptions from time to time including
region servers going down (which happens very frequently due to which there
is heavy data loss... that too on production data), scanner timeout
exceptions, cannot allocate memory exceptions etc.

I am working on amazon ec2 Large cluster with 6 nodes...
with each node having the hardware configuration as follows:

   - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores
   with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit
   platform


I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes and
made sure that there is a symbolic link to hadoop-site from hbase/conf)

Following is my configuration on hadoop-site.xml
<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hadoop</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value>
</property>

<property>
  <name>tasktracker.http.threads</name>
  <value>80</value>
</property>

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>3</value>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>3</value>
</property>

<property>
  <name>mapred.output.compress</name>
  <value>true</value>
</property>

<property>
  <name>mapred.output.compression.type</name>
  <value>BLOCK</value>
</property>

<property>
  <name>dfs.client.block.write.retries</name>
  <value>3</value>
</property>

<property>
<name>mapred.child.java.opts</name>
<value>-Xmx4096m</value>
</property>

Given it a high value since the RAM on each node is 7GB... not sure of this
setting though
**i got Cannot Allocate Memory Exception after making this setting. (got it
for the first time)
after going through the archives, someone suggested enabling the overcommit
memory....not sure of it though **

<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>

As suggested by some of you... i guess it solved the data xceivers exception
on hadoop

<property>
<name>dfs.datanode.handler.count</name>
<value>10</value>
</property>

<property>
 <name>mapred.task.timeout</name>
 <value>0</value>
 <description>The number of milliseconds before a task will be
 terminated if it neither reads an input, writes an output, nor
 updates its status string.
 </description>
</property>

This property has been set coz i have been getting a lot of exceptions
"Cannot report in 602 seconds....killing"

<property>
 <name>mapred.tasktracker.expiry.interval</name>
 <value>360000</value>
 <description>Expert: The time-interval, in miliseconds, after which
 a tasktracker is declared 'lost' if it doesn't send heartbeats.
 </description>
</property>

<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>0</value>
</property>

To avoid socket timeout exceptions

<property>
  <name>dfs.replication</name>
  <value>5</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
 <name>mapred.job.reuse.jvm.num.tasks</name>
 <value>-1</value>
 <description>How many tasks to run per jvm. If set to -1, there is
 no limit.
 </description>
</property>

</configuration>


and following is the configuration on hbase-site.xml

<configuration>
  <property>
    <name>hbase.master</name>
    <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value>
  </property>

  <property>
    <name>hbase.rootdir</name>

<value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</value>
  </property>

<property>
   <name>hbase.regionserver.lease.period</name>
   <value>12600000</value>
   <description>HRegion server lease period in milliseconds. Default is
   60 seconds. Clients must report in within this period else they are
   considered dead.</description>
 </property>


I have set this coz there is a map reduce program which takes almost 3-4
minutes to process a row. worst case is 7 mins
so this has been calculated as (7*60*1000) * (30) = 12600000
where (7*60*1000) = time to proccess a row in ms.
and 30  = thedefault hbase scanner caching.
so i shoudnt be getting scanner timeout exception

** made this change today..... i haven't come across scanner timeout
exception today **

<property>
   <name>hbase.master.lease.period</name>
   <value>3600000</value>
   <description>HMaster server lease period in milliseconds. Default is
   120 seconds.  Region servers must report in within this period else
   they are considered dead.  On loaded cluster, may need to up this
   period.</description>
 </property>

</configuration>


Any suggesstions on changes in the configurations??

My main concern is the region servers goin down from time to time which
happens very frequently. due to which my map-reduce tasks hangs and the
entire application fails :(

I have tried almost all the suggestions mentioned by you except separating
the datanodes from computational nodes which i plan to do 2morrow.
has it been tried before??
and what would be your recommendation?? how many nodes should i consider as
datanodes and computational nodes?

i am hoping that the cluster would be stable by 2morrow :)

Thanks a ton,
Raakhi

Re: Ec2 instability

Posted by Tim Hawkins <ti...@bejant.com>.
I would be interested in understanding what problems you are having,  
we are using 19.0 in production on EC2, running nutch and a set of  
custom apps
in a mixed workload on a farm of 5 instances.



On 17 Apr 2009, at 18:05, Ted Coyle wrote:

> Rakhi,
> I'd suggest going to 0.19.1.  hbase and hadoop.
>
> We had so many problems with .0.19.0 on EC2 that we couldn't use it.
> Having problems with name resolution and generic startup scripts with
> .0.19.1 release but not a show stopper.
>
> Ted
>
>
> -----Original Message-----
> From: Rakhi Khatwani [mailto:rakhi.khatwani@gmail.com]
> Sent: Friday, April 17, 2009 12:45 PM
> To: hbase-user@hadoop.apache.org; core-user@hadoop.apache.org
> Subject: Re: Ec2 instability
>
> Hi,
> this is the exception i have been getting @ the mapreduce
>
> java.io.IOException: Cannot run program "bash": java.io.IOException:
> error=12, Cannot allocate memory
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:134)
> 	at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
> 	at
> org.apache.hadoop.fs.LocalDirAllocator 
> $AllocatorPerContext.getLocalPathF
> orWrite(LocalDirAllocator.java:321)
> 	at
> org 
> .apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
> cator.java:124)
> 	at
> org 
> .apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFi
> le.java:61)
> 	at
> org.apache.hadoop.mapred.MapTask 
> $MapOutputBuffer.mergeParts(MapTask.java
> :1199)
> 	at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java: 
> 857)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:155)
> Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
> allocate memory
> 	at java.lang.UNIXProcess.(UNIXProcess.java:148)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> 	... 10 more
>
>
>
> On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
> <ra...@gmail.com>wrote:
>
>> Hi,
>>        Its been several days since we have been trying to stabilize
>> hadoop/hbase on ec2 cluster. but failed to do so.
>> We still come across frequent region server fails, scanner timeout
>> exceptions and OS level deadlocks etc...
>>
>> and 2day while doing a list of tables on hbase i get the following
>> exception:
>>
>> hbase(main):001:0> list
>> 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> could
>> not be reached after 1 tries, giving up.
>> 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> could
>> not be reached after 1 tries, giving up.
>> 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> could
>> not be reached after 1 tries, giving up.
>> 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> could
>> not be reached after 1 tries, giving up.
>> 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>>
>> but if i check on the UI, hbase master is still on, (tried refreshing
> it
>> several times).
>>
>>
>> and i have been getting a lot of exceptions from time to time
> including
>> region servers going down (which happens very frequently due to which
> there
>> is heavy data loss... that too on production data), scanner timeout
>> exceptions, cannot allocate memory exceptions etc.
>>
>> I am working on amazon ec2 Large cluster with 6 nodes...
>> with each node having the hardware configuration as follows:
>>
>>   - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual
> cores
>>   with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit
>>   platform
>>
>>
>> I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes
> and
>> made sure that there is a symbolic link to hadoop-site from
> hbase/conf)
>>
>> Following is my configuration on hadoop-site.xml
>> <configuration>
>>
>> <property>
>>  <name>hadoop.tmp.dir</name>
>>  <value>/mnt/hadoop</value>
>> </property>
>>
>> <property>
>>  <name>fs.default.name</name>
>>
> <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value>
>> </property>
>>
>> <property>
>>  <name>mapred.job.tracker</name>
>>  <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value>
>> </property>
>>
>> <property>
>>  <name>tasktracker.http.threads</name>
>>  <value>80</value>
>> </property>
>>
>> <property>
>>  <name>mapred.tasktracker.map.tasks.maximum</name>
>>  <value>3</value>
>> </property>
>>
>> <property>
>>  <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>  <value>3</value>
>> </property>
>>
>> <property>
>>  <name>mapred.output.compress</name>
>>  <value>true</value>
>> </property>
>>
>> <property>
>>  <name>mapred.output.compression.type</name>
>>  <value>BLOCK</value>
>> </property>
>>
>> <property>
>>  <name>dfs.client.block.write.retries</name>
>>  <value>3</value>
>> </property>
>>
>> <property>
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx4096m</value>
>> </property>
>>
>> Given it a high value since the RAM on each node is 7GB... not sure  
>> of
> this
>> setting though
>> **i got Cannot Allocate Memory Exception after making this setting.
> (got it
>> for the first time)
>> after going through the archives, someone suggested enabling the
> overcommit
>> memory....not sure of it though **
>>
>> <property>
>> <name>dfs.datanode.max.xcievers</name>
>> <value>4096</value>
>> </property>
>>
>> As suggested by some of you... i guess it solved the data xceivers
>> exception on hadoop
>>
>> <property>
>> <name>dfs.datanode.handler.count</name>
>> <value>10</value>
>> </property>
>>
>> <property>
>> <name>mapred.task.timeout</name>
>> <value>0</value>
>> <description>The number of milliseconds before a task will be
>> terminated if it neither reads an input, writes an output, nor
>> updates its status string.
>> </description>
>> </property>
>>
>> This property has been set coz i have been getting a lot of  
>> exceptions
>> "Cannot report in 602 seconds....killing"
>>
>> <property>
>> <name>mapred.tasktracker.expiry.interval</name>
>> <value>360000</value>
>> <description>Expert: The time-interval, in miliseconds, after which
>> a tasktracker is declared 'lost' if it doesn't send heartbeats.
>> </description>
>> </property>
>>
>> <property>
>> <name>dfs.datanode.socket.write.timeout</name>
>> <value>0</value>
>> </property>
>>
>> To avoid socket timeout exceptions
>>
>> <property>
>>  <name>dfs.replication</name>
>>  <value>5</value>
>>  <description>Default block replication.
>>  The actual number of replications can be specified when the file is
>> created.
>>  The default is used if replication is not specified in create time.
>>  </description>
>> </property>
>>
>> <property>
>> <name>mapred.job.reuse.jvm.num.tasks</name>
>> <value>-1</value>
>> <description>How many tasks to run per jvm. If set to -1, there is
>> no limit.
>> </description>
>> </property>
>>
>> </configuration>
>>
>>
>> and following is the configuration on hbase-site.xml
>>
>> <configuration>
>>  <property>
>>    <name>hbase.master</name>
>>    <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value>
>>  </property>
>>
>>  <property>
>>    <name>hbase.rootdir</name>
>>
>>
> <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</ 
> val
> ue>
>>  </property>
>>
>> <property>
>>   <name>hbase.regionserver.lease.period</name>
>>   <value>12600000</value>
>>   <description>HRegion server lease period in milliseconds. Default
> is
>>   60 seconds. Clients must report in within this period else they are
>>   considered dead.</description>
>> </property>
>>
>>
>> I have set this coz there is a map reduce program which takes almost
> 3-4
>> minutes to process a row. worst case is 7 mins
>> so this has been calculated as (7*60*1000) * (30) = 12600000
>> where (7*60*1000) = time to proccess a row in ms.
>> and 30  = thedefault hbase scanner caching.
>> so i shoudnt be getting scanner timeout exception
>>
>> ** made this change today..... i haven't come across scanner timeout
>> exception today **
>>
>> <property>
>>   <name>hbase.master.lease.period</name>
>>   <value>3600000</value>
>>   <description>HMaster server lease period in milliseconds. Default
> is
>>   120 seconds.  Region servers must report in within this period else
>>   they are considered dead.  On loaded cluster, may need to up this
>>   period.</description>
>> </property>
>>
>> </configuration>
>>
>>
>> Any suggesstions on changes in the configurations??
>>
>> My main concern is the region servers goin down from time to time
> which
>> happens very frequently. due to which my map-reduce tasks hangs and
> the
>> entire application fails :(
>>
>> I have tried almost all the suggestions mentioned by you except
> separating
>> the datanodes from computational nodes which i plan to do 2morrow.
>> has it been tried before??
>> and what would be your recommendation?? how many nodes should i
> consider as
>> datanodes and computational nodes?
>>
>> i am hoping that the cluster would be stable by 2morrow :)
>>
>> Thanks a ton,
>> Raakhi
>>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  
> _ _ _ _
>
> The information transmitted is intended only for the person or  
> entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of,  
> or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you  
> received
> this message in error, please contact the sender and delete the  
> material
> from any computer.
>
>


Re: Ec2 instability

Posted by Tim Hawkins <ti...@bejant.com>.
I would be interested in understanding what problems you are having,  
we are using 19.0 in production on EC2, running nutch and a set of  
custom apps
in a mixed workload on a farm of 5 instances.



On 17 Apr 2009, at 18:05, Ted Coyle wrote:

> Rakhi,
> I'd suggest going to 0.19.1.  hbase and hadoop.
>
> We had so many problems with .0.19.0 on EC2 that we couldn't use it.
> Having problems with name resolution and generic startup scripts with
> .0.19.1 release but not a show stopper.
>
> Ted
>
>
> -----Original Message-----
> From: Rakhi Khatwani [mailto:rakhi.khatwani@gmail.com]
> Sent: Friday, April 17, 2009 12:45 PM
> To: hbase-user@hadoop.apache.org; core-user@hadoop.apache.org
> Subject: Re: Ec2 instability
>
> Hi,
> this is the exception i have been getting @ the mapreduce
>
> java.io.IOException: Cannot run program "bash": java.io.IOException:
> error=12, Cannot allocate memory
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:134)
> 	at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
> 	at
> org.apache.hadoop.fs.LocalDirAllocator 
> $AllocatorPerContext.getLocalPathF
> orWrite(LocalDirAllocator.java:321)
> 	at
> org 
> .apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
> cator.java:124)
> 	at
> org 
> .apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFi
> le.java:61)
> 	at
> org.apache.hadoop.mapred.MapTask 
> $MapOutputBuffer.mergeParts(MapTask.java
> :1199)
> 	at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java: 
> 857)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:155)
> Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
> allocate memory
> 	at java.lang.UNIXProcess.(UNIXProcess.java:148)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> 	... 10 more
>
>
>
> On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
> <ra...@gmail.com>wrote:
>
>> Hi,
>>        Its been several days since we have been trying to stabilize
>> hadoop/hbase on ec2 cluster. but failed to do so.
>> We still come across frequent region server fails, scanner timeout
>> exceptions and OS level deadlocks etc...
>>
>> and 2day while doing a list of tables on hbase i get the following
>> exception:
>>
>> hbase(main):001:0> list
>> 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> could
>> not be reached after 1 tries, giving up.
>> 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> could
>> not be reached after 1 tries, giving up.
>> 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> could
>> not be reached after 1 tries, giving up.
>> 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> could
>> not be reached after 1 tries, giving up.
>> 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 0 time(s).
>> 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 1 time(s).
>> 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
>> 10.254.234.32:60020. Already tried 2 time(s).
>> 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
> not
>> available yet, Zzzzz...
>>
>> but if i check on the UI, hbase master is still on, (tried refreshing
> it
>> several times).
>>
>>
>> and i have been getting a lot of exceptions from time to time
> including
>> region servers going down (which happens very frequently due to which
> there
>> is heavy data loss... that too on production data), scanner timeout
>> exceptions, cannot allocate memory exceptions etc.
>>
>> I am working on amazon ec2 Large cluster with 6 nodes...
>> with each node having the hardware configuration as follows:
>>
>>   - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual
> cores
>>   with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit
>>   platform
>>
>>
>> I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes
> and
>> made sure that there is a symbolic link to hadoop-site from
> hbase/conf)
>>
>> Following is my configuration on hadoop-site.xml
>> <configuration>
>>
>> <property>
>>  <name>hadoop.tmp.dir</name>
>>  <value>/mnt/hadoop</value>
>> </property>
>>
>> <property>
>>  <name>fs.default.name</name>
>>
> <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value>
>> </property>
>>
>> <property>
>>  <name>mapred.job.tracker</name>
>>  <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value>
>> </property>
>>
>> <property>
>>  <name>tasktracker.http.threads</name>
>>  <value>80</value>
>> </property>
>>
>> <property>
>>  <name>mapred.tasktracker.map.tasks.maximum</name>
>>  <value>3</value>
>> </property>
>>
>> <property>
>>  <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>  <value>3</value>
>> </property>
>>
>> <property>
>>  <name>mapred.output.compress</name>
>>  <value>true</value>
>> </property>
>>
>> <property>
>>  <name>mapred.output.compression.type</name>
>>  <value>BLOCK</value>
>> </property>
>>
>> <property>
>>  <name>dfs.client.block.write.retries</name>
>>  <value>3</value>
>> </property>
>>
>> <property>
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx4096m</value>
>> </property>
>>
>> Given it a high value since the RAM on each node is 7GB... not sure  
>> of
> this
>> setting though
>> **i got Cannot Allocate Memory Exception after making this setting.
> (got it
>> for the first time)
>> after going through the archives, someone suggested enabling the
> overcommit
>> memory....not sure of it though **
>>
>> <property>
>> <name>dfs.datanode.max.xcievers</name>
>> <value>4096</value>
>> </property>
>>
>> As suggested by some of you... i guess it solved the data xceivers
>> exception on hadoop
>>
>> <property>
>> <name>dfs.datanode.handler.count</name>
>> <value>10</value>
>> </property>
>>
>> <property>
>> <name>mapred.task.timeout</name>
>> <value>0</value>
>> <description>The number of milliseconds before a task will be
>> terminated if it neither reads an input, writes an output, nor
>> updates its status string.
>> </description>
>> </property>
>>
>> This property has been set coz i have been getting a lot of  
>> exceptions
>> "Cannot report in 602 seconds....killing"
>>
>> <property>
>> <name>mapred.tasktracker.expiry.interval</name>
>> <value>360000</value>
>> <description>Expert: The time-interval, in miliseconds, after which
>> a tasktracker is declared 'lost' if it doesn't send heartbeats.
>> </description>
>> </property>
>>
>> <property>
>> <name>dfs.datanode.socket.write.timeout</name>
>> <value>0</value>
>> </property>
>>
>> To avoid socket timeout exceptions
>>
>> <property>
>>  <name>dfs.replication</name>
>>  <value>5</value>
>>  <description>Default block replication.
>>  The actual number of replications can be specified when the file is
>> created.
>>  The default is used if replication is not specified in create time.
>>  </description>
>> </property>
>>
>> <property>
>> <name>mapred.job.reuse.jvm.num.tasks</name>
>> <value>-1</value>
>> <description>How many tasks to run per jvm. If set to -1, there is
>> no limit.
>> </description>
>> </property>
>>
>> </configuration>
>>
>>
>> and following is the configuration on hbase-site.xml
>>
>> <configuration>
>>  <property>
>>    <name>hbase.master</name>
>>    <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value>
>>  </property>
>>
>>  <property>
>>    <name>hbase.rootdir</name>
>>
>>
> <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</ 
> val
> ue>
>>  </property>
>>
>> <property>
>>   <name>hbase.regionserver.lease.period</name>
>>   <value>12600000</value>
>>   <description>HRegion server lease period in milliseconds. Default
> is
>>   60 seconds. Clients must report in within this period else they are
>>   considered dead.</description>
>> </property>
>>
>>
>> I have set this coz there is a map reduce program which takes almost
> 3-4
>> minutes to process a row. worst case is 7 mins
>> so this has been calculated as (7*60*1000) * (30) = 12600000
>> where (7*60*1000) = time to proccess a row in ms.
>> and 30  = thedefault hbase scanner caching.
>> so i shoudnt be getting scanner timeout exception
>>
>> ** made this change today..... i haven't come across scanner timeout
>> exception today **
>>
>> <property>
>>   <name>hbase.master.lease.period</name>
>>   <value>3600000</value>
>>   <description>HMaster server lease period in milliseconds. Default
> is
>>   120 seconds.  Region servers must report in within this period else
>>   they are considered dead.  On loaded cluster, may need to up this
>>   period.</description>
>> </property>
>>
>> </configuration>
>>
>>
>> Any suggesstions on changes in the configurations??
>>
>> My main concern is the region servers goin down from time to time
> which
>> happens very frequently. due to which my map-reduce tasks hangs and
> the
>> entire application fails :(
>>
>> I have tried almost all the suggestions mentioned by you except
> separating
>> the datanodes from computational nodes which i plan to do 2morrow.
>> has it been tried before??
>> and what would be your recommendation?? how many nodes should i
> consider as
>> datanodes and computational nodes?
>>
>> i am hoping that the cluster would be stable by 2morrow :)
>>
>> Thanks a ton,
>> Raakhi
>>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  
> _ _ _ _
>
> The information transmitted is intended only for the person or  
> entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of,  
> or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you  
> received
> this message in error, please contact the sender and delete the  
> material
> from any computer.
>
>


RE: Ec2 instability

Posted by Ted Coyle <Te...@MEDecision.com>.
Rakhi,
I'd suggest going to 0.19.1.  hbase and hadoop.

We had so many problems with .0.19.0 on EC2 that we couldn't use it.
Having problems with name resolution and generic startup scripts with
.0.19.1 release but not a show stopper.

Ted


-----Original Message-----
From: Rakhi Khatwani [mailto:rakhi.khatwani@gmail.com] 
Sent: Friday, April 17, 2009 12:45 PM
To: hbase-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: Ec2 instability

Hi,
 this is the exception i have been getting @ the mapreduce

java.io.IOException: Cannot run program "bash": java.io.IOException:
error=12, Cannot allocate memory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
	at org.apache.hadoop.util.Shell.run(Shell.java:134)
	at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
	at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF
orWrite(LocalDirAllocator.java:321)
	at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
cator.java:124)
	at
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFi
le.java:61)
	at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java
:1199)
	at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
	at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
	at java.lang.UNIXProcess.(UNIXProcess.java:148)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 10 more



On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
<ra...@gmail.com>wrote:

> Hi,
>         Its been several days since we have been trying to stabilize
> hadoop/hbase on ec2 cluster. but failed to do so.
> We still come across frequent region server fails, scanner timeout
> exceptions and OS level deadlocks etc...
>
> and 2day while doing a list of tables on hbase i get the following
> exception:
>
> hbase(main):001:0> list
> 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
>
> but if i check on the UI, hbase master is still on, (tried refreshing
it
> several times).
>
>
> and i have been getting a lot of exceptions from time to time
including
> region servers going down (which happens very frequently due to which
there
> is heavy data loss... that too on production data), scanner timeout
> exceptions, cannot allocate memory exceptions etc.
>
> I am working on amazon ec2 Large cluster with 6 nodes...
> with each node having the hardware configuration as follows:
>
>    - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual
cores
>    with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit
>    platform
>
>
> I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes
and
> made sure that there is a symbolic link to hadoop-site from
hbase/conf)
>
> Following is my configuration on hadoop-site.xml
> <configuration>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/mnt/hadoop</value>
> </property>
>
> <property>
>   <name>fs.default.name</name>
>
<value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value>
> </property>
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value>
> </property>
>
> <property>
>   <name>tasktracker.http.threads</name>
>   <value>80</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.output.compress</name>
>   <value>true</value>
> </property>
>
> <property>
>   <name>mapred.output.compression.type</name>
>   <value>BLOCK</value>
> </property>
>
> <property>
>   <name>dfs.client.block.write.retries</name>
>   <value>3</value>
> </property>
>
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx4096m</value>
> </property>
>
> Given it a high value since the RAM on each node is 7GB... not sure of
this
> setting though
> **i got Cannot Allocate Memory Exception after making this setting.
(got it
> for the first time)
> after going through the archives, someone suggested enabling the
overcommit
> memory....not sure of it though **
>
> <property>
> <name>dfs.datanode.max.xcievers</name>
> <value>4096</value>
> </property>
>
> As suggested by some of you... i guess it solved the data xceivers
> exception on hadoop
>
> <property>
> <name>dfs.datanode.handler.count</name>
> <value>10</value>
> </property>
>
> <property>
>  <name>mapred.task.timeout</name>
>  <value>0</value>
>  <description>The number of milliseconds before a task will be
>  terminated if it neither reads an input, writes an output, nor
>  updates its status string.
>  </description>
> </property>
>
> This property has been set coz i have been getting a lot of exceptions
> "Cannot report in 602 seconds....killing"
>
> <property>
>  <name>mapred.tasktracker.expiry.interval</name>
>  <value>360000</value>
>  <description>Expert: The time-interval, in miliseconds, after which
>  a tasktracker is declared 'lost' if it doesn't send heartbeats.
>  </description>
> </property>
>
> <property>
> <name>dfs.datanode.socket.write.timeout</name>
> <value>0</value>
> </property>
>
> To avoid socket timeout exceptions
>
> <property>
>   <name>dfs.replication</name>
>   <value>5</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
>
> <property>
>  <name>mapred.job.reuse.jvm.num.tasks</name>
>  <value>-1</value>
>  <description>How many tasks to run per jvm. If set to -1, there is
>  no limit.
>  </description>
> </property>
>
> </configuration>
>
>
> and following is the configuration on hbase-site.xml
>
> <configuration>
>   <property>
>     <name>hbase.master</name>
>     <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value>
>   </property>
>
>   <property>
>     <name>hbase.rootdir</name>
>
>
<value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</val
ue>
>   </property>
>
> <property>
>    <name>hbase.regionserver.lease.period</name>
>    <value>12600000</value>
>    <description>HRegion server lease period in milliseconds. Default
is
>    60 seconds. Clients must report in within this period else they are
>    considered dead.</description>
>  </property>
>
>
> I have set this coz there is a map reduce program which takes almost
3-4
> minutes to process a row. worst case is 7 mins
> so this has been calculated as (7*60*1000) * (30) = 12600000
> where (7*60*1000) = time to proccess a row in ms.
> and 30  = thedefault hbase scanner caching.
> so i shoudnt be getting scanner timeout exception
>
> ** made this change today..... i haven't come across scanner timeout
> exception today **
>
> <property>
>    <name>hbase.master.lease.period</name>
>    <value>3600000</value>
>    <description>HMaster server lease period in milliseconds. Default
is
>    120 seconds.  Region servers must report in within this period else
>    they are considered dead.  On loaded cluster, may need to up this
>    period.</description>
>  </property>
>
> </configuration>
>
>
> Any suggesstions on changes in the configurations??
>
> My main concern is the region servers goin down from time to time
which
> happens very frequently. due to which my map-reduce tasks hangs and
the
> entire application fails :(
>
> I have tried almost all the suggestions mentioned by you except
separating
> the datanodes from computational nodes which i plan to do 2morrow.
> has it been tried before??
> and what would be your recommendation?? how many nodes should i
consider as
> datanodes and computational nodes?
>
> i am hoping that the cluster would be stable by 2morrow :)
>
> Thanks a ton,
> Raakhi
>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.



RE: Ec2 instability

Posted by Ted Coyle <Te...@MEDecision.com>.
Rakhi,
I'd suggest going to 0.19.1.  hbase and hadoop.

We had so many problems with .0.19.0 on EC2 that we couldn't use it.
Having problems with name resolution and generic startup scripts with
.0.19.1 release but not a show stopper.

Ted


-----Original Message-----
From: Rakhi Khatwani [mailto:rakhi.khatwani@gmail.com] 
Sent: Friday, April 17, 2009 12:45 PM
To: hbase-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: Ec2 instability

Hi,
 this is the exception i have been getting @ the mapreduce

java.io.IOException: Cannot run program "bash": java.io.IOException:
error=12, Cannot allocate memory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
	at org.apache.hadoop.util.Shell.run(Shell.java:134)
	at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
	at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF
orWrite(LocalDirAllocator.java:321)
	at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
cator.java:124)
	at
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFi
le.java:61)
	at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java
:1199)
	at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
	at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
	at java.lang.UNIXProcess.(UNIXProcess.java:148)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 10 more



On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
<ra...@gmail.com>wrote:

> Hi,
>         Its been several days since we have been trying to stabilize
> hadoop/hbase on ec2 cluster. but failed to do so.
> We still come across frequent region server fails, scanner timeout
> exceptions and OS level deadlocks etc...
>
> and 2day while doing a list of tables on hbase i get the following
> exception:
>
> hbase(main):001:0> list
> 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020
not
> available yet, Zzzzz...
>
> but if i check on the UI, hbase master is still on, (tried refreshing
it
> several times).
>
>
> and i have been getting a lot of exceptions from time to time
including
> region servers going down (which happens very frequently due to which
there
> is heavy data loss... that too on production data), scanner timeout
> exceptions, cannot allocate memory exceptions etc.
>
> I am working on amazon ec2 Large cluster with 6 nodes...
> with each node having the hardware configuration as follows:
>
>    - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual
cores
>    with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit
>    platform
>
>
> I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes
and
> made sure that there is a symbolic link to hadoop-site from
hbase/conf)
>
> Following is my configuration on hadoop-site.xml
> <configuration>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/mnt/hadoop</value>
> </property>
>
> <property>
>   <name>fs.default.name</name>
>
<value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value>
> </property>
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value>
> </property>
>
> <property>
>   <name>tasktracker.http.threads</name>
>   <value>80</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.output.compress</name>
>   <value>true</value>
> </property>
>
> <property>
>   <name>mapred.output.compression.type</name>
>   <value>BLOCK</value>
> </property>
>
> <property>
>   <name>dfs.client.block.write.retries</name>
>   <value>3</value>
> </property>
>
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx4096m</value>
> </property>
>
> Given it a high value since the RAM on each node is 7GB... not sure of
this
> setting though
> **i got Cannot Allocate Memory Exception after making this setting.
(got it
> for the first time)
> after going through the archives, someone suggested enabling the
overcommit
> memory....not sure of it though **
>
> <property>
> <name>dfs.datanode.max.xcievers</name>
> <value>4096</value>
> </property>
>
> As suggested by some of you... i guess it solved the data xceivers
> exception on hadoop
>
> <property>
> <name>dfs.datanode.handler.count</name>
> <value>10</value>
> </property>
>
> <property>
>  <name>mapred.task.timeout</name>
>  <value>0</value>
>  <description>The number of milliseconds before a task will be
>  terminated if it neither reads an input, writes an output, nor
>  updates its status string.
>  </description>
> </property>
>
> This property has been set coz i have been getting a lot of exceptions
> "Cannot report in 602 seconds....killing"
>
> <property>
>  <name>mapred.tasktracker.expiry.interval</name>
>  <value>360000</value>
>  <description>Expert: The time-interval, in miliseconds, after which
>  a tasktracker is declared 'lost' if it doesn't send heartbeats.
>  </description>
> </property>
>
> <property>
> <name>dfs.datanode.socket.write.timeout</name>
> <value>0</value>
> </property>
>
> To avoid socket timeout exceptions
>
> <property>
>   <name>dfs.replication</name>
>   <value>5</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
>
> <property>
>  <name>mapred.job.reuse.jvm.num.tasks</name>
>  <value>-1</value>
>  <description>How many tasks to run per jvm. If set to -1, there is
>  no limit.
>  </description>
> </property>
>
> </configuration>
>
>
> and following is the configuration on hbase-site.xml
>
> <configuration>
>   <property>
>     <name>hbase.master</name>
>     <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value>
>   </property>
>
>   <property>
>     <name>hbase.rootdir</name>
>
>
<value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</val
ue>
>   </property>
>
> <property>
>    <name>hbase.regionserver.lease.period</name>
>    <value>12600000</value>
>    <description>HRegion server lease period in milliseconds. Default
is
>    60 seconds. Clients must report in within this period else they are
>    considered dead.</description>
>  </property>
>
>
> I have set this coz there is a map reduce program which takes almost
3-4
> minutes to process a row. worst case is 7 mins
> so this has been calculated as (7*60*1000) * (30) = 12600000
> where (7*60*1000) = time to proccess a row in ms.
> and 30  = thedefault hbase scanner caching.
> so i shoudnt be getting scanner timeout exception
>
> ** made this change today..... i haven't come across scanner timeout
> exception today **
>
> <property>
>    <name>hbase.master.lease.period</name>
>    <value>3600000</value>
>    <description>HMaster server lease period in milliseconds. Default
is
>    120 seconds.  Region servers must report in within this period else
>    they are considered dead.  On loaded cluster, may need to up this
>    period.</description>
>  </property>
>
> </configuration>
>
>
> Any suggesstions on changes in the configurations??
>
> My main concern is the region servers goin down from time to time
which
> happens very frequently. due to which my map-reduce tasks hangs and
the
> entire application fails :(
>
> I have tried almost all the suggestions mentioned by you except
separating
> the datanodes from computational nodes which i plan to do 2morrow.
> has it been tried before??
> and what would be your recommendation?? how many nodes should i
consider as
> datanodes and computational nodes?
>
> i am hoping that the cluster would be stable by 2morrow :)
>
> Thanks a ton,
> Raakhi
>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.



Re: Ec2 instability

Posted by Rakhi Khatwani <ra...@gmail.com>.
 Hi,

     I have 6 instances allocated.
i havent tried adding more instances coz i have maximum of 30,000 rows in
hbase tables. wht do u recommend?
i have max 4-5 map concurrent map/reduce tasks on one node.
how do we characterize the memory usage of mappers and reducers??
i m running spinn3r... other than regular hadoop/hbase... but spinn3r is
being called from one of my map tasks.
I am not running gangila or any other program to characterize resource usage
over time.

Thanks,
Raakhi

On Sat, Apr 18, 2009 at 7:09 PM, Andrew Purtell <ap...@apache.org> wrote:

>
> Hi,
>
> This is an OS level exception. Your node is out of memory
> even to fork a process.
>
> How many instances do you currently have allocated? Have
> you increased the number of instances over time to try and
> spread the load of your application around? How many
> concurrent mapper and/or reducer processes do you execute
> on a node? Can you characterize the memory usage of your
> mappers and reducers? Are you running other processes
> external to hadoop/hbase which consume a lot of memory? Are
> you running Ganglia or similar to track and characterize
> resource usage over time?
>
> You may find you are trying to solve a 100 node problem
> with 10.
>
>   - Andy
>
> > From: Rakhi Khatwani
> > Subject: Re: Ec2 instability
> > To: hbase-user@hadoop.apache.org, core-user@hadoop.apache.org
> > Date: Friday, April 17, 2009, 9:44 AM
>  > Hi,
> >  this is the exception i have been getting @ the mapreduce
> >
> > java.io.IOException: Cannot run program "bash":
> > java.io.IOException:
> > error=12, Cannot allocate memory
> >       at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> >       at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
> >       at org.apache.hadoop.util.Shell.run(Shell.java:134)
> >       at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
> >       at
> >
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
> >       at
> >
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> >       at
> >
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
> >       at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
> >       at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> >       at org.apache.hadoop.mapred.Child.main(Child.java:155)
> > Caused by: java.io.IOException: java.io.IOException:
> > error=12, Cannot
> > allocate memory
> >       at java.lang.UNIXProcess.(UNIXProcess.java:148)
> >       at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> >       at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> >       ... 10 more
>
>
>
>
>

Re: Ec2 instability

Posted by Andrew Purtell <ap...@apache.org>.
Hi,

This is an OS level exception. Your node is out of memory
even to fork a process. 

How many instances do you currently have allocated? Have
you increased the number of instances over time to try and
spread the load of your application around? How many
concurrent mapper and/or reducer processes do you execute
on a node? Can you characterize the memory usage of your
mappers and reducers? Are you running other processes
external to hadoop/hbase which consume a lot of memory? Are
you running Ganglia or similar to track and characterize
resource usage over time? 

You may find you are trying to solve a 100 node problem
with 10.

   - Andy

> From: Rakhi Khatwani
> Subject: Re: Ec2 instability
> To: hbase-user@hadoop.apache.org, core-user@hadoop.apache.org
> Date: Friday, April 17, 2009, 9:44 AM
> Hi,
>  this is the exception i have been getting @ the mapreduce
> 
> java.io.IOException: Cannot run program "bash":
> java.io.IOException:
> error=12, Cannot allocate memory
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:134)
> 	at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
> 	at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
> 	at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> 	at
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
> 	at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
> 	at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:155)
> Caused by: java.io.IOException: java.io.IOException:
> error=12, Cannot
> allocate memory
> 	at java.lang.UNIXProcess.(UNIXProcess.java:148)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> 	... 10 more



      

Re: Ec2 instability

Posted by Andrew Purtell <ap...@apache.org>.
Hi,

This is an OS level exception. Your node is out of memory
even to fork a process. 

How many instances do you currently have allocated? Have
you increased the number of instances over time to try and
spread the load of your application around? How many
concurrent mapper and/or reducer processes do you execute
on a node? Can you characterize the memory usage of your
mappers and reducers? Are you running other processes
external to hadoop/hbase which consume a lot of memory? Are
you running Ganglia or similar to track and characterize
resource usage over time? 

You may find you are trying to solve a 100 node problem
with 10.

   - Andy

> From: Rakhi Khatwani
> Subject: Re: Ec2 instability
> To: hbase-user@hadoop.apache.org, core-user@hadoop.apache.org
> Date: Friday, April 17, 2009, 9:44 AM
> Hi,
>  this is the exception i have been getting @ the mapreduce
> 
> java.io.IOException: Cannot run program "bash":
> java.io.IOException:
> error=12, Cannot allocate memory
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:134)
> 	at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
> 	at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
> 	at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> 	at
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
> 	at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
> 	at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:155)
> Caused by: java.io.IOException: java.io.IOException:
> error=12, Cannot
> allocate memory
> 	at java.lang.UNIXProcess.(UNIXProcess.java:148)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> 	... 10 more



      

Re: Ec2 instability

Posted by Rakhi Khatwani <ra...@gmail.com>.
Hi,
 this is the exception i have been getting @ the mapreduce

java.io.IOException: Cannot run program "bash": java.io.IOException:
error=12, Cannot allocate memory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
	at org.apache.hadoop.util.Shell.run(Shell.java:134)
	at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
	at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
	at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
	at java.lang.UNIXProcess.(UNIXProcess.java:148)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 10 more



On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
<ra...@gmail.com>wrote:

> Hi,
>         Its been several days since we have been trying to stabilize
> hadoop/hbase on ec2 cluster. but failed to do so.
> We still come across frequent region server fails, scanner timeout
> exceptions and OS level deadlocks etc...
>
> and 2day while doing a list of tables on hbase i get the following
> exception:
>
> hbase(main):001:0> list
> 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
>
> but if i check on the UI, hbase master is still on, (tried refreshing it
> several times).
>
>
> and i have been getting a lot of exceptions from time to time including
> region servers going down (which happens very frequently due to which there
> is heavy data loss... that too on production data), scanner timeout
> exceptions, cannot allocate memory exceptions etc.
>
> I am working on amazon ec2 Large cluster with 6 nodes...
> with each node having the hardware configuration as follows:
>
>    - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores
>    with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit
>    platform
>
>
> I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes and
> made sure that there is a symbolic link to hadoop-site from hbase/conf)
>
> Following is my configuration on hadoop-site.xml
> <configuration>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/mnt/hadoop</value>
> </property>
>
> <property>
>   <name>fs.default.name</name>
>   <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value>
> </property>
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value>
> </property>
>
> <property>
>   <name>tasktracker.http.threads</name>
>   <value>80</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.output.compress</name>
>   <value>true</value>
> </property>
>
> <property>
>   <name>mapred.output.compression.type</name>
>   <value>BLOCK</value>
> </property>
>
> <property>
>   <name>dfs.client.block.write.retries</name>
>   <value>3</value>
> </property>
>
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx4096m</value>
> </property>
>
> Given it a high value since the RAM on each node is 7GB... not sure of this
> setting though
> **i got Cannot Allocate Memory Exception after making this setting. (got it
> for the first time)
> after going through the archives, someone suggested enabling the overcommit
> memory....not sure of it though **
>
> <property>
> <name>dfs.datanode.max.xcievers</name>
> <value>4096</value>
> </property>
>
> As suggested by some of you... i guess it solved the data xceivers
> exception on hadoop
>
> <property>
> <name>dfs.datanode.handler.count</name>
> <value>10</value>
> </property>
>
> <property>
>  <name>mapred.task.timeout</name>
>  <value>0</value>
>  <description>The number of milliseconds before a task will be
>  terminated if it neither reads an input, writes an output, nor
>  updates its status string.
>  </description>
> </property>
>
> This property has been set coz i have been getting a lot of exceptions
> "Cannot report in 602 seconds....killing"
>
> <property>
>  <name>mapred.tasktracker.expiry.interval</name>
>  <value>360000</value>
>  <description>Expert: The time-interval, in miliseconds, after which
>  a tasktracker is declared 'lost' if it doesn't send heartbeats.
>  </description>
> </property>
>
> <property>
> <name>dfs.datanode.socket.write.timeout</name>
> <value>0</value>
> </property>
>
> To avoid socket timeout exceptions
>
> <property>
>   <name>dfs.replication</name>
>   <value>5</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
>
> <property>
>  <name>mapred.job.reuse.jvm.num.tasks</name>
>  <value>-1</value>
>  <description>How many tasks to run per jvm. If set to -1, there is
>  no limit.
>  </description>
> </property>
>
> </configuration>
>
>
> and following is the configuration on hbase-site.xml
>
> <configuration>
>   <property>
>     <name>hbase.master</name>
>     <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value>
>   </property>
>
>   <property>
>     <name>hbase.rootdir</name>
>
> <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</value>
>   </property>
>
> <property>
>    <name>hbase.regionserver.lease.period</name>
>    <value>12600000</value>
>    <description>HRegion server lease period in milliseconds. Default is
>    60 seconds. Clients must report in within this period else they are
>    considered dead.</description>
>  </property>
>
>
> I have set this coz there is a map reduce program which takes almost 3-4
> minutes to process a row. worst case is 7 mins
> so this has been calculated as (7*60*1000) * (30) = 12600000
> where (7*60*1000) = time to proccess a row in ms.
> and 30  = thedefault hbase scanner caching.
> so i shoudnt be getting scanner timeout exception
>
> ** made this change today..... i haven't come across scanner timeout
> exception today **
>
> <property>
>    <name>hbase.master.lease.period</name>
>    <value>3600000</value>
>    <description>HMaster server lease period in milliseconds. Default is
>    120 seconds.  Region servers must report in within this period else
>    they are considered dead.  On loaded cluster, may need to up this
>    period.</description>
>  </property>
>
> </configuration>
>
>
> Any suggesstions on changes in the configurations??
>
> My main concern is the region servers goin down from time to time which
> happens very frequently. due to which my map-reduce tasks hangs and the
> entire application fails :(
>
> I have tried almost all the suggestions mentioned by you except separating
> the datanodes from computational nodes which i plan to do 2morrow.
> has it been tried before??
> and what would be your recommendation?? how many nodes should i consider as
> datanodes and computational nodes?
>
> i am hoping that the cluster would be stable by 2morrow :)
>
> Thanks a ton,
> Raakhi
>

Re: Ec2 instability

Posted by Rakhi Khatwani <ra...@gmail.com>.
Hi,
 this is the exception i have been getting @ the mapreduce

java.io.IOException: Cannot run program "bash": java.io.IOException:
error=12, Cannot allocate memory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
	at org.apache.hadoop.util.Shell.run(Shell.java:134)
	at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
	at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
	at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
	at java.lang.UNIXProcess.(UNIXProcess.java:148)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 10 more



On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani
<ra...@gmail.com>wrote:

> Hi,
>         Its been several days since we have been trying to stabilize
> hadoop/hbase on ec2 cluster. but failed to do so.
> We still come across frequent region server fails, scanner timeout
> exceptions and OS level deadlocks etc...
>
> and 2day while doing a list of tables on hbase i get the following
> exception:
>
> hbase(main):001:0> list
> 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
> 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could
> not be reached after 1 tries, giving up.
> 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 0 time(s).
> 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 1 time(s).
> 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: /
> 10.254.234.32:60020. Already tried 2 time(s).
> 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not
> available yet, Zzzzz...
>
> but if i check on the UI, hbase master is still on, (tried refreshing it
> several times).
>
>
> and i have been getting a lot of exceptions from time to time including
> region servers going down (which happens very frequently due to which there
> is heavy data loss... that too on production data), scanner timeout
> exceptions, cannot allocate memory exceptions etc.
>
> I am working on amazon ec2 Large cluster with 6 nodes...
> with each node having the hardware configuration as follows:
>
>    - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores
>    with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit
>    platform
>
>
> I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes and
> made sure that there is a symbolic link to hadoop-site from hbase/conf)
>
> Following is my configuration on hadoop-site.xml
> <configuration>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/mnt/hadoop</value>
> </property>
>
> <property>
>   <name>fs.default.name</name>
>   <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001</value>
> </property>
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>domU-12-31-39-00-E5-D2.compute-1.internal:50002</value>
> </property>
>
> <property>
>   <name>tasktracker.http.threads</name>
>   <value>80</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>3</value>
> </property>
>
> <property>
>   <name>mapred.output.compress</name>
>   <value>true</value>
> </property>
>
> <property>
>   <name>mapred.output.compression.type</name>
>   <value>BLOCK</value>
> </property>
>
> <property>
>   <name>dfs.client.block.write.retries</name>
>   <value>3</value>
> </property>
>
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx4096m</value>
> </property>
>
> Given it a high value since the RAM on each node is 7GB... not sure of this
> setting though
> **i got Cannot Allocate Memory Exception after making this setting. (got it
> for the first time)
> after going through the archives, someone suggested enabling the overcommit
> memory....not sure of it though **
>
> <property>
> <name>dfs.datanode.max.xcievers</name>
> <value>4096</value>
> </property>
>
> As suggested by some of you... i guess it solved the data xceivers
> exception on hadoop
>
> <property>
> <name>dfs.datanode.handler.count</name>
> <value>10</value>
> </property>
>
> <property>
>  <name>mapred.task.timeout</name>
>  <value>0</value>
>  <description>The number of milliseconds before a task will be
>  terminated if it neither reads an input, writes an output, nor
>  updates its status string.
>  </description>
> </property>
>
> This property has been set coz i have been getting a lot of exceptions
> "Cannot report in 602 seconds....killing"
>
> <property>
>  <name>mapred.tasktracker.expiry.interval</name>
>  <value>360000</value>
>  <description>Expert: The time-interval, in miliseconds, after which
>  a tasktracker is declared 'lost' if it doesn't send heartbeats.
>  </description>
> </property>
>
> <property>
> <name>dfs.datanode.socket.write.timeout</name>
> <value>0</value>
> </property>
>
> To avoid socket timeout exceptions
>
> <property>
>   <name>dfs.replication</name>
>   <value>5</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
>
> <property>
>  <name>mapred.job.reuse.jvm.num.tasks</name>
>  <value>-1</value>
>  <description>How many tasks to run per jvm. If set to -1, there is
>  no limit.
>  </description>
> </property>
>
> </configuration>
>
>
> and following is the configuration on hbase-site.xml
>
> <configuration>
>   <property>
>     <name>hbase.master</name>
>     <value>domU-12-31-39-00-E5-D2.compute-1.internal:60000</value>
>   </property>
>
>   <property>
>     <name>hbase.rootdir</name>
>
> <value>hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase</value>
>   </property>
>
> <property>
>    <name>hbase.regionserver.lease.period</name>
>    <value>12600000</value>
>    <description>HRegion server lease period in milliseconds. Default is
>    60 seconds. Clients must report in within this period else they are
>    considered dead.</description>
>  </property>
>
>
> I have set this coz there is a map reduce program which takes almost 3-4
> minutes to process a row. worst case is 7 mins
> so this has been calculated as (7*60*1000) * (30) = 12600000
> where (7*60*1000) = time to proccess a row in ms.
> and 30  = thedefault hbase scanner caching.
> so i shoudnt be getting scanner timeout exception
>
> ** made this change today..... i haven't come across scanner timeout
> exception today **
>
> <property>
>    <name>hbase.master.lease.period</name>
>    <value>3600000</value>
>    <description>HMaster server lease period in milliseconds. Default is
>    120 seconds.  Region servers must report in within this period else
>    they are considered dead.  On loaded cluster, may need to up this
>    period.</description>
>  </property>
>
> </configuration>
>
>
> Any suggesstions on changes in the configurations??
>
> My main concern is the region servers goin down from time to time which
> happens very frequently. due to which my map-reduce tasks hangs and the
> entire application fails :(
>
> I have tried almost all the suggestions mentioned by you except separating
> the datanodes from computational nodes which i plan to do 2morrow.
> has it been tried before??
> and what would be your recommendation?? how many nodes should i consider as
> datanodes and computational nodes?
>
> i am hoping that the cluster would be stable by 2morrow :)
>
> Thanks a ton,
> Raakhi
>