You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by anil gupta <an...@gmail.com> on 2012/08/07 19:59:14 UTC

Re: Bulk loading job failed when one region server went down in the cluster

Hi HBase Folks,

I ran the bulk loader yesterday night to load data in a table. During the
bulk loading job one of the region server crashed and the entire job
failed. It takes around 2.5 hours for this job to finish and the job failed
when it was at around 50% complete. After the failure that table was also
corrupted in HBase. My cluster has 8 region servers.

Is bulk loading not fault tolerant to failure of region servers?

I am using this old email chain because at that time my question went
unanswered. Please share your views.

Thanks,
Anil Gupta

On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com> wrote:

> Hi Kevin,
>
> I am not really concerned about the RegionServer going down as the same
> thing can happen when deployed in production. Although, in production we
> wont be having VM environment and I am aware that my current Dev
> environment is not good for heavy processing.  What i am concerned about is
> the failure of bulk loading job when the Region Server failed. Does this
> mean that Bulk loading job is not fault tolerant to Failure of Region
> Server? I was expecting the job to be successful even though the
> RegionServer failed because there 6 more RS running in the cluster. Fault
> Tolerance is one of the biggest selling point of Hadoop platform. Let me
> know your views.
> Thanks for your time.
>
> Thanks,
> Anil Gupta
>
>
> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <ke...@cloudera.com>wrote:
>
>> Anil,
>>
>>  I am sorry for the delayed response.  Reviewing the logs it appears:
>>
>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
>> have not heard from server in 59311ms for sessionid 0x136557f99c90065,
>> closing socket connection and attempting reconnect
>>
>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>>
>>   It appears to be a classic overworked RS.  You were doing too much
>> for the RS and it did not respond in time, the Master marked it as
>> dead, when the RS responded Master said no your are already dead and
>> aborted the server.  This is why you see the YouAreDeadException.
>> This is probably due to the shared resources of the VM infrastructure
>> you are running.  You will either need to devote more resources or add
>> more nodes(most likely physical) to the cluster if you would like to
>> keep running these jobs.
>>
>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <an...@buffalo.edu> wrote:
>> > Hi Kevin,
>> >
>> > Here is dropbox link to the log file of region server which failed:
>> >
>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>> > IMHO, the problem starts from the line #3009 which says: 12/03/30
>> 15:38:32
>> > FATAL regionserver.HRegionServer: ABORTING region server
>> > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, regions=44,
>> > usedHeap=446, maxHeap=1197): Unhandled exception:
>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>> > currently processing ihub-dn-b1,60020,1332955859363 as dead server
>> >
>> > I have already tested fault tolerance of HBase by manually bringing
>> down a
>> > RS while querying a Table and it worked fine and I was expecting the
>> same
>> > today(even though the RS went down by itself today) when i was loading
>> the
>> > data. But, it didn't work out well.
>> > Thanks for your time. Let me know if you need more details.
>> >
>> > ~Anil
>> >
>> > On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <kevin.odell@cloudera.com
>> >wrote:
>> >
>> >> Anil,
>> >>
>> >>  Can you please attach the RS logs from the failure?
>> >>
>> >> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <an...@buffalo.edu>
>> wrote:
>> >> > Hi All,
>> >> >
>> >> > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two
>> >> > machines) which are running Datanode, Tasktracker, and Region
>> Server(1200
>> >> > MB heap size). I was loading data into HBase using Bulk Loader with a
>> >> > custom mapper. I was loading around 34 million records and I have
>> loaded
>> >> > the same set of data in the same environment many times before
>> without
>> >> any
>> >> > problem. This time while loading the data, one of the region
>> server(but
>> >> the
>> >> > DN and TT kept on running on that node ) failed and then after
>> numerous
>> >> > failures of map-tasks the loding job failed. Is there any
>> >> > setting/configuration which can make Bulk Loading fault-tolerant to
>> >> failure
>> >> > of region-servers?
>> >> >
>> >> > --
>> >> > Thanks & Regards,
>> >> > Anil Gupta
>> >>
>> >>
>> >>
>> >> --
>> >> Kevin O'Dell
>> >> Customer Operations Engineer, Cloudera
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Anil Gupta
>>
>>
>>
>> --
>> Kevin O'Dell
>> Customer Operations Engineer, Cloudera
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>>


-- 
Thanks & Regards,
Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by anil gupta <an...@gmail.com>.

Hi Mike,

I knew this would be your next response. :) However, as i said earlier this
cluster is for HBase. At present, I only use MR for loading data.

Thanks,
Anil

On Mon, Aug 13, 2012 at 8:12 PM, Michael Segel <mi...@hotmail.com>wrote:

> Anil,
>
> Same hardware, fewer VMs.
>
> On Aug 13, 2012, at 9:49 PM, Anil Gupta <an...@gmail.com> wrote:
>
> > Hi Mike,
> > I am constrained by the hardware available for POC cluster. We are
> waiting for hardware which we will use for performance.
> >
> >
> > Best Regards,
> > Anil
> >
> > On Aug 13, 2012, at 6:59 PM, Michael Segel <mi...@hotmail.com>
> wrote:
> >
> >> Anil,
> >>
> >> I don't know if you can call it a bug if you don't have enough memory
> available.
> >>
> >> I mean if you don't use HBase, then you may have more leeway in terms
> of swap.
> >>
> >> You can also do more tuning of HBase to handle the additional latency
> found in a Virtual environment.
> >>
> >> Why don't you rebuild your vm's to be slightly larger in terms of
> memory?
> >>
> >>
> >> On Aug 13, 2012, at 8:05 PM, anil gupta <an...@gmail.com> wrote:
> >>
> >>> Hi Mike,
> >>>
> >>> You hit the nail on the that i need to lower down the memory by setting
> >>> yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN
> you
> >>> are talking about. I already tried setting that property to 1500 MB in
> >>> yarn-site.xml and  setting yarn.app.mapreduce.am.resource.mb to 1000
> MB in
> >>> mapred-site.xml. If i do this change then the YARN job does not runs
> at all
> >>> even though the configuration is right. It's a bug and i have to file a
> >>> JIRA for it. So, i was only left with the option to let it run with
> >>> incorrect YARN conf since my objective is to load data into HBase
> rather
> >>> than playing with YARN. MapReduce is only used for bulk loading in my
> >>> cluster.
> >>>
> >>> Here is a link to the mailing list email regarding running YARN with
> lesser
> >>> memory:
> >>> http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164
> >>>
> >>> It would be great if you can answer this simple question of mine: Is
> HBase
> >>> Bulk Loading fault tolerant to Region Server failures in a
> viable/decent
> >>> environment?
> >>>
> >>> Thanks,
> >>> Anil Gupta
> >>>
> >>> On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel <
> michael_segel@hotmail.com>wrote:
> >>>
> >>>> Not sure why you're having an issue in getting an answer.
> >>>> Even if you're not a YARN expert,  google is your friend.
> >>>>
> >>>> See:
> >>>>
> >>>>
> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false
> >>>>
> >>>> This is a web page from Tom White's 3rd Edition.
> >>>>
> >>>> The bottom line...
> >>>> -=-
> >>>> The considerations for how much memory to dedicate to a node manager
> for
> >>>> running containers are similar to the those discussed in
> >>>>
> >>>> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a
> datanode
> >>>> and a node manager, the total is 2,000 MB. Set aside enough for other
> >>>> processes that are running on the machine, and the remainder can be
> >>>> dedicated to the node manager’s containers by setting the
> configuration
> >>>> property yarn.nodemanager.resource.memory-mb to the total allocation
> in MB.
> >>>> (The default is 8,192 MB.)
> >>>> -=-
> >>>>
> >>>> Taken per fair use. Page 323
> >>>>
> >>>> As you can see you need to drop this down to something like 1GB if you
> >>>> even have enough memory for that.
> >>>> Again set yarn.nodemanager.resource.memory-mb to a more realistic
> value.
> >>>>
> >>>> 8GB on a 3 GB node? Yeah that would really hose you, especially if
> you're
> >>>> trying to run HBase too.
> >>>>
> >>>> Even here... You really don't have enough memory to do it all. (Maybe
> >>>> enough to do a small test)
> >>>>
> >>>>
> >>>>
> >>>> Good luck.
> >>>>
> >>>> On Aug 13, 2012, at 3:24 PM, anil gupta <an...@gmail.com>
> wrote:
> >>>>
> >>>>
> >>>>> Hi Mike,
> >>>>>
> >>>>> Here is the link to my email on Hadoop list regarding YARN problem:
> >>>>>
> >>>>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+24CkEBw@mail.gmail.com%3E
> >>>>>
> >>>>> Somehow the link for cloudera mail in last email does not seems to
> work.
> >>>>> Here is the new link:
> >>>>>
> >>>>
> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D
> >>>>>
> >>>>> Thanks for your help,
> >>>>> Anil Gupta
> >>>>>
> >>>>> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <an...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Hi Mike,
> >>>>>>
> >>>>>> I tried doing that by setting up properties in mapred-site.xml but
> Yarn
> >>>>>> doesnt seems to work with "mapreduce.tasktracker.
> >>>>>> map.tasks.maximum" property. Here is a reference to a discussion to
> same
> >>>>>> problem:
> >>>>>>
> >>>>>>
> >>>>
> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
> >>>>>> I have also posted about the same problem in Hadoop mailing list.
> >>>>>>
> >>>>>> I already admitted in my previous email that YARN is having major
> issues
> >>>>>> when we want to control it in low memory environment. I was just
> trying
> >>>> to
> >>>>>> get views HBase experts on bulk load failures since we will be
> relying
> >>>>>> heavily on Fault Tolerance.
> >>>>>> If HBase Bulk Loader is fault tolerant to failure of RS in a viable
> >>>>>> environment  then I dont have any issue. I hope this clears up my
> >>>> purpose
> >>>>>> of posting on this topic.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Anil
> >>>>>>
> >>>>>> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <
> >>>> michael_segel@hotmail.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Anil,
> >>>>>>>
> >>>>>>> Do you know what happens when you have an airplane that has too
> heavy a
> >>>>>>> cargo when it tries to take off?
> >>>>>>> You run out of runway and you crash and burn.
> >>>>>>>
> >>>>>>> Looking at your post, why are you starting 8 map processes on each
> >>>> slave?
> >>>>>>> That's tunable and you clearly do not have enough memory in each
> VM to
> >>>>>>> support 8 slots on a node.
> >>>>>>> Here you swap, you swap you cause HBase to crash and burn.
> >>>>>>>
> >>>>>>> 3.2GB of memory means that no more than 1 slot per slave and even
> >>>> then...
> >>>>>>> you're going to be very tight. Not to mention that you will need to
> >>>> loosen
> >>>>>>> up on your timings since its all virtual and you have way too much
> i/o
> >>>> per
> >>>>>>> drive going on.
> >>>>>>>
> >>>>>>>
> >>>>>>> My suggestion is that you go back and tune your system before
> thinking
> >>>>>>> about running anything.
> >>>>>>>
> >>>>>>> HTH
> >>>>>>>
> >>>>>>> -Mike
> >>>>>>>
> >>>>>>> On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>>> Hi Guys,
> >>>>>>>>
> >>>>>>>> Sorry for not mentioning the version I am currently running. My
> >>>> current
> >>>>>>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with
> YARN
> >>>>>>> for
> >>>>>>>> MR. My original post was for HBase0.92. Here are some more
> details of
> >>>> my
> >>>>>>>> current setup:
> >>>>>>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's
> >>>>>>> installed on
> >>>>>>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory
> and
> >>>> 500
> >>>>>>>> HDFS space.
> >>>>>>>> I use this cluster for POC(Proof of Concepts). I am not looking
> for
> >>>> any
> >>>>>>>> performance benchmarking from this set-up. Due to some major bugs
> in
> >>>>>>> YARN i
> >>>>>>>> am unable to make work in a proper way in memory less than 4GB. I
> am
> >>>>>>>> already having discussion regarding them on Hadoop Mailing List.
> >>>>>>>>
> >>>>>>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv
> >>>>>>>>
> >>>>>>>> The problem is that when i start a Bulk loading job in YARN, 8 Map
> >>>>>>>> processes start on each slave and then all of my slaves are
> hammered
> >>>>>>> badly
> >>>>>>>> due to this. Since the slaves are getting hammered badly then
> >>>>>>> RegionServer
> >>>>>>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS
> >>>> which
> >>>>>>>> caused the job to fail: http://pastebin.com/9ZQx0DtD
> >>>>>>>>
> >>>>>>>> I am aware that this is happening due to underperforming
> hardware(Two
> >>>>>>>> slaves are using one 7200 rpm Hard Drive in my setup) and some
> major
> >>>>>>> bugs
> >>>>>>>> regarding running YARN in less than 4 GB memory. My only concern
> is
> >>>> the
> >>>>>>>> failure of entire MR job and its fault tolerance to RS failures.
> I am
> >>>>>>> not
> >>>>>>>> really concerned about RS failure since HBase is fault tolerant.
> >>>>>>>>
> >>>>>>>> Please let me know if you need anything else.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Anil
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
> >>>>>>> michael_segel@hotmail.com>wrote:
> >>>>>>>>
> >>>>>>>>> Yes, it can.
> >>>>>>>>> You can see RS failure causing a cascading RS failure. Of course
> YMMV
> >>>>>>> and
> >>>>>>>>> it depends on which version you are running.
> >>>>>>>>>
> >>>>>>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the
> latest and
> >>>>>>> he
> >>>>>>>>> should upgrade.
> >>>>>>>>>
> >>>>>>>>> (Or go to CHD4...)
> >>>>>>>>>
> >>>>>>>>> HTH
> >>>>>>>>>
> >>>>>>>>> -Mike
> >>>>>>>>>
> >>>>>>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <
> kevin.odell@cloudera.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Anil,
> >>>>>>>>>>
> >>>>>>>>>> Do you have root cause on the RS failure?  I have never heard
> of one
> >>>>>>> RS
> >>>>>>>>>> failure causing a whole job to fail.
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <
> anilgupta84@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi HBase Folks,
> >>>>>>>>>>>
> >>>>>>>>>>> I ran the bulk loader yesterday night to load data in a table.
> >>>> During
> >>>>>>>>> the
> >>>>>>>>>>> bulk loading job one of the region server crashed and the
> entire
> >>>> job
> >>>>>>>>>>> failed. It takes around 2.5 hours for this job to finish and
> the
> >>>> job
> >>>>>>>>> failed
> >>>>>>>>>>> when it was at around 50% complete. After the failure that
> table
> >>>> was
> >>>>>>>>> also
> >>>>>>>>>>> corrupted in HBase. My cluster has 8 region servers.
> >>>>>>>>>>>
> >>>>>>>>>>> Is bulk loading not fault tolerant to failure of region
> servers?
> >>>>>>>>>>>
> >>>>>>>>>>> I am using this old email chain because at that time my
> question
> >>>> went
> >>>>>>>>>>> unanswered. Please share your views.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Anil Gupta
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <
> anilgupta84@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Kevin,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am not really concerned about the RegionServer going down
> as the
> >>>>>>> same
> >>>>>>>>>>>> thing can happen when deployed in production. Although, in
> >>>>>>> production
> >>>>>>>>> we
> >>>>>>>>>>>> wont be having VM environment and I am aware that my current
> Dev
> >>>>>>>>>>>> environment is not good for heavy processing.  What i am
> concerned
> >>>>>>>>> about
> >>>>>>>>>>> is
> >>>>>>>>>>>> the failure of bulk loading job when the Region Server failed.
> >>>> Does
> >>>>>>>>> this
> >>>>>>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of
> >>>>>>> Region
> >>>>>>>>>>>> Server? I was expecting the job to be successful even though
> the
> >>>>>>>>>>>> RegionServer failed because there 6 more RS running in the
> >>>> cluster.
> >>>>>>>>> Fault
> >>>>>>>>>>>> Tolerance is one of the biggest selling point of Hadoop
> platform.
> >>>>>>> Let
> >>>>>>>>> me
> >>>>>>>>>>>> know your views.
> >>>>>>>>>>>> Thanks for your time.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Anil Gupta
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
> >>>>>>> kevin.odell@cloudera.com
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Anil,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am sorry for the delayed response.  Reviewing the logs it
> >>>>>>> appears:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session
> timed
> >>>>>>> out,
> >>>>>>>>>>>>> have not heard from server in 59311ms for sessionid
> >>>>>>> 0x136557f99c90065,
> >>>>>>>>>>>>> closing socket connection and attempting reconnect
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING
> >>>> region
> >>>>>>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363,
> >>>> load=(requests=0,
> >>>>>>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>>>>>> rejected;
> >>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
> >>>> server
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It appears to be a classic overworked RS.  You were doing too
> >>>> much
> >>>>>>>>>>>>> for the RS and it did not respond in time, the Master marked
> it
> >>>> as
> >>>>>>>>>>>>> dead, when the RS responded Master said no your are already
> dead
> >>>>>>> and
> >>>>>>>>>>>>> aborted the server.  This is why you see the
> YouAreDeadException.
> >>>>>>>>>>>>> This is probably due to the shared resources of the VM
> >>>>>>> infrastructure
> >>>>>>>>>>>>> you are running.  You will either need to devote more
> resources
> >>>> or
> >>>>>>> add
> >>>>>>>>>>>>> more nodes(most likely physical) to the cluster if you would
> like
> >>>>>>> to
> >>>>>>>>>>>>> keep running these jobs.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <
> >>>> anilgupt@buffalo.edu>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> Hi Kevin,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Here is dropbox link to the log file of region server which
> >>>>>>> failed:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
> >>>>>>>>>>>>>> IMHO, the problem starts from the line #3009 which says:
> >>>> 12/03/30
> >>>>>>>>>>>>> 15:38:32
> >>>>>>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
> >>>>>>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >>>>>>>>>>> regions=44,
> >>>>>>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>>>>>> rejected;
> >>>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
> >>>> server
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have already tested fault tolerance of HBase by manually
> >>>>>>> bringing
> >>>>>>>>>>>>> down a
> >>>>>>>>>>>>>> RS while querying a Table and it worked fine and I was
> expecting
> >>>>>>> the
> >>>>>>>>>>>>> same
> >>>>>>>>>>>>>> today(even though the RS went down by itself today) when i
> was
> >>>>>>>>> loading
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> data. But, it didn't work out well.
> >>>>>>>>>>>>>> Thanks for your time. Let me know if you need more details.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> ~Anil
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
> >>>>>>>>>>> kevin.odell@cloudera.com
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Anil,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Can you please attach the RS logs from the failure?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <
> >>>>>>> anilgupt@buffalo.edu>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread
> across
> >>>>>>> two
> >>>>>>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and
> Region
> >>>>>>>>>>>>> Server(1200
> >>>>>>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk
> Loader
> >>>>>>>>>>> with a
> >>>>>>>>>>>>>>>> custom mapper. I was loading around 34 million records
> and I
> >>>>>>> have
> >>>>>>>>>>>>> loaded
> >>>>>>>>>>>>>>>> the same set of data in the same environment many times
> before
> >>>>>>>>>>>>> without
> >>>>>>>>>>>>>>> any
> >>>>>>>>>>>>>>>> problem. This time while loading the data, one of the
> region
> >>>>>>>>>>>>> server(but
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> DN and TT kept on running on that node ) failed and then
> after
> >>>>>>>>>>>>> numerous
> >>>>>>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any
> >>>>>>>>>>>>>>>> setting/configuration which can make Bulk Loading
> >>>>>>> fault-tolerant to
> >>>>>>>>>>>>>>> failure
> >>>>>>>>>>>>>>>> of region-servers?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Thanks & Regards,
> >>>>>>>>>>>>>>>> Anil Gupta
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Kevin O'Dell
> >>>>>>>>>>>>>>> Customer Operations Engineer, Cloudera
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Thanks & Regards,
> >>>>>>>>>>>>>> Anil Gupta
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Kevin O'Dell
> >>>>>>>>>>>>> Customer Operations Engineer, Cloudera
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Thanks & Regards,
> >>>>>>>>>>>>> Anil Gupta
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Thanks & Regards,
> >>>>>>>>>>> Anil Gupta
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Kevin O'Dell
> >>>>>>>>>> Customer Operations Engineer, Cloudera
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Thanks & Regards,
> >>>>>>>> Anil Gupta
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Thanks & Regards,
> >>>>>> Anil Gupta
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks & Regards,
> >>>>> Anil Gupta
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Anil Gupta
> >>
> >
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by Michael Segel <mi...@hotmail.com>.

Anil,

Same hardware, fewer VMs.

On Aug 13, 2012, at 9:49 PM, Anil Gupta <an...@gmail.com> wrote:

> Hi Mike,
> I am constrained by the hardware available for POC cluster. We are waiting for hardware which we will use for performance.
> 
> 
> Best Regards,
> Anil
> 
> On Aug 13, 2012, at 6:59 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
>> Anil, 
>> 
>> I don't know if you can call it a bug if you don't have enough memory available. 
>> 
>> I mean if you don't use HBase, then you may have more leeway in terms of swap. 
>> 
>> You can also do more tuning of HBase to handle the additional latency found in a Virtual environment. 
>> 
>> Why don't you rebuild your vm's to be slightly larger in terms of memory? 
>> 
>> 
>> On Aug 13, 2012, at 8:05 PM, anil gupta <an...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> You hit the nail on the that i need to lower down the memory by setting
>>> yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN you
>>> are talking about. I already tried setting that property to 1500 MB in
>>> yarn-site.xml and  setting yarn.app.mapreduce.am.resource.mb to 1000 MB in
>>> mapred-site.xml. If i do this change then the YARN job does not runs at all
>>> even though the configuration is right. It's a bug and i have to file a
>>> JIRA for it. So, i was only left with the option to let it run with
>>> incorrect YARN conf since my objective is to load data into HBase rather
>>> than playing with YARN. MapReduce is only used for bulk loading in my
>>> cluster.
>>> 
>>> Here is a link to the mailing list email regarding running YARN with lesser
>>> memory:
>>> http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164
>>> 
>>> It would be great if you can answer this simple question of mine: Is HBase
>>> Bulk Loading fault tolerant to Region Server failures in a viable/decent
>>> environment?
>>> 
>>> Thanks,
>>> Anil Gupta
>>> 
>>> On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel <mi...@hotmail.com>wrote:
>>> 
>>>> Not sure why you're having an issue in getting an answer.
>>>> Even if you're not a YARN expert,  google is your friend.
>>>> 
>>>> See:
>>>> 
>>>> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false
>>>> 
>>>> This is a web page from Tom White's 3rd Edition.
>>>> 
>>>> The bottom line...
>>>> -=-
>>>> The considerations for how much memory to dedicate to a node manager for
>>>> running containers are similar to the those discussed in
>>>> 
>>>> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode
>>>> and a node manager, the total is 2,000 MB. Set aside enough for other
>>>> processes that are running on the machine, and the remainder can be
>>>> dedicated to the node manager’s containers by setting the configuration
>>>> property yarn.nodemanager.resource.memory-mb to the total allocation in MB.
>>>> (The default is 8,192 MB.)
>>>> -=-
>>>> 
>>>> Taken per fair use. Page 323
>>>> 
>>>> As you can see you need to drop this down to something like 1GB if you
>>>> even have enough memory for that.
>>>> Again set yarn.nodemanager.resource.memory-mb to a more realistic value.
>>>> 
>>>> 8GB on a 3 GB node? Yeah that would really hose you, especially if you're
>>>> trying to run HBase too.
>>>> 
>>>> Even here... You really don't have enough memory to do it all. (Maybe
>>>> enough to do a small test)
>>>> 
>>>> 
>>>> 
>>>> Good luck.
>>>> 
>>>> On Aug 13, 2012, at 3:24 PM, anil gupta <an...@gmail.com> wrote:
>>>> 
>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> Here is the link to my email on Hadoop list regarding YARN problem:
>>>>> 
>>>> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+24CkEBw@mail.gmail.com%3E
>>>>> 
>>>>> Somehow the link for cloudera mail in last email does not seems to work.
>>>>> Here is the new link:
>>>>> 
>>>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D
>>>>> 
>>>>> Thanks for your help,
>>>>> Anil Gupta
>>>>> 
>>>>> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <an...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi Mike,
>>>>>> 
>>>>>> I tried doing that by setting up properties in mapred-site.xml but Yarn
>>>>>> doesnt seems to work with "mapreduce.tasktracker.
>>>>>> map.tasks.maximum" property. Here is a reference to a discussion to same
>>>>>> problem:
>>>>>> 
>>>>>> 
>>>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
>>>>>> I have also posted about the same problem in Hadoop mailing list.
>>>>>> 
>>>>>> I already admitted in my previous email that YARN is having major issues
>>>>>> when we want to control it in low memory environment. I was just trying
>>>> to
>>>>>> get views HBase experts on bulk load failures since we will be relying
>>>>>> heavily on Fault Tolerance.
>>>>>> If HBase Bulk Loader is fault tolerant to failure of RS in a viable
>>>>>> environment  then I dont have any issue. I hope this clears up my
>>>> purpose
>>>>>> of posting on this topic.
>>>>>> 
>>>>>> Thanks,
>>>>>> Anil
>>>>>> 
>>>>>> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <
>>>> michael_segel@hotmail.com
>>>>>>> wrote:
>>>>>> 
>>>>>>> Anil,
>>>>>>> 
>>>>>>> Do you know what happens when you have an airplane that has too heavy a
>>>>>>> cargo when it tries to take off?
>>>>>>> You run out of runway and you crash and burn.
>>>>>>> 
>>>>>>> Looking at your post, why are you starting 8 map processes on each
>>>> slave?
>>>>>>> That's tunable and you clearly do not have enough memory in each VM to
>>>>>>> support 8 slots on a node.
>>>>>>> Here you swap, you swap you cause HBase to crash and burn.
>>>>>>> 
>>>>>>> 3.2GB of memory means that no more than 1 slot per slave and even
>>>> then...
>>>>>>> you're going to be very tight. Not to mention that you will need to
>>>> loosen
>>>>>>> up on your timings since its all virtual and you have way too much i/o
>>>> per
>>>>>>> drive going on.
>>>>>>> 
>>>>>>> 
>>>>>>> My suggestion is that you go back and tune your system before thinking
>>>>>>> about running anything.
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> -Mike
>>>>>>> 
>>>>>>> On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Hi Guys,
>>>>>>>> 
>>>>>>>> Sorry for not mentioning the version I am currently running. My
>>>> current
>>>>>>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN
>>>>>>> for
>>>>>>>> MR. My original post was for HBase0.92. Here are some more details of
>>>> my
>>>>>>>> current setup:
>>>>>>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's
>>>>>>> installed on
>>>>>>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and
>>>> 500
>>>>>>>> HDFS space.
>>>>>>>> I use this cluster for POC(Proof of Concepts). I am not looking for
>>>> any
>>>>>>>> performance benchmarking from this set-up. Due to some major bugs in
>>>>>>> YARN i
>>>>>>>> am unable to make work in a proper way in memory less than 4GB. I am
>>>>>>>> already having discussion regarding them on Hadoop Mailing List.
>>>>>>>> 
>>>>>>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv
>>>>>>>> 
>>>>>>>> The problem is that when i start a Bulk loading job in YARN, 8 Map
>>>>>>>> processes start on each slave and then all of my slaves are hammered
>>>>>>> badly
>>>>>>>> due to this. Since the slaves are getting hammered badly then
>>>>>>> RegionServer
>>>>>>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS
>>>> which
>>>>>>>> caused the job to fail: http://pastebin.com/9ZQx0DtD
>>>>>>>> 
>>>>>>>> I am aware that this is happening due to underperforming hardware(Two
>>>>>>>> slaves are using one 7200 rpm Hard Drive in my setup) and some major
>>>>>>> bugs
>>>>>>>> regarding running YARN in less than 4 GB memory. My only concern is
>>>> the
>>>>>>>> failure of entire MR job and its fault tolerance to RS failures. I am
>>>>>>> not
>>>>>>>> really concerned about RS failure since HBase is fault tolerant.
>>>>>>>> 
>>>>>>>> Please let me know if you need anything else.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Anil
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
>>>>>>> michael_segel@hotmail.com>wrote:
>>>>>>>> 
>>>>>>>>> Yes, it can.
>>>>>>>>> You can see RS failure causing a cascading RS failure. Of course YMMV
>>>>>>> and
>>>>>>>>> it depends on which version you are running.
>>>>>>>>> 
>>>>>>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and
>>>>>>> he
>>>>>>>>> should upgrade.
>>>>>>>>> 
>>>>>>>>> (Or go to CHD4...)
>>>>>>>>> 
>>>>>>>>> HTH
>>>>>>>>> 
>>>>>>>>> -Mike
>>>>>>>>> 
>>>>>>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Anil,
>>>>>>>>>> 
>>>>>>>>>> Do you have root cause on the RS failure?  I have never heard of one
>>>>>>> RS
>>>>>>>>>> failure causing a whole job to fail.
>>>>>>>>>> 
>>>>>>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi HBase Folks,
>>>>>>>>>>> 
>>>>>>>>>>> I ran the bulk loader yesterday night to load data in a table.
>>>> During
>>>>>>>>> the
>>>>>>>>>>> bulk loading job one of the region server crashed and the entire
>>>> job
>>>>>>>>>>> failed. It takes around 2.5 hours for this job to finish and the
>>>> job
>>>>>>>>> failed
>>>>>>>>>>> when it was at around 50% complete. After the failure that table
>>>> was
>>>>>>>>> also
>>>>>>>>>>> corrupted in HBase. My cluster has 8 region servers.
>>>>>>>>>>> 
>>>>>>>>>>> Is bulk loading not fault tolerant to failure of region servers?
>>>>>>>>>>> 
>>>>>>>>>>> I am using this old email chain because at that time my question
>>>> went
>>>>>>>>>>> unanswered. Please share your views.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Anil Gupta
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Kevin,
>>>>>>>>>>>> 
>>>>>>>>>>>> I am not really concerned about the RegionServer going down as the
>>>>>>> same
>>>>>>>>>>>> thing can happen when deployed in production. Although, in
>>>>>>> production
>>>>>>>>> we
>>>>>>>>>>>> wont be having VM environment and I am aware that my current Dev
>>>>>>>>>>>> environment is not good for heavy processing.  What i am concerned
>>>>>>>>> about
>>>>>>>>>>> is
>>>>>>>>>>>> the failure of bulk loading job when the Region Server failed.
>>>> Does
>>>>>>>>> this
>>>>>>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of
>>>>>>> Region
>>>>>>>>>>>> Server? I was expecting the job to be successful even though the
>>>>>>>>>>>> RegionServer failed because there 6 more RS running in the
>>>> cluster.
>>>>>>>>> Fault
>>>>>>>>>>>> Tolerance is one of the biggest selling point of Hadoop platform.
>>>>>>> Let
>>>>>>>>> me
>>>>>>>>>>>> know your views.
>>>>>>>>>>>> Thanks for your time.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
>>>>>>> kevin.odell@cloudera.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Anil,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am sorry for the delayed response.  Reviewing the logs it
>>>>>>> appears:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
>>>>>>> out,
>>>>>>>>>>>>> have not heard from server in 59311ms for sessionid
>>>>>>> 0x136557f99c90065,
>>>>>>>>>>>>> closing socket connection and attempting reconnect
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING
>>>> region
>>>>>>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363,
>>>> load=(requests=0,
>>>>>>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>>>>>> rejected;
>>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
>>>> server
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It appears to be a classic overworked RS.  You were doing too
>>>> much
>>>>>>>>>>>>> for the RS and it did not respond in time, the Master marked it
>>>> as
>>>>>>>>>>>>> dead, when the RS responded Master said no your are already dead
>>>>>>> and
>>>>>>>>>>>>> aborted the server.  This is why you see the YouAreDeadException.
>>>>>>>>>>>>> This is probably due to the shared resources of the VM
>>>>>>> infrastructure
>>>>>>>>>>>>> you are running.  You will either need to devote more resources
>>>> or
>>>>>>> add
>>>>>>>>>>>>> more nodes(most likely physical) to the cluster if you would like
>>>>>>> to
>>>>>>>>>>>>> keep running these jobs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <
>>>> anilgupt@buffalo.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi Kevin,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is dropbox link to the log file of region server which
>>>>>>> failed:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>>>>>>>>>>>>>> IMHO, the problem starts from the line #3009 which says:
>>>> 12/03/30
>>>>>>>>>>>>> 15:38:32
>>>>>>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>>>>>>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>>>>>>>>> regions=44,
>>>>>>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>>>>>> rejected;
>>>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
>>>> server
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I have already tested fault tolerance of HBase by manually
>>>>>>> bringing
>>>>>>>>>>>>> down a
>>>>>>>>>>>>>> RS while querying a Table and it worked fine and I was expecting
>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>>> today(even though the RS went down by itself today) when i was
>>>>>>>>> loading
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> data. But, it didn't work out well.
>>>>>>>>>>>>>> Thanks for your time. Let me know if you need more details.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ~Anil
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
>>>>>>>>>>> kevin.odell@cloudera.com
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Anil,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Can you please attach the RS logs from the failure?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <
>>>>>>> anilgupt@buffalo.edu>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across
>>>>>>> two
>>>>>>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
>>>>>>>>>>>>> Server(1200
>>>>>>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
>>>>>>>>>>> with a
>>>>>>>>>>>>>>>> custom mapper. I was loading around 34 million records and I
>>>>>>> have
>>>>>>>>>>>>> loaded
>>>>>>>>>>>>>>>> the same set of data in the same environment many times before
>>>>>>>>>>>>> without
>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>> problem. This time while loading the data, one of the region
>>>>>>>>>>>>> server(but
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> DN and TT kept on running on that node ) failed and then after
>>>>>>>>>>>>> numerous
>>>>>>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any
>>>>>>>>>>>>>>>> setting/configuration which can make Bulk Loading
>>>>>>> fault-tolerant to
>>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>>>> of region-servers?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Kevin O'Dell
>>>>>>>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Kevin O'Dell
>>>>>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Anil Gupta
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Kevin O'Dell
>>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Anil Gupta
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Anil Gupta
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks & Regards,
>>>>> Anil Gupta
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> Thanks & Regards,
>>> Anil Gupta
>> 
>

Re: Bulk loading job failed when one region server went down in the cluster

Posted by Anil Gupta <an...@gmail.com>.

Hi Mike,
I am constrained by the hardware available for POC cluster. We are waiting for hardware which we will use for performance.


Best Regards,
Anil

On Aug 13, 2012, at 6:59 PM, Michael Segel <mi...@hotmail.com> wrote:

> Anil, 
> 
> I don't know if you can call it a bug if you don't have enough memory available. 
> 
> I mean if you don't use HBase, then you may have more leeway in terms of swap. 
> 
> You can also do more tuning of HBase to handle the additional latency found in a Virtual environment. 
> 
> Why don't you rebuild your vm's to be slightly larger in terms of memory? 
> 
> 
> On Aug 13, 2012, at 8:05 PM, anil gupta <an...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> You hit the nail on the that i need to lower down the memory by setting
>> yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN you
>> are talking about. I already tried setting that property to 1500 MB in
>> yarn-site.xml and  setting yarn.app.mapreduce.am.resource.mb to 1000 MB in
>> mapred-site.xml. If i do this change then the YARN job does not runs at all
>> even though the configuration is right. It's a bug and i have to file a
>> JIRA for it. So, i was only left with the option to let it run with
>> incorrect YARN conf since my objective is to load data into HBase rather
>> than playing with YARN. MapReduce is only used for bulk loading in my
>> cluster.
>> 
>> Here is a link to the mailing list email regarding running YARN with lesser
>> memory:
>> http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164
>> 
>> It would be great if you can answer this simple question of mine: Is HBase
>> Bulk Loading fault tolerant to Region Server failures in a viable/decent
>> environment?
>> 
>> Thanks,
>> Anil Gupta
>> 
>> On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel <mi...@hotmail.com>wrote:
>> 
>>> Not sure why you're having an issue in getting an answer.
>>> Even if you're not a YARN expert,  google is your friend.
>>> 
>>> See:
>>> 
>>> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false
>>> 
>>> This is a web page from Tom White's 3rd Edition.
>>> 
>>> The bottom line...
>>> -=-
>>> The considerations for how much memory to dedicate to a node manager for
>>> running containers are similar to the those discussed in
>>> 
>>> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode
>>> and a node manager, the total is 2,000 MB. Set aside enough for other
>>> processes that are running on the machine, and the remainder can be
>>> dedicated to the node manager’s containers by setting the configuration
>>> property yarn.nodemanager.resource.memory-mb to the total allocation in MB.
>>> (The default is 8,192 MB.)
>>> -=-
>>> 
>>> Taken per fair use. Page 323
>>> 
>>> As you can see you need to drop this down to something like 1GB if you
>>> even have enough memory for that.
>>> Again set yarn.nodemanager.resource.memory-mb to a more realistic value.
>>> 
>>> 8GB on a 3 GB node? Yeah that would really hose you, especially if you're
>>> trying to run HBase too.
>>> 
>>> Even here... You really don't have enough memory to do it all. (Maybe
>>> enough to do a small test)
>>> 
>>> 
>>> 
>>> Good luck.
>>> 
>>> On Aug 13, 2012, at 3:24 PM, anil gupta <an...@gmail.com> wrote:
>>> 
>>> 
>>>> Hi Mike,
>>>> 
>>>> Here is the link to my email on Hadoop list regarding YARN problem:
>>>> 
>>> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+24CkEBw@mail.gmail.com%3E
>>>> 
>>>> Somehow the link for cloudera mail in last email does not seems to work.
>>>> Here is the new link:
>>>> 
>>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D
>>>> 
>>>> Thanks for your help,
>>>> Anil Gupta
>>>> 
>>>> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <an...@gmail.com>
>>> wrote:
>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> I tried doing that by setting up properties in mapred-site.xml but Yarn
>>>>> doesnt seems to work with "mapreduce.tasktracker.
>>>>> map.tasks.maximum" property. Here is a reference to a discussion to same
>>>>> problem:
>>>>> 
>>>>> 
>>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
>>>>> I have also posted about the same problem in Hadoop mailing list.
>>>>> 
>>>>> I already admitted in my previous email that YARN is having major issues
>>>>> when we want to control it in low memory environment. I was just trying
>>> to
>>>>> get views HBase experts on bulk load failures since we will be relying
>>>>> heavily on Fault Tolerance.
>>>>> If HBase Bulk Loader is fault tolerant to failure of RS in a viable
>>>>> environment  then I dont have any issue. I hope this clears up my
>>> purpose
>>>>> of posting on this topic.
>>>>> 
>>>>> Thanks,
>>>>> Anil
>>>>> 
>>>>> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <
>>> michael_segel@hotmail.com
>>>>>> wrote:
>>>>> 
>>>>>> Anil,
>>>>>> 
>>>>>> Do you know what happens when you have an airplane that has too heavy a
>>>>>> cargo when it tries to take off?
>>>>>> You run out of runway and you crash and burn.
>>>>>> 
>>>>>> Looking at your post, why are you starting 8 map processes on each
>>> slave?
>>>>>> That's tunable and you clearly do not have enough memory in each VM to
>>>>>> support 8 slots on a node.
>>>>>> Here you swap, you swap you cause HBase to crash and burn.
>>>>>> 
>>>>>> 3.2GB of memory means that no more than 1 slot per slave and even
>>> then...
>>>>>> you're going to be very tight. Not to mention that you will need to
>>> loosen
>>>>>> up on your timings since its all virtual and you have way too much i/o
>>> per
>>>>>> drive going on.
>>>>>> 
>>>>>> 
>>>>>> My suggestion is that you go back and tune your system before thinking
>>>>>> about running anything.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> -Mike
>>>>>> 
>>>>>> On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi Guys,
>>>>>>> 
>>>>>>> Sorry for not mentioning the version I am currently running. My
>>> current
>>>>>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN
>>>>>> for
>>>>>>> MR. My original post was for HBase0.92. Here are some more details of
>>> my
>>>>>>> current setup:
>>>>>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's
>>>>>> installed on
>>>>>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and
>>> 500
>>>>>>> HDFS space.
>>>>>>> I use this cluster for POC(Proof of Concepts). I am not looking for
>>> any
>>>>>>> performance benchmarking from this set-up. Due to some major bugs in
>>>>>> YARN i
>>>>>>> am unable to make work in a proper way in memory less than 4GB. I am
>>>>>>> already having discussion regarding them on Hadoop Mailing List.
>>>>>>> 
>>>>>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv
>>>>>>> 
>>>>>>> The problem is that when i start a Bulk loading job in YARN, 8 Map
>>>>>>> processes start on each slave and then all of my slaves are hammered
>>>>>> badly
>>>>>>> due to this. Since the slaves are getting hammered badly then
>>>>>> RegionServer
>>>>>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS
>>> which
>>>>>>> caused the job to fail: http://pastebin.com/9ZQx0DtD
>>>>>>> 
>>>>>>> I am aware that this is happening due to underperforming hardware(Two
>>>>>>> slaves are using one 7200 rpm Hard Drive in my setup) and some major
>>>>>> bugs
>>>>>>> regarding running YARN in less than 4 GB memory. My only concern is
>>> the
>>>>>>> failure of entire MR job and its fault tolerance to RS failures. I am
>>>>>> not
>>>>>>> really concerned about RS failure since HBase is fault tolerant.
>>>>>>> 
>>>>>>> Please let me know if you need anything else.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Anil
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
>>>>>> michael_segel@hotmail.com>wrote:
>>>>>>> 
>>>>>>>> Yes, it can.
>>>>>>>> You can see RS failure causing a cascading RS failure. Of course YMMV
>>>>>> and
>>>>>>>> it depends on which version you are running.
>>>>>>>> 
>>>>>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and
>>>>>> he
>>>>>>>> should upgrade.
>>>>>>>> 
>>>>>>>> (Or go to CHD4...)
>>>>>>>> 
>>>>>>>> HTH
>>>>>>>> 
>>>>>>>> -Mike
>>>>>>>> 
>>>>>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Anil,
>>>>>>>>> 
>>>>>>>>> Do you have root cause on the RS failure?  I have never heard of one
>>>>>> RS
>>>>>>>>> failure causing a whole job to fail.
>>>>>>>>> 
>>>>>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi HBase Folks,
>>>>>>>>>> 
>>>>>>>>>> I ran the bulk loader yesterday night to load data in a table.
>>> During
>>>>>>>> the
>>>>>>>>>> bulk loading job one of the region server crashed and the entire
>>> job
>>>>>>>>>> failed. It takes around 2.5 hours for this job to finish and the
>>> job
>>>>>>>> failed
>>>>>>>>>> when it was at around 50% complete. After the failure that table
>>> was
>>>>>>>> also
>>>>>>>>>> corrupted in HBase. My cluster has 8 region servers.
>>>>>>>>>> 
>>>>>>>>>> Is bulk loading not fault tolerant to failure of region servers?
>>>>>>>>>> 
>>>>>>>>>> I am using this old email chain because at that time my question
>>> went
>>>>>>>>>> unanswered. Please share your views.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Anil Gupta
>>>>>>>>>> 
>>>>>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Kevin,
>>>>>>>>>>> 
>>>>>>>>>>> I am not really concerned about the RegionServer going down as the
>>>>>> same
>>>>>>>>>>> thing can happen when deployed in production. Although, in
>>>>>> production
>>>>>>>> we
>>>>>>>>>>> wont be having VM environment and I am aware that my current Dev
>>>>>>>>>>> environment is not good for heavy processing.  What i am concerned
>>>>>>>> about
>>>>>>>>>> is
>>>>>>>>>>> the failure of bulk loading job when the Region Server failed.
>>> Does
>>>>>>>> this
>>>>>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of
>>>>>> Region
>>>>>>>>>>> Server? I was expecting the job to be successful even though the
>>>>>>>>>>> RegionServer failed because there 6 more RS running in the
>>> cluster.
>>>>>>>> Fault
>>>>>>>>>>> Tolerance is one of the biggest selling point of Hadoop platform.
>>>>>> Let
>>>>>>>> me
>>>>>>>>>>> know your views.
>>>>>>>>>>> Thanks for your time.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Anil Gupta
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
>>>>>> kevin.odell@cloudera.com
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Anil,
>>>>>>>>>>>> 
>>>>>>>>>>>> I am sorry for the delayed response.  Reviewing the logs it
>>>>>> appears:
>>>>>>>>>>>> 
>>>>>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
>>>>>> out,
>>>>>>>>>>>> have not heard from server in 59311ms for sessionid
>>>>>> 0x136557f99c90065,
>>>>>>>>>>>> closing socket connection and attempting reconnect
>>>>>>>>>>>> 
>>>>>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING
>>> region
>>>>>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363,
>>> load=(requests=0,
>>>>>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>>>>> rejected;
>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
>>> server
>>>>>>>>>>>> 
>>>>>>>>>>>> It appears to be a classic overworked RS.  You were doing too
>>> much
>>>>>>>>>>>> for the RS and it did not respond in time, the Master marked it
>>> as
>>>>>>>>>>>> dead, when the RS responded Master said no your are already dead
>>>>>> and
>>>>>>>>>>>> aborted the server.  This is why you see the YouAreDeadException.
>>>>>>>>>>>> This is probably due to the shared resources of the VM
>>>>>> infrastructure
>>>>>>>>>>>> you are running.  You will either need to devote more resources
>>> or
>>>>>> add
>>>>>>>>>>>> more nodes(most likely physical) to the cluster if you would like
>>>>>> to
>>>>>>>>>>>> keep running these jobs.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <
>>> anilgupt@buffalo.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi Kevin,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here is dropbox link to the log file of region server which
>>>>>> failed:
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>>>>>>>>>>>>> IMHO, the problem starts from the line #3009 which says:
>>> 12/03/30
>>>>>>>>>>>> 15:38:32
>>>>>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>>>>>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>>>>>>>> regions=44,
>>>>>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>>>>> rejected;
>>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
>>> server
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have already tested fault tolerance of HBase by manually
>>>>>> bringing
>>>>>>>>>>>> down a
>>>>>>>>>>>>> RS while querying a Table and it worked fine and I was expecting
>>>>>> the
>>>>>>>>>>>> same
>>>>>>>>>>>>> today(even though the RS went down by itself today) when i was
>>>>>>>> loading
>>>>>>>>>>>> the
>>>>>>>>>>>>> data. But, it didn't work out well.
>>>>>>>>>>>>> Thanks for your time. Let me know if you need more details.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ~Anil
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
>>>>>>>>>> kevin.odell@cloudera.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Anil,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Can you please attach the RS logs from the failure?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <
>>>>>> anilgupt@buffalo.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across
>>>>>> two
>>>>>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
>>>>>>>>>>>> Server(1200
>>>>>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
>>>>>>>>>> with a
>>>>>>>>>>>>>>> custom mapper. I was loading around 34 million records and I
>>>>>> have
>>>>>>>>>>>> loaded
>>>>>>>>>>>>>>> the same set of data in the same environment many times before
>>>>>>>>>>>> without
>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>> problem. This time while loading the data, one of the region
>>>>>>>>>>>> server(but
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> DN and TT kept on running on that node ) failed and then after
>>>>>>>>>>>> numerous
>>>>>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any
>>>>>>>>>>>>>>> setting/configuration which can make Bulk Loading
>>>>>> fault-tolerant to
>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>>> of region-servers?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Kevin O'Dell
>>>>>>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Kevin O'Dell
>>>>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Anil Gupta
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Kevin O'Dell
>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Anil Gupta
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks & Regards,
>>>>> Anil Gupta
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> Anil Gupta
>>> 
>>> 
>> 
>> 
>> -- 
>> Thanks & Regards,
>> Anil Gupta
>

Re: Bulk loading job failed when one region server went down in the cluster

Posted by Michael Segel <mi...@hotmail.com>.

Anil, 

I don't know if you can call it a bug if you don't have enough memory available. 

I mean if you don't use HBase, then you may have more leeway in terms of swap. 

You can also do more tuning of HBase to handle the additional latency found in a Virtual environment. 

Why don't you rebuild your vm's to be slightly larger in terms of memory? 


On Aug 13, 2012, at 8:05 PM, anil gupta <an...@gmail.com> wrote:

> Hi Mike,
> 
> You hit the nail on the that i need to lower down the memory by setting
> yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN you
> are talking about. I already tried setting that property to 1500 MB in
> yarn-site.xml and  setting yarn.app.mapreduce.am.resource.mb to 1000 MB in
> mapred-site.xml. If i do this change then the YARN job does not runs at all
> even though the configuration is right. It's a bug and i have to file a
> JIRA for it. So, i was only left with the option to let it run with
> incorrect YARN conf since my objective is to load data into HBase rather
> than playing with YARN. MapReduce is only used for bulk loading in my
> cluster.
> 
> Here is a link to the mailing list email regarding running YARN with lesser
> memory:
> http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164
> 
> It would be great if you can answer this simple question of mine: Is HBase
> Bulk Loading fault tolerant to Region Server failures in a viable/decent
> environment?
> 
> Thanks,
> Anil Gupta
> 
> On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel <mi...@hotmail.com>wrote:
> 
>> Not sure why you're having an issue in getting an answer.
>> Even if you're not a YARN expert,  google is your friend.
>> 
>> See:
>> 
>> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false
>> 
>> This is a web page from Tom White's 3rd Edition.
>> 
>> The bottom line...
>> -=-
>> The considerations for how much memory to dedicate to a node manager for
>> running containers are similar to the those discussed in
>> 
>> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode
>> and a node manager, the total is 2,000 MB. Set aside enough for other
>> processes that are running on the machine, and the remainder can be
>> dedicated to the node manager’s containers by setting the configuration
>> property yarn.nodemanager.resource.memory-mb to the total allocation in MB.
>> (The default is 8,192 MB.)
>> -=-
>> 
>> Taken per fair use. Page 323
>> 
>> As you can see you need to drop this down to something like 1GB if you
>> even have enough memory for that.
>> Again set yarn.nodemanager.resource.memory-mb to a more realistic value.
>> 
>> 8GB on a 3 GB node? Yeah that would really hose you, especially if you're
>> trying to run HBase too.
>> 
>> Even here... You really don't have enough memory to do it all. (Maybe
>> enough to do a small test)
>> 
>> 
>> 
>> Good luck.
>> 
>> On Aug 13, 2012, at 3:24 PM, anil gupta <an...@gmail.com> wrote:
>> 
>> 
>>> Hi Mike,
>>> 
>>> Here is the link to my email on Hadoop list regarding YARN problem:
>>> 
>> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+24CkEBw@mail.gmail.com%3E
>>> 
>>> Somehow the link for cloudera mail in last email does not seems to work.
>>> Here is the new link:
>>> 
>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D
>>> 
>>> Thanks for your help,
>>> Anil Gupta
>>> 
>>> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <an...@gmail.com>
>> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> I tried doing that by setting up properties in mapred-site.xml but Yarn
>>>> doesnt seems to work with "mapreduce.tasktracker.
>>>> map.tasks.maximum" property. Here is a reference to a discussion to same
>>>> problem:
>>>> 
>>>> 
>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
>>>> I have also posted about the same problem in Hadoop mailing list.
>>>> 
>>>> I already admitted in my previous email that YARN is having major issues
>>>> when we want to control it in low memory environment. I was just trying
>> to
>>>> get views HBase experts on bulk load failures since we will be relying
>>>> heavily on Fault Tolerance.
>>>> If HBase Bulk Loader is fault tolerant to failure of RS in a viable
>>>> environment  then I dont have any issue. I hope this clears up my
>> purpose
>>>> of posting on this topic.
>>>> 
>>>> Thanks,
>>>> Anil
>>>> 
>>>> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <
>> michael_segel@hotmail.com
>>>>> wrote:
>>>> 
>>>>> Anil,
>>>>> 
>>>>> Do you know what happens when you have an airplane that has too heavy a
>>>>> cargo when it tries to take off?
>>>>> You run out of runway and you crash and burn.
>>>>> 
>>>>> Looking at your post, why are you starting 8 map processes on each
>> slave?
>>>>> That's tunable and you clearly do not have enough memory in each VM to
>>>>> support 8 slots on a node.
>>>>> Here you swap, you swap you cause HBase to crash and burn.
>>>>> 
>>>>> 3.2GB of memory means that no more than 1 slot per slave and even
>> then...
>>>>> you're going to be very tight. Not to mention that you will need to
>> loosen
>>>>> up on your timings since its all virtual and you have way too much i/o
>> per
>>>>> drive going on.
>>>>> 
>>>>> 
>>>>> My suggestion is that you go back and tune your system before thinking
>>>>> about running anything.
>>>>> 
>>>>> HTH
>>>>> 
>>>>> -Mike
>>>>> 
>>>>> On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com> wrote:
>>>>> 
>>>>>> Hi Guys,
>>>>>> 
>>>>>> Sorry for not mentioning the version I am currently running. My
>> current
>>>>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN
>>>>> for
>>>>>> MR. My original post was for HBase0.92. Here are some more details of
>> my
>>>>>> current setup:
>>>>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's
>>>>> installed on
>>>>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and
>> 500
>>>>>> HDFS space.
>>>>>> I use this cluster for POC(Proof of Concepts). I am not looking for
>> any
>>>>>> performance benchmarking from this set-up. Due to some major bugs in
>>>>> YARN i
>>>>>> am unable to make work in a proper way in memory less than 4GB. I am
>>>>>> already having discussion regarding them on Hadoop Mailing List.
>>>>>> 
>>>>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv
>>>>>> 
>>>>>> The problem is that when i start a Bulk loading job in YARN, 8 Map
>>>>>> processes start on each slave and then all of my slaves are hammered
>>>>> badly
>>>>>> due to this. Since the slaves are getting hammered badly then
>>>>> RegionServer
>>>>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS
>> which
>>>>>> caused the job to fail: http://pastebin.com/9ZQx0DtD
>>>>>> 
>>>>>> I am aware that this is happening due to underperforming hardware(Two
>>>>>> slaves are using one 7200 rpm Hard Drive in my setup) and some major
>>>>> bugs
>>>>>> regarding running YARN in less than 4 GB memory. My only concern is
>> the
>>>>>> failure of entire MR job and its fault tolerance to RS failures. I am
>>>>> not
>>>>>> really concerned about RS failure since HBase is fault tolerant.
>>>>>> 
>>>>>> Please let me know if you need anything else.
>>>>>> 
>>>>>> Thanks,
>>>>>> Anil
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
>>>>> michael_segel@hotmail.com>wrote:
>>>>>> 
>>>>>>> Yes, it can.
>>>>>>> You can see RS failure causing a cascading RS failure. Of course YMMV
>>>>> and
>>>>>>> it depends on which version you are running.
>>>>>>> 
>>>>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and
>>>>> he
>>>>>>> should upgrade.
>>>>>>> 
>>>>>>> (Or go to CHD4...)
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> -Mike
>>>>>>> 
>>>>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Anil,
>>>>>>>> 
>>>>>>>> Do you have root cause on the RS failure?  I have never heard of one
>>>>> RS
>>>>>>>> failure causing a whole job to fail.
>>>>>>>> 
>>>>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi HBase Folks,
>>>>>>>>> 
>>>>>>>>> I ran the bulk loader yesterday night to load data in a table.
>> During
>>>>>>> the
>>>>>>>>> bulk loading job one of the region server crashed and the entire
>> job
>>>>>>>>> failed. It takes around 2.5 hours for this job to finish and the
>> job
>>>>>>> failed
>>>>>>>>> when it was at around 50% complete. After the failure that table
>> was
>>>>>>> also
>>>>>>>>> corrupted in HBase. My cluster has 8 region servers.
>>>>>>>>> 
>>>>>>>>> Is bulk loading not fault tolerant to failure of region servers?
>>>>>>>>> 
>>>>>>>>> I am using this old email chain because at that time my question
>> went
>>>>>>>>> unanswered. Please share your views.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Anil Gupta
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Kevin,
>>>>>>>>>> 
>>>>>>>>>> I am not really concerned about the RegionServer going down as the
>>>>> same
>>>>>>>>>> thing can happen when deployed in production. Although, in
>>>>> production
>>>>>>> we
>>>>>>>>>> wont be having VM environment and I am aware that my current Dev
>>>>>>>>>> environment is not good for heavy processing.  What i am concerned
>>>>>>> about
>>>>>>>>> is
>>>>>>>>>> the failure of bulk loading job when the Region Server failed.
>> Does
>>>>>>> this
>>>>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of
>>>>> Region
>>>>>>>>>> Server? I was expecting the job to be successful even though the
>>>>>>>>>> RegionServer failed because there 6 more RS running in the
>> cluster.
>>>>>>> Fault
>>>>>>>>>> Tolerance is one of the biggest selling point of Hadoop platform.
>>>>> Let
>>>>>>> me
>>>>>>>>>> know your views.
>>>>>>>>>> Thanks for your time.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Anil Gupta
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
>>>>> kevin.odell@cloudera.com
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Anil,
>>>>>>>>>>> 
>>>>>>>>>>> I am sorry for the delayed response.  Reviewing the logs it
>>>>> appears:
>>>>>>>>>>> 
>>>>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
>>>>> out,
>>>>>>>>>>> have not heard from server in 59311ms for sessionid
>>>>> 0x136557f99c90065,
>>>>>>>>>>> closing socket connection and attempting reconnect
>>>>>>>>>>> 
>>>>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING
>> region
>>>>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363,
>> load=(requests=0,
>>>>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>>>> rejected;
>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
>> server
>>>>>>>>>>> 
>>>>>>>>>>> It appears to be a classic overworked RS.  You were doing too
>> much
>>>>>>>>>>> for the RS and it did not respond in time, the Master marked it
>> as
>>>>>>>>>>> dead, when the RS responded Master said no your are already dead
>>>>> and
>>>>>>>>>>> aborted the server.  This is why you see the YouAreDeadException.
>>>>>>>>>>> This is probably due to the shared resources of the VM
>>>>> infrastructure
>>>>>>>>>>> you are running.  You will either need to devote more resources
>> or
>>>>> add
>>>>>>>>>>> more nodes(most likely physical) to the cluster if you would like
>>>>> to
>>>>>>>>>>> keep running these jobs.
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <
>> anilgupt@buffalo.edu>
>>>>>>>>> wrote:
>>>>>>>>>>>> Hi Kevin,
>>>>>>>>>>>> 
>>>>>>>>>>>> Here is dropbox link to the log file of region server which
>>>>> failed:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>>>>>>>>>>>> IMHO, the problem starts from the line #3009 which says:
>> 12/03/30
>>>>>>>>>>> 15:38:32
>>>>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>>>>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>>>>>>> regions=44,
>>>>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>>>> rejected;
>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
>> server
>>>>>>>>>>>> 
>>>>>>>>>>>> I have already tested fault tolerance of HBase by manually
>>>>> bringing
>>>>>>>>>>> down a
>>>>>>>>>>>> RS while querying a Table and it worked fine and I was expecting
>>>>> the
>>>>>>>>>>> same
>>>>>>>>>>>> today(even though the RS went down by itself today) when i was
>>>>>>> loading
>>>>>>>>>>> the
>>>>>>>>>>>> data. But, it didn't work out well.
>>>>>>>>>>>> Thanks for your time. Let me know if you need more details.
>>>>>>>>>>>> 
>>>>>>>>>>>> ~Anil
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
>>>>>>>>> kevin.odell@cloudera.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Anil,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Can you please attach the RS logs from the failure?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <
>>>>> anilgupt@buffalo.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across
>>>>> two
>>>>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
>>>>>>>>>>> Server(1200
>>>>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
>>>>>>>>> with a
>>>>>>>>>>>>>> custom mapper. I was loading around 34 million records and I
>>>>> have
>>>>>>>>>>> loaded
>>>>>>>>>>>>>> the same set of data in the same environment many times before
>>>>>>>>>>> without
>>>>>>>>>>>>> any
>>>>>>>>>>>>>> problem. This time while loading the data, one of the region
>>>>>>>>>>> server(but
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> DN and TT kept on running on that node ) failed and then after
>>>>>>>>>>> numerous
>>>>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any
>>>>>>>>>>>>>> setting/configuration which can make Bulk Loading
>>>>> fault-tolerant to
>>>>>>>>>>>>> failure
>>>>>>>>>>>>>> of region-servers?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Kevin O'Dell
>>>>>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Kevin O'Dell
>>>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Anil Gupta
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Anil Gupta
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Kevin O'Dell
>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Anil Gupta
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> Anil Gupta
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>> 
>> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by anil gupta <an...@gmail.com>.

Hi Stack,

Thanks for answering my question.  I admit that i am unable to run
MR2(YARN) job in an efficient way on my cluster due to a major bug in YARN
which is not letting me set the right configuration for MapReduce jobs.

The RS's are dying with LeaseExpiredExceptions or YouAreDeadException
because of overload on the slaves due to improper YARN conf . Once the MR
job finishes then HBase performance is OK. I am not using this cluster for
performance metrics because we wont be using virtualization in our
production.

My purpose of this email post was to know whether Bulk Loading is fault
tolerant to RS failures or not. You answer is sufficient for clearing my
doubts.

Thanks,
Anil

On Wed, Aug 15, 2012 at 2:52 PM, Stack <st...@duboce.net> wrote:

> On Mon, Aug 13, 2012 at 6:05 PM, anil gupta <an...@gmail.com> wrote:
> > It would be great if you can answer this simple question of mine: Is
> HBase
> > Bulk Loading fault tolerant to Region Server failures in a viable/decent
> > environment?
> >
>
> Bulk Loading is an MapReduce job.  Bulk Loading is as 'fault tolerant'
> as MapReduce is (MapReduce jobs have long timeouts -- ten minutes IIRC
> -- and tasks are retried up to a maximum, 4 by default, but if after
> all timeouts and retries have expired, the job will fail).
>
> You have RSs failing, maybe because you have too many slots allocated
> to MapReduce for the hardware you are using to PoC (as Michael Segel
> suggests).  Maybe the MR task is not finding the region's new
> locations in time or maybe the regions are not coming back on line in
> time for the MR job to complete?
>
> The logs you provide for the MR task show us failing to go against a
> RS who has died but doesn't know it yet (the YouAreDeadException).
> Try looking at the subsequent map tasks that fail.  Why are they
> failing?  For same reason?  Look in the master log to see whats
> happening around log splitting of the failed server?  Is it hung up
> preventing the regions being assigned to new locations?
>
> St.Ack
>

-- 
Thanks & Regards,
Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by Stack <st...@duboce.net>.

On Mon, Aug 13, 2012 at 6:05 PM, anil gupta <an...@gmail.com> wrote:
> It would be great if you can answer this simple question of mine: Is HBase
> Bulk Loading fault tolerant to Region Server failures in a viable/decent
> environment?
>

Bulk Loading is an MapReduce job.  Bulk Loading is as 'fault tolerant'
as MapReduce is (MapReduce jobs have long timeouts -- ten minutes IIRC
-- and tasks are retried up to a maximum, 4 by default, but if after
all timeouts and retries have expired, the job will fail).

You have RSs failing, maybe because you have too many slots allocated
to MapReduce for the hardware you are using to PoC (as Michael Segel
suggests).  Maybe the MR task is not finding the region's new
locations in time or maybe the regions are not coming back on line in
time for the MR job to complete?

The logs you provide for the MR task show us failing to go against a
RS who has died but doesn't know it yet (the YouAreDeadException).
Try looking at the subsequent map tasks that fail.  Why are they
failing?  For same reason?  Look in the master log to see whats
happening around log splitting of the failed server?  Is it hung up
preventing the regions being assigned to new locations?

St.Ack

Re: Bulk loading job failed when one region server went down in the cluster

Posted by anil gupta <an...@gmail.com>.

Hi Mike,

You hit the nail on the that i need to lower down the memory by setting
yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN you
are talking about. I already tried setting that property to 1500 MB in
yarn-site.xml and  setting yarn.app.mapreduce.am.resource.mb to 1000 MB in
mapred-site.xml. If i do this change then the YARN job does not runs at all
even though the configuration is right. It's a bug and i have to file a
JIRA for it. So, i was only left with the option to let it run with
incorrect YARN conf since my objective is to load data into HBase rather
than playing with YARN. MapReduce is only used for bulk loading in my
cluster.

Here is a link to the mailing list email regarding running YARN with lesser
memory:
http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164

It would be great if you can answer this simple question of mine: Is HBase
Bulk Loading fault tolerant to Region Server failures in a viable/decent
environment?

Thanks,
Anil Gupta

On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel <mi...@hotmail.com>wrote:

> Not sure why you're having an issue in getting an answer.
> Even if you're not a YARN expert,  google is your friend.
>
> See:
>
> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false
>
> This is a web page from Tom White's 3rd Edition.
>
> The bottom line...
> -=-
> The considerations for how much memory to dedicate to a node manager for
> running containers are similar to the those discussed in
>
> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode
> and a node manager, the total is 2,000 MB. Set aside enough for other
> processes that are running on the machine, and the remainder can be
> dedicated to the node manager’s containers by setting the configuration
> property yarn.nodemanager.resource.memory-mb to the total allocation in MB.
> (The default is 8,192 MB.)
> -=-
>
> Taken per fair use. Page 323
>
> As you can see you need to drop this down to something like 1GB if you
> even have enough memory for that.
> Again set yarn.nodemanager.resource.memory-mb to a more realistic value.
>
> 8GB on a 3 GB node? Yeah that would really hose you, especially if you're
> trying to run HBase too.
>
> Even here... You really don't have enough memory to do it all. (Maybe
> enough to do a small test)
>
>
>
> Good luck.
>
> On Aug 13, 2012, at 3:24 PM, anil gupta <an...@gmail.com> wrote:
>
>
> > Hi Mike,
> >
> > Here is the link to my email on Hadoop list regarding YARN problem:
> >
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+24CkEBw@mail.gmail.com%3E
> >
> > Somehow the link for cloudera mail in last email does not seems to work.
> > Here is the new link:
> >
> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D
> >
> > Thanks for your help,
> > Anil Gupta
> >
> > On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <an...@gmail.com>
> wrote:
> >
> >> Hi Mike,
> >>
> >> I tried doing that by setting up properties in mapred-site.xml but Yarn
> >> doesnt seems to work with "mapreduce.tasktracker.
> >> map.tasks.maximum" property. Here is a reference to a discussion to same
> >> problem:
> >>
> >>
> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
> >> I have also posted about the same problem in Hadoop mailing list.
> >>
> >> I already admitted in my previous email that YARN is having major issues
> >> when we want to control it in low memory environment. I was just trying
> to
> >> get views HBase experts on bulk load failures since we will be relying
> >> heavily on Fault Tolerance.
> >> If HBase Bulk Loader is fault tolerant to failure of RS in a viable
> >> environment  then I dont have any issue. I hope this clears up my
> purpose
> >> of posting on this topic.
> >>
> >> Thanks,
> >> Anil
> >>
> >> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <
> michael_segel@hotmail.com
> >>> wrote:
> >>
> >>> Anil,
> >>>
> >>> Do you know what happens when you have an airplane that has too heavy a
> >>> cargo when it tries to take off?
> >>> You run out of runway and you crash and burn.
> >>>
> >>> Looking at your post, why are you starting 8 map processes on each
> slave?
> >>> That's tunable and you clearly do not have enough memory in each VM to
> >>> support 8 slots on a node.
> >>> Here you swap, you swap you cause HBase to crash and burn.
> >>>
> >>> 3.2GB of memory means that no more than 1 slot per slave and even
> then...
> >>> you're going to be very tight. Not to mention that you will need to
> loosen
> >>> up on your timings since its all virtual and you have way too much i/o
> per
> >>> drive going on.
> >>>
> >>>
> >>> My suggestion is that you go back and tune your system before thinking
> >>> about running anything.
> >>>
> >>> HTH
> >>>
> >>> -Mike
> >>>
> >>> On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com> wrote:
> >>>
> >>>> Hi Guys,
> >>>>
> >>>> Sorry for not mentioning the version I am currently running. My
> current
> >>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN
> >>> for
> >>>> MR. My original post was for HBase0.92. Here are some more details of
> my
> >>>> current setup:
> >>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's
> >>> installed on
> >>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and
> 500
> >>>> HDFS space.
> >>>> I use this cluster for POC(Proof of Concepts). I am not looking for
> any
> >>>> performance benchmarking from this set-up. Due to some major bugs in
> >>> YARN i
> >>>> am unable to make work in a proper way in memory less than 4GB. I am
> >>>> already having discussion regarding them on Hadoop Mailing List.
> >>>>
> >>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv
> >>>>
> >>>> The problem is that when i start a Bulk loading job in YARN, 8 Map
> >>>> processes start on each slave and then all of my slaves are hammered
> >>> badly
> >>>> due to this. Since the slaves are getting hammered badly then
> >>> RegionServer
> >>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS
> which
> >>>> caused the job to fail: http://pastebin.com/9ZQx0DtD
> >>>>
> >>>> I am aware that this is happening due to underperforming hardware(Two
> >>>> slaves are using one 7200 rpm Hard Drive in my setup) and some major
> >>> bugs
> >>>> regarding running YARN in less than 4 GB memory. My only concern is
> the
> >>>> failure of entire MR job and its fault tolerance to RS failures. I am
> >>> not
> >>>> really concerned about RS failure since HBase is fault tolerant.
> >>>>
> >>>> Please let me know if you need anything else.
> >>>>
> >>>> Thanks,
> >>>> Anil
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
> >>> michael_segel@hotmail.com>wrote:
> >>>>
> >>>>> Yes, it can.
> >>>>> You can see RS failure causing a cascading RS failure. Of course YMMV
> >>> and
> >>>>> it depends on which version you are running.
> >>>>>
> >>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and
> >>> he
> >>>>> should upgrade.
> >>>>>
> >>>>> (Or go to CHD4...)
> >>>>>
> >>>>> HTH
> >>>>>
> >>>>> -Mike
> >>>>>
> >>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Anil,
> >>>>>>
> >>>>>> Do you have root cause on the RS failure?  I have never heard of one
> >>> RS
> >>>>>> failure causing a whole job to fail.
> >>>>>>
> >>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hi HBase Folks,
> >>>>>>>
> >>>>>>> I ran the bulk loader yesterday night to load data in a table.
> During
> >>>>> the
> >>>>>>> bulk loading job one of the region server crashed and the entire
> job
> >>>>>>> failed. It takes around 2.5 hours for this job to finish and the
> job
> >>>>> failed
> >>>>>>> when it was at around 50% complete. After the failure that table
> was
> >>>>> also
> >>>>>>> corrupted in HBase. My cluster has 8 region servers.
> >>>>>>>
> >>>>>>> Is bulk loading not fault tolerant to failure of region servers?
> >>>>>>>
> >>>>>>> I am using this old email chain because at that time my question
> went
> >>>>>>> unanswered. Please share your views.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Anil Gupta
> >>>>>>>
> >>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Kevin,
> >>>>>>>>
> >>>>>>>> I am not really concerned about the RegionServer going down as the
> >>> same
> >>>>>>>> thing can happen when deployed in production. Although, in
> >>> production
> >>>>> we
> >>>>>>>> wont be having VM environment and I am aware that my current Dev
> >>>>>>>> environment is not good for heavy processing.  What i am concerned
> >>>>> about
> >>>>>>> is
> >>>>>>>> the failure of bulk loading job when the Region Server failed.
> Does
> >>>>> this
> >>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of
> >>> Region
> >>>>>>>> Server? I was expecting the job to be successful even though the
> >>>>>>>> RegionServer failed because there 6 more RS running in the
> cluster.
> >>>>> Fault
> >>>>>>>> Tolerance is one of the biggest selling point of Hadoop platform.
> >>> Let
> >>>>> me
> >>>>>>>> know your views.
> >>>>>>>> Thanks for your time.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Anil Gupta
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
> >>> kevin.odell@cloudera.com
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Anil,
> >>>>>>>>>
> >>>>>>>>> I am sorry for the delayed response.  Reviewing the logs it
> >>> appears:
> >>>>>>>>>
> >>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
> >>> out,
> >>>>>>>>> have not heard from server in 59311ms for sessionid
> >>> 0x136557f99c90065,
> >>>>>>>>> closing socket connection and attempting reconnect
> >>>>>>>>>
> >>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING
> region
> >>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363,
> load=(requests=0,
> >>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> rejected;
> >>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
> server
> >>>>>>>>>
> >>>>>>>>> It appears to be a classic overworked RS.  You were doing too
> much
> >>>>>>>>> for the RS and it did not respond in time, the Master marked it
> as
> >>>>>>>>> dead, when the RS responded Master said no your are already dead
> >>> and
> >>>>>>>>> aborted the server.  This is why you see the YouAreDeadException.
> >>>>>>>>> This is probably due to the shared resources of the VM
> >>> infrastructure
> >>>>>>>>> you are running.  You will either need to devote more resources
> or
> >>> add
> >>>>>>>>> more nodes(most likely physical) to the cluster if you would like
> >>> to
> >>>>>>>>> keep running these jobs.
> >>>>>>>>>
> >>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <
> anilgupt@buffalo.edu>
> >>>>>>> wrote:
> >>>>>>>>>> Hi Kevin,
> >>>>>>>>>>
> >>>>>>>>>> Here is dropbox link to the log file of region server which
> >>> failed:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
> >>>>>>>>>> IMHO, the problem starts from the line #3009 which says:
> 12/03/30
> >>>>>>>>> 15:38:32
> >>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
> >>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >>>>>>> regions=44,
> >>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> rejected;
> >>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
> server
> >>>>>>>>>>
> >>>>>>>>>> I have already tested fault tolerance of HBase by manually
> >>> bringing
> >>>>>>>>> down a
> >>>>>>>>>> RS while querying a Table and it worked fine and I was expecting
> >>> the
> >>>>>>>>> same
> >>>>>>>>>> today(even though the RS went down by itself today) when i was
> >>>>> loading
> >>>>>>>>> the
> >>>>>>>>>> data. But, it didn't work out well.
> >>>>>>>>>> Thanks for your time. Let me know if you need more details.
> >>>>>>>>>>
> >>>>>>>>>> ~Anil
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
> >>>>>>> kevin.odell@cloudera.com
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Anil,
> >>>>>>>>>>>
> >>>>>>>>>>> Can you please attach the RS logs from the failure?
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <
> >>> anilgupt@buffalo.edu>
> >>>>>>>>> wrote:
> >>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across
> >>> two
> >>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
> >>>>>>>>> Server(1200
> >>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
> >>>>>>> with a
> >>>>>>>>>>>> custom mapper. I was loading around 34 million records and I
> >>> have
> >>>>>>>>> loaded
> >>>>>>>>>>>> the same set of data in the same environment many times before
> >>>>>>>>> without
> >>>>>>>>>>> any
> >>>>>>>>>>>> problem. This time while loading the data, one of the region
> >>>>>>>>> server(but
> >>>>>>>>>>> the
> >>>>>>>>>>>> DN and TT kept on running on that node ) failed and then after
> >>>>>>>>> numerous
> >>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any
> >>>>>>>>>>>> setting/configuration which can make Bulk Loading
> >>> fault-tolerant to
> >>>>>>>>>>> failure
> >>>>>>>>>>>> of region-servers?
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Thanks & Regards,
> >>>>>>>>>>>> Anil Gupta
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Kevin O'Dell
> >>>>>>>>>>> Customer Operations Engineer, Cloudera
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Thanks & Regards,
> >>>>>>>>>> Anil Gupta
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Kevin O'Dell
> >>>>>>>>> Customer Operations Engineer, Cloudera
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Thanks & Regards,
> >>>>>>>>> Anil Gupta
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Thanks & Regards,
> >>>>>>> Anil Gupta
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Kevin O'Dell
> >>>>>> Customer Operations Engineer, Cloudera
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thanks & Regards,
> >>>> Anil Gupta
> >>>
> >>>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Anil Gupta
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by Michael Segel <mi...@hotmail.com>.

Not sure why you're having an issue in getting an answer. 
Even if you're not a YARN expert,  google is your friend. 

See:
http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false

This is a web page from Tom White's 3rd Edition. 

The bottom line...
-=-
The considerations for how much memory to dedicate to a node manager for running containers are similar to the those discussed in

“Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode and a node manager, the total is 2,000 MB. Set aside enough for other processes that are running on the machine, and the remainder can be dedicated to the node manager’s containers by setting the configuration property yarn.nodemanager.resource.memory-mb to the total allocation in MB. (The default is 8,192 MB.)
-=-

Taken per fair use. Page 323

As you can see you need to drop this down to something like 1GB if you even have enough memory for that. 
Again set yarn.nodemanager.resource.memory-mb to a more realistic value. 

8GB on a 3 GB node? Yeah that would really hose you, especially if you're trying to run HBase too. 

Even here... You really don't have enough memory to do it all. (Maybe enough to do a small test)



Good luck.

On Aug 13, 2012, at 3:24 PM, anil gupta <an...@gmail.com> wrote:


> Hi Mike,
> 
> Here is the link to my email on Hadoop list regarding YARN problem:
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+24CkEBw@mail.gmail.com%3E
> 
> Somehow the link for cloudera mail in last email does not seems to work.
> Here is the new link:
> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D
> 
> Thanks for your help,
> Anil Gupta
> 
> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <an...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> I tried doing that by setting up properties in mapred-site.xml but Yarn
>> doesnt seems to work with "mapreduce.tasktracker.
>> map.tasks.maximum" property. Here is a reference to a discussion to same
>> problem:
>> 
>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
>> I have also posted about the same problem in Hadoop mailing list.
>> 
>> I already admitted in my previous email that YARN is having major issues
>> when we want to control it in low memory environment. I was just trying to
>> get views HBase experts on bulk load failures since we will be relying
>> heavily on Fault Tolerance.
>> If HBase Bulk Loader is fault tolerant to failure of RS in a viable
>> environment  then I dont have any issue. I hope this clears up my purpose
>> of posting on this topic.
>> 
>> Thanks,
>> Anil
>> 
>> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <michael_segel@hotmail.com
>>> wrote:
>> 
>>> Anil,
>>> 
>>> Do you know what happens when you have an airplane that has too heavy a
>>> cargo when it tries to take off?
>>> You run out of runway and you crash and burn.
>>> 
>>> Looking at your post, why are you starting 8 map processes on each slave?
>>> That's tunable and you clearly do not have enough memory in each VM to
>>> support 8 slots on a node.
>>> Here you swap, you swap you cause HBase to crash and burn.
>>> 
>>> 3.2GB of memory means that no more than 1 slot per slave and even then...
>>> you're going to be very tight. Not to mention that you will need to loosen
>>> up on your timings since its all virtual and you have way too much i/o per
>>> drive going on.
>>> 
>>> 
>>> My suggestion is that you go back and tune your system before thinking
>>> about running anything.
>>> 
>>> HTH
>>> 
>>> -Mike
>>> 
>>> On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com> wrote:
>>> 
>>>> Hi Guys,
>>>> 
>>>> Sorry for not mentioning the version I am currently running. My current
>>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN
>>> for
>>>> MR. My original post was for HBase0.92. Here are some more details of my
>>>> current setup:
>>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's
>>> installed on
>>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500
>>>> HDFS space.
>>>> I use this cluster for POC(Proof of Concepts). I am not looking for any
>>>> performance benchmarking from this set-up. Due to some major bugs in
>>> YARN i
>>>> am unable to make work in a proper way in memory less than 4GB. I am
>>>> already having discussion regarding them on Hadoop Mailing List.
>>>> 
>>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv
>>>> 
>>>> The problem is that when i start a Bulk loading job in YARN, 8 Map
>>>> processes start on each slave and then all of my slaves are hammered
>>> badly
>>>> due to this. Since the slaves are getting hammered badly then
>>> RegionServer
>>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS which
>>>> caused the job to fail: http://pastebin.com/9ZQx0DtD
>>>> 
>>>> I am aware that this is happening due to underperforming hardware(Two
>>>> slaves are using one 7200 rpm Hard Drive in my setup) and some major
>>> bugs
>>>> regarding running YARN in less than 4 GB memory. My only concern is the
>>>> failure of entire MR job and its fault tolerance to RS failures. I am
>>> not
>>>> really concerned about RS failure since HBase is fault tolerant.
>>>> 
>>>> Please let me know if you need anything else.
>>>> 
>>>> Thanks,
>>>> Anil
>>>> 
>>>> 
>>>> 
>>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
>>> michael_segel@hotmail.com>wrote:
>>>> 
>>>>> Yes, it can.
>>>>> You can see RS failure causing a cascading RS failure. Of course YMMV
>>> and
>>>>> it depends on which version you are running.
>>>>> 
>>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and
>>> he
>>>>> should upgrade.
>>>>> 
>>>>> (Or go to CHD4...)
>>>>> 
>>>>> HTH
>>>>> 
>>>>> -Mike
>>>>> 
>>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
>>>>> wrote:
>>>>> 
>>>>>> Anil,
>>>>>> 
>>>>>> Do you have root cause on the RS failure?  I have never heard of one
>>> RS
>>>>>> failure causing a whole job to fail.
>>>>>> 
>>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Hi HBase Folks,
>>>>>>> 
>>>>>>> I ran the bulk loader yesterday night to load data in a table. During
>>>>> the
>>>>>>> bulk loading job one of the region server crashed and the entire job
>>>>>>> failed. It takes around 2.5 hours for this job to finish and the job
>>>>> failed
>>>>>>> when it was at around 50% complete. After the failure that table was
>>>>> also
>>>>>>> corrupted in HBase. My cluster has 8 region servers.
>>>>>>> 
>>>>>>> Is bulk loading not fault tolerant to failure of region servers?
>>>>>>> 
>>>>>>> I am using this old email chain because at that time my question went
>>>>>>> unanswered. Please share your views.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Anil Gupta
>>>>>>> 
>>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Kevin,
>>>>>>>> 
>>>>>>>> I am not really concerned about the RegionServer going down as the
>>> same
>>>>>>>> thing can happen when deployed in production. Although, in
>>> production
>>>>> we
>>>>>>>> wont be having VM environment and I am aware that my current Dev
>>>>>>>> environment is not good for heavy processing.  What i am concerned
>>>>> about
>>>>>>> is
>>>>>>>> the failure of bulk loading job when the Region Server failed. Does
>>>>> this
>>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of
>>> Region
>>>>>>>> Server? I was expecting the job to be successful even though the
>>>>>>>> RegionServer failed because there 6 more RS running in the cluster.
>>>>> Fault
>>>>>>>> Tolerance is one of the biggest selling point of Hadoop platform.
>>> Let
>>>>> me
>>>>>>>> know your views.
>>>>>>>> Thanks for your time.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Anil Gupta
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
>>> kevin.odell@cloudera.com
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Anil,
>>>>>>>>> 
>>>>>>>>> I am sorry for the delayed response.  Reviewing the logs it
>>> appears:
>>>>>>>>> 
>>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
>>> out,
>>>>>>>>> have not heard from server in 59311ms for sessionid
>>> 0x136557f99c90065,
>>>>>>>>> closing socket connection and attempting reconnect
>>>>>>>>> 
>>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
>>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> rejected;
>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>>>>>>>>> 
>>>>>>>>> It appears to be a classic overworked RS.  You were doing too much
>>>>>>>>> for the RS and it did not respond in time, the Master marked it as
>>>>>>>>> dead, when the RS responded Master said no your are already dead
>>> and
>>>>>>>>> aborted the server.  This is why you see the YouAreDeadException.
>>>>>>>>> This is probably due to the shared resources of the VM
>>> infrastructure
>>>>>>>>> you are running.  You will either need to devote more resources or
>>> add
>>>>>>>>> more nodes(most likely physical) to the cluster if you would like
>>> to
>>>>>>>>> keep running these jobs.
>>>>>>>>> 
>>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <an...@buffalo.edu>
>>>>>>> wrote:
>>>>>>>>>> Hi Kevin,
>>>>>>>>>> 
>>>>>>>>>> Here is dropbox link to the log file of region server which
>>> failed:
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>>>>>>>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30
>>>>>>>>> 15:38:32
>>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>>>>> regions=44,
>>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> rejected;
>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>>>>>>>>>> 
>>>>>>>>>> I have already tested fault tolerance of HBase by manually
>>> bringing
>>>>>>>>> down a
>>>>>>>>>> RS while querying a Table and it worked fine and I was expecting
>>> the
>>>>>>>>> same
>>>>>>>>>> today(even though the RS went down by itself today) when i was
>>>>> loading
>>>>>>>>> the
>>>>>>>>>> data. But, it didn't work out well.
>>>>>>>>>> Thanks for your time. Let me know if you need more details.
>>>>>>>>>> 
>>>>>>>>>> ~Anil
>>>>>>>>>> 
>>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
>>>>>>> kevin.odell@cloudera.com
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Anil,
>>>>>>>>>>> 
>>>>>>>>>>> Can you please attach the RS logs from the failure?
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <
>>> anilgupt@buffalo.edu>
>>>>>>>>> wrote:
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>> 
>>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across
>>> two
>>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
>>>>>>>>> Server(1200
>>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
>>>>>>> with a
>>>>>>>>>>>> custom mapper. I was loading around 34 million records and I
>>> have
>>>>>>>>> loaded
>>>>>>>>>>>> the same set of data in the same environment many times before
>>>>>>>>> without
>>>>>>>>>>> any
>>>>>>>>>>>> problem. This time while loading the data, one of the region
>>>>>>>>> server(but
>>>>>>>>>>> the
>>>>>>>>>>>> DN and TT kept on running on that node ) failed and then after
>>>>>>>>> numerous
>>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any
>>>>>>>>>>>> setting/configuration which can make Bulk Loading
>>> fault-tolerant to
>>>>>>>>>>> failure
>>>>>>>>>>>> of region-servers?
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Anil Gupta
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Kevin O'Dell
>>>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Anil Gupta
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Kevin O'Dell
>>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Anil Gupta
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Anil Gupta
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Kevin O'Dell
>>>>>> Customer Operations Engineer, Cloudera
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> Anil Gupta
>>> 
>>> 
>> 
>> 
>> --
>> Thanks & Regards,
>> Anil Gupta
>> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by anil gupta <an...@gmail.com>.

Hi Mike,

Here is the link to my email on Hadoop list regarding YARN problem:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+24CkEBw@mail.gmail.com%3E

Somehow the link for cloudera mail in last email does not seems to work.
Here is the new link:
https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D

Thanks for your help,
Anil Gupta

On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <an...@gmail.com> wrote:

> Hi Mike,
>
> I tried doing that by setting up properties in mapred-site.xml but Yarn
> doesnt seems to work with "mapreduce.tasktracker.
> map.tasks.maximum" property. Here is a reference to a discussion to same
> problem:
>
> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
> I have also posted about the same problem in Hadoop mailing list.
>
> I already admitted in my previous email that YARN is having major issues
> when we want to control it in low memory environment. I was just trying to
> get views HBase experts on bulk load failures since we will be relying
> heavily on Fault Tolerance.
> If HBase Bulk Loader is fault tolerant to failure of RS in a viable
> environment  then I dont have any issue. I hope this clears up my purpose
> of posting on this topic.
>
> Thanks,
> Anil
>
> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>> Anil,
>>
>> Do you know what happens when you have an airplane that has too heavy a
>> cargo when it tries to take off?
>> You run out of runway and you crash and burn.
>>
>> Looking at your post, why are you starting 8 map processes on each slave?
>> That's tunable and you clearly do not have enough memory in each VM to
>> support 8 slots on a node.
>> Here you swap, you swap you cause HBase to crash and burn.
>>
>> 3.2GB of memory means that no more than 1 slot per slave and even then...
>> you're going to be very tight. Not to mention that you will need to loosen
>> up on your timings since its all virtual and you have way too much i/o per
>> drive going on.
>>
>>
>> My suggestion is that you go back and tune your system before thinking
>> about running anything.
>>
>> HTH
>>
>> -Mike
>>
>> On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com> wrote:
>>
>> > Hi Guys,
>> >
>> > Sorry for not mentioning the version I am currently running. My current
>> > version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN
>> for
>> > MR. My original post was for HBase0.92. Here are some more details of my
>> > current setup:
>> > I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's
>> installed on
>> > VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500
>> > HDFS space.
>> > I use this cluster for POC(Proof of Concepts). I am not looking for any
>> > performance benchmarking from this set-up. Due to some major bugs in
>> YARN i
>> > am unable to make work in a proper way in memory less than 4GB. I am
>> > already having discussion regarding them on Hadoop Mailing List.
>> >
>> > Here is the log of failed mapper: http://pastebin.com/f83xE2wv
>> >
>> > The problem is that when i start a Bulk loading job in YARN, 8 Map
>> > processes start on each slave and then all of my slaves are hammered
>> badly
>> > due to this. Since the slaves are getting hammered badly then
>> RegionServer
>> > gets lease expired or YourAreDeadExpcetion. Here is the log of RS which
>> > caused the job to fail: http://pastebin.com/9ZQx0DtD
>> >
>> > I am aware that this is happening due to underperforming hardware(Two
>> > slaves are using one 7200 rpm Hard Drive in my setup) and some major
>> bugs
>> > regarding running YARN in less than 4 GB memory. My only concern is the
>> > failure of entire MR job and its fault tolerance to RS failures. I am
>> not
>> > really concerned about RS failure since HBase is fault tolerant.
>> >
>> > Please let me know if you need anything else.
>> >
>> > Thanks,
>> > Anil
>> >
>> >
>> >
>> > On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
>> michael_segel@hotmail.com>wrote:
>> >
>> >> Yes, it can.
>> >> You can see RS failure causing a cascading RS failure. Of course YMMV
>> and
>> >> it depends on which version you are running.
>> >>
>> >> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and
>> he
>> >> should upgrade.
>> >>
>> >> (Or go to CHD4...)
>> >>
>> >> HTH
>> >>
>> >> -Mike
>> >>
>> >> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
>> >> wrote:
>> >>
>> >>> Anil,
>> >>>
>> >>> Do you have root cause on the RS failure?  I have never heard of one
>> RS
>> >>> failure causing a whole job to fail.
>> >>>
>> >>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
>> >> wrote:
>> >>>
>> >>>> Hi HBase Folks,
>> >>>>
>> >>>> I ran the bulk loader yesterday night to load data in a table. During
>> >> the
>> >>>> bulk loading job one of the region server crashed and the entire job
>> >>>> failed. It takes around 2.5 hours for this job to finish and the job
>> >> failed
>> >>>> when it was at around 50% complete. After the failure that table was
>> >> also
>> >>>> corrupted in HBase. My cluster has 8 region servers.
>> >>>>
>> >>>> Is bulk loading not fault tolerant to failure of region servers?
>> >>>>
>> >>>> I am using this old email chain because at that time my question went
>> >>>> unanswered. Please share your views.
>> >>>>
>> >>>> Thanks,
>> >>>> Anil Gupta
>> >>>>
>> >>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
>> >> wrote:
>> >>>>
>> >>>>> Hi Kevin,
>> >>>>>
>> >>>>> I am not really concerned about the RegionServer going down as the
>> same
>> >>>>> thing can happen when deployed in production. Although, in
>> production
>> >> we
>> >>>>> wont be having VM environment and I am aware that my current Dev
>> >>>>> environment is not good for heavy processing.  What i am concerned
>> >> about
>> >>>> is
>> >>>>> the failure of bulk loading job when the Region Server failed. Does
>> >> this
>> >>>>> mean that Bulk loading job is not fault tolerant to Failure of
>> Region
>> >>>>> Server? I was expecting the job to be successful even though the
>> >>>>> RegionServer failed because there 6 more RS running in the cluster.
>> >> Fault
>> >>>>> Tolerance is one of the biggest selling point of Hadoop platform.
>> Let
>> >> me
>> >>>>> know your views.
>> >>>>> Thanks for your time.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Anil Gupta
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
>> kevin.odell@cloudera.com
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Anil,
>> >>>>>>
>> >>>>>> I am sorry for the delayed response.  Reviewing the logs it
>> appears:
>> >>>>>>
>> >>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
>> out,
>> >>>>>> have not heard from server in 59311ms for sessionid
>> 0x136557f99c90065,
>> >>>>>> closing socket connection and attempting reconnect
>> >>>>>>
>> >>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
>> >>>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>> >>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>> >>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> rejected;
>> >>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>> >>>>>>
>> >>>>>> It appears to be a classic overworked RS.  You were doing too much
>> >>>>>> for the RS and it did not respond in time, the Master marked it as
>> >>>>>> dead, when the RS responded Master said no your are already dead
>> and
>> >>>>>> aborted the server.  This is why you see the YouAreDeadException.
>> >>>>>> This is probably due to the shared resources of the VM
>> infrastructure
>> >>>>>> you are running.  You will either need to devote more resources or
>> add
>> >>>>>> more nodes(most likely physical) to the cluster if you would like
>> to
>> >>>>>> keep running these jobs.
>> >>>>>>
>> >>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <an...@buffalo.edu>
>> >>>> wrote:
>> >>>>>>> Hi Kevin,
>> >>>>>>>
>> >>>>>>> Here is dropbox link to the log file of region server which
>> failed:
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>
>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>> >>>>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30
>> >>>>>> 15:38:32
>> >>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>> >>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>> >>>> regions=44,
>> >>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>> >>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> rejected;
>> >>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>> >>>>>>>
>> >>>>>>> I have already tested fault tolerance of HBase by manually
>> bringing
>> >>>>>> down a
>> >>>>>>> RS while querying a Table and it worked fine and I was expecting
>> the
>> >>>>>> same
>> >>>>>>> today(even though the RS went down by itself today) when i was
>> >> loading
>> >>>>>> the
>> >>>>>>> data. But, it didn't work out well.
>> >>>>>>> Thanks for your time. Let me know if you need more details.
>> >>>>>>>
>> >>>>>>> ~Anil
>> >>>>>>>
>> >>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
>> >>>> kevin.odell@cloudera.com
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Anil,
>> >>>>>>>>
>> >>>>>>>> Can you please attach the RS logs from the failure?
>> >>>>>>>>
>> >>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <
>> anilgupt@buffalo.edu>
>> >>>>>> wrote:
>> >>>>>>>>> Hi All,
>> >>>>>>>>>
>> >>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across
>> two
>> >>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
>> >>>>>> Server(1200
>> >>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
>> >>>> with a
>> >>>>>>>>> custom mapper. I was loading around 34 million records and I
>> have
>> >>>>>> loaded
>> >>>>>>>>> the same set of data in the same environment many times before
>> >>>>>> without
>> >>>>>>>> any
>> >>>>>>>>> problem. This time while loading the data, one of the region
>> >>>>>> server(but
>> >>>>>>>> the
>> >>>>>>>>> DN and TT kept on running on that node ) failed and then after
>> >>>>>> numerous
>> >>>>>>>>> failures of map-tasks the loding job failed. Is there any
>> >>>>>>>>> setting/configuration which can make Bulk Loading
>> fault-tolerant to
>> >>>>>>>> failure
>> >>>>>>>>> of region-servers?
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Thanks & Regards,
>> >>>>>>>>> Anil Gupta
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> Kevin O'Dell
>> >>>>>>>> Customer Operations Engineer, Cloudera
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Thanks & Regards,
>> >>>>>>> Anil Gupta
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Kevin O'Dell
>> >>>>>> Customer Operations Engineer, Cloudera
>> >>>>>>
>> >>>>>> --
>> >>>>>> Thanks & Regards,
>> >>>>>> Anil Gupta
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Thanks & Regards,
>> >>>> Anil Gupta
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Kevin O'Dell
>> >>> Customer Operations Engineer, Cloudera
>> >>
>> >>
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Anil Gupta
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Thanks & Regards,
Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by anil gupta <an...@gmail.com>.

Hi Mike,

I tried doing that by setting up properties in mapred-site.xml but Yarn
doesnt seems to work with "mapreduce.tasktracker.
map.tasks.maximum" property. Here is a reference to a discussion to same
problem:
https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
I have also posted about the same problem in Hadoop mailing list.

I already admitted in my previous email that YARN is having major issues
when we want to control it in low memory environment. I was just trying to
get views HBase experts on bulk load failures since we will be relying
heavily on Fault Tolerance.
If HBase Bulk Loader is fault tolerant to failure of RS in a viable
environment  then I dont have any issue. I hope this clears up my purpose
of posting on this topic.

Thanks,
Anil

On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel
<mi...@hotmail.com>wrote:

> Anil,
>
> Do you know what happens when you have an airplane that has too heavy a
> cargo when it tries to take off?
> You run out of runway and you crash and burn.
>
> Looking at your post, why are you starting 8 map processes on each slave?
> That's tunable and you clearly do not have enough memory in each VM to
> support 8 slots on a node.
> Here you swap, you swap you cause HBase to crash and burn.
>
> 3.2GB of memory means that no more than 1 slot per slave and even then...
> you're going to be very tight. Not to mention that you will need to loosen
> up on your timings since its all virtual and you have way too much i/o per
> drive going on.
>
>
> My suggestion is that you go back and tune your system before thinking
> about running anything.
>
> HTH
>
> -Mike
>
> On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com> wrote:
>
> > Hi Guys,
> >
> > Sorry for not mentioning the version I am currently running. My current
> > version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN for
> > MR. My original post was for HBase0.92. Here are some more details of my
> > current setup:
> > I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's installed
> on
> > VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500
> > HDFS space.
> > I use this cluster for POC(Proof of Concepts). I am not looking for any
> > performance benchmarking from this set-up. Due to some major bugs in
> YARN i
> > am unable to make work in a proper way in memory less than 4GB. I am
> > already having discussion regarding them on Hadoop Mailing List.
> >
> > Here is the log of failed mapper: http://pastebin.com/f83xE2wv
> >
> > The problem is that when i start a Bulk loading job in YARN, 8 Map
> > processes start on each slave and then all of my slaves are hammered
> badly
> > due to this. Since the slaves are getting hammered badly then
> RegionServer
> > gets lease expired or YourAreDeadExpcetion. Here is the log of RS which
> > caused the job to fail: http://pastebin.com/9ZQx0DtD
> >
> > I am aware that this is happening due to underperforming hardware(Two
> > slaves are using one 7200 rpm Hard Drive in my setup) and some major bugs
> > regarding running YARN in less than 4 GB memory. My only concern is the
> > failure of entire MR job and its fault tolerance to RS failures. I am not
> > really concerned about RS failure since HBase is fault tolerant.
> >
> > Please let me know if you need anything else.
> >
> > Thanks,
> > Anil
> >
> >
> >
> > On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
> michael_segel@hotmail.com>wrote:
> >
> >> Yes, it can.
> >> You can see RS failure causing a cascading RS failure. Of course YMMV
> and
> >> it depends on which version you are running.
> >>
> >> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he
> >> should upgrade.
> >>
> >> (Or go to CHD4...)
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
> >> wrote:
> >>
> >>> Anil,
> >>>
> >>> Do you have root cause on the RS failure?  I have never heard of one RS
> >>> failure causing a whole job to fail.
> >>>
> >>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi HBase Folks,
> >>>>
> >>>> I ran the bulk loader yesterday night to load data in a table. During
> >> the
> >>>> bulk loading job one of the region server crashed and the entire job
> >>>> failed. It takes around 2.5 hours for this job to finish and the job
> >> failed
> >>>> when it was at around 50% complete. After the failure that table was
> >> also
> >>>> corrupted in HBase. My cluster has 8 region servers.
> >>>>
> >>>> Is bulk loading not fault tolerant to failure of region servers?
> >>>>
> >>>> I am using this old email chain because at that time my question went
> >>>> unanswered. Please share your views.
> >>>>
> >>>> Thanks,
> >>>> Anil Gupta
> >>>>
> >>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Hi Kevin,
> >>>>>
> >>>>> I am not really concerned about the RegionServer going down as the
> same
> >>>>> thing can happen when deployed in production. Although, in production
> >> we
> >>>>> wont be having VM environment and I am aware that my current Dev
> >>>>> environment is not good for heavy processing.  What i am concerned
> >> about
> >>>> is
> >>>>> the failure of bulk loading job when the Region Server failed. Does
> >> this
> >>>>> mean that Bulk loading job is not fault tolerant to Failure of Region
> >>>>> Server? I was expecting the job to be successful even though the
> >>>>> RegionServer failed because there 6 more RS running in the cluster.
> >> Fault
> >>>>> Tolerance is one of the biggest selling point of Hadoop platform. Let
> >> me
> >>>>> know your views.
> >>>>> Thanks for your time.
> >>>>>
> >>>>> Thanks,
> >>>>> Anil Gupta
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
> kevin.odell@cloudera.com
> >>>>> wrote:
> >>>>>
> >>>>>> Anil,
> >>>>>>
> >>>>>> I am sorry for the delayed response.  Reviewing the logs it appears:
> >>>>>>
> >>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
> out,
> >>>>>> have not heard from server in 59311ms for sessionid
> 0x136557f99c90065,
> >>>>>> closing socket connection and attempting reconnect
> >>>>>>
> >>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
> >>>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >>>>>>
> >>>>>> It appears to be a classic overworked RS.  You were doing too much
> >>>>>> for the RS and it did not respond in time, the Master marked it as
> >>>>>> dead, when the RS responded Master said no your are already dead and
> >>>>>> aborted the server.  This is why you see the YouAreDeadException.
> >>>>>> This is probably due to the shared resources of the VM
> infrastructure
> >>>>>> you are running.  You will either need to devote more resources or
> add
> >>>>>> more nodes(most likely physical) to the cluster if you would like to
> >>>>>> keep running these jobs.
> >>>>>>
> >>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <an...@buffalo.edu>
> >>>> wrote:
> >>>>>>> Hi Kevin,
> >>>>>>>
> >>>>>>> Here is dropbox link to the log file of region server which failed:
> >>>>>>>
> >>>>>>
> >>>>
> >>
> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
> >>>>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30
> >>>>>> 15:38:32
> >>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
> >>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >>>> regions=44,
> >>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> rejected;
> >>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >>>>>>>
> >>>>>>> I have already tested fault tolerance of HBase by manually bringing
> >>>>>> down a
> >>>>>>> RS while querying a Table and it worked fine and I was expecting
> the
> >>>>>> same
> >>>>>>> today(even though the RS went down by itself today) when i was
> >> loading
> >>>>>> the
> >>>>>>> data. But, it didn't work out well.
> >>>>>>> Thanks for your time. Let me know if you need more details.
> >>>>>>>
> >>>>>>> ~Anil
> >>>>>>>
> >>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
> >>>> kevin.odell@cloudera.com
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Anil,
> >>>>>>>>
> >>>>>>>> Can you please attach the RS logs from the failure?
> >>>>>>>>
> >>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <anilgupt@buffalo.edu
> >
> >>>>>> wrote:
> >>>>>>>>> Hi All,
> >>>>>>>>>
> >>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across
> two
> >>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
> >>>>>> Server(1200
> >>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
> >>>> with a
> >>>>>>>>> custom mapper. I was loading around 34 million records and I have
> >>>>>> loaded
> >>>>>>>>> the same set of data in the same environment many times before
> >>>>>> without
> >>>>>>>> any
> >>>>>>>>> problem. This time while loading the data, one of the region
> >>>>>> server(but
> >>>>>>>> the
> >>>>>>>>> DN and TT kept on running on that node ) failed and then after
> >>>>>> numerous
> >>>>>>>>> failures of map-tasks the loding job failed. Is there any
> >>>>>>>>> setting/configuration which can make Bulk Loading fault-tolerant
> to
> >>>>>>>> failure
> >>>>>>>>> of region-servers?
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Thanks & Regards,
> >>>>>>>>> Anil Gupta
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Kevin O'Dell
> >>>>>>>> Customer Operations Engineer, Cloudera
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Thanks & Regards,
> >>>>>>> Anil Gupta
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Kevin O'Dell
> >>>>>> Customer Operations Engineer, Cloudera
> >>>>>>
> >>>>>> --
> >>>>>> Thanks & Regards,
> >>>>>> Anil Gupta
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thanks & Regards,
> >>>> Anil Gupta
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Kevin O'Dell
> >>> Customer Operations Engineer, Cloudera
> >>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by Michael Segel <mi...@hotmail.com>.

Anil, 

Do you know what happens when you have an airplane that has too heavy a cargo when it tries to take off? 
You run out of runway and you crash and burn. 

Looking at your post, why are you starting 8 map processes on each slave? That's tunable and you clearly do not have enough memory in each VM to support 8 slots on a node. 
Here you swap, you swap you cause HBase to crash and burn. 

3.2GB of memory means that no more than 1 slot per slave and even then... you're going to be very tight. Not to mention that you will need to loosen up on your timings since its all virtual and you have way too much i/o per drive going on.


My suggestion is that you go back and tune your system before thinking about running anything. 

HTH

-Mike

On Aug 13, 2012, at 2:11 PM, anil gupta <an...@gmail.com> wrote:

> Hi Guys,
> 
> Sorry for not mentioning the version I am currently running. My current
> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN for
> MR. My original post was for HBase0.92. Here are some more details of my
> current setup:
> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's installed on
> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500
> HDFS space.
> I use this cluster for POC(Proof of Concepts). I am not looking for any
> performance benchmarking from this set-up. Due to some major bugs in YARN i
> am unable to make work in a proper way in memory less than 4GB. I am
> already having discussion regarding them on Hadoop Mailing List.
> 
> Here is the log of failed mapper: http://pastebin.com/f83xE2wv
> 
> The problem is that when i start a Bulk loading job in YARN, 8 Map
> processes start on each slave and then all of my slaves are hammered badly
> due to this. Since the slaves are getting hammered badly then RegionServer
> gets lease expired or YourAreDeadExpcetion. Here is the log of RS which
> caused the job to fail: http://pastebin.com/9ZQx0DtD
> 
> I am aware that this is happening due to underperforming hardware(Two
> slaves are using one 7200 rpm Hard Drive in my setup) and some major bugs
> regarding running YARN in less than 4 GB memory. My only concern is the
> failure of entire MR job and its fault tolerance to RS failures. I am not
> really concerned about RS failure since HBase is fault tolerant.
> 
> Please let me know if you need anything else.
> 
> Thanks,
> Anil
> 
> 
> 
> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <mi...@hotmail.com>wrote:
> 
>> Yes, it can.
>> You can see RS failure causing a cascading RS failure. Of course YMMV and
>> it depends on which version you are running.
>> 
>> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he
>> should upgrade.
>> 
>> (Or go to CHD4...)
>> 
>> HTH
>> 
>> -Mike
>> 
>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
>> wrote:
>> 
>>> Anil,
>>> 
>>> Do you have root cause on the RS failure?  I have never heard of one RS
>>> failure causing a whole job to fail.
>>> 
>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
>> wrote:
>>> 
>>>> Hi HBase Folks,
>>>> 
>>>> I ran the bulk loader yesterday night to load data in a table. During
>> the
>>>> bulk loading job one of the region server crashed and the entire job
>>>> failed. It takes around 2.5 hours for this job to finish and the job
>> failed
>>>> when it was at around 50% complete. After the failure that table was
>> also
>>>> corrupted in HBase. My cluster has 8 region servers.
>>>> 
>>>> Is bulk loading not fault tolerant to failure of region servers?
>>>> 
>>>> I am using this old email chain because at that time my question went
>>>> unanswered. Please share your views.
>>>> 
>>>> Thanks,
>>>> Anil Gupta
>>>> 
>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
>> wrote:
>>>> 
>>>>> Hi Kevin,
>>>>> 
>>>>> I am not really concerned about the RegionServer going down as the same
>>>>> thing can happen when deployed in production. Although, in production
>> we
>>>>> wont be having VM environment and I am aware that my current Dev
>>>>> environment is not good for heavy processing.  What i am concerned
>> about
>>>> is
>>>>> the failure of bulk loading job when the Region Server failed. Does
>> this
>>>>> mean that Bulk loading job is not fault tolerant to Failure of Region
>>>>> Server? I was expecting the job to be successful even though the
>>>>> RegionServer failed because there 6 more RS running in the cluster.
>> Fault
>>>>> Tolerance is one of the biggest selling point of Hadoop platform. Let
>> me
>>>>> know your views.
>>>>> Thanks for your time.
>>>>> 
>>>>> Thanks,
>>>>> Anil Gupta
>>>>> 
>>>>> 
>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <kevin.odell@cloudera.com
>>>>> wrote:
>>>>> 
>>>>>> Anil,
>>>>>> 
>>>>>> I am sorry for the delayed response.  Reviewing the logs it appears:
>>>>>> 
>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
>>>>>> have not heard from server in 59311ms for sessionid 0x136557f99c90065,
>>>>>> closing socket connection and attempting reconnect
>>>>>> 
>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
>>>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>>>>>> 
>>>>>> It appears to be a classic overworked RS.  You were doing too much
>>>>>> for the RS and it did not respond in time, the Master marked it as
>>>>>> dead, when the RS responded Master said no your are already dead and
>>>>>> aborted the server.  This is why you see the YouAreDeadException.
>>>>>> This is probably due to the shared resources of the VM infrastructure
>>>>>> you are running.  You will either need to devote more resources or add
>>>>>> more nodes(most likely physical) to the cluster if you would like to
>>>>>> keep running these jobs.
>>>>>> 
>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <an...@buffalo.edu>
>>>> wrote:
>>>>>>> Hi Kevin,
>>>>>>> 
>>>>>>> Here is dropbox link to the log file of region server which failed:
>>>>>>> 
>>>>>> 
>>>> 
>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>>>>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30
>>>>>> 15:38:32
>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>> regions=44,
>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>>>>>>> 
>>>>>>> I have already tested fault tolerance of HBase by manually bringing
>>>>>> down a
>>>>>>> RS while querying a Table and it worked fine and I was expecting the
>>>>>> same
>>>>>>> today(even though the RS went down by itself today) when i was
>> loading
>>>>>> the
>>>>>>> data. But, it didn't work out well.
>>>>>>> Thanks for your time. Let me know if you need more details.
>>>>>>> 
>>>>>>> ~Anil
>>>>>>> 
>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
>>>> kevin.odell@cloudera.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Anil,
>>>>>>>> 
>>>>>>>> Can you please attach the RS logs from the failure?
>>>>>>>> 
>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <an...@buffalo.edu>
>>>>>> wrote:
>>>>>>>>> Hi All,
>>>>>>>>> 
>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across two
>>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
>>>>>> Server(1200
>>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
>>>> with a
>>>>>>>>> custom mapper. I was loading around 34 million records and I have
>>>>>> loaded
>>>>>>>>> the same set of data in the same environment many times before
>>>>>> without
>>>>>>>> any
>>>>>>>>> problem. This time while loading the data, one of the region
>>>>>> server(but
>>>>>>>> the
>>>>>>>>> DN and TT kept on running on that node ) failed and then after
>>>>>> numerous
>>>>>>>>> failures of map-tasks the loding job failed. Is there any
>>>>>>>>> setting/configuration which can make Bulk Loading fault-tolerant to
>>>>>>>> failure
>>>>>>>>> of region-servers?
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Anil Gupta
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Kevin O'Dell
>>>>>>>> Customer Operations Engineer, Cloudera
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Anil Gupta
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Kevin O'Dell
>>>>>> Customer Operations Engineer, Cloudera
>>>>>> 
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Anil Gupta
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> Anil Gupta
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Kevin O'Dell
>>> Customer Operations Engineer, Cloudera
>> 
>> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by anil gupta <an...@gmail.com>.

Hi Guys,

Sorry for not mentioning the version I am currently running. My current
version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN for
MR. My original post was for HBase0.92. Here are some more details of my
current setup:
I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's installed on
VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500
HDFS space.
I use this cluster for POC(Proof of Concepts). I am not looking for any
performance benchmarking from this set-up. Due to some major bugs in YARN i
am unable to make work in a proper way in memory less than 4GB. I am
already having discussion regarding them on Hadoop Mailing List.

Here is the log of failed mapper: http://pastebin.com/f83xE2wv

The problem is that when i start a Bulk loading job in YARN, 8 Map
processes start on each slave and then all of my slaves are hammered badly
due to this. Since the slaves are getting hammered badly then RegionServer
gets lease expired or YourAreDeadExpcetion. Here is the log of RS which
caused the job to fail: http://pastebin.com/9ZQx0DtD

I am aware that this is happening due to underperforming hardware(Two
slaves are using one 7200 rpm Hard Drive in my setup) and some major bugs
regarding running YARN in less than 4 GB memory. My only concern is the
failure of entire MR job and its fault tolerance to RS failures. I am not
really concerned about RS failure since HBase is fault tolerant.

Please let me know if you need anything else.

Thanks,
Anil



On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <mi...@hotmail.com>wrote:

> Yes, it can.
> You can see RS failure causing a cascading RS failure. Of course YMMV and
> it depends on which version you are running.
>
> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he
> should upgrade.
>
> (Or go to CHD4...)
>
> HTH
>
> -Mike
>
> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com>
> wrote:
>
> > Anil,
> >
> >  Do you have root cause on the RS failure?  I have never heard of one RS
> > failure causing a whole job to fail.
> >
> > On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com>
> wrote:
> >
> >> Hi HBase Folks,
> >>
> >> I ran the bulk loader yesterday night to load data in a table. During
> the
> >> bulk loading job one of the region server crashed and the entire job
> >> failed. It takes around 2.5 hours for this job to finish and the job
> failed
> >> when it was at around 50% complete. After the failure that table was
> also
> >> corrupted in HBase. My cluster has 8 region servers.
> >>
> >> Is bulk loading not fault tolerant to failure of region servers?
> >>
> >> I am using this old email chain because at that time my question went
> >> unanswered. Please share your views.
> >>
> >> Thanks,
> >> Anil Gupta
> >>
> >> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com>
> wrote:
> >>
> >>> Hi Kevin,
> >>>
> >>> I am not really concerned about the RegionServer going down as the same
> >>> thing can happen when deployed in production. Although, in production
> we
> >>> wont be having VM environment and I am aware that my current Dev
> >>> environment is not good for heavy processing.  What i am concerned
> about
> >> is
> >>> the failure of bulk loading job when the Region Server failed. Does
> this
> >>> mean that Bulk loading job is not fault tolerant to Failure of Region
> >>> Server? I was expecting the job to be successful even though the
> >>> RegionServer failed because there 6 more RS running in the cluster.
> Fault
> >>> Tolerance is one of the biggest selling point of Hadoop platform. Let
> me
> >>> know your views.
> >>> Thanks for your time.
> >>>
> >>> Thanks,
> >>> Anil Gupta
> >>>
> >>>
> >>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <kevin.odell@cloudera.com
> >>> wrote:
> >>>
> >>>> Anil,
> >>>>
> >>>> I am sorry for the delayed response.  Reviewing the logs it appears:
> >>>>
> >>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
> >>>> have not heard from server in 59311ms for sessionid 0x136557f99c90065,
> >>>> closing socket connection and attempting reconnect
> >>>>
> >>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
> >>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >>>>
> >>>>  It appears to be a classic overworked RS.  You were doing too much
> >>>> for the RS and it did not respond in time, the Master marked it as
> >>>> dead, when the RS responded Master said no your are already dead and
> >>>> aborted the server.  This is why you see the YouAreDeadException.
> >>>> This is probably due to the shared resources of the VM infrastructure
> >>>> you are running.  You will either need to devote more resources or add
> >>>> more nodes(most likely physical) to the cluster if you would like to
> >>>> keep running these jobs.
> >>>>
> >>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <an...@buffalo.edu>
> >> wrote:
> >>>>> Hi Kevin,
> >>>>>
> >>>>> Here is dropbox link to the log file of region server which failed:
> >>>>>
> >>>>
> >>
> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
> >>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30
> >>>> 15:38:32
> >>>>> FATAL regionserver.HRegionServer: ABORTING region server
> >>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >> regions=44,
> >>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >>>>>
> >>>>> I have already tested fault tolerance of HBase by manually bringing
> >>>> down a
> >>>>> RS while querying a Table and it worked fine and I was expecting the
> >>>> same
> >>>>> today(even though the RS went down by itself today) when i was
> loading
> >>>> the
> >>>>> data. But, it didn't work out well.
> >>>>> Thanks for your time. Let me know if you need more details.
> >>>>>
> >>>>> ~Anil
> >>>>>
> >>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
> >> kevin.odell@cloudera.com
> >>>>> wrote:
> >>>>>
> >>>>>> Anil,
> >>>>>>
> >>>>>> Can you please attach the RS logs from the failure?
> >>>>>>
> >>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <an...@buffalo.edu>
> >>>> wrote:
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across two
> >>>>>>> machines) which are running Datanode, Tasktracker, and Region
> >>>> Server(1200
> >>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
> >> with a
> >>>>>>> custom mapper. I was loading around 34 million records and I have
> >>>> loaded
> >>>>>>> the same set of data in the same environment many times before
> >>>> without
> >>>>>> any
> >>>>>>> problem. This time while loading the data, one of the region
> >>>> server(but
> >>>>>> the
> >>>>>>> DN and TT kept on running on that node ) failed and then after
> >>>> numerous
> >>>>>>> failures of map-tasks the loding job failed. Is there any
> >>>>>>> setting/configuration which can make Bulk Loading fault-tolerant to
> >>>>>> failure
> >>>>>>> of region-servers?
> >>>>>>>
> >>>>>>> --
> >>>>>>> Thanks & Regards,
> >>>>>>> Anil Gupta
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Kevin O'Dell
> >>>>>> Customer Operations Engineer, Cloudera
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks & Regards,
> >>>>> Anil Gupta
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Kevin O'Dell
> >>>> Customer Operations Engineer, Cloudera
> >>>>
> >>>> --
> >>>> Thanks & Regards,
> >>>> Anil Gupta
> >>>>
> >>>>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Anil Gupta
> >>
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Posted by Michael Segel <mi...@hotmail.com>.

Yes, it can. 
You can see RS failure causing a cascading RS failure. Of course YMMV and it depends on which version you are running. 

OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he should upgrade. 

(Or go to CHD4...)

HTH

-Mike

On Aug 13, 2012, at 8:51 AM, Kevin O'dell <ke...@cloudera.com> wrote:

> Anil,
> 
>  Do you have root cause on the RS failure?  I have never heard of one RS
> failure causing a whole job to fail.
> 
> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com> wrote:
> 
>> Hi HBase Folks,
>> 
>> I ran the bulk loader yesterday night to load data in a table. During the
>> bulk loading job one of the region server crashed and the entire job
>> failed. It takes around 2.5 hours for this job to finish and the job failed
>> when it was at around 50% complete. After the failure that table was also
>> corrupted in HBase. My cluster has 8 region servers.
>> 
>> Is bulk loading not fault tolerant to failure of region servers?
>> 
>> I am using this old email chain because at that time my question went
>> unanswered. Please share your views.
>> 
>> Thanks,
>> Anil Gupta
>> 
>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com> wrote:
>> 
>>> Hi Kevin,
>>> 
>>> I am not really concerned about the RegionServer going down as the same
>>> thing can happen when deployed in production. Although, in production we
>>> wont be having VM environment and I am aware that my current Dev
>>> environment is not good for heavy processing.  What i am concerned about
>> is
>>> the failure of bulk loading job when the Region Server failed. Does this
>>> mean that Bulk loading job is not fault tolerant to Failure of Region
>>> Server? I was expecting the job to be successful even though the
>>> RegionServer failed because there 6 more RS running in the cluster. Fault
>>> Tolerance is one of the biggest selling point of Hadoop platform. Let me
>>> know your views.
>>> Thanks for your time.
>>> 
>>> Thanks,
>>> Anil Gupta
>>> 
>>> 
>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <kevin.odell@cloudera.com
>>> wrote:
>>> 
>>>> Anil,
>>>> 
>>>> I am sorry for the delayed response.  Reviewing the logs it appears:
>>>> 
>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
>>>> have not heard from server in 59311ms for sessionid 0x136557f99c90065,
>>>> closing socket connection and attempting reconnect
>>>> 
>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>>>> 
>>>>  It appears to be a classic overworked RS.  You were doing too much
>>>> for the RS and it did not respond in time, the Master marked it as
>>>> dead, when the RS responded Master said no your are already dead and
>>>> aborted the server.  This is why you see the YouAreDeadException.
>>>> This is probably due to the shared resources of the VM infrastructure
>>>> you are running.  You will either need to devote more resources or add
>>>> more nodes(most likely physical) to the cluster if you would like to
>>>> keep running these jobs.
>>>> 
>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <an...@buffalo.edu>
>> wrote:
>>>>> Hi Kevin,
>>>>> 
>>>>> Here is dropbox link to the log file of region server which failed:
>>>>> 
>>>> 
>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30
>>>> 15:38:32
>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>> regions=44,
>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>>>>> 
>>>>> I have already tested fault tolerance of HBase by manually bringing
>>>> down a
>>>>> RS while querying a Table and it worked fine and I was expecting the
>>>> same
>>>>> today(even though the RS went down by itself today) when i was loading
>>>> the
>>>>> data. But, it didn't work out well.
>>>>> Thanks for your time. Let me know if you need more details.
>>>>> 
>>>>> ~Anil
>>>>> 
>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
>> kevin.odell@cloudera.com
>>>>> wrote:
>>>>> 
>>>>>> Anil,
>>>>>> 
>>>>>> Can you please attach the RS logs from the failure?
>>>>>> 
>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <an...@buffalo.edu>
>>>> wrote:
>>>>>>> Hi All,
>>>>>>> 
>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across two
>>>>>>> machines) which are running Datanode, Tasktracker, and Region
>>>> Server(1200
>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
>> with a
>>>>>>> custom mapper. I was loading around 34 million records and I have
>>>> loaded
>>>>>>> the same set of data in the same environment many times before
>>>> without
>>>>>> any
>>>>>>> problem. This time while loading the data, one of the region
>>>> server(but
>>>>>> the
>>>>>>> DN and TT kept on running on that node ) failed and then after
>>>> numerous
>>>>>>> failures of map-tasks the loding job failed. Is there any
>>>>>>> setting/configuration which can make Bulk Loading fault-tolerant to
>>>>>> failure
>>>>>>> of region-servers?
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Anil Gupta
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Kevin O'Dell
>>>>>> Customer Operations Engineer, Cloudera
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks & Regards,
>>>>> Anil Gupta
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Kevin O'Dell
>>>> Customer Operations Engineer, Cloudera
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> Anil Gupta
>>>> 
>>>> 
>> 
>> 
>> --
>> Thanks & Regards,
>> Anil Gupta
>> 
> 
> 
> 
> -- 
> Kevin O'Dell
> Customer Operations Engineer, Cloudera

Re: Bulk loading job failed when one region server went down in the cluster

Posted by Kevin O'dell <ke...@cloudera.com>.

Anil,

  Do you have root cause on the RS failure?  I have never heard of one RS
failure causing a whole job to fail.

On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <an...@gmail.com> wrote:

> Hi HBase Folks,
>
> I ran the bulk loader yesterday night to load data in a table. During the
> bulk loading job one of the region server crashed and the entire job
> failed. It takes around 2.5 hours for this job to finish and the job failed
> when it was at around 50% complete. After the failure that table was also
> corrupted in HBase. My cluster has 8 region servers.
>
> Is bulk loading not fault tolerant to failure of region servers?
>
> I am using this old email chain because at that time my question went
> unanswered. Please share your views.
>
> Thanks,
> Anil Gupta
>
> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <an...@gmail.com> wrote:
>
> > Hi Kevin,
> >
> > I am not really concerned about the RegionServer going down as the same
> > thing can happen when deployed in production. Although, in production we
> > wont be having VM environment and I am aware that my current Dev
> > environment is not good for heavy processing.  What i am concerned about
> is
> > the failure of bulk loading job when the Region Server failed. Does this
> > mean that Bulk loading job is not fault tolerant to Failure of Region
> > Server? I was expecting the job to be successful even though the
> > RegionServer failed because there 6 more RS running in the cluster. Fault
> > Tolerance is one of the biggest selling point of Hadoop platform. Let me
> > know your views.
> > Thanks for your time.
> >
> > Thanks,
> > Anil Gupta
> >
> >
> > On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <kevin.odell@cloudera.com
> >wrote:
> >
> >> Anil,
> >>
> >>  I am sorry for the delayed response.  Reviewing the logs it appears:
> >>
> >> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
> >> have not heard from server in 59311ms for sessionid 0x136557f99c90065,
> >> closing socket connection and attempting reconnect
> >>
> >> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
> >> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
> >> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >> currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >>
> >>   It appears to be a classic overworked RS.  You were doing too much
> >> for the RS and it did not respond in time, the Master marked it as
> >> dead, when the RS responded Master said no your are already dead and
> >> aborted the server.  This is why you see the YouAreDeadException.
> >> This is probably due to the shared resources of the VM infrastructure
> >> you are running.  You will either need to devote more resources or add
> >> more nodes(most likely physical) to the cluster if you would like to
> >> keep running these jobs.
> >>
> >> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <an...@buffalo.edu>
> wrote:
> >> > Hi Kevin,
> >> >
> >> > Here is dropbox link to the log file of region server which failed:
> >> >
> >>
> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
> >> > IMHO, the problem starts from the line #3009 which says: 12/03/30
> >> 15:38:32
> >> > FATAL regionserver.HRegionServer: ABORTING region server
> >> > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> regions=44,
> >> > usedHeap=446, maxHeap=1197): Unhandled exception:
> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >> > currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >> >
> >> > I have already tested fault tolerance of HBase by manually bringing
> >> down a
> >> > RS while querying a Table and it worked fine and I was expecting the
> >> same
> >> > today(even though the RS went down by itself today) when i was loading
> >> the
> >> > data. But, it didn't work out well.
> >> > Thanks for your time. Let me know if you need more details.
> >> >
> >> > ~Anil
> >> >
> >> > On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
> kevin.odell@cloudera.com
> >> >wrote:
> >> >
> >> >> Anil,
> >> >>
> >> >>  Can you please attach the RS logs from the failure?
> >> >>
> >> >> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <an...@buffalo.edu>
> >> wrote:
> >> >> > Hi All,
> >> >> >
> >> >> > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two
> >> >> > machines) which are running Datanode, Tasktracker, and Region
> >> Server(1200
> >> >> > MB heap size). I was loading data into HBase using Bulk Loader
> with a
> >> >> > custom mapper. I was loading around 34 million records and I have
> >> loaded
> >> >> > the same set of data in the same environment many times before
> >> without
> >> >> any
> >> >> > problem. This time while loading the data, one of the region
> >> server(but
> >> >> the
> >> >> > DN and TT kept on running on that node ) failed and then after
> >> numerous
> >> >> > failures of map-tasks the loding job failed. Is there any
> >> >> > setting/configuration which can make Bulk Loading fault-tolerant to
> >> >> failure
> >> >> > of region-servers?
> >> >> >
> >> >> > --
> >> >> > Thanks & Regards,
> >> >> > Anil Gupta
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Kevin O'Dell
> >> >> Customer Operations Engineer, Cloudera
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards,
> >> > Anil Gupta
> >>
> >>
> >>
> >> --
> >> Kevin O'Dell
> >> Customer Operations Engineer, Cloudera
> >>
> >> --
> >> Thanks & Regards,
> >> Anil Gupta
> >>
> >>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera