You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Asaf Mesika <as...@gmail.com> on 2012/07/12 17:09:37 UTC

HDFS + HBASE process high cpu usage

Hi,

I have a cluster of 3 DN/RS and another computer hosting NN/Master.

From some reason, two of the DataNode nodes are showing high load average (~17).
When using "top" I can see HDFS and HBASE processes are the one using the most of the cpu (95% in top).

When inspecting both HDFS and HBASE through JVisualVM on the problematic nodes, I can clearly see that the cpu usage is high.

Any ideas why its happening on those two nodes (and why the 3rd is resting happily)?

All three computers have roughly the same hardware.
The Cluster (both HBASE and HDFS) are not used currently (during my inspection).

Both HDFS and HBASE logs don't show any particular activity.


Any leads on where should I look for more would be appreciated.


Thanks!

Asaf

Re: HDFS + HBASE process high cpu usage

Posted by Asaf Mesika <as...@gmail.com>.

Thanks a lot!

That must have been it.
Unfortunately I couldn't really test this command, since the guys from ops rebooted the entire computer room during maintenance, and it fixed the issue.
(This room is a lab room of course)

Asaf


On Jul 13, 2012, at 4:27 AM, Esteban Gutierrez wrote:

> date -s "`date`"

Re: HDFS + HBASE process high cpu usage

Posted by deanforwever2010 <de...@gmail.com>.

maybe there is some slow query
I met the same problem,I found out that I query 100 thousand columns of a
row, the hbase had no response and stopped working.

2012/7/13 Esteban Gutierrez <es...@cloudera.com>

> Hi Asaf,
>
> By any chance is this issue has been going on in your boxes for the last
> few days? I won't be surprised by so many calls to futex by the JVM itself,
> but since you are giving the same symptoms as the leap second issue it
> would be good to know what OS are you using, if NTP is/was running or not
> and if the boxes have been restarted or not after jul/1. If the leap second
> issue is the cause of this, then just running date -s "`date`" as root wil
> lower the cpu usage.
>
> regards,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
>
>
> On Thu, Jul 12, 2012 at 10:12 AM, Asaf Mesika <as...@gmail.com>
> wrote:
>
> > Just adding more information.
> > The following is a histogram output of 'strace -p <hdfs-pid> -f -C' which
> > ran for 10 seconds. From some reason futex takes 65% of the time.
> >
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  65.06   11.097387         103    108084     53662 futex
> >  12.00    2.047692      170641        12         3 restart_syscall
> >   8.73    1.488824       23263        64           accept
> >   6.99    1.192192        5624       212           poll
> >   6.60    1.125829       22517        50           epoll_wait
> >   0.26    0.045039         506        89           close
> >   0.19    0.031703         170       187           sendto
> >   0.04    0.007508         110        68           setsockopt
> >   0.03    0.005558          27       209           recvfrom
> >   0.02    0.003000         375         8           sched_yield
> >   0.02    0.002999         107        28         1 epoll_ctl
> >   0.01    0.002000         125        16           open
> >   0.01    0.001999         167        12           getsockname
> >   0.01    0.001156          36        32           write
> >   0.01    0.001000         100        10           fstat
> >   0.01    0.001000          30        33           fcntl
> >   0.01    0.000999          15        67           dup2
> >   0.00    0.000488          98         5           rt_sigreturn
> >   0.00    0.000350           8        46        10 read
> >   0.00    0.000222           4        51           mprotect
> >   0.00    0.000167          42         4           openat
> >   0.00    0.000092           2        52           stat
> >   0.00    0.000084           2        45           statfs
> >   0.00    0.000074           4        21           mmap
> >   0.00    0.000000           0         9           munmap
> >   0.00    0.000000           0        26           rt_sigprocmask
> >   0.00    0.000000           0         3           ioctl
> >   0.00    0.000000           0         1           pipe
> >   0.00    0.000000           0         5           madvise
> >   0.00    0.000000           0         6           socket
> >   0.00    0.000000           0         6         4 connect
> >   0.00    0.000000           0         1           shutdown
> >   0.00    0.000000           0         3           getsockopt
> >   0.00    0.000000           0         7           clone
> >   0.00    0.000000           0         8           getdents
> >   0.00    0.000000           0         3           getrlimit
> >   0.00    0.000000           0         6           sysinfo
> >   0.00    0.000000           0         7           gettid
> >   0.00    0.000000           0        14           sched_getaffinity
> >   0.00    0.000000           0         1           epoll_create
> >   0.00    0.000000           0         7           set_robust_list
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00   17.057362                109518     53680 total
> >
> > On Jul 12, 2012, at 18:09 PM, Asaf Mesika wrote:
> >
> > > Hi,
> > >
> > > I have a cluster of 3 DN/RS and another computer hosting NN/Master.
> > >
> > > From some reason, two of the DataNode nodes are showing high load
> > average (~17).
> > > When using "top" I can see HDFS and HBASE processes are the one using
> > the most of the cpu (95% in top).
> > >
> > > When inspecting both HDFS and HBASE through JVisualVM on the
> problematic
> > nodes, I can clearly see that the cpu usage is high.
> > >
> > > Any ideas why its happening on those two nodes (and why the 3rd is
> > resting happily)?
> > >
> > > All three computers have roughly the same hardware.
> > > The Cluster (both HBASE and HDFS) are not used currently (during my
> > inspection).
> > >
> > > Both HDFS and HBASE logs don't show any particular activity.
> > >
> > >
> > > Any leads on where should I look for more would be appreciated.
> > >
> > >
> > > Thanks!
> > >
> > > Asaf
> > >
> >
> >
>

Re: HDFS + HBASE process high cpu usage

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hi Asaf,

By any chance is this issue has been going on in your boxes for the last
few days? I won't be surprised by so many calls to futex by the JVM itself,
but since you are giving the same symptoms as the leap second issue it
would be good to know what OS are you using, if NTP is/was running or not
and if the boxes have been restarted or not after jul/1. If the leap second
issue is the cause of this, then just running date -s "`date`" as root wil
lower the cpu usage.

regards,
esteban.


--
Cloudera, Inc.




On Thu, Jul 12, 2012 at 10:12 AM, Asaf Mesika <as...@gmail.com> wrote:

> Just adding more information.
> The following is a histogram output of 'strace -p <hdfs-pid> -f -C' which
> ran for 10 seconds. From some reason futex takes 65% of the time.
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  65.06   11.097387         103    108084     53662 futex
>  12.00    2.047692      170641        12         3 restart_syscall
>   8.73    1.488824       23263        64           accept
>   6.99    1.192192        5624       212           poll
>   6.60    1.125829       22517        50           epoll_wait
>   0.26    0.045039         506        89           close
>   0.19    0.031703         170       187           sendto
>   0.04    0.007508         110        68           setsockopt
>   0.03    0.005558          27       209           recvfrom
>   0.02    0.003000         375         8           sched_yield
>   0.02    0.002999         107        28         1 epoll_ctl
>   0.01    0.002000         125        16           open
>   0.01    0.001999         167        12           getsockname
>   0.01    0.001156          36        32           write
>   0.01    0.001000         100        10           fstat
>   0.01    0.001000          30        33           fcntl
>   0.01    0.000999          15        67           dup2
>   0.00    0.000488          98         5           rt_sigreturn
>   0.00    0.000350           8        46        10 read
>   0.00    0.000222           4        51           mprotect
>   0.00    0.000167          42         4           openat
>   0.00    0.000092           2        52           stat
>   0.00    0.000084           2        45           statfs
>   0.00    0.000074           4        21           mmap
>   0.00    0.000000           0         9           munmap
>   0.00    0.000000           0        26           rt_sigprocmask
>   0.00    0.000000           0         3           ioctl
>   0.00    0.000000           0         1           pipe
>   0.00    0.000000           0         5           madvise
>   0.00    0.000000           0         6           socket
>   0.00    0.000000           0         6         4 connect
>   0.00    0.000000           0         1           shutdown
>   0.00    0.000000           0         3           getsockopt
>   0.00    0.000000           0         7           clone
>   0.00    0.000000           0         8           getdents
>   0.00    0.000000           0         3           getrlimit
>   0.00    0.000000           0         6           sysinfo
>   0.00    0.000000           0         7           gettid
>   0.00    0.000000           0        14           sched_getaffinity
>   0.00    0.000000           0         1           epoll_create
>   0.00    0.000000           0         7           set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00   17.057362                109518     53680 total
>
> On Jul 12, 2012, at 18:09 PM, Asaf Mesika wrote:
>
> > Hi,
> >
> > I have a cluster of 3 DN/RS and another computer hosting NN/Master.
> >
> > From some reason, two of the DataNode nodes are showing high load
> average (~17).
> > When using "top" I can see HDFS and HBASE processes are the one using
> the most of the cpu (95% in top).
> >
> > When inspecting both HDFS and HBASE through JVisualVM on the problematic
> nodes, I can clearly see that the cpu usage is high.
> >
> > Any ideas why its happening on those two nodes (and why the 3rd is
> resting happily)?
> >
> > All three computers have roughly the same hardware.
> > The Cluster (both HBASE and HDFS) are not used currently (during my
> inspection).
> >
> > Both HDFS and HBASE logs don't show any particular activity.
> >
> >
> > Any leads on where should I look for more would be appreciated.
> >
> >
> > Thanks!
> >
> > Asaf
> >
>
>

Re: HDFS + HBASE process high cpu usage

Posted by Asaf Mesika <as...@gmail.com>.

Just adding more information.
The following is a histogram output of 'strace -p <hdfs-pid> -f -C' which ran for 10 seconds. From some reason futex takes 65% of the time. 

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 65.06   11.097387         103    108084     53662 futex
 12.00    2.047692      170641        12         3 restart_syscall
  8.73    1.488824       23263        64           accept
  6.99    1.192192        5624       212           poll
  6.60    1.125829       22517        50           epoll_wait
  0.26    0.045039         506        89           close
  0.19    0.031703         170       187           sendto
  0.04    0.007508         110        68           setsockopt
  0.03    0.005558          27       209           recvfrom
  0.02    0.003000         375         8           sched_yield
  0.02    0.002999         107        28         1 epoll_ctl
  0.01    0.002000         125        16           open
  0.01    0.001999         167        12           getsockname
  0.01    0.001156          36        32           write
  0.01    0.001000         100        10           fstat
  0.01    0.001000          30        33           fcntl
  0.01    0.000999          15        67           dup2
  0.00    0.000488          98         5           rt_sigreturn
  0.00    0.000350           8        46        10 read
  0.00    0.000222           4        51           mprotect
  0.00    0.000167          42         4           openat
  0.00    0.000092           2        52           stat
  0.00    0.000084           2        45           statfs
  0.00    0.000074           4        21           mmap
  0.00    0.000000           0         9           munmap
  0.00    0.000000           0        26           rt_sigprocmask
  0.00    0.000000           0         3           ioctl
  0.00    0.000000           0         1           pipe
  0.00    0.000000           0         5           madvise
  0.00    0.000000           0         6           socket
  0.00    0.000000           0         6         4 connect
  0.00    0.000000           0         1           shutdown
  0.00    0.000000           0         3           getsockopt
  0.00    0.000000           0         7           clone
  0.00    0.000000           0         8           getdents
  0.00    0.000000           0         3           getrlimit
  0.00    0.000000           0         6           sysinfo
  0.00    0.000000           0         7           gettid
  0.00    0.000000           0        14           sched_getaffinity
  0.00    0.000000           0         1           epoll_create
  0.00    0.000000           0         7           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   17.057362                109518     53680 total

On Jul 12, 2012, at 18:09 PM, Asaf Mesika wrote:

> Hi,
> 
> I have a cluster of 3 DN/RS and another computer hosting NN/Master.
> 
> From some reason, two of the DataNode nodes are showing high load average (~17).
> When using "top" I can see HDFS and HBASE processes are the one using the most of the cpu (95% in top).
> 
> When inspecting both HDFS and HBASE through JVisualVM on the problematic nodes, I can clearly see that the cpu usage is high.
> 
> Any ideas why its happening on those two nodes (and why the 3rd is resting happily)?
> 
> All three computers have roughly the same hardware.
> The Cluster (both HBASE and HDFS) are not used currently (during my inspection).
> 
> Both HDFS and HBASE logs don't show any particular activity.
> 
> 
> Any leads on where should I look for more would be appreciated.
> 
> 
> Thanks!
> 
> Asaf
>