You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by moon soo Lee <so...@gmail.com> on 2011/07/08 04:02:55 UTC

Can i safely set dfs.blockreport.intervalMsec to very large value (1 year or more?)

I have many blocks. Around 50~90m each datanode.

They often do not respond while 1~3 min and i think this is because of full
scanning for block report.

So if i set dfs.blockreport.intervalMsec to very large value (1year or
more?), i expect problem clear.

But if i really do what happens? any side effects?

Re: Can i safely set dfs.blockreport.intervalMsec to very large value (1 year or more?)

Posted by moon soo Lee <so...@gmail.com>.

Thanks for response, Robert, Matt.

I applogize, i wrote wrong information about block number.

I have 0.5m~0.9m blocks each datanode.

cluster summary shows "*10901479 files and directories, 9553721 blocks =
20455200 total. Heap Size is 5.92 GB / 17.36 GB (34%)** "*
*
*
I have another HDFS cluster, hadoop-0.18, 45 datanode, 0.12m~0.15m blocks
per datanode, which does not have such problem.
*
*
When i face the problem, last contact of data node (from namenode's view)
increased by 60~200 seconds. (normally 1 or 2 )
At that time, the datanode seems do not respond (to hdfs client
application), i attach my datanode process stack dump at that
situation(dump.out.bz2).
*
*
*I didn't precisely measure about how often, but i believe how often problem
shows (each datanode) = *dfs.blockreport.intervalMsec
*
*
*
*
*
* How many datanodes in the cluster?
    40 datanodes are in HDFS(0.20+228, cloudera CDH2)

*
*
* How many volumes (physical HDD's) per datanode?
    i have 11 volumes.*
*
*
*
* How much RAM per datanode?
    6GB physical, I set HADOOP_HEAPSIZE=3000 in hadoop-env.sh*
*
*
** What OS on the datanodes, and is it 32-bit or 64-bit?  What max process
*
*size is configured?
    64-bit centos 5.4 or 5.5 runs datanode.*
*
*
*    my ulimit -a result*
*
*
*
    # ulimit -a
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 57344
    max locked memory       (kbytes, -l) 32
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 102400
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 10240
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 57344
    virtual memory          (kbytes, -v) unlimited
    file locks                      (-x) unlimited

*
*
*
*
*
** Is the datanode services JVM running as 32-bit or 64-bit?*
    64-bit, openjdk 1.6.0, release 1.11.b16.el5



On Sat, Jul 9, 2011 at 3:12 AM, Matt Foley <mf...@hortonworks.com> wrote:

> Hi Moon,
> The periodic block report is constructed entirely from info in memory, so
> there is no complete scan of the filesystem for this purpose.  The periodic
> block report defaults to only sending once per hour from each datanode, and
> each DN calculates a random start time for the hourly cycle (after initial
> startup block report), to spread those hourly reports somewhat evenly
> across
> the entire hour.  It is part of Hadoop's fault tolerance that the namenode
> and datanodes perform this hourly check to assure that they both have the
> same understanding of what replicas are available from each node.
>
> However, depending on your answers to the questions below, you may be
> having
> memory management and/or garbage collection problems.  We may be able to
> help diagnose it if you can provide more info:
>
> First, please confirm that you said 50,000,000 blocks per datanode (not
> 50,000).  This is a lot.  The data centers I'm most familiar with run with
> aprx 100,000 blocks per datanode, because they need a higher ratio of
> compute power to data.
>
> Second, please confirm whether it is the datanodes, or the namenode
> services, that are being non-responsive for minutes at a time.  And when
> you
> say "often", how often are you experiencing such non-responsiveness?  What
> are you experiencing when it happens?
>
> Regarding your environment:
> * How many datanodes in the cluster?
> * How many volumes (physical HDD's) per datanode?
> * How much RAM per datanode?
> * What OS on the datanodes, and is it 32-bit or 64-bit?  What max process
> size is configured?
> * Is the datanode services JVM running as 32-bit or 64-bit?
>
> Hopefully these answers will help figure out what's going on.
> --Matt
>
>
> On Fri, Jul 8, 2011 at 7:21 AM, Robert Evans <ev...@yahoo-inc.com> wrote:
>
> > Moon Soo Lee
> >
> > The full block report is used in error cases.  Currently when a datanode
> > heartbeats into the namenode the namenode can send back a list of tasks
> to
> > be preformed, this is mostly for deleting blocks.  The namenode just
> assumes
> > that all of these tasks execute successfully.  If any of them fail then
> the
> > namenode is unaware of it.  HDFS-395 adds in an ack to address this.
> >  Creating of new blocks is sent to the namenode as they happen so this is
> > not really an issue. So if you set the period to 1 year then you will
> likely
> > have several blocks in your cluster sitting around unused but taking up
> > space.  It is also likely compensating for other error conditions or even
> > bugs in HDFS that I am unaware of, just because of the nature of it.
> >
> > --Bobby Evans
> >
> > On 7/7/11 9:02 PM, "moon soo Lee" <so...@gmail.com> wrote:
> >
> > I have many blocks. Around 50~90m each datanode.
> >
> > They often do not respond while 1~3 min and i think this is because of
> full
> > scanning for block report.
> >
> > So if i set dfs.blockreport.intervalMsec to very large value (1year or
> > more?), i expect problem clear.
> >
> > But if i really do what happens? any side effects?
> >
> >
>

Re: Can i safely set dfs.blockreport.intervalMsec to very large value (1 year or more?)

Posted by Matt Foley <mf...@hortonworks.com>.

Hi Moon,
The periodic block report is constructed entirely from info in memory, so
there is no complete scan of the filesystem for this purpose.  The periodic
block report defaults to only sending once per hour from each datanode, and
each DN calculates a random start time for the hourly cycle (after initial
startup block report), to spread those hourly reports somewhat evenly across
the entire hour.  It is part of Hadoop's fault tolerance that the namenode
and datanodes perform this hourly check to assure that they both have the
same understanding of what replicas are available from each node.

However, depending on your answers to the questions below, you may be having
memory management and/or garbage collection problems.  We may be able to
help diagnose it if you can provide more info:

First, please confirm that you said 50,000,000 blocks per datanode (not
50,000).  This is a lot.  The data centers I'm most familiar with run with
aprx 100,000 blocks per datanode, because they need a higher ratio of
compute power to data.

Second, please confirm whether it is the datanodes, or the namenode
services, that are being non-responsive for minutes at a time.  And when you
say "often", how often are you experiencing such non-responsiveness?  What
are you experiencing when it happens?

Regarding your environment:
* How many datanodes in the cluster?
* How many volumes (physical HDD's) per datanode?
* How much RAM per datanode?
* What OS on the datanodes, and is it 32-bit or 64-bit?  What max process
size is configured?
* Is the datanode services JVM running as 32-bit or 64-bit?

Hopefully these answers will help figure out what's going on.
--Matt

On Fri, Jul 8, 2011 at 7:21 AM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Moon Soo Lee
>
> The full block report is used in error cases.  Currently when a datanode
> heartbeats into the namenode the namenode can send back a list of tasks to
> be preformed, this is mostly for deleting blocks.  The namenode just assumes
> that all of these tasks execute successfully.  If any of them fail then the
> namenode is unaware of it.  HDFS-395 adds in an ack to address this.
>  Creating of new blocks is sent to the namenode as they happen so this is
> not really an issue. So if you set the period to 1 year then you will likely
> have several blocks in your cluster sitting around unused but taking up
> space.  It is also likely compensating for other error conditions or even
> bugs in HDFS that I am unaware of, just because of the nature of it.
>
> --Bobby Evans
>
> On 7/7/11 9:02 PM, "moon soo Lee" <so...@gmail.com> wrote:
>
> I have many blocks. Around 50~90m each datanode.
>
> They often do not respond while 1~3 min and i think this is because of full
> scanning for block report.
>
> So if i set dfs.blockreport.intervalMsec to very large value (1year or
> more?), i expect problem clear.
>
> But if i really do what happens? any side effects?
>
>

Re: Can i safely set dfs.blockreport.intervalMsec to very large value (1 year or more?)

Posted by Robert Evans <ev...@yahoo-inc.com>.

Moon Soo Lee

The full block report is used in error cases.  Currently when a datanode heartbeats into the namenode the namenode can send back a list of tasks to be preformed, this is mostly for deleting blocks.  The namenode just assumes that all of these tasks execute successfully.  If any of them fail then the namenode is unaware of it.  HDFS-395 adds in an ack to address this.  Creating of new blocks is sent to the namenode as they happen so this is not really an issue. So if you set the period to 1 year then you will likely have several blocks in your cluster sitting around unused but taking up space.  It is also likely compensating for other error conditions or even bugs in HDFS that I am unaware of, just because of the nature of it.

--Bobby Evans

On 7/7/11 9:02 PM, "moon soo Lee" <so...@gmail.com> wrote:

I have many blocks. Around 50~90m each datanode.

They often do not respond while 1~3 min and i think this is because of full
scanning for block report.

So if i set dfs.blockreport.intervalMsec to very large value (1year or
more?), i expect problem clear.

But if i really do what happens? any side effects?