You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sean Knapp <se...@ooyala.com> on 2009/01/31 23:20:11 UTC

HDFS Namenode Heap Size woes

I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB). The
current status of my FS is approximately 1 million files and directories,
950k blocks, and heap size of 7GB (16GB reserved). Average block replication
is 3.8. I'm concerned that the heap size is steadily climbing... a 7GB heap
is substantially higher per file that I have on a similar 0.18.2 cluster,
which has closer to a 1GB heap.
My typical usage model is 1) write a number of small files into HDFS (tens
or hundreds of thousands at a time), 2) archive those files, 3) delete the
originals. I've tried dropping the replication factor of the _index and
_masterindex files without much effect on overall heap size. While I had
trash enabled at one point, I've since disabled it and deleted the .Trash
folders.

On namenode startup, I get a massive number of the following lines in my log
file:
2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.processReport: block blk_-2389330910609345428_7332878 on
172.16.129.33:50010 size 798080 does not belong to any file.
2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addToInvalidates: blk_-2389330910609345428 is added to invalidSet
of 172.16.129.33:50010

I suspect the original files may be left behind and causing the heap size
bloat. Is there any accounting mechanism to determine what is contributing
to my heap size?

Thanks,
Sean

Re: HDFS Namenode Heap Size woes

Posted by Sean Knapp <se...@ooyala.com>.
Jason,
Thanks again for the response. Is there a way to inspect these lists to
verify?

Regards,
Sean

On Sun, Feb 1, 2009 at 6:00 PM, jason hadoop <ja...@gmail.com> wrote:

> When the nodes drop out into dead status it creates the workload and memory
> load for the namenode.
> I don't know if the 100+ second case does so also.
>
> On Sun, Feb 1, 2009 at 4:11 PM, Sean Knapp <se...@ooyala.com> wrote:
>
> > Jason,
> > Thanks for the response. By falling out, do you mean a longer time since
> > last contact (100s+), or fully timed out where it is dropped into dead
> > nodes? The former happens fairly often, the latter only under serious
> load
> > but not in the last day. Also, my namenode is now up to 10GB with less
> than
> > 700k files after some additional archiving.
> >
> > Thanks,
> > Sean
> >
> > On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop <ja...@gmail.com>
> > wrote:
> >
> > > If your datanodes are pausing and falling out of the cluster you will
> get
> > a
> > > large workload for the namenode of blocks to replicate and when the
> > paused
> > > datanode comes back, a large workload of blocks to delete.
> > > These lists are stored in memory on the namenode.
> > > The startup messages lead me to wonder if your datanodes are
> periodically
> > > pausing or are otherwise dropping in and out of the cluster.
> > >
> > > On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com> wrote:
> > >
> > > > I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB).
> > The
> > > > current status of my FS is approximately 1 million files and
> > directories,
> > > > 950k blocks, and heap size of 7GB (16GB reserved). Average block
> > > > replication
> > > > is 3.8. I'm concerned that the heap size is steadily climbing... a
> 7GB
> > > heap
> > > > is substantially higher per file that I have on a similar 0.18.2
> > cluster,
> > > > which has closer to a 1GB heap.
> > > > My typical usage model is 1) write a number of small files into HDFS
> > > (tens
> > > > or hundreds of thousands at a time), 2) archive those files, 3)
> delete
> > > the
> > > > originals. I've tried dropping the replication factor of the _index
> and
> > > > _masterindex files without much effect on overall heap size. While I
> > had
> > > > trash enabled at one point, I've since disabled it and deleted the
> > .Trash
> > > > folders.
> > > >
> > > > On namenode startup, I get a massive number of the following lines in
> > my
> > > > log
> > > > file:
> > > > 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > > > NameSystem.processReport: block blk_-2389330910609345428_7332878 on
> > > > 172.16.129.33:50010 size 798080 does not belong to any file.
> > > > 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > > > NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
> > > > invalidSet
> > > > of 172.16.129.33:50010
> > > >
> > > > I suspect the original files may be left behind and causing the heap
> > size
> > > > bloat. Is there any accounting mechanism to determine what is
> > > contributing
> > > > to my heap size?
> > > >
> > > > Thanks,
> > > > Sean
> > > >
> > >
> >
>

Re: HDFS Namenode Heap Size woes

Posted by jason hadoop <ja...@gmail.com>.
When the nodes drop out into dead status it creates the workload and memory
load for the namenode.
I don't know if the 100+ second case does so also.

On Sun, Feb 1, 2009 at 4:11 PM, Sean Knapp <se...@ooyala.com> wrote:

> Jason,
> Thanks for the response. By falling out, do you mean a longer time since
> last contact (100s+), or fully timed out where it is dropped into dead
> nodes? The former happens fairly often, the latter only under serious load
> but not in the last day. Also, my namenode is now up to 10GB with less than
> 700k files after some additional archiving.
>
> Thanks,
> Sean
>
> On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop <ja...@gmail.com>
> wrote:
>
> > If your datanodes are pausing and falling out of the cluster you will get
> a
> > large workload for the namenode of blocks to replicate and when the
> paused
> > datanode comes back, a large workload of blocks to delete.
> > These lists are stored in memory on the namenode.
> > The startup messages lead me to wonder if your datanodes are periodically
> > pausing or are otherwise dropping in and out of the cluster.
> >
> > On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com> wrote:
> >
> > > I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB).
> The
> > > current status of my FS is approximately 1 million files and
> directories,
> > > 950k blocks, and heap size of 7GB (16GB reserved). Average block
> > > replication
> > > is 3.8. I'm concerned that the heap size is steadily climbing... a 7GB
> > heap
> > > is substantially higher per file that I have on a similar 0.18.2
> cluster,
> > > which has closer to a 1GB heap.
> > > My typical usage model is 1) write a number of small files into HDFS
> > (tens
> > > or hundreds of thousands at a time), 2) archive those files, 3) delete
> > the
> > > originals. I've tried dropping the replication factor of the _index and
> > > _masterindex files without much effect on overall heap size. While I
> had
> > > trash enabled at one point, I've since disabled it and deleted the
> .Trash
> > > folders.
> > >
> > > On namenode startup, I get a massive number of the following lines in
> my
> > > log
> > > file:
> > > 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> > > NameSystem.processReport: block blk_-2389330910609345428_7332878 on
> > > 172.16.129.33:50010 size 798080 does not belong to any file.
> > > 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> > > NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
> > > invalidSet
> > > of 172.16.129.33:50010
> > >
> > > I suspect the original files may be left behind and causing the heap
> size
> > > bloat. Is there any accounting mechanism to determine what is
> > contributing
> > > to my heap size?
> > >
> > > Thanks,
> > > Sean
> > >
> >
>

Re: HDFS Namenode Heap Size woes

Posted by Sean Knapp <se...@ooyala.com>.
Brian, Jason,
Thanks again for your help. Just to close thread, while following your
suggestions I found I had an incredibly large number of files on my data
nodes that were being marked for invalidation at startup. I believe they
were left behind from an old mass-delete that was followed by a shutdown
before the deletes were performed. I've cleaned out those files and we're
humming along with <1GB heap size.

Thanks,
Sean

On Sun, Feb 1, 2009 at 10:48 PM, jason hadoop <ja...@gmail.com>wrote:

> If you set up your namenode for remote debugging, you could attach with
> eclipse or the debugger of your choice.
>
> Look at the objects in org.apache.hadoop.hdfs.server.namenode.FSNamesystem
>  private UnderReplicatedBlocks neededReplications = new
> UnderReplicatedBlocks();
>  private PendingReplicationBlocks pendingReplications;
>
>  //
>  // Keeps a Collection for every named machine containing
>  // blocks that have recently been invalidated and are thought to live
>  // on the machine in question.
>  // Mapping: StorageID -> ArrayList<Block>
>  //
>  private Map<String, Collection<Block>> recentInvalidateSets =
>    new TreeMap<String, Collection<Block>>();
>
>  //
>  // Keeps a TreeSet for every named node.  Each treeset contains
>  // a list of the blocks that are "extra" at that location.  We'll
>  // eventually remove these extras.
>  // Mapping: StorageID -> TreeSet<Block>
>  //
>  Map<String, Collection<Block>> excessReplicateMap =
>    new TreeMap<String, Collection<Block>>();
>
> Much of this is run out of a thread ReplicationMonitor.
>
> In our case we had datanodes with 2million blocks dropping off and on
> again,
> and this was trashing these queues with the 2million blocks on the
> datanodoes, re-replicating the blocks and then invalidating them all when
> the datanode came back.
>
>
> On Sun, Feb 1, 2009 at 7:03 PM, Brian Bockelman <bbockelm@cse.unl.edu
> >wrote:
>
> > Hey Sean,
> >
> > I use JMX monitoring -- which allows me to trigger GC via jconsole.
> >  There's decent documentation out there to making it work, but you'd have
> to
> > restart the namenode to do it ... let the list know if you can't figure
> it
> > out.
> >
> > Brian
> >
> >
> > On Feb 1, 2009, at 8:59 PM, Sean Knapp wrote:
> >
> >  Brian,
> >> Thanks for jumping in as well. Is there a recommended way of manually
> >> triggering GC?
> >>
> >> Thanks,
> >> Sean
> >>
> >> On Sun, Feb 1, 2009 at 6:06 PM, Brian Bockelman <bbockelm@cse.unl.edu
> >> >wrote:
> >>
> >>  Hey Sean,
> >>>
> >>> Dumb question: how much memory is used after a garbage collection
> cycle?
> >>>
> >>> Look at the graph "jvm.metrics.memHeapUsedM":
> >>>
> >>>
> >>>
> >>>
> http://rcf.unl.edu/ganglia/?m=network_report&r=hour&s=descending&c=red&h=hadoop-name&sh=1&hc=4&z=small
> >>>
> >>> If you tell the JVM it has 16GB of memory to play with, it will often
> use
> >>> a
> >>> significant portion of that before it does a thorough GC.  In our site,
> >>> it
> >>> actually only needs ~ 500MB, but sometimes it will hit 1GB before GC is
> >>> triggered.  One of the vagaries of Java, eh?
> >>>
> >>> Trigger a GC and see how much is actually used.
> >>>
> >>> Brian
> >>>
> >>>
> >>> On Feb 1, 2009, at 6:11 PM, Sean Knapp wrote:
> >>>
> >>> Jason,
> >>>
> >>>> Thanks for the response. By falling out, do you mean a longer time
> since
> >>>> last contact (100s+), or fully timed out where it is dropped into dead
> >>>> nodes? The former happens fairly often, the latter only under serious
> >>>> load
> >>>> but not in the last day. Also, my namenode is now up to 10GB with less
> >>>> than
> >>>> 700k files after some additional archiving.
> >>>>
> >>>> Thanks,
> >>>> Sean
> >>>>
> >>>> On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop <ja...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> If your datanodes are pausing and falling out of the cluster you will
> >>>> get
> >>>>
> >>>>> a
> >>>>> large workload for the namenode of blocks to replicate and when the
> >>>>> paused
> >>>>> datanode comes back, a large workload of blocks to delete.
> >>>>> These lists are stored in memory on the namenode.
> >>>>> The startup messages lead me to wonder if your datanodes are
> >>>>> periodically
> >>>>> pausing or are otherwise dropping in and out of the cluster.
> >>>>>
> >>>>> On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com> wrote:
> >>>>>
> >>>>> I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB).
> >>>>> The
> >>>>>
> >>>>>> current status of my FS is approximately 1 million files and
> >>>>>> directories,
> >>>>>> 950k blocks, and heap size of 7GB (16GB reserved). Average block
> >>>>>> replication
> >>>>>> is 3.8. I'm concerned that the heap size is steadily climbing... a
> 7GB
> >>>>>>
> >>>>>>  heap
> >>>>>
> >>>>>  is substantially higher per file that I have on a similar 0.18.2
> >>>>>> cluster,
> >>>>>> which has closer to a 1GB heap.
> >>>>>> My typical usage model is 1) write a number of small files into HDFS
> >>>>>>
> >>>>>>  (tens
> >>>>>
> >>>>>  or hundreds of thousands at a time), 2) archive those files, 3)
> delete
> >>>>>>
> >>>>>>  the
> >>>>>
> >>>>>  originals. I've tried dropping the replication factor of the _index
> >>>>>> and
> >>>>>> _masterindex files without much effect on overall heap size. While I
> >>>>>> had
> >>>>>> trash enabled at one point, I've since disabled it and deleted the
> >>>>>> .Trash
> >>>>>> folders.
> >>>>>>
> >>>>>> On namenode startup, I get a massive number of the following lines
> in
> >>>>>> my
> >>>>>> log
> >>>>>> file:
> >>>>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>>> BLOCK*
> >>>>>> NameSystem.processReport: block blk_-2389330910609345428_7332878 on
> >>>>>> 172.16.129.33:50010 size 798080 does not belong to any file.
> >>>>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>>> BLOCK*
> >>>>>> NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
> >>>>>> invalidSet
> >>>>>> of 172.16.129.33:50010
> >>>>>>
> >>>>>> I suspect the original files may be left behind and causing the heap
> >>>>>> size
> >>>>>> bloat. Is there any accounting mechanism to determine what is
> >>>>>>
> >>>>>>  contributing
> >>>>>
> >>>>>  to my heap size?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Sean
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >
>

Re: HDFS Namenode Heap Size woes

Posted by jason hadoop <ja...@gmail.com>.
If you set up your namenode for remote debugging, you could attach with
eclipse or the debugger of your choice.

Look at the objects in org.apache.hadoop.hdfs.server.namenode.FSNamesystem
  private UnderReplicatedBlocks neededReplications = new
UnderReplicatedBlocks();
  private PendingReplicationBlocks pendingReplications;

  //
  // Keeps a Collection for every named machine containing
  // blocks that have recently been invalidated and are thought to live
  // on the machine in question.
  // Mapping: StorageID -> ArrayList<Block>
  //
  private Map<String, Collection<Block>> recentInvalidateSets =
    new TreeMap<String, Collection<Block>>();

  //
  // Keeps a TreeSet for every named node.  Each treeset contains
  // a list of the blocks that are "extra" at that location.  We'll
  // eventually remove these extras.
  // Mapping: StorageID -> TreeSet<Block>
  //
  Map<String, Collection<Block>> excessReplicateMap =
    new TreeMap<String, Collection<Block>>();

Much of this is run out of a thread ReplicationMonitor.

In our case we had datanodes with 2million blocks dropping off and on again,
and this was trashing these queues with the 2million blocks on the
datanodoes, re-replicating the blocks and then invalidating them all when
the datanode came back.


On Sun, Feb 1, 2009 at 7:03 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Sean,
>
> I use JMX monitoring -- which allows me to trigger GC via jconsole.
>  There's decent documentation out there to making it work, but you'd have to
> restart the namenode to do it ... let the list know if you can't figure it
> out.
>
> Brian
>
>
> On Feb 1, 2009, at 8:59 PM, Sean Knapp wrote:
>
>  Brian,
>> Thanks for jumping in as well. Is there a recommended way of manually
>> triggering GC?
>>
>> Thanks,
>> Sean
>>
>> On Sun, Feb 1, 2009 at 6:06 PM, Brian Bockelman <bbockelm@cse.unl.edu
>> >wrote:
>>
>>  Hey Sean,
>>>
>>> Dumb question: how much memory is used after a garbage collection cycle?
>>>
>>> Look at the graph "jvm.metrics.memHeapUsedM":
>>>
>>>
>>>
>>> http://rcf.unl.edu/ganglia/?m=network_report&r=hour&s=descending&c=red&h=hadoop-name&sh=1&hc=4&z=small
>>>
>>> If you tell the JVM it has 16GB of memory to play with, it will often use
>>> a
>>> significant portion of that before it does a thorough GC.  In our site,
>>> it
>>> actually only needs ~ 500MB, but sometimes it will hit 1GB before GC is
>>> triggered.  One of the vagaries of Java, eh?
>>>
>>> Trigger a GC and see how much is actually used.
>>>
>>> Brian
>>>
>>>
>>> On Feb 1, 2009, at 6:11 PM, Sean Knapp wrote:
>>>
>>> Jason,
>>>
>>>> Thanks for the response. By falling out, do you mean a longer time since
>>>> last contact (100s+), or fully timed out where it is dropped into dead
>>>> nodes? The former happens fairly often, the latter only under serious
>>>> load
>>>> but not in the last day. Also, my namenode is now up to 10GB with less
>>>> than
>>>> 700k files after some additional archiving.
>>>>
>>>> Thanks,
>>>> Sean
>>>>
>>>> On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop <ja...@gmail.com>
>>>> wrote:
>>>>
>>>> If your datanodes are pausing and falling out of the cluster you will
>>>> get
>>>>
>>>>> a
>>>>> large workload for the namenode of blocks to replicate and when the
>>>>> paused
>>>>> datanode comes back, a large workload of blocks to delete.
>>>>> These lists are stored in memory on the namenode.
>>>>> The startup messages lead me to wonder if your datanodes are
>>>>> periodically
>>>>> pausing or are otherwise dropping in and out of the cluster.
>>>>>
>>>>> On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com> wrote:
>>>>>
>>>>> I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB).
>>>>> The
>>>>>
>>>>>> current status of my FS is approximately 1 million files and
>>>>>> directories,
>>>>>> 950k blocks, and heap size of 7GB (16GB reserved). Average block
>>>>>> replication
>>>>>> is 3.8. I'm concerned that the heap size is steadily climbing... a 7GB
>>>>>>
>>>>>>  heap
>>>>>
>>>>>  is substantially higher per file that I have on a similar 0.18.2
>>>>>> cluster,
>>>>>> which has closer to a 1GB heap.
>>>>>> My typical usage model is 1) write a number of small files into HDFS
>>>>>>
>>>>>>  (tens
>>>>>
>>>>>  or hundreds of thousands at a time), 2) archive those files, 3) delete
>>>>>>
>>>>>>  the
>>>>>
>>>>>  originals. I've tried dropping the replication factor of the _index
>>>>>> and
>>>>>> _masterindex files without much effect on overall heap size. While I
>>>>>> had
>>>>>> trash enabled at one point, I've since disabled it and deleted the
>>>>>> .Trash
>>>>>> folders.
>>>>>>
>>>>>> On namenode startup, I get a massive number of the following lines in
>>>>>> my
>>>>>> log
>>>>>> file:
>>>>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:
>>>>>> BLOCK*
>>>>>> NameSystem.processReport: block blk_-2389330910609345428_7332878 on
>>>>>> 172.16.129.33:50010 size 798080 does not belong to any file.
>>>>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:
>>>>>> BLOCK*
>>>>>> NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
>>>>>> invalidSet
>>>>>> of 172.16.129.33:50010
>>>>>>
>>>>>> I suspect the original files may be left behind and causing the heap
>>>>>> size
>>>>>> bloat. Is there any accounting mechanism to determine what is
>>>>>>
>>>>>>  contributing
>>>>>
>>>>>  to my heap size?
>>>>>>
>>>>>> Thanks,
>>>>>> Sean
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>

Re: HDFS Namenode Heap Size woes

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Sean,

I use JMX monitoring -- which allows me to trigger GC via jconsole.   
There's decent documentation out there to making it work, but you'd  
have to restart the namenode to do it ... let the list know if you  
can't figure it out.

Brian

On Feb 1, 2009, at 8:59 PM, Sean Knapp wrote:

> Brian,
> Thanks for jumping in as well. Is there a recommended way of manually
> triggering GC?
>
> Thanks,
> Sean
>
> On Sun, Feb 1, 2009 at 6:06 PM, Brian Bockelman  
> <bb...@cse.unl.edu>wrote:
>
>> Hey Sean,
>>
>> Dumb question: how much memory is used after a garbage collection  
>> cycle?
>>
>> Look at the graph "jvm.metrics.memHeapUsedM":
>>
>>
>> http://rcf.unl.edu/ganglia/?m=network_report&r=hour&s=descending&c=red&h=hadoop-name&sh=1&hc=4&z=small
>>
>> If you tell the JVM it has 16GB of memory to play with, it will  
>> often use a
>> significant portion of that before it does a thorough GC.  In our  
>> site, it
>> actually only needs ~ 500MB, but sometimes it will hit 1GB before  
>> GC is
>> triggered.  One of the vagaries of Java, eh?
>>
>> Trigger a GC and see how much is actually used.
>>
>> Brian
>>
>>
>> On Feb 1, 2009, at 6:11 PM, Sean Knapp wrote:
>>
>> Jason,
>>> Thanks for the response. By falling out, do you mean a longer time  
>>> since
>>> last contact (100s+), or fully timed out where it is dropped into  
>>> dead
>>> nodes? The former happens fairly often, the latter only under  
>>> serious load
>>> but not in the last day. Also, my namenode is now up to 10GB with  
>>> less
>>> than
>>> 700k files after some additional archiving.
>>>
>>> Thanks,
>>> Sean
>>>
>>> On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop  
>>> <ja...@gmail.com>
>>> wrote:
>>>
>>> If your datanodes are pausing and falling out of the cluster you  
>>> will get
>>>> a
>>>> large workload for the namenode of blocks to replicate and when the
>>>> paused
>>>> datanode comes back, a large workload of blocks to delete.
>>>> These lists are stored in memory on the namenode.
>>>> The startup messages lead me to wonder if your datanodes are  
>>>> periodically
>>>> pausing or are otherwise dropping in and out of the cluster.
>>>>
>>>> On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com>  
>>>> wrote:
>>>>
>>>> I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM,  
>>>> 4x1.5TB). The
>>>>> current status of my FS is approximately 1 million files and
>>>>> directories,
>>>>> 950k blocks, and heap size of 7GB (16GB reserved). Average block
>>>>> replication
>>>>> is 3.8. I'm concerned that the heap size is steadily climbing...  
>>>>> a 7GB
>>>>>
>>>> heap
>>>>
>>>>> is substantially higher per file that I have on a similar 0.18.2
>>>>> cluster,
>>>>> which has closer to a 1GB heap.
>>>>> My typical usage model is 1) write a number of small files into  
>>>>> HDFS
>>>>>
>>>> (tens
>>>>
>>>>> or hundreds of thousands at a time), 2) archive those files, 3)  
>>>>> delete
>>>>>
>>>> the
>>>>
>>>>> originals. I've tried dropping the replication factor of the  
>>>>> _index and
>>>>> _masterindex files without much effect on overall heap size.  
>>>>> While I had
>>>>> trash enabled at one point, I've since disabled it and deleted the
>>>>> .Trash
>>>>> folders.
>>>>>
>>>>> On namenode startup, I get a massive number of the following  
>>>>> lines in my
>>>>> log
>>>>> file:
>>>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:  
>>>>> BLOCK*
>>>>> NameSystem.processReport: block blk_-2389330910609345428_7332878  
>>>>> on
>>>>> 172.16.129.33:50010 size 798080 does not belong to any file.
>>>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:  
>>>>> BLOCK*
>>>>> NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
>>>>> invalidSet
>>>>> of 172.16.129.33:50010
>>>>>
>>>>> I suspect the original files may be left behind and causing the  
>>>>> heap
>>>>> size
>>>>> bloat. Is there any accounting mechanism to determine what is
>>>>>
>>>> contributing
>>>>
>>>>> to my heap size?
>>>>>
>>>>> Thanks,
>>>>> Sean
>>>>>
>>>>>
>>>>
>>


Re: HDFS Namenode Heap Size woes

Posted by Sean Knapp <se...@ooyala.com>.
Brian,
Thanks for jumping in as well. Is there a recommended way of manually
triggering GC?

Thanks,
Sean

On Sun, Feb 1, 2009 at 6:06 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Sean,
>
> Dumb question: how much memory is used after a garbage collection cycle?
>
> Look at the graph "jvm.metrics.memHeapUsedM":
>
>
> http://rcf.unl.edu/ganglia/?m=network_report&r=hour&s=descending&c=red&h=hadoop-name&sh=1&hc=4&z=small
>
> If you tell the JVM it has 16GB of memory to play with, it will often use a
> significant portion of that before it does a thorough GC.  In our site, it
> actually only needs ~ 500MB, but sometimes it will hit 1GB before GC is
> triggered.  One of the vagaries of Java, eh?
>
> Trigger a GC and see how much is actually used.
>
> Brian
>
>
> On Feb 1, 2009, at 6:11 PM, Sean Knapp wrote:
>
>  Jason,
>> Thanks for the response. By falling out, do you mean a longer time since
>> last contact (100s+), or fully timed out where it is dropped into dead
>> nodes? The former happens fairly often, the latter only under serious load
>> but not in the last day. Also, my namenode is now up to 10GB with less
>> than
>> 700k files after some additional archiving.
>>
>> Thanks,
>> Sean
>>
>> On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop <ja...@gmail.com>
>> wrote:
>>
>>  If your datanodes are pausing and falling out of the cluster you will get
>>> a
>>> large workload for the namenode of blocks to replicate and when the
>>> paused
>>> datanode comes back, a large workload of blocks to delete.
>>> These lists are stored in memory on the namenode.
>>> The startup messages lead me to wonder if your datanodes are periodically
>>> pausing or are otherwise dropping in and out of the cluster.
>>>
>>> On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com> wrote:
>>>
>>>  I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB). The
>>>> current status of my FS is approximately 1 million files and
>>>> directories,
>>>> 950k blocks, and heap size of 7GB (16GB reserved). Average block
>>>> replication
>>>> is 3.8. I'm concerned that the heap size is steadily climbing... a 7GB
>>>>
>>> heap
>>>
>>>> is substantially higher per file that I have on a similar 0.18.2
>>>> cluster,
>>>> which has closer to a 1GB heap.
>>>> My typical usage model is 1) write a number of small files into HDFS
>>>>
>>> (tens
>>>
>>>> or hundreds of thousands at a time), 2) archive those files, 3) delete
>>>>
>>> the
>>>
>>>> originals. I've tried dropping the replication factor of the _index and
>>>> _masterindex files without much effect on overall heap size. While I had
>>>> trash enabled at one point, I've since disabled it and deleted the
>>>> .Trash
>>>> folders.
>>>>
>>>> On namenode startup, I get a massive number of the following lines in my
>>>> log
>>>> file:
>>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>>>> NameSystem.processReport: block blk_-2389330910609345428_7332878 on
>>>> 172.16.129.33:50010 size 798080 does not belong to any file.
>>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>>>> NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
>>>> invalidSet
>>>> of 172.16.129.33:50010
>>>>
>>>> I suspect the original files may be left behind and causing the heap
>>>> size
>>>> bloat. Is there any accounting mechanism to determine what is
>>>>
>>> contributing
>>>
>>>> to my heap size?
>>>>
>>>> Thanks,
>>>> Sean
>>>>
>>>>
>>>
>

Re: HDFS Namenode Heap Size woes

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Sean,

Dumb question: how much memory is used after a garbage collection cycle?

Look at the graph "jvm.metrics.memHeapUsedM":

http://rcf.unl.edu/ganglia/?m=network_report&r=hour&s=descending&c=red&h=hadoop-name&sh=1&hc=4&z=small

If you tell the JVM it has 16GB of memory to play with, it will often  
use a significant portion of that before it does a thorough GC.  In  
our site, it actually only needs ~ 500MB, but sometimes it will hit  
1GB before GC is triggered.  One of the vagaries of Java, eh?

Trigger a GC and see how much is actually used.

Brian

On Feb 1, 2009, at 6:11 PM, Sean Knapp wrote:

> Jason,
> Thanks for the response. By falling out, do you mean a longer time  
> since
> last contact (100s+), or fully timed out where it is dropped into dead
> nodes? The former happens fairly often, the latter only under  
> serious load
> but not in the last day. Also, my namenode is now up to 10GB with  
> less than
> 700k files after some additional archiving.
>
> Thanks,
> Sean
>
> On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop  
> <ja...@gmail.com> wrote:
>
>> If your datanodes are pausing and falling out of the cluster you  
>> will get a
>> large workload for the namenode of blocks to replicate and when the  
>> paused
>> datanode comes back, a large workload of blocks to delete.
>> These lists are stored in memory on the namenode.
>> The startup messages lead me to wonder if your datanodes are  
>> periodically
>> pausing or are otherwise dropping in and out of the cluster.
>>
>> On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com> wrote:
>>
>>> I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM,  
>>> 4x1.5TB). The
>>> current status of my FS is approximately 1 million files and  
>>> directories,
>>> 950k blocks, and heap size of 7GB (16GB reserved). Average block
>>> replication
>>> is 3.8. I'm concerned that the heap size is steadily climbing... a  
>>> 7GB
>> heap
>>> is substantially higher per file that I have on a similar 0.18.2  
>>> cluster,
>>> which has closer to a 1GB heap.
>>> My typical usage model is 1) write a number of small files into HDFS
>> (tens
>>> or hundreds of thousands at a time), 2) archive those files, 3)  
>>> delete
>> the
>>> originals. I've tried dropping the replication factor of the  
>>> _index and
>>> _masterindex files without much effect on overall heap size. While  
>>> I had
>>> trash enabled at one point, I've since disabled it and deleted  
>>> the .Trash
>>> folders.
>>>
>>> On namenode startup, I get a massive number of the following lines  
>>> in my
>>> log
>>> file:
>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:  
>>> BLOCK*
>>> NameSystem.processReport: block blk_-2389330910609345428_7332878 on
>>> 172.16.129.33:50010 size 798080 does not belong to any file.
>>> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange:  
>>> BLOCK*
>>> NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
>>> invalidSet
>>> of 172.16.129.33:50010
>>>
>>> I suspect the original files may be left behind and causing the  
>>> heap size
>>> bloat. Is there any accounting mechanism to determine what is
>> contributing
>>> to my heap size?
>>>
>>> Thanks,
>>> Sean
>>>
>>


Re: HDFS Namenode Heap Size woes

Posted by Sean Knapp <se...@ooyala.com>.
Jason,
Thanks for the response. By falling out, do you mean a longer time since
last contact (100s+), or fully timed out where it is dropped into dead
nodes? The former happens fairly often, the latter only under serious load
but not in the last day. Also, my namenode is now up to 10GB with less than
700k files after some additional archiving.

Thanks,
Sean

On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop <ja...@gmail.com> wrote:

> If your datanodes are pausing and falling out of the cluster you will get a
> large workload for the namenode of blocks to replicate and when the paused
> datanode comes back, a large workload of blocks to delete.
> These lists are stored in memory on the namenode.
> The startup messages lead me to wonder if your datanodes are periodically
> pausing or are otherwise dropping in and out of the cluster.
>
> On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com> wrote:
>
> > I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB). The
> > current status of my FS is approximately 1 million files and directories,
> > 950k blocks, and heap size of 7GB (16GB reserved). Average block
> > replication
> > is 3.8. I'm concerned that the heap size is steadily climbing... a 7GB
> heap
> > is substantially higher per file that I have on a similar 0.18.2 cluster,
> > which has closer to a 1GB heap.
> > My typical usage model is 1) write a number of small files into HDFS
> (tens
> > or hundreds of thousands at a time), 2) archive those files, 3) delete
> the
> > originals. I've tried dropping the replication factor of the _index and
> > _masterindex files without much effect on overall heap size. While I had
> > trash enabled at one point, I've since disabled it and deleted the .Trash
> > folders.
> >
> > On namenode startup, I get a massive number of the following lines in my
> > log
> > file:
> > 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> > NameSystem.processReport: block blk_-2389330910609345428_7332878 on
> > 172.16.129.33:50010 size 798080 does not belong to any file.
> > 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> > NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
> > invalidSet
> > of 172.16.129.33:50010
> >
> > I suspect the original files may be left behind and causing the heap size
> > bloat. Is there any accounting mechanism to determine what is
> contributing
> > to my heap size?
> >
> > Thanks,
> > Sean
> >
>

Re: HDFS Namenode Heap Size woes

Posted by jason hadoop <ja...@gmail.com>.
If your datanodes are pausing and falling out of the cluster you will get a
large workload for the namenode of blocks to replicate and when the paused
datanode comes back, a large workload of blocks to delete.
These lists are stored in memory on the namenode.
The startup messages lead me to wonder if your datanodes are periodically
pausing or are otherwise dropping in and out of the cluster.

On Sat, Jan 31, 2009 at 2:20 PM, Sean Knapp <se...@ooyala.com> wrote:

> I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB). The
> current status of my FS is approximately 1 million files and directories,
> 950k blocks, and heap size of 7GB (16GB reserved). Average block
> replication
> is 3.8. I'm concerned that the heap size is steadily climbing... a 7GB heap
> is substantially higher per file that I have on a similar 0.18.2 cluster,
> which has closer to a 1GB heap.
> My typical usage model is 1) write a number of small files into HDFS (tens
> or hundreds of thousands at a time), 2) archive those files, 3) delete the
> originals. I've tried dropping the replication factor of the _index and
> _masterindex files without much effect on overall heap size. While I had
> trash enabled at one point, I've since disabled it and deleted the .Trash
> folders.
>
> On namenode startup, I get a massive number of the following lines in my
> log
> file:
> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.processReport: block blk_-2389330910609345428_7332878 on
> 172.16.129.33:50010 size 798080 does not belong to any file.
> 2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_-2389330910609345428 is added to
> invalidSet
> of 172.16.129.33:50010
>
> I suspect the original files may be left behind and causing the heap size
> bloat. Is there any accounting mechanism to determine what is contributing
> to my heap size?
>
> Thanks,
> Sean
>