You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Henning Blohm <he...@zfabrik.de> on 2010/11/19 14:09:36 UTC

map task performance degradation - any idea why?

We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
(12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
relatively simple 
table in HBase (1 column familiy, 5 columns, rowkey about 100chars). 

In order to better understand the load behavior, I wanted to put 5*10^8
rows into that table. I wrote an M/R job that uses a Split Input Format
to split the 
5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
into 1000 chunks of 500000 keys and then let the map do the actual job
of writing the corresponding rows (with some random column values) into
hbase.

So there are 1000 map tasks, no reducer. Each task writes 500000 rows
into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.

The whole job runs for approx. 48 hours. Initially the map tasks need
around 30 min. each. After a while things take longer and longer,
eventually 
reaching > 2h. It tops around the 850s task after which things speed up
again improving to about 48min. in the end, until completed.

It's all dedicated machines and there is nothing else running. The map
tasks have 200m heap and when checking with vmstat in between I cannot 
observe swapping. 

Also, on the master it seems that heap utilization is not at the limit
and no swapping either. All Hadoop and Hbase processes have
1G heap.

Any idea what would cause the strong variation (or degradation) of write
performance? 
Is there a way of finding out where time gets lost? 

Thanks,
  Henning


Re: map task performance degradation - any idea why?

Posted by Alex Baranau <al...@gmail.com>.
Added to HBASE-2888 (Review all our metrics) issue.

Thanks for support, Lars!

Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Mon, Nov 22, 2010 at 11:05 AM, Lars George <la...@gmail.com> wrote:

> Hi Alex,
>
> We once had a class called Historian that would log the splits etc.  It was
> abandoned as it was not implemented right. Currently we only have those logs
> and I am personally agreeing with you that this is not good for the average
> (or even advanced) user. In the future we may be able to use Coprocessors to
> handle region split info being written somewhere, but for now there is a
> void.
>
> I 100% agree with you that this info should be in a metric as it is vital
> information of what the system is doing and all the core devs are used to
> read the logs directly, which I think is ok for them but madness otherwise.
> Please have look if there is a JIRA to this event and if not create a new
> one so that we can improve this for everyone. I am strongly for these sort
> of additions and will help you getting this committed if you could provide a
> patch.
>
> Lars
>
> On Nov 22, 2010, at 9:38, Alex Baranau <al...@gmail.com> wrote:
>
> > Hello Lars,
> >
> >> But ideally you would do a post
> >> mortem on the master and slave logs for Hadoop and HBase, since that
> >> would give you a better insight of the events. For example, when did
> >> the system start to flush, when did it start compacting, when did the
> >> HDFS start to go slow?
> >
> > I wonder, does it makes sense to expose these events somehow to the HBase
> > web interface in an easy accessible way: e.g. list of time points of
> splits
> > (for each regionserver), compaction/flush events (or rate), etc. Looks
> like
> > this info is valuable for many users and most importantly I believe can
> > affect their configuration-related decisions, so showing the data on web
> > interface instead of making users dig into logs makes sense and brings
> HBase
> > towards "easy-to-install&use" a bit. Thoughts?
> >
> > btw, I found that in JMX we expose only flush and compaction related
> data.
> > Nothing related to splits. Could you give a hint why? Also we have only
> time
> > and size being exposed, probably count (number of actions) would be good
> to
> > expose too: thus one can see flush/compaction/split(?) rate and make
> > judgement on whether some configuration is properly set (e.g.
> > hbase.hregion.memstore.flush.size).
> >
> > Thanks,
> > Alex Baranau
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> HBase
> >
> > On Fri, Nov 19, 2010 at 5:16 PM, Lars George <la...@gmail.com>
> wrote:
> >
> >> Hi Henning,
> >>
> >> And you what you have seen is often difficult to explain. What I
> >> listed are the obvious contenders. But ideally you would do a post
> >> mortem on the master and slave logs for Hadoop and HBase, since that
> >> would give you a better insight of the events. For example, when did
> >> the system start to flush, when did it start compacting, when did the
> >> HDFS start to go slow? And so on. One thing that I would highly
> >> recommend (if you haven't done so already) is getting graphing going.
> >> Use the build in Ganglia support and you may be able to at least
> >> determine the overall load on the system and various metrics of Hadoop
> >> and HBase.
> >>
> >> Did you use the normal Puts or did you set it to cache Puts and write
> >> them in bulk? See HTable.setWriteBufferSize() and
> >> HTable.setAutoFlush() for details (but please note that you then do
> >> need to call HTable.flushCommits() in your close() method of the
> >> mapper class). That will help a lot speeding up writing data.
> >>
> >> Lars
> >>
> >> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <
> henning.blohm@zfabrik.de>
> >> wrote:
> >>> Hi Lars,
> >>>
> >>> thanks. Yes, this is just the first test setup. Eventually the data
> >>> load will be significantly higher.
> >>>
> >>> At the moment (looking at the master after the run) the number of
> >>> regions is well-distributed (684,685,685 regions). The overall
> >>> HDFS use is  ~700G. (replication factor is 3 btw).
> >>>
> >>> I will want to upgrade as soon as that makes sense. It seems there is
> >>> "release" after 0.20.6 - that's why we are still with that one.
> >>>
> >>> When I do that run again, I will check the master UI and see how things
> >>> develop there. As for the current run: I do not expect
> >>> to get stable numbers early in the run. What looked suspicous was that
> >>> things got gradually worse until well into 30 hours after
> >>> the start of the run and then even got better. An unexpected load
> >>> behavior for me (would have expected early changes but then
> >>> some stable behavior up to the end).
> >>>
> >>> Thanks,
> >>> Henning
> >>>
> >>> Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
> >>>
> >>>> Hi Henning,
> >>>>
> >>>> Could you look at the Master UI while doing the import? The issue with
> >>>> a cold bulk import is that you are hitting one region server
> >>>> initially, and while it is filling up its in-memory structures all is
> >>>> nice and dandy. Then ou start to tax the server as it has to flush
> >>>> data out and it becomes slower responding to the mappers still
> >>>> hammering it. Only after a while the regions become large enough so
> >>>> that they get split and load starts to spread across 2 machines, then
> >>>> 3. Eventually you have enough regions to handle your data and you will
> >>>> see an average of the performance you could expect from a loaded
> >>>> cluster. For that reason we have added a bulk loading feature that
> >>>> helps building the region files externally and then swap them in.
> >>>>
> >>>> When you check the UI you can actually see this behavior as the
> >>>> operations-per-second (ops) are bound to one server initially. Well,
> >>>> could be two as one of them has to also serve META. If that is the
> >>>> same machine then you are penalized twice.
> >>>>
> >>>> In addition you start to run into minor compaction load while HBase
> >>>> tries to do housekeeping during your load.
> >>>>
> >>>> With 0.89 you could pre-split the regions into what you see eventually
> >>>> when your job is complete. Please use the UI to check and let us know
> >>>> how many regions you end up with in total (out of interest mainly). If
> >>>> you had that done before the import then the load is split right from
> >>>> the start.
> >>>>
> >>>> In general 0.89 is much better performance wise when it comes to bulk
> >>>> loads so you may want to try it out as well. The 0.90RC is up so a
> >>>> release is imminent and saves you from having to upgrade soon. Also,
> >>>> 0.90 is the first with Hadoop's append fix, so that you do not lose
> >>>> any data from wonky server behavior.
> >>>>
> >>>> And to wrap this up, 3 data nodes is not too great. If you ask anyone
> >>>> with a serious production type setup you will see 10 machines and
> >>>> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
> >>>> this little data" but that is not fair given that we do not know what
> >>>> your targets are. Bottom line is, you will see issues (like slowness)
> >>>> with 3 nodes that 8 or 10 nodes will never show.
> >>>>
> >>>> HTH,
> >>>> Lars
> >>>>
> >>>>
> >>>> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <
> >> henning.blohm@zfabrik.de> wrote:
> >>>>> We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
> >>>>> (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
> >>>>> relatively simple
> >>>>> table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
> >>>>>
> >>>>> In order to better understand the load behavior, I wanted to put
> >> 5*10^8
> >>>>> rows into that table. I wrote an M/R job that uses a Split Input
> >> Format
> >>>>> to split the
> >>>>> 5*10^8 logical row keys (essentially just counting from 0 to
> 5*10^8-1)
> >>>>> into 1000 chunks of 500000 keys and then let the map do the actual
> job
> >>>>> of writing the corresponding rows (with some random column values)
> >> into
> >>>>> hbase.
> >>>>>
> >>>>> So there are 1000 map tasks, no reducer. Each task writes 500000 rows
> >>>>> into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
> >>>>>
> >>>>> The whole job runs for approx. 48 hours. Initially the map tasks need
> >>>>> around 30 min. each. After a while things take longer and longer,
> >>>>> eventually
> >>>>> reaching > 2h. It tops around the 850s task after which things speed
> >> up
> >>>>> again improving to about 48min. in the end, until completed.
> >>>>>
> >>>>> It's all dedicated machines and there is nothing else running. The
> map
> >>>>> tasks have 200m heap and when checking with vmstat in between I
> cannot
> >>>>> observe swapping.
> >>>>>
> >>>>> Also, on the master it seems that heap utilization is not at the
> limit
> >>>>> and no swapping either. All Hadoop and Hbase processes have
> >>>>> 1G heap.
> >>>>>
> >>>>> Any idea what would cause the strong variation (or degradation) of
> >> write
> >>>>> performance?
> >>>>> Is there a way of finding out where time gets lost?
> >>>>>
> >>>>> Thanks,
> >>>>> Henning
> >>>>>
> >>>>>
> >>>
> >>
>

Re: map task performance degradation - any idea why?

Posted by Lars George <la...@gmail.com>.
Hi Alex,

We once had a class called Historian that would log the splits etc.  It was abandoned as it was not implemented right. Currently we only have those logs and I am personally agreeing with you that this is not good for the average (or even advanced) user. In the future we may be able to use Coprocessors to handle region split info being written somewhere, but for now there is a void.

I 100% agree with you that this info should be in a metric as it is vital information of what the system is doing and all the core devs are used to read the logs directly, which I think is ok for them but madness otherwise. Please have look if there is a JIRA to this event and if not create a new one so that we can improve this for everyone. I am strongly for these sort of additions and will help you getting this committed if you could provide a patch. 

Lars

On Nov 22, 2010, at 9:38, Alex Baranau <al...@gmail.com> wrote:

> Hello Lars,
> 
>> But ideally you would do a post
>> mortem on the master and slave logs for Hadoop and HBase, since that
>> would give you a better insight of the events. For example, when did
>> the system start to flush, when did it start compacting, when did the
>> HDFS start to go slow?
> 
> I wonder, does it makes sense to expose these events somehow to the HBase
> web interface in an easy accessible way: e.g. list of time points of splits
> (for each regionserver), compaction/flush events (or rate), etc. Looks like
> this info is valuable for many users and most importantly I believe can
> affect their configuration-related decisions, so showing the data on web
> interface instead of making users dig into logs makes sense and brings HBase
> towards "easy-to-install&use" a bit. Thoughts?
> 
> btw, I found that in JMX we expose only flush and compaction related data.
> Nothing related to splits. Could you give a hint why? Also we have only time
> and size being exposed, probably count (number of actions) would be good to
> expose too: thus one can see flush/compaction/split(?) rate and make
> judgement on whether some configuration is properly set (e.g.
> hbase.hregion.memstore.flush.size).
> 
> Thanks,
> Alex Baranau
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
> 
> On Fri, Nov 19, 2010 at 5:16 PM, Lars George <la...@gmail.com> wrote:
> 
>> Hi Henning,
>> 
>> And you what you have seen is often difficult to explain. What I
>> listed are the obvious contenders. But ideally you would do a post
>> mortem on the master and slave logs for Hadoop and HBase, since that
>> would give you a better insight of the events. For example, when did
>> the system start to flush, when did it start compacting, when did the
>> HDFS start to go slow? And so on. One thing that I would highly
>> recommend (if you haven't done so already) is getting graphing going.
>> Use the build in Ganglia support and you may be able to at least
>> determine the overall load on the system and various metrics of Hadoop
>> and HBase.
>> 
>> Did you use the normal Puts or did you set it to cache Puts and write
>> them in bulk? See HTable.setWriteBufferSize() and
>> HTable.setAutoFlush() for details (but please note that you then do
>> need to call HTable.flushCommits() in your close() method of the
>> mapper class). That will help a lot speeding up writing data.
>> 
>> Lars
>> 
>> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <he...@zfabrik.de>
>> wrote:
>>> Hi Lars,
>>> 
>>> thanks. Yes, this is just the first test setup. Eventually the data
>>> load will be significantly higher.
>>> 
>>> At the moment (looking at the master after the run) the number of
>>> regions is well-distributed (684,685,685 regions). The overall
>>> HDFS use is  ~700G. (replication factor is 3 btw).
>>> 
>>> I will want to upgrade as soon as that makes sense. It seems there is
>>> "release" after 0.20.6 - that's why we are still with that one.
>>> 
>>> When I do that run again, I will check the master UI and see how things
>>> develop there. As for the current run: I do not expect
>>> to get stable numbers early in the run. What looked suspicous was that
>>> things got gradually worse until well into 30 hours after
>>> the start of the run and then even got better. An unexpected load
>>> behavior for me (would have expected early changes but then
>>> some stable behavior up to the end).
>>> 
>>> Thanks,
>>> Henning
>>> 
>>> Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
>>> 
>>>> Hi Henning,
>>>> 
>>>> Could you look at the Master UI while doing the import? The issue with
>>>> a cold bulk import is that you are hitting one region server
>>>> initially, and while it is filling up its in-memory structures all is
>>>> nice and dandy. Then ou start to tax the server as it has to flush
>>>> data out and it becomes slower responding to the mappers still
>>>> hammering it. Only after a while the regions become large enough so
>>>> that they get split and load starts to spread across 2 machines, then
>>>> 3. Eventually you have enough regions to handle your data and you will
>>>> see an average of the performance you could expect from a loaded
>>>> cluster. For that reason we have added a bulk loading feature that
>>>> helps building the region files externally and then swap them in.
>>>> 
>>>> When you check the UI you can actually see this behavior as the
>>>> operations-per-second (ops) are bound to one server initially. Well,
>>>> could be two as one of them has to also serve META. If that is the
>>>> same machine then you are penalized twice.
>>>> 
>>>> In addition you start to run into minor compaction load while HBase
>>>> tries to do housekeeping during your load.
>>>> 
>>>> With 0.89 you could pre-split the regions into what you see eventually
>>>> when your job is complete. Please use the UI to check and let us know
>>>> how many regions you end up with in total (out of interest mainly). If
>>>> you had that done before the import then the load is split right from
>>>> the start.
>>>> 
>>>> In general 0.89 is much better performance wise when it comes to bulk
>>>> loads so you may want to try it out as well. The 0.90RC is up so a
>>>> release is imminent and saves you from having to upgrade soon. Also,
>>>> 0.90 is the first with Hadoop's append fix, so that you do not lose
>>>> any data from wonky server behavior.
>>>> 
>>>> And to wrap this up, 3 data nodes is not too great. If you ask anyone
>>>> with a serious production type setup you will see 10 machines and
>>>> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
>>>> this little data" but that is not fair given that we do not know what
>>>> your targets are. Bottom line is, you will see issues (like slowness)
>>>> with 3 nodes that 8 or 10 nodes will never show.
>>>> 
>>>> HTH,
>>>> Lars
>>>> 
>>>> 
>>>> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <
>> henning.blohm@zfabrik.de> wrote:
>>>>> We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
>>>>> (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
>>>>> relatively simple
>>>>> table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
>>>>> 
>>>>> In order to better understand the load behavior, I wanted to put
>> 5*10^8
>>>>> rows into that table. I wrote an M/R job that uses a Split Input
>> Format
>>>>> to split the
>>>>> 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
>>>>> into 1000 chunks of 500000 keys and then let the map do the actual job
>>>>> of writing the corresponding rows (with some random column values)
>> into
>>>>> hbase.
>>>>> 
>>>>> So there are 1000 map tasks, no reducer. Each task writes 500000 rows
>>>>> into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
>>>>> 
>>>>> The whole job runs for approx. 48 hours. Initially the map tasks need
>>>>> around 30 min. each. After a while things take longer and longer,
>>>>> eventually
>>>>> reaching > 2h. It tops around the 850s task after which things speed
>> up
>>>>> again improving to about 48min. in the end, until completed.
>>>>> 
>>>>> It's all dedicated machines and there is nothing else running. The map
>>>>> tasks have 200m heap and when checking with vmstat in between I cannot
>>>>> observe swapping.
>>>>> 
>>>>> Also, on the master it seems that heap utilization is not at the limit
>>>>> and no swapping either. All Hadoop and Hbase processes have
>>>>> 1G heap.
>>>>> 
>>>>> Any idea what would cause the strong variation (or degradation) of
>> write
>>>>> performance?
>>>>> Is there a way of finding out where time gets lost?
>>>>> 
>>>>> Thanks,
>>>>> Henning
>>>>> 
>>>>> 
>>> 
>> 

Re: map task performance degradation - any idea why?

Posted by Alex Baranau <al...@gmail.com>.
Hello Lars,

> But ideally you would do a post
> mortem on the master and slave logs for Hadoop and HBase, since that
> would give you a better insight of the events. For example, when did
> the system start to flush, when did it start compacting, when did the
> HDFS start to go slow?

I wonder, does it makes sense to expose these events somehow to the HBase
web interface in an easy accessible way: e.g. list of time points of splits
(for each regionserver), compaction/flush events (or rate), etc. Looks like
this info is valuable for many users and most importantly I believe can
affect their configuration-related decisions, so showing the data on web
interface instead of making users dig into logs makes sense and brings HBase
towards "easy-to-install&use" a bit. Thoughts?

btw, I found that in JMX we expose only flush and compaction related data.
Nothing related to splits. Could you give a hint why? Also we have only time
and size being exposed, probably count (number of actions) would be good to
expose too: thus one can see flush/compaction/split(?) rate and make
judgement on whether some configuration is properly set (e.g.
hbase.hregion.memstore.flush.size).

Thanks,
Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Fri, Nov 19, 2010 at 5:16 PM, Lars George <la...@gmail.com> wrote:

> Hi Henning,
>
> And you what you have seen is often difficult to explain. What I
> listed are the obvious contenders. But ideally you would do a post
> mortem on the master and slave logs for Hadoop and HBase, since that
> would give you a better insight of the events. For example, when did
> the system start to flush, when did it start compacting, when did the
> HDFS start to go slow? And so on. One thing that I would highly
> recommend (if you haven't done so already) is getting graphing going.
> Use the build in Ganglia support and you may be able to at least
> determine the overall load on the system and various metrics of Hadoop
> and HBase.
>
> Did you use the normal Puts or did you set it to cache Puts and write
> them in bulk? See HTable.setWriteBufferSize() and
> HTable.setAutoFlush() for details (but please note that you then do
> need to call HTable.flushCommits() in your close() method of the
> mapper class). That will help a lot speeding up writing data.
>
> Lars
>
> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <he...@zfabrik.de>
> wrote:
> > Hi Lars,
> >
> >  thanks. Yes, this is just the first test setup. Eventually the data
> > load will be significantly higher.
> >
> > At the moment (looking at the master after the run) the number of
> > regions is well-distributed (684,685,685 regions). The overall
> > HDFS use is  ~700G. (replication factor is 3 btw).
> >
> > I will want to upgrade as soon as that makes sense. It seems there is
> > "release" after 0.20.6 - that's why we are still with that one.
> >
> > When I do that run again, I will check the master UI and see how things
> > develop there. As for the current run: I do not expect
> > to get stable numbers early in the run. What looked suspicous was that
> > things got gradually worse until well into 30 hours after
> > the start of the run and then even got better. An unexpected load
> > behavior for me (would have expected early changes but then
> > some stable behavior up to the end).
> >
> > Thanks,
> >  Henning
> >
> > Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
> >
> >> Hi Henning,
> >>
> >> Could you look at the Master UI while doing the import? The issue with
> >> a cold bulk import is that you are hitting one region server
> >> initially, and while it is filling up its in-memory structures all is
> >> nice and dandy. Then ou start to tax the server as it has to flush
> >> data out and it becomes slower responding to the mappers still
> >> hammering it. Only after a while the regions become large enough so
> >> that they get split and load starts to spread across 2 machines, then
> >> 3. Eventually you have enough regions to handle your data and you will
> >> see an average of the performance you could expect from a loaded
> >> cluster. For that reason we have added a bulk loading feature that
> >> helps building the region files externally and then swap them in.
> >>
> >> When you check the UI you can actually see this behavior as the
> >> operations-per-second (ops) are bound to one server initially. Well,
> >> could be two as one of them has to also serve META. If that is the
> >> same machine then you are penalized twice.
> >>
> >> In addition you start to run into minor compaction load while HBase
> >> tries to do housekeeping during your load.
> >>
> >> With 0.89 you could pre-split the regions into what you see eventually
> >> when your job is complete. Please use the UI to check and let us know
> >> how many regions you end up with in total (out of interest mainly). If
> >> you had that done before the import then the load is split right from
> >> the start.
> >>
> >> In general 0.89 is much better performance wise when it comes to bulk
> >> loads so you may want to try it out as well. The 0.90RC is up so a
> >> release is imminent and saves you from having to upgrade soon. Also,
> >> 0.90 is the first with Hadoop's append fix, so that you do not lose
> >> any data from wonky server behavior.
> >>
> >> And to wrap this up, 3 data nodes is not too great. If you ask anyone
> >> with a serious production type setup you will see 10 machines and
> >> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
> >> this little data" but that is not fair given that we do not know what
> >> your targets are. Bottom line is, you will see issues (like slowness)
> >> with 3 nodes that 8 or 10 nodes will never show.
> >>
> >> HTH,
> >> Lars
> >>
> >>
> >> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <
> henning.blohm@zfabrik.de> wrote:
> >> > We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
> >> > (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
> >> > relatively simple
> >> > table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
> >> >
> >> > In order to better understand the load behavior, I wanted to put
> 5*10^8
> >> > rows into that table. I wrote an M/R job that uses a Split Input
> Format
> >> > to split the
> >> > 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
> >> > into 1000 chunks of 500000 keys and then let the map do the actual job
> >> > of writing the corresponding rows (with some random column values)
> into
> >> > hbase.
> >> >
> >> > So there are 1000 map tasks, no reducer. Each task writes 500000 rows
> >> > into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
> >> >
> >> > The whole job runs for approx. 48 hours. Initially the map tasks need
> >> > around 30 min. each. After a while things take longer and longer,
> >> > eventually
> >> > reaching > 2h. It tops around the 850s task after which things speed
> up
> >> > again improving to about 48min. in the end, until completed.
> >> >
> >> > It's all dedicated machines and there is nothing else running. The map
> >> > tasks have 200m heap and when checking with vmstat in between I cannot
> >> > observe swapping.
> >> >
> >> > Also, on the master it seems that heap utilization is not at the limit
> >> > and no swapping either. All Hadoop and Hbase processes have
> >> > 1G heap.
> >> >
> >> > Any idea what would cause the strong variation (or degradation) of
> write
> >> > performance?
> >> > Is there a way of finding out where time gets lost?
> >> >
> >> > Thanks,
> >> >  Henning
> >> >
> >> >
> >
>

Re: map task performance degradation - any idea why?

Posted by Lars George <la...@gmail.com>.
Yeah, turning of the WAL would have been my next suggestion. Apart
from that Ganglia is really easily set up - you might want to consider
getting used to it now :)

On Fri, Nov 19, 2010 at 4:29 PM, Henning Blohm <he...@zfabrik.de> wrote:
> Hi Lars,
>
>  we do not have anything like ganglia up. Unfortunately.
>
>  I use regular puts with autoflush turned off, with a buffer of 4MB
> (could be bigger right?). We write to WAL.
>
>  I flush every 1000 recs.
>
>  I will try again - maybe over the weekend - and see if I can find out
> more.
>
> Thanks,
>  Henning
>
> Am Freitag, den 19.11.2010, 16:16 +0100 schrieb Lars George:
>
>> Hi Henning,
>>
>> And you what you have seen is often difficult to explain. What I
>> listed are the obvious contenders. But ideally you would do a post
>> mortem on the master and slave logs for Hadoop and HBase, since that
>> would give you a better insight of the events. For example, when did
>> the system start to flush, when did it start compacting, when did the
>> HDFS start to go slow? And so on. One thing that I would highly
>> recommend (if you haven't done so already) is getting graphing going.
>> Use the build in Ganglia support and you may be able to at least
>> determine the overall load on the system and various metrics of Hadoop
>> and HBase.
>>
>> Did you use the normal Puts or did you set it to cache Puts and write
>> them in bulk? See HTable.setWriteBufferSize() and
>> HTable.setAutoFlush() for details (but please note that you then do
>> need to call HTable.flushCommits() in your close() method of the
>> mapper class). That will help a lot speeding up writing data.
>>
>> Lars
>>
>> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <he...@zfabrik.de> wrote:
>> > Hi Lars,
>> >
>> >  thanks. Yes, this is just the first test setup. Eventually the data
>> > load will be significantly higher.
>> >
>> > At the moment (looking at the master after the run) the number of
>> > regions is well-distributed (684,685,685 regions). The overall
>> > HDFS use is  ~700G. (replication factor is 3 btw).
>> >
>> > I will want to upgrade as soon as that makes sense. It seems there is
>> > "release" after 0.20.6 - that's why we are still with that one.
>> >
>> > When I do that run again, I will check the master UI and see how things
>> > develop there. As for the current run: I do not expect
>> > to get stable numbers early in the run. What looked suspicous was that
>> > things got gradually worse until well into 30 hours after
>> > the start of the run and then even got better. An unexpected load
>> > behavior for me (would have expected early changes but then
>> > some stable behavior up to the end).
>> >
>> > Thanks,
>> >  Henning
>> >
>> > Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
>> >
>> >> Hi Henning,
>> >>
>> >> Could you look at the Master UI while doing the import? The issue with
>> >> a cold bulk import is that you are hitting one region server
>> >> initially, and while it is filling up its in-memory structures all is
>> >> nice and dandy. Then ou start to tax the server as it has to flush
>> >> data out and it becomes slower responding to the mappers still
>> >> hammering it. Only after a while the regions become large enough so
>> >> that they get split and load starts to spread across 2 machines, then
>> >> 3. Eventually you have enough regions to handle your data and you will
>> >> see an average of the performance you could expect from a loaded
>> >> cluster. For that reason we have added a bulk loading feature that
>> >> helps building the region files externally and then swap them in.
>> >>
>> >> When you check the UI you can actually see this behavior as the
>> >> operations-per-second (ops) are bound to one server initially. Well,
>> >> could be two as one of them has to also serve META. If that is the
>> >> same machine then you are penalized twice.
>> >>
>> >> In addition you start to run into minor compaction load while HBase
>> >> tries to do housekeeping during your load.
>> >>
>> >> With 0.89 you could pre-split the regions into what you see eventually
>> >> when your job is complete. Please use the UI to check and let us know
>> >> how many regions you end up with in total (out of interest mainly). If
>> >> you had that done before the import then the load is split right from
>> >> the start.
>> >>
>> >> In general 0.89 is much better performance wise when it comes to bulk
>> >> loads so you may want to try it out as well. The 0.90RC is up so a
>> >> release is imminent and saves you from having to upgrade soon. Also,
>> >> 0.90 is the first with Hadoop's append fix, so that you do not lose
>> >> any data from wonky server behavior.
>> >>
>> >> And to wrap this up, 3 data nodes is not too great. If you ask anyone
>> >> with a serious production type setup you will see 10 machines and
>> >> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
>> >> this little data" but that is not fair given that we do not know what
>> >> your targets are. Bottom line is, you will see issues (like slowness)
>> >> with 3 nodes that 8 or 10 nodes will never show.
>> >>
>> >> HTH,
>> >> Lars
>> >>
>> >>
>> >> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <he...@zfabrik.de> wrote:
>> >> > We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
>> >> > (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
>> >> > relatively simple
>> >> > table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
>> >> >
>> >> > In order to better understand the load behavior, I wanted to put 5*10^8
>> >> > rows into that table. I wrote an M/R job that uses a Split Input Format
>> >> > to split the
>> >> > 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
>> >> > into 1000 chunks of 500000 keys and then let the map do the actual job
>> >> > of writing the corresponding rows (with some random column values) into
>> >> > hbase.
>> >> >
>> >> > So there are 1000 map tasks, no reducer. Each task writes 500000 rows
>> >> > into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
>> >> >
>> >> > The whole job runs for approx. 48 hours. Initially the map tasks need
>> >> > around 30 min. each. After a while things take longer and longer,
>> >> > eventually
>> >> > reaching > 2h. It tops around the 850s task after which things speed up
>> >> > again improving to about 48min. in the end, until completed.
>> >> >
>> >> > It's all dedicated machines and there is nothing else running. The map
>> >> > tasks have 200m heap and when checking with vmstat in between I cannot
>> >> > observe swapping.
>> >> >
>> >> > Also, on the master it seems that heap utilization is not at the limit
>> >> > and no swapping either. All Hadoop and Hbase processes have
>> >> > 1G heap.
>> >> >
>> >> > Any idea what would cause the strong variation (or degradation) of write
>> >> > performance?
>> >> > Is there a way of finding out where time gets lost?
>> >> >
>> >> > Thanks,
>> >> >  Henning
>> >> >
>> >> >
>> >
>

Re: map task performance degradation - any idea why?

Posted by Henning Blohm <he...@zfabrik.de>.
Hi Lars,

  we do not have anything like ganglia up. Unfortunately.

  I use regular puts with autoflush turned off, with a buffer of 4MB
(could be bigger right?). We write to WAL. 

  I flush every 1000 recs. 

  I will try again - maybe over the weekend - and see if I can find out
more.

Thanks,
  Henning

Am Freitag, den 19.11.2010, 16:16 +0100 schrieb Lars George:

> Hi Henning,
> 
> And you what you have seen is often difficult to explain. What I
> listed are the obvious contenders. But ideally you would do a post
> mortem on the master and slave logs for Hadoop and HBase, since that
> would give you a better insight of the events. For example, when did
> the system start to flush, when did it start compacting, when did the
> HDFS start to go slow? And so on. One thing that I would highly
> recommend (if you haven't done so already) is getting graphing going.
> Use the build in Ganglia support and you may be able to at least
> determine the overall load on the system and various metrics of Hadoop
> and HBase.
> 
> Did you use the normal Puts or did you set it to cache Puts and write
> them in bulk? See HTable.setWriteBufferSize() and
> HTable.setAutoFlush() for details (but please note that you then do
> need to call HTable.flushCommits() in your close() method of the
> mapper class). That will help a lot speeding up writing data.
> 
> Lars
> 
> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <he...@zfabrik.de> wrote:
> > Hi Lars,
> >
> >  thanks. Yes, this is just the first test setup. Eventually the data
> > load will be significantly higher.
> >
> > At the moment (looking at the master after the run) the number of
> > regions is well-distributed (684,685,685 regions). The overall
> > HDFS use is  ~700G. (replication factor is 3 btw).
> >
> > I will want to upgrade as soon as that makes sense. It seems there is
> > "release" after 0.20.6 - that's why we are still with that one.
> >
> > When I do that run again, I will check the master UI and see how things
> > develop there. As for the current run: I do not expect
> > to get stable numbers early in the run. What looked suspicous was that
> > things got gradually worse until well into 30 hours after
> > the start of the run and then even got better. An unexpected load
> > behavior for me (would have expected early changes but then
> > some stable behavior up to the end).
> >
> > Thanks,
> >  Henning
> >
> > Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
> >
> >> Hi Henning,
> >>
> >> Could you look at the Master UI while doing the import? The issue with
> >> a cold bulk import is that you are hitting one region server
> >> initially, and while it is filling up its in-memory structures all is
> >> nice and dandy. Then ou start to tax the server as it has to flush
> >> data out and it becomes slower responding to the mappers still
> >> hammering it. Only after a while the regions become large enough so
> >> that they get split and load starts to spread across 2 machines, then
> >> 3. Eventually you have enough regions to handle your data and you will
> >> see an average of the performance you could expect from a loaded
> >> cluster. For that reason we have added a bulk loading feature that
> >> helps building the region files externally and then swap them in.
> >>
> >> When you check the UI you can actually see this behavior as the
> >> operations-per-second (ops) are bound to one server initially. Well,
> >> could be two as one of them has to also serve META. If that is the
> >> same machine then you are penalized twice.
> >>
> >> In addition you start to run into minor compaction load while HBase
> >> tries to do housekeeping during your load.
> >>
> >> With 0.89 you could pre-split the regions into what you see eventually
> >> when your job is complete. Please use the UI to check and let us know
> >> how many regions you end up with in total (out of interest mainly). If
> >> you had that done before the import then the load is split right from
> >> the start.
> >>
> >> In general 0.89 is much better performance wise when it comes to bulk
> >> loads so you may want to try it out as well. The 0.90RC is up so a
> >> release is imminent and saves you from having to upgrade soon. Also,
> >> 0.90 is the first with Hadoop's append fix, so that you do not lose
> >> any data from wonky server behavior.
> >>
> >> And to wrap this up, 3 data nodes is not too great. If you ask anyone
> >> with a serious production type setup you will see 10 machines and
> >> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
> >> this little data" but that is not fair given that we do not know what
> >> your targets are. Bottom line is, you will see issues (like slowness)
> >> with 3 nodes that 8 or 10 nodes will never show.
> >>
> >> HTH,
> >> Lars
> >>
> >>
> >> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <he...@zfabrik.de> wrote:
> >> > We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
> >> > (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
> >> > relatively simple
> >> > table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
> >> >
> >> > In order to better understand the load behavior, I wanted to put 5*10^8
> >> > rows into that table. I wrote an M/R job that uses a Split Input Format
> >> > to split the
> >> > 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
> >> > into 1000 chunks of 500000 keys and then let the map do the actual job
> >> > of writing the corresponding rows (with some random column values) into
> >> > hbase.
> >> >
> >> > So there are 1000 map tasks, no reducer. Each task writes 500000 rows
> >> > into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
> >> >
> >> > The whole job runs for approx. 48 hours. Initially the map tasks need
> >> > around 30 min. each. After a while things take longer and longer,
> >> > eventually
> >> > reaching > 2h. It tops around the 850s task after which things speed up
> >> > again improving to about 48min. in the end, until completed.
> >> >
> >> > It's all dedicated machines and there is nothing else running. The map
> >> > tasks have 200m heap and when checking with vmstat in between I cannot
> >> > observe swapping.
> >> >
> >> > Also, on the master it seems that heap utilization is not at the limit
> >> > and no swapping either. All Hadoop and Hbase processes have
> >> > 1G heap.
> >> >
> >> > Any idea what would cause the strong variation (or degradation) of write
> >> > performance?
> >> > Is there a way of finding out where time gets lost?
> >> >
> >> > Thanks,
> >> >  Henning
> >> >
> >> >
> >

Re: map task performance degradation - any idea why?

Posted by Lars George <la...@gmail.com>.
Hi Henning,

And you what you have seen is often difficult to explain. What I
listed are the obvious contenders. But ideally you would do a post
mortem on the master and slave logs for Hadoop and HBase, since that
would give you a better insight of the events. For example, when did
the system start to flush, when did it start compacting, when did the
HDFS start to go slow? And so on. One thing that I would highly
recommend (if you haven't done so already) is getting graphing going.
Use the build in Ganglia support and you may be able to at least
determine the overall load on the system and various metrics of Hadoop
and HBase.

Did you use the normal Puts or did you set it to cache Puts and write
them in bulk? See HTable.setWriteBufferSize() and
HTable.setAutoFlush() for details (but please note that you then do
need to call HTable.flushCommits() in your close() method of the
mapper class). That will help a lot speeding up writing data.

Lars

On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <he...@zfabrik.de> wrote:
> Hi Lars,
>
>  thanks. Yes, this is just the first test setup. Eventually the data
> load will be significantly higher.
>
> At the moment (looking at the master after the run) the number of
> regions is well-distributed (684,685,685 regions). The overall
> HDFS use is  ~700G. (replication factor is 3 btw).
>
> I will want to upgrade as soon as that makes sense. It seems there is
> "release" after 0.20.6 - that's why we are still with that one.
>
> When I do that run again, I will check the master UI and see how things
> develop there. As for the current run: I do not expect
> to get stable numbers early in the run. What looked suspicous was that
> things got gradually worse until well into 30 hours after
> the start of the run and then even got better. An unexpected load
> behavior for me (would have expected early changes but then
> some stable behavior up to the end).
>
> Thanks,
>  Henning
>
> Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
>
>> Hi Henning,
>>
>> Could you look at the Master UI while doing the import? The issue with
>> a cold bulk import is that you are hitting one region server
>> initially, and while it is filling up its in-memory structures all is
>> nice and dandy. Then ou start to tax the server as it has to flush
>> data out and it becomes slower responding to the mappers still
>> hammering it. Only after a while the regions become large enough so
>> that they get split and load starts to spread across 2 machines, then
>> 3. Eventually you have enough regions to handle your data and you will
>> see an average of the performance you could expect from a loaded
>> cluster. For that reason we have added a bulk loading feature that
>> helps building the region files externally and then swap them in.
>>
>> When you check the UI you can actually see this behavior as the
>> operations-per-second (ops) are bound to one server initially. Well,
>> could be two as one of them has to also serve META. If that is the
>> same machine then you are penalized twice.
>>
>> In addition you start to run into minor compaction load while HBase
>> tries to do housekeeping during your load.
>>
>> With 0.89 you could pre-split the regions into what you see eventually
>> when your job is complete. Please use the UI to check and let us know
>> how many regions you end up with in total (out of interest mainly). If
>> you had that done before the import then the load is split right from
>> the start.
>>
>> In general 0.89 is much better performance wise when it comes to bulk
>> loads so you may want to try it out as well. The 0.90RC is up so a
>> release is imminent and saves you from having to upgrade soon. Also,
>> 0.90 is the first with Hadoop's append fix, so that you do not lose
>> any data from wonky server behavior.
>>
>> And to wrap this up, 3 data nodes is not too great. If you ask anyone
>> with a serious production type setup you will see 10 machines and
>> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
>> this little data" but that is not fair given that we do not know what
>> your targets are. Bottom line is, you will see issues (like slowness)
>> with 3 nodes that 8 or 10 nodes will never show.
>>
>> HTH,
>> Lars
>>
>>
>> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <he...@zfabrik.de> wrote:
>> > We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
>> > (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
>> > relatively simple
>> > table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
>> >
>> > In order to better understand the load behavior, I wanted to put 5*10^8
>> > rows into that table. I wrote an M/R job that uses a Split Input Format
>> > to split the
>> > 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
>> > into 1000 chunks of 500000 keys and then let the map do the actual job
>> > of writing the corresponding rows (with some random column values) into
>> > hbase.
>> >
>> > So there are 1000 map tasks, no reducer. Each task writes 500000 rows
>> > into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
>> >
>> > The whole job runs for approx. 48 hours. Initially the map tasks need
>> > around 30 min. each. After a while things take longer and longer,
>> > eventually
>> > reaching > 2h. It tops around the 850s task after which things speed up
>> > again improving to about 48min. in the end, until completed.
>> >
>> > It's all dedicated machines and there is nothing else running. The map
>> > tasks have 200m heap and when checking with vmstat in between I cannot
>> > observe swapping.
>> >
>> > Also, on the master it seems that heap utilization is not at the limit
>> > and no swapping either. All Hadoop and Hbase processes have
>> > 1G heap.
>> >
>> > Any idea what would cause the strong variation (or degradation) of write
>> > performance?
>> > Is there a way of finding out where time gets lost?
>> >
>> > Thanks,
>> >  Henning
>> >
>> >
>

Re: map task performance degradation - any idea why?

Posted by Henning Blohm <he...@zfabrik.de>.
Hi Lars,

  thanks. Yes, this is just the first test setup. Eventually the data
load will be significantly higher.

At the moment (looking at the master after the run) the number of
regions is well-distributed (684,685,685 regions). The overall 
HDFS use is  ~700G. (replication factor is 3 btw).

I will want to upgrade as soon as that makes sense. It seems there is
"release" after 0.20.6 - that's why we are still with that one. 

When I do that run again, I will check the master UI and see how things
develop there. As for the current run: I do not expect
to get stable numbers early in the run. What looked suspicous was that
things got gradually worse until well into 30 hours after
the start of the run and then even got better. An unexpected load
behavior for me (would have expected early changes but then
some stable behavior up to the end).

Thanks,
  Henning

Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:

> Hi Henning,
> 
> Could you look at the Master UI while doing the import? The issue with
> a cold bulk import is that you are hitting one region server
> initially, and while it is filling up its in-memory structures all is
> nice and dandy. Then ou start to tax the server as it has to flush
> data out and it becomes slower responding to the mappers still
> hammering it. Only after a while the regions become large enough so
> that they get split and load starts to spread across 2 machines, then
> 3. Eventually you have enough regions to handle your data and you will
> see an average of the performance you could expect from a loaded
> cluster. For that reason we have added a bulk loading feature that
> helps building the region files externally and then swap them in.
> 
> When you check the UI you can actually see this behavior as the
> operations-per-second (ops) are bound to one server initially. Well,
> could be two as one of them has to also serve META. If that is the
> same machine then you are penalized twice.
> 
> In addition you start to run into minor compaction load while HBase
> tries to do housekeeping during your load.
> 
> With 0.89 you could pre-split the regions into what you see eventually
> when your job is complete. Please use the UI to check and let us know
> how many regions you end up with in total (out of interest mainly). If
> you had that done before the import then the load is split right from
> the start.
> 
> In general 0.89 is much better performance wise when it comes to bulk
> loads so you may want to try it out as well. The 0.90RC is up so a
> release is imminent and saves you from having to upgrade soon. Also,
> 0.90 is the first with Hadoop's append fix, so that you do not lose
> any data from wonky server behavior.
> 
> And to wrap this up, 3 data nodes is not too great. If you ask anyone
> with a serious production type setup you will see 10 machines and
> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
> this little data" but that is not fair given that we do not know what
> your targets are. Bottom line is, you will see issues (like slowness)
> with 3 nodes that 8 or 10 nodes will never show.
> 
> HTH,
> Lars
> 
> 
> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <he...@zfabrik.de> wrote:
> > We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
> > (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
> > relatively simple
> > table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
> >
> > In order to better understand the load behavior, I wanted to put 5*10^8
> > rows into that table. I wrote an M/R job that uses a Split Input Format
> > to split the
> > 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
> > into 1000 chunks of 500000 keys and then let the map do the actual job
> > of writing the corresponding rows (with some random column values) into
> > hbase.
> >
> > So there are 1000 map tasks, no reducer. Each task writes 500000 rows
> > into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
> >
> > The whole job runs for approx. 48 hours. Initially the map tasks need
> > around 30 min. each. After a while things take longer and longer,
> > eventually
> > reaching > 2h. It tops around the 850s task after which things speed up
> > again improving to about 48min. in the end, until completed.
> >
> > It's all dedicated machines and there is nothing else running. The map
> > tasks have 200m heap and when checking with vmstat in between I cannot
> > observe swapping.
> >
> > Also, on the master it seems that heap utilization is not at the limit
> > and no swapping either. All Hadoop and Hbase processes have
> > 1G heap.
> >
> > Any idea what would cause the strong variation (or degradation) of write
> > performance?
> > Is there a way of finding out where time gets lost?
> >
> > Thanks,
> >  Henning
> >
> >

Re: map task performance degradation - any idea why?

Posted by Lars George <la...@gmail.com>.
Hi Henning,

Could you look at the Master UI while doing the import? The issue with
a cold bulk import is that you are hitting one region server
initially, and while it is filling up its in-memory structures all is
nice and dandy. Then ou start to tax the server as it has to flush
data out and it becomes slower responding to the mappers still
hammering it. Only after a while the regions become large enough so
that they get split and load starts to spread across 2 machines, then
3. Eventually you have enough regions to handle your data and you will
see an average of the performance you could expect from a loaded
cluster. For that reason we have added a bulk loading feature that
helps building the region files externally and then swap them in.

When you check the UI you can actually see this behavior as the
operations-per-second (ops) are bound to one server initially. Well,
could be two as one of them has to also serve META. If that is the
same machine then you are penalized twice.

In addition you start to run into minor compaction load while HBase
tries to do housekeeping during your load.

With 0.89 you could pre-split the regions into what you see eventually
when your job is complete. Please use the UI to check and let us know
how many regions you end up with in total (out of interest mainly). If
you had that done before the import then the load is split right from
the start.

In general 0.89 is much better performance wise when it comes to bulk
loads so you may want to try it out as well. The 0.90RC is up so a
release is imminent and saves you from having to upgrade soon. Also,
0.90 is the first with Hadoop's append fix, so that you do not lose
any data from wonky server behavior.

And to wrap this up, 3 data nodes is not too great. If you ask anyone
with a serious production type setup you will see 10 machines and
more, I'd say 20-30 machines and up. Some would say "Use MySQL for
this little data" but that is not fair given that we do not know what
your targets are. Bottom line is, you will see issues (like slowness)
with 3 nodes that 8 or 10 nodes will never show.

HTH,
Lars


On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <he...@zfabrik.de> wrote:
> We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
> (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
> relatively simple
> table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
>
> In order to better understand the load behavior, I wanted to put 5*10^8
> rows into that table. I wrote an M/R job that uses a Split Input Format
> to split the
> 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
> into 1000 chunks of 500000 keys and then let the map do the actual job
> of writing the corresponding rows (with some random column values) into
> hbase.
>
> So there are 1000 map tasks, no reducer. Each task writes 500000 rows
> into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
>
> The whole job runs for approx. 48 hours. Initially the map tasks need
> around 30 min. each. After a while things take longer and longer,
> eventually
> reaching > 2h. It tops around the 850s task after which things speed up
> again improving to about 48min. in the end, until completed.
>
> It's all dedicated machines and there is nothing else running. The map
> tasks have 200m heap and when checking with vmstat in between I cannot
> observe swapping.
>
> Also, on the master it seems that heap utilization is not at the limit
> and no swapping either. All Hadoop and Hbase processes have
> 1G heap.
>
> Any idea what would cause the strong variation (or degradation) of write
> performance?
> Is there a way of finding out where time gets lost?
>
> Thanks,
>  Henning
>
>