You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dmitry <dm...@tellapart.com> on 2010/03/17 05:10:59 UTC

Analysing slow HBase mapreduce performance

Hi all,

I'm trying to analyse some issues with HBase performance in a mapreduce.

I'm running a mapreduce which reads a table and just writes it out to HDFS.
The table is small, roughly ~400M of data and 18M rows.
I've pre-split the table into 32 regions, so that I'm not running into the
problem of having one region server serve the entire table.

I'm running an HBase cluster with:
- 16 region servers (each on the same machine as a Hadoop tasktracker and
datanode).
- 1 master (on the same machine as the Hadoop jobtracker and namenode.)
- Zookeeper quorum of just 1 machine (on the same machine as the master).

I have LZO compression enabled for both HBase and Hadoop.

Running this job takes about 5-6 minutes.

Running a mapreduce reading the exact same set of data from a SequenceFile
on HDFS takes only about 1 minute.

What else can I do to try to diagnose this?

Thanks,

- Dmitry

Re: Analysing slow HBase mapreduce performance

Posted by Dmitry Chechik <dm...@tellapart.com>.
I set it to 10,000 - the job ran in 44 seconds (compared to 29 seconds
reading from HFS), so a speed up of 7x or so.

Thanks again,

- Dmitry

On Tue, Mar 16, 2010 at 9:28 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Out of interest... to what did you set it and what was the speed-up like?
>
> J-D
>
> On Tue, Mar 16, 2010 at 9:26 PM, Dmitry Chechik <dm...@tellapart.com>
> wrote:
> > That did it. Thanks!
> >
> > On Tue, Mar 16, 2010 at 9:14 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Did you set scanner caching higher?
> >>
> >> J-D
> >>
> >> On Tue, Mar 16, 2010 at 9:10 PM, Dmitry <dm...@tellapart.com> wrote:
> >> > Hi all,
> >> >
> >> > I'm trying to analyse some issues with HBase performance in a
> mapreduce.
> >> >
> >> > I'm running a mapreduce which reads a table and just writes it out to
> >> HDFS.
> >> > The table is small, roughly ~400M of data and 18M rows.
> >> > I've pre-split the table into 32 regions, so that I'm not running into
> >> the
> >> > problem of having one region server serve the entire table.
> >> >
> >> > I'm running an HBase cluster with:
> >> > - 16 region servers (each on the same machine as a Hadoop tasktracker
> and
> >> > datanode).
> >> > - 1 master (on the same machine as the Hadoop jobtracker and
> namenode.)
> >> > - Zookeeper quorum of just 1 machine (on the same machine as the
> master).
> >> >
> >> > I have LZO compression enabled for both HBase and Hadoop.
> >> >
> >> > Running this job takes about 5-6 minutes.
> >> >
> >> > Running a mapreduce reading the exact same set of data from a
> >> SequenceFile
> >> > on HDFS takes only about 1 minute.
> >> >
> >> > What else can I do to try to diagnose this?
> >> >
> >> > Thanks,
> >> >
> >> > - Dmitry
> >> >
> >>
> >
>

Re: Analysing slow HBase mapreduce performance

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Out of interest... to what did you set it and what was the speed-up like?

J-D

On Tue, Mar 16, 2010 at 9:26 PM, Dmitry Chechik <dm...@tellapart.com> wrote:
> That did it. Thanks!
>
> On Tue, Mar 16, 2010 at 9:14 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Did you set scanner caching higher?
>>
>> J-D
>>
>> On Tue, Mar 16, 2010 at 9:10 PM, Dmitry <dm...@tellapart.com> wrote:
>> > Hi all,
>> >
>> > I'm trying to analyse some issues with HBase performance in a mapreduce.
>> >
>> > I'm running a mapreduce which reads a table and just writes it out to
>> HDFS.
>> > The table is small, roughly ~400M of data and 18M rows.
>> > I've pre-split the table into 32 regions, so that I'm not running into
>> the
>> > problem of having one region server serve the entire table.
>> >
>> > I'm running an HBase cluster with:
>> > - 16 region servers (each on the same machine as a Hadoop tasktracker and
>> > datanode).
>> > - 1 master (on the same machine as the Hadoop jobtracker and namenode.)
>> > - Zookeeper quorum of just 1 machine (on the same machine as the master).
>> >
>> > I have LZO compression enabled for both HBase and Hadoop.
>> >
>> > Running this job takes about 5-6 minutes.
>> >
>> > Running a mapreduce reading the exact same set of data from a
>> SequenceFile
>> > on HDFS takes only about 1 minute.
>> >
>> > What else can I do to try to diagnose this?
>> >
>> > Thanks,
>> >
>> > - Dmitry
>> >
>>
>

Re: Analysing slow HBase mapreduce performance

Posted by Dmitry Chechik <dm...@tellapart.com>.
That did it. Thanks!

On Tue, Mar 16, 2010 at 9:14 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Did you set scanner caching higher?
>
> J-D
>
> On Tue, Mar 16, 2010 at 9:10 PM, Dmitry <dm...@tellapart.com> wrote:
> > Hi all,
> >
> > I'm trying to analyse some issues with HBase performance in a mapreduce.
> >
> > I'm running a mapreduce which reads a table and just writes it out to
> HDFS.
> > The table is small, roughly ~400M of data and 18M rows.
> > I've pre-split the table into 32 regions, so that I'm not running into
> the
> > problem of having one region server serve the entire table.
> >
> > I'm running an HBase cluster with:
> > - 16 region servers (each on the same machine as a Hadoop tasktracker and
> > datanode).
> > - 1 master (on the same machine as the Hadoop jobtracker and namenode.)
> > - Zookeeper quorum of just 1 machine (on the same machine as the master).
> >
> > I have LZO compression enabled for both HBase and Hadoop.
> >
> > Running this job takes about 5-6 minutes.
> >
> > Running a mapreduce reading the exact same set of data from a
> SequenceFile
> > on HDFS takes only about 1 minute.
> >
> > What else can I do to try to diagnose this?
> >
> > Thanks,
> >
> > - Dmitry
> >
>

Re: Analysing slow HBase mapreduce performance

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Did you set scanner caching higher?

J-D

On Tue, Mar 16, 2010 at 9:10 PM, Dmitry <dm...@tellapart.com> wrote:
> Hi all,
>
> I'm trying to analyse some issues with HBase performance in a mapreduce.
>
> I'm running a mapreduce which reads a table and just writes it out to HDFS.
> The table is small, roughly ~400M of data and 18M rows.
> I've pre-split the table into 32 regions, so that I'm not running into the
> problem of having one region server serve the entire table.
>
> I'm running an HBase cluster with:
> - 16 region servers (each on the same machine as a Hadoop tasktracker and
> datanode).
> - 1 master (on the same machine as the Hadoop jobtracker and namenode.)
> - Zookeeper quorum of just 1 machine (on the same machine as the master).
>
> I have LZO compression enabled for both HBase and Hadoop.
>
> Running this job takes about 5-6 minutes.
>
> Running a mapreduce reading the exact same set of data from a SequenceFile
> on HDFS takes only about 1 minute.
>
> What else can I do to try to diagnose this?
>
> Thanks,
>
> - Dmitry
>