You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Raghava Mutharaju <m....@gmail.com> on 2010/06/06 00:44:45 UTC

performance consideration when writing to HBase from MR job

Hi all,

    If HBase is used as the data sink in an MR job, would there be a
performance improvement if a) is done instead of b)

a) all the Puts are collected in Reduce or Map (if there is no reduce)  and
a batch write is done
b) writing out each <K,V> pair using context.write(k, v)

If a) is considered instead of b) then wouldn't there be a violation of
semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being output)?? Is
this OK?

Thank you.

Regards,
Raghava

Re: performance consideration when writing to HBase from MR job

Posted by Raghava Mutharaju <m....@gmail.com>.
aah, Ok, thank you :)

On Sun, Jun 6, 2010 at 12:40 PM, Jonathan Gray <jg...@facebook.com> wrote:

> TableOutputFormat does batching of writes under the hood so it's basically
> doing the same thing.
>
> > -----Original Message-----
> > From: Raghava Mutharaju [mailto:m.vijayaraghava@gmail.com]
> > Sent: Saturday, June 05, 2010 4:22 PM
> > To: user@hbase.apache.org
> > Subject: Re: performance consideration when writing to HBase from MR
> > job
> >
> > Hi Amandeep,
> >
> > Thank you for the reply. I was using HBase API directly in mapper
> > (there is
> > no reducer). I thought instead of writing out each row (using context),
> > it
> > would be quicker if a batch write - table.put(List) is done. Going by
> > what
> > you said, I guess the difference won't be much.
> >
> > Regards,
> > Raghava.
> >
> > On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <am...@gmail.com>
> > wrote:
> >
> > > >
> > > > a) all the Puts are collected in Reduce or Map (if there is no
> > reduce)
> > >  and
> > > > a batch write is done
> > > > b) writing out each <K,V> pair using context.write(k, v)
> > > >
> > > > If a) is considered instead of b) then wouldn't there be a
> > violation of
> > > > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being
> > output)??
> > > Is
> > > > this OK?
> > > >
> > >
> > > 1. If you can write from the mapper, you would avoid the overhead
> > caused
> > > due
> > > to shuffling and sorting between the map and reduce phase.
> > > 2. It would not make much difference if you are using the HBase API
> > > directly
> > > in the mapper/reducer to write to the table instead of writing out to
> > the
> > > context and using one of the output formats that writes to the table.
> > > However, if you plan to use the bulkload utility (HBASE-48 jira), you
> > will
> > > get much better performance than using the HBase API directly.
> > > Regarding the semantics - no there would not be a problem as long as
> > you
> > > create your Puts properly.
> > >
> > > -Amandeep
> > >
>

RE: performance consideration when writing to HBase from MR job

Posted by Jonathan Gray <jg...@facebook.com>.
TableOutputFormat does batching of writes under the hood so it's basically doing the same thing.

> -----Original Message-----
> From: Raghava Mutharaju [mailto:m.vijayaraghava@gmail.com]
> Sent: Saturday, June 05, 2010 4:22 PM
> To: user@hbase.apache.org
> Subject: Re: performance consideration when writing to HBase from MR
> job
> 
> Hi Amandeep,
> 
> Thank you for the reply. I was using HBase API directly in mapper
> (there is
> no reducer). I thought instead of writing out each row (using context),
> it
> would be quicker if a batch write - table.put(List) is done. Going by
> what
> you said, I guess the difference won't be much.
> 
> Regards,
> Raghava.
> 
> On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <am...@gmail.com>
> wrote:
> 
> > >
> > > a) all the Puts are collected in Reduce or Map (if there is no
> reduce)
> >  and
> > > a batch write is done
> > > b) writing out each <K,V> pair using context.write(k, v)
> > >
> > > If a) is considered instead of b) then wouldn't there be a
> violation of
> > > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being
> output)??
> > Is
> > > this OK?
> > >
> >
> > 1. If you can write from the mapper, you would avoid the overhead
> caused
> > due
> > to shuffling and sorting between the map and reduce phase.
> > 2. It would not make much difference if you are using the HBase API
> > directly
> > in the mapper/reducer to write to the table instead of writing out to
> the
> > context and using one of the output formats that writes to the table.
> > However, if you plan to use the bulkload utility (HBASE-48 jira), you
> will
> > get much better performance than using the HBase API directly.
> > Regarding the semantics - no there would not be a problem as long as
> you
> > create your Puts properly.
> >
> > -Amandeep
> >

Re: performance consideration when writing to HBase from MR job

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi Amandeep,

Thank you for the reply. I was using HBase API directly in mapper (there is
no reducer). I thought instead of writing out each row (using context), it
would be quicker if a batch write - table.put(List) is done. Going by what
you said, I guess the difference won't be much.

Regards,
Raghava.

On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <am...@gmail.com> wrote:

> >
> > a) all the Puts are collected in Reduce or Map (if there is no reduce)
>  and
> > a batch write is done
> > b) writing out each <K,V> pair using context.write(k, v)
> >
> > If a) is considered instead of b) then wouldn't there be a violation of
> > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being output)??
> Is
> > this OK?
> >
>
> 1. If you can write from the mapper, you would avoid the overhead caused
> due
> to shuffling and sorting between the map and reduce phase.
> 2. It would not make much difference if you are using the HBase API
> directly
> in the mapper/reducer to write to the table instead of writing out to the
> context and using one of the output formats that writes to the table.
> However, if you plan to use the bulkload utility (HBASE-48 jira), you will
> get much better performance than using the HBase API directly.
> Regarding the semantics - no there would not be a problem as long as you
> create your Puts properly.
>
> -Amandeep
>

Re: performance consideration when writing to HBase from MR job

Posted by Amandeep Khurana <am...@gmail.com>.
>
> a) all the Puts are collected in Reduce or Map (if there is no reduce)  and
> a batch write is done
> b) writing out each <K,V> pair using context.write(k, v)
>
> If a) is considered instead of b) then wouldn't there be a violation of
> semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being output)?? Is
> this OK?
>

1. If you can write from the mapper, you would avoid the overhead caused due
to shuffling and sorting between the map and reduce phase.
2. It would not make much difference if you are using the HBase API directly
in the mapper/reducer to write to the table instead of writing out to the
context and using one of the output formats that writes to the table.
However, if you plan to use the bulkload utility (HBASE-48 jira), you will
get much better performance than using the HBase API directly.
Regarding the semantics - no there would not be a problem as long as you
create your Puts properly.

-Amandeep