You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Shuja Rehman <sh...@gmail.com> on 2011/07/01 22:07:23 UTC

Insertion Performance [WAL Disable Vs WAL Enable]

Hi,

I have  job where i need to read from 1 hbase table, perform aggregations
and writing back to other hbase table. For it, I am using
TableMapReduceUtil.initTableMapperJob and
TableMapReduceUtil.initTableReducerJob. In reducer, if I use
put.setWriteToWAL(false), then job completes within seconds but without it,
it takes 30 mins approximately. Why there is so huge difference in
performance? I wish that I can complete the same job within seconds while
using put.setWriteToWAL(true) to prevent the data loss. So kindly let me
know what other optimizations I can do?

Thanks

-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

Re: Insertion Performance [WAL Disable Vs WAL Enable]

Posted by Stack <st...@duboce.net>.

On Fri, Jul 1, 2011 at 1:07 PM, Shuja Rehman <sh...@gmail.com> wrote:
> I have  job where i need to read from 1 hbase table, perform aggregations
> and writing back to other hbase table. For it, I am using
> TableMapReduceUtil.initTableMapperJob and
> TableMapReduceUtil.initTableReducerJob. In reducer, if I use
> put.setWriteToWAL(false), then job completes within seconds but without it,
> it takes 30 mins approximately. Why there is so huge difference in
> performance? I wish that I can complete the same job within seconds while
> using put.setWriteToWAL(true) to prevent the data loss. So kindly let me
> know what other optimizations I can do?
>

Don't disable WAL.  You are just going to shoot yourself in the foot
if you leave it off.

The difference in perf is that you are writing every edit to the
filesystem first before anything else is done.

Try playing with deferred sync'ing of writes.  You need to set your
table do to deferred flushes by setting the DEFERRED_LOG_FLUSH table
attribute on your table.  Once set, rather than sync every write,
we'll sync on a period.  The default is to sync every second.  Here is
the setting in hbase-default.xml

  <property>
    <name>hbase.regionserver.optionallogflushinterval</name>
    <value>1000</value>
    <description>Sync the HLog to the HDFS after this interval if it has not
    accumulated enough entries to trigger a sync. Default 1 second. Units:
    milliseconds.
    </description>
  </property>

Now if you crash, instead of losing massive chunks of your job,
instead you will lose up to the last second worth of writes but in
compensation you should see faster writing.

Also, what is slow?  The writes or the reads?  How many reducers?  If
you up the number does that help?

St.Ack

Try playing with hbase.regionserver.optionallogflushinterval.  If you
set your table so it does deferred flushes