You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jean-Daniel Cryans <jd...@apache.org> on 2009/08/11 19:40:54 UTC

Tip when migrating your data loading MR jobs from 0.19 to 0.20

Hi users,

This weekend at the second HBase Hackathon (held by StumbleUpon, thx!)
we helped someone migrating a data loading MapReduce job from 0.19 to
0.20 because of a performance problem. It was something like 20x
slower.

How we solved it, short answer:
After instantiating the Put that you give to the TableOutputFormat, do
put.writeToWAL(false).

Long answer:
As you may know, HDFS still does not support appends. That means that
the write ahead logs or WAL that HBase uses are only helpful if synced
on disk. That means that you lose some data during a region server
crash or a kill -9. In 0.19 the logs could be opened forever if they
had under 100000 edits. Now in 0.20 we fixed that by capping the WAL
to ~62MB and we also rotate the logs after 1 hour. This is all good
because it means far less data loss until we are able to append to
files in HDFS.

Now to why this may slow down your import, the job I was talking about
had huge rows so the logs got rotated much more often whereas in 0.19
only the number of rows triggered a log rotation. Not writing to the
WAL has the advantage of using far less disk IO but, as you can guess,
it means huge data loss in the case of a region server crash. But, in
many cases, a RS crash still means that you must restart your job
because log splitting can take more than 10 minutes so many tasks
times out (I am currently working on that for 0.21 to make it really
faster btw).

Hopes this helps someone,

J-D

Re: Tip when migrating your data loading MR jobs from 0.19 to 0.20

Posted by Schubert Zhang <zs...@gmail.com>.

@JG:

Regards the write-buffer, I means the DFSClient side buffer. In the current
version of HDFS, I found the buffer (bytesPerChecksum) in client size. The
writed data will be flushed to data node when the buffer full. The HBase RS
is a client of HDFS.
@JD:

you wrote:
"But, in many cases, a RS crash still means that you must restart your job
because log splitting can take more than 10 minutes so many tasks times out
(I am currently working on that for 0.21 to make it really faster btw)."

I think this (task will timeout) is just a temporary or
uncertain phenomenon.
For example, there is a MapReduce (map only) job, each map task put rows
into a same HBase table. When one RS crash, maybe only one MapTask fails.
The failed MapTask will be relanuched. So I consider to call
HBaseAdmin.flush() when each MapTask colpleted. But too many
HBaseAdmin.flush() will cause too many small HStoreFiles and then too many
compactions. If we call HBaseAdmin.flush() when the job complete, must
ensure RS not crash before it.


On Thu, Aug 20, 2009 at 1:24 AM, Jonathan Gray <jl...@streamy.com> wrote:

> Are you referring to the actual disk or raid card write buffer?  If so,
> then yes a single node could lose data if you don't have a raid card with a
> BBU, but remember that this ain't no RDBMS, leave your raid cards and
> batteries at home!
>
> HDFS append does not just append to a single node, it ensures that the
> append is replicated just like every other operation.  So (on default of 3)
> you would have to lose 3 nodes at the same instant, which is the
> "impossible" paradigm we are working with by using a rep factor of 3.
>
> JG
>
>
> Schubert Zhang wrote:
>
>> Thank you J-D, it's a good post.
>> I have test the performance of put.setWriteToWAL(false), it is really
>> fast.
>> And after the batch loading, we shoud call admin.flush(tablename) to flush
>> all data to HDFS.
>>
>> More question about aHDFS support append, and the logs can work as append
>> mode.
>> I think HDFS will still have write-buffer in the future. Then if the
>> server
>> crash when the appends in the write-buffer are not write into disk, the
>> data
>> in the write-buffer will be lost?
>>
>>
>> On Wed, Aug 12, 2009 at 1:40 AM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>>
>> Hi users,
>>>
>>> This weekend at the second HBase Hackathon (held by StumbleUpon, thx!)
>>> we helped someone migrating a data loading MapReduce job from 0.19 to
>>> 0.20 because of a performance problem. It was something like 20x
>>> slower.
>>>
>>> How we solved it, short answer:
>>> After instantiating the Put that you give to the TableOutputFormat, do
>>> put.writeToWAL(false).
>>>
>>> Long answer:
>>> As you may know, HDFS still does not support appends. That means that
>>> the write ahead logs or WAL that HBase uses are only helpful if synced
>>> on disk. That means that you lose some data during a region server
>>> crash or a kill -9. In 0.19 the logs could be opened forever if they
>>> had under 100000 edits. Now in 0.20 we fixed that by capping the WAL
>>> to ~62MB and we also rotate the logs after 1 hour. This is all good
>>> because it means far less data loss until we are able to append to
>>> files in HDFS.
>>>
>>> Now to why this may slow down your import, the job I was talking about
>>> had huge rows so the logs got rotated much more often whereas in 0.19
>>> only the number of rows triggered a log rotation. Not writing to the
>>> WAL has the advantage of using far less disk IO but, as you can guess,
>>> it means huge data loss in the case of a region server crash. But, in
>>> many cases, a RS crash still means that you must restart your job
>>> because log splitting can take more than 10 minutes so many tasks
>>> times out (I am currently working on that for 0.21 to make it really
>>> faster btw).
>>>
>>> Hopes this helps someone,
>>>
>>> J-D
>>>
>>>
>>

Re: Tip when migrating your data loading MR jobs from 0.19 to 0.20

Posted by Jonathan Gray <jl...@streamy.com>.

Are you referring to the actual disk or raid card write buffer?  If so, 
then yes a single node could lose data if you don't have a raid card 
with a BBU, but remember that this ain't no RDBMS, leave your raid cards 
and batteries at home!

HDFS append does not just append to a single node, it ensures that the 
append is replicated just like every other operation.  So (on default of 
3) you would have to lose 3 nodes at the same instant, which is the 
"impossible" paradigm we are working with by using a rep factor of 3.

JG

Schubert Zhang wrote:
> Thank you J-D, it's a good post.
> I have test the performance of put.setWriteToWAL(false), it is really fast.
> And after the batch loading, we shoud call admin.flush(tablename) to flush
> all data to HDFS.
> 
> More question about aHDFS support append, and the logs can work as append
> mode.
> I think HDFS will still have write-buffer in the future. Then if the server
> crash when the appends in the write-buffer are not write into disk, the data
> in the write-buffer will be lost?
> 
> 
> On Wed, Aug 12, 2009 at 1:40 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
> 
>> Hi users,
>>
>> This weekend at the second HBase Hackathon (held by StumbleUpon, thx!)
>> we helped someone migrating a data loading MapReduce job from 0.19 to
>> 0.20 because of a performance problem. It was something like 20x
>> slower.
>>
>> How we solved it, short answer:
>> After instantiating the Put that you give to the TableOutputFormat, do
>> put.writeToWAL(false).
>>
>> Long answer:
>> As you may know, HDFS still does not support appends. That means that
>> the write ahead logs or WAL that HBase uses are only helpful if synced
>> on disk. That means that you lose some data during a region server
>> crash or a kill -9. In 0.19 the logs could be opened forever if they
>> had under 100000 edits. Now in 0.20 we fixed that by capping the WAL
>> to ~62MB and we also rotate the logs after 1 hour. This is all good
>> because it means far less data loss until we are able to append to
>> files in HDFS.
>>
>> Now to why this may slow down your import, the job I was talking about
>> had huge rows so the logs got rotated much more often whereas in 0.19
>> only the number of rows triggered a log rotation. Not writing to the
>> WAL has the advantage of using far less disk IO but, as you can guess,
>> it means huge data loss in the case of a region server crash. But, in
>> many cases, a RS crash still means that you must restart your job
>> because log splitting can take more than 10 minutes so many tasks
>> times out (I am currently working on that for 0.21 to make it really
>> faster btw).
>>
>> Hopes this helps someone,
>>
>> J-D
>>
>

Re: Tip when migrating your data loading MR jobs from 0.19 to 0.20

Posted by Schubert Zhang <zs...@gmail.com>.

Thank you J-D, it's a good post.
I have test the performance of put.setWriteToWAL(false), it is really fast.
And after the batch loading, we shoud call admin.flush(tablename) to flush
all data to HDFS.

More question about aHDFS support append, and the logs can work as append
mode.
I think HDFS will still have write-buffer in the future. Then if the server
crash when the appends in the write-buffer are not write into disk, the data
in the write-buffer will be lost?


On Wed, Aug 12, 2009 at 1:40 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Hi users,
>
> This weekend at the second HBase Hackathon (held by StumbleUpon, thx!)
> we helped someone migrating a data loading MapReduce job from 0.19 to
> 0.20 because of a performance problem. It was something like 20x
> slower.
>
> How we solved it, short answer:
> After instantiating the Put that you give to the TableOutputFormat, do
> put.writeToWAL(false).
>
> Long answer:
> As you may know, HDFS still does not support appends. That means that
> the write ahead logs or WAL that HBase uses are only helpful if synced
> on disk. That means that you lose some data during a region server
> crash or a kill -9. In 0.19 the logs could be opened forever if they
> had under 100000 edits. Now in 0.20 we fixed that by capping the WAL
> to ~62MB and we also rotate the logs after 1 hour. This is all good
> because it means far less data loss until we are able to append to
> files in HDFS.
>
> Now to why this may slow down your import, the job I was talking about
> had huge rows so the logs got rotated much more often whereas in 0.19
> only the number of rows triggered a log rotation. Not writing to the
> WAL has the advantage of using far less disk IO but, as you can guess,
> it means huge data loss in the case of a region server crash. But, in
> many cases, a RS crash still means that you must restart your job
> because log splitting can take more than 10 minutes so many tasks
> times out (I am currently working on that for 0.21 to make it really
> faster btw).
>
> Hopes this helps someone,
>
> J-D
>