You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Weihua JIANG <we...@gmail.com> on 2011/05/26 09:38:52 UTC

Is there any way to disable WAL while keeping data safety

Hi all,

As I know, WAL is used to ensure the data is safe even if certain RS
or the whole HBase cluster is down. But, it is anyway a burden on each
put.

I am wondering: is there any way to disable WAL while keeping data safety.

An ideal solution to me looks like this:
1. clients continuely put records with WAL disabled.
2. clients call a certain HBase method to ensure all the
previously-put records are safely stored persistently, then it can
remove the records at client side.
3. on errror, client re-put the maybe-lost records.

Or a slightly different solution is:
1. clients continuely put records on HDFS using sequential file.
2. clients periodly flush HDFS file and remove the previously put
records at client side.
3. after all records are stored on HDFS, use a map-reduce job to put
the records into HBase with WAL disabled.
4. before each map-reduce task finish, a certain HBase method is
called to flush the memory data onto HDFS.
5. if on error, certain map-reduce task is re-executed (equvalent to
replay log).

Is there any way to do so in HBase? If no, do you have any plan to
support such usage model in near future?


Thanks
Weihua

Re: Is there any way to disable WAL while keeping data safety

Posted by Jean-Daniel Cryans <jd...@apache.org>.
You can call flush on the table with either the shell or HBaseAdmin
which will persist the Memstore data. What's not so good about this
trick is that if any region server died before you called flush you
need to re-import.

J-D

On Thu, May 26, 2011 at 12:38 AM, Weihua JIANG <we...@gmail.com> wrote:
> Hi all,
>
> As I know, WAL is used to ensure the data is safe even if certain RS
> or the whole HBase cluster is down. But, it is anyway a burden on each
> put.
>
> I am wondering: is there any way to disable WAL while keeping data safety.
>
> An ideal solution to me looks like this:
> 1. clients continuely put records with WAL disabled.
> 2. clients call a certain HBase method to ensure all the
> previously-put records are safely stored persistently, then it can
> remove the records at client side.
> 3. on errror, client re-put the maybe-lost records.
>
> Or a slightly different solution is:
> 1. clients continuely put records on HDFS using sequential file.
> 2. clients periodly flush HDFS file and remove the previously put
> records at client side.
> 3. after all records are stored on HDFS, use a map-reduce job to put
> the records into HBase with WAL disabled.
> 4. before each map-reduce task finish, a certain HBase method is
> called to flush the memory data onto HDFS.
> 5. if on error, certain map-reduce task is re-executed (equvalent to
> replay log).
>
> Is there any way to do so in HBase? If no, do you have any plan to
> support such usage model in near future?
>
>
> Thanks
> Weihua
>

Re: Is there any way to disable WAL while keeping data safety

Posted by Ted Yu <yu...@gmail.com>.
Xiyun:
Take a look at https://issues.apache.org/jira/browse/HBASE-3871 for parallel
HFile splitting.

On Mon, May 30, 2011 at 6:31 PM, Gan, Xiyun <ga...@gmail.com> wrote:

> I used BulkLoad to import data. The step of writing HFiles using m/r is
> fast, but the step of loading HFiles to hbase takes lots of time. It
> says  HFile at ****** no longer fits inside a single region. Splitting....
> Even worth, sometimes it throws Region is not online Exception.
>
> Thanks
>
> On Fri, May 27, 2011 at 1:18 PM, Chris Tarnas <cf...@tarnas.org> wrote:
>
> > Yes, it does deal with data merging and yes, doing a major compaction
> would
> > be needed to guarantee the store files are as small as possible.
> >
> > -chris
> >
> >
> >
> > On May 26, 2011, at 7:00 PM, Weihua JIANG <we...@gmail.com>
> wrote:
> >
> > > Thanks. It seems quite useful.
> > >
> > > Does bulk load support data merging? I.e. there is a table with
> > > existing data and I want to add more data into it. The new data row
> > > key range is mixed with the existing data row key range. So, the final
> > > effect is the new data shall be inserted into existing regions.
> > >
> > > If bulk load supports this feature, then it is the ideal solution to
> me?
> > >
> > > And do I need to perform a major compact after bulk load to ensure
> > > store file number is small?
> > >
> > >
> > > Thanks
> > > Weihua
> > >
> > > 2011/5/27 Chris Tarnas <cf...@email.com>:
> > >> Your second solution sounds quite similar to the bulk loader. Actually
> > the bulk load is a bit simpler and bypasses even more of the
> regionserver's
> > overhead:
> > >>
> > >> http://hbase.apache.org/bulk-loads.html
> > >>
> > >> Using M/R it creates HFiles in HDFS directly, then add the Hfiles them
> > to the existing regionservers.
> > >>
> > >> -chris
> > >>
> > >>
> > >> On May 26, 2011, at 12:38 AM, Weihua JIANG wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> As I know, WAL is used to ensure the data is safe even if certain RS
> > >>> or the whole HBase cluster is down. But, it is anyway a burden on
> each
> > >>> put.
> > >>>
> > >>> I am wondering: is there any way to disable WAL while keeping data
> > safety.
> > >>>
> > >>> An ideal solution to me looks like this:
> > >>> 1. clients continuely put records with WAL disabled.
> > >>> 2. clients call a certain HBase method to ensure all the
> > >>> previously-put records are safely stored persistently, then it can
> > >>> remove the records at client side.
> > >>> 3. on errror, client re-put the maybe-lost records.
> > >>>
> > >>> Or a slightly different solution is:
> > >>> 1. clients continuely put records on HDFS using sequential file.
> > >>> 2. clients periodly flush HDFS file and remove the previously put
> > >>> records at client side.
> > >>> 3. after all records are stored on HDFS, use a map-reduce job to put
> > >>> the records into HBase with WAL disabled.
> > >>> 4. before each map-reduce task finish, a certain HBase method is
> > >>> called to flush the memory data onto HDFS.
> > >>> 5. if on error, certain map-reduce task is re-executed (equvalent to
> > >>> replay log).
> > >>>
> > >>> Is there any way to do so in HBase? If no, do you have any plan to
> > >>> support such usage model in near future?
> > >>>
> > >>>
> > >>> Thanks
> > >>> Weihua
> > >>
> > >>
> >
>
>
>
> --
> Best wishes
> Gan, Xiyun
>

Re: Is there any way to disable WAL while keeping data safety

Posted by "Gan, Xiyun" <ga...@gmail.com>.
Thanks a lot

Is there any suggestion on the Region is not online Exception?

On Tue, May 31, 2011 at 9:36 AM, Joey Echeverria <jo...@cloudera.com> wrote:

> If you have a well defined key space, you'll get better performance if
> you pre-split your table and use the TotalOrderPartitioner with your
> MapReduce job.
>
> You can see an example of pre-splitting here:
> http://hbase.apache.org/book.html#precreate.regions.
>
> -Joey
>
> On Mon, May 30, 2011 at 9:31 PM, Gan, Xiyun <ga...@gmail.com> wrote:
> > I used BulkLoad to import data. The step of writing HFiles using m/r is
> > fast, but the step of loading HFiles to hbase takes lots of time. It
> > says  HFile at ****** no longer fits inside a single region.
> Splitting....
> > Even worth, sometimes it throws Region is not online Exception.
> >
> > Thanks
> >
> > On Fri, May 27, 2011 at 1:18 PM, Chris Tarnas <cf...@tarnas.org> wrote:
> >
> >> Yes, it does deal with data merging and yes, doing a major compaction
> would
> >> be needed to guarantee the store files are as small as possible.
> >>
> >> -chris
> >>
> >>
> >>
> >> On May 26, 2011, at 7:00 PM, Weihua JIANG <we...@gmail.com>
> wrote:
> >>
> >> > Thanks. It seems quite useful.
> >> >
> >> > Does bulk load support data merging? I.e. there is a table with
> >> > existing data and I want to add more data into it. The new data row
> >> > key range is mixed with the existing data row key range. So, the final
> >> > effect is the new data shall be inserted into existing regions.
> >> >
> >> > If bulk load supports this feature, then it is the ideal solution to
> me?
> >> >
> >> > And do I need to perform a major compact after bulk load to ensure
> >> > store file number is small?
> >> >
> >> >
> >> > Thanks
> >> > Weihua
> >> >
> >> > 2011/5/27 Chris Tarnas <cf...@email.com>:
> >> >> Your second solution sounds quite similar to the bulk loader.
> Actually
> >> the bulk load is a bit simpler and bypasses even more of the
> regionserver's
> >> overhead:
> >> >>
> >> >> http://hbase.apache.org/bulk-loads.html
> >> >>
> >> >> Using M/R it creates HFiles in HDFS directly, then add the Hfiles
> them
> >> to the existing regionservers.
> >> >>
> >> >> -chris
> >> >>
> >> >>
> >> >> On May 26, 2011, at 12:38 AM, Weihua JIANG wrote:
> >> >>
> >> >>> Hi all,
> >> >>>
> >> >>> As I know, WAL is used to ensure the data is safe even if certain RS
> >> >>> or the whole HBase cluster is down. But, it is anyway a burden on
> each
> >> >>> put.
> >> >>>
> >> >>> I am wondering: is there any way to disable WAL while keeping data
> >> safety.
> >> >>>
> >> >>> An ideal solution to me looks like this:
> >> >>> 1. clients continuely put records with WAL disabled.
> >> >>> 2. clients call a certain HBase method to ensure all the
> >> >>> previously-put records are safely stored persistently, then it can
> >> >>> remove the records at client side.
> >> >>> 3. on errror, client re-put the maybe-lost records.
> >> >>>
> >> >>> Or a slightly different solution is:
> >> >>> 1. clients continuely put records on HDFS using sequential file.
> >> >>> 2. clients periodly flush HDFS file and remove the previously put
> >> >>> records at client side.
> >> >>> 3. after all records are stored on HDFS, use a map-reduce job to put
> >> >>> the records into HBase with WAL disabled.
> >> >>> 4. before each map-reduce task finish, a certain HBase method is
> >> >>> called to flush the memory data onto HDFS.
> >> >>> 5. if on error, certain map-reduce task is re-executed (equvalent to
> >> >>> replay log).
> >> >>>
> >> >>> Is there any way to do so in HBase? If no, do you have any plan to
> >> >>> support such usage model in near future?
> >> >>>
> >> >>>
> >> >>> Thanks
> >> >>> Weihua
> >> >>
> >> >>
> >>
> >
> >
> >
> > --
> > Best wishes
> > Gan, Xiyun
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>



-- 
Best wishes
Gan, Xiyun

Re: Is there any way to disable WAL while keeping data safety

Posted by Joey Echeverria <jo...@cloudera.com>.
If you have a well defined key space, you'll get better performance if
you pre-split your table and use the TotalOrderPartitioner with your
MapReduce job.

You can see an example of pre-splitting here:
http://hbase.apache.org/book.html#precreate.regions.

-Joey

On Mon, May 30, 2011 at 9:31 PM, Gan, Xiyun <ga...@gmail.com> wrote:
> I used BulkLoad to import data. The step of writing HFiles using m/r is
> fast, but the step of loading HFiles to hbase takes lots of time. It
> says  HFile at ****** no longer fits inside a single region. Splitting....
> Even worth, sometimes it throws Region is not online Exception.
>
> Thanks
>
> On Fri, May 27, 2011 at 1:18 PM, Chris Tarnas <cf...@tarnas.org> wrote:
>
>> Yes, it does deal with data merging and yes, doing a major compaction would
>> be needed to guarantee the store files are as small as possible.
>>
>> -chris
>>
>>
>>
>> On May 26, 2011, at 7:00 PM, Weihua JIANG <we...@gmail.com> wrote:
>>
>> > Thanks. It seems quite useful.
>> >
>> > Does bulk load support data merging? I.e. there is a table with
>> > existing data and I want to add more data into it. The new data row
>> > key range is mixed with the existing data row key range. So, the final
>> > effect is the new data shall be inserted into existing regions.
>> >
>> > If bulk load supports this feature, then it is the ideal solution to me?
>> >
>> > And do I need to perform a major compact after bulk load to ensure
>> > store file number is small?
>> >
>> >
>> > Thanks
>> > Weihua
>> >
>> > 2011/5/27 Chris Tarnas <cf...@email.com>:
>> >> Your second solution sounds quite similar to the bulk loader. Actually
>> the bulk load is a bit simpler and bypasses even more of the regionserver's
>> overhead:
>> >>
>> >> http://hbase.apache.org/bulk-loads.html
>> >>
>> >> Using M/R it creates HFiles in HDFS directly, then add the Hfiles them
>> to the existing regionservers.
>> >>
>> >> -chris
>> >>
>> >>
>> >> On May 26, 2011, at 12:38 AM, Weihua JIANG wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> As I know, WAL is used to ensure the data is safe even if certain RS
>> >>> or the whole HBase cluster is down. But, it is anyway a burden on each
>> >>> put.
>> >>>
>> >>> I am wondering: is there any way to disable WAL while keeping data
>> safety.
>> >>>
>> >>> An ideal solution to me looks like this:
>> >>> 1. clients continuely put records with WAL disabled.
>> >>> 2. clients call a certain HBase method to ensure all the
>> >>> previously-put records are safely stored persistently, then it can
>> >>> remove the records at client side.
>> >>> 3. on errror, client re-put the maybe-lost records.
>> >>>
>> >>> Or a slightly different solution is:
>> >>> 1. clients continuely put records on HDFS using sequential file.
>> >>> 2. clients periodly flush HDFS file and remove the previously put
>> >>> records at client side.
>> >>> 3. after all records are stored on HDFS, use a map-reduce job to put
>> >>> the records into HBase with WAL disabled.
>> >>> 4. before each map-reduce task finish, a certain HBase method is
>> >>> called to flush the memory data onto HDFS.
>> >>> 5. if on error, certain map-reduce task is re-executed (equvalent to
>> >>> replay log).
>> >>>
>> >>> Is there any way to do so in HBase? If no, do you have any plan to
>> >>> support such usage model in near future?
>> >>>
>> >>>
>> >>> Thanks
>> >>> Weihua
>> >>
>> >>
>>
>
>
>
> --
> Best wishes
> Gan, Xiyun
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Is there any way to disable WAL while keeping data safety

Posted by "Gan, Xiyun" <ga...@gmail.com>.
I used BulkLoad to import data. The step of writing HFiles using m/r is
fast, but the step of loading HFiles to hbase takes lots of time. It
says  HFile at ****** no longer fits inside a single region. Splitting....
Even worth, sometimes it throws Region is not online Exception.

Thanks

On Fri, May 27, 2011 at 1:18 PM, Chris Tarnas <cf...@tarnas.org> wrote:

> Yes, it does deal with data merging and yes, doing a major compaction would
> be needed to guarantee the store files are as small as possible.
>
> -chris
>
>
>
> On May 26, 2011, at 7:00 PM, Weihua JIANG <we...@gmail.com> wrote:
>
> > Thanks. It seems quite useful.
> >
> > Does bulk load support data merging? I.e. there is a table with
> > existing data and I want to add more data into it. The new data row
> > key range is mixed with the existing data row key range. So, the final
> > effect is the new data shall be inserted into existing regions.
> >
> > If bulk load supports this feature, then it is the ideal solution to me?
> >
> > And do I need to perform a major compact after bulk load to ensure
> > store file number is small?
> >
> >
> > Thanks
> > Weihua
> >
> > 2011/5/27 Chris Tarnas <cf...@email.com>:
> >> Your second solution sounds quite similar to the bulk loader. Actually
> the bulk load is a bit simpler and bypasses even more of the regionserver's
> overhead:
> >>
> >> http://hbase.apache.org/bulk-loads.html
> >>
> >> Using M/R it creates HFiles in HDFS directly, then add the Hfiles them
> to the existing regionservers.
> >>
> >> -chris
> >>
> >>
> >> On May 26, 2011, at 12:38 AM, Weihua JIANG wrote:
> >>
> >>> Hi all,
> >>>
> >>> As I know, WAL is used to ensure the data is safe even if certain RS
> >>> or the whole HBase cluster is down. But, it is anyway a burden on each
> >>> put.
> >>>
> >>> I am wondering: is there any way to disable WAL while keeping data
> safety.
> >>>
> >>> An ideal solution to me looks like this:
> >>> 1. clients continuely put records with WAL disabled.
> >>> 2. clients call a certain HBase method to ensure all the
> >>> previously-put records are safely stored persistently, then it can
> >>> remove the records at client side.
> >>> 3. on errror, client re-put the maybe-lost records.
> >>>
> >>> Or a slightly different solution is:
> >>> 1. clients continuely put records on HDFS using sequential file.
> >>> 2. clients periodly flush HDFS file and remove the previously put
> >>> records at client side.
> >>> 3. after all records are stored on HDFS, use a map-reduce job to put
> >>> the records into HBase with WAL disabled.
> >>> 4. before each map-reduce task finish, a certain HBase method is
> >>> called to flush the memory data onto HDFS.
> >>> 5. if on error, certain map-reduce task is re-executed (equvalent to
> >>> replay log).
> >>>
> >>> Is there any way to do so in HBase? If no, do you have any plan to
> >>> support such usage model in near future?
> >>>
> >>>
> >>> Thanks
> >>> Weihua
> >>
> >>
>



-- 
Best wishes
Gan, Xiyun

Re: Is there any way to disable WAL while keeping data safety

Posted by Chris Tarnas <cf...@tarnas.org>.
Yes, it does deal with data merging and yes, doing a major compaction would be needed to guarantee the store files are as small as possible. 

-chris



On May 26, 2011, at 7:00 PM, Weihua JIANG <we...@gmail.com> wrote:

> Thanks. It seems quite useful.
> 
> Does bulk load support data merging? I.e. there is a table with
> existing data and I want to add more data into it. The new data row
> key range is mixed with the existing data row key range. So, the final
> effect is the new data shall be inserted into existing regions.
> 
> If bulk load supports this feature, then it is the ideal solution to me?
> 
> And do I need to perform a major compact after bulk load to ensure
> store file number is small?
> 
> 
> Thanks
> Weihua
> 
> 2011/5/27 Chris Tarnas <cf...@email.com>:
>> Your second solution sounds quite similar to the bulk loader. Actually the bulk load is a bit simpler and bypasses even more of the regionserver's overhead:
>> 
>> http://hbase.apache.org/bulk-loads.html
>> 
>> Using M/R it creates HFiles in HDFS directly, then add the Hfiles them to the existing regionservers.
>> 
>> -chris
>> 
>> 
>> On May 26, 2011, at 12:38 AM, Weihua JIANG wrote:
>> 
>>> Hi all,
>>> 
>>> As I know, WAL is used to ensure the data is safe even if certain RS
>>> or the whole HBase cluster is down. But, it is anyway a burden on each
>>> put.
>>> 
>>> I am wondering: is there any way to disable WAL while keeping data safety.
>>> 
>>> An ideal solution to me looks like this:
>>> 1. clients continuely put records with WAL disabled.
>>> 2. clients call a certain HBase method to ensure all the
>>> previously-put records are safely stored persistently, then it can
>>> remove the records at client side.
>>> 3. on errror, client re-put the maybe-lost records.
>>> 
>>> Or a slightly different solution is:
>>> 1. clients continuely put records on HDFS using sequential file.
>>> 2. clients periodly flush HDFS file and remove the previously put
>>> records at client side.
>>> 3. after all records are stored on HDFS, use a map-reduce job to put
>>> the records into HBase with WAL disabled.
>>> 4. before each map-reduce task finish, a certain HBase method is
>>> called to flush the memory data onto HDFS.
>>> 5. if on error, certain map-reduce task is re-executed (equvalent to
>>> replay log).
>>> 
>>> Is there any way to do so in HBase? If no, do you have any plan to
>>> support such usage model in near future?
>>> 
>>> 
>>> Thanks
>>> Weihua
>> 
>> 

Re: Is there any way to disable WAL while keeping data safety

Posted by Weihua JIANG <we...@gmail.com>.
Thanks. It seems quite useful.

Does bulk load support data merging? I.e. there is a table with
existing data and I want to add more data into it. The new data row
key range is mixed with the existing data row key range. So, the final
effect is the new data shall be inserted into existing regions.

If bulk load supports this feature, then it is the ideal solution to me?

And do I need to perform a major compact after bulk load to ensure
store file number is small?


Thanks
Weihua

2011/5/27 Chris Tarnas <cf...@email.com>:
> Your second solution sounds quite similar to the bulk loader. Actually the bulk load is a bit simpler and bypasses even more of the regionserver's overhead:
>
> http://hbase.apache.org/bulk-loads.html
>
> Using M/R it creates HFiles in HDFS directly, then add the Hfiles them to the existing regionservers.
>
> -chris
>
>
> On May 26, 2011, at 12:38 AM, Weihua JIANG wrote:
>
>> Hi all,
>>
>> As I know, WAL is used to ensure the data is safe even if certain RS
>> or the whole HBase cluster is down. But, it is anyway a burden on each
>> put.
>>
>> I am wondering: is there any way to disable WAL while keeping data safety.
>>
>> An ideal solution to me looks like this:
>> 1. clients continuely put records with WAL disabled.
>> 2. clients call a certain HBase method to ensure all the
>> previously-put records are safely stored persistently, then it can
>> remove the records at client side.
>> 3. on errror, client re-put the maybe-lost records.
>>
>> Or a slightly different solution is:
>> 1. clients continuely put records on HDFS using sequential file.
>> 2. clients periodly flush HDFS file and remove the previously put
>> records at client side.
>> 3. after all records are stored on HDFS, use a map-reduce job to put
>> the records into HBase with WAL disabled.
>> 4. before each map-reduce task finish, a certain HBase method is
>> called to flush the memory data onto HDFS.
>> 5. if on error, certain map-reduce task is re-executed (equvalent to
>> replay log).
>>
>> Is there any way to do so in HBase? If no, do you have any plan to
>> support such usage model in near future?
>>
>>
>> Thanks
>> Weihua
>
>

Re: Is there any way to disable WAL while keeping data safety

Posted by Chris Tarnas <cf...@email.com>.
Your second solution sounds quite similar to the bulk loader. Actually the bulk load is a bit simpler and bypasses even more of the regionserver's overhead:

http://hbase.apache.org/bulk-loads.html

Using M/R it creates HFiles in HDFS directly, then add the Hfiles them to the existing regionservers.

-chris


On May 26, 2011, at 12:38 AM, Weihua JIANG wrote:

> Hi all,
> 
> As I know, WAL is used to ensure the data is safe even if certain RS
> or the whole HBase cluster is down. But, it is anyway a burden on each
> put.
> 
> I am wondering: is there any way to disable WAL while keeping data safety.
> 
> An ideal solution to me looks like this:
> 1. clients continuely put records with WAL disabled.
> 2. clients call a certain HBase method to ensure all the
> previously-put records are safely stored persistently, then it can
> remove the records at client side.
> 3. on errror, client re-put the maybe-lost records.
> 
> Or a slightly different solution is:
> 1. clients continuely put records on HDFS using sequential file.
> 2. clients periodly flush HDFS file and remove the previously put
> records at client side.
> 3. after all records are stored on HDFS, use a map-reduce job to put
> the records into HBase with WAL disabled.
> 4. before each map-reduce task finish, a certain HBase method is
> called to flush the memory data onto HDFS.
> 5. if on error, certain map-reduce task is re-executed (equvalent to
> replay log).
> 
> Is there any way to do so in HBase? If no, do you have any plan to
> support such usage model in near future?
> 
> 
> Thanks
> Weihua