You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/08/01 03:09:54 UTC

sync on writes

In the HBase book it mentioned that the default behaviour of write is to
call sync on each node before sending replica copies to the nodes in the
pipeline. Is there a reason this was kept default because if data is
getting written on multiple nodes then likelyhood of losing data is really
low since another copy is always there on the replica nodes. Is it ok to
make this sync async and is it advisable?

Re: sync on writes

Posted by Mohit Anchlia <mo...@gmail.com>.
On Wed, Aug 1, 2012 at 9:29 AM, lars hofhansl <lh...@yahoo.com> wrote:

> "sync" is a fluffy term in HDFS. HDFS has hsync and hflush.
> hflush forces all current changes at a DFSClient to all replica nodes (but
> not to disk).
>
> Until HDFS-744 hsync would be identical to hflush. After HDFS-744 hsync
> can be used to force data to disk at the replicas.
>
>
> When HBase refers to "sync" the hflush semantics are meant (at least until
> HBASE-5954 is finished).
> I.e. a sync here ensures that the replica nodes have seen the changes,
> which is what you want.
>
>
> So when you say "since another copy is always there on the replica nodes",
> that is only guaranteed after an hflush (again, which HBase calls sync).
>
>
> I've also written about this here:
> http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html
>
> -- Lars
>
>
>
Thanks this post is very helpful

>
> ________________________________
>  From: Mohit Anchlia <mo...@gmail.com>
> To: user@hbase.apache.org
> Sent: Tuesday, July 31, 2012 6:09 PM
> Subject: sync on writes
>
> In the HBase book it mentioned that the default behaviour of write is to
> call sync on each node before sending replica copies to the nodes in the
> pipeline. Is there a reason this was kept default because if data is
> getting written on multiple nodes then likelyhood of losing data is really
> low since another copy is always there on the replica nodes. Is it ok to
> make this sync async and is it advisable?
>

Re: sync on writes

Posted by lars hofhansl <lh...@yahoo.com>.
"sync" is a fluffy term in HDFS. HDFS has hsync and hflush.
hflush forces all current changes at a DFSClient to all replica nodes (but not to disk).

Until HDFS-744 hsync would be identical to hflush. After HDFS-744 hsync can be used to force data to disk at the replicas.


When HBase refers to "sync" the hflush semantics are meant (at least until HBASE-5954 is finished).
I.e. a sync here ensures that the replica nodes have seen the changes, which is what you want.


So when you say "since another copy is always there on the replica nodes", that is only guaranteed after an hflush (again, which HBase calls sync).


I've also written about this here: http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html

-- Lars



________________________________
 From: Mohit Anchlia <mo...@gmail.com>
To: user@hbase.apache.org 
Sent: Tuesday, July 31, 2012 6:09 PM
Subject: sync on writes
 
In the HBase book it mentioned that the default behaviour of write is to
call sync on each node before sending replica copies to the nodes in the
pipeline. Is there a reason this was kept default because if data is
getting written on multiple nodes then likelyhood of losing data is really
low since another copy is always there on the replica nodes. Is it ok to
make this sync async and is it advisable?

Re: sync on writes

Posted by Jerry Lam <ch...@gmail.com>.
I believe you are talking about enabling dfs.support.append feature? I
benchmarked the difference (disable/enable) previously and I don't find
much differences. It would be great if someone else can confirm on this.

Best Regards,

Jerry

On Wednesday, August 1, 2012, Alex Baranau wrote:

> I believe that this is *not default*, but *current* implementation of
> sync(). I.e. (please correct me if I'm wrong) n-way write approach is not
> available yet.
> You might confuse it with the fact that by default, sync() is called on
> every edit. And you can change it by using "deferred log flushing". Either
> way, sync() is going to be a pipelined write.
>
> There's an explanation of benefits of pipelined and n-way writes there in
> the book (p337), it's not just about which approach provides better
>  durability of saved edits. Both of them do. But both can take different
> time to execute and utilize network differently: pipelined *may* be slower
> but can saturate network bandwidth better.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Tue, Jul 31, 2012 at 9:09 PM, Mohit Anchlia <mohitanchlia@gmail.com<javascript:;>
> >wrote:
>
> > In the HBase book it mentioned that the default behaviour of write is to
> > call sync on each node before sending replica copies to the nodes in the
> > pipeline. Is there a reason this was kept default because if data is
> > getting written on multiple nodes then likelyhood of losing data is
> really
> > low since another copy is always there on the replica nodes. Is it ok to
> > make this sync async and is it advisable?
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>

Re: sync on writes

Posted by Alex Baranau <al...@gmail.com>.
I believe that this is *not default*, but *current* implementation of
sync(). I.e. (please correct me if I'm wrong) n-way write approach is not
available yet.
You might confuse it with the fact that by default, sync() is called on
every edit. And you can change it by using "deferred log flushing". Either
way, sync() is going to be a pipelined write.

There's an explanation of benefits of pipelined and n-way writes there in
the book (p337), it's not just about which approach provides better
 durability of saved edits. Both of them do. But both can take different
time to execute and utilize network differently: pipelined *may* be slower
but can saturate network bandwidth better.

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Tue, Jul 31, 2012 at 9:09 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> In the HBase book it mentioned that the default behaviour of write is to
> call sync on each node before sending replica copies to the nodes in the
> pipeline. Is there a reason this was kept default because if data is
> getting written on multiple nodes then likelyhood of losing data is really
> low since another copy is always there on the replica nodes. Is it ok to
> make this sync async and is it advisable?
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr