You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Josh Elser <jo...@gmail.com> on 2017/04/02 16:26:41 UTC

Re: How threads interact with each other in HBase

No, that's not correct. HBase would, by definition, not be a
consistent database if a write was not durable when a client sees a
successful write.

The point that I will concede to you is that the hflush call may, in
extenuating circumstances, may not be completely durable. For example,
HFlush does not actually force the data to disk. If an abrupt power
failure happens before this data is pushed to disk, HBase may think
that data was durable when it actually wasn't (at the HDFS level).

On Thu, Mar 30, 2017 at 4:26 PM, 杨苏立 Yang Su Li <ya...@gmail.com> wrote:
> Also, please correct me if I am wrong, but I don't think a put is durable
> when an RPC returns to the client. Just its corresponding WAL entry is
> pushed to the memory of all three data nodes, so it has a low probability
> of being lost. But nothing is persisted at this point.
>
> And this is true no mater you use SYNC_WAL or FSYNC_WAL flag.
>
> On Tue, Mar 28, 2017 at 12:11 PM, Josh Elser <el...@apache.org> wrote:
>
>> 1.1 -> 2: don't forget about the block cache which can invalidate the need
>> for any HDFS read.
>>
>> I think you're over-simplifying the write-path quite a bit. I'm not sure
>> what you mean by an 'asynchronous write', but that doesn't exist at the
>> HBase RPC layer as that would invalidate the consistency guarantees (if an
>> RPC returns to the client that data was "put", then it is durable).
>>
>> Going off of memory (sorry in advance if I misstate something): the
>> general way that data is written to the WAL is a "group commit". You have
>> many threads all trying to append data to the WAL -- performance would be
>> terrible if you serially applied all of these writes. Instead, many writes
>> can be accepted and a the caller receives a Future. The caller must wait
>> for the Future to complete. What's happening behind the scene is that the
>> writes are being bundled together to reduce the number of syncs to the WAL
>> ("grouping" the writes together). When one caller's future would complete,
>> what really happened is that the write/sync which included the caller's
>> update was committed (along with others). All of this is happening inside
>> the RS's implementation of accepting an update.
>>
>> https://github.com/apache/hbase/blob/55d6dcaf877cc5223e67973
>> 6eb613173229c18be/hbase-server/src/main/java/org/apache/hadoop/hbase/
>> regionserver/wal/FSHLog.java#L74-L106
>>
>>
>> 杨苏立 Yang Su Li wrote:
>>
>>> The attachment can be found in the following URL:
>>> http://pages.cs.wisc.edu/~suli/hbase.pdf
>>>
>>> Sorry for the inconvenience...
>>>
>>>
>>> On Mon, Mar 27, 2017 at 8:25 PM, Ted Yu<yu...@gmail.com>  wrote:
>>>
>>> Again, attachment didn't come thru.
>>>>
>>>> Is it possible to formulate as google doc ?
>>>>
>>>> Thanks
>>>>
>>>> On Mon, Mar 27, 2017 at 6:19 PM, 杨苏立 Yang Su Li<ya...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> I am a graduate student working on scheduling on storage systems, and we
>>>>> are interested in how different threads in HBase interact with each
>>>>> other
>>>>> and how it might affect scheduling.
>>>>>
>>>>> I have written down my understanding on how HBase/HDFS works based on
>>>>> its
>>>>> current thread architecture (attached). I am wondering if the developers
>>>>>
>>>> of
>>>>
>>>>> HBase could take a look at it and let me know if anything is incorrect
>>>>> or
>>>>> inaccurate, or if I have missed anything.
>>>>>
>>>>> Thanks a lot for your help!
>>>>>
>>>>> On Wed, Mar 22, 2017 at 3:39 PM, 杨苏立 Yang Su Li<ya...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> I am a graduate student working on scheduling on storage systems, and
>>>>>> we
>>>>>> are interested in how different threads in HBase interact with each
>>>>>>
>>>>> other
>>>>
>>>>> and how it might affect scheduling.
>>>>>>
>>>>>> I have written down my understanding on how HBase/HDFS works based on
>>>>>>
>>>>> its
>>>>
>>>>> current thread architecture (attached). I am wondering if the
>>>>>>
>>>>> developers of
>>>>
>>>>> HBase could take a look at it and let me know if anything is incorrect
>>>>>>
>>>>> or
>>>>
>>>>> inaccurate, or if I have missed anything.
>>>>>>
>>>>>> Thanks a lot for your help!
>>>>>>
>>>>>> --
>>>>>> Suli Yang
>>>>>>
>>>>>> Department of Physics
>>>>>> University of Wisconsin Madison
>>>>>>
>>>>>> 4257 Chamberlin Hall
>>>>>> Madison WI 53703
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Suli Yang
>>>>>
>>>>> Department of Physics
>>>>> University of Wisconsin Madison
>>>>>
>>>>> 4257 Chamberlin Hall
>>>>> Madison WI 53703
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
> --
> Suli Yang
>
> Department of Physics
> University of Wisconsin Madison
>
> 4257 Chamberlin Hall
> Madison WI 53703

Re: How threads interact with each other in HBase

Posted by Josh Elser <jo...@gmail.com>.

Yes, you are correct that there is an edge condition here when there is 
abrupt power-failure to a node. HDFS guards against most of this as 
there are multiple copies of your data spread across racks. However, if 
you have abrupt power failure across multiple racks (or your entire 
hardware), yes, you would likely lose some data. Having some form of 
redundant power-supply is a common deployment choice that further 
mitigates this risk. If this is not documented clearly enough, patches 
are welcome to improve this :)

IMO, all of this is an implementation detail, though, as I believe you 
already understand. It does not change the fact that 
architecturally/academically, HBase is a consistent system.

\u6768\u82cf\u7acb Yang Su Li wrote:
> I understand why HBase by default does not use hsync -- it does come with
> big performance cost (though for FSYNC_WAL which is not the default option,
> you should probably do it because the documentation explicitly promised
> it).
>
>
> I just want to make sure my description about HBase is accurate, including
> the durability aspect.
>
> On Sun, Apr 2, 2017 at 12:19 PM, Ted Yu<yu...@gmail.com>  wrote:
>
>> Suli:
>> Have you looked at HBASE-5954 ?
>>
>> It gives some background on why hbase code is formulated the way it
>> currently is.
>>
>> Cheers
>>
>> On Sun, Apr 2, 2017 at 9:36 AM, \u6768\u82cf\u7acb Yang Su Li<ya...@gmail.com>  wrote:
>>
>>> Don't your second paragraph just prove my point? -- If data is not
>>> persisted to disk, then it is not durable. That is the definition of
>>> durability.
>>>
>>> If you want the data to be durable, then you need to call hsync() instead
>>> of hflush(), and that would be the correct behavior if you use FSYNC_WAL
>>> flag (per HBase documentation).
>>>
>>> However, HBase does not do that.
>>>
>>> Suli
>>>
>>> On Sun, Apr 2, 2017 at 11:26 AM, Josh Elser<jo...@gmail.com>
>> wrote:
>>>> No, that's not correct. HBase would, by definition, not be a
>>>> consistent database if a write was not durable when a client sees a
>>>> successful write.
>>>>
>>>> The point that I will concede to you is that the hflush call may, in
>>>> extenuating circumstances, may not be completely durable. For example,
>>>> HFlush does not actually force the data to disk. If an abrupt power
>>>> failure happens before this data is pushed to disk, HBase may think
>>>> that data was durable when it actually wasn't (at the HDFS level).
>>>>
>>>> On Thu, Mar 30, 2017 at 4:26 PM, \u6768\u82cf\u7acb Yang Su Li<ya...@gmail.com>
>>>> wrote:
>>>>> Also, please correct me if I am wrong, but I don't think a put is
>>> durable
>>>>> when an RPC returns to the client. Just its corresponding WAL entry
>> is
>>>>> pushed to the memory of all three data nodes, so it has a low
>>> probability
>>>>> of being lost. But nothing is persisted at this point.
>>>>>
>>>>> And this is true no mater you use SYNC_WAL or FSYNC_WAL flag.
>>>>>
>>>>> On Tue, Mar 28, 2017 at 12:11 PM, Josh Elser<el...@apache.org>
>>> wrote:
>>>>>> 1.1 ->  2: don't forget about the block cache which can invalidate
>> the
>>>> need
>>>>>> for any HDFS read.
>>>>>>
>>>>>> I think you're over-simplifying the write-path quite a bit. I'm not
>>> sure
>>>>>> what you mean by an 'asynchronous write', but that doesn't exist at
>>> the
>>>>>> HBase RPC layer as that would invalidate the consistency guarantees
>>> (if
>>>> an
>>>>>> RPC returns to the client that data was "put", then it is durable).
>>>>>>
>>>>>> Going off of memory (sorry in advance if I misstate something): the
>>>>>> general way that data is written to the WAL is a "group commit". You
>>>> have
>>>>>> many threads all trying to append data to the WAL -- performance
>> would
>>>> be
>>>>>> terrible if you serially applied all of these writes. Instead, many
>>>> writes
>>>>>> can be accepted and a the caller receives a Future. The caller must
>>> wait
>>>>>> for the Future to complete. What's happening behind the scene is
>> that
>>>> the
>>>>>> writes are being bundled together to reduce the number of syncs to
>> the
>>>> WAL
>>>>>> ("grouping" the writes together). When one caller's future would
>>>> complete,
>>>>>> what really happened is that the write/sync which included the
>>> caller's
>>>>>> update was committed (along with others). All of this is happening
>>>> inside
>>>>>> the RS's implementation of accepting an update.
>>>>>>
>>>>>> https://github.com/apache/hbase/blob/55d6dcaf877cc5223e67973
>>>>>> 6eb613173229c18be/hbase-server/src/main/java/org/
>> apache/hadoop/hbase/
>>>>>> regionserver/wal/FSHLog.java#L74-L106
>>>>>>
>>>>>>
>>>>>> \u6768\u82cf\u7acb Yang Su Li wrote:
>>>>>>
>>>>>>> The attachment can be found in the following URL:
>>>>>>> http://pages.cs.wisc.edu/~suli/hbase.pdf
>>>>>>>
>>>>>>> Sorry for the inconvenience...
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 27, 2017 at 8:25 PM, Ted Yu<yu...@gmail.com>
>> wrote:
>>>>>>> Again, attachment didn't come thru.
>>>>>>>> Is it possible to formulate as google doc ?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Mon, Mar 27, 2017 at 6:19 PM, \u6768\u82cf\u7acb Yang Su Li<
>> yangsuli@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>> I am a graduate student working on scheduling on storage systems,
>>>> and we
>>>>>>>>> are interested in how different threads in HBase interact with
>> each
>>>>>>>>> other
>>>>>>>>> and how it might affect scheduling.
>>>>>>>>>
>>>>>>>>> I have written down my understanding on how HBase/HDFS works
>> based
>>> on
>>>>>>>>> its
>>>>>>>>> current thread architecture (attached). I am wondering if the
>>>> developers
>>>>>>>> of
>>>>>>>>
>>>>>>>>> HBase could take a look at it and let me know if anything is
>>>> incorrect
>>>>>>>>> or
>>>>>>>>> inaccurate, or if I have missed anything.
>>>>>>>>>
>>>>>>>>> Thanks a lot for your help!
>>>>>>>>>
>>>>>>>>> On Wed, Mar 22, 2017 at 3:39 PM, \u6768\u82cf\u7acb Yang Su Li<
>> yangsuli@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>> I am a graduate student working on scheduling on storage
>> systems,
>>>> and
>>>>>>>>>> we
>>>>>>>>>> are interested in how different threads in HBase interact with
>>> each
>>>>>>>>> other
>>>>>>>>> and how it might affect scheduling.
>>>>>>>>>> I have written down my understanding on how HBase/HDFS works
>> based
>>>> on
>>>>>>>>> its
>>>>>>>>> current thread architecture (attached). I am wondering if the
>>>>>>>>> developers of
>>>>>>>>> HBase could take a look at it and let me know if anything is
>>>> incorrect
>>>>>>>>> or
>>>>>>>>> inaccurate, or if I have missed anything.
>>>>>>>>>> Thanks a lot for your help!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Suli Yang
>>>>>>>>>>
>>>>>>>>>> Department of Physics
>>>>>>>>>> University of Wisconsin Madison
>>>>>>>>>>
>>>>>>>>>> 4257 Chamberlin Hall
>>>>>>>>>> Madison WI 53703
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Suli Yang
>>>>>>>>>
>>>>>>>>> Department of Physics
>>>>>>>>> University of Wisconsin Madison
>>>>>>>>>
>>>>>>>>> 4257 Chamberlin Hall
>>>>>>>>> Madison WI 53703
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Suli Yang
>>>>>
>>>>> Department of Physics
>>>>> University of Wisconsin Madison
>>>>>
>>>>> 4257 Chamberlin Hall
>>>>> Madison WI 53703
>>>
>>>
>>> --
>>> Suli Yang
>>>
>>> Department of Physics
>>> University of Wisconsin Madison
>>>
>>> 4257 Chamberlin Hall
>>> Madison WI 53703
>>>
>
>
>

Re: How threads interact with each other in HBase

Posted by 杨苏立 Yang Su Li <ya...@gmail.com>.

You might want to look at this follow up work as well:
https://www.usenix.org/conference/osdi16/technical-sessions/presentation/alagappan

It talks about how to use bob on distributed systems.

On Sun, Apr 2, 2017 at 4:32 PM, Ted Yu <yu...@gmail.com> wrote:

> Need some time to digest the BOB and see if it can simplify the reasoning
> of how fsync is implemented in hbase.
>
> hdfs was evaluated by the paper where I noticed the following:
>
> bq. both HDFS and ZooKeeper respondents lament that such an fsync() is not
> easily achievable with Java
>
> Cheers
>
> On Sun, Apr 2, 2017 at 1:53 PM, 杨苏立 Yang Su Li <ya...@gmail.com> wrote:
>
> > Regarding HBASE-5954 specifically, have you thought about using BOB
> (block
> > order breaker,
> > https://www.usenix.org/system/files/conference/osdi14/
> > osdi14-paper-pillai.pdf)
> > to verify if a change is correct.
> >
> > It allows you to explore many different crash scenarios.
> >
> >
> >
> > On Sun, Apr 2, 2017 at 1:35 PM, 杨苏立 Yang Su Li <ya...@gmail.com>
> wrote:
> >
> > > I understand why HBase by default does not use hsync -- it does come
> with
> > > big performance cost (though for FSYNC_WAL which is not the default
> > option,
> > > you should probably do it because the documentation explicitly promised
> > > it).
> > >
> > >
> > > I just want to make sure my description about HBase is accurate,
> > including
> > > the durability aspect.
> > >
> > > On Sun, Apr 2, 2017 at 12:19 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> Suli:
> > >> Have you looked at HBASE-5954 ?
> > >>
> > >> It gives some background on why hbase code is formulated the way it
> > >> currently is.
> > >>
> > >> Cheers
> > >>
> > >> On Sun, Apr 2, 2017 at 9:36 AM, 杨苏立 Yang Su Li <ya...@gmail.com>
> > >> wrote:
> > >>
> > >> > Don't your second paragraph just prove my point? -- If data is not
> > >> > persisted to disk, then it is not durable. That is the definition of
> > >> > durability.
> > >> >
> > >> > If you want the data to be durable, then you need to call hsync()
> > >> instead
> > >> > of hflush(), and that would be the correct behavior if you use
> > FSYNC_WAL
> > >> > flag (per HBase documentation).
> > >> >
> > >> > However, HBase does not do that.
> > >> >
> > >> > Suli
> > >> >
> > >> > On Sun, Apr 2, 2017 at 11:26 AM, Josh Elser <jo...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > No, that's not correct. HBase would, by definition, not be a
> > >> > > consistent database if a write was not durable when a client sees
> a
> > >> > > successful write.
> > >> > >
> > >> > > The point that I will concede to you is that the hflush call may,
> in
> > >> > > extenuating circumstances, may not be completely durable. For
> > example,
> > >> > > HFlush does not actually force the data to disk. If an abrupt
> power
> > >> > > failure happens before this data is pushed to disk, HBase may
> think
> > >> > > that data was durable when it actually wasn't (at the HDFS level).
> > >> > >
> > >> > > On Thu, Mar 30, 2017 at 4:26 PM, 杨苏立 Yang Su Li <
> yangsuli@gmail.com
> > >
> > >> > > wrote:
> > >> > > > Also, please correct me if I am wrong, but I don't think a put
> is
> > >> > durable
> > >> > > > when an RPC returns to the client. Just its corresponding WAL
> > entry
> > >> is
> > >> > > > pushed to the memory of all three data nodes, so it has a low
> > >> > probability
> > >> > > > of being lost. But nothing is persisted at this point.
> > >> > > >
> > >> > > > And this is true no mater you use SYNC_WAL or FSYNC_WAL flag.
> > >> > > >
> > >> > > > On Tue, Mar 28, 2017 at 12:11 PM, Josh Elser <elserj@apache.org
> >
> > >> > wrote:
> > >> > > >
> > >> > > >> 1.1 -> 2: don't forget about the block cache which can
> invalidate
> > >> the
> > >> > > need
> > >> > > >> for any HDFS read.
> > >> > > >>
> > >> > > >> I think you're over-simplifying the write-path quite a bit. I'm
> > not
> > >> > sure
> > >> > > >> what you mean by an 'asynchronous write', but that doesn't
> exist
> > at
> > >> > the
> > >> > > >> HBase RPC layer as that would invalidate the consistency
> > guarantees
> > >> > (if
> > >> > > an
> > >> > > >> RPC returns to the client that data was "put", then it is
> > durable).
> > >> > > >>
> > >> > > >> Going off of memory (sorry in advance if I misstate something):
> > the
> > >> > > >> general way that data is written to the WAL is a "group
> commit".
> > >> You
> > >> > > have
> > >> > > >> many threads all trying to append data to the WAL --
> performance
> > >> would
> > >> > > be
> > >> > > >> terrible if you serially applied all of these writes. Instead,
> > many
> > >> > > writes
> > >> > > >> can be accepted and a the caller receives a Future. The caller
> > must
> > >> > wait
> > >> > > >> for the Future to complete. What's happening behind the scene
> is
> > >> that
> > >> > > the
> > >> > > >> writes are being bundled together to reduce the number of syncs
> > to
> > >> the
> > >> > > WAL
> > >> > > >> ("grouping" the writes together). When one caller's future
> would
> > >> > > complete,
> > >> > > >> what really happened is that the write/sync which included the
> > >> > caller's
> > >> > > >> update was committed (along with others). All of this is
> > happening
> > >> > > inside
> > >> > > >> the RS's implementation of accepting an update.
> > >> > > >>
> > >> > > >> https://github.com/apache/hbase/blob/55d6dcaf877cc5223e67973
> > >> > > >> 6eb613173229c18be/hbase-server/src/main/java/org/apache/
> > >> hadoop/hbase/
> > >> > > >> regionserver/wal/FSHLog.java#L74-L106
> > >> > > >>
> > >> > > >>
> > >> > > >> 杨苏立 Yang Su Li wrote:
> > >> > > >>
> > >> > > >>> The attachment can be found in the following URL:
> > >> > > >>> http://pages.cs.wisc.edu/~suli/hbase.pdf
> > >> > > >>>
> > >> > > >>> Sorry for the inconvenience...
> > >> > > >>>
> > >> > > >>>
> > >> > > >>> On Mon, Mar 27, 2017 at 8:25 PM, Ted Yu<yu...@gmail.com>
> > >> wrote:
> > >> > > >>>
> > >> > > >>> Again, attachment didn't come thru.
> > >> > > >>>>
> > >> > > >>>> Is it possible to formulate as google doc ?
> > >> > > >>>>
> > >> > > >>>> Thanks
> > >> > > >>>>
> > >> > > >>>> On Mon, Mar 27, 2017 at 6:19 PM, 杨苏立 Yang Su Li<
> > >> yangsuli@gmail.com>
> > >> > > >>>> wrote:
> > >> > > >>>>
> > >> > > >>>> Hi,
> > >> > > >>>>>
> > >> > > >>>>> I am a graduate student working on scheduling on storage
> > >> systems,
> > >> > > and we
> > >> > > >>>>> are interested in how different threads in HBase interact
> with
> > >> each
> > >> > > >>>>> other
> > >> > > >>>>> and how it might affect scheduling.
> > >> > > >>>>>
> > >> > > >>>>> I have written down my understanding on how HBase/HDFS works
> > >> based
> > >> > on
> > >> > > >>>>> its
> > >> > > >>>>> current thread architecture (attached). I am wondering if
> the
> > >> > > developers
> > >> > > >>>>>
> > >> > > >>>> of
> > >> > > >>>>
> > >> > > >>>>> HBase could take a look at it and let me know if anything is
> > >> > > incorrect
> > >> > > >>>>> or
> > >> > > >>>>> inaccurate, or if I have missed anything.
> > >> > > >>>>>
> > >> > > >>>>> Thanks a lot for your help!
> > >> > > >>>>>
> > >> > > >>>>> On Wed, Mar 22, 2017 at 3:39 PM, 杨苏立 Yang Su Li<
> > >> yangsuli@gmail.com
> > >> > >
> > >> > > >>>>> wrote:
> > >> > > >>>>>
> > >> > > >>>>> Hi,
> > >> > > >>>>>>
> > >> > > >>>>>> I am a graduate student working on scheduling on storage
> > >> systems,
> > >> > > and
> > >> > > >>>>>> we
> > >> > > >>>>>> are interested in how different threads in HBase interact
> > with
> > >> > each
> > >> > > >>>>>>
> > >> > > >>>>> other
> > >> > > >>>>
> > >> > > >>>>> and how it might affect scheduling.
> > >> > > >>>>>>
> > >> > > >>>>>> I have written down my understanding on how HBase/HDFS
> works
> > >> based
> > >> > > on
> > >> > > >>>>>>
> > >> > > >>>>> its
> > >> > > >>>>
> > >> > > >>>>> current thread architecture (attached). I am wondering if
> the
> > >> > > >>>>>>
> > >> > > >>>>> developers of
> > >> > > >>>>
> > >> > > >>>>> HBase could take a look at it and let me know if anything is
> > >> > > incorrect
> > >> > > >>>>>>
> > >> > > >>>>> or
> > >> > > >>>>
> > >> > > >>>>> inaccurate, or if I have missed anything.
> > >> > > >>>>>>
> > >> > > >>>>>> Thanks a lot for your help!
> > >> > > >>>>>>
> > >> > > >>>>>> --
> > >> > > >>>>>> Suli Yang
> > >> > > >>>>>>
> > >> > > >>>>>> Department of Physics
> > >> > > >>>>>> University of Wisconsin Madison
> > >> > > >>>>>>
> > >> > > >>>>>> 4257 Chamberlin Hall
> > >> > > >>>>>> Madison WI 53703
> > >> > > >>>>>>
> > >> > > >>>>>>
> > >> > > >>>>>>
> > >> > > >>>>> --
> > >> > > >>>>> Suli Yang
> > >> > > >>>>>
> > >> > > >>>>> Department of Physics
> > >> > > >>>>> University of Wisconsin Madison
> > >> > > >>>>>
> > >> > > >>>>> 4257 Chamberlin Hall
> > >> > > >>>>> Madison WI 53703
> > >> > > >>>>>
> > >> > > >>>>>
> > >> > > >>>>>
> > >> > > >>>
> > >> > > >>>
> > >> > > >>>
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Suli Yang
> > >> > > >
> > >> > > > Department of Physics
> > >> > > > University of Wisconsin Madison
> > >> > > >
> > >> > > > 4257 Chamberlin Hall
> > >> > > > Madison WI 53703
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Suli Yang
> > >> >
> > >> > Department of Physics
> > >> > University of Wisconsin Madison
> > >> >
> > >> > 4257 Chamberlin Hall
> > >> > Madison WI 53703
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Suli Yang
> > >
> > > Department of Physics
> > > University of Wisconsin Madison
> > >
> > > 4257 Chamberlin Hall
> > > Madison WI 53703
> > >
> > >
> >
> >
> > --
> > Suli Yang
> >
> > Department of Physics
> > University of Wisconsin Madison
> >
> > 4257 Chamberlin Hall
> > Madison WI 53703
> >
>



-- 
Suli Yang

Department of Physics
University of Wisconsin Madison

4257 Chamberlin Hall
Madison WI 53703

Re: How threads interact with each other in HBase

Posted by Ted Yu <yu...@gmail.com>.

Need some time to digest the BOB and see if it can simplify the reasoning
of how fsync is implemented in hbase.

hdfs was evaluated by the paper where I noticed the following:

bq. both HDFS and ZooKeeper respondents lament that such an fsync() is not
easily achievable with Java

Cheers

On Sun, Apr 2, 2017 at 1:53 PM, 杨苏立 Yang Su Li <ya...@gmail.com> wrote:

> Regarding HBASE-5954 specifically, have you thought about using BOB (block
> order breaker,
> https://www.usenix.org/system/files/conference/osdi14/
> osdi14-paper-pillai.pdf)
> to verify if a change is correct.
>
> It allows you to explore many different crash scenarios.
>
>
>
> On Sun, Apr 2, 2017 at 1:35 PM, 杨苏立 Yang Su Li <ya...@gmail.com> wrote:
>
> > I understand why HBase by default does not use hsync -- it does come with
> > big performance cost (though for FSYNC_WAL which is not the default
> option,
> > you should probably do it because the documentation explicitly promised
> > it).
> >
> >
> > I just want to make sure my description about HBase is accurate,
> including
> > the durability aspect.
> >
> > On Sun, Apr 2, 2017 at 12:19 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> Suli:
> >> Have you looked at HBASE-5954 ?
> >>
> >> It gives some background on why hbase code is formulated the way it
> >> currently is.
> >>
> >> Cheers
> >>
> >> On Sun, Apr 2, 2017 at 9:36 AM, 杨苏立 Yang Su Li <ya...@gmail.com>
> >> wrote:
> >>
> >> > Don't your second paragraph just prove my point? -- If data is not
> >> > persisted to disk, then it is not durable. That is the definition of
> >> > durability.
> >> >
> >> > If you want the data to be durable, then you need to call hsync()
> >> instead
> >> > of hflush(), and that would be the correct behavior if you use
> FSYNC_WAL
> >> > flag (per HBase documentation).
> >> >
> >> > However, HBase does not do that.
> >> >
> >> > Suli
> >> >
> >> > On Sun, Apr 2, 2017 at 11:26 AM, Josh Elser <jo...@gmail.com>
> >> wrote:
> >> >
> >> > > No, that's not correct. HBase would, by definition, not be a
> >> > > consistent database if a write was not durable when a client sees a
> >> > > successful write.
> >> > >
> >> > > The point that I will concede to you is that the hflush call may, in
> >> > > extenuating circumstances, may not be completely durable. For
> example,
> >> > > HFlush does not actually force the data to disk. If an abrupt power
> >> > > failure happens before this data is pushed to disk, HBase may think
> >> > > that data was durable when it actually wasn't (at the HDFS level).
> >> > >
> >> > > On Thu, Mar 30, 2017 at 4:26 PM, 杨苏立 Yang Su Li <yangsuli@gmail.com
> >
> >> > > wrote:
> >> > > > Also, please correct me if I am wrong, but I don't think a put is
> >> > durable
> >> > > > when an RPC returns to the client. Just its corresponding WAL
> entry
> >> is
> >> > > > pushed to the memory of all three data nodes, so it has a low
> >> > probability
> >> > > > of being lost. But nothing is persisted at this point.
> >> > > >
> >> > > > And this is true no mater you use SYNC_WAL or FSYNC_WAL flag.
> >> > > >
> >> > > > On Tue, Mar 28, 2017 at 12:11 PM, Josh Elser <el...@apache.org>
> >> > wrote:
> >> > > >
> >> > > >> 1.1 -> 2: don't forget about the block cache which can invalidate
> >> the
> >> > > need
> >> > > >> for any HDFS read.
> >> > > >>
> >> > > >> I think you're over-simplifying the write-path quite a bit. I'm
> not
> >> > sure
> >> > > >> what you mean by an 'asynchronous write', but that doesn't exist
> at
> >> > the
> >> > > >> HBase RPC layer as that would invalidate the consistency
> guarantees
> >> > (if
> >> > > an
> >> > > >> RPC returns to the client that data was "put", then it is
> durable).
> >> > > >>
> >> > > >> Going off of memory (sorry in advance if I misstate something):
> the
> >> > > >> general way that data is written to the WAL is a "group commit".
> >> You
> >> > > have
> >> > > >> many threads all trying to append data to the WAL -- performance
> >> would
> >> > > be
> >> > > >> terrible if you serially applied all of these writes. Instead,
> many
> >> > > writes
> >> > > >> can be accepted and a the caller receives a Future. The caller
> must
> >> > wait
> >> > > >> for the Future to complete. What's happening behind the scene is
> >> that
> >> > > the
> >> > > >> writes are being bundled together to reduce the number of syncs
> to
> >> the
> >> > > WAL
> >> > > >> ("grouping" the writes together). When one caller's future would
> >> > > complete,
> >> > > >> what really happened is that the write/sync which included the
> >> > caller's
> >> > > >> update was committed (along with others). All of this is
> happening
> >> > > inside
> >> > > >> the RS's implementation of accepting an update.
> >> > > >>
> >> > > >> https://github.com/apache/hbase/blob/55d6dcaf877cc5223e67973
> >> > > >> 6eb613173229c18be/hbase-server/src/main/java/org/apache/
> >> hadoop/hbase/
> >> > > >> regionserver/wal/FSHLog.java#L74-L106
> >> > > >>
> >> > > >>
> >> > > >> 杨苏立 Yang Su Li wrote:
> >> > > >>
> >> > > >>> The attachment can be found in the following URL:
> >> > > >>> http://pages.cs.wisc.edu/~suli/hbase.pdf
> >> > > >>>
> >> > > >>> Sorry for the inconvenience...
> >> > > >>>
> >> > > >>>
> >> > > >>> On Mon, Mar 27, 2017 at 8:25 PM, Ted Yu<yu...@gmail.com>
> >> wrote:
> >> > > >>>
> >> > > >>> Again, attachment didn't come thru.
> >> > > >>>>
> >> > > >>>> Is it possible to formulate as google doc ?
> >> > > >>>>
> >> > > >>>> Thanks
> >> > > >>>>
> >> > > >>>> On Mon, Mar 27, 2017 at 6:19 PM, 杨苏立 Yang Su Li<
> >> yangsuli@gmail.com>
> >> > > >>>> wrote:
> >> > > >>>>
> >> > > >>>> Hi,
> >> > > >>>>>
> >> > > >>>>> I am a graduate student working on scheduling on storage
> >> systems,
> >> > > and we
> >> > > >>>>> are interested in how different threads in HBase interact with
> >> each
> >> > > >>>>> other
> >> > > >>>>> and how it might affect scheduling.
> >> > > >>>>>
> >> > > >>>>> I have written down my understanding on how HBase/HDFS works
> >> based
> >> > on
> >> > > >>>>> its
> >> > > >>>>> current thread architecture (attached). I am wondering if the
> >> > > developers
> >> > > >>>>>
> >> > > >>>> of
> >> > > >>>>
> >> > > >>>>> HBase could take a look at it and let me know if anything is
> >> > > incorrect
> >> > > >>>>> or
> >> > > >>>>> inaccurate, or if I have missed anything.
> >> > > >>>>>
> >> > > >>>>> Thanks a lot for your help!
> >> > > >>>>>
> >> > > >>>>> On Wed, Mar 22, 2017 at 3:39 PM, 杨苏立 Yang Su Li<
> >> yangsuli@gmail.com
> >> > >
> >> > > >>>>> wrote:
> >> > > >>>>>
> >> > > >>>>> Hi,
> >> > > >>>>>>
> >> > > >>>>>> I am a graduate student working on scheduling on storage
> >> systems,
> >> > > and
> >> > > >>>>>> we
> >> > > >>>>>> are interested in how different threads in HBase interact
> with
> >> > each
> >> > > >>>>>>
> >> > > >>>>> other
> >> > > >>>>
> >> > > >>>>> and how it might affect scheduling.
> >> > > >>>>>>
> >> > > >>>>>> I have written down my understanding on how HBase/HDFS works
> >> based
> >> > > on
> >> > > >>>>>>
> >> > > >>>>> its
> >> > > >>>>
> >> > > >>>>> current thread architecture (attached). I am wondering if the
> >> > > >>>>>>
> >> > > >>>>> developers of
> >> > > >>>>
> >> > > >>>>> HBase could take a look at it and let me know if anything is
> >> > > incorrect
> >> > > >>>>>>
> >> > > >>>>> or
> >> > > >>>>
> >> > > >>>>> inaccurate, or if I have missed anything.
> >> > > >>>>>>
> >> > > >>>>>> Thanks a lot for your help!
> >> > > >>>>>>
> >> > > >>>>>> --
> >> > > >>>>>> Suli Yang
> >> > > >>>>>>
> >> > > >>>>>> Department of Physics
> >> > > >>>>>> University of Wisconsin Madison
> >> > > >>>>>>
> >> > > >>>>>> 4257 Chamberlin Hall
> >> > > >>>>>> Madison WI 53703
> >> > > >>>>>>
> >> > > >>>>>>
> >> > > >>>>>>
> >> > > >>>>> --
> >> > > >>>>> Suli Yang
> >> > > >>>>>
> >> > > >>>>> Department of Physics
> >> > > >>>>> University of Wisconsin Madison
> >> > > >>>>>
> >> > > >>>>> 4257 Chamberlin Hall
> >> > > >>>>> Madison WI 53703
> >> > > >>>>>
> >> > > >>>>>
> >> > > >>>>>
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Suli Yang
> >> > > >
> >> > > > Department of Physics
> >> > > > University of Wisconsin Madison
> >> > > >
> >> > > > 4257 Chamberlin Hall
> >> > > > Madison WI 53703
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Suli Yang
> >> >
> >> > Department of Physics
> >> > University of Wisconsin Madison
> >> >
> >> > 4257 Chamberlin Hall
> >> > Madison WI 53703
> >> >
> >>
> >
> >
> >
> > --
> > Suli Yang
> >
> > Department of Physics
> > University of Wisconsin Madison
> >
> > 4257 Chamberlin Hall
> > Madison WI 53703
> >
> >
>
>
> --
> Suli Yang
>
> Department of Physics
> University of Wisconsin Madison
>
> 4257 Chamberlin Hall
> Madison WI 53703
>

Re: How threads interact with each other in HBase

Posted by 杨苏立 Yang Su Li <ya...@gmail.com>.

Regarding HBASE-5954 specifically, have you thought about using BOB (block
order breaker,
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf)
to verify if a change is correct.

It allows you to explore many different crash scenarios.



On Sun, Apr 2, 2017 at 1:35 PM, 杨苏立 Yang Su Li <ya...@gmail.com> wrote:

> I understand why HBase by default does not use hsync -- it does come with
> big performance cost (though for FSYNC_WAL which is not the default option,
> you should probably do it because the documentation explicitly promised
> it).
>
>
> I just want to make sure my description about HBase is accurate, including
> the durability aspect.
>
> On Sun, Apr 2, 2017 at 12:19 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Suli:
>> Have you looked at HBASE-5954 ?
>>
>> It gives some background on why hbase code is formulated the way it
>> currently is.
>>
>> Cheers
>>
>> On Sun, Apr 2, 2017 at 9:36 AM, 杨苏立 Yang Su Li <ya...@gmail.com>
>> wrote:
>>
>> > Don't your second paragraph just prove my point? -- If data is not
>> > persisted to disk, then it is not durable. That is the definition of
>> > durability.
>> >
>> > If you want the data to be durable, then you need to call hsync()
>> instead
>> > of hflush(), and that would be the correct behavior if you use FSYNC_WAL
>> > flag (per HBase documentation).
>> >
>> > However, HBase does not do that.
>> >
>> > Suli
>> >
>> > On Sun, Apr 2, 2017 at 11:26 AM, Josh Elser <jo...@gmail.com>
>> wrote:
>> >
>> > > No, that's not correct. HBase would, by definition, not be a
>> > > consistent database if a write was not durable when a client sees a
>> > > successful write.
>> > >
>> > > The point that I will concede to you is that the hflush call may, in
>> > > extenuating circumstances, may not be completely durable. For example,
>> > > HFlush does not actually force the data to disk. If an abrupt power
>> > > failure happens before this data is pushed to disk, HBase may think
>> > > that data was durable when it actually wasn't (at the HDFS level).
>> > >
>> > > On Thu, Mar 30, 2017 at 4:26 PM, 杨苏立 Yang Su Li <ya...@gmail.com>
>> > > wrote:
>> > > > Also, please correct me if I am wrong, but I don't think a put is
>> > durable
>> > > > when an RPC returns to the client. Just its corresponding WAL entry
>> is
>> > > > pushed to the memory of all three data nodes, so it has a low
>> > probability
>> > > > of being lost. But nothing is persisted at this point.
>> > > >
>> > > > And this is true no mater you use SYNC_WAL or FSYNC_WAL flag.
>> > > >
>> > > > On Tue, Mar 28, 2017 at 12:11 PM, Josh Elser <el...@apache.org>
>> > wrote:
>> > > >
>> > > >> 1.1 -> 2: don't forget about the block cache which can invalidate
>> the
>> > > need
>> > > >> for any HDFS read.
>> > > >>
>> > > >> I think you're over-simplifying the write-path quite a bit. I'm not
>> > sure
>> > > >> what you mean by an 'asynchronous write', but that doesn't exist at
>> > the
>> > > >> HBase RPC layer as that would invalidate the consistency guarantees
>> > (if
>> > > an
>> > > >> RPC returns to the client that data was "put", then it is durable).
>> > > >>
>> > > >> Going off of memory (sorry in advance if I misstate something): the
>> > > >> general way that data is written to the WAL is a "group commit".
>> You
>> > > have
>> > > >> many threads all trying to append data to the WAL -- performance
>> would
>> > > be
>> > > >> terrible if you serially applied all of these writes. Instead, many
>> > > writes
>> > > >> can be accepted and a the caller receives a Future. The caller must
>> > wait
>> > > >> for the Future to complete. What's happening behind the scene is
>> that
>> > > the
>> > > >> writes are being bundled together to reduce the number of syncs to
>> the
>> > > WAL
>> > > >> ("grouping" the writes together). When one caller's future would
>> > > complete,
>> > > >> what really happened is that the write/sync which included the
>> > caller's
>> > > >> update was committed (along with others). All of this is happening
>> > > inside
>> > > >> the RS's implementation of accepting an update.
>> > > >>
>> > > >> https://github.com/apache/hbase/blob/55d6dcaf877cc5223e67973
>> > > >> 6eb613173229c18be/hbase-server/src/main/java/org/apache/
>> hadoop/hbase/
>> > > >> regionserver/wal/FSHLog.java#L74-L106
>> > > >>
>> > > >>
>> > > >> 杨苏立 Yang Su Li wrote:
>> > > >>
>> > > >>> The attachment can be found in the following URL:
>> > > >>> http://pages.cs.wisc.edu/~suli/hbase.pdf
>> > > >>>
>> > > >>> Sorry for the inconvenience...
>> > > >>>
>> > > >>>
>> > > >>> On Mon, Mar 27, 2017 at 8:25 PM, Ted Yu<yu...@gmail.com>
>> wrote:
>> > > >>>
>> > > >>> Again, attachment didn't come thru.
>> > > >>>>
>> > > >>>> Is it possible to formulate as google doc ?
>> > > >>>>
>> > > >>>> Thanks
>> > > >>>>
>> > > >>>> On Mon, Mar 27, 2017 at 6:19 PM, 杨苏立 Yang Su Li<
>> yangsuli@gmail.com>
>> > > >>>> wrote:
>> > > >>>>
>> > > >>>> Hi,
>> > > >>>>>
>> > > >>>>> I am a graduate student working on scheduling on storage
>> systems,
>> > > and we
>> > > >>>>> are interested in how different threads in HBase interact with
>> each
>> > > >>>>> other
>> > > >>>>> and how it might affect scheduling.
>> > > >>>>>
>> > > >>>>> I have written down my understanding on how HBase/HDFS works
>> based
>> > on
>> > > >>>>> its
>> > > >>>>> current thread architecture (attached). I am wondering if the
>> > > developers
>> > > >>>>>
>> > > >>>> of
>> > > >>>>
>> > > >>>>> HBase could take a look at it and let me know if anything is
>> > > incorrect
>> > > >>>>> or
>> > > >>>>> inaccurate, or if I have missed anything.
>> > > >>>>>
>> > > >>>>> Thanks a lot for your help!
>> > > >>>>>
>> > > >>>>> On Wed, Mar 22, 2017 at 3:39 PM, 杨苏立 Yang Su Li<
>> yangsuli@gmail.com
>> > >
>> > > >>>>> wrote:
>> > > >>>>>
>> > > >>>>> Hi,
>> > > >>>>>>
>> > > >>>>>> I am a graduate student working on scheduling on storage
>> systems,
>> > > and
>> > > >>>>>> we
>> > > >>>>>> are interested in how different threads in HBase interact with
>> > each
>> > > >>>>>>
>> > > >>>>> other
>> > > >>>>
>> > > >>>>> and how it might affect scheduling.
>> > > >>>>>>
>> > > >>>>>> I have written down my understanding on how HBase/HDFS works
>> based
>> > > on
>> > > >>>>>>
>> > > >>>>> its
>> > > >>>>
>> > > >>>>> current thread architecture (attached). I am wondering if the
>> > > >>>>>>
>> > > >>>>> developers of
>> > > >>>>
>> > > >>>>> HBase could take a look at it and let me know if anything is
>> > > incorrect
>> > > >>>>>>
>> > > >>>>> or
>> > > >>>>
>> > > >>>>> inaccurate, or if I have missed anything.
>> > > >>>>>>
>> > > >>>>>> Thanks a lot for your help!
>> > > >>>>>>
>> > > >>>>>> --
>> > > >>>>>> Suli Yang
>> > > >>>>>>
>> > > >>>>>> Department of Physics
>> > > >>>>>> University of Wisconsin Madison
>> > > >>>>>>
>> > > >>>>>> 4257 Chamberlin Hall
>> > > >>>>>> Madison WI 53703
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>> --
>> > > >>>>> Suli Yang
>> > > >>>>>
>> > > >>>>> Department of Physics
>> > > >>>>> University of Wisconsin Madison
>> > > >>>>>
>> > > >>>>> 4257 Chamberlin Hall
>> > > >>>>> Madison WI 53703
>> > > >>>>>
>> > > >>>>>
>> > > >>>>>
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >
>> > > >
>> > > > --
>> > > > Suli Yang
>> > > >
>> > > > Department of Physics
>> > > > University of Wisconsin Madison
>> > > >
>> > > > 4257 Chamberlin Hall
>> > > > Madison WI 53703
>> > >
>> >
>> >
>> >
>> > --
>> > Suli Yang
>> >
>> > Department of Physics
>> > University of Wisconsin Madison
>> >
>> > 4257 Chamberlin Hall
>> > Madison WI 53703
>> >
>>
>
>
>
> --
> Suli Yang
>
> Department of Physics
> University of Wisconsin Madison
>
> 4257 Chamberlin Hall
> Madison WI 53703
>
>


-- 
Suli Yang

Department of Physics
University of Wisconsin Madison

4257 Chamberlin Hall
Madison WI 53703

Re: How threads interact with each other in HBase

Posted by 杨苏立 Yang Su Li <ya...@gmail.com>.

I understand why HBase by default does not use hsync -- it does come with
big performance cost (though for FSYNC_WAL which is not the default option,
you should probably do it because the documentation explicitly promised
it).


I just want to make sure my description about HBase is accurate, including
the durability aspect.

On Sun, Apr 2, 2017 at 12:19 PM, Ted Yu <yu...@gmail.com> wrote:

> Suli:
> Have you looked at HBASE-5954 ?
>
> It gives some background on why hbase code is formulated the way it
> currently is.
>
> Cheers
>
> On Sun, Apr 2, 2017 at 9:36 AM, 杨苏立 Yang Su Li <ya...@gmail.com> wrote:
>
> > Don't your second paragraph just prove my point? -- If data is not
> > persisted to disk, then it is not durable. That is the definition of
> > durability.
> >
> > If you want the data to be durable, then you need to call hsync() instead
> > of hflush(), and that would be the correct behavior if you use FSYNC_WAL
> > flag (per HBase documentation).
> >
> > However, HBase does not do that.
> >
> > Suli
> >
> > On Sun, Apr 2, 2017 at 11:26 AM, Josh Elser <jo...@gmail.com>
> wrote:
> >
> > > No, that's not correct. HBase would, by definition, not be a
> > > consistent database if a write was not durable when a client sees a
> > > successful write.
> > >
> > > The point that I will concede to you is that the hflush call may, in
> > > extenuating circumstances, may not be completely durable. For example,
> > > HFlush does not actually force the data to disk. If an abrupt power
> > > failure happens before this data is pushed to disk, HBase may think
> > > that data was durable when it actually wasn't (at the HDFS level).
> > >
> > > On Thu, Mar 30, 2017 at 4:26 PM, 杨苏立 Yang Su Li <ya...@gmail.com>
> > > wrote:
> > > > Also, please correct me if I am wrong, but I don't think a put is
> > durable
> > > > when an RPC returns to the client. Just its corresponding WAL entry
> is
> > > > pushed to the memory of all three data nodes, so it has a low
> > probability
> > > > of being lost. But nothing is persisted at this point.
> > > >
> > > > And this is true no mater you use SYNC_WAL or FSYNC_WAL flag.
> > > >
> > > > On Tue, Mar 28, 2017 at 12:11 PM, Josh Elser <el...@apache.org>
> > wrote:
> > > >
> > > >> 1.1 -> 2: don't forget about the block cache which can invalidate
> the
> > > need
> > > >> for any HDFS read.
> > > >>
> > > >> I think you're over-simplifying the write-path quite a bit. I'm not
> > sure
> > > >> what you mean by an 'asynchronous write', but that doesn't exist at
> > the
> > > >> HBase RPC layer as that would invalidate the consistency guarantees
> > (if
> > > an
> > > >> RPC returns to the client that data was "put", then it is durable).
> > > >>
> > > >> Going off of memory (sorry in advance if I misstate something): the
> > > >> general way that data is written to the WAL is a "group commit". You
> > > have
> > > >> many threads all trying to append data to the WAL -- performance
> would
> > > be
> > > >> terrible if you serially applied all of these writes. Instead, many
> > > writes
> > > >> can be accepted and a the caller receives a Future. The caller must
> > wait
> > > >> for the Future to complete. What's happening behind the scene is
> that
> > > the
> > > >> writes are being bundled together to reduce the number of syncs to
> the
> > > WAL
> > > >> ("grouping" the writes together). When one caller's future would
> > > complete,
> > > >> what really happened is that the write/sync which included the
> > caller's
> > > >> update was committed (along with others). All of this is happening
> > > inside
> > > >> the RS's implementation of accepting an update.
> > > >>
> > > >> https://github.com/apache/hbase/blob/55d6dcaf877cc5223e67973
> > > >> 6eb613173229c18be/hbase-server/src/main/java/org/
> apache/hadoop/hbase/
> > > >> regionserver/wal/FSHLog.java#L74-L106
> > > >>
> > > >>
> > > >> 杨苏立 Yang Su Li wrote:
> > > >>
> > > >>> The attachment can be found in the following URL:
> > > >>> http://pages.cs.wisc.edu/~suli/hbase.pdf
> > > >>>
> > > >>> Sorry for the inconvenience...
> > > >>>
> > > >>>
> > > >>> On Mon, Mar 27, 2017 at 8:25 PM, Ted Yu<yu...@gmail.com>
> wrote:
> > > >>>
> > > >>> Again, attachment didn't come thru.
> > > >>>>
> > > >>>> Is it possible to formulate as google doc ?
> > > >>>>
> > > >>>> Thanks
> > > >>>>
> > > >>>> On Mon, Mar 27, 2017 at 6:19 PM, 杨苏立 Yang Su Li<
> yangsuli@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>> Hi,
> > > >>>>>
> > > >>>>> I am a graduate student working on scheduling on storage systems,
> > > and we
> > > >>>>> are interested in how different threads in HBase interact with
> each
> > > >>>>> other
> > > >>>>> and how it might affect scheduling.
> > > >>>>>
> > > >>>>> I have written down my understanding on how HBase/HDFS works
> based
> > on
> > > >>>>> its
> > > >>>>> current thread architecture (attached). I am wondering if the
> > > developers
> > > >>>>>
> > > >>>> of
> > > >>>>
> > > >>>>> HBase could take a look at it and let me know if anything is
> > > incorrect
> > > >>>>> or
> > > >>>>> inaccurate, or if I have missed anything.
> > > >>>>>
> > > >>>>> Thanks a lot for your help!
> > > >>>>>
> > > >>>>> On Wed, Mar 22, 2017 at 3:39 PM, 杨苏立 Yang Su Li<
> yangsuli@gmail.com
> > >
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>> Hi,
> > > >>>>>>
> > > >>>>>> I am a graduate student working on scheduling on storage
> systems,
> > > and
> > > >>>>>> we
> > > >>>>>> are interested in how different threads in HBase interact with
> > each
> > > >>>>>>
> > > >>>>> other
> > > >>>>
> > > >>>>> and how it might affect scheduling.
> > > >>>>>>
> > > >>>>>> I have written down my understanding on how HBase/HDFS works
> based
> > > on
> > > >>>>>>
> > > >>>>> its
> > > >>>>
> > > >>>>> current thread architecture (attached). I am wondering if the
> > > >>>>>>
> > > >>>>> developers of
> > > >>>>
> > > >>>>> HBase could take a look at it and let me know if anything is
> > > incorrect
> > > >>>>>>
> > > >>>>> or
> > > >>>>
> > > >>>>> inaccurate, or if I have missed anything.
> > > >>>>>>
> > > >>>>>> Thanks a lot for your help!
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Suli Yang
> > > >>>>>>
> > > >>>>>> Department of Physics
> > > >>>>>> University of Wisconsin Madison
> > > >>>>>>
> > > >>>>>> 4257 Chamberlin Hall
> > > >>>>>> Madison WI 53703
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>> --
> > > >>>>> Suli Yang
> > > >>>>>
> > > >>>>> Department of Physics
> > > >>>>> University of Wisconsin Madison
> > > >>>>>
> > > >>>>> 4257 Chamberlin Hall
> > > >>>>> Madison WI 53703
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>>
> > > >
> > > >
> > > > --
> > > > Suli Yang
> > > >
> > > > Department of Physics
> > > > University of Wisconsin Madison
> > > >
> > > > 4257 Chamberlin Hall
> > > > Madison WI 53703
> > >
> >
> >
> >
> > --
> > Suli Yang
> >
> > Department of Physics
> > University of Wisconsin Madison
> >
> > 4257 Chamberlin Hall
> > Madison WI 53703
> >
>



-- 
Suli Yang

Department of Physics
University of Wisconsin Madison

4257 Chamberlin Hall
Madison WI 53703

Re: How threads interact with each other in HBase

Posted by Ted Yu <yu...@gmail.com>.

Suli:
Have you looked at HBASE-5954 ?

It gives some background on why hbase code is formulated the way it
currently is.

Cheers

On Sun, Apr 2, 2017 at 9:36 AM, 杨苏立 Yang Su Li <ya...@gmail.com> wrote:

> Don't your second paragraph just prove my point? -- If data is not
> persisted to disk, then it is not durable. That is the definition of
> durability.
>
> If you want the data to be durable, then you need to call hsync() instead
> of hflush(), and that would be the correct behavior if you use FSYNC_WAL
> flag (per HBase documentation).
>
> However, HBase does not do that.
>
> Suli
>
> On Sun, Apr 2, 2017 at 11:26 AM, Josh Elser <jo...@gmail.com> wrote:
>
> > No, that's not correct. HBase would, by definition, not be a
> > consistent database if a write was not durable when a client sees a
> > successful write.
> >
> > The point that I will concede to you is that the hflush call may, in
> > extenuating circumstances, may not be completely durable. For example,
> > HFlush does not actually force the data to disk. If an abrupt power
> > failure happens before this data is pushed to disk, HBase may think
> > that data was durable when it actually wasn't (at the HDFS level).
> >
> > On Thu, Mar 30, 2017 at 4:26 PM, 杨苏立 Yang Su Li <ya...@gmail.com>
> > wrote:
> > > Also, please correct me if I am wrong, but I don't think a put is
> durable
> > > when an RPC returns to the client. Just its corresponding WAL entry is
> > > pushed to the memory of all three data nodes, so it has a low
> probability
> > > of being lost. But nothing is persisted at this point.
> > >
> > > And this is true no mater you use SYNC_WAL or FSYNC_WAL flag.
> > >
> > > On Tue, Mar 28, 2017 at 12:11 PM, Josh Elser <el...@apache.org>
> wrote:
> > >
> > >> 1.1 -> 2: don't forget about the block cache which can invalidate the
> > need
> > >> for any HDFS read.
> > >>
> > >> I think you're over-simplifying the write-path quite a bit. I'm not
> sure
> > >> what you mean by an 'asynchronous write', but that doesn't exist at
> the
> > >> HBase RPC layer as that would invalidate the consistency guarantees
> (if
> > an
> > >> RPC returns to the client that data was "put", then it is durable).
> > >>
> > >> Going off of memory (sorry in advance if I misstate something): the
> > >> general way that data is written to the WAL is a "group commit". You
> > have
> > >> many threads all trying to append data to the WAL -- performance would
> > be
> > >> terrible if you serially applied all of these writes. Instead, many
> > writes
> > >> can be accepted and a the caller receives a Future. The caller must
> wait
> > >> for the Future to complete. What's happening behind the scene is that
> > the
> > >> writes are being bundled together to reduce the number of syncs to the
> > WAL
> > >> ("grouping" the writes together). When one caller's future would
> > complete,
> > >> what really happened is that the write/sync which included the
> caller's
> > >> update was committed (along with others). All of this is happening
> > inside
> > >> the RS's implementation of accepting an update.
> > >>
> > >> https://github.com/apache/hbase/blob/55d6dcaf877cc5223e67973
> > >> 6eb613173229c18be/hbase-server/src/main/java/org/apache/hadoop/hbase/
> > >> regionserver/wal/FSHLog.java#L74-L106
> > >>
> > >>
> > >> 杨苏立 Yang Su Li wrote:
> > >>
> > >>> The attachment can be found in the following URL:
> > >>> http://pages.cs.wisc.edu/~suli/hbase.pdf
> > >>>
> > >>> Sorry for the inconvenience...
> > >>>
> > >>>
> > >>> On Mon, Mar 27, 2017 at 8:25 PM, Ted Yu<yu...@gmail.com>  wrote:
> > >>>
> > >>> Again, attachment didn't come thru.
> > >>>>
> > >>>> Is it possible to formulate as google doc ?
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>> On Mon, Mar 27, 2017 at 6:19 PM, 杨苏立 Yang Su Li<ya...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>>
> > >>>>> I am a graduate student working on scheduling on storage systems,
> > and we
> > >>>>> are interested in how different threads in HBase interact with each
> > >>>>> other
> > >>>>> and how it might affect scheduling.
> > >>>>>
> > >>>>> I have written down my understanding on how HBase/HDFS works based
> on
> > >>>>> its
> > >>>>> current thread architecture (attached). I am wondering if the
> > developers
> > >>>>>
> > >>>> of
> > >>>>
> > >>>>> HBase could take a look at it and let me know if anything is
> > incorrect
> > >>>>> or
> > >>>>> inaccurate, or if I have missed anything.
> > >>>>>
> > >>>>> Thanks a lot for your help!
> > >>>>>
> > >>>>> On Wed, Mar 22, 2017 at 3:39 PM, 杨苏立 Yang Su Li<yangsuli@gmail.com
> >
> > >>>>> wrote:
> > >>>>>
> > >>>>> Hi,
> > >>>>>>
> > >>>>>> I am a graduate student working on scheduling on storage systems,
> > and
> > >>>>>> we
> > >>>>>> are interested in how different threads in HBase interact with
> each
> > >>>>>>
> > >>>>> other
> > >>>>
> > >>>>> and how it might affect scheduling.
> > >>>>>>
> > >>>>>> I have written down my understanding on how HBase/HDFS works based
> > on
> > >>>>>>
> > >>>>> its
> > >>>>
> > >>>>> current thread architecture (attached). I am wondering if the
> > >>>>>>
> > >>>>> developers of
> > >>>>
> > >>>>> HBase could take a look at it and let me know if anything is
> > incorrect
> > >>>>>>
> > >>>>> or
> > >>>>
> > >>>>> inaccurate, or if I have missed anything.
> > >>>>>>
> > >>>>>> Thanks a lot for your help!
> > >>>>>>
> > >>>>>> --
> > >>>>>> Suli Yang
> > >>>>>>
> > >>>>>> Department of Physics
> > >>>>>> University of Wisconsin Madison
> > >>>>>>
> > >>>>>> 4257 Chamberlin Hall
> > >>>>>> Madison WI 53703
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>> --
> > >>>>> Suli Yang
> > >>>>>
> > >>>>> Department of Physics
> > >>>>> University of Wisconsin Madison
> > >>>>>
> > >>>>> 4257 Chamberlin Hall
> > >>>>> Madison WI 53703
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >>>
> > >
> > >
> > > --
> > > Suli Yang
> > >
> > > Department of Physics
> > > University of Wisconsin Madison
> > >
> > > 4257 Chamberlin Hall
> > > Madison WI 53703
> >
>
>
>
> --
> Suli Yang
>
> Department of Physics
> University of Wisconsin Madison
>
> 4257 Chamberlin Hall
> Madison WI 53703
>

Re: How threads interact with each other in HBase

Posted by 杨苏立 Yang Su Li <ya...@gmail.com>.

Don't your second paragraph just prove my point? -- If data is not
persisted to disk, then it is not durable. That is the definition of
durability.

If you want the data to be durable, then you need to call hsync() instead
of hflush(), and that would be the correct behavior if you use FSYNC_WAL
flag (per HBase documentation).

However, HBase does not do that.

Suli

On Sun, Apr 2, 2017 at 11:26 AM, Josh Elser <jo...@gmail.com> wrote:

> No, that's not correct. HBase would, by definition, not be a
> consistent database if a write was not durable when a client sees a
> successful write.
>
> The point that I will concede to you is that the hflush call may, in
> extenuating circumstances, may not be completely durable. For example,
> HFlush does not actually force the data to disk. If an abrupt power
> failure happens before this data is pushed to disk, HBase may think
> that data was durable when it actually wasn't (at the HDFS level).
>
> On Thu, Mar 30, 2017 at 4:26 PM, 杨苏立 Yang Su Li <ya...@gmail.com>
> wrote:
> > Also, please correct me if I am wrong, but I don't think a put is durable
> > when an RPC returns to the client. Just its corresponding WAL entry is
> > pushed to the memory of all three data nodes, so it has a low probability
> > of being lost. But nothing is persisted at this point.
> >
> > And this is true no mater you use SYNC_WAL or FSYNC_WAL flag.
> >
> > On Tue, Mar 28, 2017 at 12:11 PM, Josh Elser <el...@apache.org> wrote:
> >
> >> 1.1 -> 2: don't forget about the block cache which can invalidate the
> need
> >> for any HDFS read.
> >>
> >> I think you're over-simplifying the write-path quite a bit. I'm not sure
> >> what you mean by an 'asynchronous write', but that doesn't exist at the
> >> HBase RPC layer as that would invalidate the consistency guarantees (if
> an
> >> RPC returns to the client that data was "put", then it is durable).
> >>
> >> Going off of memory (sorry in advance if I misstate something): the
> >> general way that data is written to the WAL is a "group commit". You
> have
> >> many threads all trying to append data to the WAL -- performance would
> be
> >> terrible if you serially applied all of these writes. Instead, many
> writes
> >> can be accepted and a the caller receives a Future. The caller must wait
> >> for the Future to complete. What's happening behind the scene is that
> the
> >> writes are being bundled together to reduce the number of syncs to the
> WAL
> >> ("grouping" the writes together). When one caller's future would
> complete,
> >> what really happened is that the write/sync which included the caller's
> >> update was committed (along with others). All of this is happening
> inside
> >> the RS's implementation of accepting an update.
> >>
> >> https://github.com/apache/hbase/blob/55d6dcaf877cc5223e67973
> >> 6eb613173229c18be/hbase-server/src/main/java/org/apache/hadoop/hbase/
> >> regionserver/wal/FSHLog.java#L74-L106
> >>
> >>
> >> 杨苏立 Yang Su Li wrote:
> >>
> >>> The attachment can be found in the following URL:
> >>> http://pages.cs.wisc.edu/~suli/hbase.pdf
> >>>
> >>> Sorry for the inconvenience...
> >>>
> >>>
> >>> On Mon, Mar 27, 2017 at 8:25 PM, Ted Yu<yu...@gmail.com>  wrote:
> >>>
> >>> Again, attachment didn't come thru.
> >>>>
> >>>> Is it possible to formulate as google doc ?
> >>>>
> >>>> Thanks
> >>>>
> >>>> On Mon, Mar 27, 2017 at 6:19 PM, 杨苏立 Yang Su Li<ya...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> Hi,
> >>>>>
> >>>>> I am a graduate student working on scheduling on storage systems,
> and we
> >>>>> are interested in how different threads in HBase interact with each
> >>>>> other
> >>>>> and how it might affect scheduling.
> >>>>>
> >>>>> I have written down my understanding on how HBase/HDFS works based on
> >>>>> its
> >>>>> current thread architecture (attached). I am wondering if the
> developers
> >>>>>
> >>>> of
> >>>>
> >>>>> HBase could take a look at it and let me know if anything is
> incorrect
> >>>>> or
> >>>>> inaccurate, or if I have missed anything.
> >>>>>
> >>>>> Thanks a lot for your help!
> >>>>>
> >>>>> On Wed, Mar 22, 2017 at 3:39 PM, 杨苏立 Yang Su Li<ya...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>>
> >>>>>> I am a graduate student working on scheduling on storage systems,
> and
> >>>>>> we
> >>>>>> are interested in how different threads in HBase interact with each
> >>>>>>
> >>>>> other
> >>>>
> >>>>> and how it might affect scheduling.
> >>>>>>
> >>>>>> I have written down my understanding on how HBase/HDFS works based
> on
> >>>>>>
> >>>>> its
> >>>>
> >>>>> current thread architecture (attached). I am wondering if the
> >>>>>>
> >>>>> developers of
> >>>>
> >>>>> HBase could take a look at it and let me know if anything is
> incorrect
> >>>>>>
> >>>>> or
> >>>>
> >>>>> inaccurate, or if I have missed anything.
> >>>>>>
> >>>>>> Thanks a lot for your help!
> >>>>>>
> >>>>>> --
> >>>>>> Suli Yang
> >>>>>>
> >>>>>> Department of Physics
> >>>>>> University of Wisconsin Madison
> >>>>>>
> >>>>>> 4257 Chamberlin Hall
> >>>>>> Madison WI 53703
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> --
> >>>>> Suli Yang
> >>>>>
> >>>>> Department of Physics
> >>>>> University of Wisconsin Madison
> >>>>>
> >>>>> 4257 Chamberlin Hall
> >>>>> Madison WI 53703
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >
> >
> > --
> > Suli Yang
> >
> > Department of Physics
> > University of Wisconsin Madison
> >
> > 4257 Chamberlin Hall
> > Madison WI 53703
>



-- 
Suli Yang

Department of Physics
University of Wisconsin Madison

4257 Chamberlin Hall
Madison WI 53703