You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Eric Burin des Roziers <er...@yahoo.com> on 2011/05/05 15:03:41 UTC

put to WAL and scan/get operation concurrency

Hi,

I am currently looking at adding a transactional consistency aspect to HBase and had 2 questions:

1. My understanding is that when the client performs an operation (put, delete, incr), it is sent to the region server which delegates it to different region servers, which in turn puts it in the WAL and the MemStore in that region.  At some point later, the MemStore is flushed to disk (into the HFiles).  The WAL is essentially there as a way to recover the data in case the machine crashes, hence loosing data stored in its MemCache, but not yet store on disk.  Once the data is available in the MemStore (but not yet in HFiles), do scans and gets 'see' that data?  Is the data duplicated in the MemStore across 3 region servers?  If a region server crashes, can I get into a situation where a scan can return a partial data set without the client being aware of it?

2. The Hbase-trx package implements transactions by effectively creating a WAL per transaction (THLog) and 'flushing' it to the main WAL (HLog) on commit.  But, flushing this THLog will take a time window (however small it is).  If a scan (or get) is performed during that window, could I get into a situation where I see part of the committed transaction (some rows but not others since they have not been flushed yet)?  Why did the HBase-trx decide to go with a THLog, instead of leveraging the KeyValue versioning?

I am thinking of implementing a transaction isolation/consistency mechanism by storing a unique transaction id as the version when doing a put (instead of the current millis) and passing invalid transaction ids to scans/get letting them know to fetch a previous version (with a valid transaction id) for cells that have been updated by a non-committed transaction.  Are there any reasons for not going with this approach?

Thanks for your help,
-Eric

Re: put to WAL and scan/get operation concurrency

Posted by pob <pe...@gmail.com>.
I think splitting doesnt distribute your read load.

With read load distribution i mean you can access same data on let say 3
different nodes (RS) - if the dfs replication is set to 3. What Hbase doesnt
handle, am i right?





2011/5/6 Todd Lipcon <to...@cloudera.com>

> On Fri, May 6, 2011 at 11:19 AM, pob <pe...@gmail.com> wrote:
>
> >
> >
> > The data for those regions is replicated, but only 1 region server
> > > does the management of that data.
> > >
> > >
> > So does it mean, there isnt "scalling for reads"? {mean higher replica ->
> > better read throughput}
> >
>
> Reads are scaled by splitting regions and distributing them around multiple
> servers. If you have one super-hot row, it should fit in cache and give you
> some >20k reads/second. If you need more reads/sec on a single row than
> that, you'll need to add your own caching layer in front.
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: put to WAL and scan/get operation concurrency

Posted by Todd Lipcon <to...@cloudera.com>.
On Fri, May 6, 2011 at 11:19 AM, pob <pe...@gmail.com> wrote:

>
>
> The data for those regions is replicated, but only 1 region server
> > does the management of that data.
> >
> >
> So does it mean, there isnt "scalling for reads"? {mean higher replica ->
> better read throughput}
>

Reads are scaled by splitting regions and distributing them around multiple
servers. If you have one super-hot row, it should fit in cache and give you
some >20k reads/second. If you need more reads/sec on a single row than
that, you'll need to add your own caching layer in front.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: put to WAL and scan/get operation concurrency

Posted by pob <pe...@gmail.com>.
Hi,



2011/5/6 Jean-Daniel Cryans <jd...@apache.org>

> As I said before the regions aren't replicated, it is not the right
> way to see it.
>

The data for those regions is replicated, but only 1 region server
> does the management of that data.
>
>
So does it mean, there isnt "scalling for reads"? {mean higher replica ->
better read throughput}


Thanks


> If a RS crashes, like I said before, the data is unavailable until the
> logs are replayed. In the context of a MR (or any other client
> context, because it's really the same) the maps or reducers that are
> reading data from HBase will be blocked (specifically in the
> HConnectionManager code, it is transparent) until the region that
> contains the row it's trying to get to is made available again. That's
> the strong consistency guarantee. The only case where your client will
> see an exception is if the retries are exhausted, in which case you'll
> see a RetriesExhaustedException.
>
> J-D
>
> On Fri, May 6, 2011 at 7:47 AM, Eric Burin des Roziers
> <er...@yahoo.com> wrote:
> > So, just to make sure I understand, there is a chance that, a MapReduce
> job does not get all the data without being aware of it, because a region
> server crashed?  Wouldn't HBase use a replicated region instead?  And if the
> region server crashed during the job scan, shouldn't it get an exception,
> right?
> > Thanks,
> > -Eric
> >
> >
> >
> > ________________________________
> > From: Stack <st...@duboce.net>
> > To: user@hbase.apache.org; Eric Burin des Roziers <er...@yahoo.com>
> > Sent: Friday, May 6, 2011 4:37 PM
> > Subject: Re: put to WAL and scan/get operation concurrency
> >
> > On Fri, May 6, 2011 at 1:45 AM, Eric Burin des Roziers
> > <er...@yahoo.com> wrote:
> >> Thanks Stack,  I hadn't read the percolator paper (doing it now).  I
> think I am not describing my question properly.  Basically, based on the
> hbase-trx implementation, when the transaction commits, there is a time
> window where a Get() might read partial rows since it implements the
> snapshot isolation by writing records to a different location (than the
> actual HTable) before the commit().  In the percolator paper, cell versions
> are used as snapshot isolation and uses an as-of timestamp when doing a
> Get().
> >>
> >
> > That could be the case (I had a bit of a notion of how hbase-trx
> > worked -- once -- but its been flushed w/ a while now).  Want to ask
> > over on the hbase-trx github project?  James will likely know.
> >
> >> Another unrelated question: when a region server fails, does the client
> (while doing a get/scan) get notified (exception)?  Basically, I want to
> ensure that an operation (such as a rollup/aggregate) does not compute the
> wrong amounts due to missing data.
> >>
> >
> > The client?  No.  Not natively.  RegionServers do register themselves
> > in zk.  A trx-client could register a zk watcher on regionservers dir
> > in zk.  Then you'd get notification of RS death.  If you go this route
> > and thousands or tens of thousands of clients, you might want to do a
> > bit of research around how it'll scale.
> >
> > St.Ack
>

Re: put to WAL and scan/get operation concurrency

Posted by Jean-Daniel Cryans <jd...@apache.org>.
As I said before the regions aren't replicated, it is not the right
way to see it.

The data for those regions is replicated, but only 1 region server
does the management of that data.

If a RS crashes, like I said before, the data is unavailable until the
logs are replayed. In the context of a MR (or any other client
context, because it's really the same) the maps or reducers that are
reading data from HBase will be blocked (specifically in the
HConnectionManager code, it is transparent) until the region that
contains the row it's trying to get to is made available again. That's
the strong consistency guarantee. The only case where your client will
see an exception is if the retries are exhausted, in which case you'll
see a RetriesExhaustedException.

J-D

On Fri, May 6, 2011 at 7:47 AM, Eric Burin des Roziers
<er...@yahoo.com> wrote:
> So, just to make sure I understand, there is a chance that, a MapReduce job does not get all the data without being aware of it, because a region server crashed?  Wouldn't HBase use a replicated region instead?  And if the region server crashed during the job scan, shouldn't it get an exception, right?
> Thanks,
> -Eric
>
>
>
> ________________________________
> From: Stack <st...@duboce.net>
> To: user@hbase.apache.org; Eric Burin des Roziers <er...@yahoo.com>
> Sent: Friday, May 6, 2011 4:37 PM
> Subject: Re: put to WAL and scan/get operation concurrency
>
> On Fri, May 6, 2011 at 1:45 AM, Eric Burin des Roziers
> <er...@yahoo.com> wrote:
>> Thanks Stack,  I hadn't read the percolator paper (doing it now).  I think I am not describing my question properly.  Basically, based on the hbase-trx implementation, when the transaction commits, there is a time window where a Get() might read partial rows since it implements the snapshot isolation by writing records to a different location (than the actual HTable) before the commit().  In the percolator paper, cell versions are used as snapshot isolation and uses an as-of timestamp when doing a Get().
>>
>
> That could be the case (I had a bit of a notion of how hbase-trx
> worked -- once -- but its been flushed w/ a while now).  Want to ask
> over on the hbase-trx github project?  James will likely know.
>
>> Another unrelated question: when a region server fails, does the client (while doing a get/scan) get notified (exception)?  Basically, I want to ensure that an operation (such as a rollup/aggregate) does not compute the wrong amounts due to missing data.
>>
>
> The client?  No.  Not natively.  RegionServers do register themselves
> in zk.  A trx-client could register a zk watcher on regionservers dir
> in zk.  Then you'd get notification of RS death.  If you go this route
> and thousands or tens of thousands of clients, you might want to do a
> bit of research around how it'll scale.
>
> St.Ack

Re: put to WAL and scan/get operation concurrency

Posted by Eric Burin des Roziers <er...@yahoo.com>.
So, just to make sure I understand, there is a chance that, a MapReduce job does not get all the data without being aware of it, because a region server crashed?  Wouldn't HBase use a replicated region instead?  And if the region server crashed during the job scan, shouldn't it get an exception, right?
Thanks,
-Eric



________________________________
From: Stack <st...@duboce.net>
To: user@hbase.apache.org; Eric Burin des Roziers <er...@yahoo.com>
Sent: Friday, May 6, 2011 4:37 PM
Subject: Re: put to WAL and scan/get operation concurrency

On Fri, May 6, 2011 at 1:45 AM, Eric Burin des Roziers
<er...@yahoo.com> wrote:
> Thanks Stack,  I hadn't read the percolator paper (doing it now).  I think I am not describing my question properly.  Basically, based on the hbase-trx implementation, when the transaction commits, there is a time window where a Get() might read partial rows since it implements the snapshot isolation by writing records to a different location (than the actual HTable) before the commit().  In the percolator paper, cell versions are used as snapshot isolation and uses an as-of timestamp when doing a Get().
>

That could be the case (I had a bit of a notion of how hbase-trx
worked -- once -- but its been flushed w/ a while now).  Want to ask
over on the hbase-trx github project?  James will likely know.

> Another unrelated question: when a region server fails, does the client (while doing a get/scan) get notified (exception)?  Basically, I want to ensure that an operation (such as a rollup/aggregate) does not compute the wrong amounts due to missing data.
>

The client?  No.  Not natively.  RegionServers do register themselves
in zk.  A trx-client could register a zk watcher on regionservers dir
in zk.  Then you'd get notification of RS death.  If you go this route
and thousands or tens of thousands of clients, you might want to do a
bit of research around how it'll scale.

St.Ack

Re: put to WAL and scan/get operation concurrency

Posted by Stack <st...@duboce.net>.
On Fri, May 6, 2011 at 1:45 AM, Eric Burin des Roziers
<er...@yahoo.com> wrote:
> Thanks Stack,  I hadn't read the percolator paper (doing it now).  I think I am not describing my question properly.  Basically, based on the hbase-trx implementation, when the transaction commits, there is a time window where a Get() might read partial rows since it implements the snapshot isolation by writing records to a different location (than the actual HTable) before the commit().  In the percolator paper, cell versions are used as snapshot isolation and uses an as-of timestamp when doing a Get().
>

That could be the case (I had a bit of a notion of how hbase-trx
worked -- once -- but its been flushed w/ a while now).  Want to ask
over on the hbase-trx github project?  James will likely know.

> Another unrelated question: when a region server fails, does the client (while doing a get/scan) get notified (exception)?  Basically, I want to ensure that an operation (such as a rollup/aggregate) does not compute the wrong amounts due to missing data.
>

The client?  No.  Not natively.  RegionServers do register themselves
in zk.  A trx-client could register a zk watcher on regionservers dir
in zk.  Then you'd get notification of RS death.  If you go this route
and thousands or tens of thousands of clients, you might want to do a
bit of research around how it'll scale.

St.Ack

Re: put to WAL and scan/get operation concurrency

Posted by Eric Burin des Roziers <er...@yahoo.com>.
Thanks Stack,  I hadn't read the percolator paper (doing it now).  I think I am not describing my question properly.  Basically, based on the hbase-trx implementation, when the transaction commits, there is a time window where a Get() might read partial rows since it implements the snapshot isolation by writing records to a different location (than the actual HTable) before the commit().  In the percolator paper, cell versions are used as snapshot isolation and uses an as-of timestamp when doing a Get().

Another unrelated question: when a region server fails, does the client (while doing a get/scan) get notified (exception)?  Basically, I want to ensure that an operation (such as a rollup/aggregate) does not compute the wrong amounts due to missing data.

Thanks again for your help,
-Eric


________________________________
From: Stack <st...@duboce.net>
To: user@hbase.apache.org; Eric Burin des Roziers <er...@yahoo.com>
Sent: Thursday, May 5, 2011 10:05 PM
Subject: Re: put to WAL and scan/get operation concurrency

On Thu, May 5, 2011 at 11:43 AM, Eric Burin des Roziers
<er...@yahoo.com> wrote:
> Hi Jean-Daniel,
>
> Yes, I need to have a multi-row transactional aware HBase for the types of processing I need to do.  I need to avoid having partial rows available and I am in the process of selecting a way to implement such a transaction isolation.  I currently have 2 choices: (1) use the HBase-trx or (2) implement my own leveraging the verioning that HBase provides.  In light of this I wanted to understand the inner workings of HBase a little more.


You have read the megastore and percolator papers?  They discuss x-row
transactions.


> For example, I want to understand if scans read data from the MemStore even if it has not yet been flushed to the HFiles yet.

It does.

> HBase replicates the data 3 times (depending on your configs).  Does it do that as well for the MemStore.

The data in memstore is first put in the WAL which is replicated three times.



> Say the client wants to inserts 10 lines which happen to fall across 2 regions.  If region 2 fails, then another client will still be able to read the rows inserted in region 1, but not region 2.  Since HBase replicates data to other servers, region 2 lines could be available on other servers, right?
>

Would suggest you read the bigtable paper.  It'll answer most of your
questions more eloquently than I can (To answer your question, only
one region serves a specific piece of data.  It depends on your
transaction implementation as to whether the half written data is
readable by the client).


> The second aspect that I would like to understand is the implementation of the HBase-trx.  It seems that I can still have a failure point when the transactional WAL (THLog) flushed the data to the main Wal.  using the above example, I can get into a situation where I will only be able to read a subset of the initial 10 lines initially inserted.  Is that right?
>

I think, pardon me if I'm reading this wrong, you have begun on a
wrong foot so your question doesn't add up right.

St.Ack

Re: put to WAL and scan/get operation concurrency

Posted by Stack <st...@duboce.net>.
On Thu, May 5, 2011 at 11:43 AM, Eric Burin des Roziers
<er...@yahoo.com> wrote:
> Hi Jean-Daniel,
>
> Yes, I need to have a multi-row transactional aware HBase for the types of processing I need to do.  I need to avoid having partial rows available and I am in the process of selecting a way to implement such a transaction isolation.  I currently have 2 choices: (1) use the HBase-trx or (2) implement my own leveraging the verioning that HBase provides.  In light of this I wanted to understand the inner workings of HBase a little more.


You have read the megastore and percolator papers?  They discuss x-row
transactions.


> For example, I want to understand if scans read data from the MemStore even if it has not yet been flushed to the HFiles yet.

It does.

> HBase replicates the data 3 times (depending on your configs).  Does it do that as well for the MemStore.

The data in memstore is first put in the WAL which is replicated three times.



> Say the client wants to inserts 10 lines which happen to fall across 2 regions.  If region 2 fails, then another client will still be able to read the rows inserted in region 1, but not region 2.  Since HBase replicates data to other servers, region 2 lines could be available on other servers, right?
>

Would suggest you read the bigtable paper.  It'll answer most of your
questions more eloquently than I can (To answer your question, only
one region serves a specific piece of data.  It depends on your
transaction implementation as to whether the half written data is
readable by the client).


> The second aspect that I would like to understand is the implementation of the HBase-trx.  It seems that I can still have a failure point when the transactional WAL (THLog) flushed the data to the main Wal.  using the above example, I can get into a situation where I will only be able to read a subset of the initial 10 lines initially inserted.  Is that right?
>

I think, pardon me if I'm reading this wrong, you have begun on a
wrong foot so your question doesn't add up right.

St.Ack

Re: put to WAL and scan/get operation concurrency

Posted by Eric Burin des Roziers <er...@yahoo.com>.
Hi Jean-Daniel,

Yes, I need to have a multi-row transactional aware HBase for the types of processing I need to do.  I need to avoid having partial rows available and I am in the process of selecting a way to implement such a transaction isolation.  I currently have 2 choices: (1) use the HBase-trx or (2) implement my own leveraging the verioning that HBase provides.  In light of this I wanted to understand the inner workings of HBase a little more.  

For example, I want to understand if scans read data from the MemStore even if it has not yet been flushed to the HFiles yet.  HBase replicates the data 3 times (depending on your configs).  Does it do that as well for the MemStore.  Say the client wants to inserts 10 lines which happen to fall across 2 regions.  If region 2 fails, then another client will still be able to read the rows inserted in region 1, but not region 2.  Since HBase replicates data to other servers, region 2 lines could be available on other servers, right?

The second aspect that I would like to understand is the implementation of the HBase-trx.  It seems that I can still have a failure point when the transactional WAL (THLog) flushed the data to the main Wal.  using the above example, I can get into a situation where I will only be able to read a subset of the initial 10 lines initially inserted.  Is that right?

Thanks,
-Eric



________________________________
From: Jean-Daniel Cryans <jd...@apache.org>
To: user@hbase.apache.org
Sent: Thursday, May 5, 2011 7:24 PM
Subject: Re: put to WAL and scan/get operation concurrency

Inline.

On Thu, May 5, 2011 at 6:03 AM, Eric Burin des Roziers
<er...@yahoo.com> wrote:
> Hi,
>
> I am currently looking at adding a transactional consistency aspect to HBase and had 2 questions:
>
> 1. My understanding is that when the client performs an operation (put, delete, incr), it is sent to the region server which delegates it to different region servers, which in turn puts it in the WAL and the MemStore in that region.  At some point later, the MemStore is flushed to disk (into the HFiles).  The WAL is essentially there as a way to recover the data in case the machine crashes, hence loosing data stored in its MemCache, but not yet store on disk.  Once the data is available in the MemStore (but not yet in HFiles), do scans and gets 'see' that data?  Is the data duplicated in the MemStore across 3 region servers?  If a region server crashes, can I get into a situation where a scan can return a partial data set without the client being aware of it?

Only one region server serves a region at a time, if that region
server crashes then the data is available on other Datanodes but it's
not available to the client until the WAL is replayed and the region
is reopened. So no stale data.

>
> 2. The Hbase-trx package implements transactions by effectively creating a WAL per transaction (THLog) and 'flushing' it to the main WAL (HLog) on commit.  But, flushing this THLog will take a time window (however small it is).  If a scan (or get) is performed during that window, could I get into a situation where I see part of the committed transaction (some rows but not others since they have not been flushed yet)?  Why did the HBase-trx decide to go with a THLog, instead of leveraging the KeyValue versioning?

I think you are confusing multi-row transactions and single row
transactions. In pure HBase, every single row transaction is ACID. You
can learn more here http://hbase.apache.org/acid-semantics.html

The trx package does multi-row transactions.

>
> I am thinking of implementing a transaction isolation/consistency mechanism by storing a unique transaction id as the version when doing a put (instead of the current millis) and passing invalid transaction ids to scans/get letting them know to fetch a previous version (with a valid transaction id) for cells that have been updated by a non-committed transaction.  Are there any reasons for not going with this approach?
>

So just to be sure, were my previous answers good enough to answer
your question, or are you trying to implement something like the
HBase-trx?

J-D

Re: put to WAL and scan/get operation concurrency

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Inline.

On Thu, May 5, 2011 at 6:03 AM, Eric Burin des Roziers
<er...@yahoo.com> wrote:
> Hi,
>
> I am currently looking at adding a transactional consistency aspect to HBase and had 2 questions:
>
> 1. My understanding is that when the client performs an operation (put, delete, incr), it is sent to the region server which delegates it to different region servers, which in turn puts it in the WAL and the MemStore in that region.  At some point later, the MemStore is flushed to disk (into the HFiles).  The WAL is essentially there as a way to recover the data in case the machine crashes, hence loosing data stored in its MemCache, but not yet store on disk.  Once the data is available in the MemStore (but not yet in HFiles), do scans and gets 'see' that data?  Is the data duplicated in the MemStore across 3 region servers?  If a region server crashes, can I get into a situation where a scan can return a partial data set without the client being aware of it?

Only one region server serves a region at a time, if that region
server crashes then the data is available on other Datanodes but it's
not available to the client until the WAL is replayed and the region
is reopened. So no stale data.

>
> 2. The Hbase-trx package implements transactions by effectively creating a WAL per transaction (THLog) and 'flushing' it to the main WAL (HLog) on commit.  But, flushing this THLog will take a time window (however small it is).  If a scan (or get) is performed during that window, could I get into a situation where I see part of the committed transaction (some rows but not others since they have not been flushed yet)?  Why did the HBase-trx decide to go with a THLog, instead of leveraging the KeyValue versioning?

I think you are confusing multi-row transactions and single row
transactions. In pure HBase, every single row transaction is ACID. You
can learn more here http://hbase.apache.org/acid-semantics.html

The trx package does multi-row transactions.

>
> I am thinking of implementing a transaction isolation/consistency mechanism by storing a unique transaction id as the version when doing a put (instead of the current millis) and passing invalid transaction ids to scans/get letting them know to fetch a previous version (with a valid transaction id) for cells that have been updated by a non-committed transaction.  Are there any reasons for not going with this approach?
>

So just to be sure, were my previous answers good enough to answer
your question, or are you trying to implement something like the
HBase-trx?

J-D