You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Joydeep Sarma <js...@apache.org> on 2010/01/12 00:46:25 UTC

commit semantics

Hey HBase-devs,

we have been going through hbase code to come up to speed.

One of the questions was regarding the commit semantics. Thumbing through
the RegionServer code that's appending to the wal:

syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()

and the log writer thread calls:

hflush(), syncDone.signalAll()

however hflush doesn't necessarily call a sync on the underlying log file:

      if (this.forceSync ||
          this.unflushedEntries.get() >= this.flushlogentries) { ... sync()
... }

so it seems that if forceSync is not true, the syncWal can unblock before a
sync is called (and forcesync seems to be only true for metaregion()).

are we missing something - or is there a bug here (the signalAll should be
conditional on hflush having actually flushed something).

thanks,

Joydeep

Re: commit semantics

Posted by Ryan Rawson <ry...@gmail.com>.

Performance.... It's all about performance.

In my own tests, calling sync() in HDFS-0.21 on every single commit
can limit the number of small rows you do to about a max of 1200 a
second.  One way to speed things up is to sync less often.  Another
way is to sync on a timer instead.  Both of these are going to be way
more important in HDFS-0.21/Hbase-0.21.

If we are talking about hdfs/hadoop 0.20, it hardly matters either
way, there is that whole 'no append/sync' thing you know all about.

-ryan

On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <js...@apache.org> wrote:
> Hey HBase-devs,
>
> we have been going through hbase code to come up to speed.
>
> One of the questions was regarding the commit semantics. Thumbing through
> the RegionServer code that's appending to the wal:
>
> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
>
> and the log writer thread calls:
>
> hflush(), syncDone.signalAll()
>
> however hflush doesn't necessarily call a sync on the underlying log file:
>
>      if (this.forceSync ||
>          this.unflushedEntries.get() >= this.flushlogentries) { ... sync()
> ... }
>
> so it seems that if forceSync is not true, the syncWal can unblock before a
> sync is called (and forcesync seems to be only true for metaregion()).
>
> are we missing something - or is there a bug here (the signalAll should be
> conditional on hflush having actually flushed something).
>
> thanks,
>
> Joydeep
>

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Hey Joydeep,

This is actually intended this way but the name of the variable is
misleading. The sync is done only if forceSync or we have enough
entries to sync (default is 1). If someone wants to sync only 100
entries for example, they would play with that configuration.

Hope that helps,

J-D

On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <js...@apache.org> wrote:
> Hey HBase-devs,
>
> we have been going through hbase code to come up to speed.
>
> One of the questions was regarding the commit semantics. Thumbing through
> the RegionServer code that's appending to the wal:
>
> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
>
> and the log writer thread calls:
>
> hflush(), syncDone.signalAll()
>
> however hflush doesn't necessarily call a sync on the underlying log file:
>
>      if (this.forceSync ||
>          this.unflushedEntries.get() >= this.flushlogentries) { ... sync()
> ... }
>
> so it seems that if forceSync is not true, the syncWal can unblock before a
> sync is called (and forcesync seems to be only true for metaregion()).
>
> are we missing something - or is there a bug here (the signalAll should be
> conditional on hflush having actually flushed something).
>
> thanks,
>
> Joydeep
>

RE: commit semantics

Posted by Kannan Muthukkaruppan <Ka...@facebook.com>.

Ok cool. Thanks for clarifying.

I think what I had in mind was a hybrid-- basically try to accumulate transactions up to a certain app configurable time window before sync'ing (& until the sync delay ack'ing the client).

Just caught up on a earlier response from Joy on this as well.

<<if the performance with setting of 1 doesn't work out - we may need an option to delay acks until actual syncs .. (most likely we would be able to compromise on latency to get higher throughput - but wouldn't be willing to compromise on data integrity)>>

Yes, that's what I had in mind. Agree that this could be something we explore later if necessary.

Regards,
Kannan
-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, January 12, 2010 11:44 AM
To: hbase-dev@hadoop.apache.org
Subject: Re: commit semantics

On Tue, Jan 12, 2010 at 11:29 AM, Kannan Muthukkaruppan
<Ka...@facebook.com> wrote:
>
> For data integrity, going with group commits (batch commits) seems like a good option. My understanding of group commits as implemented in 0.21 is as follows:
>
> *         We wait on acknowledging back to the client until the transaction has been synced to HDFS.

Yes

>
> *         Syncs are batched-a sync is called if the queue has enough transactions  or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself.

Nope. There is two kinds of group commit around that piece of code:

1) What you called batch commit, which is a configurable value
(flushlogentries) that we have to append x amount of entries to
trigger a sync. Clients don't hold until that syncs happens so a
region server failure could lose some rows depending on the time
between the last sync and the failure.

If flushlogentries=100 and 99 entries are lying around for more than
the timer's timeout (default 1 sec), the timer will force sync those
entries.

2) Group commit happens at high concurrency and is only useful if a
high number of clients are writing at the same time and that
flushlogentries=1. What happens in the LogSyncer thread is that
instead of calling sync() for every entry, we "group" the clients
waiting on the previous sync and issue only 1 sync for all of them. In
this case, when the call returns in the client, we are sure that the
value is in HDFS.

>
>
>
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of stack
> Sent: Tuesday, January 12, 2010 10:52 AM
> To: hbase-dev@hadoop.apache.org
> Cc: Kannan Muthukkaruppan; Dhruba Borthakur
> Subject: Re: commit semantics
>
> On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <dh...@gmail.com>> wrote:
> Hi stack,
>
> I was meaning "what if the application inserted the same record into two
> Hbase instances"? Of course, now the onus is on the appl to keep both of
> them in sync and recover from any inconsistencies between them.
>
> Ok.  Like your  "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf?
>
> I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above.  It could write in parallel but the second thread might not complete for myriad reasons.  What then?  And as you say, reading, the client would have to make reconciliation.
>
> Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)?
>
> Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests?  Will that be sufficient data integrity-wise?
>
> St.Ack
>
> Thanks,
> St.Ack
>

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

On Tue, Jan 12, 2010 at 11:29 AM, Kannan Muthukkaruppan
<Ka...@facebook.com> wrote:
>
> For data integrity, going with group commits (batch commits) seems like a good option. My understanding of group commits as implemented in 0.21 is as follows:
>
> *         We wait on acknowledging back to the client until the transaction has been synced to HDFS.

Yes

>
> *         Syncs are batched-a sync is called if the queue has enough transactions  or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself.

Nope. There is two kinds of group commit around that piece of code:

1) What you called batch commit, which is a configurable value
(flushlogentries) that we have to append x amount of entries to
trigger a sync. Clients don't hold until that syncs happens so a
region server failure could lose some rows depending on the time
between the last sync and the failure.

If flushlogentries=100 and 99 entries are lying around for more than
the timer's timeout (default 1 sec), the timer will force sync those
entries.

2) Group commit happens at high concurrency and is only useful if a
high number of clients are writing at the same time and that
flushlogentries=1. What happens in the LogSyncer thread is that
instead of calling sync() for every entry, we "group" the clients
waiting on the previous sync and issue only 1 sync for all of them. In
this case, when the call returns in the client, we are sure that the
value is in HDFS.

>
>
>
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of stack
> Sent: Tuesday, January 12, 2010 10:52 AM
> To: hbase-dev@hadoop.apache.org
> Cc: Kannan Muthukkaruppan; Dhruba Borthakur
> Subject: Re: commit semantics
>
> On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <dh...@gmail.com>> wrote:
> Hi stack,
>
> I was meaning "what if the application inserted the same record into two
> Hbase instances"? Of course, now the onus is on the appl to keep both of
> them in sync and recover from any inconsistencies between them.
>
> Ok.  Like your  "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf?
>
> I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above.  It could write in parallel but the second thread might not complete for myriad reasons.  What then?  And as you say, reading, the client would have to make reconciliation.
>
> Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)?
>
> Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests?  Will that be sufficient data integrity-wise?
>
> St.Ack
>
> Thanks,
> St.Ack
>

RE: commit semantics

Posted by Kannan Muthukkaruppan <Ka...@facebook.com>.

Dhruba & I just talked off-line about this as well. Yes, writing to two clusters would result in unnecessary complexity... we will essentially need to deal with inconsistencies between the two clusters at the application level.

For data integrity, going with group commits (batch commits) seems like a good option. My understanding of group commits as implemented in 0.21 is as follows:

*         We wait on acknowledging back to the client until the transaction has been synced to HDFS.

*         Syncs are batched-a sync is called if the queue has enough transactions  or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself.




From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of stack
Sent: Tuesday, January 12, 2010 10:52 AM
To: hbase-dev@hadoop.apache.org
Cc: Kannan Muthukkaruppan; Dhruba Borthakur
Subject: Re: commit semantics

On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <dh...@gmail.com>> wrote:
Hi stack,

I was meaning "what if the application inserted the same record into two
Hbase instances"? Of course, now the onus is on the appl to keep both of
them in sync and recover from any inconsistencies between them.

Ok.  Like your  "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf?

I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above.  It could write in parallel but the second thread might not complete for myriad reasons.  What then?  And as you say, reading, the client would have to make reconciliation.

Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)?

Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests?  Will that be sufficient data integrity-wise?

St.Ack

Thanks,
St.Ack

Re: commit semantics

Posted by stack <st...@duboce.net>.

On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <dh...@gmail.com> wrote:

> Hi stack,
>
> I was meaning "what if the application inserted the same record into two
> Hbase instances"? Of course, now the onus is on the appl to keep both of
> them in sync and recover from any inconsistencies between them.
>
>
Ok.  Like your  "Overlapping Clusters for HA" from
http://www.borthakur.com/ftp/hdfs_high_availability.pdf?

I'm not sure how the application could return after writing one cluster
without waiting on the second to complete as you suggest above.  It could
write in parallel but the second thread might not complete for myriad
reasons.  What then?  And as you say, reading, the client would have to make
reconciliation.

Isn't there already a 'scalable database' that gives you this headache for
free without your having to do work on your part (smile)?

Do you think there a problem syncing on every write (with some batching of
writes happening when high-concurrency) or, if that too slow for your needs,
adding the holding of clients until sync happens as joydeep suggests?  Will
that be sufficient data integrity-wise?

St.Ack

Thanks,
St.Ack

Re: commit semantics

Posted by Dhruba Borthakur <dh...@gmail.com>.

Hi stack,

I was meaning "what if the application inserted the same record into two
Hbase instances"? Of course, now the onus is on the appl to keep both of
them in sync and recover from any inconsistencies between them.

thanks,
dhruba

On Tue, Jan 12, 2010 at 9:58 AM, stack <st...@duboce.net> wrote:

> On Mon, Jan 11, 2010 at 10:25 PM, Dhruba Borthakur <dh...@gmail.com>
> wrote:
>
> > if we want the best of both worlds.. latency as well as data integrity,
> how
> > about inserting the same record into two completely separate HBase tables
> > in
> > parallel... the operation can complete as soon as the record is inserted
> > into the first HBase table (thus giving low latencies)
>
>
> Return after insert into the first table?  Then internally hbase is meant
> to
> take care of the insert into the second table?  What if the latter fails
> for
> some reason other than regionserver crash?
>
> The two writes would have to be done as hdfs does, in series, if the two
> tables are to remain in sync, with the addition of a roll back of the
> transaction if insert does not go through to both tables since we don't
> have
> something like the hdfs background thread ensuring replica counts.
>
>
> > but data integrity
> > will not be compromised because it is unlikely that two region servers
> will
> > fail exactly at the same time (assuming that there is a way to ensure
> that
> > these two tables are not handled by the same region server).
> >
>
> How do you suggest the application deal with reading from these two tables?
> If they are guaranteed in-sync, then it could pick either.  If the two can
> wander, then the application needs to read from both and make
> reconciliation
> somehow?
>
> Just trying to understand what you are suggesting Dhruba,
> St.Ack
>
>
>
> >
> > thanks,
> > dhruba
> >
> >
> > On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <js...@apache.org>
> wrote:
> >
> > > ok - hadn't thought about it that way - but yeah with a default of 1 -
> > > the semantics seem correct.
> > >
> > > under high load - some batching would automatically happen at this
> > > setting (or so one would think - not sure if hdfs appends are blocked
> > > on pending syncs (in which case the batching wouldn't quite happen i
> > > think) - cc'ing Dhruba).
> > >
> > > if the performance with setting of 1 doesn't work out - we may need an
> > > option to delay acks until actual syncs .. (most likely we would be
> > > able to compromise on latency to get higher throughput - but wouldn't
> > > be willing to compromise on data integrity)
> > >
> > > > Hey Joydeep,
> > > >
> > > > This is actually intended this way but the name of the variable is
> > > > misleading. The sync is done only if forceSync or we have enough
> > > > entries to sync (default is 1). If someone wants to sync only 100
> > > > entries for example, they would play with that configuration.
> > > >
> > > > Hope that helps,
> > > >
> > > > J-D
> > > >
> > > >
> > > > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <js...@apache.org>
> > > wrote:
> > > >>
> > > >> Hey HBase-devs,
> > > >>
> > > >> we have been going through hbase code to come up to speed.
> > > >>
> > > >> One of the questions was regarding the commit semantics. Thumbing
> > > through the RegionServer code that's appending to the wal:
> > > >>
> > > >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
> > > >>
> > > >> and the log writer thread calls:
> > > >>
> > > >> hflush(), syncDone.signalAll()
> > > >>
> > > >> however hflush doesn't necessarily call a sync on the underlying log
> > > file:
> > > >>
> > > >>       if (this.forceSync ||
> > > >>           this.unflushedEntries.get() >= this.flushlogentries) { ...
> > > sync() ... }
> > > >>
> > > >> so it seems that if forceSync is not true, the syncWal can unblock
> > > before a sync is called (and forcesync seems to be only true for
> > > metaregion()).
> > > >>
> > > >> are we missing something - or is there a bug here (the signalAll
> > should
> > > be conditional on hflush having actually flushed something).
> > > >>
> > > >> thanks,
> > > >>
> > > >> Joydeep
> > > >
> > >
> >
> >
> >
> > --
> > Connect to me at http://www.facebook.com/dhruba
> >
>



-- 
Connect to me at http://www.facebook.com/dhruba

Re: commit semantics

Posted by stack <st...@duboce.net>.

On Mon, Jan 11, 2010 at 10:25 PM, Dhruba Borthakur <dh...@gmail.com> wrote:

> if we want the best of both worlds.. latency as well as data integrity, how
> about inserting the same record into two completely separate HBase tables
> in
> parallel... the operation can complete as soon as the record is inserted
> into the first HBase table (thus giving low latencies)


Return after insert into the first table?  Then internally hbase is meant to
take care of the insert into the second table?  What if the latter fails for
some reason other than regionserver crash?

The two writes would have to be done as hdfs does, in series, if the two
tables are to remain in sync, with the addition of a roll back of the
transaction if insert does not go through to both tables since we don't have
something like the hdfs background thread ensuring replica counts.


> but data integrity
> will not be compromised because it is unlikely that two region servers will
> fail exactly at the same time (assuming that there is a way to ensure that
> these two tables are not handled by the same region server).
>

How do you suggest the application deal with reading from these two tables?
If they are guaranteed in-sync, then it could pick either.  If the two can
wander, then the application needs to read from both and make reconciliation
somehow?

Just trying to understand what you are suggesting Dhruba,
St.Ack



>
> thanks,
> dhruba
>
>
> On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <js...@apache.org> wrote:
>
> > ok - hadn't thought about it that way - but yeah with a default of 1 -
> > the semantics seem correct.
> >
> > under high load - some batching would automatically happen at this
> > setting (or so one would think - not sure if hdfs appends are blocked
> > on pending syncs (in which case the batching wouldn't quite happen i
> > think) - cc'ing Dhruba).
> >
> > if the performance with setting of 1 doesn't work out - we may need an
> > option to delay acks until actual syncs .. (most likely we would be
> > able to compromise on latency to get higher throughput - but wouldn't
> > be willing to compromise on data integrity)
> >
> > > Hey Joydeep,
> > >
> > > This is actually intended this way but the name of the variable is
> > > misleading. The sync is done only if forceSync or we have enough
> > > entries to sync (default is 1). If someone wants to sync only 100
> > > entries for example, they would play with that configuration.
> > >
> > > Hope that helps,
> > >
> > > J-D
> > >
> > >
> > > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <js...@apache.org>
> > wrote:
> > >>
> > >> Hey HBase-devs,
> > >>
> > >> we have been going through hbase code to come up to speed.
> > >>
> > >> One of the questions was regarding the commit semantics. Thumbing
> > through the RegionServer code that's appending to the wal:
> > >>
> > >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
> > >>
> > >> and the log writer thread calls:
> > >>
> > >> hflush(), syncDone.signalAll()
> > >>
> > >> however hflush doesn't necessarily call a sync on the underlying log
> > file:
> > >>
> > >>       if (this.forceSync ||
> > >>           this.unflushedEntries.get() >= this.flushlogentries) { ...
> > sync() ... }
> > >>
> > >> so it seems that if forceSync is not true, the syncWal can unblock
> > before a sync is called (and forcesync seems to be only true for
> > metaregion()).
> > >>
> > >> are we missing something - or is there a bug here (the signalAll
> should
> > be conditional on hflush having actually flushed something).
> > >>
> > >> thanks,
> > >>
> > >> Joydeep
> > >
> >
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: commit semantics

Posted by Ryan Rawson <ry...@gmail.com>.

On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> Hi Ryan,
>
> thanks for ur response.
>
>>Right now each regionserver has 1 log, so if 2 puts on different
>>tables hit the same RS, they hit the same HLog.
>
> I understand. My point was that the application could insert the same record
> into two different tables on two different Hbase instances on two different
> piece of hardware.

Ah yes, of course, I thought you meant 2 tables in the same cluster.

>
> On a related note, can somebody explain what the tradeoff is if each region
> has its own hlog? are you worried about the number of files in HDFS? or
> maybe the number of sync-threads in the region server? Can multiple hlog
> files provide faster region splits?

So each hlog needs to be treated as a stream of edits for log
recovery. So adding more logs, requires the code to still treat the
pool as 1 log and keep an overall ordering across all logs as a merged
set.  It just adds complexity, and I'd like to put it off as long as
possible.  Initially when I was worried about performance issues,
adding a pool only extended the performance by a linear amount, and I
was looking for substantially more than that.

>
>
>> I've thought about this issue quite a bit, and I think the sync every
>> 1 rows combined with optional no-sync and low time sync() is the way
>> to go. If you want to discuss this more in person, maybe we can meet
>> up for brews or something.
>>
>
> The group-commit thing I can understand. HDFS does a very similar thing. But
> can you explain your alternative "sync every 1 rows combined with optional
> no-sync and low time sync"? For those applications that have the natural
> characteristics of updating only one row per logical operation, how can they
> be sure that their data has reached some-sort-of-stable-storage unless they
> sync after every row update?

Normally this would be the case, but consider the case of the call
'incrementColumnValue' which maintains a counter essentially. Losing
some edits means losing counter values - if we we are talking about a
counter that is incremented 100m times a day, then speed is more
important than potentially losing some extremely small number of
updates when a server crashes.

-ryan

>
> thanks,
> dhruba
>

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Good idea, let me try it.

J-D

On Wed, Jan 13, 2010 at 11:01 AM, Joydeep Sarma <js...@gmail.com> wrote:
> i posted on the jira as well - but we should be able to simulate the
> effect of the patch.
>
> if the sync was simulated merely a sleep (for 2-3ms - whatever is the
> average RTT for dfs write pipeline) instead of an actual call into dfs
> client - it should simulate the effect of the patch. (the appends
> would proceed in parallel, each sync would block for sometime).
>
> so we should be able to test whether this gets a performance win for
> the queue threshold=1 case.
>
> On Wed, Jan 13, 2010 at 10:43 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
>> Awesome, I will try to post a patch soon and  will let you know as soon as I
>> have the first version ready.
>>
>> thanks,
>> dhruba
>>
>>
>> On Wed, Jan 13, 2010 at 10:40 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>>
>>> I'll be happy to benchmark, we already have code to test the
>>> multi-client hitting 1 region server case.
>>>   know
>>> J-D
>>>
>>> On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur <dh...@gmail.com>
>>> wrote:
>>> > I will try to make a patch for it first. depending on the complexity of
>>> the
>>> > patch code, we can decide which release it can go in.
>>> >
>>> > thanks,
>>> > dhruba
>>> >
>>> > On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <jdcryans@apache.org
>>> >wrote:
>>> >
>>> >> That's great dhruba, I guess the sooner it could go in is 0.21.1?
>>> >>
>>> >> J-D
>>> >>
>>> >> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <dh...@gmail.com>
>>> >> wrote:
>>> >> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
>>> >> >
>>> >> > thanks,
>>> >> > dhruba
>>> >> >
>>> >> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <js...@gmail.com>
>>> >> wrote:
>>> >> >
>>> >> >> this is internal to the dfsclient. this would explain why performance
>>> >> >> would suck with queue threshold of 1.
>>> >> >>
>>> >> >> leave it up to Dhruba to explain the details.
>>> >> >>
>>> >> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <st...@duboce.net> wrote:
>>> >> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:
>>> >> >> >
>>> >> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked
>>> on
>>> >> a
>>> >> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation
>>> as
>>> >> it
>>> >> >> >> stands.
>>> >> >> >>
>>> >> >> >> i think this is likely to explain limited throughput with the
>>> default
>>> >> >> >> write queue threshold of 1. if the appends cannot make progress
>>> while
>>> >> >> >> one is waiting for the sync - then the write pipeline is going to
>>> be
>>> >> >> >> idle most of the time (with queue threshold of 1).
>>> >> >> >>
>>> >> >> >> i think it would be good to have the sync not block other writers
>>> on
>>> >> >> >> the file/pipeline. logically - it's not clear why it needs to
>>> (since
>>> >> >> >> the sync is just a wait for the completion as of some write
>>> >> >> >> transaction id - allowing new ones to be queued up subsequently).
>>> >> >> >
>>> >> >> >
>>> >> >> > Are you talking about internal to DFSClient Joydeep?  Or some
>>> >> >> > synchronization block up in hlog?
>>> >> >> >
>>> >> >> > St.Ack
>>> >> >> >
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Connect to me at http://www.facebook.com/dhruba
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Connect to me at http://www.facebook.com/dhruba
>>> >
>>>
>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>
>

Re: commit semantics

Posted by Joydeep Sarma <js...@gmail.com>.

i posted on the jira as well - but we should be able to simulate the
effect of the patch.

if the sync was simulated merely a sleep (for 2-3ms - whatever is the
average RTT for dfs write pipeline) instead of an actual call into dfs
client - it should simulate the effect of the patch. (the appends
would proceed in parallel, each sync would block for sometime).

so we should be able to test whether this gets a performance win for
the queue threshold=1 case.

On Wed, Jan 13, 2010 at 10:43 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> Awesome, I will try to post a patch soon and  will let you know as soon as I
> have the first version ready.
>
> thanks,
> dhruba
>
>
> On Wed, Jan 13, 2010 at 10:40 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> I'll be happy to benchmark, we already have code to test the
>> multi-client hitting 1 region server case.
>>   know
>> J-D
>>
>> On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur <dh...@gmail.com>
>> wrote:
>> > I will try to make a patch for it first. depending on the complexity of
>> the
>> > patch code, we can decide which release it can go in.
>> >
>> > thanks,
>> > dhruba
>> >
>> > On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>> >
>> >> That's great dhruba, I guess the sooner it could go in is 0.21.1?
>> >>
>> >> J-D
>> >>
>> >> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <dh...@gmail.com>
>> >> wrote:
>> >> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
>> >> >
>> >> > thanks,
>> >> > dhruba
>> >> >
>> >> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <js...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> this is internal to the dfsclient. this would explain why performance
>> >> >> would suck with queue threshold of 1.
>> >> >>
>> >> >> leave it up to Dhruba to explain the details.
>> >> >>
>> >> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <st...@duboce.net> wrote:
>> >> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:
>> >> >> >
>> >> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked
>> on
>> >> a
>> >> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation
>> as
>> >> it
>> >> >> >> stands.
>> >> >> >>
>> >> >> >> i think this is likely to explain limited throughput with the
>> default
>> >> >> >> write queue threshold of 1. if the appends cannot make progress
>> while
>> >> >> >> one is waiting for the sync - then the write pipeline is going to
>> be
>> >> >> >> idle most of the time (with queue threshold of 1).
>> >> >> >>
>> >> >> >> i think it would be good to have the sync not block other writers
>> on
>> >> >> >> the file/pipeline. logically - it's not clear why it needs to
>> (since
>> >> >> >> the sync is just a wait for the completion as of some write
>> >> >> >> transaction id - allowing new ones to be queued up subsequently).
>> >> >> >
>> >> >> >
>> >> >> > Are you talking about internal to DFSClient Joydeep?  Or some
>> >> >> > synchronization block up in hlog?
>> >> >> >
>> >> >> > St.Ack
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Connect to me at http://www.facebook.com/dhruba
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Connect to me at http://www.facebook.com/dhruba
>> >
>>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: commit semantics

Posted by Dhruba Borthakur <dh...@gmail.com>.

Awesome, I will try to post a patch soon and  will let you know as soon as I
have the first version ready.

thanks,
dhruba


On Wed, Jan 13, 2010 at 10:40 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> I'll be happy to benchmark, we already have code to test the
> multi-client hitting 1 region server case.
>   know
> J-D
>
> On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur <dh...@gmail.com>
> wrote:
> > I will try to make a patch for it first. depending on the complexity of
> the
> > patch code, we can decide which release it can go in.
> >
> > thanks,
> > dhruba
> >
> > On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> That's great dhruba, I guess the sooner it could go in is 0.21.1?
> >>
> >> J-D
> >>
> >> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <dh...@gmail.com>
> >> wrote:
> >> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
> >> >
> >> > thanks,
> >> > dhruba
> >> >
> >> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <js...@gmail.com>
> >> wrote:
> >> >
> >> >> this is internal to the dfsclient. this would explain why performance
> >> >> would suck with queue threshold of 1.
> >> >>
> >> >> leave it up to Dhruba to explain the details.
> >> >>
> >> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <st...@duboce.net> wrote:
> >> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:
> >> >> >
> >> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked
> on
> >> a
> >> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation
> as
> >> it
> >> >> >> stands.
> >> >> >>
> >> >> >> i think this is likely to explain limited throughput with the
> default
> >> >> >> write queue threshold of 1. if the appends cannot make progress
> while
> >> >> >> one is waiting for the sync - then the write pipeline is going to
> be
> >> >> >> idle most of the time (with queue threshold of 1).
> >> >> >>
> >> >> >> i think it would be good to have the sync not block other writers
> on
> >> >> >> the file/pipeline. logically - it's not clear why it needs to
> (since
> >> >> >> the sync is just a wait for the completion as of some write
> >> >> >> transaction id - allowing new ones to be queued up subsequently).
> >> >> >
> >> >> >
> >> >> > Are you talking about internal to DFSClient Joydeep?  Or some
> >> >> > synchronization block up in hlog?
> >> >> >
> >> >> > St.Ack
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Connect to me at http://www.facebook.com/dhruba
> >> >
> >>
> >
> >
> >
> > --
> > Connect to me at http://www.facebook.com/dhruba
> >
>



-- 
Connect to me at http://www.facebook.com/dhruba

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I'll be happy to benchmark, we already have code to test the
multi-client hitting 1 region server case.

J-D

On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> I will try to make a patch for it first. depending on the complexity of the
> patch code, we can decide which release it can go in.
>
> thanks,
> dhruba
>
> On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> That's great dhruba, I guess the sooner it could go in is 0.21.1?
>>
>> J-D
>>
>> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <dh...@gmail.com>
>> wrote:
>> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
>> >
>> > thanks,
>> > dhruba
>> >
>> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <js...@gmail.com>
>> wrote:
>> >
>> >> this is internal to the dfsclient. this would explain why performance
>> >> would suck with queue threshold of 1.
>> >>
>> >> leave it up to Dhruba to explain the details.
>> >>
>> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <st...@duboce.net> wrote:
>> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:
>> >> >
>> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on
>> a
>> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as
>> it
>> >> >> stands.
>> >> >>
>> >> >> i think this is likely to explain limited throughput with the default
>> >> >> write queue threshold of 1. if the appends cannot make progress while
>> >> >> one is waiting for the sync - then the write pipeline is going to be
>> >> >> idle most of the time (with queue threshold of 1).
>> >> >>
>> >> >> i think it would be good to have the sync not block other writers on
>> >> >> the file/pipeline. logically - it's not clear why it needs to (since
>> >> >> the sync is just a wait for the completion as of some write
>> >> >> transaction id - allowing new ones to be queued up subsequently).
>> >> >
>> >> >
>> >> > Are you talking about internal to DFSClient Joydeep?  Or some
>> >> > synchronization block up in hlog?
>> >> >
>> >> > St.Ack
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Connect to me at http://www.facebook.com/dhruba
>> >
>>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: commit semantics

Posted by Dhruba Borthakur <dh...@gmail.com>.

I will try to make a patch for it first. depending on the complexity of the
patch code, we can decide which release it can go in.

thanks,
dhruba

On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> That's great dhruba, I guess the sooner it could go in is 0.21.1?
>
> J-D
>
> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <dh...@gmail.com>
> wrote:
> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
> >
> > thanks,
> > dhruba
> >
> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <js...@gmail.com>
> wrote:
> >
> >> this is internal to the dfsclient. this would explain why performance
> >> would suck with queue threshold of 1.
> >>
> >> leave it up to Dhruba to explain the details.
> >>
> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <st...@duboce.net> wrote:
> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:
> >> >
> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on
> a
> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as
> it
> >> >> stands.
> >> >>
> >> >> i think this is likely to explain limited throughput with the default
> >> >> write queue threshold of 1. if the appends cannot make progress while
> >> >> one is waiting for the sync - then the write pipeline is going to be
> >> >> idle most of the time (with queue threshold of 1).
> >> >>
> >> >> i think it would be good to have the sync not block other writers on
> >> >> the file/pipeline. logically - it's not clear why it needs to (since
> >> >> the sync is just a wait for the completion as of some write
> >> >> transaction id - allowing new ones to be queued up subsequently).
> >> >
> >> >
> >> > Are you talking about internal to DFSClient Joydeep?  Or some
> >> > synchronization block up in hlog?
> >> >
> >> > St.Ack
> >> >
> >>
> >
> >
> >
> > --
> > Connect to me at http://www.facebook.com/dhruba
> >
>



-- 
Connect to me at http://www.facebook.com/dhruba

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

That's great dhruba, I guess the sooner it could go in is 0.21.1?

J-D

On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
>
> thanks,
> dhruba
>
> On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <js...@gmail.com> wrote:
>
>> this is internal to the dfsclient. this would explain why performance
>> would suck with queue threshold of 1.
>>
>> leave it up to Dhruba to explain the details.
>>
>> On Tue, Jan 12, 2010 at 9:16 PM, stack <st...@duboce.net> wrote:
>> > On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:
>> >
>> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
>> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as it
>> >> stands.
>> >>
>> >> i think this is likely to explain limited throughput with the default
>> >> write queue threshold of 1. if the appends cannot make progress while
>> >> one is waiting for the sync - then the write pipeline is going to be
>> >> idle most of the time (with queue threshold of 1).
>> >>
>> >> i think it would be good to have the sync not block other writers on
>> >> the file/pipeline. logically - it's not clear why it needs to (since
>> >> the sync is just a wait for the completion as of some write
>> >> transaction id - allowing new ones to be queued up subsequently).
>> >
>> >
>> > Are you talking about internal to DFSClient Joydeep?  Or some
>> > synchronization block up in hlog?
>> >
>> > St.Ack
>> >
>>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: commit semantics

Posted by Dhruba Borthakur <dh...@gmail.com>.

I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.

thanks,
dhruba

On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <js...@gmail.com> wrote:

> this is internal to the dfsclient. this would explain why performance
> would suck with queue threshold of 1.
>
> leave it up to Dhruba to explain the details.
>
> On Tue, Jan 12, 2010 at 9:16 PM, stack <st...@duboce.net> wrote:
> > On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:
> >
> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as it
> >> stands.
> >>
> >> i think this is likely to explain limited throughput with the default
> >> write queue threshold of 1. if the appends cannot make progress while
> >> one is waiting for the sync - then the write pipeline is going to be
> >> idle most of the time (with queue threshold of 1).
> >>
> >> i think it would be good to have the sync not block other writers on
> >> the file/pipeline. logically - it's not clear why it needs to (since
> >> the sync is just a wait for the completion as of some write
> >> transaction id - allowing new ones to be queued up subsequently).
> >
> >
> > Are you talking about internal to DFSClient Joydeep?  Or some
> > synchronization block up in hlog?
> >
> > St.Ack
> >
>



-- 
Connect to me at http://www.facebook.com/dhruba

Re: commit semantics

Posted by Joydeep Sarma <js...@gmail.com>.

this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.

leave it up to Dhruba to explain the details.

On Tue, Jan 12, 2010 at 9:16 PM, stack <st...@duboce.net> wrote:
> On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:
>
>> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
>> > pending sync. "sync" in HDFS is a pretty heavyweight operation as it
>> stands.
>>
>> i think this is likely to explain limited throughput with the default
>> write queue threshold of 1. if the appends cannot make progress while
>> one is waiting for the sync - then the write pipeline is going to be
>> idle most of the time (with queue threshold of 1).
>>
>> i think it would be good to have the sync not block other writers on
>> the file/pipeline. logically - it's not clear why it needs to (since
>> the sync is just a wait for the completion as of some write
>> transaction id - allowing new ones to be queued up subsequently).
>
>
> Are you talking about internal to DFSClient Joydeep?  Or some
> synchronization block up in hlog?
>
> St.Ack
>

Re: commit semantics

Posted by stack <st...@duboce.net>.

On Tue, Jan 12, 2010 at 9:12 PM, stack <st...@duboce.net> wrote:

> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
> > pending sync. "sync" in HDFS is a pretty heavyweight operation as it
> stands.
>
> i think this is likely to explain limited throughput with the default
> write queue threshold of 1. if the appends cannot make progress while
> one is waiting for the sync - then the write pipeline is going to be
> idle most of the time (with queue threshold of 1).
>
> i think it would be good to have the sync not block other writers on
> the file/pipeline. logically - it's not clear why it needs to (since
> the sync is just a wait for the completion as of some write
> transaction id - allowing new ones to be queued up subsequently).


Are you talking about internal to DFSClient Joydeep?  Or some
synchronization block up in hlog?

St.Ack

Re: commit semantics

Posted by stack <st...@duboce.net>.

(Below is a note from Joydeep.  Something about Joydeeps' messages are
requiring that I approve/disapprove them.  For the message below, I his
disapprove by mistake so am copying it here manually)

---------- Forwarded message ----------
From: Joydeep Sarma <js...@apache.org>
To: hbase-dev@hadoop.apache.org, kannan@facebook.com, Dhruba Borthakur <
dhruba@facebook.com>
Date: Tue, 12 Jan 2010 15:39:05 -0800
Subject: Re: commit semantics
btw - i followed up with Dhruba afterwards on this comment:

> any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
> pending sync. "sync" in HDFS is a pretty heavyweight operation as it
stands.

i think this is likely to explain limited throughput with the default
write queue threshold of 1. if the appends cannot make progress while
one is waiting for the sync - then the write pipeline is going to be
idle most of the time (with queue threshold of 1).

i think it would be good to have the sync not block other writers on
the file/pipeline. logically - it's not clear why it needs to (since
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).

Joydeep

Re: commit semantics

Posted by stack <st...@duboce.net>.

On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan
<Ka...@facebook.com>wrote:

>
> Seems like we all generally agree that large number of regions per region
> server may not be the way to go.
>
> What Andrew says.  You could make regions bigger so more data per
regionserver but same rough (small) number to redeploy on crash but the logs
to replay will be correspondingly bigger taking longer to process

> So coming back to Dhruba's question on having one commit log per region
> instead of one commit log per region server. Is the number of HDFS files
> open still a major concern?
>

Yes.  From "Commit-log implementation" section of the BT paper:

"If we kept the commit log for each tablet in a separate log file, a very
large number of files would be written concurrently in GFS. Depending on the
underlying file system implementation on each GFS server, these writes could
cause a large number of disk seeks to write to the different physical log
files. In addition, having separate log files per tablet also reduces the
effectiveness of the group commit optimization, since groups would tend to
be smaller. To fix these issues, we append mutations to a single commit log
per tablet server, co-mingling mutations for different tablets in the same
physical log file."

Not knowing any better, we presume hdfs is kinda-like gfs.

>
> Is my understanding correct that unavailability window during region server
> failover is large due to the time it takes to split the shared commit log
> into a per region log?

Yes, though truth be told, this area of hbase performance has had very
little attention paid to it.  There are things that we could do much better
-- e.g. distributed split instead of threaded split inside in a single
procss -- and ideas for making it so we can take on writes much sooner than
we currently do; e.g. open regions immediately on new server before split
completes.

> Instead, if we always had per-region commit logs even in the normal mode of
> operation, then the unavailability window would be minimized? It does
> minimize the extent of batch/group commits you can do though-- since you can
> only batch updates going to the same region. Any other gotchas/issues?

Just those listed above.

St.Ack

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Even with 100 regions, times 1000 region servers, we talk about
potentially having 100 000 opened files instead of 1000 (and also we
have to count every replica).

I guess that an OS that was configured for such usage would be able to
sustain it... You would have to watch that metric cluster-wide, get
new nodes when needed, etc.

Then you need to make sure that GC pauses won't block for too long to
have a very low unavailability time.

J-D

On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan
<Ka...@facebook.com> wrote:
>> I presume you intend to run HBase region servers
>> colocated with HDFS DataNodes.
>
> Yes.
>
> ---
>
> Seems like we all generally agree that large number of regions per region server may not be the way to go.
>
> So coming back to Dhruba's question on having one commit log per region instead of one commit log per region server. Is the number of HDFS files open still a major concern?
>
> Is my understanding correct that unavailability window during region server failover is large due to the time it takes to split the shared commit log into a per region log? Instead, if we always had per-region commit logs even in the normal mode of operation, then the unavailability window would be minimized? It does minimize the extent of batch/group commits you can do though-- since you can only batch updates going to the same region. Any other gotchas/issues?
>
> regards,
> Kannan
> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org]
> Sent: Tuesday, January 12, 2010 12:50 PM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: commit semantics
>
>> But would say having a
>> smaller number of regions per region server (say ~50) be really bad.
>
> Not at all.
>
> There are some (test) HBase deployments I know of that go pretty
> vertical, multiple TBs of disk on each node therefore wanting a high
> number of regions per region server to match that density. That may meet
> with operational success but it is architecturally suspect. I ran a test
> cluster once with > 1,000 regions per server on 25 servers, in the 0.19
> timeframe. 0.20 is much better in terms of resource demand (less) and
> liveness (enormously improved), but I still wouldn't recommend it,
> unless your clients can wait for up to several minutes on blocked reads
> and writes to affected regions should a node go down. With that many
> regions per server,  it stands to reason just about every client would be
> affected.
>
> The numbers I have for Google's canonical BigTable deployment are several
> years out of date but they go pretty far in the other direction -- about
> 100 regions per server is the target.
>
> I think it also depends on whether you intend to colocate TaskTrackers
> with the region servers. I presume you intend to run HBase region servers
> colocated with HDFS DataNodes. After you have a HBase cluster up for some
> number of hours, certainly ~24, background compaction will bring the HDFS
> blocks backing region data local to the server, generally. MapReduce
> tasks backed by HBase tables will see similar advantages of data locality
> that you are probably accustomed to with working with files in HDFS. If
> you mix storage and computation this way it makes sense to seek a balance
> between the amount of data stored on each node (number of regions being
> served) and the available computational resources (available CPU cores,
> time constraints (if any) on task execution).
>
> Even if you don't intend to do the above, it's possible that an overly
> high region density can negatively impact performance if too much I/O
> load is placed on average on each region server. Adding more servers to
> spread load would then likely help**.
>
> These considerations bias against hosting a very large number of regions
> per region server.
>
>   - Andy
>
> **: I say likely because this presumes query and edit patterns have been
> guided as necessary through engineering to be widely distributed in the
> key space. You have to take some care to avoid hot regions.
>
>
> ----- Original Message ----
>> From: Kannan Muthukkaruppan <Ka...@facebook.com>
>> To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
>> Sent: Tue, January 12, 2010 11:40:00 AM
>> Subject: RE: commit semantics
>>
>> Btw, is there much gains in having a large number of regions-- i.e. to the tune
>> of 500 -- per region server?
>>
>> I understand that having multiple regions per region server allows finer grained
>> rebalancing when new nodes are added or a node goes down. But would say having a
>> smaller number of regions per region server (say ~50) be really bad. If a region
>> server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as
>> 500 other nodes picking up 1/500 of its work each-- but seems acceptable still.
>> Are there other advantages of having a large number of regions per region
>> server?
>>
>> regards,
>> Kannan
>> -----Original Message-----
>> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel
>> Cryans
>> Sent: Tuesday, January 12, 2010 9:42 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: commit semantics
>>
>> wrt 1 HLog per region server, this is from the Bigtable paper. Their
>> main concern is the number of opened files since if you have 1000
>> region servers * 500 regions then you may have 100 000 HLogs to
>> manage. Also you can have more than one file per HLog, so let's say
>> you have on average 5 log files per HLog that's 500 000 files on HDFS.
>>
>> J-D
>>
>> On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur wrote:
>> > Hi Ryan,
>> >
>> > thanks for ur response.
>> >
>> >>Right now each regionserver has 1 log, so if 2 puts on different
>> >>tables hit the same RS, they hit the same HLog.
>> >
>> > I understand. My point was that the application could insert the same record
>> > into two different tables on two different Hbase instances on two different
>> > piece of hardware.
>> >
>> > On a related note, can somebody explain what the tradeoff is if each region
>> > has its own hlog? are you worried about the number of files in HDFS? or
>> > maybe the number of sync-threads in the region server? Can multiple hlog
>> > files provide faster region splits?
>> >
>> >
>> >> I've thought about this issue quite a bit, and I think the sync every
>> >> 1 rows combined with optional no-sync and low time sync() is the way
>> >> to go. If you want to discuss this more in person, maybe we can meet
>> >> up for brews or something.
>> >>
>> >
>> > The group-commit thing I can understand. HDFS does a very similar thing. But
>> > can you explain your alternative "sync every 1 rows combined with optional
>> > no-sync and low time sync"? For those applications that have the natural
>> > characteristics of updating only one row per logical operation, how can they
>> > be sure that their data has reached some-sort-of-stable-storage unless they
>> > sync after every row update?
>> >
>> > thanks,
>> > dhruba
>> >
>
>
>
>
>
>

RE: commit semantics

Posted by Kannan Muthukkaruppan <Ka...@facebook.com>.

> I presume you intend to run HBase region servers
> colocated with HDFS DataNodes.

Yes.

---

Seems like we all generally agree that large number of regions per region server may not be the way to go.

So coming back to Dhruba's question on having one commit log per region instead of one commit log per region server. Is the number of HDFS files open still a major concern?

Is my understanding correct that unavailability window during region server failover is large due to the time it takes to split the shared commit log into a per region log? Instead, if we always had per-region commit logs even in the normal mode of operation, then the unavailability window would be minimized? It does minimize the extent of batch/group commits you can do though-- since you can only batch updates going to the same region. Any other gotchas/issues?

regards,
Kannan
-----Original Message-----
From: Andrew Purtell [mailto:apurtell@apache.org]
Sent: Tuesday, January 12, 2010 12:50 PM
To: hbase-dev@hadoop.apache.org
Subject: Re: commit semantics

> But would say having a
> smaller number of regions per region server (say ~50) be really bad.

Not at all.

There are some (test) HBase deployments I know of that go pretty
vertical, multiple TBs of disk on each node therefore wanting a high
number of regions per region server to match that density. That may meet
with operational success but it is architecturally suspect. I ran a test
cluster once with > 1,000 regions per server on 25 servers, in the 0.19
timeframe. 0.20 is much better in terms of resource demand (less) and
liveness (enormously improved), but I still wouldn't recommend it,
unless your clients can wait for up to several minutes on blocked reads
and writes to affected regions should a node go down. With that many
regions per server,  it stands to reason just about every client would be
affected.

The numbers I have for Google's canonical BigTable deployment are several
years out of date but they go pretty far in the other direction -- about
100 regions per server is the target.

I think it also depends on whether you intend to colocate TaskTrackers
with the region servers. I presume you intend to run HBase region servers
colocated with HDFS DataNodes. After you have a HBase cluster up for some
number of hours, certainly ~24, background compaction will bring the HDFS
blocks backing region data local to the server, generally. MapReduce
tasks backed by HBase tables will see similar advantages of data locality
that you are probably accustomed to with working with files in HDFS. If
you mix storage and computation this way it makes sense to seek a balance
between the amount of data stored on each node (number of regions being
served) and the available computational resources (available CPU cores,
time constraints (if any) on task execution).

Even if you don't intend to do the above, it's possible that an overly
high region density can negatively impact performance if too much I/O
load is placed on average on each region server. Adding more servers to
spread load would then likely help**.

These considerations bias against hosting a very large number of regions
per region server.

   - Andy

**: I say likely because this presumes query and edit patterns have been
guided as necessary through engineering to be widely distributed in the
key space. You have to take some care to avoid hot regions.

----- Original Message ----
> From: Kannan Muthukkaruppan <Ka...@facebook.com>
> To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
> Sent: Tue, January 12, 2010 11:40:00 AM
> Subject: RE: commit semantics
>
> Btw, is there much gains in having a large number of regions-- i.e. to the tune
> of 500 -- per region server?
>
> I understand that having multiple regions per region server allows finer grained
> rebalancing when new nodes are added or a node goes down. But would say having a
> smaller number of regions per region server (say ~50) be really bad. If a region
> server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as
> 500 other nodes picking up 1/500 of its work each-- but seems acceptable still.
> Are there other advantages of having a large number of regions per region
> server?
>
> regards,
> Kannan
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel
> Cryans
> Sent: Tuesday, January 12, 2010 9:42 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: commit semantics
>
> wrt 1 HLog per region server, this is from the Bigtable paper. Their
> main concern is the number of opened files since if you have 1000
> region servers * 500 regions then you may have 100 000 HLogs to
> manage. Also you can have more than one file per HLog, so let's say
> you have on average 5 log files per HLog that's 500 000 files on HDFS.
>
> J-D
>
> On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur wrote:
> > Hi Ryan,
> >
> > thanks for ur response.
> >
> >>Right now each regionserver has 1 log, so if 2 puts on different
> >>tables hit the same RS, they hit the same HLog.
> >
> > I understand. My point was that the application could insert the same record
> > into two different tables on two different Hbase instances on two different
> > piece of hardware.
> >
> > On a related note, can somebody explain what the tradeoff is if each region
> > has its own hlog? are you worried about the number of files in HDFS? or
> > maybe the number of sync-threads in the region server? Can multiple hlog
> > files provide faster region splits?
> >
> >
> >> I've thought about this issue quite a bit, and I think the sync every
> >> 1 rows combined with optional no-sync and low time sync() is the way
> >> to go. If you want to discuss this more in person, maybe we can meet
> >> up for brews or something.
> >>
> >
> > The group-commit thing I can understand. HDFS does a very similar thing. But
> > can you explain your alternative "sync every 1 rows combined with optional
> > no-sync and low time sync"? For those applications that have the natural
> > characteristics of updating only one row per logical operation, how can they
> > be sure that their data has reached some-sort-of-stable-storage unless they
> > sync after every row update?
> >
> > thanks,
> > dhruba
> >

Re: commit semantics

Posted by Andrew Purtell <ap...@apache.org>.

> But would say having a 
> smaller number of regions per region server (say ~50) be really bad. 

Not at all. 

There are some (test) HBase deployments I know of that go pretty 
vertical, multiple TBs of disk on each node therefore wanting a high
number of regions per region server to match that density. That may meet
with operational success but it is architecturally suspect. I ran a test
cluster once with > 1,000 regions per server on 25 servers, in the 0.19
timeframe. 0.20 is much better in terms of resource demand (less) and
liveness (enormously improved), but I still wouldn't recommend it,
unless your clients can wait for up to several minutes on blocked reads
and writes to affected regions should a node go down. With that many
regions per server,  it stands to reason just about every client would be
affected.

The numbers I have for Google's canonical BigTable deployment are several
years out of date but they go pretty far in the other direction -- about
100 regions per server is the target. 

I think it also depends on whether you intend to colocate TaskTrackers
with the region servers. I presume you intend to run HBase region servers
colocated with HDFS DataNodes. After you have a HBase cluster up for some
number of hours, certainly ~24, background compaction will bring the HDFS
blocks backing region data local to the server, generally. MapReduce 
tasks backed by HBase tables will see similar advantages of data locality
that you are probably accustomed to with working with files in HDFS. If
you mix storage and computation this way it makes sense to seek a balance
between the amount of data stored on each node (number of regions being
served) and the available computational resources (available CPU cores,
time constraints (if any) on task execution). 

Even if you don't intend to do the above, it's possible that an overly
high region density can negatively impact performance if too much I/O
load is placed on average on each region server. Adding more servers to
spread load would then likely help**.

These considerations bias against hosting a very large number of regions
per region server.

   - Andy

**: I say likely because this presumes query and edit patterns have been
guided as necessary through engineering to be widely distributed in the
key space. You have to take some care to avoid hot regions. 

----- Original Message ----
> From: Kannan Muthukkaruppan <Ka...@facebook.com>
> To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
> Sent: Tue, January 12, 2010 11:40:00 AM
> Subject: RE: commit semantics
> 
> Btw, is there much gains in having a large number of regions-- i.e. to the tune 
> of 500 -- per region server?
> 
> I understand that having multiple regions per region server allows finer grained 
> rebalancing when new nodes are added or a node goes down. But would say having a 
> smaller number of regions per region server (say ~50) be really bad. If a region 
> server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as 
> 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. 
> Are there other advantages of having a large number of regions per region 
> server?
> 
> regards,
> Kannan
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel 
> Cryans
> Sent: Tuesday, January 12, 2010 9:42 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: commit semantics
> 
> wrt 1 HLog per region server, this is from the Bigtable paper. Their
> main concern is the number of opened files since if you have 1000
> region servers * 500 regions then you may have 100 000 HLogs to
> manage. Also you can have more than one file per HLog, so let's say
> you have on average 5 log files per HLog that's 500 000 files on HDFS.
> 
> J-D
> 
> On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur wrote:
> > Hi Ryan,
> >
> > thanks for ur response.
> >
> >>Right now each regionserver has 1 log, so if 2 puts on different
> >>tables hit the same RS, they hit the same HLog.
> >
> > I understand. My point was that the application could insert the same record
> > into two different tables on two different Hbase instances on two different
> > piece of hardware.
> >
> > On a related note, can somebody explain what the tradeoff is if each region
> > has its own hlog? are you worried about the number of files in HDFS? or
> > maybe the number of sync-threads in the region server? Can multiple hlog
> > files provide faster region splits?
> >
> >
> >> I've thought about this issue quite a bit, and I think the sync every
> >> 1 rows combined with optional no-sync and low time sync() is the way
> >> to go. If you want to discuss this more in person, maybe we can meet
> >> up for brews or something.
> >>
> >
> > The group-commit thing I can understand. HDFS does a very similar thing. But
> > can you explain your alternative "sync every 1 rows combined with optional
> > no-sync and low time sync"? For those applications that have the natural
> > characteristics of updating only one row per logical operation, how can they
> > be sure that their data has reached some-sort-of-stable-storage unless they
> > sync after every row update?
> >
> > thanks,
> > dhruba
> >

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

It's all very depending on the size of your data VS the size of your
cluster VS your usage pattern.

Example: you have 50 regions on a RS and they are all filled at the
same rate. The RS dies so the master has to split the logs of 50
regions before reassigning.

Example2: you have 500 regions on a RS and only 1 is filled. When it
dies, the master will only have 1 region to process.

Since a planned optimization is to reassign regions that have no edits
in any HLog (you have to have that knowledge prior to processing the
files, maybe store that in zookeeper) right before log splitting, then
you lose availability on 49 regions in this case. Nevertheless,
splitting a small number of regions should be more efficient.

Also more regions in general means more memory usage, possibly more
opened files, and if your data should be served very fast, then a
higher number of regions means more data to keep in memory.

J-D

On Tue, Jan 12, 2010 at 11:40 AM, Kannan Muthukkaruppan
<Ka...@facebook.com> wrote:
> Btw, is there much gains in having a large number of regions-- i.e. to the tune of 500 -- per region server?
>
> I understand that having multiple regions per region server allows finer grained rebalancing when new nodes are added or a node goes down. But would say having a smaller number of regions per region server (say ~50) be really bad. If a region server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. Are there other advantages of having a large number of regions per region server?
>
> regards,
> Kannan
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
> Sent: Tuesday, January 12, 2010 9:42 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: commit semantics
>
> wrt 1 HLog per region server, this is from the Bigtable paper. Their
> main concern is the number of opened files since if you have 1000
> region servers * 500 regions then you may have 100 000 HLogs to
> manage. Also you can have more than one file per HLog, so let's say
> you have on average 5 log files per HLog that's 500 000 files on HDFS.
>
> J-D
>
> On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
>> Hi Ryan,
>>
>> thanks for ur response.
>>
>>>Right now each regionserver has 1 log, so if 2 puts on different
>>>tables hit the same RS, they hit the same HLog.
>>
>> I understand. My point was that the application could insert the same record
>> into two different tables on two different Hbase instances on two different
>> piece of hardware.
>>
>> On a related note, can somebody explain what the tradeoff is if each region
>> has its own hlog? are you worried about the number of files in HDFS? or
>> maybe the number of sync-threads in the region server? Can multiple hlog
>> files provide faster region splits?
>>
>>
>>> I've thought about this issue quite a bit, and I think the sync every
>>> 1 rows combined with optional no-sync and low time sync() is the way
>>> to go. If you want to discuss this more in person, maybe we can meet
>>> up for brews or something.
>>>
>>
>> The group-commit thing I can understand. HDFS does a very similar thing. But
>> can you explain your alternative "sync every 1 rows combined with optional
>> no-sync and low time sync"? For those applications that have the natural
>> characteristics of updating only one row per logical operation, how can they
>> be sure that their data has reached some-sort-of-stable-storage unless they
>> sync after every row update?
>>
>> thanks,
>> dhruba
>>
>

RE: commit semantics

Posted by Kannan Muthukkaruppan <Ka...@facebook.com>.

Btw, is there much gains in having a large number of regions-- i.e. to the tune of 500 -- per region server?

I understand that having multiple regions per region server allows finer grained rebalancing when new nodes are added or a node goes down. But would say having a smaller number of regions per region server (say ~50) be really bad. If a region server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. Are there other advantages of having a large number of regions per region server?

regards,
Kannan
-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, January 12, 2010 9:42 AM
To: hbase-dev@hadoop.apache.org
Subject: Re: commit semantics

wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.

J-D

On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> Hi Ryan,
>
> thanks for ur response.
>
>>Right now each regionserver has 1 log, so if 2 puts on different
>>tables hit the same RS, they hit the same HLog.
>
> I understand. My point was that the application could insert the same record
> into two different tables on two different Hbase instances on two different
> piece of hardware.
>
> On a related note, can somebody explain what the tradeoff is if each region
> has its own hlog? are you worried about the number of files in HDFS? or
> maybe the number of sync-threads in the region server? Can multiple hlog
> files provide faster region splits?
>
>
>> I've thought about this issue quite a bit, and I think the sync every
>> 1 rows combined with optional no-sync and low time sync() is the way
>> to go. If you want to discuss this more in person, maybe we can meet
>> up for brews or something.
>>
>
> The group-commit thing I can understand. HDFS does a very similar thing. But
> can you explain your alternative "sync every 1 rows combined with optional
> no-sync and low time sync"? For those applications that have the natural
> characteristics of updating only one row per logical operation, how can they
> be sure that their data has reached some-sort-of-stable-storage unless they
> sync after every row update?
>
> thanks,
> dhruba
>

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.

J-D

On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> Hi Ryan,
>
> thanks for ur response.
>
>>Right now each regionserver has 1 log, so if 2 puts on different
>>tables hit the same RS, they hit the same HLog.
>
> I understand. My point was that the application could insert the same record
> into two different tables on two different Hbase instances on two different
> piece of hardware.
>
> On a related note, can somebody explain what the tradeoff is if each region
> has its own hlog? are you worried about the number of files in HDFS? or
> maybe the number of sync-threads in the region server? Can multiple hlog
> files provide faster region splits?
>
>
>> I've thought about this issue quite a bit, and I think the sync every
>> 1 rows combined with optional no-sync and low time sync() is the way
>> to go. If you want to discuss this more in person, maybe we can meet
>> up for brews or something.
>>
>
> The group-commit thing I can understand. HDFS does a very similar thing. But
> can you explain your alternative "sync every 1 rows combined with optional
> no-sync and low time sync"? For those applications that have the natural
> characteristics of updating only one row per logical operation, how can they
> be sure that their data has reached some-sort-of-stable-storage unless they
> sync after every row update?
>
> thanks,
> dhruba
>

Re: commit semantics

Posted by Dhruba Borthakur <dh...@gmail.com>.

Hi Ryan,

thanks for ur response.

>Right now each regionserver has 1 log, so if 2 puts on different
>tables hit the same RS, they hit the same HLog.

I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.

On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?


> I've thought about this issue quite a bit, and I think the sync every
> 1 rows combined with optional no-sync and low time sync() is the way
> to go. If you want to discuss this more in person, maybe we can meet
> up for brews or something.
>

The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?

thanks,
dhruba

Re: commit semantics

Posted by Ryan Rawson <ry...@gmail.com>.

Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.

There are 2 performance enhancing things in trunk:
- bulk commit - we only call sync() once per RPC, no matter how many
rows are involved.  If you use the batch put API you can get really
high levels of performance.
- group commit - we can take multiple thread's worth of sync()s and do
it in one, not N.  This improves performance while maintaining high
data security.

If you are expecting very high concurrency, group commit is your
friend. The more concurrent operations, the more rows per sync you are
capturing and the higher overall rows/sec performance you can achieve
while the same number of sync() calls/sec performance remains
constant.

The other option is to sync() on a fine grained timer, eg: every 10ms
(or at 100hz).  The window of data loss is small, and the performance
boost is substantial. I asked JD to implement a switchable config so
that you can chose on a table-by-table basis the right mix of
performance vs persistence with a better control feature than merely
"sync every N rows".

I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.

-ryan

On Mon, Jan 11, 2010 at 10:25 PM, Dhruba Borthakur <dh...@gmail.com> wrote:
> any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
> pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands.
>
> if we want the best of both worlds.. latency as well as data integrity, how
> about inserting the same record into two completely separate HBase tables in
> parallel... the operation can complete as soon as the record is inserted
> into the first HBase table (thus giving low latencies) but data integrity
> will not be compromised because it is unlikely that two region servers will
> fail exactly at the same time (assuming that there is a way to ensure that
> these two tables are not handled by the same region server).
>
> thanks,
> dhruba
>
>
> On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <js...@apache.org> wrote:
>
>> ok - hadn't thought about it that way - but yeah with a default of 1 -
>> the semantics seem correct.
>>
>> under high load - some batching would automatically happen at this
>> setting (or so one would think - not sure if hdfs appends are blocked
>> on pending syncs (in which case the batching wouldn't quite happen i
>> think) - cc'ing Dhruba).
>>
>> if the performance with setting of 1 doesn't work out - we may need an
>> option to delay acks until actual syncs .. (most likely we would be
>> able to compromise on latency to get higher throughput - but wouldn't
>> be willing to compromise on data integrity)
>>
>> > Hey Joydeep,
>> >
>> > This is actually intended this way but the name of the variable is
>> > misleading. The sync is done only if forceSync or we have enough
>> > entries to sync (default is 1). If someone wants to sync only 100
>> > entries for example, they would play with that configuration.
>> >
>> > Hope that helps,
>> >
>> > J-D
>> >
>> >
>> > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <js...@apache.org>
>> wrote:
>> >>
>> >> Hey HBase-devs,
>> >>
>> >> we have been going through hbase code to come up to speed.
>> >>
>> >> One of the questions was regarding the commit semantics. Thumbing
>> through the RegionServer code that's appending to the wal:
>> >>
>> >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
>> >>
>> >> and the log writer thread calls:
>> >>
>> >> hflush(), syncDone.signalAll()
>> >>
>> >> however hflush doesn't necessarily call a sync on the underlying log
>> file:
>> >>
>> >>       if (this.forceSync ||
>> >>           this.unflushedEntries.get() >= this.flushlogentries) { ...
>> sync() ... }
>> >>
>> >> so it seems that if forceSync is not true, the syncWal can unblock
>> before a sync is called (and forcesync seems to be only true for
>> metaregion()).
>> >>
>> >> are we missing something - or is there a bug here (the signalAll should
>> be conditional on hflush having actually flushed something).
>> >>
>> >> thanks,
>> >>
>> >> Joydeep
>> >
>>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: commit semantics

Posted by Dhruba Borthakur <dh...@gmail.com>.

any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands.

if we want the best of both worlds.. latency as well as data integrity, how
about inserting the same record into two completely separate HBase tables in
parallel... the operation can complete as soon as the record is inserted
into the first HBase table (thus giving low latencies) but data integrity
will not be compromised because it is unlikely that two region servers will
fail exactly at the same time (assuming that there is a way to ensure that
these two tables are not handled by the same region server).

thanks,
dhruba


On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <js...@apache.org> wrote:

> ok - hadn't thought about it that way - but yeah with a default of 1 -
> the semantics seem correct.
>
> under high load - some batching would automatically happen at this
> setting (or so one would think - not sure if hdfs appends are blocked
> on pending syncs (in which case the batching wouldn't quite happen i
> think) - cc'ing Dhruba).
>
> if the performance with setting of 1 doesn't work out - we may need an
> option to delay acks until actual syncs .. (most likely we would be
> able to compromise on latency to get higher throughput - but wouldn't
> be willing to compromise on data integrity)
>
> > Hey Joydeep,
> >
> > This is actually intended this way but the name of the variable is
> > misleading. The sync is done only if forceSync or we have enough
> > entries to sync (default is 1). If someone wants to sync only 100
> > entries for example, they would play with that configuration.
> >
> > Hope that helps,
> >
> > J-D
> >
> >
> > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <js...@apache.org>
> wrote:
> >>
> >> Hey HBase-devs,
> >>
> >> we have been going through hbase code to come up to speed.
> >>
> >> One of the questions was regarding the commit semantics. Thumbing
> through the RegionServer code that's appending to the wal:
> >>
> >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
> >>
> >> and the log writer thread calls:
> >>
> >> hflush(), syncDone.signalAll()
> >>
> >> however hflush doesn't necessarily call a sync on the underlying log
> file:
> >>
> >>       if (this.forceSync ||
> >>           this.unflushedEntries.get() >= this.flushlogentries) { ...
> sync() ... }
> >>
> >> so it seems that if forceSync is not true, the syncWal can unblock
> before a sync is called (and forcesync seems to be only true for
> metaregion()).
> >>
> >> are we missing something - or is there a bug here (the signalAll should
> be conditional on hflush having actually flushed something).
> >>
> >> thanks,
> >>
> >> Joydeep
> >
>



-- 
Connect to me at http://www.facebook.com/dhruba

Re: commit semantics

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Inline.

J-D

On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <js...@apache.org> wrote:
> ok - hadn't thought about it that way - but yeah with a default of 1 -
> the semantics seem correct.
>
> under high load - some batching would automatically happen at this
> setting (or so one would think - not sure if hdfs appends are blocked
> on pending syncs (in which case the batching wouldn't quite happen i
> think) - cc'ing Dhruba).

Yes this is our version of group commit.

>
> if the performance with setting of 1 doesn't work out - we may need an
> option to delay acks until actual syncs .. (most likely we would be
> able to compromise on latency to get higher throughput - but wouldn't
> be willing to compromise on data integrity)

Good idea, we don't currently support that feature although we have
the opposite running by default which is deferred log flush. Tables
are never sync'ed and they rely on the LogSyncer thread awaitNanos'
timeout (configurable) or tables that are highly durable. In our
opinion, a cluster with a healthy mix of deferred and non-deferred
tables still guarantees a very high level of durability for the
default setting.

Re: commit semantics

Posted by Joydeep Sarma <js...@apache.org>.

ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct.

under high load - some batching would automatically happen at this
setting (or so one would think - not sure if hdfs appends are blocked
on pending syncs (in which case the batching wouldn't quite happen i
think) - cc'ing Dhruba).

if the performance with setting of 1 doesn't work out - we may need an
option to delay acks until actual syncs .. (most likely we would be
able to compromise on latency to get higher throughput - but wouldn't
be willing to compromise on data integrity)

> Hey Joydeep,
>
> This is actually intended this way but the name of the variable is
> misleading. The sync is done only if forceSync or we have enough
> entries to sync (default is 1). If someone wants to sync only 100
> entries for example, they would play with that configuration.
>
> Hope that helps,
>
> J-D
>
>
> On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <js...@apache.org> wrote:
>>
>> Hey HBase-devs,
>>
>> we have been going through hbase code to come up to speed.
>>
>> One of the questions was regarding the commit semantics. Thumbing through the RegionServer code that's appending to the wal:
>>
>> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
>>
>> and the log writer thread calls:
>>
>> hflush(), syncDone.signalAll()
>>
>> however hflush doesn't necessarily call a sync on the underlying log file:
>>
>>       if (this.forceSync ||
>>           this.unflushedEntries.get() >= this.flushlogentries) { ... sync() ... }
>>
>> so it seems that if forceSync is not true, the syncWal can unblock before a sync is called (and forcesync seems to be only true for metaregion()).
>>
>> are we missing something - or is there a bug here (the signalAll should be conditional on hflush having actually flushed something).
>>
>> thanks,
>>
>> Joydeep
>