You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Todd Lipcon <to...@cloudera.com> on 2010/12/25 05:17:24 UTC

Good VLDB paper on WALs

Via Hammer - I thought this was a pretty good read, some good ideas for
optimizations for our WAL.

http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Good VLDB paper on WALs

Posted by Stack <st...@duboce.net>.

On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> Does anybody have any idea on how to figure out what percentage of the above
> sys-time is spent in thread scheduling vs the time spent in other system
> calls (especially in the Namenode context)?
>

Dhruba:

Our Benoit suggests http://oprofile.sourceforge.net/

St.Ack

Re: Good VLDB paper on WALs

Posted by "M. C. Srivas" <mc...@gmail.com>.

My observation (working on Spinnaker's NFS server, and then on MapR's
server), is that ELR + group-commit is essential. ELR is trivial and I am a
bit surprised that the paper claims no one does it.

Once ELR is implemented, the bottleneck immediately shifts to forcing the
log on a commit. But if multiple commit records end up landing on the same
VM page in the Linux kernel (imagine tiny transactions), then fsync issued
during log-force will cause the Linux kernel to lock out further writes to
that page while it is flushed to disk, so things come to a halt
anyway.  HBase writes to a single log file (thus a single spindle on HDFS)
so the fsync rate is further limited

Thus, a group-commit at the HBase level will go a long way in improving
performance. But, group-commit (as usually implemented in most systems) ends
up requires a timer + 2 extra context switches for each transaction. Perhaps
a "peek" into the transacation-manager to see if other transactions are
actually running can tell whether to even bother with the group-commit (ie,
wait for a group-commit only if there are other uncommitted transactions in
flight).


On Wed, Dec 29, 2010 at 12:07 PM, Ryan Rawson <ry...@gmail.com> wrote:

> Oh no, let's be wary of those server rewrites.  My micro profiling is
> showing about 30 usec for a lock handoff in the HBase client...
>
> I think we should be able to get big wins with minimal things.  A big
> rewrite has it's major costs, not to mention to effectively be async
> we'd have to rewrite every single pice of code more complex than
> Bytes.*.  If you need to block you will need to push context on a
> context-store (aka stack) and manage that all ourselves.
>
> I've been seeing papers that are talking about threading improvements
> that could get us better performance.  Assuming that ctx is the actual
> reason why we arent as fast as we could be (note: we are NOT slow!).
>
> As for the DI, I think I'd like to see more study on the costs and
> benefits.  We have a relatively minimal amount of interfaces and
> concrete objects, for the interfaces we do, we have 1 or 2
> implementations at most.  Usually 1.  There is a cost, I'd like to see
> more descriptions of the costs vs the benefits.
>
> -ryan
>
> On Wed, Dec 29, 2010 at 11:32 AM, Stack <st...@duboce.net> wrote:
> > Nice list of things we need to do to make logging faster (with useful
> > citations on current state of art).  This notion of early lock release
> > (ELR) is worth looking into (Jon, for high rates of counter
> > transactions, you've been talking about aggregating counts in front of
> > the WAL lock... maybe an ELR and then a hold on the transaction until
> > confirmation of flush would be way to go?).  Regards flush-pipelining,
> > it would be interesting to see if there are traces of the sys-time
> > that Dhruba is seeing in his NN out in HBase servers.  My guess is
> > that its probably drowned by other context switches done in our
> > servers.  Definitely worth study.
> >
> > St.Ack
> > P.S. Minimizing context switches, a system for ELR and
> > flush-pipelining, recasting the server to make use of one of the DI or
> > OSGi frameworks, moving off log4j, etc..... Is it just me or do others
> > feel a server rewrite coming on?
> >
> >
> > On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur <dh...@gmail.com>
> wrote:
> >> HDFS currently uses Hadoop RPC and the server thread blocks till the WAL
> is
> >> written to disk. In earlier deployments, I thought we could safely
> ignore
> >> flush-pipelining by creating more server threads. But in our largest
> HDFS
> >> systems, I am starting to see  20% sys-time usage on the namenode
> machine;
> >> most of this  could be thread scheduling. If so, then it makes sense to
> >> enhance the logging code to release server threads even before the WAL
> is
> >> flushed to disk (but, of course, we still have to delay the transaction
> >> response to the client till the WAL is synced to disk).
> >>
> >> Does anybody have any idea on how to figure out what percentage of the
> above
> >> sys-time is spent in thread scheduling vs the time spent in other system
> >> calls (especially in the Namenode context)?
> >>
> >> thanks,
> >> dhruba
> >>
> >>
> >> On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >>
> >>> Via Hammer - I thought this was a pretty good read, some good ideas for
> >>> optimizations for our WAL.
> >>>
> >>> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
> >>>
> >>> -Todd
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >>
> >>
> >>
> >> --
> >> Connect to me at http://www.facebook.com/dhruba
> >>
> >
>

Re: Good VLDB paper on WALs

Posted by Ryan Rawson <ry...@gmail.com>.

Oh no, let's be wary of those server rewrites.  My micro profiling is
showing about 30 usec for a lock handoff in the HBase client...

I think we should be able to get big wins with minimal things.  A big
rewrite has it's major costs, not to mention to effectively be async
we'd have to rewrite every single pice of code more complex than
Bytes.*.  If you need to block you will need to push context on a
context-store (aka stack) and manage that all ourselves.

I've been seeing papers that are talking about threading improvements
that could get us better performance.  Assuming that ctx is the actual
reason why we arent as fast as we could be (note: we are NOT slow!).

As for the DI, I think I'd like to see more study on the costs and
benefits.  We have a relatively minimal amount of interfaces and
concrete objects, for the interfaces we do, we have 1 or 2
implementations at most.  Usually 1.  There is a cost, I'd like to see
more descriptions of the costs vs the benefits.

-ryan

On Wed, Dec 29, 2010 at 11:32 AM, Stack <st...@duboce.net> wrote:
> Nice list of things we need to do to make logging faster (with useful
> citations on current state of art).  This notion of early lock release
> (ELR) is worth looking into (Jon, for high rates of counter
> transactions, you've been talking about aggregating counts in front of
> the WAL lock... maybe an ELR and then a hold on the transaction until
> confirmation of flush would be way to go?).  Regards flush-pipelining,
> it would be interesting to see if there are traces of the sys-time
> that Dhruba is seeing in his NN out in HBase servers.  My guess is
> that its probably drowned by other context switches done in our
> servers.  Definitely worth study.
>
> St.Ack
> P.S. Minimizing context switches, a system for ELR and
> flush-pipelining, recasting the server to make use of one of the DI or
> OSGi frameworks, moving off log4j, etc..... Is it just me or do others
> feel a server rewrite coming on?
>
>
> On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
>> HDFS currently uses Hadoop RPC and the server thread blocks till the WAL is
>> written to disk. In earlier deployments, I thought we could safely ignore
>> flush-pipelining by creating more server threads. But in our largest HDFS
>> systems, I am starting to see  20% sys-time usage on the namenode machine;
>> most of this  could be thread scheduling. If so, then it makes sense to
>> enhance the logging code to release server threads even before the WAL is
>> flushed to disk (but, of course, we still have to delay the transaction
>> response to the client till the WAL is synced to disk).
>>
>> Does anybody have any idea on how to figure out what percentage of the above
>> sys-time is spent in thread scheduling vs the time spent in other system
>> calls (especially in the Namenode context)?
>>
>> thanks,
>> dhruba
>>
>>
>> On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Via Hammer - I thought this was a pretty good read, some good ideas for
>>> optimizations for our WAL.
>>>
>>> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
>>>
>>> -Todd
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>
>

Re: Good VLDB paper on WALs

Posted by Nicolas Spiegelberg <ns...@fb.com>.

+1 for ELR.

I think having some data structure where we prepare the next stage of
sync() operations instead of holding the row lock over the sync would be a
big win for hot regions without a huge refactor.  I think the other two
optimizations are useful to think about, but wouldn't have the same
impact/effort ratio as ELR.


On 12/29/10 11:32 AM, "Stack" <st...@duboce.net> wrote:

>Nice list of things we need to do to make logging faster (with useful
>citations on current state of art).  This notion of early lock release
>(ELR) is worth looking into (Jon, for high rates of counter
>transactions, you've been talking about aggregating counts in front of
>the WAL lock... maybe an ELR and then a hold on the transaction until
>confirmation of flush would be way to go?).  Regards flush-pipelining,
>it would be interesting to see if there are traces of the sys-time
>that Dhruba is seeing in his NN out in HBase servers.  My guess is
>that its probably drowned by other context switches done in our
>servers.  Definitely worth study.
>
>St.Ack
>P.S. Minimizing context switches, a system for ELR and
>flush-pipelining, recasting the server to make use of one of the DI or
>OSGi frameworks, moving off log4j, etc..... Is it just me or do others
>feel a server rewrite coming on?
>
>
>On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur <dh...@gmail.com>
>wrote:
>> HDFS currently uses Hadoop RPC and the server thread blocks till the
>>WAL is
>> written to disk. In earlier deployments, I thought we could safely
>>ignore
>> flush-pipelining by creating more server threads. But in our largest
>>HDFS
>> systems, I am starting to see  20% sys-time usage on the namenode
>>machine;
>> most of this  could be thread scheduling. If so, then it makes sense to
>> enhance the logging code to release server threads even before the WAL
>>is
>> flushed to disk (but, of course, we still have to delay the transaction
>> response to the client till the WAL is synced to disk).
>>
>> Does anybody have any idea on how to figure out what percentage of the
>>above
>> sys-time is spent in thread scheduling vs the time spent in other system
>> calls (especially in the Namenode context)?
>>
>> thanks,
>> dhruba
>>
>>
>> On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Via Hammer - I thought this was a pretty good read, some good ideas for
>>> optimizations for our WAL.
>>>
>>> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
>>>
>>> -Todd
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>

Re: Good VLDB paper on WALs

Posted by Stack <st...@duboce.net>.

Nice list of things we need to do to make logging faster (with useful
citations on current state of art).  This notion of early lock release
(ELR) is worth looking into (Jon, for high rates of counter
transactions, you've been talking about aggregating counts in front of
the WAL lock... maybe an ELR and then a hold on the transaction until
confirmation of flush would be way to go?).  Regards flush-pipelining,
it would be interesting to see if there are traces of the sys-time
that Dhruba is seeing in his NN out in HBase servers.  My guess is
that its probably drowned by other context switches done in our
servers.  Definitely worth study.

St.Ack
P.S. Minimizing context switches, a system for ELR and
flush-pipelining, recasting the server to make use of one of the DI or
OSGi frameworks, moving off log4j, etc..... Is it just me or do others
feel a server rewrite coming on?

On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> HDFS currently uses Hadoop RPC and the server thread blocks till the WAL is
> written to disk. In earlier deployments, I thought we could safely ignore
> flush-pipelining by creating more server threads. But in our largest HDFS
> systems, I am starting to see  20% sys-time usage on the namenode machine;
> most of this  could be thread scheduling. If so, then it makes sense to
> enhance the logging code to release server threads even before the WAL is
> flushed to disk (but, of course, we still have to delay the transaction
> response to the client till the WAL is synced to disk).
>
> Does anybody have any idea on how to figure out what percentage of the above
> sys-time is spent in thread scheduling vs the time spent in other system
> calls (especially in the Namenode context)?
>
> thanks,
> dhruba
>
>
> On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Via Hammer - I thought this was a pretty good read, some good ideas for
>> optimizations for our WAL.
>>
>> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
>>
>> -Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: Good VLDB paper on WALs

Posted by Dhruba Borthakur <dh...@gmail.com>.

Hi Todd,

Good paper, it would be nice to get the Flush-Pipelining technique
(described in the paper) implemented in HBase and HDFS write-ahead logs. (I
am CC-ing this to hdfs-dev@hadoop as well)

HDFS currently uses Hadoop RPC and the server thread blocks till the WAL is
written to disk. In earlier deployments, I thought we could safely ignore
flush-pipelining by creating more server threads. But in our largest HDFS
systems, I am starting to see  20% sys-time usage on the namenode machine;
most of this  could be thread scheduling. If so, then it makes sense to
enhance the logging code to release server threads even before the WAL is
flushed to disk (but, of course, we still have to delay the transaction
response to the client till the WAL is synced to disk).

Does anybody have any idea on how to figure out what percentage of the above
sys-time is spent in thread scheduling vs the time spent in other system
calls (especially in the Namenode context)?

thanks,
dhruba

On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Via Hammer - I thought this was a pretty good read, some good ideas for
> optimizations for our WAL.
>
> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

-- 
Connect to me at http://www.facebook.com/dhruba

Re: Good VLDB paper on WALs

Posted by Dhruba Borthakur <dh...@gmail.com>.

Hi Todd,

Good paper, it would be nice to get the Flush-Pipelining technique
(described in the paper) implemented in HBase and HDFS write-ahead logs. (I
am CC-ing this to hdfs-dev@hadoop as well)

HDFS currently uses Hadoop RPC and the server thread blocks till the WAL is
written to disk. In earlier deployments, I thought we could safely ignore
flush-pipelining by creating more server threads. But in our largest HDFS
systems, I am starting to see  20% sys-time usage on the namenode machine;
most of this  could be thread scheduling. If so, then it makes sense to
enhance the logging code to release server threads even before the WAL is
flushed to disk (but, of course, we still have to delay the transaction
response to the client till the WAL is synced to disk).

Does anybody have any idea on how to figure out what percentage of the above
sys-time is spent in thread scheduling vs the time spent in other system
calls (especially in the Namenode context)?

thanks,
dhruba

On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Via Hammer - I thought this was a pretty good read, some good ideas for
> optimizations for our WAL.
>
> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

-- 
Connect to me at http://www.facebook.com/dhruba