You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Aaron Cordova <aa...@interllective.com> on 2012/01/30 16:48:08 UTC

loggers

At the Hbase vs Accumulo deathmatch the other night Todd elucidated that Hbase's write-ahead log is in HDFS and benefits somewhat thereby. He neglected to mention that for years until HDFS append() was available Hbase just LOST data while Accumulo didn't .. but he was talking about the current state of affairs so, whatever.

The question now is, does it make any sense to look at HDFS as a place to store Accumulo's write-ahead log? I remember that BigTable used two write streams (each of which is transparently replicated by HDFS) and switched between them to avoid performance hiccups, so it does sound like a critical part of the overall performance. Such a big change would belong probably in 1.6 or later ... But there may be reasons to never use HDFS and to always use a separately maintained subsystem.

Any one care to lay out the arguments for staying with a separate subsystem? I think we know the arguments for using HDFS.

Aaron


Re: loggers

Posted by Todd Lipcon <to...@cloudera.com>.
By default we roll every hour regardless of the number of edits. We
also roll every 0.95 x blocksize (fairly arbitrary, but works out to
around 120M usually). And same thing - we trigger flushes when we have
too many logs - but we're much less agressive than you -- our default
here is 32 segments I think. We'll also roll immediately any time
there are less than 3 DNs in the pipeline.

-Todd

On Mon, Jan 30, 2012 at 11:15 AM, Eric Newton <er...@gmail.com> wrote:
> What is "periodically" for closing the log segments?  Accumulo basis this
> decision based on the log size, with 1G segments by default.
>
> Accumulo marks the tablets with the log segment, and initiates flushes when
> the number of segments grows large (3 segments by default).  Otherwise,
> tablets with low write-rates, like the metadata table can take forever to
> recover.
>
> Does HBase do similar things?  Have you experimented with larger/smaller
> segments?
>
> -Eric
>
> On Mon, Jan 30, 2012 at 1:59 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Mon, Jan 30, 2012 at 10:53 AM, Eric Newton <er...@gmail.com>
>> wrote:
>> > I will definitely look at using HBase's WAL.  What did you guys do to
>> > distribute the log-sort/split?
>>
>> In 0.90 the log sorting was done serially by the master, which as you
>> can imagine was slow after a full outage.
>> In 0.92 we have distributed log splitting:
>> https://issues.apache.org/jira/browse/hbase-1364
>>
>> I don't think there is a single design doc for it anywhere, but
>> basically we use ZK as a distributed work queue. The master shoves
>> items in there, and the region servers try to claim them. Each item
>> corresponds to a log segment (the HBase WAL rolls periodically). After
>> the segments are split, they're moved into the appropriate region
>> directories, so when the region is next opened, they are replayed.
>>
>> -Todd
>>
>> >
>> > On Mon, Jan 30, 2012 at 1:37 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> >
>> >> On Mon, Jan 30, 2012 at 10:05 AM, Aaron Cordova
>> >> <aa...@interllective.com> wrote:
>> >> >> The big problem is in the fact that writing replicas in HDFS is done
>> in
>> >> a pipeline, rather than in parallel. There is a ticket to change this
>> >> (HDFS-1783), but no movement on it since last summer.
>> >> >
>> >> > ugh - why would they change this? Pipelining maximizes bandwidth
>> usage.
>> >> It'd be cool if the log stream could be configured to return after
>> written
>> >> to one, two, or more nodes though.
>> >> >
>> >>
>> >> The JIRA proposes to allow "star replication" instead of "pipeline
>> >> replication" on a per-stream basis. Pipelining trades off latency for
>> >> bandwidth -- multiple RTTs instead of 1 RTT.
>> >>
>> >> A few other notes relevant to the discussion above (sorry for losing
>> >> the quote history):
>> >>
>> >> Regarding HDFS's being designed for large sequential writes rather
>> >> than small records, that was originally true, but now its actually
>> >> fairly efficient. We have optimizations like HDFS-895 specifically for
>> >> the WAL use case which approximate things like group commit, and when
>> >> you combine that with group commit at the tablet-server level you can
>> >> get very good throughput along with durability guarantees. I haven't
>> >> benchmarked vs Accumulo's Loggers ever, but I'd be surprised if the
>> >> difference were substantial - we tend to be network bound on the WAL
>> >> unless the edits are really quite tiny.
>> >>
>> >> We're also looking at making our WAL implementation pluggable: see
>> >> HBASE-4529. Maybe a similar approach could be taken in Accumulo such
>> >> that HBase could use Accumulo loggers, or Accumulo could use HBase's
>> >> existing WAL class?
>> >>
>> >> -Todd
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: loggers

Posted by Eric Newton <er...@gmail.com>.
What is "periodically" for closing the log segments?  Accumulo basis this
decision based on the log size, with 1G segments by default.

Accumulo marks the tablets with the log segment, and initiates flushes when
the number of segments grows large (3 segments by default).  Otherwise,
tablets with low write-rates, like the metadata table can take forever to
recover.

Does HBase do similar things?  Have you experimented with larger/smaller
segments?

-Eric

On Mon, Jan 30, 2012 at 1:59 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Mon, Jan 30, 2012 at 10:53 AM, Eric Newton <er...@gmail.com>
> wrote:
> > I will definitely look at using HBase's WAL.  What did you guys do to
> > distribute the log-sort/split?
>
> In 0.90 the log sorting was done serially by the master, which as you
> can imagine was slow after a full outage.
> In 0.92 we have distributed log splitting:
> https://issues.apache.org/jira/browse/hbase-1364
>
> I don't think there is a single design doc for it anywhere, but
> basically we use ZK as a distributed work queue. The master shoves
> items in there, and the region servers try to claim them. Each item
> corresponds to a log segment (the HBase WAL rolls periodically). After
> the segments are split, they're moved into the appropriate region
> directories, so when the region is next opened, they are replayed.
>
> -Todd
>
> >
> > On Mon, Jan 30, 2012 at 1:37 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> On Mon, Jan 30, 2012 at 10:05 AM, Aaron Cordova
> >> <aa...@interllective.com> wrote:
> >> >> The big problem is in the fact that writing replicas in HDFS is done
> in
> >> a pipeline, rather than in parallel. There is a ticket to change this
> >> (HDFS-1783), but no movement on it since last summer.
> >> >
> >> > ugh - why would they change this? Pipelining maximizes bandwidth
> usage.
> >> It'd be cool if the log stream could be configured to return after
> written
> >> to one, two, or more nodes though.
> >> >
> >>
> >> The JIRA proposes to allow "star replication" instead of "pipeline
> >> replication" on a per-stream basis. Pipelining trades off latency for
> >> bandwidth -- multiple RTTs instead of 1 RTT.
> >>
> >> A few other notes relevant to the discussion above (sorry for losing
> >> the quote history):
> >>
> >> Regarding HDFS's being designed for large sequential writes rather
> >> than small records, that was originally true, but now its actually
> >> fairly efficient. We have optimizations like HDFS-895 specifically for
> >> the WAL use case which approximate things like group commit, and when
> >> you combine that with group commit at the tablet-server level you can
> >> get very good throughput along with durability guarantees. I haven't
> >> benchmarked vs Accumulo's Loggers ever, but I'd be surprised if the
> >> difference were substantial - we tend to be network bound on the WAL
> >> unless the edits are really quite tiny.
> >>
> >> We're also looking at making our WAL implementation pluggable: see
> >> HBASE-4529. Maybe a similar approach could be taken in Accumulo such
> >> that HBase could use Accumulo loggers, or Accumulo could use HBase's
> >> existing WAL class?
> >>
> >> -Todd
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: loggers

Posted by Todd Lipcon <to...@cloudera.com>.
On Mon, Jan 30, 2012 at 10:53 AM, Eric Newton <er...@gmail.com> wrote:
> I will definitely look at using HBase's WAL.  What did you guys do to
> distribute the log-sort/split?

In 0.90 the log sorting was done serially by the master, which as you
can imagine was slow after a full outage.
In 0.92 we have distributed log splitting:
https://issues.apache.org/jira/browse/hbase-1364

I don't think there is a single design doc for it anywhere, but
basically we use ZK as a distributed work queue. The master shoves
items in there, and the region servers try to claim them. Each item
corresponds to a log segment (the HBase WAL rolls periodically). After
the segments are split, they're moved into the appropriate region
directories, so when the region is next opened, they are replayed.

-Todd

>
> On Mon, Jan 30, 2012 at 1:37 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Mon, Jan 30, 2012 at 10:05 AM, Aaron Cordova
>> <aa...@interllective.com> wrote:
>> >> The big problem is in the fact that writing replicas in HDFS is done in
>> a pipeline, rather than in parallel. There is a ticket to change this
>> (HDFS-1783), but no movement on it since last summer.
>> >
>> > ugh - why would they change this? Pipelining maximizes bandwidth usage.
>> It'd be cool if the log stream could be configured to return after written
>> to one, two, or more nodes though.
>> >
>>
>> The JIRA proposes to allow "star replication" instead of "pipeline
>> replication" on a per-stream basis. Pipelining trades off latency for
>> bandwidth -- multiple RTTs instead of 1 RTT.
>>
>> A few other notes relevant to the discussion above (sorry for losing
>> the quote history):
>>
>> Regarding HDFS's being designed for large sequential writes rather
>> than small records, that was originally true, but now its actually
>> fairly efficient. We have optimizations like HDFS-895 specifically for
>> the WAL use case which approximate things like group commit, and when
>> you combine that with group commit at the tablet-server level you can
>> get very good throughput along with durability guarantees. I haven't
>> benchmarked vs Accumulo's Loggers ever, but I'd be surprised if the
>> difference were substantial - we tend to be network bound on the WAL
>> unless the edits are really quite tiny.
>>
>> We're also looking at making our WAL implementation pluggable: see
>> HBASE-4529. Maybe a similar approach could be taken in Accumulo such
>> that HBase could use Accumulo loggers, or Accumulo could use HBase's
>> existing WAL class?
>>
>> -Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: loggers

Posted by Eric Newton <er...@gmail.com>.
I will definitely look at using HBase's WAL.  What did you guys do to
distribute the log-sort/split?

Actually, if you have any design description (Jira, or email)... I would
love to read it over.

-Eric

On Mon, Jan 30, 2012 at 1:37 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Mon, Jan 30, 2012 at 10:05 AM, Aaron Cordova
> <aa...@interllective.com> wrote:
> >> The big problem is in the fact that writing replicas in HDFS is done in
> a pipeline, rather than in parallel. There is a ticket to change this
> (HDFS-1783), but no movement on it since last summer.
> >
> > ugh - why would they change this? Pipelining maximizes bandwidth usage.
> It'd be cool if the log stream could be configured to return after written
> to one, two, or more nodes though.
> >
>
> The JIRA proposes to allow "star replication" instead of "pipeline
> replication" on a per-stream basis. Pipelining trades off latency for
> bandwidth -- multiple RTTs instead of 1 RTT.
>
> A few other notes relevant to the discussion above (sorry for losing
> the quote history):
>
> Regarding HDFS's being designed for large sequential writes rather
> than small records, that was originally true, but now its actually
> fairly efficient. We have optimizations like HDFS-895 specifically for
> the WAL use case which approximate things like group commit, and when
> you combine that with group commit at the tablet-server level you can
> get very good throughput along with durability guarantees. I haven't
> benchmarked vs Accumulo's Loggers ever, but I'd be surprised if the
> difference were substantial - we tend to be network bound on the WAL
> unless the edits are really quite tiny.
>
> We're also looking at making our WAL implementation pluggable: see
> HBASE-4529. Maybe a similar approach could be taken in Accumulo such
> that HBase could use Accumulo loggers, or Accumulo could use HBase's
> existing WAL class?
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: loggers

Posted by Todd Lipcon <to...@cloudera.com>.
On Mon, Jan 30, 2012 at 10:05 AM, Aaron Cordova
<aa...@interllective.com> wrote:
>> The big problem is in the fact that writing replicas in HDFS is done in a pipeline, rather than in parallel. There is a ticket to change this (HDFS-1783), but no movement on it since last summer.
>
> ugh - why would they change this? Pipelining maximizes bandwidth usage. It'd be cool if the log stream could be configured to return after written to one, two, or more nodes though.
>

The JIRA proposes to allow "star replication" instead of "pipeline
replication" on a per-stream basis. Pipelining trades off latency for
bandwidth -- multiple RTTs instead of 1 RTT.

A few other notes relevant to the discussion above (sorry for losing
the quote history):

Regarding HDFS's being designed for large sequential writes rather
than small records, that was originally true, but now its actually
fairly efficient. We have optimizations like HDFS-895 specifically for
the WAL use case which approximate things like group commit, and when
you combine that with group commit at the tablet-server level you can
get very good throughput along with durability guarantees. I haven't
benchmarked vs Accumulo's Loggers ever, but I'd be surprised if the
difference were substantial - we tend to be network bound on the WAL
unless the edits are really quite tiny.

We're also looking at making our WAL implementation pluggable: see
HBASE-4529. Maybe a similar approach could be taken in Accumulo such
that HBase could use Accumulo loggers, or Accumulo could use HBase's
existing WAL class?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: loggers

Posted by Aaron Cordova <aa...@interllective.com>.
On Jan 30, 2012, at 12:02 PM, Jesse Yates wrote:

> The large blocks issue is going away soon/already with append support in HDFS. You are still going to be hurt if you have other things IOing on the node as you still need to spin disk, but it won't be as terrible as it could be.
> 
> The big problem is in the fact that writing replicas in HDFS is done in a pipeline, rather than in parallel. There is a ticket to change this (HDFS-1783), but no movement on it since last summer.

ugh - why would they change this? Pipelining maximizes bandwidth usage. It'd be cool if the log stream could be configured to return after written to one, two, or more nodes though.

> Just my two cents, but sticking with the currently logging style makes the most sense, though maybe making it a really distinct interface so we can swap out for an HDFS implementation when it's ready and people prefer.
> 
> - Jesse Yates
> 
> Sent from my iPhone.


Re: loggers

Posted by Jesse Yates <je...@gmail.com>.
The large blocks issue is going away soon/already with append support in HDFS. You are still going to be hurt if you have other things IOing on the node as you still need to spin disk, but it won't be as terrible as it could be.

The big problem is in the fact that writing replicas in HDFS is done in a pipeline, rather than in parallel. There is a ticket to change this (HDFS-1783), but no movement on it since last summer.

Just my two cents, but sticking with the currently logging style makes the most sense, though maybe making it a really distinct interface so we can swap out for an HDFS implementation when it's ready and people prefer.

- Jesse Yates

Sent from my iPhone.

On Jan 30, 2012, at 8:18 AM, Eric Newton <er...@gmail.com> wrote:

> I wonder if the I/O model of HDFS might be different from logging.  Logging
> consists of many small appends, whereas HDFS relies on large buffers to
> push around big blocks of data.  This is awesome for performance, but not
> so great for synchronous appends.
> 
> I imagine we'll need to implement it in order to measure the impact.  I
> don't expect it will be that hard to add.
> 
> I would also like to experiment with specialized log nodes that would not
> have to compete with HDFS seeks.
> 
> And, of course, we should try the multi-queue logger discussed in the
> google paper.
> 
> As it is now, we can barely tell (by performance) if the walog is on, so
> I'm not sure it's worth spending much more time on the performance.
> 
> -Eric
> 
> 
> On Mon, Jan 30, 2012 at 10:55 AM, Keith Turner <ke...@deenlo.com> wrote:
> 
>> I think it makes sense to move to HDFS if it is reliable (can survive
>> continuous ingest and the agitator) and performs well. Also, I am very
>> curious about what the performance differences are.  It would be nice
>> to do some test.
>> 
>> On Mon, Jan 30, 2012 at 10:48 AM, Aaron Cordova
>> <aa...@interllective.com> wrote:
>>> At the Hbase vs Accumulo deathmatch the other night Todd elucidated that
>> Hbase's write-ahead log is in HDFS and benefits somewhat thereby. He
>> neglected to mention that for years until HDFS append() was available Hbase
>> just LOST data while Accumulo didn't .. but he was talking about the
>> current state of affairs so, whatever.
>>> 
>>> The question now is, does it make any sense to look at HDFS as a place
>> to store Accumulo's write-ahead log? I remember that BigTable used two
>> write streams (each of which is transparently replicated by HDFS) and
>> switched between them to avoid performance hiccups, so it does sound like a
>> critical part of the overall performance. Such a big change would belong
>> probably in 1.6 or later ... But there may be reasons to never use HDFS and
>> to always use a separately maintained subsystem.
>>> 
>>> Any one care to lay out the arguments for staying with a separate
>> subsystem? I think we know the arguments for using HDFS.
>>> 
>>> Aaron
>>> 
>> 

Re: loggers

Posted by Eric Newton <er...@gmail.com>.
I wonder if the I/O model of HDFS might be different from logging.  Logging
consists of many small appends, whereas HDFS relies on large buffers to
push around big blocks of data.  This is awesome for performance, but not
so great for synchronous appends.

I imagine we'll need to implement it in order to measure the impact.  I
don't expect it will be that hard to add.

I would also like to experiment with specialized log nodes that would not
have to compete with HDFS seeks.

And, of course, we should try the multi-queue logger discussed in the
google paper.

As it is now, we can barely tell (by performance) if the walog is on, so
I'm not sure it's worth spending much more time on the performance.

-Eric


On Mon, Jan 30, 2012 at 10:55 AM, Keith Turner <ke...@deenlo.com> wrote:

> I think it makes sense to move to HDFS if it is reliable (can survive
> continuous ingest and the agitator) and performs well. Also, I am very
> curious about what the performance differences are.  It would be nice
> to do some test.
>
> On Mon, Jan 30, 2012 at 10:48 AM, Aaron Cordova
> <aa...@interllective.com> wrote:
> > At the Hbase vs Accumulo deathmatch the other night Todd elucidated that
> Hbase's write-ahead log is in HDFS and benefits somewhat thereby. He
> neglected to mention that for years until HDFS append() was available Hbase
> just LOST data while Accumulo didn't .. but he was talking about the
> current state of affairs so, whatever.
> >
> > The question now is, does it make any sense to look at HDFS as a place
> to store Accumulo's write-ahead log? I remember that BigTable used two
> write streams (each of which is transparently replicated by HDFS) and
> switched between them to avoid performance hiccups, so it does sound like a
> critical part of the overall performance. Such a big change would belong
> probably in 1.6 or later ... But there may be reasons to never use HDFS and
> to always use a separately maintained subsystem.
> >
> > Any one care to lay out the arguments for staying with a separate
> subsystem? I think we know the arguments for using HDFS.
> >
> > Aaron
> >
>

Re: loggers

Posted by Keith Turner <ke...@deenlo.com>.
I think it makes sense to move to HDFS if it is reliable (can survive
continuous ingest and the agitator) and performs well. Also, I am very
curious about what the performance differences are.  It would be nice
to do some test.

On Mon, Jan 30, 2012 at 10:48 AM, Aaron Cordova
<aa...@interllective.com> wrote:
> At the Hbase vs Accumulo deathmatch the other night Todd elucidated that Hbase's write-ahead log is in HDFS and benefits somewhat thereby. He neglected to mention that for years until HDFS append() was available Hbase just LOST data while Accumulo didn't .. but he was talking about the current state of affairs so, whatever.
>
> The question now is, does it make any sense to look at HDFS as a place to store Accumulo's write-ahead log? I remember that BigTable used two write streams (each of which is transparently replicated by HDFS) and switched between them to avoid performance hiccups, so it does sound like a critical part of the overall performance. Such a big change would belong probably in 1.6 or later ... But there may be reasons to never use HDFS and to always use a separately maintained subsystem.
>
> Any one care to lay out the arguments for staying with a separate subsystem? I think we know the arguments for using HDFS.
>
> Aaron
>

Re: loggers

Posted by dl...@comcast.net.

even more load on the name node... 



Dave Marion 



----- Original Message -----


From: "Aaron Cordova" <aa...@interllective.com> 
To: accumulo-dev@incubator.apache.org 
Sent: Monday, January 30, 2012 10:48:08 AM 
Subject: loggers 

At the Hbase vs Accumulo deathmatch the other night Todd elucidated that Hbase's write-ahead log is in HDFS and benefits somewhat thereby. He neglected to mention that for years until HDFS append() was available Hbase just LOST data while Accumulo didn't .. but he was talking about the current state of affairs so, whatever. 

The question now is, does it make any sense to look at HDFS as a place to store Accumulo's write-ahead log? I remember that BigTable used two write streams (each of which is transparently replicated by HDFS) and switched between them to avoid performance hiccups, so it does sound like a critical part of the overall performance. Such a big change would belong probably in 1.6 or later ... But there may be reasons to never use HDFS and to always use a separately maintained subsystem. 

Any one care to lay out the arguments for staying with a separate subsystem? I think we know the arguments for using HDFS. 

Aaron