You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Ryan Rawson <ry...@gmail.com> on 2010/08/28 22:07:27 UTC

hbase vs bigtable

bigtable was written for 1 core machines, with ~ 100 regions per box.
Thanks to CMS we generally can't run on < 4 cores, and at this point
16 core machines (with HTT) is becoming pretty standard.

The question is, how do we leverage the ever-increasing sizes of
machines and differentiate ourselves from bigtable?  What did google
do (if anything) to adopt to the 16 core machines?  We should be able
to do quite a bit on a 20 or 40 node cluster.

more thread parallelism?

Re: hbase vs bigtable

Posted by Ryan Rawson <ry...@gmail.com>.
What about the cautionary tale of xcievers?

On Aug 28, 2010 4:40 PM, "Jay Booth" <ja...@gmail.com> wrote:
> You guys could look at using ExecutorService -- set up a pool with max
> 1024 threads that are reused, then you're not spawning new threads for
> every read. Since those network waits are probably mostly latency,
> doing them in parallel could be a win that was possible from the HBase
> side. You might have problems with memory churn, though, if you're
> allocating 10+ buffers per read.
>
> On Sat, Aug 28, 2010 at 7:27 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> One problem of performance right now is our inability to push io down
>> into the kernel. This is where async Apis help. A full read in hbase
>> might require reading 10+ files before ever returning a single row.
>> Doing these in parallel would be nice. Spawning 10+ threads isn't
>> really a good idea.
>>
>> Right now hadoop scales by adding processes, we just don't have that
option.
>>
>> On Saturday, August 28, 2010, Todd Lipcon <to...@cloudera.com> wrote:
>>> Agreed, I think we'll get more bang for our buck by finishing up
(reviving)
>>> patches like HDFS-941 or HDFS-347. Unfortunately performance doesn't
seem to
>>> be the highest priority among our customers so it's tough to find much
time
>>> to work on these things until we really get stability up to par.
>>>
>>> -Todd
>>>
>>> On Sat, Aug 28, 2010 at 3:36 PM, Jay Booth <ja...@gmail.com> wrote:
>>>
>>>> I don't think async is a magic bullet for it's own sake, we've all
>>>> seen those papers that show good performance from blocking
>>>> implementations.  Particularly, I don't think async is worth a whole
>>>> lot on the client side of service, which HBase is to HDFS.
>>>>
>>>> What about an HDFS call for localize(Path) which attempts to replicate
>>>> the blocks for a file to the local datanode (if any) in a background
>>>> thread?  If RegionServers called that function for their files every
>>>> so often, you'd eliminate a lot of bandwidth constraints, although the
>>>> latency of establishing a local socket for every read is still there.
>>>>
>>>> On Sat, Aug 28, 2010 at 4:42 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>> > On Sat, Aug 28, 2010 at 1:38 PM, Ryan Rawson <ry...@gmail.com>
wrote:
>>>> >
>>>> >> One thought I had was if we have the writable code, surely just
>>>> >> putting a different transport around it wouldn't be THAT bad right
:-)
>>>> >>
>>>> >> Of course writables are really tied to that DataInputStream or
>>>> >> whatever, so we'd have to work on that.  Benoit said something about
>>>> >> writables needing to do blocking reads and that causing issues, but
>>>> >> there was a netty3 thing specifically designed to handle that by
>>>> >> throwing and retrying the op later when there was more data.
>>>> >>
>>>> >>
>>>> > The data transfer protocol actually doesn't do anything with
Writables -
>>>> > it's all hand coded bytes going over the transport.
>>>> >
>>>> > I have some code floating around somewhere for translating between
>>>> blocking
>>>> > IO and Netty - not sure where, though :)
>>>> >
>>>> > -Todd
>>>> >
>>>> >
>>>> >>  On Sat, Aug 28, 2010 at 1:32 PM, Todd Lipcon <to...@cloudera.com>
>>>> wrote:
>>>> >> > On Sat, Aug 28, 2010 at 1:29 PM, Ryan Rawson <ry...@gmail.com>
>>>> wrote:
>>>> >> >
>>>> >> >> a production server should be CPU bound, with memory caching etc.
>>>>  Our
>>>> >> >> prod systems do see a reasonable load, and jstack always shows
some
>>>> >> >> kind of wait generally...
>>>> >> >>
>>>> >> >> but we need more IO pushdown into HDFS.  For example if we are
>>>> loading
>>>> >> >> regions, why not do N at the same time?  That figure N is
probably
>>>> >> >> more dependent on how many disks/node you have than anything else
>>>> >> >> really.
>>>> >> >>
>>>> >> >> For simple reads (eg: hfile) would it really be that hard to
retrofit
>>>> >> >> some kind of async netty based API on top of the existing
DFSClient
>>>> >> >> logic?
>>>> >> >>
>>>> >> >
>>>> >> > Would probably be a duplication rather than a retrofit, but it's
>>>> probably
>>>> >> > doable -- the protocol is pretty simple for reads, and
failure/retry
>>>> is
>>>> >> much
>>>> >> > less complicated compared to writes (though still pretty
complicated)
>>>> >> >
>>>> >> >
>>>> >> >>
>>>> >> >> -ryan
>>>> >> >>
>>>> >> >> On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com>
>>>> wrote:
>>>> >> >> > Depending on the workload, parallelism doesn't seem to matter
much.
>>>> On
>>>> >> my
>>>> >> >> > 8-core Nehalem test cluster with 12 disks each, I'm always
network
>>>> >> bound
>>>> >> >> far
>>>> >> >> > before I'm CPU bound for most benchmarks. ie jstacks show
threads
>>>> >> mostly
>>>> >> >> > waiting for IO to happen, not blocked on locks.
>>>> >> >> >
>>>> >> >> > Is that not the case for your production boxes?
>>>> >> >> >
>>>> >> >> > On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <--
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>

Re: hbase vs bigtable

Posted by Jay Booth <ja...@gmail.com>.
You guys could look at using ExecutorService -- set up a pool with max
1024 threads that are reused, then you're not spawning new threads for
every read.  Since those network waits are probably mostly latency,
doing them in parallel could be a win that was possible from the HBase
side.  You might have problems with memory churn, though, if you're
allocating 10+ buffers per read.

On Sat, Aug 28, 2010 at 7:27 PM, Ryan Rawson <ry...@gmail.com> wrote:
> One problem of performance right now is our inability to push io down
> into the kernel. This is where async Apis help. A full read in hbase
> might require reading 10+ files before ever returning a single row.
> Doing these in parallel would be nice. Spawning 10+ threads isn't
> really a good idea.
>
> Right now hadoop scales by adding processes, we just don't have that option.
>
> On Saturday, August 28, 2010, Todd Lipcon <to...@cloudera.com> wrote:
>> Agreed, I think we'll get more bang for our buck by finishing up (reviving)
>> patches like HDFS-941 or HDFS-347. Unfortunately performance doesn't seem to
>> be the highest priority among our customers so it's tough to find much time
>> to work on these things until we really get stability up to par.
>>
>> -Todd
>>
>> On Sat, Aug 28, 2010 at 3:36 PM, Jay Booth <ja...@gmail.com> wrote:
>>
>>> I don't think async is a magic bullet for it's own sake, we've all
>>> seen those papers that show good performance from blocking
>>> implementations.  Particularly, I don't think async is worth a whole
>>> lot on the client side of service, which HBase is to HDFS.
>>>
>>> What about an HDFS call for localize(Path) which attempts to replicate
>>> the blocks for a file to the local datanode (if any) in a background
>>> thread?  If RegionServers called that function for their files every
>>> so often, you'd eliminate a lot of bandwidth constraints, although the
>>> latency of establishing a local socket for every read is still there.
>>>
>>> On Sat, Aug 28, 2010 at 4:42 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>> > On Sat, Aug 28, 2010 at 1:38 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> >
>>> >> One thought I had was if we have the writable code, surely just
>>> >> putting a different transport around it wouldn't be THAT bad right :-)
>>> >>
>>> >> Of course writables are really tied to that DataInputStream or
>>> >> whatever, so we'd have to work on that.  Benoit said something about
>>> >> writables needing to do blocking reads and that causing issues, but
>>> >> there was a netty3 thing specifically designed to handle that by
>>> >> throwing and retrying the op later when there was more data.
>>> >>
>>> >>
>>> > The data transfer protocol actually doesn't do anything with Writables -
>>> > it's all hand coded bytes going over the transport.
>>> >
>>> > I have some code floating around somewhere for translating between
>>> blocking
>>> > IO and Netty - not sure where, though :)
>>> >
>>> > -Todd
>>> >
>>> >
>>> >>  On Sat, Aug 28, 2010 at 1:32 PM, Todd Lipcon <to...@cloudera.com>
>>> wrote:
>>> >> > On Sat, Aug 28, 2010 at 1:29 PM, Ryan Rawson <ry...@gmail.com>
>>> wrote:
>>> >> >
>>> >> >> a production server should be CPU bound, with memory caching etc.
>>>  Our
>>> >> >> prod systems do see a reasonable load, and jstack always shows some
>>> >> >> kind of wait generally...
>>> >> >>
>>> >> >> but we need more IO pushdown into HDFS.  For example if we are
>>> loading
>>> >> >> regions, why not do N at the same time?  That figure N is probably
>>> >> >> more dependent on how many disks/node you have than anything else
>>> >> >> really.
>>> >> >>
>>> >> >> For simple reads (eg: hfile) would it really be that hard to retrofit
>>> >> >> some kind of async netty based API on top of the existing DFSClient
>>> >> >> logic?
>>> >> >>
>>> >> >
>>> >> > Would probably be a duplication rather than a retrofit, but it's
>>> probably
>>> >> > doable -- the protocol is pretty simple for reads, and failure/retry
>>> is
>>> >> much
>>> >> > less complicated compared to writes (though still pretty complicated)
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> -ryan
>>> >> >>
>>> >> >> On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com>
>>> wrote:
>>> >> >> > Depending on the workload, parallelism doesn't seem to matter much.
>>> On
>>> >> my
>>> >> >> > 8-core Nehalem test cluster with 12 disks each, I'm always network
>>> >> bound
>>> >> >> far
>>> >> >> > before I'm CPU bound for most benchmarks. ie jstacks show threads
>>> >> mostly
>>> >> >> > waiting for IO to happen, not blocked on locks.
>>> >> >> >
>>> >> >> > Is that not the case for your production boxes?
>>> >> >> >
>>> >> >> > On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <--
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>

Re: hbase vs bigtable

Posted by Ryan Rawson <ry...@gmail.com>.
One problem of performance right now is our inability to push io down
into the kernel. This is where async Apis help. A full read in hbase
might require reading 10+ files before ever returning a single row.
Doing these in parallel would be nice. Spawning 10+ threads isn't
really a good idea.

Right now hadoop scales by adding processes, we just don't have that option.

On Saturday, August 28, 2010, Todd Lipcon <to...@cloudera.com> wrote:
> Agreed, I think we'll get more bang for our buck by finishing up (reviving)
> patches like HDFS-941 or HDFS-347. Unfortunately performance doesn't seem to
> be the highest priority among our customers so it's tough to find much time
> to work on these things until we really get stability up to par.
>
> -Todd
>
> On Sat, Aug 28, 2010 at 3:36 PM, Jay Booth <ja...@gmail.com> wrote:
>
>> I don't think async is a magic bullet for it's own sake, we've all
>> seen those papers that show good performance from blocking
>> implementations.  Particularly, I don't think async is worth a whole
>> lot on the client side of service, which HBase is to HDFS.
>>
>> What about an HDFS call for localize(Path) which attempts to replicate
>> the blocks for a file to the local datanode (if any) in a background
>> thread?  If RegionServers called that function for their files every
>> so often, you'd eliminate a lot of bandwidth constraints, although the
>> latency of establishing a local socket for every read is still there.
>>
>> On Sat, Aug 28, 2010 at 4:42 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> > On Sat, Aug 28, 2010 at 1:38 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> >
>> >> One thought I had was if we have the writable code, surely just
>> >> putting a different transport around it wouldn't be THAT bad right :-)
>> >>
>> >> Of course writables are really tied to that DataInputStream or
>> >> whatever, so we'd have to work on that.  Benoit said something about
>> >> writables needing to do blocking reads and that causing issues, but
>> >> there was a netty3 thing specifically designed to handle that by
>> >> throwing and retrying the op later when there was more data.
>> >>
>> >>
>> > The data transfer protocol actually doesn't do anything with Writables -
>> > it's all hand coded bytes going over the transport.
>> >
>> > I have some code floating around somewhere for translating between
>> blocking
>> > IO and Netty - not sure where, though :)
>> >
>> > -Todd
>> >
>> >
>> >>  On Sat, Aug 28, 2010 at 1:32 PM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>> >> > On Sat, Aug 28, 2010 at 1:29 PM, Ryan Rawson <ry...@gmail.com>
>> wrote:
>> >> >
>> >> >> a production server should be CPU bound, with memory caching etc.
>>  Our
>> >> >> prod systems do see a reasonable load, and jstack always shows some
>> >> >> kind of wait generally...
>> >> >>
>> >> >> but we need more IO pushdown into HDFS.  For example if we are
>> loading
>> >> >> regions, why not do N at the same time?  That figure N is probably
>> >> >> more dependent on how many disks/node you have than anything else
>> >> >> really.
>> >> >>
>> >> >> For simple reads (eg: hfile) would it really be that hard to retrofit
>> >> >> some kind of async netty based API on top of the existing DFSClient
>> >> >> logic?
>> >> >>
>> >> >
>> >> > Would probably be a duplication rather than a retrofit, but it's
>> probably
>> >> > doable -- the protocol is pretty simple for reads, and failure/retry
>> is
>> >> much
>> >> > less complicated compared to writes (though still pretty complicated)
>> >> >
>> >> >
>> >> >>
>> >> >> -ryan
>> >> >>
>> >> >> On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>> >> >> > Depending on the workload, parallelism doesn't seem to matter much.
>> On
>> >> my
>> >> >> > 8-core Nehalem test cluster with 12 disks each, I'm always network
>> >> bound
>> >> >> far
>> >> >> > before I'm CPU bound for most benchmarks. ie jstacks show threads
>> >> mostly
>> >> >> > waiting for IO to happen, not blocked on locks.
>> >> >> >
>> >> >> > Is that not the case for your production boxes?
>> >> >> >
>> >> >> > On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <--
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: hbase vs bigtable

Posted by Todd Lipcon <to...@cloudera.com>.
Agreed, I think we'll get more bang for our buck by finishing up (reviving)
patches like HDFS-941 or HDFS-347. Unfortunately performance doesn't seem to
be the highest priority among our customers so it's tough to find much time
to work on these things until we really get stability up to par.

-Todd

On Sat, Aug 28, 2010 at 3:36 PM, Jay Booth <ja...@gmail.com> wrote:

> I don't think async is a magic bullet for it's own sake, we've all
> seen those papers that show good performance from blocking
> implementations.  Particularly, I don't think async is worth a whole
> lot on the client side of service, which HBase is to HDFS.
>
> What about an HDFS call for localize(Path) which attempts to replicate
> the blocks for a file to the local datanode (if any) in a background
> thread?  If RegionServers called that function for their files every
> so often, you'd eliminate a lot of bandwidth constraints, although the
> latency of establishing a local socket for every read is still there.
>
> On Sat, Aug 28, 2010 at 4:42 PM, Todd Lipcon <to...@cloudera.com> wrote:
> > On Sat, Aug 28, 2010 at 1:38 PM, Ryan Rawson <ry...@gmail.com> wrote:
> >
> >> One thought I had was if we have the writable code, surely just
> >> putting a different transport around it wouldn't be THAT bad right :-)
> >>
> >> Of course writables are really tied to that DataInputStream or
> >> whatever, so we'd have to work on that.  Benoit said something about
> >> writables needing to do blocking reads and that causing issues, but
> >> there was a netty3 thing specifically designed to handle that by
> >> throwing and retrying the op later when there was more data.
> >>
> >>
> > The data transfer protocol actually doesn't do anything with Writables -
> > it's all hand coded bytes going over the transport.
> >
> > I have some code floating around somewhere for translating between
> blocking
> > IO and Netty - not sure where, though :)
> >
> > -Todd
> >
> >
> >>  On Sat, Aug 28, 2010 at 1:32 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >> > On Sat, Aug 28, 2010 at 1:29 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> >> >
> >> >> a production server should be CPU bound, with memory caching etc.
>  Our
> >> >> prod systems do see a reasonable load, and jstack always shows some
> >> >> kind of wait generally...
> >> >>
> >> >> but we need more IO pushdown into HDFS.  For example if we are
> loading
> >> >> regions, why not do N at the same time?  That figure N is probably
> >> >> more dependent on how many disks/node you have than anything else
> >> >> really.
> >> >>
> >> >> For simple reads (eg: hfile) would it really be that hard to retrofit
> >> >> some kind of async netty based API on top of the existing DFSClient
> >> >> logic?
> >> >>
> >> >
> >> > Would probably be a duplication rather than a retrofit, but it's
> probably
> >> > doable -- the protocol is pretty simple for reads, and failure/retry
> is
> >> much
> >> > less complicated compared to writes (though still pretty complicated)
> >> >
> >> >
> >> >>
> >> >> -ryan
> >> >>
> >> >> On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >> >> > Depending on the workload, parallelism doesn't seem to matter much.
> On
> >> my
> >> >> > 8-core Nehalem test cluster with 12 disks each, I'm always network
> >> bound
> >> >> far
> >> >> > before I'm CPU bound for most benchmarks. ie jstacks show threads
> >> mostly
> >> >> > waiting for IO to happen, not blocked on locks.
> >> >> >
> >> >> > Is that not the case for your production boxes?
> >> >> >
> >> >> > On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <ry...@gmail.com>
> >> wrote:
> >> >> >
> >> >> >> bigtable was written for 1 core machines, with ~ 100 regions per
> box.
> >> >> >> Thanks to CMS we generally can't run on < 4 cores, and at this
> point
> >> >> >> 16 core machines (with HTT) is becoming pretty standard.
> >> >> >>
> >> >> >> The question is, how do we leverage the ever-increasing sizes of
> >> >> >> machines and differentiate ourselves from bigtable?  What did
> google
> >> >> >> do (if anything) to adopt to the 16 core machines?  We should be
> able
> >> >> >> to do quite a bit on a 20 or 40 node cluster.
> >> >> >>
> >> >> >> more thread parallelism?
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Todd Lipcon
> >> >> > Software Engineer, Cloudera
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Todd Lipcon
> >> > Software Engineer, Cloudera
> >> >
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: hbase vs bigtable

Posted by Jay Booth <ja...@gmail.com>.
I don't think async is a magic bullet for it's own sake, we've all
seen those papers that show good performance from blocking
implementations.  Particularly, I don't think async is worth a whole
lot on the client side of service, which HBase is to HDFS.

What about an HDFS call for localize(Path) which attempts to replicate
the blocks for a file to the local datanode (if any) in a background
thread?  If RegionServers called that function for their files every
so often, you'd eliminate a lot of bandwidth constraints, although the
latency of establishing a local socket for every read is still there.

On Sat, Aug 28, 2010 at 4:42 PM, Todd Lipcon <to...@cloudera.com> wrote:
> On Sat, Aug 28, 2010 at 1:38 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> One thought I had was if we have the writable code, surely just
>> putting a different transport around it wouldn't be THAT bad right :-)
>>
>> Of course writables are really tied to that DataInputStream or
>> whatever, so we'd have to work on that.  Benoit said something about
>> writables needing to do blocking reads and that causing issues, but
>> there was a netty3 thing specifically designed to handle that by
>> throwing and retrying the op later when there was more data.
>>
>>
> The data transfer protocol actually doesn't do anything with Writables -
> it's all hand coded bytes going over the transport.
>
> I have some code floating around somewhere for translating between blocking
> IO and Netty - not sure where, though :)
>
> -Todd
>
>
>>  On Sat, Aug 28, 2010 at 1:32 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> > On Sat, Aug 28, 2010 at 1:29 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> >
>> >> a production server should be CPU bound, with memory caching etc.  Our
>> >> prod systems do see a reasonable load, and jstack always shows some
>> >> kind of wait generally...
>> >>
>> >> but we need more IO pushdown into HDFS.  For example if we are loading
>> >> regions, why not do N at the same time?  That figure N is probably
>> >> more dependent on how many disks/node you have than anything else
>> >> really.
>> >>
>> >> For simple reads (eg: hfile) would it really be that hard to retrofit
>> >> some kind of async netty based API on top of the existing DFSClient
>> >> logic?
>> >>
>> >
>> > Would probably be a duplication rather than a retrofit, but it's probably
>> > doable -- the protocol is pretty simple for reads, and failure/retry is
>> much
>> > less complicated compared to writes (though still pretty complicated)
>> >
>> >
>> >>
>> >> -ryan
>> >>
>> >> On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> >> > Depending on the workload, parallelism doesn't seem to matter much. On
>> my
>> >> > 8-core Nehalem test cluster with 12 disks each, I'm always network
>> bound
>> >> far
>> >> > before I'm CPU bound for most benchmarks. ie jstacks show threads
>> mostly
>> >> > waiting for IO to happen, not blocked on locks.
>> >> >
>> >> > Is that not the case for your production boxes?
>> >> >
>> >> > On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <ry...@gmail.com>
>> wrote:
>> >> >
>> >> >> bigtable was written for 1 core machines, with ~ 100 regions per box.
>> >> >> Thanks to CMS we generally can't run on < 4 cores, and at this point
>> >> >> 16 core machines (with HTT) is becoming pretty standard.
>> >> >>
>> >> >> The question is, how do we leverage the ever-increasing sizes of
>> >> >> machines and differentiate ourselves from bigtable?  What did google
>> >> >> do (if anything) to adopt to the 16 core machines?  We should be able
>> >> >> to do quite a bit on a 20 or 40 node cluster.
>> >> >>
>> >> >> more thread parallelism?
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Todd Lipcon
>> >> > Software Engineer, Cloudera
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: hbase vs bigtable

Posted by Todd Lipcon <to...@cloudera.com>.
On Sat, Aug 28, 2010 at 1:38 PM, Ryan Rawson <ry...@gmail.com> wrote:

> One thought I had was if we have the writable code, surely just
> putting a different transport around it wouldn't be THAT bad right :-)
>
> Of course writables are really tied to that DataInputStream or
> whatever, so we'd have to work on that.  Benoit said something about
> writables needing to do blocking reads and that causing issues, but
> there was a netty3 thing specifically designed to handle that by
> throwing and retrying the op later when there was more data.
>
>
The data transfer protocol actually doesn't do anything with Writables -
it's all hand coded bytes going over the transport.

I have some code floating around somewhere for translating between blocking
IO and Netty - not sure where, though :)

-Todd


>  On Sat, Aug 28, 2010 at 1:32 PM, Todd Lipcon <to...@cloudera.com> wrote:
> > On Sat, Aug 28, 2010 at 1:29 PM, Ryan Rawson <ry...@gmail.com> wrote:
> >
> >> a production server should be CPU bound, with memory caching etc.  Our
> >> prod systems do see a reasonable load, and jstack always shows some
> >> kind of wait generally...
> >>
> >> but we need more IO pushdown into HDFS.  For example if we are loading
> >> regions, why not do N at the same time?  That figure N is probably
> >> more dependent on how many disks/node you have than anything else
> >> really.
> >>
> >> For simple reads (eg: hfile) would it really be that hard to retrofit
> >> some kind of async netty based API on top of the existing DFSClient
> >> logic?
> >>
> >
> > Would probably be a duplication rather than a retrofit, but it's probably
> > doable -- the protocol is pretty simple for reads, and failure/retry is
> much
> > less complicated compared to writes (though still pretty complicated)
> >
> >
> >>
> >> -ryan
> >>
> >> On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >> > Depending on the workload, parallelism doesn't seem to matter much. On
> my
> >> > 8-core Nehalem test cluster with 12 disks each, I'm always network
> bound
> >> far
> >> > before I'm CPU bound for most benchmarks. ie jstacks show threads
> mostly
> >> > waiting for IO to happen, not blocked on locks.
> >> >
> >> > Is that not the case for your production boxes?
> >> >
> >> > On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> >> >
> >> >> bigtable was written for 1 core machines, with ~ 100 regions per box.
> >> >> Thanks to CMS we generally can't run on < 4 cores, and at this point
> >> >> 16 core machines (with HTT) is becoming pretty standard.
> >> >>
> >> >> The question is, how do we leverage the ever-increasing sizes of
> >> >> machines and differentiate ourselves from bigtable?  What did google
> >> >> do (if anything) to adopt to the 16 core machines?  We should be able
> >> >> to do quite a bit on a 20 or 40 node cluster.
> >> >>
> >> >> more thread parallelism?
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Todd Lipcon
> >> > Software Engineer, Cloudera
> >> >
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: hbase vs bigtable

Posted by Ryan Rawson <ry...@gmail.com>.
One thought I had was if we have the writable code, surely just
putting a different transport around it wouldn't be THAT bad right :-)

Of course writables are really tied to that DataInputStream or
whatever, so we'd have to work on that.  Benoit said something about
writables needing to do blocking reads and that causing issues, but
there was a netty3 thing specifically designed to handle that by
throwing and retrying the op later when there was more data.

On Sat, Aug 28, 2010 at 1:32 PM, Todd Lipcon <to...@cloudera.com> wrote:
> On Sat, Aug 28, 2010 at 1:29 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> a production server should be CPU bound, with memory caching etc.  Our
>> prod systems do see a reasonable load, and jstack always shows some
>> kind of wait generally...
>>
>> but we need more IO pushdown into HDFS.  For example if we are loading
>> regions, why not do N at the same time?  That figure N is probably
>> more dependent on how many disks/node you have than anything else
>> really.
>>
>> For simple reads (eg: hfile) would it really be that hard to retrofit
>> some kind of async netty based API on top of the existing DFSClient
>> logic?
>>
>
> Would probably be a duplication rather than a retrofit, but it's probably
> doable -- the protocol is pretty simple for reads, and failure/retry is much
> less complicated compared to writes (though still pretty complicated)
>
>
>>
>> -ryan
>>
>> On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> > Depending on the workload, parallelism doesn't seem to matter much. On my
>> > 8-core Nehalem test cluster with 12 disks each, I'm always network bound
>> far
>> > before I'm CPU bound for most benchmarks. ie jstacks show threads mostly
>> > waiting for IO to happen, not blocked on locks.
>> >
>> > Is that not the case for your production boxes?
>> >
>> > On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> >
>> >> bigtable was written for 1 core machines, with ~ 100 regions per box.
>> >> Thanks to CMS we generally can't run on < 4 cores, and at this point
>> >> 16 core machines (with HTT) is becoming pretty standard.
>> >>
>> >> The question is, how do we leverage the ever-increasing sizes of
>> >> machines and differentiate ourselves from bigtable?  What did google
>> >> do (if anything) to adopt to the 16 core machines?  We should be able
>> >> to do quite a bit on a 20 or 40 node cluster.
>> >>
>> >> more thread parallelism?
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: hbase vs bigtable

Posted by Todd Lipcon <to...@cloudera.com>.
On Sat, Aug 28, 2010 at 1:29 PM, Ryan Rawson <ry...@gmail.com> wrote:

> a production server should be CPU bound, with memory caching etc.  Our
> prod systems do see a reasonable load, and jstack always shows some
> kind of wait generally...
>
> but we need more IO pushdown into HDFS.  For example if we are loading
> regions, why not do N at the same time?  That figure N is probably
> more dependent on how many disks/node you have than anything else
> really.
>
> For simple reads (eg: hfile) would it really be that hard to retrofit
> some kind of async netty based API on top of the existing DFSClient
> logic?
>

Would probably be a duplication rather than a retrofit, but it's probably
doable -- the protocol is pretty simple for reads, and failure/retry is much
less complicated compared to writes (though still pretty complicated)


>
> -ryan
>
> On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com> wrote:
> > Depending on the workload, parallelism doesn't seem to matter much. On my
> > 8-core Nehalem test cluster with 12 disks each, I'm always network bound
> far
> > before I'm CPU bound for most benchmarks. ie jstacks show threads mostly
> > waiting for IO to happen, not blocked on locks.
> >
> > Is that not the case for your production boxes?
> >
> > On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <ry...@gmail.com> wrote:
> >
> >> bigtable was written for 1 core machines, with ~ 100 regions per box.
> >> Thanks to CMS we generally can't run on < 4 cores, and at this point
> >> 16 core machines (with HTT) is becoming pretty standard.
> >>
> >> The question is, how do we leverage the ever-increasing sizes of
> >> machines and differentiate ourselves from bigtable?  What did google
> >> do (if anything) to adopt to the 16 core machines?  We should be able
> >> to do quite a bit on a 20 or 40 node cluster.
> >>
> >> more thread parallelism?
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: hbase vs bigtable

Posted by Ryan Rawson <ry...@gmail.com>.
a production server should be CPU bound, with memory caching etc.  Our
prod systems do see a reasonable load, and jstack always shows some
kind of wait generally...

but we need more IO pushdown into HDFS.  For example if we are loading
regions, why not do N at the same time?  That figure N is probably
more dependent on how many disks/node you have than anything else
really.

For simple reads (eg: hfile) would it really be that hard to retrofit
some kind of async netty based API on top of the existing DFSClient
logic?

-ryan

On Sat, Aug 28, 2010 at 1:11 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Depending on the workload, parallelism doesn't seem to matter much. On my
> 8-core Nehalem test cluster with 12 disks each, I'm always network bound far
> before I'm CPU bound for most benchmarks. ie jstacks show threads mostly
> waiting for IO to happen, not blocked on locks.
>
> Is that not the case for your production boxes?
>
> On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> bigtable was written for 1 core machines, with ~ 100 regions per box.
>> Thanks to CMS we generally can't run on < 4 cores, and at this point
>> 16 core machines (with HTT) is becoming pretty standard.
>>
>> The question is, how do we leverage the ever-increasing sizes of
>> machines and differentiate ourselves from bigtable?  What did google
>> do (if anything) to adopt to the 16 core machines?  We should be able
>> to do quite a bit on a 20 or 40 node cluster.
>>
>> more thread parallelism?
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: hbase vs bigtable

Posted by Todd Lipcon <to...@cloudera.com>.
Depending on the workload, parallelism doesn't seem to matter much. On my
8-core Nehalem test cluster with 12 disks each, I'm always network bound far
before I'm CPU bound for most benchmarks. ie jstacks show threads mostly
waiting for IO to happen, not blocked on locks.

Is that not the case for your production boxes?

On Sat, Aug 28, 2010 at 1:07 PM, Ryan Rawson <ry...@gmail.com> wrote:

> bigtable was written for 1 core machines, with ~ 100 regions per box.
> Thanks to CMS we generally can't run on < 4 cores, and at this point
> 16 core machines (with HTT) is becoming pretty standard.
>
> The question is, how do we leverage the ever-increasing sizes of
> machines and differentiate ourselves from bigtable?  What did google
> do (if anything) to adopt to the 16 core machines?  We should be able
> to do quite a bit on a 20 or 40 node cluster.
>
> more thread parallelism?
>



-- 
Todd Lipcon
Software Engineer, Cloudera