You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Asaf Mesika <as...@gmail.com> on 2013/06/20 12:46:09 UTC

Replication not suited for intensive write applications?

Hi,

I've been conducting lots of benchmarks to test the maximum throughput of
replication in HBase.

I've come to the conclusion that HBase replication is not suited for write
intensive application. I hope that people here can show me where I'm wrong.

*My setup*
*Cluster (*Master and slave are alike)
1 Master, NameNode
3 RS, Data Node

All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit ethernet
card

I insert data into HBase from a java process (client) reading files from
disk, running on the machine running the HBase Master in the master cluster.

*Benchmark Results*
When the client writes with 10 Threads, then the master cluster writes at
17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data size
I wrote is 15 GB, all Puts, to two different tables.
Both clusters when tested independently without replication, achieved write
throughput of 17-19 MB/sec, so evidently the replication process is the
bottleneck.

I also tested connectivity between the two clusters using "netcat" and
achieved 111 MB/sec.
I've checked the usage of the network cards both on the client, master
cluster region server and slave region servers. No computer when over
30mb/sec in Receive or Transmit.
The way I checked was rather crud but works: I've run "netstat -ie" before
HBase in the master cluster starts writing and after it finishes. The same
was done on the replicated cluster (when the replication started and
finished). I can tell the amount of bytes Received and Transmitted and I
know that duration each cluster worked, thus I can calculate the throughput.

 *The bottleneck in my opinion*
Since we've excluded network capacity, and each cluster works at faster
rate independently, all is left is the replication process.
My client writes to the master cluster with 10 Threads, and manages to
write at 17-18 MB/sec.
Each region server has only 1 thread responsible for transmitting the data
written to the WAL to the slave cluster. Thus in my setup I effectively
have 3 threads writing to the slave cluster.  Thus this is the bottleneck,
since this process can not be parallelized, since it must transmit the WAL
in a certain order.

*Conclusion*
When writes intensively to HBase with more than 3 threads (in my setup),
you can't use replication.

*Master throughput without replication*
On a different note, I have one thing I couldn't understand at all.
When turned off replication, and wrote with my client with 3 threads I got
throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than that
doesn't help) I achieved maximum throughput of 19 MB/sec.
The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each RS,
thus the network capacity was not the bottleneck.
On the HBase master machine which ran the client, the network card again
showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
MB/sec. Hence it's the client machine network card creating the bottleneck.

The only explanation I have is the synchronized writes to the WAL. Those 10
threads have to get in line, and one by one, write their batch of Puts to
the WAL, which creates a bottleneck.

*My question*:
The one thing I couldn't understand is: When I write with 3 Threads,
meaning I have no more than 3 concurrent RPC requests to write in each RS.
They achieved 11.3 MB/sec.
The write to the WAL is synchronized, so why increasing the number of
threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
They all get in line to write to the same location, so it seems have
concurrent write shouldn't improve throughput at all.


Thanks you!

Asaf
*
*

Re: Replication not suited for intensive write applications?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Given that the region server writes to a single WAL at a time, doing
it with multiple threads might be hard. You also have to manage the
correct position up in ZK. It might be easier with multiple WALs.

In any case, Inserting at such date might not be doable over long
periods of time. How long were your benchmarks running for exactly?
(can't find it in your first email)

You could also fancy doing regular bulk loads (say, every 30 minutes)
and consider shipping the same files to the other cluster.

Do you have a real use case in mind?

Thanks,

J-D

On Sat, Jun 22, 2013 at 11:33 PM, Asaf Mesika <as...@gmail.com> wrote:
> bq. I'm not sure if it's really a problem tho.
>
> Let's the maximum throughput achieved by writing with k client threads is
> 30 MB/sec, where k = the number of region servers.
> If you are consistently writing to HBase more than 30 MB/sec  - lets say 40
> MB/sec with 2k threads - that you can't use HBase replication and must
> write your own solution.
>
> One way I started thinking about is to somehow declare that for a specific
> table, order of Puts is not important (say each write is unique), thus you
> can spawn multiple threads for replicating a WAL file.
>
>
>
>
> On Sat, Jun 22, 2013 at 12:18 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> I think that the same way writing with more clients helped throughput,
>> writing with only 1 replication thread will hurt it. The clients in
>> both cases have to read something (a file from HDFS or the WAL) then
>> ship it, meaning that you can utilize the cluster better since a
>> single client isn't consistently writing.
>>
>> I agree with Asaf's assessment that it's possible that you can write
>> faster into HBase than you can replicate from it if your clients are
>> using the write buffers and have a bigger aggregate throughput than
>> replication's.
>>
>> I'm not sure if it's really a problem tho.
>>
>> J-D
>>
>> On Fri, Jun 21, 2013 at 3:05 PM, lars hofhansl <la...@apache.org> wrote:
>> > Hmm... Yes. Was worth a try :)  Should've checked and I even wrote that
>> part of the code.
>> >
>> > I have no good explanation then, and also no good suggestion about how
>> to improve this.
>> >
>> >
>> >
>> > ________________________________
>> >  From: Asaf Mesika <as...@gmail.com>
>> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
>> larsh@apache.org>
>> > Sent: Friday, June 21, 2013 5:50 AM
>> > Subject: Re: Replication not suited for intensive write applications?
>> >
>> >
>> > On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <la...@apache.org> wrote:
>> >
>> >> Another thought...
>> >>
>> >> I assume you only write to a single table, right? How large are your
>> rows
>> >> on average?
>> >>
>> >> I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and
>> the
>> > seconds around is around 800 bytes
>> >
>> >>
>> >> Replication will send 64mb blocks by default (or 25000 edits, whatever
>> is
>> >> smaller). The default HTable buffer is 2mb only, so the slave RS
>> receiving
>> >> a block of edits (assuming it is a full block), has to do 32 rounds of
>> >> splitting the edits per region in order to apply them.
>> >>
>> >> In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used,
>> > which writes directly to RS without buffers?
>> >
>> >   private void batch(byte[] tableName, List<Row> rows) throws
>> IOException {
>> >
>> >     if (rows.isEmpty()) {
>> >
>> >       return;
>> >
>> >     }
>> >
>> >     HTableInterface table = null;
>> >
>> >     try {
>> >
>> >       table = new HTable(tableName, this.sharedHtableCon, this.
>> > sharedThreadPool);
>> >
>> >       table.batch(rows);
>> >
>> >       this.metrics.appliedOpsRate.inc(rows.size());
>> >
>> >     } catch (InterruptedException ix) {
>> >
>> >       throw new IOException(ix);
>> >
>> >     } finally {
>> >
>> >       if (table != null) {
>> >
>> >         table.close();
>> >
>> >       }
>> >
>> >     }
>> >
>> >   }
>> >
>> >
>> >
>> >>
>> >> There is no setting specifically targeted at the buffer size for
>> >> replication, but maybe you could increase "hbase.client.write.buffer" to
>> >> 64mb (67108864) on the slave cluster and see whether that makes a
>> >> difference. If it does we can (1) add a setting to control the
>> >> ReplicationSink HTable's buffer size, or (2) just have it match the
>> >> replication buffer size "replication.source.size.capacity".
>> >>
>> >>
>> >> -- Lars
>> >> ________________________________
>> >> From: lars hofhansl <la...@apache.org>
>> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> >> Sent: Friday, June 21, 2013 1:48 AM
>> >> Subject: Re: Replication not suited for intensive write applications?
>> >>
>> >>
>> >> Thanks for checking... Interesting. So talking to 3RSs as opposed to
>> only
>> >> 1 before had no effect on the throughput?
>> >>
>> >> Would be good to explore this a bit more.
>> >> Since our RPC is not streaming, latency will effect throughout. In this
>> >> case there is latency while all edits are shipped to the RS in the slave
>> >> cluster and then extra latency when applying the edits there (which are
>> >> likely not local to that RS). A true streaming API should be better. If
>> >> that is the case compression *could* help (but that is a big if).
>> >>
>> >> The single thread shipping the edits to the slave should not be an issue
>> >> as the edits are actually applied by the slave RS, which will use
>> multiple
>> >> threads to apply the edits in the local cluster.
>> >>
>> >> Also my first reply - upon re-reading it - sounded a bit rough, that was
>> >> not intended.
>> >>
>> >> -- Lars
>> >>
>> >>
>> >> ----- Original Message -----
>> >> From: Asaf Mesika <as...@gmail.com>
>> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
>> >> larsh@apache.org>
>> >> Cc:
>> >> Sent: Thursday, June 20, 2013 10:16 PM
>> >> Subject: Re: Replication not suited for intensive write applications?
>> >>
>> >> Thanks for the taking the time to answer!
>> >> My answers are inline.
>> >>
>> >> On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <la...@apache.org>
>> wrote:
>> >>
>> >> > I see.
>> >> >
>> >> > In HBase you have machines for both CPU (to serve requests) and
>> storage
>> >> > (to hold the data).
>> >> >
>> >> > If you only grow your cluster for CPU and you keep all RegionServers
>> 100%
>> >> > busy at all times, you are correct.
>> >> >
>> >> > Maybe you need to increase replication.source.size.capacity and/or
>> >> > replication.source.nb.capacity (although I doubt that this will help
>> >> here).
>> >> >
>> >> > I was thinking of giving a shot, but theoretically it should not
>> affect,
>> >> since I'm doing anything in parallel, right?
>> >>
>> >>
>> >> > Also a replication source will pick region server from the target at
>> >> > random (10% of them at default). That has two effects:
>> >> > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
>> >> > 2. With such a small cluster setup the likelihood is high that two or
>> >> more
>> >> > RSs in the source will happen to pick the same RS at the target. Thus
>> >> > leading less throughput.
>> >> >
>> >> You are absolutely correct. In Graphite, in the beginning, I saw that
>> only
>> >> one slave RS was getting all replicateLogEntries RPC calls. I search the
>> >> master RS logs and saw "Choose Peer" as follows:
>> >> Master RS 74: Choose peer 83
>> >> Master RS 75: Choose peer 83
>> >> Master RS 76: Choose peer 85
>> >> From some reason, they ALL talked to 83 (which seems like a bug to me).
>> >>
>> >> I thought I nailed the bottleneck, so I've changed the factor from 0.1
>> to
>> >> 1. It had the exact you described, and now all RS were getting the same
>> >> amount of replicateLogEntries RPC calls, BUT it didn't budge the
>> >> replication throughput. When I checked the network card usage I
>> understood
>> >> that even when all 3 RS were talking to the same slave RS, network
>> wasn't
>> >> the bottleneck.
>> >>
>> >>
>> >> >
>> >> > In fact your numbers might indicate that two of your source RSs might
>> >> have
>> >> > picked the same target (you get 2/3 of your throughput via
>> replication).
>> >> >
>> >> >
>> >> > In any case, before drawing conclusions this should be tested with a
>> >> > larger cluster.
>> >> > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs
>> >> will
>> >> > round robin all target RSs and lead to better distribution), but that
>> >> might
>> >> > have other side-effects, too.
>> >> >
>> >> I'll try getting two clusters of 10 RS each and see if that helps. I
>> >> suspect it won't. My hunch is that: since we're replicating with no more
>> >> than 10 threads, than if I take my client and set it to 10 threads and
>> >> measure the throughput, this will the maximum replication throughput.
>> Thus,
>> >> if my client will write with let's say 20 threads (or have two client
>> with
>> >> 10 threads each), than I'm bound to reach an ever increasing
>> >> ageOfLastShipped.
>> >>
>> >> >
>> >> > Did you measure the disk IO at each RS at the target? Maybe one of
>> them
>> >> is
>> >> > mostly idle.
>> >> >
>> >> > I didn't, but I did run my client directly at the slave cluster and
>> >> measure throughput and got 18 MB/sec which is bigger than the
>> replication
>> >> throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
>> >> bottleneck here.
>> >>
>> >> I was thinking of somehow tweaking HBase a bit for my use case: I always
>> >> send Puts with new row KV (never update or delete), thus I have no
>> >> importance for ordering, thus maybe enable with a flag the ability, on a
>> >> certain column family to open multiple threads at the Replication
>> Source?
>> >>
>> >> One more question - keeping the one thread in mind here, having
>> compression
>> >> on the replicateLogEntries RPC call, shouldn't really help here right?
>> >> Since the entire RPC call time is mostly the time it takes to run the
>> >> HTable.batch call on the slave RS, right? If I enable compression
>> somehow
>> >> (hack HBase code to test drive it), I will only speed up transfer time
>> of
>> >> the batch to the slave RS, but still wait on the insertion of this batch
>> >> into the slave cluster.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> > -- Lars
>> >> > ________________________________
>> >> > From: Asaf Mesika <as...@gmail.com>
>> >> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
>> >> > larsh@apache.org>
>> >> > Sent: Thursday, June 20, 2013 1:38 PM
>> >> > Subject: Re: Replication not suited for intensive write applications?
>> >> >
>> >> >
>> >> > Thanks for the answer!
>> >> > My responses are inline.
>> >> >
>> >> > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org>
>> >> wrote:
>> >> >
>> >> > > First off, this is a pretty constructed case leading to a specious
>> >> > general
>> >> > > conclusion.
>> >> > >
>> >> > > If you only have three RSs/DNs and the default replication factor
>> of 3,
>> >> > > each machine will get every single write.
>> >> > > That is the first issue. Using HBase makes little sense with such a
>> >> small
>> >> > > cluster.
>> >> > >
>> >> > You are correct, non the less - network as I measured, was far from
>> its
>> >> > capacity thus probably not the bottleneck.
>> >> >
>> >> > >
>> >> > > Secondly, as you say yourself, there are only three regionservers
>> >> writing
>> >> > > to the replicated cluster using a single thread each in order to
>> >> preserve
>> >> > > ordering.
>> >> > > With more region servers your scale will tip the other way. Again
>> more
>> >> > > regionservers will make this better.
>> >> > >
>> >> > > I presume, in production, I will add more region servers to
>> accommodate
>> >> > growing write demand on my cluster. Hence, my clients will write with
>> >> more
>> >> > threads. Thus proportionally I will always have a lot more client
>> threads
>> >> > than the number of region servers (each has one replication thread).
>> So,
>> >> I
>> >> > don't see how adding more region servers will tip the scale to other
>> >> side.
>> >> > The only way to avoid this, is to design the cluster in such a way
>> that
>> >> if
>> >> > I can handle the events received at the client which write them to
>> HBase
>> >> > with x Threads, this is the amount of region servers I should have.
>> If I
>> >> > will have a spike, then it will even out eventually, but this under
>> >> > utilizing my cluster hardware, no?
>> >> >
>> >> >
>> >> > > As for your other question, more threads can lead to better
>> >> interleaving
>> >> > > of CPU and IO, thus leading to better throughput (this relationship
>> is
>> >> > not
>> >> > > linear, though).
>> >> > >
>> >> > >
>> >> >
>> >> > >
>> >> > > -- Lars
>> >> > >
>> >> > >
>> >> > >
>> >> > > ----- Original Message -----
>> >> > > From: Asaf Mesika <as...@gmail.com>
>> >> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> >> > > Cc:
>> >> > > Sent: Thursday, June 20, 2013 3:46 AM
>> >> > > Subject: Replication not suited for intensive write applications?
>> >> > >
>> >> > > Hi,
>> >> > >
>> >> > > I've been conducting lots of benchmarks to test the maximum
>> throughput
>> >> of
>> >> > > replication in HBase.
>> >> > >
>> >> > > I've come to the conclusion that HBase replication is not suited for
>> >> > write
>> >> > > intensive application. I hope that people here can show me where I'm
>> >> > wrong.
>> >> > >
>> >> > > *My setup*
>> >> > > *Cluster (*Master and slave are alike)
>> >> > > 1 Master, NameNode
>> >> > > 3 RS, Data Node
>> >> > >
>> >> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
>> >> > ethernet
>> >> > > card
>> >> > >
>> >> > > I insert data into HBase from a java process (client) reading files
>> >> from
>> >> > > disk, running on the machine running the HBase Master in the master
>> >> > > cluster.
>> >> > >
>> >> > > *Benchmark Results*
>> >> > > When the client writes with 10 Threads, then the master cluster
>> writes
>> >> at
>> >> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The
>> data
>> >> > size
>> >> > > I wrote is 15 GB, all Puts, to two different tables.
>> >> > > Both clusters when tested independently without replication,
>> achieved
>> >> > write
>> >> > > throughput of 17-19 MB/sec, so evidently the replication process is
>> the
>> >> > > bottleneck.
>> >> > >
>> >> > > I also tested connectivity between the two clusters using "netcat"
>> and
>> >> > > achieved 111 MB/sec.
>> >> > > I've checked the usage of the network cards both on the client,
>> master
>> >> > > cluster region server and slave region servers. No computer when
>> over
>> >> > > 30mb/sec in Receive or Transmit.
>> >> > > The way I checked was rather crud but works: I've run "netstat -ie"
>> >> > before
>> >> > > HBase in the master cluster starts writing and after it finishes.
>> The
>> >> > same
>> >> > > was done on the replicated cluster (when the replication started and
>> >> > > finished). I can tell the amount of bytes Received and Transmitted
>> and
>> >> I
>> >> > > know that duration each cluster worked, thus I can calculate the
>> >> > > throughput.
>> >> > >
>> >> > > *The bottleneck in my opinion*
>> >> > > Since we've excluded network capacity, and each cluster works at
>> faster
>> >> > > rate independently, all is left is the replication process.
>> >> > > My client writes to the master cluster with 10 Threads, and manages
>> to
>> >> > > write at 17-18 MB/sec.
>> >> > > Each region server has only 1 thread responsible for transmitting
>> the
>> >> > data
>> >> > > written to the WAL to the slave cluster. Thus in my setup I
>> effectively
>> >> > > have 3 threads writing to the slave cluster.  Thus this is the
>> >> > bottleneck,
>> >> > > since this process can not be parallelized, since it must transmit
>> the
>> >> > WAL
>> >> > > in a certain order.
>> >> > >
>> >> > > *Conclusion*
>> >> > > When writes intensively to HBase with more than 3 threads (in my
>> >> setup),
>> >> > > you can't use replication.
>> >> > >
>> >> > > *Master throughput without replication*
>> >> > > On a different note, I have one thing I couldn't understand at all.
>> >> > > When turned off replication, and wrote with my client with 3
>> threads I
>> >> > got
>> >> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more
>> than
>> >> > that
>> >> > > doesn't help) I achieved maximum throughput of 19 MB/sec.
>> >> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on
>> each
>> >> > RS,
>> >> > > thus the network capacity was not the bottleneck.
>> >> > > On the HBase master machine which ran the client, the network card
>> >> again
>> >> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of
>> 18.28
>> >> > > MB/sec. Hence it's the client machine network card creating the
>> >> > bottleneck.
>> >> > >
>> >> > > The only explanation I have is the synchronized writes to the WAL.
>> >> Those
>> >> > 10
>> >> > > threads have to get in line, and one by one, write their batch of
>> Puts
>> >> to
>> >> > > the WAL, which creates a bottleneck.
>> >> > >
>> >> > > *My question*:
>> >> > > The one thing I couldn't understand is: When I write with 3 Threads,
>> >> > > meaning I have no more than 3 concurrent RPC requests to write in
>> each
>> >> > RS.
>> >> > > They achieved 11.3 MB/sec.
>> >> > > The write to the WAL is synchronized, so why increasing the number
>> of
>> >> > > threads to 10 (x3 more) actually increased the throughput to 19
>> MB/sec?
>> >> > > They all get in line to write to the same location, so it seems have
>> >> > > concurrent write shouldn't improve throughput at all.
>> >> > >
>> >> > >
>> >> > > Thanks you!
>> >> > >
>> >> > > Asaf
>> >> > > *
>> >> > > *
>> >> > >
>> >> > >
>> >> >
>> >>
>>

Re: Replication not suited for intensive write applications?

Posted by Asaf Mesika <as...@gmail.com>.
bq. I'm not sure if it's really a problem tho.

Let's the maximum throughput achieved by writing with k client threads is
30 MB/sec, where k = the number of region servers.
If you are consistently writing to HBase more than 30 MB/sec  - lets say 40
MB/sec with 2k threads - that you can't use HBase replication and must
write your own solution.

One way I started thinking about is to somehow declare that for a specific
table, order of Puts is not important (say each write is unique), thus you
can spawn multiple threads for replicating a WAL file.




On Sat, Jun 22, 2013 at 12:18 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> I think that the same way writing with more clients helped throughput,
> writing with only 1 replication thread will hurt it. The clients in
> both cases have to read something (a file from HDFS or the WAL) then
> ship it, meaning that you can utilize the cluster better since a
> single client isn't consistently writing.
>
> I agree with Asaf's assessment that it's possible that you can write
> faster into HBase than you can replicate from it if your clients are
> using the write buffers and have a bigger aggregate throughput than
> replication's.
>
> I'm not sure if it's really a problem tho.
>
> J-D
>
> On Fri, Jun 21, 2013 at 3:05 PM, lars hofhansl <la...@apache.org> wrote:
> > Hmm... Yes. Was worth a try :)  Should've checked and I even wrote that
> part of the code.
> >
> > I have no good explanation then, and also no good suggestion about how
> to improve this.
> >
> >
> >
> > ________________________________
> >  From: Asaf Mesika <as...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> > Sent: Friday, June 21, 2013 5:50 AM
> > Subject: Re: Replication not suited for intensive write applications?
> >
> >
> > On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <la...@apache.org> wrote:
> >
> >> Another thought...
> >>
> >> I assume you only write to a single table, right? How large are your
> rows
> >> on average?
> >>
> >> I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and
> the
> > seconds around is around 800 bytes
> >
> >>
> >> Replication will send 64mb blocks by default (or 25000 edits, whatever
> is
> >> smaller). The default HTable buffer is 2mb only, so the slave RS
> receiving
> >> a block of edits (assuming it is a full block), has to do 32 rounds of
> >> splitting the edits per region in order to apply them.
> >>
> >> In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used,
> > which writes directly to RS without buffers?
> >
> >   private void batch(byte[] tableName, List<Row> rows) throws
> IOException {
> >
> >     if (rows.isEmpty()) {
> >
> >       return;
> >
> >     }
> >
> >     HTableInterface table = null;
> >
> >     try {
> >
> >       table = new HTable(tableName, this.sharedHtableCon, this.
> > sharedThreadPool);
> >
> >       table.batch(rows);
> >
> >       this.metrics.appliedOpsRate.inc(rows.size());
> >
> >     } catch (InterruptedException ix) {
> >
> >       throw new IOException(ix);
> >
> >     } finally {
> >
> >       if (table != null) {
> >
> >         table.close();
> >
> >       }
> >
> >     }
> >
> >   }
> >
> >
> >
> >>
> >> There is no setting specifically targeted at the buffer size for
> >> replication, but maybe you could increase "hbase.client.write.buffer" to
> >> 64mb (67108864) on the slave cluster and see whether that makes a
> >> difference. If it does we can (1) add a setting to control the
> >> ReplicationSink HTable's buffer size, or (2) just have it match the
> >> replication buffer size "replication.source.size.capacity".
> >>
> >>
> >> -- Lars
> >> ________________________________
> >> From: lars hofhansl <la...@apache.org>
> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> Sent: Friday, June 21, 2013 1:48 AM
> >> Subject: Re: Replication not suited for intensive write applications?
> >>
> >>
> >> Thanks for checking... Interesting. So talking to 3RSs as opposed to
> only
> >> 1 before had no effect on the throughput?
> >>
> >> Would be good to explore this a bit more.
> >> Since our RPC is not streaming, latency will effect throughout. In this
> >> case there is latency while all edits are shipped to the RS in the slave
> >> cluster and then extra latency when applying the edits there (which are
> >> likely not local to that RS). A true streaming API should be better. If
> >> that is the case compression *could* help (but that is a big if).
> >>
> >> The single thread shipping the edits to the slave should not be an issue
> >> as the edits are actually applied by the slave RS, which will use
> multiple
> >> threads to apply the edits in the local cluster.
> >>
> >> Also my first reply - upon re-reading it - sounded a bit rough, that was
> >> not intended.
> >>
> >> -- Lars
> >>
> >>
> >> ----- Original Message -----
> >> From: Asaf Mesika <as...@gmail.com>
> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> >> larsh@apache.org>
> >> Cc:
> >> Sent: Thursday, June 20, 2013 10:16 PM
> >> Subject: Re: Replication not suited for intensive write applications?
> >>
> >> Thanks for the taking the time to answer!
> >> My answers are inline.
> >>
> >> On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <la...@apache.org>
> wrote:
> >>
> >> > I see.
> >> >
> >> > In HBase you have machines for both CPU (to serve requests) and
> storage
> >> > (to hold the data).
> >> >
> >> > If you only grow your cluster for CPU and you keep all RegionServers
> 100%
> >> > busy at all times, you are correct.
> >> >
> >> > Maybe you need to increase replication.source.size.capacity and/or
> >> > replication.source.nb.capacity (although I doubt that this will help
> >> here).
> >> >
> >> > I was thinking of giving a shot, but theoretically it should not
> affect,
> >> since I'm doing anything in parallel, right?
> >>
> >>
> >> > Also a replication source will pick region server from the target at
> >> > random (10% of them at default). That has two effects:
> >> > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
> >> > 2. With such a small cluster setup the likelihood is high that two or
> >> more
> >> > RSs in the source will happen to pick the same RS at the target. Thus
> >> > leading less throughput.
> >> >
> >> You are absolutely correct. In Graphite, in the beginning, I saw that
> only
> >> one slave RS was getting all replicateLogEntries RPC calls. I search the
> >> master RS logs and saw "Choose Peer" as follows:
> >> Master RS 74: Choose peer 83
> >> Master RS 75: Choose peer 83
> >> Master RS 76: Choose peer 85
> >> From some reason, they ALL talked to 83 (which seems like a bug to me).
> >>
> >> I thought I nailed the bottleneck, so I've changed the factor from 0.1
> to
> >> 1. It had the exact you described, and now all RS were getting the same
> >> amount of replicateLogEntries RPC calls, BUT it didn't budge the
> >> replication throughput. When I checked the network card usage I
> understood
> >> that even when all 3 RS were talking to the same slave RS, network
> wasn't
> >> the bottleneck.
> >>
> >>
> >> >
> >> > In fact your numbers might indicate that two of your source RSs might
> >> have
> >> > picked the same target (you get 2/3 of your throughput via
> replication).
> >> >
> >> >
> >> > In any case, before drawing conclusions this should be tested with a
> >> > larger cluster.
> >> > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs
> >> will
> >> > round robin all target RSs and lead to better distribution), but that
> >> might
> >> > have other side-effects, too.
> >> >
> >> I'll try getting two clusters of 10 RS each and see if that helps. I
> >> suspect it won't. My hunch is that: since we're replicating with no more
> >> than 10 threads, than if I take my client and set it to 10 threads and
> >> measure the throughput, this will the maximum replication throughput.
> Thus,
> >> if my client will write with let's say 20 threads (or have two client
> with
> >> 10 threads each), than I'm bound to reach an ever increasing
> >> ageOfLastShipped.
> >>
> >> >
> >> > Did you measure the disk IO at each RS at the target? Maybe one of
> them
> >> is
> >> > mostly idle.
> >> >
> >> > I didn't, but I did run my client directly at the slave cluster and
> >> measure throughput and got 18 MB/sec which is bigger than the
> replication
> >> throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
> >> bottleneck here.
> >>
> >> I was thinking of somehow tweaking HBase a bit for my use case: I always
> >> send Puts with new row KV (never update or delete), thus I have no
> >> importance for ordering, thus maybe enable with a flag the ability, on a
> >> certain column family to open multiple threads at the Replication
> Source?
> >>
> >> One more question - keeping the one thread in mind here, having
> compression
> >> on the replicateLogEntries RPC call, shouldn't really help here right?
> >> Since the entire RPC call time is mostly the time it takes to run the
> >> HTable.batch call on the slave RS, right? If I enable compression
> somehow
> >> (hack HBase code to test drive it), I will only speed up transfer time
> of
> >> the batch to the slave RS, but still wait on the insertion of this batch
> >> into the slave cluster.
> >>
> >>
> >>
> >>
> >>
> >>
> >> > -- Lars
> >> > ________________________________
> >> > From: Asaf Mesika <as...@gmail.com>
> >> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> >> > larsh@apache.org>
> >> > Sent: Thursday, June 20, 2013 1:38 PM
> >> > Subject: Re: Replication not suited for intensive write applications?
> >> >
> >> >
> >> > Thanks for the answer!
> >> > My responses are inline.
> >> >
> >> > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org>
> >> wrote:
> >> >
> >> > > First off, this is a pretty constructed case leading to a specious
> >> > general
> >> > > conclusion.
> >> > >
> >> > > If you only have three RSs/DNs and the default replication factor
> of 3,
> >> > > each machine will get every single write.
> >> > > That is the first issue. Using HBase makes little sense with such a
> >> small
> >> > > cluster.
> >> > >
> >> > You are correct, non the less - network as I measured, was far from
> its
> >> > capacity thus probably not the bottleneck.
> >> >
> >> > >
> >> > > Secondly, as you say yourself, there are only three regionservers
> >> writing
> >> > > to the replicated cluster using a single thread each in order to
> >> preserve
> >> > > ordering.
> >> > > With more region servers your scale will tip the other way. Again
> more
> >> > > regionservers will make this better.
> >> > >
> >> > > I presume, in production, I will add more region servers to
> accommodate
> >> > growing write demand on my cluster. Hence, my clients will write with
> >> more
> >> > threads. Thus proportionally I will always have a lot more client
> threads
> >> > than the number of region servers (each has one replication thread).
> So,
> >> I
> >> > don't see how adding more region servers will tip the scale to other
> >> side.
> >> > The only way to avoid this, is to design the cluster in such a way
> that
> >> if
> >> > I can handle the events received at the client which write them to
> HBase
> >> > with x Threads, this is the amount of region servers I should have.
> If I
> >> > will have a spike, then it will even out eventually, but this under
> >> > utilizing my cluster hardware, no?
> >> >
> >> >
> >> > > As for your other question, more threads can lead to better
> >> interleaving
> >> > > of CPU and IO, thus leading to better throughput (this relationship
> is
> >> > not
> >> > > linear, though).
> >> > >
> >> > >
> >> >
> >> > >
> >> > > -- Lars
> >> > >
> >> > >
> >> > >
> >> > > ----- Original Message -----
> >> > > From: Asaf Mesika <as...@gmail.com>
> >> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> > > Cc:
> >> > > Sent: Thursday, June 20, 2013 3:46 AM
> >> > > Subject: Replication not suited for intensive write applications?
> >> > >
> >> > > Hi,
> >> > >
> >> > > I've been conducting lots of benchmarks to test the maximum
> throughput
> >> of
> >> > > replication in HBase.
> >> > >
> >> > > I've come to the conclusion that HBase replication is not suited for
> >> > write
> >> > > intensive application. I hope that people here can show me where I'm
> >> > wrong.
> >> > >
> >> > > *My setup*
> >> > > *Cluster (*Master and slave are alike)
> >> > > 1 Master, NameNode
> >> > > 3 RS, Data Node
> >> > >
> >> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> >> > ethernet
> >> > > card
> >> > >
> >> > > I insert data into HBase from a java process (client) reading files
> >> from
> >> > > disk, running on the machine running the HBase Master in the master
> >> > > cluster.
> >> > >
> >> > > *Benchmark Results*
> >> > > When the client writes with 10 Threads, then the master cluster
> writes
> >> at
> >> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The
> data
> >> > size
> >> > > I wrote is 15 GB, all Puts, to two different tables.
> >> > > Both clusters when tested independently without replication,
> achieved
> >> > write
> >> > > throughput of 17-19 MB/sec, so evidently the replication process is
> the
> >> > > bottleneck.
> >> > >
> >> > > I also tested connectivity between the two clusters using "netcat"
> and
> >> > > achieved 111 MB/sec.
> >> > > I've checked the usage of the network cards both on the client,
> master
> >> > > cluster region server and slave region servers. No computer when
> over
> >> > > 30mb/sec in Receive or Transmit.
> >> > > The way I checked was rather crud but works: I've run "netstat -ie"
> >> > before
> >> > > HBase in the master cluster starts writing and after it finishes.
> The
> >> > same
> >> > > was done on the replicated cluster (when the replication started and
> >> > > finished). I can tell the amount of bytes Received and Transmitted
> and
> >> I
> >> > > know that duration each cluster worked, thus I can calculate the
> >> > > throughput.
> >> > >
> >> > > *The bottleneck in my opinion*
> >> > > Since we've excluded network capacity, and each cluster works at
> faster
> >> > > rate independently, all is left is the replication process.
> >> > > My client writes to the master cluster with 10 Threads, and manages
> to
> >> > > write at 17-18 MB/sec.
> >> > > Each region server has only 1 thread responsible for transmitting
> the
> >> > data
> >> > > written to the WAL to the slave cluster. Thus in my setup I
> effectively
> >> > > have 3 threads writing to the slave cluster.  Thus this is the
> >> > bottleneck,
> >> > > since this process can not be parallelized, since it must transmit
> the
> >> > WAL
> >> > > in a certain order.
> >> > >
> >> > > *Conclusion*
> >> > > When writes intensively to HBase with more than 3 threads (in my
> >> setup),
> >> > > you can't use replication.
> >> > >
> >> > > *Master throughput without replication*
> >> > > On a different note, I have one thing I couldn't understand at all.
> >> > > When turned off replication, and wrote with my client with 3
> threads I
> >> > got
> >> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more
> than
> >> > that
> >> > > doesn't help) I achieved maximum throughput of 19 MB/sec.
> >> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on
> each
> >> > RS,
> >> > > thus the network capacity was not the bottleneck.
> >> > > On the HBase master machine which ran the client, the network card
> >> again
> >> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of
> 18.28
> >> > > MB/sec. Hence it's the client machine network card creating the
> >> > bottleneck.
> >> > >
> >> > > The only explanation I have is the synchronized writes to the WAL.
> >> Those
> >> > 10
> >> > > threads have to get in line, and one by one, write their batch of
> Puts
> >> to
> >> > > the WAL, which creates a bottleneck.
> >> > >
> >> > > *My question*:
> >> > > The one thing I couldn't understand is: When I write with 3 Threads,
> >> > > meaning I have no more than 3 concurrent RPC requests to write in
> each
> >> > RS.
> >> > > They achieved 11.3 MB/sec.
> >> > > The write to the WAL is synchronized, so why increasing the number
> of
> >> > > threads to 10 (x3 more) actually increased the throughput to 19
> MB/sec?
> >> > > They all get in line to write to the same location, so it seems have
> >> > > concurrent write shouldn't improve throughput at all.
> >> > >
> >> > >
> >> > > Thanks you!
> >> > >
> >> > > Asaf
> >> > > *
> >> > > *
> >> > >
> >> > >
> >> >
> >>
>

Re: Replication not suited for intensive write applications?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I think that the same way writing with more clients helped throughput,
writing with only 1 replication thread will hurt it. The clients in
both cases have to read something (a file from HDFS or the WAL) then
ship it, meaning that you can utilize the cluster better since a
single client isn't consistently writing.

I agree with Asaf's assessment that it's possible that you can write
faster into HBase than you can replicate from it if your clients are
using the write buffers and have a bigger aggregate throughput than
replication's.

I'm not sure if it's really a problem tho.

J-D

On Fri, Jun 21, 2013 at 3:05 PM, lars hofhansl <la...@apache.org> wrote:
> Hmm... Yes. Was worth a try :)  Should've checked and I even wrote that part of the code.
>
> I have no good explanation then, and also no good suggestion about how to improve this.
>
>
>
> ________________________________
>  From: Asaf Mesika <as...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org>
> Sent: Friday, June 21, 2013 5:50 AM
> Subject: Re: Replication not suited for intensive write applications?
>
>
> On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <la...@apache.org> wrote:
>
>> Another thought...
>>
>> I assume you only write to a single table, right? How large are your rows
>> on average?
>>
>> I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and the
> seconds around is around 800 bytes
>
>>
>> Replication will send 64mb blocks by default (or 25000 edits, whatever is
>> smaller). The default HTable buffer is 2mb only, so the slave RS receiving
>> a block of edits (assuming it is a full block), has to do 32 rounds of
>> splitting the edits per region in order to apply them.
>>
>> In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used,
> which writes directly to RS without buffers?
>
>   private void batch(byte[] tableName, List<Row> rows) throws IOException {
>
>     if (rows.isEmpty()) {
>
>       return;
>
>     }
>
>     HTableInterface table = null;
>
>     try {
>
>       table = new HTable(tableName, this.sharedHtableCon, this.
> sharedThreadPool);
>
>       table.batch(rows);
>
>       this.metrics.appliedOpsRate.inc(rows.size());
>
>     } catch (InterruptedException ix) {
>
>       throw new IOException(ix);
>
>     } finally {
>
>       if (table != null) {
>
>         table.close();
>
>       }
>
>     }
>
>   }
>
>
>
>>
>> There is no setting specifically targeted at the buffer size for
>> replication, but maybe you could increase "hbase.client.write.buffer" to
>> 64mb (67108864) on the slave cluster and see whether that makes a
>> difference. If it does we can (1) add a setting to control the
>> ReplicationSink HTable's buffer size, or (2) just have it match the
>> replication buffer size "replication.source.size.capacity".
>>
>>
>> -- Lars
>> ________________________________
>> From: lars hofhansl <la...@apache.org>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Sent: Friday, June 21, 2013 1:48 AM
>> Subject: Re: Replication not suited for intensive write applications?
>>
>>
>> Thanks for checking... Interesting. So talking to 3RSs as opposed to only
>> 1 before had no effect on the throughput?
>>
>> Would be good to explore this a bit more.
>> Since our RPC is not streaming, latency will effect throughout. In this
>> case there is latency while all edits are shipped to the RS in the slave
>> cluster and then extra latency when applying the edits there (which are
>> likely not local to that RS). A true streaming API should be better. If
>> that is the case compression *could* help (but that is a big if).
>>
>> The single thread shipping the edits to the slave should not be an issue
>> as the edits are actually applied by the slave RS, which will use multiple
>> threads to apply the edits in the local cluster.
>>
>> Also my first reply - upon re-reading it - sounded a bit rough, that was
>> not intended.
>>
>> -- Lars
>>
>>
>> ----- Original Message -----
>> From: Asaf Mesika <as...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
>> larsh@apache.org>
>> Cc:
>> Sent: Thursday, June 20, 2013 10:16 PM
>> Subject: Re: Replication not suited for intensive write applications?
>>
>> Thanks for the taking the time to answer!
>> My answers are inline.
>>
>> On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <la...@apache.org> wrote:
>>
>> > I see.
>> >
>> > In HBase you have machines for both CPU (to serve requests) and storage
>> > (to hold the data).
>> >
>> > If you only grow your cluster for CPU and you keep all RegionServers 100%
>> > busy at all times, you are correct.
>> >
>> > Maybe you need to increase replication.source.size.capacity and/or
>> > replication.source.nb.capacity (although I doubt that this will help
>> here).
>> >
>> > I was thinking of giving a shot, but theoretically it should not affect,
>> since I'm doing anything in parallel, right?
>>
>>
>> > Also a replication source will pick region server from the target at
>> > random (10% of them at default). That has two effects:
>> > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
>> > 2. With such a small cluster setup the likelihood is high that two or
>> more
>> > RSs in the source will happen to pick the same RS at the target. Thus
>> > leading less throughput.
>> >
>> You are absolutely correct. In Graphite, in the beginning, I saw that only
>> one slave RS was getting all replicateLogEntries RPC calls. I search the
>> master RS logs and saw "Choose Peer" as follows:
>> Master RS 74: Choose peer 83
>> Master RS 75: Choose peer 83
>> Master RS 76: Choose peer 85
>> From some reason, they ALL talked to 83 (which seems like a bug to me).
>>
>> I thought I nailed the bottleneck, so I've changed the factor from 0.1 to
>> 1. It had the exact you described, and now all RS were getting the same
>> amount of replicateLogEntries RPC calls, BUT it didn't budge the
>> replication throughput. When I checked the network card usage I understood
>> that even when all 3 RS were talking to the same slave RS, network wasn't
>> the bottleneck.
>>
>>
>> >
>> > In fact your numbers might indicate that two of your source RSs might
>> have
>> > picked the same target (you get 2/3 of your throughput via replication).
>> >
>> >
>> > In any case, before drawing conclusions this should be tested with a
>> > larger cluster.
>> > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs
>> will
>> > round robin all target RSs and lead to better distribution), but that
>> might
>> > have other side-effects, too.
>> >
>> I'll try getting two clusters of 10 RS each and see if that helps. I
>> suspect it won't. My hunch is that: since we're replicating with no more
>> than 10 threads, than if I take my client and set it to 10 threads and
>> measure the throughput, this will the maximum replication throughput. Thus,
>> if my client will write with let's say 20 threads (or have two client with
>> 10 threads each), than I'm bound to reach an ever increasing
>> ageOfLastShipped.
>>
>> >
>> > Did you measure the disk IO at each RS at the target? Maybe one of them
>> is
>> > mostly idle.
>> >
>> > I didn't, but I did run my client directly at the slave cluster and
>> measure throughput and got 18 MB/sec which is bigger than the replication
>> throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
>> bottleneck here.
>>
>> I was thinking of somehow tweaking HBase a bit for my use case: I always
>> send Puts with new row KV (never update or delete), thus I have no
>> importance for ordering, thus maybe enable with a flag the ability, on a
>> certain column family to open multiple threads at the Replication Source?
>>
>> One more question - keeping the one thread in mind here, having compression
>> on the replicateLogEntries RPC call, shouldn't really help here right?
>> Since the entire RPC call time is mostly the time it takes to run the
>> HTable.batch call on the slave RS, right? If I enable compression somehow
>> (hack HBase code to test drive it), I will only speed up transfer time of
>> the batch to the slave RS, but still wait on the insertion of this batch
>> into the slave cluster.
>>
>>
>>
>>
>>
>>
>> > -- Lars
>> > ________________________________
>> > From: Asaf Mesika <as...@gmail.com>
>> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
>> > larsh@apache.org>
>> > Sent: Thursday, June 20, 2013 1:38 PM
>> > Subject: Re: Replication not suited for intensive write applications?
>> >
>> >
>> > Thanks for the answer!
>> > My responses are inline.
>> >
>> > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org>
>> wrote:
>> >
>> > > First off, this is a pretty constructed case leading to a specious
>> > general
>> > > conclusion.
>> > >
>> > > If you only have three RSs/DNs and the default replication factor of 3,
>> > > each machine will get every single write.
>> > > That is the first issue. Using HBase makes little sense with such a
>> small
>> > > cluster.
>> > >
>> > You are correct, non the less - network as I measured, was far from its
>> > capacity thus probably not the bottleneck.
>> >
>> > >
>> > > Secondly, as you say yourself, there are only three regionservers
>> writing
>> > > to the replicated cluster using a single thread each in order to
>> preserve
>> > > ordering.
>> > > With more region servers your scale will tip the other way. Again more
>> > > regionservers will make this better.
>> > >
>> > > I presume, in production, I will add more region servers to accommodate
>> > growing write demand on my cluster. Hence, my clients will write with
>> more
>> > threads. Thus proportionally I will always have a lot more client threads
>> > than the number of region servers (each has one replication thread). So,
>> I
>> > don't see how adding more region servers will tip the scale to other
>> side.
>> > The only way to avoid this, is to design the cluster in such a way that
>> if
>> > I can handle the events received at the client which write them to HBase
>> > with x Threads, this is the amount of region servers I should have. If I
>> > will have a spike, then it will even out eventually, but this under
>> > utilizing my cluster hardware, no?
>> >
>> >
>> > > As for your other question, more threads can lead to better
>> interleaving
>> > > of CPU and IO, thus leading to better throughput (this relationship is
>> > not
>> > > linear, though).
>> > >
>> > >
>> >
>> > >
>> > > -- Lars
>> > >
>> > >
>> > >
>> > > ----- Original Message -----
>> > > From: Asaf Mesika <as...@gmail.com>
>> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> > > Cc:
>> > > Sent: Thursday, June 20, 2013 3:46 AM
>> > > Subject: Replication not suited for intensive write applications?
>> > >
>> > > Hi,
>> > >
>> > > I've been conducting lots of benchmarks to test the maximum throughput
>> of
>> > > replication in HBase.
>> > >
>> > > I've come to the conclusion that HBase replication is not suited for
>> > write
>> > > intensive application. I hope that people here can show me where I'm
>> > wrong.
>> > >
>> > > *My setup*
>> > > *Cluster (*Master and slave are alike)
>> > > 1 Master, NameNode
>> > > 3 RS, Data Node
>> > >
>> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
>> > ethernet
>> > > card
>> > >
>> > > I insert data into HBase from a java process (client) reading files
>> from
>> > > disk, running on the machine running the HBase Master in the master
>> > > cluster.
>> > >
>> > > *Benchmark Results*
>> > > When the client writes with 10 Threads, then the master cluster writes
>> at
>> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
>> > size
>> > > I wrote is 15 GB, all Puts, to two different tables.
>> > > Both clusters when tested independently without replication, achieved
>> > write
>> > > throughput of 17-19 MB/sec, so evidently the replication process is the
>> > > bottleneck.
>> > >
>> > > I also tested connectivity between the two clusters using "netcat" and
>> > > achieved 111 MB/sec.
>> > > I've checked the usage of the network cards both on the client, master
>> > > cluster region server and slave region servers. No computer when over
>> > > 30mb/sec in Receive or Transmit.
>> > > The way I checked was rather crud but works: I've run "netstat -ie"
>> > before
>> > > HBase in the master cluster starts writing and after it finishes. The
>> > same
>> > > was done on the replicated cluster (when the replication started and
>> > > finished). I can tell the amount of bytes Received and Transmitted and
>> I
>> > > know that duration each cluster worked, thus I can calculate the
>> > > throughput.
>> > >
>> > > *The bottleneck in my opinion*
>> > > Since we've excluded network capacity, and each cluster works at faster
>> > > rate independently, all is left is the replication process.
>> > > My client writes to the master cluster with 10 Threads, and manages to
>> > > write at 17-18 MB/sec.
>> > > Each region server has only 1 thread responsible for transmitting the
>> > data
>> > > written to the WAL to the slave cluster. Thus in my setup I effectively
>> > > have 3 threads writing to the slave cluster.  Thus this is the
>> > bottleneck,
>> > > since this process can not be parallelized, since it must transmit the
>> > WAL
>> > > in a certain order.
>> > >
>> > > *Conclusion*
>> > > When writes intensively to HBase with more than 3 threads (in my
>> setup),
>> > > you can't use replication.
>> > >
>> > > *Master throughput without replication*
>> > > On a different note, I have one thing I couldn't understand at all.
>> > > When turned off replication, and wrote with my client with 3 threads I
>> > got
>> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
>> > that
>> > > doesn't help) I achieved maximum throughput of 19 MB/sec.
>> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
>> > RS,
>> > > thus the network capacity was not the bottleneck.
>> > > On the HBase master machine which ran the client, the network card
>> again
>> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
>> > > MB/sec. Hence it's the client machine network card creating the
>> > bottleneck.
>> > >
>> > > The only explanation I have is the synchronized writes to the WAL.
>> Those
>> > 10
>> > > threads have to get in line, and one by one, write their batch of Puts
>> to
>> > > the WAL, which creates a bottleneck.
>> > >
>> > > *My question*:
>> > > The one thing I couldn't understand is: When I write with 3 Threads,
>> > > meaning I have no more than 3 concurrent RPC requests to write in each
>> > RS.
>> > > They achieved 11.3 MB/sec.
>> > > The write to the WAL is synchronized, so why increasing the number of
>> > > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
>> > > They all get in line to write to the same location, so it seems have
>> > > concurrent write shouldn't improve throughput at all.
>> > >
>> > >
>> > > Thanks you!
>> > >
>> > > Asaf
>> > > *
>> > > *
>> > >
>> > >
>> >
>>

Re: Replication not suited for intensive write applications?

Posted by lars hofhansl <la...@apache.org>.
Hmm... Yes. Was worth a try :)  Should've checked and I even wrote that part of the code.

I have no good explanation then, and also no good suggestion about how to improve this.



________________________________
 From: Asaf Mesika <as...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Friday, June 21, 2013 5:50 AM
Subject: Re: Replication not suited for intensive write applications?
 

On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <la...@apache.org> wrote:

> Another thought...
>
> I assume you only write to a single table, right? How large are your rows
> on average?
>
> I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and the
seconds around is around 800 bytes

>
> Replication will send 64mb blocks by default (or 25000 edits, whatever is
> smaller). The default HTable buffer is 2mb only, so the slave RS receiving
> a block of edits (assuming it is a full block), has to do 32 rounds of
> splitting the edits per region in order to apply them.
>
> In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used,
which writes directly to RS without buffers?

  private void batch(byte[] tableName, List<Row> rows) throws IOException {

    if (rows.isEmpty()) {

      return;

    }

    HTableInterface table = null;

    try {

      table = new HTable(tableName, this.sharedHtableCon, this.
sharedThreadPool);

      table.batch(rows);

      this.metrics.appliedOpsRate.inc(rows.size());

    } catch (InterruptedException ix) {

      throw new IOException(ix);

    } finally {

      if (table != null) {

        table.close();

      }

    }

  }



>
> There is no setting specifically targeted at the buffer size for
> replication, but maybe you could increase "hbase.client.write.buffer" to
> 64mb (67108864) on the slave cluster and see whether that makes a
> difference. If it does we can (1) add a setting to control the
> ReplicationSink HTable's buffer size, or (2) just have it match the
> replication buffer size "replication.source.size.capacity".
>
>
> -- Lars
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Friday, June 21, 2013 1:48 AM
> Subject: Re: Replication not suited for intensive write applications?
>
>
> Thanks for checking... Interesting. So talking to 3RSs as opposed to only
> 1 before had no effect on the throughput?
>
> Would be good to explore this a bit more.
> Since our RPC is not streaming, latency will effect throughout. In this
> case there is latency while all edits are shipped to the RS in the slave
> cluster and then extra latency when applying the edits there (which are
> likely not local to that RS). A true streaming API should be better. If
> that is the case compression *could* help (but that is a big if).
>
> The single thread shipping the edits to the slave should not be an issue
> as the edits are actually applied by the slave RS, which will use multiple
> threads to apply the edits in the local cluster.
>
> Also my first reply - upon re-reading it - sounded a bit rough, that was
> not intended.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Asaf Mesika <as...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Cc:
> Sent: Thursday, June 20, 2013 10:16 PM
> Subject: Re: Replication not suited for intensive write applications?
>
> Thanks for the taking the time to answer!
> My answers are inline.
>
> On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <la...@apache.org> wrote:
>
> > I see.
> >
> > In HBase you have machines for both CPU (to serve requests) and storage
> > (to hold the data).
> >
> > If you only grow your cluster for CPU and you keep all RegionServers 100%
> > busy at all times, you are correct.
> >
> > Maybe you need to increase replication.source.size.capacity and/or
> > replication.source.nb.capacity (although I doubt that this will help
> here).
> >
> > I was thinking of giving a shot, but theoretically it should not affect,
> since I'm doing anything in parallel, right?
>
>
> > Also a replication source will pick region server from the target at
> > random (10% of them at default). That has two effects:
> > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
> > 2. With such a small cluster setup the likelihood is high that two or
> more
> > RSs in the source will happen to pick the same RS at the target. Thus
> > leading less throughput.
> >
> You are absolutely correct. In Graphite, in the beginning, I saw that only
> one slave RS was getting all replicateLogEntries RPC calls. I search the
> master RS logs and saw "Choose Peer" as follows:
> Master RS 74: Choose peer 83
> Master RS 75: Choose peer 83
> Master RS 76: Choose peer 85
> From some reason, they ALL talked to 83 (which seems like a bug to me).
>
> I thought I nailed the bottleneck, so I've changed the factor from 0.1 to
> 1. It had the exact you described, and now all RS were getting the same
> amount of replicateLogEntries RPC calls, BUT it didn't budge the
> replication throughput. When I checked the network card usage I understood
> that even when all 3 RS were talking to the same slave RS, network wasn't
> the bottleneck.
>
>
> >
> > In fact your numbers might indicate that two of your source RSs might
> have
> > picked the same target (you get 2/3 of your throughput via replication).
> >
> >
> > In any case, before drawing conclusions this should be tested with a
> > larger cluster.
> > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs
> will
> > round robin all target RSs and lead to better distribution), but that
> might
> > have other side-effects, too.
> >
> I'll try getting two clusters of 10 RS each and see if that helps. I
> suspect it won't. My hunch is that: since we're replicating with no more
> than 10 threads, than if I take my client and set it to 10 threads and
> measure the throughput, this will the maximum replication throughput. Thus,
> if my client will write with let's say 20 threads (or have two client with
> 10 threads each), than I'm bound to reach an ever increasing
> ageOfLastShipped.
>
> >
> > Did you measure the disk IO at each RS at the target? Maybe one of them
> is
> > mostly idle.
> >
> > I didn't, but I did run my client directly at the slave cluster and
> measure throughput and got 18 MB/sec which is bigger than the replication
> throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
> bottleneck here.
>
> I was thinking of somehow tweaking HBase a bit for my use case: I always
> send Puts with new row KV (never update or delete), thus I have no
> importance for ordering, thus maybe enable with a flag the ability, on a
> certain column family to open multiple threads at the Replication Source?
>
> One more question - keeping the one thread in mind here, having compression
> on the replicateLogEntries RPC call, shouldn't really help here right?
> Since the entire RPC call time is mostly the time it takes to run the
> HTable.batch call on the slave RS, right? If I enable compression somehow
> (hack HBase code to test drive it), I will only speed up transfer time of
> the batch to the slave RS, but still wait on the insertion of this batch
> into the slave cluster.
>
>
>
>
>
>
> > -- Lars
> > ________________________________
> > From: Asaf Mesika <as...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> > larsh@apache.org>
> > Sent: Thursday, June 20, 2013 1:38 PM
> > Subject: Re: Replication not suited for intensive write applications?
> >
> >
> > Thanks for the answer!
> > My responses are inline.
> >
> > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org>
> wrote:
> >
> > > First off, this is a pretty constructed case leading to a specious
> > general
> > > conclusion.
> > >
> > > If you only have three RSs/DNs and the default replication factor of 3,
> > > each machine will get every single write.
> > > That is the first issue. Using HBase makes little sense with such a
> small
> > > cluster.
> > >
> > You are correct, non the less - network as I measured, was far from its
> > capacity thus probably not the bottleneck.
> >
> > >
> > > Secondly, as you say yourself, there are only three regionservers
> writing
> > > to the replicated cluster using a single thread each in order to
> preserve
> > > ordering.
> > > With more region servers your scale will tip the other way. Again more
> > > regionservers will make this better.
> > >
> > > I presume, in production, I will add more region servers to accommodate
> > growing write demand on my cluster. Hence, my clients will write with
> more
> > threads. Thus proportionally I will always have a lot more client threads
> > than the number of region servers (each has one replication thread). So,
> I
> > don't see how adding more region servers will tip the scale to other
> side.
> > The only way to avoid this, is to design the cluster in such a way that
> if
> > I can handle the events received at the client which write them to HBase
> > with x Threads, this is the amount of region servers I should have. If I
> > will have a spike, then it will even out eventually, but this under
> > utilizing my cluster hardware, no?
> >
> >
> > > As for your other question, more threads can lead to better
> interleaving
> > > of CPU and IO, thus leading to better throughput (this relationship is
> > not
> > > linear, though).
> > >
> > >
> >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Asaf Mesika <as...@gmail.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > Cc:
> > > Sent: Thursday, June 20, 2013 3:46 AM
> > > Subject: Replication not suited for intensive write applications?
> > >
> > > Hi,
> > >
> > > I've been conducting lots of benchmarks to test the maximum throughput
> of
> > > replication in HBase.
> > >
> > > I've come to the conclusion that HBase replication is not suited for
> > write
> > > intensive application. I hope that people here can show me where I'm
> > wrong.
> > >
> > > *My setup*
> > > *Cluster (*Master and slave are alike)
> > > 1 Master, NameNode
> > > 3 RS, Data Node
> > >
> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> > ethernet
> > > card
> > >
> > > I insert data into HBase from a java process (client) reading files
> from
> > > disk, running on the machine running the HBase Master in the master
> > > cluster.
> > >
> > > *Benchmark Results*
> > > When the client writes with 10 Threads, then the master cluster writes
> at
> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
> > size
> > > I wrote is 15 GB, all Puts, to two different tables.
> > > Both clusters when tested independently without replication, achieved
> > write
> > > throughput of 17-19 MB/sec, so evidently the replication process is the
> > > bottleneck.
> > >
> > > I also tested connectivity between the two clusters using "netcat" and
> > > achieved 111 MB/sec.
> > > I've checked the usage of the network cards both on the client, master
> > > cluster region server and slave region servers. No computer when over
> > > 30mb/sec in Receive or Transmit.
> > > The way I checked was rather crud but works: I've run "netstat -ie"
> > before
> > > HBase in the master cluster starts writing and after it finishes. The
> > same
> > > was done on the replicated cluster (when the replication started and
> > > finished). I can tell the amount of bytes Received and Transmitted and
> I
> > > know that duration each cluster worked, thus I can calculate the
> > > throughput.
> > >
> > > *The bottleneck in my opinion*
> > > Since we've excluded network capacity, and each cluster works at faster
> > > rate independently, all is left is the replication process.
> > > My client writes to the master cluster with 10 Threads, and manages to
> > > write at 17-18 MB/sec.
> > > Each region server has only 1 thread responsible for transmitting the
> > data
> > > written to the WAL to the slave cluster. Thus in my setup I effectively
> > > have 3 threads writing to the slave cluster.  Thus this is the
> > bottleneck,
> > > since this process can not be parallelized, since it must transmit the
> > WAL
> > > in a certain order.
> > >
> > > *Conclusion*
> > > When writes intensively to HBase with more than 3 threads (in my
> setup),
> > > you can't use replication.
> > >
> > > *Master throughput without replication*
> > > On a different note, I have one thing I couldn't understand at all.
> > > When turned off replication, and wrote with my client with 3 threads I
> > got
> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
> > that
> > > doesn't help) I achieved maximum throughput of 19 MB/sec.
> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
> > RS,
> > > thus the network capacity was not the bottleneck.
> > > On the HBase master machine which ran the client, the network card
> again
> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> > > MB/sec. Hence it's the client machine network card creating the
> > bottleneck.
> > >
> > > The only explanation I have is the synchronized writes to the WAL.
> Those
> > 10
> > > threads have to get in line, and one by one, write their batch of Puts
> to
> > > the WAL, which creates a bottleneck.
> > >
> > > *My question*:
> > > The one thing I couldn't understand is: When I write with 3 Threads,
> > > meaning I have no more than 3 concurrent RPC requests to write in each
> > RS.
> > > They achieved 11.3 MB/sec.
> > > The write to the WAL is synchronized, so why increasing the number of
> > > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> > > They all get in line to write to the same location, so it seems have
> > > concurrent write shouldn't improve throughput at all.
> > >
> > >
> > > Thanks you!
> > >
> > > Asaf
> > > *
> > > *
> > >
> > >
> >
>

Re: Replication not suited for intensive write applications?

Posted by Asaf Mesika <as...@gmail.com>.
On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <la...@apache.org> wrote:

> Another thought...
>
> I assume you only write to a single table, right? How large are your rows
> on average?
>
> I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and the
seconds around is around 800 bytes

>
> Replication will send 64mb blocks by default (or 25000 edits, whatever is
> smaller). The default HTable buffer is 2mb only, so the slave RS receiving
> a block of edits (assuming it is a full block), has to do 32 rounds of
> splitting the edits per region in order to apply them.
>
> In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used,
which writes directly to RS without buffers?

  private void batch(byte[] tableName, List<Row> rows) throws IOException {

    if (rows.isEmpty()) {

      return;

    }

    HTableInterface table = null;

    try {

      table = new HTable(tableName, this.sharedHtableCon, this.
sharedThreadPool);

      table.batch(rows);

      this.metrics.appliedOpsRate.inc(rows.size());

    } catch (InterruptedException ix) {

      throw new IOException(ix);

    } finally {

      if (table != null) {

        table.close();

      }

    }

  }



>
> There is no setting specifically targeted at the buffer size for
> replication, but maybe you could increase "hbase.client.write.buffer" to
> 64mb (67108864) on the slave cluster and see whether that makes a
> difference. If it does we can (1) add a setting to control the
> ReplicationSink HTable's buffer size, or (2) just have it match the
> replication buffer size "replication.source.size.capacity".
>
>
> -- Lars
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Friday, June 21, 2013 1:48 AM
> Subject: Re: Replication not suited for intensive write applications?
>
>
> Thanks for checking... Interesting. So talking to 3RSs as opposed to only
> 1 before had no effect on the throughput?
>
> Would be good to explore this a bit more.
> Since our RPC is not streaming, latency will effect throughout. In this
> case there is latency while all edits are shipped to the RS in the slave
> cluster and then extra latency when applying the edits there (which are
> likely not local to that RS). A true streaming API should be better. If
> that is the case compression *could* help (but that is a big if).
>
> The single thread shipping the edits to the slave should not be an issue
> as the edits are actually applied by the slave RS, which will use multiple
> threads to apply the edits in the local cluster.
>
> Also my first reply - upon re-reading it - sounded a bit rough, that was
> not intended.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Asaf Mesika <as...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Cc:
> Sent: Thursday, June 20, 2013 10:16 PM
> Subject: Re: Replication not suited for intensive write applications?
>
> Thanks for the taking the time to answer!
> My answers are inline.
>
> On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <la...@apache.org> wrote:
>
> > I see.
> >
> > In HBase you have machines for both CPU (to serve requests) and storage
> > (to hold the data).
> >
> > If you only grow your cluster for CPU and you keep all RegionServers 100%
> > busy at all times, you are correct.
> >
> > Maybe you need to increase replication.source.size.capacity and/or
> > replication.source.nb.capacity (although I doubt that this will help
> here).
> >
> > I was thinking of giving a shot, but theoretically it should not affect,
> since I'm doing anything in parallel, right?
>
>
> > Also a replication source will pick region server from the target at
> > random (10% of them at default). That has two effects:
> > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
> > 2. With such a small cluster setup the likelihood is high that two or
> more
> > RSs in the source will happen to pick the same RS at the target. Thus
> > leading less throughput.
> >
> You are absolutely correct. In Graphite, in the beginning, I saw that only
> one slave RS was getting all replicateLogEntries RPC calls. I search the
> master RS logs and saw "Choose Peer" as follows:
> Master RS 74: Choose peer 83
> Master RS 75: Choose peer 83
> Master RS 76: Choose peer 85
> From some reason, they ALL talked to 83 (which seems like a bug to me).
>
> I thought I nailed the bottleneck, so I've changed the factor from 0.1 to
> 1. It had the exact you described, and now all RS were getting the same
> amount of replicateLogEntries RPC calls, BUT it didn't budge the
> replication throughput. When I checked the network card usage I understood
> that even when all 3 RS were talking to the same slave RS, network wasn't
> the bottleneck.
>
>
> >
> > In fact your numbers might indicate that two of your source RSs might
> have
> > picked the same target (you get 2/3 of your throughput via replication).
> >
> >
> > In any case, before drawing conclusions this should be tested with a
> > larger cluster.
> > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs
> will
> > round robin all target RSs and lead to better distribution), but that
> might
> > have other side-effects, too.
> >
> I'll try getting two clusters of 10 RS each and see if that helps. I
> suspect it won't. My hunch is that: since we're replicating with no more
> than 10 threads, than if I take my client and set it to 10 threads and
> measure the throughput, this will the maximum replication throughput. Thus,
> if my client will write with let's say 20 threads (or have two client with
> 10 threads each), than I'm bound to reach an ever increasing
> ageOfLastShipped.
>
> >
> > Did you measure the disk IO at each RS at the target? Maybe one of them
> is
> > mostly idle.
> >
> > I didn't, but I did run my client directly at the slave cluster and
> measure throughput and got 18 MB/sec which is bigger than the replication
> throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
> bottleneck here.
>
> I was thinking of somehow tweaking HBase a bit for my use case: I always
> send Puts with new row KV (never update or delete), thus I have no
> importance for ordering, thus maybe enable with a flag the ability, on a
> certain column family to open multiple threads at the Replication Source?
>
> One more question - keeping the one thread in mind here, having compression
> on the replicateLogEntries RPC call, shouldn't really help here right?
> Since the entire RPC call time is mostly the time it takes to run the
> HTable.batch call on the slave RS, right? If I enable compression somehow
> (hack HBase code to test drive it), I will only speed up transfer time of
> the batch to the slave RS, but still wait on the insertion of this batch
> into the slave cluster.
>
>
>
>
>
>
> > -- Lars
> > ________________________________
> > From: Asaf Mesika <as...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> > larsh@apache.org>
> > Sent: Thursday, June 20, 2013 1:38 PM
> > Subject: Re: Replication not suited for intensive write applications?
> >
> >
> > Thanks for the answer!
> > My responses are inline.
> >
> > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org>
> wrote:
> >
> > > First off, this is a pretty constructed case leading to a specious
> > general
> > > conclusion.
> > >
> > > If you only have three RSs/DNs and the default replication factor of 3,
> > > each machine will get every single write.
> > > That is the first issue. Using HBase makes little sense with such a
> small
> > > cluster.
> > >
> > You are correct, non the less - network as I measured, was far from its
> > capacity thus probably not the bottleneck.
> >
> > >
> > > Secondly, as you say yourself, there are only three regionservers
> writing
> > > to the replicated cluster using a single thread each in order to
> preserve
> > > ordering.
> > > With more region servers your scale will tip the other way. Again more
> > > regionservers will make this better.
> > >
> > > I presume, in production, I will add more region servers to accommodate
> > growing write demand on my cluster. Hence, my clients will write with
> more
> > threads. Thus proportionally I will always have a lot more client threads
> > than the number of region servers (each has one replication thread). So,
> I
> > don't see how adding more region servers will tip the scale to other
> side.
> > The only way to avoid this, is to design the cluster in such a way that
> if
> > I can handle the events received at the client which write them to HBase
> > with x Threads, this is the amount of region servers I should have. If I
> > will have a spike, then it will even out eventually, but this under
> > utilizing my cluster hardware, no?
> >
> >
> > > As for your other question, more threads can lead to better
> interleaving
> > > of CPU and IO, thus leading to better throughput (this relationship is
> > not
> > > linear, though).
> > >
> > >
> >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Asaf Mesika <as...@gmail.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > Cc:
> > > Sent: Thursday, June 20, 2013 3:46 AM
> > > Subject: Replication not suited for intensive write applications?
> > >
> > > Hi,
> > >
> > > I've been conducting lots of benchmarks to test the maximum throughput
> of
> > > replication in HBase.
> > >
> > > I've come to the conclusion that HBase replication is not suited for
> > write
> > > intensive application. I hope that people here can show me where I'm
> > wrong.
> > >
> > > *My setup*
> > > *Cluster (*Master and slave are alike)
> > > 1 Master, NameNode
> > > 3 RS, Data Node
> > >
> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> > ethernet
> > > card
> > >
> > > I insert data into HBase from a java process (client) reading files
> from
> > > disk, running on the machine running the HBase Master in the master
> > > cluster.
> > >
> > > *Benchmark Results*
> > > When the client writes with 10 Threads, then the master cluster writes
> at
> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
> > size
> > > I wrote is 15 GB, all Puts, to two different tables.
> > > Both clusters when tested independently without replication, achieved
> > write
> > > throughput of 17-19 MB/sec, so evidently the replication process is the
> > > bottleneck.
> > >
> > > I also tested connectivity between the two clusters using "netcat" and
> > > achieved 111 MB/sec.
> > > I've checked the usage of the network cards both on the client, master
> > > cluster region server and slave region servers. No computer when over
> > > 30mb/sec in Receive or Transmit.
> > > The way I checked was rather crud but works: I've run "netstat -ie"
> > before
> > > HBase in the master cluster starts writing and after it finishes. The
> > same
> > > was done on the replicated cluster (when the replication started and
> > > finished). I can tell the amount of bytes Received and Transmitted and
> I
> > > know that duration each cluster worked, thus I can calculate the
> > > throughput.
> > >
> > > *The bottleneck in my opinion*
> > > Since we've excluded network capacity, and each cluster works at faster
> > > rate independently, all is left is the replication process.
> > > My client writes to the master cluster with 10 Threads, and manages to
> > > write at 17-18 MB/sec.
> > > Each region server has only 1 thread responsible for transmitting the
> > data
> > > written to the WAL to the slave cluster. Thus in my setup I effectively
> > > have 3 threads writing to the slave cluster.  Thus this is the
> > bottleneck,
> > > since this process can not be parallelized, since it must transmit the
> > WAL
> > > in a certain order.
> > >
> > > *Conclusion*
> > > When writes intensively to HBase with more than 3 threads (in my
> setup),
> > > you can't use replication.
> > >
> > > *Master throughput without replication*
> > > On a different note, I have one thing I couldn't understand at all.
> > > When turned off replication, and wrote with my client with 3 threads I
> > got
> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
> > that
> > > doesn't help) I achieved maximum throughput of 19 MB/sec.
> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
> > RS,
> > > thus the network capacity was not the bottleneck.
> > > On the HBase master machine which ran the client, the network card
> again
> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> > > MB/sec. Hence it's the client machine network card creating the
> > bottleneck.
> > >
> > > The only explanation I have is the synchronized writes to the WAL.
> Those
> > 10
> > > threads have to get in line, and one by one, write their batch of Puts
> to
> > > the WAL, which creates a bottleneck.
> > >
> > > *My question*:
> > > The one thing I couldn't understand is: When I write with 3 Threads,
> > > meaning I have no more than 3 concurrent RPC requests to write in each
> > RS.
> > > They achieved 11.3 MB/sec.
> > > The write to the WAL is synchronized, so why increasing the number of
> > > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> > > They all get in line to write to the same location, so it seems have
> > > concurrent write shouldn't improve throughput at all.
> > >
> > >
> > > Thanks you!
> > >
> > > Asaf
> > > *
> > > *
> > >
> > >
> >
>

Re: Replication not suited for intensive write applications?

Posted by lars hofhansl <la...@apache.org>.
Another thought...

I assume you only write to a single table, right? How large are your rows on average?


Replication will send 64mb blocks by default (or 25000 edits, whatever is smaller). The default HTable buffer is 2mb only, so the slave RS receiving a block of edits (assuming it is a full block), has to do 32 rounds of splitting the edits per region in order to apply them.


There is no setting specifically targeted at the buffer size for replication, but maybe you could increase "hbase.client.write.buffer" to 64mb (67108864) on the slave cluster and see whether that makes a difference. If it does we can (1) add a setting to control the ReplicationSink HTable's buffer size, or (2) just have it match the replication buffer size "replication.source.size.capacity".


-- Lars
________________________________
From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Friday, June 21, 2013 1:48 AM
Subject: Re: Replication not suited for intensive write applications?


Thanks for checking... Interesting. So talking to 3RSs as opposed to only 1 before had no effect on the throughput?

Would be good to explore this a bit more.
Since our RPC is not streaming, latency will effect throughout. In this case there is latency while all edits are shipped to the RS in the slave cluster and then extra latency when applying the edits there (which are likely not local to that RS). A true streaming API should be better. If that is the case compression *could* help (but that is a big if).

The single thread shipping the edits to the slave should not be an issue as the edits are actually applied by the slave RS, which will use multiple threads to apply the edits in the local cluster.

Also my first reply - upon re-reading it - sounded a bit rough, that was not intended.

-- Lars


----- Original Message -----
From: Asaf Mesika <as...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org>
Cc: 
Sent: Thursday, June 20, 2013 10:16 PM
Subject: Re: Replication not suited for intensive write applications?

Thanks for the taking the time to answer!
My answers are inline.

On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <la...@apache.org> wrote:

> I see.
>
> In HBase you have machines for both CPU (to serve requests) and storage
> (to hold the data).
>
> If you only grow your cluster for CPU and you keep all RegionServers 100%
> busy at all times, you are correct.
>
> Maybe you need to increase replication.source.size.capacity and/or
> replication.source.nb.capacity (although I doubt that this will help here).
>
> I was thinking of giving a shot, but theoretically it should not affect,
since I'm doing anything in parallel, right?


> Also a replication source will pick region server from the target at
> random (10% of them at default). That has two effects:
> 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
> 2. With such a small cluster setup the likelihood is high that two or more
> RSs in the source will happen to pick the same RS at the target. Thus
> leading less throughput.
>
You are absolutely correct. In Graphite, in the beginning, I saw that only
one slave RS was getting all replicateLogEntries RPC calls. I search the
master RS logs and saw "Choose Peer" as follows:
Master RS 74: Choose peer 83
Master RS 75: Choose peer 83
Master RS 76: Choose peer 85
From some reason, they ALL talked to 83 (which seems like a bug to me).

I thought I nailed the bottleneck, so I've changed the factor from 0.1 to
1. It had the exact you described, and now all RS were getting the same
amount of replicateLogEntries RPC calls, BUT it didn't budge the
replication throughput. When I checked the network card usage I understood
that even when all 3 RS were talking to the same slave RS, network wasn't
the bottleneck.


>
> In fact your numbers might indicate that two of your source RSs might have
> picked the same target (you get 2/3 of your throughput via replication).
>
>
> In any case, before drawing conclusions this should be tested with a
> larger cluster.
> Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs will
> round robin all target RSs and lead to better distribution), but that might
> have other side-effects, too.
>
I'll try getting two clusters of 10 RS each and see if that helps. I
suspect it won't. My hunch is that: since we're replicating with no more
than 10 threads, than if I take my client and set it to 10 threads and
measure the throughput, this will the maximum replication throughput. Thus,
if my client will write with let's say 20 threads (or have two client with
10 threads each), than I'm bound to reach an ever increasing
ageOfLastShipped.

>
> Did you measure the disk IO at each RS at the target? Maybe one of them is
> mostly idle.
>
> I didn't, but I did run my client directly at the slave cluster and
measure throughput and got 18 MB/sec which is bigger than the replication
throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
bottleneck here.

I was thinking of somehow tweaking HBase a bit for my use case: I always
send Puts with new row KV (never update or delete), thus I have no
importance for ordering, thus maybe enable with a flag the ability, on a
certain column family to open multiple threads at the Replication Source?

One more question - keeping the one thread in mind here, having compression
on the replicateLogEntries RPC call, shouldn't really help here right?
Since the entire RPC call time is mostly the time it takes to run the
HTable.batch call on the slave RS, right? If I enable compression somehow
(hack HBase code to test drive it), I will only speed up transfer time of
the batch to the slave RS, but still wait on the insertion of this batch
into the slave cluster.






> -- Lars
> ________________________________
> From: Asaf Mesika <as...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Sent: Thursday, June 20, 2013 1:38 PM
> Subject: Re: Replication not suited for intensive write applications?
>
>
> Thanks for the answer!
> My responses are inline.
>
> On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org> wrote:
>
> > First off, this is a pretty constructed case leading to a specious
> general
> > conclusion.
> >
> > If you only have three RSs/DNs and the default replication factor of 3,
> > each machine will get every single write.
> > That is the first issue. Using HBase makes little sense with such a small
> > cluster.
> >
> You are correct, non the less - network as I measured, was far from its
> capacity thus probably not the bottleneck.
>
> >
> > Secondly, as you say yourself, there are only three regionservers writing
> > to the replicated cluster using a single thread each in order to preserve
> > ordering.
> > With more region servers your scale will tip the other way. Again more
> > regionservers will make this better.
> >
> > I presume, in production, I will add more region servers to accommodate
> growing write demand on my cluster. Hence, my clients will write with more
> threads. Thus proportionally I will always have a lot more client threads
> than the number of region servers (each has one replication thread). So, I
> don't see how adding more region servers will tip the scale to other side.
> The only way to avoid this, is to design the cluster in such a way that if
> I can handle the events received at the client which write them to HBase
> with x Threads, this is the amount of region servers I should have. If I
> will have a spike, then it will even out eventually, but this under
> utilizing my cluster hardware, no?
>
>
> > As for your other question, more threads can lead to better interleaving
> > of CPU and IO, thus leading to better throughput (this relationship is
> not
> > linear, though).
> >
> >
>
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Asaf Mesika <as...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc:
> > Sent: Thursday, June 20, 2013 3:46 AM
> > Subject: Replication not suited for intensive write applications?
> >
> > Hi,
> >
> > I've been conducting lots of benchmarks to test the maximum throughput of
> > replication in HBase.
> >
> > I've come to the conclusion that HBase replication is not suited for
> write
> > intensive application. I hope that people here can show me where I'm
> wrong.
> >
> > *My setup*
> > *Cluster (*Master and slave are alike)
> > 1 Master, NameNode
> > 3 RS, Data Node
> >
> > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> ethernet
> > card
> >
> > I insert data into HBase from a java process (client) reading files from
> > disk, running on the machine running the HBase Master in the master
> > cluster.
> >
> > *Benchmark Results*
> > When the client writes with 10 Threads, then the master cluster writes at
> > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
> size
> > I wrote is 15 GB, all Puts, to two different tables.
> > Both clusters when tested independently without replication, achieved
> write
> > throughput of 17-19 MB/sec, so evidently the replication process is the
> > bottleneck.
> >
> > I also tested connectivity between the two clusters using "netcat" and
> > achieved 111 MB/sec.
> > I've checked the usage of the network cards both on the client, master
> > cluster region server and slave region servers. No computer when over
> > 30mb/sec in Receive or Transmit.
> > The way I checked was rather crud but works: I've run "netstat -ie"
> before
> > HBase in the master cluster starts writing and after it finishes. The
> same
> > was done on the replicated cluster (when the replication started and
> > finished). I can tell the amount of bytes Received and Transmitted and I
> > know that duration each cluster worked, thus I can calculate the
> > throughput.
> >
> > *The bottleneck in my opinion*
> > Since we've excluded network capacity, and each cluster works at faster
> > rate independently, all is left is the replication process.
> > My client writes to the master cluster with 10 Threads, and manages to
> > write at 17-18 MB/sec.
> > Each region server has only 1 thread responsible for transmitting the
> data
> > written to the WAL to the slave cluster. Thus in my setup I effectively
> > have 3 threads writing to the slave cluster.  Thus this is the
> bottleneck,
> > since this process can not be parallelized, since it must transmit the
> WAL
> > in a certain order.
> >
> > *Conclusion*
> > When writes intensively to HBase with more than 3 threads (in my setup),
> > you can't use replication.
> >
> > *Master throughput without replication*
> > On a different note, I have one thing I couldn't understand at all.
> > When turned off replication, and wrote with my client with 3 threads I
> got
> > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
> that
> > doesn't help) I achieved maximum throughput of 19 MB/sec.
> > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
> RS,
> > thus the network capacity was not the bottleneck.
> > On the HBase master machine which ran the client, the network card again
> > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> > MB/sec. Hence it's the client machine network card creating the
> bottleneck.
> >
> > The only explanation I have is the synchronized writes to the WAL. Those
> 10
> > threads have to get in line, and one by one, write their batch of Puts to
> > the WAL, which creates a bottleneck.
> >
> > *My question*:
> > The one thing I couldn't understand is: When I write with 3 Threads,
> > meaning I have no more than 3 concurrent RPC requests to write in each
> RS.
> > They achieved 11.3 MB/sec.
> > The write to the WAL is synchronized, so why increasing the number of
> > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> > They all get in line to write to the same location, so it seems have
> > concurrent write shouldn't improve throughput at all.
> >
> >
> > Thanks you!
> >
> > Asaf
> > *
> > *
> >
> >
>

Re: Replication not suited for intensive write applications?

Posted by lars hofhansl <la...@apache.org>.
Thanks for checking... Interesting. So talking to 3RSs as opposed to only 1 before had no effect on the throughput?

Would be good to explore this a bit more.
Since our RPC is not streaming, latency will effect throughout. In this case there is latency while all edits are shipped to the RS in the slave cluster and then extra latency when applying the edits there (which are likely not local to that RS). A true streaming API should be better. If that is the case compression *could* help (but that is a big if).

The single thread shipping the edits to the slave should not be an issue as the edits are actually applied by the slave RS, which will use multiple threads to apply the edits in the local cluster.

Also my first reply - upon re-reading it - sounded a bit rough, that was not intended.

-- Lars


----- Original Message -----
From: Asaf Mesika <as...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org>
Cc: 
Sent: Thursday, June 20, 2013 10:16 PM
Subject: Re: Replication not suited for intensive write applications?

Thanks for the taking the time to answer!
My answers are inline.

On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <la...@apache.org> wrote:

> I see.
>
> In HBase you have machines for both CPU (to serve requests) and storage
> (to hold the data).
>
> If you only grow your cluster for CPU and you keep all RegionServers 100%
> busy at all times, you are correct.
>
> Maybe you need to increase replication.source.size.capacity and/or
> replication.source.nb.capacity (although I doubt that this will help here).
>
> I was thinking of giving a shot, but theoretically it should not affect,
since I'm doing anything in parallel, right?


> Also a replication source will pick region server from the target at
> random (10% of them at default). That has two effects:
> 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
> 2. With such a small cluster setup the likelihood is high that two or more
> RSs in the source will happen to pick the same RS at the target. Thus
> leading less throughput.
>
You are absolutely correct. In Graphite, in the beginning, I saw that only
one slave RS was getting all replicateLogEntries RPC calls. I search the
master RS logs and saw "Choose Peer" as follows:
Master RS 74: Choose peer 83
Master RS 75: Choose peer 83
Master RS 76: Choose peer 85
From some reason, they ALL talked to 83 (which seems like a bug to me).

I thought I nailed the bottleneck, so I've changed the factor from 0.1 to
1. It had the exact you described, and now all RS were getting the same
amount of replicateLogEntries RPC calls, BUT it didn't budge the
replication throughput. When I checked the network card usage I understood
that even when all 3 RS were talking to the same slave RS, network wasn't
the bottleneck.


>
> In fact your numbers might indicate that two of your source RSs might have
> picked the same target (you get 2/3 of your throughput via replication).
>
>
> In any case, before drawing conclusions this should be tested with a
> larger cluster.
> Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs will
> round robin all target RSs and lead to better distribution), but that might
> have other side-effects, too.
>
I'll try getting two clusters of 10 RS each and see if that helps. I
suspect it won't. My hunch is that: since we're replicating with no more
than 10 threads, than if I take my client and set it to 10 threads and
measure the throughput, this will the maximum replication throughput. Thus,
if my client will write with let's say 20 threads (or have two client with
10 threads each), than I'm bound to reach an ever increasing
ageOfLastShipped.

>
> Did you measure the disk IO at each RS at the target? Maybe one of them is
> mostly idle.
>
> I didn't, but I did run my client directly at the slave cluster and
measure throughput and got 18 MB/sec which is bigger than the replication
throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
bottleneck here.

I was thinking of somehow tweaking HBase a bit for my use case: I always
send Puts with new row KV (never update or delete), thus I have no
importance for ordering, thus maybe enable with a flag the ability, on a
certain column family to open multiple threads at the Replication Source?

One more question - keeping the one thread in mind here, having compression
on the replicateLogEntries RPC call, shouldn't really help here right?
Since the entire RPC call time is mostly the time it takes to run the
HTable.batch call on the slave RS, right? If I enable compression somehow
(hack HBase code to test drive it), I will only speed up transfer time of
the batch to the slave RS, but still wait on the insertion of this batch
into the slave cluster.






> -- Lars
> ________________________________
> From: Asaf Mesika <as...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Sent: Thursday, June 20, 2013 1:38 PM
> Subject: Re: Replication not suited for intensive write applications?
>
>
> Thanks for the answer!
> My responses are inline.
>
> On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org> wrote:
>
> > First off, this is a pretty constructed case leading to a specious
> general
> > conclusion.
> >
> > If you only have three RSs/DNs and the default replication factor of 3,
> > each machine will get every single write.
> > That is the first issue. Using HBase makes little sense with such a small
> > cluster.
> >
> You are correct, non the less - network as I measured, was far from its
> capacity thus probably not the bottleneck.
>
> >
> > Secondly, as you say yourself, there are only three regionservers writing
> > to the replicated cluster using a single thread each in order to preserve
> > ordering.
> > With more region servers your scale will tip the other way. Again more
> > regionservers will make this better.
> >
> > I presume, in production, I will add more region servers to accommodate
> growing write demand on my cluster. Hence, my clients will write with more
> threads. Thus proportionally I will always have a lot more client threads
> than the number of region servers (each has one replication thread). So, I
> don't see how adding more region servers will tip the scale to other side.
> The only way to avoid this, is to design the cluster in such a way that if
> I can handle the events received at the client which write them to HBase
> with x Threads, this is the amount of region servers I should have. If I
> will have a spike, then it will even out eventually, but this under
> utilizing my cluster hardware, no?
>
>
> > As for your other question, more threads can lead to better interleaving
> > of CPU and IO, thus leading to better throughput (this relationship is
> not
> > linear, though).
> >
> >
>
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Asaf Mesika <as...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc:
> > Sent: Thursday, June 20, 2013 3:46 AM
> > Subject: Replication not suited for intensive write applications?
> >
> > Hi,
> >
> > I've been conducting lots of benchmarks to test the maximum throughput of
> > replication in HBase.
> >
> > I've come to the conclusion that HBase replication is not suited for
> write
> > intensive application. I hope that people here can show me where I'm
> wrong.
> >
> > *My setup*
> > *Cluster (*Master and slave are alike)
> > 1 Master, NameNode
> > 3 RS, Data Node
> >
> > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> ethernet
> > card
> >
> > I insert data into HBase from a java process (client) reading files from
> > disk, running on the machine running the HBase Master in the master
> > cluster.
> >
> > *Benchmark Results*
> > When the client writes with 10 Threads, then the master cluster writes at
> > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
> size
> > I wrote is 15 GB, all Puts, to two different tables.
> > Both clusters when tested independently without replication, achieved
> write
> > throughput of 17-19 MB/sec, so evidently the replication process is the
> > bottleneck.
> >
> > I also tested connectivity between the two clusters using "netcat" and
> > achieved 111 MB/sec.
> > I've checked the usage of the network cards both on the client, master
> > cluster region server and slave region servers. No computer when over
> > 30mb/sec in Receive or Transmit.
> > The way I checked was rather crud but works: I've run "netstat -ie"
> before
> > HBase in the master cluster starts writing and after it finishes. The
> same
> > was done on the replicated cluster (when the replication started and
> > finished). I can tell the amount of bytes Received and Transmitted and I
> > know that duration each cluster worked, thus I can calculate the
> > throughput.
> >
> > *The bottleneck in my opinion*
> > Since we've excluded network capacity, and each cluster works at faster
> > rate independently, all is left is the replication process.
> > My client writes to the master cluster with 10 Threads, and manages to
> > write at 17-18 MB/sec.
> > Each region server has only 1 thread responsible for transmitting the
> data
> > written to the WAL to the slave cluster. Thus in my setup I effectively
> > have 3 threads writing to the slave cluster.  Thus this is the
> bottleneck,
> > since this process can not be parallelized, since it must transmit the
> WAL
> > in a certain order.
> >
> > *Conclusion*
> > When writes intensively to HBase with more than 3 threads (in my setup),
> > you can't use replication.
> >
> > *Master throughput without replication*
> > On a different note, I have one thing I couldn't understand at all.
> > When turned off replication, and wrote with my client with 3 threads I
> got
> > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
> that
> > doesn't help) I achieved maximum throughput of 19 MB/sec.
> > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
> RS,
> > thus the network capacity was not the bottleneck.
> > On the HBase master machine which ran the client, the network card again
> > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> > MB/sec. Hence it's the client machine network card creating the
> bottleneck.
> >
> > The only explanation I have is the synchronized writes to the WAL. Those
> 10
> > threads have to get in line, and one by one, write their batch of Puts to
> > the WAL, which creates a bottleneck.
> >
> > *My question*:
> > The one thing I couldn't understand is: When I write with 3 Threads,
> > meaning I have no more than 3 concurrent RPC requests to write in each
> RS.
> > They achieved 11.3 MB/sec.
> > The write to the WAL is synchronized, so why increasing the number of
> > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> > They all get in line to write to the same location, so it seems have
> > concurrent write shouldn't improve throughput at all.
> >
> >
> > Thanks you!
> >
> > Asaf
> > *
> > *
> >
> >
>


Re: Replication not suited for intensive write applications?

Posted by Asaf Mesika <as...@gmail.com>.
Thanks for the taking the time to answer!
My answers are inline.

On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <la...@apache.org> wrote:

> I see.
>
> In HBase you have machines for both CPU (to serve requests) and storage
> (to hold the data).
>
> If you only grow your cluster for CPU and you keep all RegionServers 100%
> busy at all times, you are correct.
>
> Maybe you need to increase replication.source.size.capacity and/or
> replication.source.nb.capacity (although I doubt that this will help here).
>
> I was thinking of giving a shot, but theoretically it should not affect,
since I'm doing anything in parallel, right?


> Also a replication source will pick region server from the target at
> random (10% of them at default). That has two effects:
> 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
> 2. With such a small cluster setup the likelihood is high that two or more
> RSs in the source will happen to pick the same RS at the target. Thus
> leading less throughput.
>
You are absolutely correct. In Graphite, in the beginning, I saw that only
one slave RS was getting all replicateLogEntries RPC calls. I search the
master RS logs and saw "Choose Peer" as follows:
Master RS 74: Choose peer 83
Master RS 75: Choose peer 83
Master RS 76: Choose peer 85
>From some reason, they ALL talked to 83 (which seems like a bug to me).

I thought I nailed the bottleneck, so I've changed the factor from 0.1 to
1. It had the exact you described, and now all RS were getting the same
amount of replicateLogEntries RPC calls, BUT it didn't budge the
replication throughput. When I checked the network card usage I understood
that even when all 3 RS were talking to the same slave RS, network wasn't
the bottleneck.


>
> In fact your numbers might indicate that two of your source RSs might have
> picked the same target (you get 2/3 of your throughput via replication).
>
>
> In any case, before drawing conclusions this should be tested with a
> larger cluster.
> Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs will
> round robin all target RSs and lead to better distribution), but that might
> have other side-effects, too.
>
I'll try getting two clusters of 10 RS each and see if that helps. I
suspect it won't. My hunch is that: since we're replicating with no more
than 10 threads, than if I take my client and set it to 10 threads and
measure the throughput, this will the maximum replication throughput. Thus,
if my client will write with let's say 20 threads (or have two client with
10 threads each), than I'm bound to reach an ever increasing
ageOfLastShipped.

>
> Did you measure the disk IO at each RS at the target? Maybe one of them is
> mostly idle.
>
> I didn't, but I did run my client directly at the slave cluster and
measure throughput and got 18 MB/sec which is bigger than the replication
throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
bottleneck here.

I was thinking of somehow tweaking HBase a bit for my use case: I always
send Puts with new row KV (never update or delete), thus I have no
importance for ordering, thus maybe enable with a flag the ability, on a
certain column family to open multiple threads at the Replication Source?

One more question - keeping the one thread in mind here, having compression
on the replicateLogEntries RPC call, shouldn't really help here right?
Since the entire RPC call time is mostly the time it takes to run the
HTable.batch call on the slave RS, right? If I enable compression somehow
(hack HBase code to test drive it), I will only speed up transfer time of
the batch to the slave RS, but still wait on the insertion of this batch
into the slave cluster.






> -- Lars
> ________________________________
> From: Asaf Mesika <as...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Sent: Thursday, June 20, 2013 1:38 PM
> Subject: Re: Replication not suited for intensive write applications?
>
>
> Thanks for the answer!
> My responses are inline.
>
> On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org> wrote:
>
> > First off, this is a pretty constructed case leading to a specious
> general
> > conclusion.
> >
> > If you only have three RSs/DNs and the default replication factor of 3,
> > each machine will get every single write.
> > That is the first issue. Using HBase makes little sense with such a small
> > cluster.
> >
> You are correct, non the less - network as I measured, was far from its
> capacity thus probably not the bottleneck.
>
> >
> > Secondly, as you say yourself, there are only three regionservers writing
> > to the replicated cluster using a single thread each in order to preserve
> > ordering.
> > With more region servers your scale will tip the other way. Again more
> > regionservers will make this better.
> >
> > I presume, in production, I will add more region servers to accommodate
> growing write demand on my cluster. Hence, my clients will write with more
> threads. Thus proportionally I will always have a lot more client threads
> than the number of region servers (each has one replication thread). So, I
> don't see how adding more region servers will tip the scale to other side.
> The only way to avoid this, is to design the cluster in such a way that if
> I can handle the events received at the client which write them to HBase
> with x Threads, this is the amount of region servers I should have. If I
> will have a spike, then it will even out eventually, but this under
> utilizing my cluster hardware, no?
>
>
> > As for your other question, more threads can lead to better interleaving
> > of CPU and IO, thus leading to better throughput (this relationship is
> not
> > linear, though).
> >
> >
>
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Asaf Mesika <as...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc:
> > Sent: Thursday, June 20, 2013 3:46 AM
> > Subject: Replication not suited for intensive write applications?
> >
> > Hi,
> >
> > I've been conducting lots of benchmarks to test the maximum throughput of
> > replication in HBase.
> >
> > I've come to the conclusion that HBase replication is not suited for
> write
> > intensive application. I hope that people here can show me where I'm
> wrong.
> >
> > *My setup*
> > *Cluster (*Master and slave are alike)
> > 1 Master, NameNode
> > 3 RS, Data Node
> >
> > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> ethernet
> > card
> >
> > I insert data into HBase from a java process (client) reading files from
> > disk, running on the machine running the HBase Master in the master
> > cluster.
> >
> > *Benchmark Results*
> > When the client writes with 10 Threads, then the master cluster writes at
> > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
> size
> > I wrote is 15 GB, all Puts, to two different tables.
> > Both clusters when tested independently without replication, achieved
> write
> > throughput of 17-19 MB/sec, so evidently the replication process is the
> > bottleneck.
> >
> > I also tested connectivity between the two clusters using "netcat" and
> > achieved 111 MB/sec.
> > I've checked the usage of the network cards both on the client, master
> > cluster region server and slave region servers. No computer when over
> > 30mb/sec in Receive or Transmit.
> > The way I checked was rather crud but works: I've run "netstat -ie"
> before
> > HBase in the master cluster starts writing and after it finishes. The
> same
> > was done on the replicated cluster (when the replication started and
> > finished). I can tell the amount of bytes Received and Transmitted and I
> > know that duration each cluster worked, thus I can calculate the
> > throughput.
> >
> > *The bottleneck in my opinion*
> > Since we've excluded network capacity, and each cluster works at faster
> > rate independently, all is left is the replication process.
> > My client writes to the master cluster with 10 Threads, and manages to
> > write at 17-18 MB/sec.
> > Each region server has only 1 thread responsible for transmitting the
> data
> > written to the WAL to the slave cluster. Thus in my setup I effectively
> > have 3 threads writing to the slave cluster.  Thus this is the
> bottleneck,
> > since this process can not be parallelized, since it must transmit the
> WAL
> > in a certain order.
> >
> > *Conclusion*
> > When writes intensively to HBase with more than 3 threads (in my setup),
> > you can't use replication.
> >
> > *Master throughput without replication*
> > On a different note, I have one thing I couldn't understand at all.
> > When turned off replication, and wrote with my client with 3 threads I
> got
> > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
> that
> > doesn't help) I achieved maximum throughput of 19 MB/sec.
> > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
> RS,
> > thus the network capacity was not the bottleneck.
> > On the HBase master machine which ran the client, the network card again
> > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> > MB/sec. Hence it's the client machine network card creating the
> bottleneck.
> >
> > The only explanation I have is the synchronized writes to the WAL. Those
> 10
> > threads have to get in line, and one by one, write their batch of Puts to
> > the WAL, which creates a bottleneck.
> >
> > *My question*:
> > The one thing I couldn't understand is: When I write with 3 Threads,
> > meaning I have no more than 3 concurrent RPC requests to write in each
> RS.
> > They achieved 11.3 MB/sec.
> > The write to the WAL is synchronized, so why increasing the number of
> > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> > They all get in line to write to the same location, so it seems have
> > concurrent write shouldn't improve throughput at all.
> >
> >
> > Thanks you!
> >
> > Asaf
> > *
> > *
> >
> >
>

Re: Replication not suited for intensive write applications?

Posted by lars hofhansl <la...@apache.org>.
I see.

In HBase you have machines for both CPU (to serve requests) and storage (to hold the data).

If you only grow your cluster for CPU and you keep all RegionServers 100% busy at all times, you are correct.

Maybe you need to increase replication.source.size.capacity and/or replication.source.nb.capacity (although I doubt that this will help here).

Also a replication source will pick region server from the target at random (10% of them at default). That has two effects:
1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
2. With such a small cluster setup the likelihood is high that two or more RSs in the source will happen to pick the same RS at the target. Thus leading less throughput.

In fact your numbers might indicate that two of your source RSs might have picked the same target (you get 2/3 of your throughput via replication).


In any case, before drawing conclusions this should be tested with a larger cluster.
Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs will round robin all target RSs and lead to better distribution), but that might have other side-effects, too.

Did you measure the disk IO at each RS at the target? Maybe one of them is mostly idle.

-- Lars
________________________________
From: Asaf Mesika <as...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Thursday, June 20, 2013 1:38 PM
Subject: Re: Replication not suited for intensive write applications?


Thanks for the answer!
My responses are inline.

On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org> wrote:

> First off, this is a pretty constructed case leading to a specious general
> conclusion.
>
> If you only have three RSs/DNs and the default replication factor of 3,
> each machine will get every single write.
> That is the first issue. Using HBase makes little sense with such a small
> cluster.
>
You are correct, non the less - network as I measured, was far from its
capacity thus probably not the bottleneck.

>
> Secondly, as you say yourself, there are only three regionservers writing
> to the replicated cluster using a single thread each in order to preserve
> ordering.
> With more region servers your scale will tip the other way. Again more
> regionservers will make this better.
>
> I presume, in production, I will add more region servers to accommodate
growing write demand on my cluster. Hence, my clients will write with more
threads. Thus proportionally I will always have a lot more client threads
than the number of region servers (each has one replication thread). So, I
don't see how adding more region servers will tip the scale to other side.
The only way to avoid this, is to design the cluster in such a way that if
I can handle the events received at the client which write them to HBase
with x Threads, this is the amount of region servers I should have. If I
will have a spike, then it will even out eventually, but this under
utilizing my cluster hardware, no?


> As for your other question, more threads can lead to better interleaving
> of CPU and IO, thus leading to better throughput (this relationship is not
> linear, though).
>
>

>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Asaf Mesika <as...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Thursday, June 20, 2013 3:46 AM
> Subject: Replication not suited for intensive write applications?
>
> Hi,
>
> I've been conducting lots of benchmarks to test the maximum throughput of
> replication in HBase.
>
> I've come to the conclusion that HBase replication is not suited for write
> intensive application. I hope that people here can show me where I'm wrong.
>
> *My setup*
> *Cluster (*Master and slave are alike)
> 1 Master, NameNode
> 3 RS, Data Node
>
> All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit ethernet
> card
>
> I insert data into HBase from a java process (client) reading files from
> disk, running on the machine running the HBase Master in the master
> cluster.
>
> *Benchmark Results*
> When the client writes with 10 Threads, then the master cluster writes at
> 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data size
> I wrote is 15 GB, all Puts, to two different tables.
> Both clusters when tested independently without replication, achieved write
> throughput of 17-19 MB/sec, so evidently the replication process is the
> bottleneck.
>
> I also tested connectivity between the two clusters using "netcat" and
> achieved 111 MB/sec.
> I've checked the usage of the network cards both on the client, master
> cluster region server and slave region servers. No computer when over
> 30mb/sec in Receive or Transmit.
> The way I checked was rather crud but works: I've run "netstat -ie" before
> HBase in the master cluster starts writing and after it finishes. The same
> was done on the replicated cluster (when the replication started and
> finished). I can tell the amount of bytes Received and Transmitted and I
> know that duration each cluster worked, thus I can calculate the
> throughput.
>
> *The bottleneck in my opinion*
> Since we've excluded network capacity, and each cluster works at faster
> rate independently, all is left is the replication process.
> My client writes to the master cluster with 10 Threads, and manages to
> write at 17-18 MB/sec.
> Each region server has only 1 thread responsible for transmitting the data
> written to the WAL to the slave cluster. Thus in my setup I effectively
> have 3 threads writing to the slave cluster.  Thus this is the bottleneck,
> since this process can not be parallelized, since it must transmit the WAL
> in a certain order.
>
> *Conclusion*
> When writes intensively to HBase with more than 3 threads (in my setup),
> you can't use replication.
>
> *Master throughput without replication*
> On a different note, I have one thing I couldn't understand at all.
> When turned off replication, and wrote with my client with 3 threads I got
> throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than that
> doesn't help) I achieved maximum throughput of 19 MB/sec.
> The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each RS,
> thus the network capacity was not the bottleneck.
> On the HBase master machine which ran the client, the network card again
> showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> MB/sec. Hence it's the client machine network card creating the bottleneck.
>
> The only explanation I have is the synchronized writes to the WAL. Those 10
> threads have to get in line, and one by one, write their batch of Puts to
> the WAL, which creates a bottleneck.
>
> *My question*:
> The one thing I couldn't understand is: When I write with 3 Threads,
> meaning I have no more than 3 concurrent RPC requests to write in each RS.
> They achieved 11.3 MB/sec.
> The write to the WAL is synchronized, so why increasing the number of
> threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> They all get in line to write to the same location, so it seems have
> concurrent write shouldn't improve throughput at all.
>
>
> Thanks you!
>
> Asaf
> *
> *
>
>

Re: Replication not suited for intensive write applications?

Posted by Asaf Mesika <as...@gmail.com>.
Thanks for the answer!
My responses are inline.

On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <la...@apache.org> wrote:

> First off, this is a pretty constructed case leading to a specious general
> conclusion.
>
> If you only have three RSs/DNs and the default replication factor of 3,
> each machine will get every single write.
> That is the first issue. Using HBase makes little sense with such a small
> cluster.
>
You are correct, non the less - network as I measured, was far from its
capacity thus probably not the bottleneck.

>
> Secondly, as you say yourself, there are only three regionservers writing
> to the replicated cluster using a single thread each in order to preserve
> ordering.
> With more region servers your scale will tip the other way. Again more
> regionservers will make this better.
>
> I presume, in production, I will add more region servers to accommodate
growing write demand on my cluster. Hence, my clients will write with more
threads. Thus proportionally I will always have a lot more client threads
than the number of region servers (each has one replication thread). So, I
don't see how adding more region servers will tip the scale to other side.
The only way to avoid this, is to design the cluster in such a way that if
I can handle the events received at the client which write them to HBase
with x Threads, this is the amount of region servers I should have. If I
will have a spike, then it will even out eventually, but this under
utilizing my cluster hardware, no?


> As for your other question, more threads can lead to better interleaving
> of CPU and IO, thus leading to better throughput (this relationship is not
> linear, though).
>
>

>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Asaf Mesika <as...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Thursday, June 20, 2013 3:46 AM
> Subject: Replication not suited for intensive write applications?
>
> Hi,
>
> I've been conducting lots of benchmarks to test the maximum throughput of
> replication in HBase.
>
> I've come to the conclusion that HBase replication is not suited for write
> intensive application. I hope that people here can show me where I'm wrong.
>
> *My setup*
> *Cluster (*Master and slave are alike)
> 1 Master, NameNode
> 3 RS, Data Node
>
> All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit ethernet
> card
>
> I insert data into HBase from a java process (client) reading files from
> disk, running on the machine running the HBase Master in the master
> cluster.
>
> *Benchmark Results*
> When the client writes with 10 Threads, then the master cluster writes at
> 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data size
> I wrote is 15 GB, all Puts, to two different tables.
> Both clusters when tested independently without replication, achieved write
> throughput of 17-19 MB/sec, so evidently the replication process is the
> bottleneck.
>
> I also tested connectivity between the two clusters using "netcat" and
> achieved 111 MB/sec.
> I've checked the usage of the network cards both on the client, master
> cluster region server and slave region servers. No computer when over
> 30mb/sec in Receive or Transmit.
> The way I checked was rather crud but works: I've run "netstat -ie" before
> HBase in the master cluster starts writing and after it finishes. The same
> was done on the replicated cluster (when the replication started and
> finished). I can tell the amount of bytes Received and Transmitted and I
> know that duration each cluster worked, thus I can calculate the
> throughput.
>
> *The bottleneck in my opinion*
> Since we've excluded network capacity, and each cluster works at faster
> rate independently, all is left is the replication process.
> My client writes to the master cluster with 10 Threads, and manages to
> write at 17-18 MB/sec.
> Each region server has only 1 thread responsible for transmitting the data
> written to the WAL to the slave cluster. Thus in my setup I effectively
> have 3 threads writing to the slave cluster.  Thus this is the bottleneck,
> since this process can not be parallelized, since it must transmit the WAL
> in a certain order.
>
> *Conclusion*
> When writes intensively to HBase with more than 3 threads (in my setup),
> you can't use replication.
>
> *Master throughput without replication*
> On a different note, I have one thing I couldn't understand at all.
> When turned off replication, and wrote with my client with 3 threads I got
> throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than that
> doesn't help) I achieved maximum throughput of 19 MB/sec.
> The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each RS,
> thus the network capacity was not the bottleneck.
> On the HBase master machine which ran the client, the network card again
> showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> MB/sec. Hence it's the client machine network card creating the bottleneck.
>
> The only explanation I have is the synchronized writes to the WAL. Those 10
> threads have to get in line, and one by one, write their batch of Puts to
> the WAL, which creates a bottleneck.
>
> *My question*:
> The one thing I couldn't understand is: When I write with 3 Threads,
> meaning I have no more than 3 concurrent RPC requests to write in each RS.
> They achieved 11.3 MB/sec.
> The write to the WAL is synchronized, so why increasing the number of
> threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> They all get in line to write to the same location, so it seems have
> concurrent write shouldn't improve throughput at all.
>
>
> Thanks you!
>
> Asaf
> *
> *
>
>

Re: Replication not suited for intensive write applications?

Posted by lars hofhansl <la...@apache.org>.
First off, this is a pretty constructed case leading to a specious general conclusion.

If you only have three RSs/DNs and the default replication factor of 3, each machine will get every single write.
That is the first issue. Using HBase makes little sense with such a small cluster.

Secondly, as you say yourself, there are only three regionservers writing to the replicated cluster using a single thread each in order to preserve ordering.
With more region servers your scale will tip the other way. Again more regionservers will make this better.


As for your other question, more threads can lead to better interleaving of CPU and IO, thus leading to better throughput (this relationship is not linear, though).


-- Lars



----- Original Message -----
From: Asaf Mesika <as...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Thursday, June 20, 2013 3:46 AM
Subject: Replication not suited for intensive write applications?

Hi,

I've been conducting lots of benchmarks to test the maximum throughput of
replication in HBase.

I've come to the conclusion that HBase replication is not suited for write
intensive application. I hope that people here can show me where I'm wrong.

*My setup*
*Cluster (*Master and slave are alike)
1 Master, NameNode
3 RS, Data Node

All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit ethernet
card

I insert data into HBase from a java process (client) reading files from
disk, running on the machine running the HBase Master in the master cluster.

*Benchmark Results*
When the client writes with 10 Threads, then the master cluster writes at
17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data size
I wrote is 15 GB, all Puts, to two different tables.
Both clusters when tested independently without replication, achieved write
throughput of 17-19 MB/sec, so evidently the replication process is the
bottleneck.

I also tested connectivity between the two clusters using "netcat" and
achieved 111 MB/sec.
I've checked the usage of the network cards both on the client, master
cluster region server and slave region servers. No computer when over
30mb/sec in Receive or Transmit.
The way I checked was rather crud but works: I've run "netstat -ie" before
HBase in the master cluster starts writing and after it finishes. The same
was done on the replicated cluster (when the replication started and
finished). I can tell the amount of bytes Received and Transmitted and I
know that duration each cluster worked, thus I can calculate the throughput.

*The bottleneck in my opinion*
Since we've excluded network capacity, and each cluster works at faster
rate independently, all is left is the replication process.
My client writes to the master cluster with 10 Threads, and manages to
write at 17-18 MB/sec.
Each region server has only 1 thread responsible for transmitting the data
written to the WAL to the slave cluster. Thus in my setup I effectively
have 3 threads writing to the slave cluster.  Thus this is the bottleneck,
since this process can not be parallelized, since it must transmit the WAL
in a certain order.

*Conclusion*
When writes intensively to HBase with more than 3 threads (in my setup),
you can't use replication.

*Master throughput without replication*
On a different note, I have one thing I couldn't understand at all.
When turned off replication, and wrote with my client with 3 threads I got
throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than that
doesn't help) I achieved maximum throughput of 19 MB/sec.
The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each RS,
thus the network capacity was not the bottleneck.
On the HBase master machine which ran the client, the network card again
showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
MB/sec. Hence it's the client machine network card creating the bottleneck.

The only explanation I have is the synchronized writes to the WAL. Those 10
threads have to get in line, and one by one, write their batch of Puts to
the WAL, which creates a bottleneck.

*My question*:
The one thing I couldn't understand is: When I write with 3 Threads,
meaning I have no more than 3 concurrent RPC requests to write in each RS.
They achieved 11.3 MB/sec.
The write to the WAL is synchronized, so why increasing the number of
threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
They all get in line to write to the same location, so it seems have
concurrent write shouldn't improve throughput at all.


Thanks you!

Asaf
*
*


Re: Replication not suited for intensive write applications?

Posted by Varun Sharma <va...@pinterest.com>.
On Thu, Jun 20, 2013 at 11:10 AM, Asaf Mesika <as...@gmail.com> wrote:

> On Thu, Jun 20, 2013 at 7:12 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > What is the ageOfLastShippedOp as reported on your Master region servers
> > (should be available through the /jmx) - it tells the delay your edits
> are
> > experiencing before being shipped. If this number is < 1000 (in
> > milliseconds), I would say replication is doing a very good job. This is
> > the most important metric worth tracking and I would be interested in how
> > it looks since we are also looking into using replication for write heavy
> > workloads...
> >
> > ageOfLastShippedOp showed 10min, on 15GB on inserted data. When I ran the
> test with 50GB, it showed 30min. This was also easily spotted when in
> Graphite I see when the writeRequests count started increasing in the slave
> RS and when it stopped, thus can measure the duration of the replication.
>
> Although it is the single most important metric,  I had to fire up JConsole
> on the 3 Master RS since when using the hadoop-metrics.properties and
> configuring a context for Graphite (or even a file) I've discovered that if
> there is/was a recovered edits queue of another RS, it has reported its
> ageOfLastShippedOp forever instead of the active queue (since there's isn't
> a ageOfLastShippedOp metrics per queue).
>

In our tests run on 0.94.7 - we do see ageOfLastShippedOp per queue - so we
would see a giant number for the recovered queue and a small number for the
regular queue. Maybe you are running an old version which does not have
that.

>
>
> > The network on your 2nd cluster could be lower because replication ships
> > edits in batches - so the batching could be amortizing the amount of data
> > sent over the wire. Also, when you are measuring traffic - are you
> > measuring the traffic on the NIC - which will also include traffic due to
> > HDFS replication ?
> >
> > My NIC/ethernet measuring is quite simple. I ran "netstat -ie" which
> gives
> a total counter of bytes, both on Receive and Transmit for my interface
> (eth0). Running it before and after, gives you the total amount of bytes. I
> also know the duration of the replication work by watching the
> writeRequestsCount metric settle on the slave RS, thus I can calculate the
> throughput. 15 GB / 14min.
> Regarding your question - yes, it has to include all traffic on the card,
> which probably includes HDFS replication. There's much I can do about that
> though.
> We should note that the network capacity is not the issue, since it was
> measured 30MB/sec Receive and 20MB/sec Transmit, thus it's far from the
> measured max bandwidth of 111MB/sec (measured by running nc - netcat).
>
> Yep, saturating the NIC is not easy !

>
>
>
> >
> > On Thu, Jun 20, 2013 at 3:46 AM, Asaf Mesika <as...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I've been conducting lots of benchmarks to test the maximum throughput
> of
> > > replication in HBase.
> > >
> > > I've come to the conclusion that HBase replication is not suited for
> > write
> > > intensive application. I hope that people here can show me where I'm
> > wrong.
> > >
> > > *My setup*
> > > *Cluster (*Master and slave are alike)
> > > 1 Master, NameNode
> > > 3 RS, Data Node
> > >
> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> > ethernet
> > > card
> > >
> > > I insert data into HBase from a java process (client) reading files
> from
> > > disk, running on the machine running the HBase Master in the master
> > > cluster.
> > >
> > > *Benchmark Results*
> > > When the client writes with 10 Threads, then the master cluster writes
> at
> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
> > size
> > > I wrote is 15 GB, all Puts, to two different tables.
> > > Both clusters when tested independently without replication, achieved
> > write
> > > throughput of 17-19 MB/sec, so evidently the replication process is the
> > > bottleneck.
> > >
> > > I also tested connectivity between the two clusters using "netcat" and
> > > achieved 111 MB/sec.
> > > I've checked the usage of the network cards both on the client, master
> > > cluster region server and slave region servers. No computer when over
> > > 30mb/sec in Receive or Transmit.
> > > The way I checked was rather crud but works: I've run "netstat -ie"
> > before
> > > HBase in the master cluster starts writing and after it finishes. The
> > same
> > > was done on the replicated cluster (when the replication started and
> > > finished). I can tell the amount of bytes Received and Transmitted and
> I
> > > know that duration each cluster worked, thus I can calculate the
> > > throughput.
> > >
> > >  *The bottleneck in my opinion*
> > > Since we've excluded network capacity, and each cluster works at faster
> > > rate independently, all is left is the replication process.
> > > My client writes to the master cluster with 10 Threads, and manages to
> > > write at 17-18 MB/sec.
> > > Each region server has only 1 thread responsible for transmitting the
> > data
> > > written to the WAL to the slave cluster. Thus in my setup I effectively
> > > have 3 threads writing to the slave cluster.  Thus this is the
> > bottleneck,
> > > since this process can not be parallelized, since it must transmit the
> > WAL
> > > in a certain order.
> > >
> > > *Conclusion*
> > > When writes intensively to HBase with more than 3 threads (in my
> setup),
> > > you can't use replication.
> > >
> > > *Master throughput without replication*
> > > On a different note, I have one thing I couldn't understand at all.
> > > When turned off replication, and wrote with my client with 3 threads I
> > got
> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
> > that
> > > doesn't help) I achieved maximum throughput of 19 MB/sec.
> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
> > RS,
> > > thus the network capacity was not the bottleneck.
> > > On the HBase master machine which ran the client, the network card
> again
> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> > > MB/sec. Hence it's the client machine network card creating the
> > bottleneck.
> > >
> > > The only explanation I have is the synchronized writes to the WAL.
> Those
> > 10
> > > threads have to get in line, and one by one, write their batch of Puts
> to
> > > the WAL, which creates a bottleneck.
> > >
> > > *My question*:
> > > The one thing I couldn't understand is: When I write with 3 Threads,
> > > meaning I have no more than 3 concurrent RPC requests to write in each
> > RS.
> > > They achieved 11.3 MB/sec.
> > > The write to the WAL is synchronized, so why increasing the number of
> > > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> > > They all get in line to write to the same location, so it seems have
> > > concurrent write shouldn't improve throughput at all.
> > >
> > >
> > > Thanks you!
> > >
> > > Asaf
> > > *
> > > *
> > >
> >
>

Re: Replication not suited for intensive write applications?

Posted by Asaf Mesika <as...@gmail.com>.
On Thu, Jun 20, 2013 at 7:12 PM, Varun Sharma <va...@pinterest.com> wrote:

> What is the ageOfLastShippedOp as reported on your Master region servers
> (should be available through the /jmx) - it tells the delay your edits are
> experiencing before being shipped. If this number is < 1000 (in
> milliseconds), I would say replication is doing a very good job. This is
> the most important metric worth tracking and I would be interested in how
> it looks since we are also looking into using replication for write heavy
> workloads...
>
> ageOfLastShippedOp showed 10min, on 15GB on inserted data. When I ran the
test with 50GB, it showed 30min. This was also easily spotted when in
Graphite I see when the writeRequests count started increasing in the slave
RS and when it stopped, thus can measure the duration of the replication.

Although it is the single most important metric,  I had to fire up JConsole
on the 3 Master RS since when using the hadoop-metrics.properties and
configuring a context for Graphite (or even a file) I've discovered that if
there is/was a recovered edits queue of another RS, it has reported its
ageOfLastShippedOp forever instead of the active queue (since there's isn't
a ageOfLastShippedOp metrics per queue).


> The network on your 2nd cluster could be lower because replication ships
> edits in batches - so the batching could be amortizing the amount of data
> sent over the wire. Also, when you are measuring traffic - are you
> measuring the traffic on the NIC - which will also include traffic due to
> HDFS replication ?
>
> My NIC/ethernet measuring is quite simple. I ran "netstat -ie" which gives
a total counter of bytes, both on Receive and Transmit for my interface
(eth0). Running it before and after, gives you the total amount of bytes. I
also know the duration of the replication work by watching the
writeRequestsCount metric settle on the slave RS, thus I can calculate the
throughput. 15 GB / 14min.
Regarding your question - yes, it has to include all traffic on the card,
which probably includes HDFS replication. There's much I can do about that
though.
We should note that the network capacity is not the issue, since it was
measured 30MB/sec Receive and 20MB/sec Transmit, thus it's far from the
measured max bandwidth of 111MB/sec (measured by running nc - netcat).




>
> On Thu, Jun 20, 2013 at 3:46 AM, Asaf Mesika <as...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I've been conducting lots of benchmarks to test the maximum throughput of
> > replication in HBase.
> >
> > I've come to the conclusion that HBase replication is not suited for
> write
> > intensive application. I hope that people here can show me where I'm
> wrong.
> >
> > *My setup*
> > *Cluster (*Master and slave are alike)
> > 1 Master, NameNode
> > 3 RS, Data Node
> >
> > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> ethernet
> > card
> >
> > I insert data into HBase from a java process (client) reading files from
> > disk, running on the machine running the HBase Master in the master
> > cluster.
> >
> > *Benchmark Results*
> > When the client writes with 10 Threads, then the master cluster writes at
> > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
> size
> > I wrote is 15 GB, all Puts, to two different tables.
> > Both clusters when tested independently without replication, achieved
> write
> > throughput of 17-19 MB/sec, so evidently the replication process is the
> > bottleneck.
> >
> > I also tested connectivity between the two clusters using "netcat" and
> > achieved 111 MB/sec.
> > I've checked the usage of the network cards both on the client, master
> > cluster region server and slave region servers. No computer when over
> > 30mb/sec in Receive or Transmit.
> > The way I checked was rather crud but works: I've run "netstat -ie"
> before
> > HBase in the master cluster starts writing and after it finishes. The
> same
> > was done on the replicated cluster (when the replication started and
> > finished). I can tell the amount of bytes Received and Transmitted and I
> > know that duration each cluster worked, thus I can calculate the
> > throughput.
> >
> >  *The bottleneck in my opinion*
> > Since we've excluded network capacity, and each cluster works at faster
> > rate independently, all is left is the replication process.
> > My client writes to the master cluster with 10 Threads, and manages to
> > write at 17-18 MB/sec.
> > Each region server has only 1 thread responsible for transmitting the
> data
> > written to the WAL to the slave cluster. Thus in my setup I effectively
> > have 3 threads writing to the slave cluster.  Thus this is the
> bottleneck,
> > since this process can not be parallelized, since it must transmit the
> WAL
> > in a certain order.
> >
> > *Conclusion*
> > When writes intensively to HBase with more than 3 threads (in my setup),
> > you can't use replication.
> >
> > *Master throughput without replication*
> > On a different note, I have one thing I couldn't understand at all.
> > When turned off replication, and wrote with my client with 3 threads I
> got
> > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
> that
> > doesn't help) I achieved maximum throughput of 19 MB/sec.
> > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
> RS,
> > thus the network capacity was not the bottleneck.
> > On the HBase master machine which ran the client, the network card again
> > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> > MB/sec. Hence it's the client machine network card creating the
> bottleneck.
> >
> > The only explanation I have is the synchronized writes to the WAL. Those
> 10
> > threads have to get in line, and one by one, write their batch of Puts to
> > the WAL, which creates a bottleneck.
> >
> > *My question*:
> > The one thing I couldn't understand is: When I write with 3 Threads,
> > meaning I have no more than 3 concurrent RPC requests to write in each
> RS.
> > They achieved 11.3 MB/sec.
> > The write to the WAL is synchronized, so why increasing the number of
> > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> > They all get in line to write to the same location, so it seems have
> > concurrent write shouldn't improve throughput at all.
> >
> >
> > Thanks you!
> >
> > Asaf
> > *
> > *
> >
>

Re: Replication not suited for intensive write applications?

Posted by Varun Sharma <va...@pinterest.com>.
What is the ageOfLastShippedOp as reported on your Master region servers
(should be available through the /jmx) - it tells the delay your edits are
experiencing before being shipped. If this number is < 1000 (in
milliseconds), I would say replication is doing a very good job. This is
the most important metric worth tracking and I would be interested in how
it looks since we are also looking into using replication for write heavy
workloads...

The network on your 2nd cluster could be lower because replication ships
edits in batches - so the batching could be amortizing the amount of data
sent over the wire. Also, when you are measuring traffic - are you
measuring the traffic on the NIC - which will also include traffic due to
HDFS replication ?


On Thu, Jun 20, 2013 at 3:46 AM, Asaf Mesika <as...@gmail.com> wrote:

> Hi,
>
> I've been conducting lots of benchmarks to test the maximum throughput of
> replication in HBase.
>
> I've come to the conclusion that HBase replication is not suited for write
> intensive application. I hope that people here can show me where I'm wrong.
>
> *My setup*
> *Cluster (*Master and slave are alike)
> 1 Master, NameNode
> 3 RS, Data Node
>
> All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit ethernet
> card
>
> I insert data into HBase from a java process (client) reading files from
> disk, running on the machine running the HBase Master in the master
> cluster.
>
> *Benchmark Results*
> When the client writes with 10 Threads, then the master cluster writes at
> 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data size
> I wrote is 15 GB, all Puts, to two different tables.
> Both clusters when tested independently without replication, achieved write
> throughput of 17-19 MB/sec, so evidently the replication process is the
> bottleneck.
>
> I also tested connectivity between the two clusters using "netcat" and
> achieved 111 MB/sec.
> I've checked the usage of the network cards both on the client, master
> cluster region server and slave region servers. No computer when over
> 30mb/sec in Receive or Transmit.
> The way I checked was rather crud but works: I've run "netstat -ie" before
> HBase in the master cluster starts writing and after it finishes. The same
> was done on the replicated cluster (when the replication started and
> finished). I can tell the amount of bytes Received and Transmitted and I
> know that duration each cluster worked, thus I can calculate the
> throughput.
>
>  *The bottleneck in my opinion*
> Since we've excluded network capacity, and each cluster works at faster
> rate independently, all is left is the replication process.
> My client writes to the master cluster with 10 Threads, and manages to
> write at 17-18 MB/sec.
> Each region server has only 1 thread responsible for transmitting the data
> written to the WAL to the slave cluster. Thus in my setup I effectively
> have 3 threads writing to the slave cluster.  Thus this is the bottleneck,
> since this process can not be parallelized, since it must transmit the WAL
> in a certain order.
>
> *Conclusion*
> When writes intensively to HBase with more than 3 threads (in my setup),
> you can't use replication.
>
> *Master throughput without replication*
> On a different note, I have one thing I couldn't understand at all.
> When turned off replication, and wrote with my client with 3 threads I got
> throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than that
> doesn't help) I achieved maximum throughput of 19 MB/sec.
> The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each RS,
> thus the network capacity was not the bottleneck.
> On the HBase master machine which ran the client, the network card again
> showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> MB/sec. Hence it's the client machine network card creating the bottleneck.
>
> The only explanation I have is the synchronized writes to the WAL. Those 10
> threads have to get in line, and one by one, write their batch of Puts to
> the WAL, which creates a bottleneck.
>
> *My question*:
> The one thing I couldn't understand is: When I write with 3 Threads,
> meaning I have no more than 3 concurrent RPC requests to write in each RS.
> They achieved 11.3 MB/sec.
> The write to the WAL is synchronized, so why increasing the number of
> threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> They all get in line to write to the same location, so it seems have
> concurrent write shouldn't improve throughput at all.
>
>
> Thanks you!
>
> Asaf
> *
> *
>