You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Slater, David M." <Da...@jhuapl.edu> on 2013/09/18 23:07:49 UTC

BatchWriter performance on 1.4

Hi, I'm running a single-threaded ingestion program that takes data from an input source, parses it into mutations, and then writes those mutations (sequentially) to four different BatchWriters (all on different tables). Most of the time (95%) taken is on adding mutations, e.g. batchWriter.addMutations(mutations); I am wondering how to reduce the time taken by these methods.

1) For the method batchWriter.addMutations(Iterable<Mutation>), does it matter for performance whether the mutations returned by the iterator are sorted in lexicographic order?

2) If the Iterable<Mutation> that I pass to the BatchWriter is very large, will I need to wait for a number of Batches to be written and flushed before it will finish iterating, or does it transfer the elements of the Iterable to a different intermediate list?

3) If that is the case, would it then make sense to spawn off short threads for each time I make use of addMutations?

At a high level, my code looks like this:

BatchWriter bw1 = connector.createBatchWriter(...)
BatchWriter bw2 = ...
...
while(true) {
String[] data = input.getData();
List<Mutation> mutations1 = parseData1(data);
                List<Mutation> mutations2 = parseData2(data);
                ...
                bw1.addMutations(mutations1);
                bw2.addMutations(mutations2);
                ...
}
Thanks,
David

Re: BatchWriter performance on 1.4

Posted by Keith Turner <ke...@deenlo.com>.

On Fri, Sep 20, 2013 at 12:47 PM, Slater, David M.
<Da...@jhuapl.edu>wrote:

> I was using flush() after sending a bunch of mutations to the batchwriters
> to limit their latency. I thought it would normally flush the buffer to
> ensure that the maxLatency is not violated. If the maxLatency is quite
> large, how do I ensure that it doesn’t wait a long time before writing?
>

If you are constantly writing a batch writer, then it will be continually
flushing.   The example debug output I posted was from running
org.apache.accumulo.test.TestIngest (may be in a another package before
1.6).  I ran the following command to write a million random mutations.

accumulo org.apache.accumulo.test.TestIngest --debug -u root -p secret
--timestamp 1 --size 50 --random 56 --rows 1000000 --start 0 --cols 1

I think it defaults to 50M of memory for the batch writer.  It was
continually sending batches of 80K mutations every .45 seconds.   So in
that case the latency of a mutation is probably less than two seconds. But
this is just one tablet server, the behavior would be different on multiple
tablet servers.

In this example if I set the max latency on the batch writer to 30 secs,
then it would never kick in and force a flush.

> ****
>
> ** **
>
> If the returned batchscanners are all thread safe, then I’m still going to
> have the bottleneck of their synchronized addMutations method, correct?
>

In my experience, thats not a bottle neck but you will need to confirm this
for your situation (hopefully the debug output can help you w/ this).   If
the M threads adding mutations to a queue are going at a faster rate than
the N threads taking mutation and sending them, then the in synchronization
around the queue is not the bottleneck.  M threads probably could add to a
synchronized queue at a rate of millions of mutations per second.  N
threads can probably only serialize and send tens or hundreds of thousands
of mutations per second.


> ****
>
> ** **
>
> I’m looking for “org.apache.accumulo.client.impl” in the
> log4j.properties, generic_logger.xml the and other config files, but can’t
> locate it. Do I need to create a new entry for it there?
>

You can add something to a log4j.props file thats on the class path or you
can try adding something like the following to your code.  I had the
package wrong, its correct below.

Logger.getLogger("org.apache.accumulo.core.client.impl").setLevel(Level.TRACE)

****
>
> ** **
>
> Thanks,
> David****
>
> ** **
>
> *From:* Keith Turner [mailto:keith@deenlo.com]
> *Sent:* Thursday, September 19, 2013 7:01 PM
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: BatchWriter performance on 1.4****
>
> ** **
>
> On Thu, Sep 19, 2013 at 5:08 PM, Slater, David M. <Da...@jhuapl.edu>
> wrote:****
>
> Thanks Keith, I’m looking at it now. It appears like what I would want. As
> for the proper usage…****
>
>  ****
>
> Would I create one using the Connector, ****
>
> then .getBatchWriter() for each of the tables I’m interested in,****
>
> add data to each of BatchWriters returned,****
>
> ** **
>
> yes.****
>
>  ****
>
> and then hit flush() when I want to write all of that to get written?****
>
> ** **
>
> Why are you calling flush() ?   Doing this frequently will increase rpc
> overhead and lower throughput.****
>
>  ****
>
>  ****
>
> Would the individual batch writers spawned by the multiTableBatchWriter
> still have synchronized addMutations() methods so I would have to worry
> about blocking still, or would that all happen at the flush() method?****
>
> ** **
>
> The returned batch writers are thread safe. They all add to the same
> queue/buffer in a synchronized manner.   Calling flush() on any of the
> batch writers returned from getBatchWriter() will block the others.   ****
>
> ** **
>
> If you enable set the log4j log level to TRACE for
> org.apache.accumulo.client.impl you can see output like the following.
>  Binning is the process of taking each mutation and deciding which tablet
> and tablet server it goes to.****
>
> ** **
>
>   2013-09-19 18:43:37,261 [impl.ThriftTransportPool] TRACE: Using existing
> connection to 127.0.0.1:9997****
>
>   2013-09-19 18:43:37,393 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13
>  Binning 80909 mutations for table 3****
>
>   2013-09-19 18:43:37,402 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13
>  Binned 80909 mutations for table 3 to 1 tservers in 0.009 secs****
>
>   2013-09-19 18:43:37,402 [impl.TabletServerBatchWriter] TRACE: Started
> sending 80,909 mutations to 1 tablet servers****
>
>   2013-09-19 18:43:37,656 [impl.ThriftTransportPool] TRACE: Returned
> connection 127.0.0.1:9997 (120000) ioCount : 1459116****
>
>   2013-09-19 18:43:37,657 [impl.TabletServerBatchWriter] TRACE: sent
> 80,909 mutations to 127.0.0.1:9997 in 0.40 secs (204,832.91
> mutations/sec) with 0 failures****
>
> ** **
>
> When you close the batch writer, it will log some summary stats like the
> following.   ****
>
> ** **
>
> ** **
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: ****
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: TABLET
> SERVER BATCH WRITER STATISTICS****
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Added
>            :  1,000,000 mutations****
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Sent
>           :  1,000,000 mutations****
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Resent
> percentage   :       0.00%****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall
> time         :       5.94 secs****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall
> send rate    : 168,406.87 mutations/sec****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Send
> efficiency      :      86.91%****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: ****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: BACKGROUND
> WRITER PROCESS STATISTICS****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Total send
> time      :       5.16 secs  86.91%****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Average
> send rate    : 193,760.90 mutations/sec****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Total bin
> time       :       0.46 secs   7.81%****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Average
> bin rate     : 2,155,172.41 mutations/sec****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tservers
> per batch   :     1.00 avg       1 min      1 max****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tablets
> per batch    :     1.00 avg       1 min      1 max****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: ****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: SYSTEM
> STATISTICS****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: JVM GC
> Time          :       0.53 secs****
>
>   2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: JVM
> Compile Time     :       1.60 secs****
>
>   2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: System
> load average : initial=  0.22 final=  0.20****
>
> ** **
>
> What do these numbers look like for you?****
>
>  ****
>
> Keith****
>
> ** **
>
>  ****
>
> *From:* Keith Turner [mailto:keith@deenlo.com]
> *Sent:* Thursday, September 19, 2013 12:39 PM
> *To:* user@accumulo.apache.org****
>
>
> *Subject:* Re: BatchWriter performance on 1.4****
>
>  ****
>
> Are you aware of the multi table batch writer?  I am not sure if it would
> be useful, but wanted to make sure you knew about it.   It will use the
> same thread pool to process mutations for multiple tables.  Also it will
> batch mutations for multiple tablets into the same rpc calls.****
>
>  ****
>
> On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. <Da...@jhuapl.edu>
> wrote:****
>
> Hi, I’m running a single-threaded ingestion program that takes data from
> an input source, parses it into mutations, and then writes those mutations
> (sequentially) to four different BatchWriters (all on different tables).
> Most of the time (95%) taken is on adding mutations, e.g.
> batchWriter.addMutations(mutations); I am wondering how to reduce the time
> taken by these methods. ****
>
>  ****
>
> 1) For the method batchWriter.addMutations(Iterable<Mutation>), does it
> matter for performance whether the mutations returned by the iterator are
> sorted in lexicographic order? ****
>
>  ****
>
> 2) If the Iterable<Mutation> that I pass to the BatchWriter is very large,
> will I need to wait for a number of Batches to be written and flushed
> before it will finish iterating, or does it transfer the elements of the
> Iterable to a different intermediate list?****
>
>  ****
>
> 3) If that is the case, would it then make sense to spawn off short
> threads for each time I make use of addMutations?****
>
>  ****
>
> At a high level, my code looks like this:****
>
>  ****
>
> BatchWriter bw1 = connector.createBatchWriter(…)****
>
> BatchWriter bw2 = …****
>
> …****
>
> while(true) {****
>
> String[] data = input.getData();****
>
> List<Mutation> mutations1 = parseData1(data);****
>
>                 List<Mutation> mutations2 = parseData2(data);****
>
>                 …****
>
>                 bw1.addMutations(mutations1);****
>
>                 bw2.addMutations(mutations2);****
>
>                 …****
>
> }****
>
> Thanks,
> David****
>
>  ****
>
> ** **
>

Re: BatchWriter performance on 1.4

Posted by John Vines <vi...@apache.org>.

If you don't want it to wait a long time before writing, then set the
maxLatency lower. That is the entire reason for that setting.


On Fri, Sep 20, 2013 at 12:47 PM, Slater, David M.
<Da...@jhuapl.edu>wrote:

> I was using flush() after sending a bunch of mutations to the batchwriters
> to limit their latency. I thought it would normally flush the buffer to
> ensure that the maxLatency is not violated. If the maxLatency is quite
> large, how do I ensure that it doesn’t wait a long time before writing? **
> **
>
> ** **
>
> If the returned batchscanners are all thread safe, then I’m still going to
> have the bottleneck of their synchronized addMutations method, correct?***
> *
>
> ** **
>
> I’m looking for “org.apache.accumulo.client.impl” in the
> log4j.properties, generic_logger.xml the and other config files, but can’t
> locate it. Do I need to create a new entry for it there?****
>
> ** **
>
> Thanks,
> David****
>
> ** **
>
> *From:* Keith Turner [mailto:keith@deenlo.com]
> *Sent:* Thursday, September 19, 2013 7:01 PM
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: BatchWriter performance on 1.4****
>
> ** **
>
> On Thu, Sep 19, 2013 at 5:08 PM, Slater, David M. <Da...@jhuapl.edu>
> wrote:****
>
> Thanks Keith, I’m looking at it now. It appears like what I would want. As
> for the proper usage…****
>
>  ****
>
> Would I create one using the Connector, ****
>
> then .getBatchWriter() for each of the tables I’m interested in,****
>
> add data to each of BatchWriters returned,****
>
> ** **
>
> yes.****
>
>  ****
>
> and then hit flush() when I want to write all of that to get written?****
>
> ** **
>
> Why are you calling flush() ?   Doing this frequently will increase rpc
> overhead and lower throughput.****
>
>  ****
>
>  ****
>
> Would the individual batch writers spawned by the multiTableBatchWriter
> still have synchronized addMutations() methods so I would have to worry
> about blocking still, or would that all happen at the flush() method?****
>
> ** **
>
> The returned batch writers are thread safe. They all add to the same
> queue/buffer in a synchronized manner.   Calling flush() on any of the
> batch writers returned from getBatchWriter() will block the others.   ****
>
> ** **
>
> If you enable set the log4j log level to TRACE for
> org.apache.accumulo.client.impl you can see output like the following.
>  Binning is the process of taking each mutation and deciding which tablet
> and tablet server it goes to.****
>
> ** **
>
>   2013-09-19 18:43:37,261 [impl.ThriftTransportPool] TRACE: Using existing
> connection to 127.0.0.1:9997****
>
>   2013-09-19 18:43:37,393 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13
>  Binning 80909 mutations for table 3****
>
>   2013-09-19 18:43:37,402 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13
>  Binned 80909 mutations for table 3 to 1 tservers in 0.009 secs****
>
>   2013-09-19 18:43:37,402 [impl.TabletServerBatchWriter] TRACE: Started
> sending 80,909 mutations to 1 tablet servers****
>
>   2013-09-19 18:43:37,656 [impl.ThriftTransportPool] TRACE: Returned
> connection 127.0.0.1:9997 (120000) ioCount : 1459116****
>
>   2013-09-19 18:43:37,657 [impl.TabletServerBatchWriter] TRACE: sent
> 80,909 mutations to 127.0.0.1:9997 in 0.40 secs (204,832.91
> mutations/sec) with 0 failures****
>
> ** **
>
> When you close the batch writer, it will log some summary stats like the
> following.   ****
>
> ** **
>
> ** **
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: ****
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: TABLET
> SERVER BATCH WRITER STATISTICS****
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Added
>            :  1,000,000 mutations****
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Sent
>           :  1,000,000 mutations****
>
>   2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Resent
> percentage   :       0.00%****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall
> time         :       5.94 secs****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall
> send rate    : 168,406.87 mutations/sec****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Send
> efficiency      :      86.91%****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: ****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: BACKGROUND
> WRITER PROCESS STATISTICS****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Total send
> time      :       5.16 secs  86.91%****
>
>   2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Average
> send rate    : 193,760.90 mutations/sec****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Total bin
> time       :       0.46 secs   7.81%****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Average
> bin rate     : 2,155,172.41 mutations/sec****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tservers
> per batch   :     1.00 avg       1 min      1 max****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tablets
> per batch    :     1.00 avg       1 min      1 max****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: ****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: SYSTEM
> STATISTICS****
>
>   2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: JVM GC
> Time          :       0.53 secs****
>
>   2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: JVM
> Compile Time     :       1.60 secs****
>
>   2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: System
> load average : initial=  0.22 final=  0.20****
>
> ** **
>
> What do these numbers look like for you?****
>
>  ****
>
> Keith****
>
> ** **
>
>  ****
>
> *From:* Keith Turner [mailto:keith@deenlo.com]
> *Sent:* Thursday, September 19, 2013 12:39 PM
> *To:* user@accumulo.apache.org****
>
>
> *Subject:* Re: BatchWriter performance on 1.4****
>
>  ****
>
> Are you aware of the multi table batch writer?  I am not sure if it would
> be useful, but wanted to make sure you knew about it.   It will use the
> same thread pool to process mutations for multiple tables.  Also it will
> batch mutations for multiple tablets into the same rpc calls.****
>
>  ****
>
> On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. <Da...@jhuapl.edu>
> wrote:****
>
> Hi, I’m running a single-threaded ingestion program that takes data from
> an input source, parses it into mutations, and then writes those mutations
> (sequentially) to four different BatchWriters (all on different tables).
> Most of the time (95%) taken is on adding mutations, e.g.
> batchWriter.addMutations(mutations); I am wondering how to reduce the time
> taken by these methods. ****
>
>  ****
>
> 1) For the method batchWriter.addMutations(Iterable<Mutation>), does it
> matter for performance whether the mutations returned by the iterator are
> sorted in lexicographic order? ****
>
>  ****
>
> 2) If the Iterable<Mutation> that I pass to the BatchWriter is very large,
> will I need to wait for a number of Batches to be written and flushed
> before it will finish iterating, or does it transfer the elements of the
> Iterable to a different intermediate list?****
>
>  ****
>
> 3) If that is the case, would it then make sense to spawn off short
> threads for each time I make use of addMutations?****
>
>  ****
>
> At a high level, my code looks like this:****
>
>  ****
>
> BatchWriter bw1 = connector.createBatchWriter(…)****
>
> BatchWriter bw2 = …****
>
> …****
>
> while(true) {****
>
> String[] data = input.getData();****
>
> List<Mutation> mutations1 = parseData1(data);****
>
>                 List<Mutation> mutations2 = parseData2(data);****
>
>                 …****
>
>                 bw1.addMutations(mutations1);****
>
>                 bw2.addMutations(mutations2);****
>
>                 …****
>
> }****
>
> Thanks,
> David****
>
>  ****
>
> ** **
>

RE: BatchWriter performance on 1.4

Posted by "Slater, David M." <Da...@jhuapl.edu>.

I was using flush() after sending a bunch of mutations to the batchwriters to limit their latency. I thought it would normally flush the buffer to ensure that the maxLatency is not violated. If the maxLatency is quite large, how do I ensure that it doesn't wait a long time before writing?

If the returned batchscanners are all thread safe, then I'm still going to have the bottleneck of their synchronized addMutations method, correct?

I'm looking for "org.apache.accumulo.client.impl" in the log4j.properties, generic_logger.xml the and other config files, but can't locate it. Do I need to create a new entry for it there?

Thanks,
David

From: Keith Turner [mailto:keith@deenlo.com]
Sent: Thursday, September 19, 2013 7:01 PM
To: user@accumulo.apache.org
Subject: Re: BatchWriter performance on 1.4

On Thu, Sep 19, 2013 at 5:08 PM, Slater, David M. <Da...@jhuapl.edu>> wrote:
Thanks Keith, I'm looking at it now. It appears like what I would want. As for the proper usage...

Would I create one using the Connector,
then .getBatchWriter() for each of the tables I'm interested in,
add data to each of BatchWriters returned,

yes.

and then hit flush() when I want to write all of that to get written?

Why are you calling flush() ?   Doing this frequently will increase rpc overhead and lower throughput.


Would the individual batch writers spawned by the multiTableBatchWriter still have synchronized addMutations() methods so I would have to worry about blocking still, or would that all happen at the flush() method?

The returned batch writers are thread safe. They all add to the same queue/buffer in a synchronized manner.   Calling flush() on any of the batch writers returned from getBatchWriter() will block the others.

If you enable set the log4j log level to TRACE for org.apache.accumulo.client.impl you can see output like the following.  Binning is the process of taking each mutation and deciding which tablet and tablet server it goes to.

  2013-09-19 18:43:37,261 [impl.ThriftTransportPool] TRACE: Using existing connection to 127.0.0.1:9997<http://127.0.0.1:9997>
  2013-09-19 18:43:37,393 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13  Binning 80909 mutations for table 3
  2013-09-19 18:43:37,402 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13  Binned 80909 mutations for table 3 to 1 tservers in 0.009 secs
  2013-09-19 18:43:37,402 [impl.TabletServerBatchWriter] TRACE: Started sending 80,909 mutations to 1 tablet servers
  2013-09-19 18:43:37,656 [impl.ThriftTransportPool] TRACE: Returned connection 127.0.0.1:9997<http://127.0.0.1:9997> (120000) ioCount : 1459116
  2013-09-19 18:43:37,657 [impl.TabletServerBatchWriter] TRACE: sent 80,909 mutations to 127.0.0.1:9997<http://127.0.0.1:9997> in 0.40 secs (204,832.91 mutations/sec) with 0 failures

When you close the batch writer, it will log some summary stats like the following.


  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: TABLET SERVER BATCH WRITER STATISTICS
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Added                :  1,000,000 mutations
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Sent                 :  1,000,000 mutations
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Resent percentage   :       0.00%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall time         :       5.94 secs
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall send rate    : 168,406.87 mutations/sec
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Send efficiency      :      86.91%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: BACKGROUND WRITER PROCESS STATISTICS
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Total send time      :       5.16 secs  86.91%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Average send rate    : 193,760.90 mutations/sec
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Total bin time       :       0.46 secs   7.81%
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Average bin rate     : 2,155,172.41 mutations/sec
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tservers per batch   :     1.00 avg       1 min      1 max
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tablets per batch    :     1.00 avg       1 min      1 max
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: SYSTEM STATISTICS
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: JVM GC Time          :       0.53 secs
  2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: JVM Compile Time     :       1.60 secs
  2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: System load average : initial=  0.22 final=  0.20

What do these numbers look like for you?

Keith


From: Keith Turner [mailto:keith@deenlo.com<ma...@deenlo.com>]
Sent: Thursday, September 19, 2013 12:39 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>

Subject: Re: BatchWriter performance on 1.4

Are you aware of the multi table batch writer?  I am not sure if it would be useful, but wanted to make sure you knew about it.   It will use the same thread pool to process mutations for multiple tables.  Also it will batch mutations for multiple tablets into the same rpc calls.

On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. <Da...@jhuapl.edu>> wrote:
Hi, I'm running a single-threaded ingestion program that takes data from an input source, parses it into mutations, and then writes those mutations (sequentially) to four different BatchWriters (all on different tables). Most of the time (95%) taken is on adding mutations, e.g. batchWriter.addMutations(mutations); I am wondering how to reduce the time taken by these methods.

1) For the method batchWriter.addMutations(Iterable<Mutation>), does it matter for performance whether the mutations returned by the iterator are sorted in lexicographic order?

2) If the Iterable<Mutation> that I pass to the BatchWriter is very large, will I need to wait for a number of Batches to be written and flushed before it will finish iterating, or does it transfer the elements of the Iterable to a different intermediate list?

3) If that is the case, would it then make sense to spawn off short threads for each time I make use of addMutations?

At a high level, my code looks like this:

BatchWriter bw1 = connector.createBatchWriter(...)
BatchWriter bw2 = ...
...
while(true) {
String[] data = input.getData();
List<Mutation> mutations1 = parseData1(data);
                List<Mutation> mutations2 = parseData2(data);
                ...
                bw1.addMutations(mutations1);
                bw2.addMutations(mutations2);
                ...
}
Thanks,
David

Re: BatchWriter performance on 1.4

Posted by Keith Turner <ke...@deenlo.com>.

On Thu, Sep 19, 2013 at 5:08 PM, Slater, David M.
<Da...@jhuapl.edu>wrote:

> Thanks Keith, I’m looking at it now. It appears like what I would want. As
> for the proper usage…****
>
> ** **
>
> Would I create one using the Connector, ****
>
> then .getBatchWriter() for each of the tables I’m interested in,****
>
> add data to each of BatchWriters returned,
>

yes.


> ****
>
> and then hit flush() when I want to write all of that to get written?
>

Why are you calling flush() ?   Doing this frequently will increase rpc
overhead and lower throughput.


> ****
>
> ** **
>
> Would the individual batch writers spawned by the multiTableBatchWriter
> still have synchronized addMutations() methods so I would have to worry
> about blocking still, or would that all happen at the flush() method?
>

The returned batch writers are thread safe. They all add to the same
queue/buffer in a synchronized manner.   Calling flush() on any of the
batch writers returned from getBatchWriter() will block the others.

If you enable set the log4j log level to TRACE for
org.apache.accumulo.client.impl you can see output like the following.
 Binning is the process of taking each mutation and deciding which tablet
and tablet server it goes to.

  2013-09-19 18:43:37,261 [impl.ThriftTransportPool] TRACE: Using existing
connection to 127.0.0.1:9997
  2013-09-19 18:43:37,393 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13
 Binning 80909 mutations for table 3
  2013-09-19 18:43:37,402 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13
 Binned 80909 mutations for table 3 to 1 tservers in 0.009 secs
  2013-09-19 18:43:37,402 [impl.TabletServerBatchWriter] TRACE: Started
sending 80,909 mutations to 1 tablet servers
  2013-09-19 18:43:37,656 [impl.ThriftTransportPool] TRACE: Returned
connection 127.0.0.1:9997 (120000) ioCount : 1459116
  2013-09-19 18:43:37,657 [impl.TabletServerBatchWriter] TRACE: sent 80,909
mutations to 127.0.0.1:9997 in 0.40 secs (204,832.91 mutations/sec) with 0
failures

When you close the batch writer, it will log some summary stats like the
following.


  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: TABLET
SERVER BATCH WRITER STATISTICS
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Added
         :  1,000,000 mutations
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Sent
          :  1,000,000 mutations
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Resent
percentage   :       0.00%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall
time         :       5.94 secs
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall
send rate    : 168,406.87 mutations/sec
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Send
efficiency      :      86.91%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: BACKGROUND
WRITER PROCESS STATISTICS
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Total send
time      :       5.16 secs  86.91%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Average
send rate    : 193,760.90 mutations/sec
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Total bin
time       :       0.46 secs   7.81%
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Average bin
rate     : 2,155,172.41 mutations/sec
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tservers
per batch   :     1.00 avg       1 min      1 max
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tablets per
batch    :     1.00 avg       1 min      1 max
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: SYSTEM
STATISTICS
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: JVM GC Time
         :       0.53 secs
  2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: JVM Compile
Time     :       1.60 secs
  2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: System load
average : initial=  0.22 final=  0.20

What do these numbers look like for you?

Keith

****
>
> ** **
>
> *From:* Keith Turner [mailto:keith@deenlo.com]
> *Sent:* Thursday, September 19, 2013 12:39 PM
> *To:* user@accumulo.apache.org
>
> *Subject:* Re: BatchWriter performance on 1.4****
>
> ** **
>
> Are you aware of the multi table batch writer?  I am not sure if it would
> be useful, but wanted to make sure you knew about it.   It will use the
> same thread pool to process mutations for multiple tables.  Also it will
> batch mutations for multiple tablets into the same rpc calls.****
>
> ** **
>
> On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. <Da...@jhuapl.edu>
> wrote:****
>
> Hi, I’m running a single-threaded ingestion program that takes data from
> an input source, parses it into mutations, and then writes those mutations
> (sequentially) to four different BatchWriters (all on different tables).
> Most of the time (95%) taken is on adding mutations, e.g.
> batchWriter.addMutations(mutations); I am wondering how to reduce the time
> taken by these methods. ****
>
>  ****
>
> 1) For the method batchWriter.addMutations(Iterable<Mutation>), does it
> matter for performance whether the mutations returned by the iterator are
> sorted in lexicographic order? ****
>
>  ****
>
> 2) If the Iterable<Mutation> that I pass to the BatchWriter is very large,
> will I need to wait for a number of Batches to be written and flushed
> before it will finish iterating, or does it transfer the elements of the
> Iterable to a different intermediate list?****
>
>  ****
>
> 3) If that is the case, would it then make sense to spawn off short
> threads for each time I make use of addMutations?****
>
>  ****
>
> At a high level, my code looks like this:****
>
>  ****
>
> BatchWriter bw1 = connector.createBatchWriter(…)****
>
> BatchWriter bw2 = …****
>
> …****
>
> while(true) {****
>
> String[] data = input.getData();****
>
> List<Mutation> mutations1 = parseData1(data);****
>
>                 List<Mutation> mutations2 = parseData2(data);****
>
>                 …****
>
>                 bw1.addMutations(mutations1);****
>
>                 bw2.addMutations(mutations2);****
>
>                 …****
>
> }****
>
> Thanks,
> David****
>
> ** **
>

RE: BatchWriter performance on 1.4

Posted by "Slater, David M." <Da...@jhuapl.edu>.

Thanks Keith, I'm looking at it now. It appears like what I would want. As for the proper usage...

Would I create one using the Connector,
then .getBatchWriter() for each of the tables I'm interested in,
add data to each of BatchWriters returned,
and then hit flush() when I want to write all of that to get written?

Would the individual batch writers spawned by the multiTableBatchWriter still have synchronized addMutations() methods so I would have to worry about blocking still, or would that all happen at the flush() method?

From: Keith Turner [mailto:keith@deenlo.com]
Sent: Thursday, September 19, 2013 12:39 PM
To: user@accumulo.apache.org
Subject: Re: BatchWriter performance on 1.4

Are you aware of the multi table batch writer?  I am not sure if it would be useful, but wanted to make sure you knew about it.   It will use the same thread pool to process mutations for multiple tables.  Also it will batch mutations for multiple tablets into the same rpc calls.

On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. <Da...@jhuapl.edu>> wrote:
Hi, I'm running a single-threaded ingestion program that takes data from an input source, parses it into mutations, and then writes those mutations (sequentially) to four different BatchWriters (all on different tables). Most of the time (95%) taken is on adding mutations, e.g. batchWriter.addMutations(mutations); I am wondering how to reduce the time taken by these methods.

1) For the method batchWriter.addMutations(Iterable<Mutation>), does it matter for performance whether the mutations returned by the iterator are sorted in lexicographic order?

2) If the Iterable<Mutation> that I pass to the BatchWriter is very large, will I need to wait for a number of Batches to be written and flushed before it will finish iterating, or does it transfer the elements of the Iterable to a different intermediate list?

3) If that is the case, would it then make sense to spawn off short threads for each time I make use of addMutations?

At a high level, my code looks like this:

BatchWriter bw1 = connector.createBatchWriter(...)
BatchWriter bw2 = ...
...
while(true) {
String[] data = input.getData();
List<Mutation> mutations1 = parseData1(data);
                List<Mutation> mutations2 = parseData2(data);
                ...
                bw1.addMutations(mutations1);
                bw2.addMutations(mutations2);
                ...
}
Thanks,
David

Re: BatchWriter performance on 1.4

Posted by Keith Turner <ke...@deenlo.com>.

Are you aware of the multi table batch writer?  I am not sure if it would
be useful, but wanted to make sure you knew about it.   It will use the
same thread pool to process mutations for multiple tables.  Also it will
batch mutations for multiple tablets into the same rpc calls.


On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M.
<Da...@jhuapl.edu>wrote:

> Hi, I’m running a single-threaded ingestion program that takes data from
> an input source, parses it into mutations, and then writes those mutations
> (sequentially) to four different BatchWriters (all on different tables).
> Most of the time (95%) taken is on adding mutations, e.g.
> batchWriter.addMutations(mutations); I am wondering how to reduce the time
> taken by these methods. ****
>
> ** **
>
> 1) For the method batchWriter.addMutations(Iterable<Mutation>), does it
> matter for performance whether the mutations returned by the iterator are
> sorted in lexicographic order? ****
>
> ** **
>
> 2) If the Iterable<Mutation> that I pass to the BatchWriter is very large,
> will I need to wait for a number of Batches to be written and flushed
> before it will finish iterating, or does it transfer the elements of the
> Iterable to a different intermediate list?****
>
> ** **
>
> 3) If that is the case, would it then make sense to spawn off short
> threads for each time I make use of addMutations?****
>
> ** **
>
> At a high level, my code looks like this:****
>
> ** **
>
> BatchWriter bw1 = connector.createBatchWriter(…)****
>
> BatchWriter bw2 = …****
>
> …****
>
> while(true) {****
>
> String[] data = input.getData();****
>
> List<Mutation> mutations1 = parseData1(data);****
>
>                 List<Mutation> mutations2 = parseData2(data);****
>
>                 …****
>
>                 bw1.addMutations(mutations1);****
>
>                 bw2.addMutations(mutations2);****
>
>                 …****
>
> }****
>
> ****
>
> Thanks,
> David****
>

Re: BatchWriter performance on 1.4

Posted by Adam Fuchs <af...@apache.org>.

The addMutations method blocks when the client-side buffer fills up, so you
may see a lot of time spent in that method due to a bottleneck downstream.
There are a number of things you could try to speed that up. Here are a few:
1. Increase the BatchWriter's buffer size. This can smooth out the network
utilization and increase efficiency.
2. Increase the number of threads that the BatchWriter uses to process
mutations. This is particularly useful if you have more tablet servers than
ingest clients.
3. Use a more efficient encoding. The more data you put through the
BatchWriter, the longer it will take, even if that data compresses well at
rest.
4. If you are seeing hold time show up on your tablet servers (displayed
through the monitor page) you can increase the memory.maps.max to make
minor compactions more efficient.

Cheers,
Adam
On Sep 18, 2013 10:08 PM, "Slater, David M." <Da...@jhuapl.edu>
wrote:

> Hi, I’m running a single-threaded ingestion program that takes data from
> an input source, parses it into mutations, and then writes those mutations
> (sequentially) to four different BatchWriters (all on different tables).
> Most of the time (95%) taken is on adding mutations, e.g.
> batchWriter.addMutations(mutations); I am wondering how to reduce the time
> taken by these methods. ****
>
> ** **
>
> 1) For the method batchWriter.addMutations(Iterable<Mutation>), does it
> matter for performance whether the mutations returned by the iterator are
> sorted in lexicographic order? ****
>
> ** **
>
> 2) If the Iterable<Mutation> that I pass to the BatchWriter is very large,
> will I need to wait for a number of Batches to be written and flushed
> before it will finish iterating, or does it transfer the elements of the
> Iterable to a different intermediate list?****
>
> ** **
>
> 3) If that is the case, would it then make sense to spawn off short
> threads for each time I make use of addMutations?****
>
> ** **
>
> At a high level, my code looks like this:****
>
> ** **
>
> BatchWriter bw1 = connector.createBatchWriter(…)****
>
> BatchWriter bw2 = …****
>
> …****
>
> while(true) {****
>
> String[] data = input.getData();****
>
> List<Mutation> mutations1 = parseData1(data);****
>
>                 List<Mutation> mutations2 = parseData2(data);****
>
>                 …****
>
>                 bw1.addMutations(mutations1);****
>
>                 bw2.addMutations(mutations2);****
>
>                 …****
>
> }****
>
> ****
>
> Thanks,
> David****
>

RE: BatchWriter performance on 1.4

Posted by "Slater, David M." <Da...@jhuapl.edu>.

Hi David,

I've looked at generating rfiles directly, but I know that adds latency to the process, so I wanted to make sure I had found the upper bound for direct mutations before exploring that.

The tables are pre-split, and all tservers are engaged in ingest (though the application itself that does the parsing and batchwriters is on the namenode, which is not a tserver). There are some compactions happening on ingest, but not a lot.

The reason I'm running them on the same ingest process is that they use the same data and their mutations reuse a lot of that data. However, it would be nice to have a different thread handle the ingest for each BatchWriter, so I might try that out.

From: David Medinets [mailto:david.medinets@gmail.com]
Sent: Wednesday, September 18, 2013 10:41 PM
To: accumulo-user
Subject: Re: BatchWriter performance on 1.4

Have you looked at generating rfiles instead of writing mutations directly to Accumulo?
Are the four target tables pre-split?
Are all tservers engaged in the ingest process?
Do you see a lot of compactions while the ingest is happening?
Any reason not to run four ingest processes with one batchwriter each instead of one ingest with four batchwriters?

On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. <Da...@jhuapl.edu>> wrote:
Hi, I'm running a single-threaded ingestion program that takes data from an input source, parses it into mutations, and then writes those mutations (sequentially) to four different BatchWriters (all on different tables). Most of the time (95%) taken is on adding mutations, e.g. batchWriter.addMutations(mutations); I am wondering how to reduce the time taken by these methods.

1) For the method batchWriter.addMutations(Iterable<Mutation>), does it matter for performance whether the mutations returned by the iterator are sorted in lexicographic order?

2) If the Iterable<Mutation> that I pass to the BatchWriter is very large, will I need to wait for a number of Batches to be written and flushed before it will finish iterating, or does it transfer the elements of the Iterable to a different intermediate list?

3) If that is the case, would it then make sense to spawn off short threads for each time I make use of addMutations?

At a high level, my code looks like this:

BatchWriter bw1 = connector.createBatchWriter(...)
BatchWriter bw2 = ...
...
while(true) {
String[] data = input.getData();
List<Mutation> mutations1 = parseData1(data);
                List<Mutation> mutations2 = parseData2(data);
                ...
                bw1.addMutations(mutations1);
                bw2.addMutations(mutations2);
                ...
}
Thanks,
David

Re: BatchWriter performance on 1.4

Posted by David Medinets <da...@gmail.com>.

Have you looked at generating rfiles instead of writing mutations directly
to Accumulo?
Are the four target tables pre-split?
Are all tservers engaged in the ingest process?
Do you see a lot of compactions while the ingest is happening?
Any reason not to run four ingest processes with one batchwriter each
instead of one ingest with four batchwriters?


On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M.
<Da...@jhuapl.edu>wrote:

> Hi, I’m running a single-threaded ingestion program that takes data from
> an input source, parses it into mutations, and then writes those mutations
> (sequentially) to four different BatchWriters (all on different tables).
> Most of the time (95%) taken is on adding mutations, e.g.
> batchWriter.addMutations(mutations); I am wondering how to reduce the time
> taken by these methods. ****
>
> ** **
>
> 1) For the method batchWriter.addMutations(Iterable<Mutation>), does it
> matter for performance whether the mutations returned by the iterator are
> sorted in lexicographic order? ****
>
> ** **
>
> 2) If the Iterable<Mutation> that I pass to the BatchWriter is very large,
> will I need to wait for a number of Batches to be written and flushed
> before it will finish iterating, or does it transfer the elements of the
> Iterable to a different intermediate list?****
>
> ** **
>
> 3) If that is the case, would it then make sense to spawn off short
> threads for each time I make use of addMutations?****
>
> ** **
>
> At a high level, my code looks like this:****
>
> ** **
>
> BatchWriter bw1 = connector.createBatchWriter(…)****
>
> BatchWriter bw2 = …****
>
> …****
>
> while(true) {****
>
> String[] data = input.getData();****
>
> List<Mutation> mutations1 = parseData1(data);****
>
>                 List<Mutation> mutations2 = parseData2(data);****
>
>                 …****
>
>                 bw1.addMutations(mutations1);****
>
>                 bw2.addMutations(mutations2);****
>
>                 …****
>
> }****
>
> ****
>
> Thanks,
> David****
>

Re: BatchWriter performance on 1.4

Posted by John Vines <vi...@apache.org>.

Currently the addMutation() code is synchronized, so that is a bottle neck.
A thread would get around this, but then there's then you need to manage
the thread properly.


On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M.
<Da...@jhuapl.edu>wrote:

> Hi, I’m running a single-threaded ingestion program that takes data from
> an input source, parses it into mutations, and then writes those mutations
> (sequentially) to four different BatchWriters (all on different tables).
> Most of the time (95%) taken is on adding mutations, e.g.
> batchWriter.addMutations(mutations); I am wondering how to reduce the time
> taken by these methods. ****
>
> ** **
>
> 1) For the method batchWriter.addMutations(Iterable<Mutation>), does it
> matter for performance whether the mutations returned by the iterator are
> sorted in lexicographic order? ****
>
> ** **
>
> 2) If the Iterable<Mutation> that I pass to the BatchWriter is very large,
> will I need to wait for a number of Batches to be written and flushed
> before it will finish iterating, or does it transfer the elements of the
> Iterable to a different intermediate list?****
>
> ** **
>
> 3) If that is the case, would it then make sense to spawn off short
> threads for each time I make use of addMutations?****
>
> ** **
>
> At a high level, my code looks like this:****
>
> ** **
>
> BatchWriter bw1 = connector.createBatchWriter(…)****
>
> BatchWriter bw2 = …****
>
> …****
>
> while(true) {****
>
> String[] data = input.getData();****
>
> List<Mutation> mutations1 = parseData1(data);****
>
>                 List<Mutation> mutations2 = parseData2(data);****
>
>                 …****
>
>                 bw1.addMutations(mutations1);****
>
>                 bw2.addMutations(mutations2);****
>
>                 …****
>
> }****
>
> ****
>
> Thanks,
> David****
>