You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Massimilian Mattetti <MA...@il.ibm.com> on 2017/07/05 12:36:59 UTC

maximize usage of cluster resources during ingestion

Hi all,

I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each 
server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as 
masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 
10 machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and 
are running Accumulo TServer processes. All the machines are connected via 
a 10Gb network and 3 of them are running ZooKeeper. I have run some heavy 
ingestion test on this cluster but I have never been able to reach more 
than 20% CPU usage on each Tablet Server. I am running an ingestion 
process (using batch writers) on each data node. The table is pre-split in 
order to have 4 tablets per tablet server. Monitoring the network I have 
seen that data is received/sent from each node with a peak rate of about 
120MB/s / 100MB/s while the aggregated disk write throughput on each 
tablet servers is around 120MB/s. 

The table configuration I am playing with are:
"table.file.replication": "2",
"table.compaction.minor.logs.threshold": "10",
"table.durability": "flush",
"table.file.max": "30",
"table.compaction.major.ratio": "9",
"table.split.threshold": "1G"

while the tablet server configuration is:
"tserver.wal.blocksize": "2G",
"tserver.walog.max.size": "8G",
"tserver.memory.maps.max": "32G",
"tserver.compaction.minor.concurrent.max": "50",
"tserver.compaction.major.concurrent.max": "8",
"tserver.total.mutation.queue.max": "50M",
"tserver.wal.replication": "2",
"tserver.compaction.major.thread.files.open.max": "15"

the tablet server heap has been set to 32GB

From Monitor UI


As you can see I have a lot of valleys in which the ingestion rate reaches 
0. 
What would be a good procedure to identify the bottleneck which causes the 
0 ingestion rate periods?
Thanks.

Best Regards,
Max

Re: maximize usage of cluster resources during ingestion

Posted by Mike Walch <mw...@apache.org>.

This was a great question. I want start recording answers to these types of
questions in the troubleshooting documentation[1] for 2.0. I made a pull
request[2] to the website repo for this one if anyone wants to
review/comment on it.

[1]: https://accumulo.apache.org/docs/unreleased/troubleshooting/basic
[2]: https://github.com/apache/accumulo-website/pull/18


On Wed, Jul 5, 2017 at 3:32 PM Christopher <ct...@apache.org> wrote:

> Huge GC pauses can be mitigated by ensuring you're using the Accumulo
> native maps library.
>
> On Wed, Jul 5, 2017 at 11:05 AM Cyrille Savelief <cs...@gmail.com>
> wrote:
>
>> Hi Massimilian*,*
>>
>> Using a MultiTableBatchWriter we are able to ingest about 600K entries/s
>> on a single node (30Gb of memory, 8 vCPU) running Hadoop, Zookeeper,
>> Accumulo and our ingest process. For us, "valleys" came from huge GC pauses.
>>
>> Best,
>>
>> Cyrille
>>
>> Le mer. 5 juil. 2017 à 14:37, Massimilian Mattetti <MA...@il.ibm.com>
>> a écrit :
>>
>>> Hi all,
>>>
>>> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
>>> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
>>> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
>>> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
>>> running Accumulo TServer processes. All the machines are connected via a
>>> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
>>> ingestion test on this cluster but I have never been able to reach more
>>> than *20% *CPU usage on each Tablet Server. I am running an ingestion
>>> process (using batch writers) on each data node. The table is pre-split in
>>> order to have 4 tablets per tablet server. Monitoring the network I have
>>> seen that data is received/sent from each node with a peak rate of about
>>> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
>>> servers is around 120MB/s.
>>>
>>> The table configuration I am playing with are:
>>> "table.file.replication": "2",
>>> "table.compaction.minor.logs.threshold": "10",
>>> "table.durability": "flush",
>>> "table.file.max": "30",
>>> "table.compaction.major.ratio": "9",
>>> "table.split.threshold": "1G"
>>>
>>> while the tablet server configuration is:
>>> "tserver.wal.blocksize": "2G",
>>> "tserver.walog.max.size": "8G",
>>> "tserver.memory.maps.max": "32G",
>>> "tserver.compaction.minor.concurrent.max": "50",
>>> "tserver.compaction.major.concurrent.max": "8",
>>> "tserver.total.mutation.queue.max": "50M",
>>> "tserver.wal.replication": "2",
>>> "tserver.compaction.major.thread.files.open.max": "15"
>>>
>>> the tablet server heap has been set to 32GB
>>>
>>> From Monitor UI
>>>
>>>
>>> As you can see I have a lot of valleys in which the ingestion rate
>>> reaches 0.
>>> What would be a good procedure to identify the bottleneck which causes
>>> the 0 ingestion rate periods?
>>> Thanks.
>>>
>>> Best Regards,
>>> Max
>>>
>>>

Re: maximize usage of cluster resources during ingestion

Posted by Christopher <ct...@apache.org>.

Huge GC pauses can be mitigated by ensuring you're using the Accumulo
native maps library.

On Wed, Jul 5, 2017 at 11:05 AM Cyrille Savelief <cs...@gmail.com>
wrote:

> Hi Massimilian*,*
>
> Using a MultiTableBatchWriter we are able to ingest about 600K entries/s
> on a single node (30Gb of memory, 8 vCPU) running Hadoop, Zookeeper,
> Accumulo and our ingest process. For us, "valleys" came from huge GC pauses.
>
> Best,
>
> Cyrille
>
> Le mer. 5 juil. 2017 à 14:37, Massimilian Mattetti <MA...@il.ibm.com>
> a écrit :
>
>> Hi all,
>>
>> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
>> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
>> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
>> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
>> running Accumulo TServer processes. All the machines are connected via a
>> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
>> ingestion test on this cluster but I have never been able to reach more
>> than *20% *CPU usage on each Tablet Server. I am running an ingestion
>> process (using batch writers) on each data node. The table is pre-split in
>> order to have 4 tablets per tablet server. Monitoring the network I have
>> seen that data is received/sent from each node with a peak rate of about
>> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
>> servers is around 120MB/s.
>>
>> The table configuration I am playing with are:
>> "table.file.replication": "2",
>> "table.compaction.minor.logs.threshold": "10",
>> "table.durability": "flush",
>> "table.file.max": "30",
>> "table.compaction.major.ratio": "9",
>> "table.split.threshold": "1G"
>>
>> while the tablet server configuration is:
>> "tserver.wal.blocksize": "2G",
>> "tserver.walog.max.size": "8G",
>> "tserver.memory.maps.max": "32G",
>> "tserver.compaction.minor.concurrent.max": "50",
>> "tserver.compaction.major.concurrent.max": "8",
>> "tserver.total.mutation.queue.max": "50M",
>> "tserver.wal.replication": "2",
>> "tserver.compaction.major.thread.files.open.max": "15"
>>
>> the tablet server heap has been set to 32GB
>>
>> From Monitor UI
>>
>>
>> As you can see I have a lot of valleys in which the ingestion rate
>> reaches 0.
>> What would be a good procedure to identify the bottleneck which causes
>> the 0 ingestion rate periods?
>> Thanks.
>>
>> Best Regards,
>> Max
>>
>>

Re: maximize usage of cluster resources during ingestion

Posted by Cyrille Savelief <cs...@gmail.com>.

Hi Massimilian*,*

Using a MultiTableBatchWriter we are able to ingest about 600K entries/s on
a single node (30Gb of memory, 8 vCPU) running Hadoop, Zookeeper, Accumulo
and our ingest process. For us, "valleys" came from huge GC pauses.

Best,

Cyrille

Le mer. 5 juil. 2017 à 14:37, Massimilian Mattetti <MA...@il.ibm.com> a
écrit :

> Hi all,
>
> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
> running Accumulo TServer processes. All the machines are connected via a
> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
> ingestion test on this cluster but I have never been able to reach more
> than *20% *CPU usage on each Tablet Server. I am running an ingestion
> process (using batch writers) on each data node. The table is pre-split in
> order to have 4 tablets per tablet server. Monitoring the network I have
> seen that data is received/sent from each node with a peak rate of about
> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
> servers is around 120MB/s.
>
> The table configuration I am playing with are:
> "table.file.replication": "2",
> "table.compaction.minor.logs.threshold": "10",
> "table.durability": "flush",
> "table.file.max": "30",
> "table.compaction.major.ratio": "9",
> "table.split.threshold": "1G"
>
> while the tablet server configuration is:
> "tserver.wal.blocksize": "2G",
> "tserver.walog.max.size": "8G",
> "tserver.memory.maps.max": "32G",
> "tserver.compaction.minor.concurrent.max": "50",
> "tserver.compaction.major.concurrent.max": "8",
> "tserver.total.mutation.queue.max": "50M",
> "tserver.wal.replication": "2",
> "tserver.compaction.major.thread.files.open.max": "15"
>
> the tablet server heap has been set to 32GB
>
> From Monitor UI
>
>
> As you can see I have a lot of valleys in which the ingestion rate reaches
> 0.
> What would be a good procedure to identify the bottleneck which causes the
> 0 ingestion rate periods?
> Thanks.
>
> Best Regards,
> Max
>
>

Re: maximize usage of cluster resources during ingestion

Posted by Keith Turner <ke...@deenlo.com>.

On Thu, Jul 13, 2017 at 10:56 AM,  <dl...@comcast.net> wrote:
> Regarding the referenced paper, pre-splitting the tables, using an optimized zookeeper deployment, and increasing concurrent minor / major compactions are good things. I'm not sure that we want to recommend turning off the write ahead logs and replication for production deployments.


I wouldn't recommend turning off write ahead logs either.   However,
it can be useful to turn them off for performance testing to
understand their impact.

I noticed you set "table.durability": "flush" which is good for
performance.  However the metdata table may still be set to sync which
can cause performance problems.   The following article describes how
to set the metdata table to flush. The article also describes the
consequences for regular tables of having the metadata table set to
sync.

http://accumulo.apache.org/blog/2016/11/02/durability-performance.html

>
> -----Original Message-----
> From: Jeremy Kepner [mailto:kepner@ll.mit.edu]
> Sent: Thursday, July 13, 2017 10:05 AM
> To: user@accumulo.apache.org
> Subject: Re: maximize usage of cluster resources during ingestion
>
> https://arxiv.org/abs/1406.4923  contains a number of tricks for maximizing ingest performance.
>
> On Thu, Jul 13, 2017 at 08:13:40AM -0400, Jonathan Wonders wrote:
>> Keep in mind that Accumulo puts a much different kind of load on HDFS
>> than the DFSIO benchmark.  It might be more appropriate to use a tool
>> like dstat to monitor HDD utilization and queue depth.  HDD throughput
>> benchmarks usually will involve high queue depths as disks are much
>> more effective when they can pipeline and batch updates. Accumulo's
>> WAL workload will typically call hflush or hsync periodically which
>> interrupts the IO pipeline much like memory barriers can interrupt CPU
>> pipelining except more severe.  This is necessary to provide
>> durability guarantees, but definitely comes at a cost to throughput.
>> Any database that has these durability guarantees will suffer
>> similarly to an extent.  For Accumulo, it is probably worse than for
>> non-distributed databases because the flush or sync must happen at
>> each replica prior to the mutation being added into the in-memory map.
>>
>> I think one of the reasons the recommendation was made to add more
>> tablet servers is because each tablet server only writes to one WAL at
>> a time and each block will live on N disk based on replication factor.
>> If you have a replication factor of 3, there will be 10x3 blocks being
>> appended to at any given time (excluding compactions).  Since you have
>> 120 disks, not all will be participating in write-ahead-logging, so
>> you should not count the IO capacity of these extra disks towards
>> expected ingest throughput.  10 tablet servers per node is probably
>> too many because there would likely be a lot of contention
>> flushing/syncing WALs.  I'm not sure how smart HDFS is about how it
>> distributes the WAL load.  You might see more benefit with 2-4
>> tservers per node.  This would mostly likely require more batch writer threads in the client as well.
>>
>> I'm not too surprised that snappy did not help because the WALs are
>> not compressed and are likely a bigger bottleneck than compaction
>> since you have many disks not participating in WAL.
>>
>>
>> On Wed, Jul 12, 2017 at 11:16 AM, Josh Elser <el...@apache.org> wrote:
>>
>> > You probably want to split the table further than just 4 tablets per
>> > tablet server. Try 10's of tablets per server.
>> >
>> > Also, merging the content from (who I assume is) your coworker on
>> > this stackoverflow post[1], I don't believe the suggestion[2] to
>> > verify WAL max size, minc threshold, and native maps size was brought up yet.
>> >
>> > Also, did you look at the JVM GC logs for the TabletServers like was
>> > previously suggested to you?
>> >
>> > [1] https://stackoverflow.com/questions/44928354/accumulo-tablet
>> > -server-doesnt-utilize-all-available-resources-on-host-machine/
>> > [2] https://accumulo.apache.org/1.8/accumulo_user_manual.html#_n
>> > ative_maps_configuration
>> >
>> > On 7/12/17 10:12 AM, Massimilian Mattetti wrote:
>> >
>> >> Hi all,
>> >>
>> >> I ran a few experiments in the last days trying to identify what is
>> >> the bottleneck for the ingestion process.
>> >> - Running 10 tservers per node instead of only one gave me a very
>> >> neglectable performance improvement of about 15%.
>> >> - Running the ingestor processes from the two masters give the same
>> >> performance as running one ingestor process in each tablet server
>> >> (10
>> >> ingestors)
>> >> - neither the network limit (10 Gb network) nor the disk throughput
>> >> limit has been reached (1GB/s per node reached while running the
>> >> TestDFSIO benchmark on HDFS)
>> >> - CPU is always around 20% on each tserver
>> >> - changing compression from GZ to snappy did not provide any
>> >> benefit
>> >> - increasing the tserver.total.mutation.queue.maxto 200MB actually
>> >> decreased the performance I am going to run some ingestion
>> >> experiment with Kudu over the next few days, but any other
>> >> suggestion on how improve the performance on Accumulo is very
>> >> welcome.
>> >> Thanks.
>> >>
>> >> Best Regards,
>> >> Massimiliano
>> >>
>> >>
>> >>
>> >> From: Jonathan Wonders <jw...@gmail.com>
>> >> To: user@accumulo.apache.org, Dave Marion <dl...@comcast.net>
>> >> Date: 07/07/2017 04:02
>> >> Subject: Re: maximize usage of cluster resources during ingestion
>> >> -------------------------------------------------------------------
>> >> -----
>> >>
>> >>
>> >>
>> >> I've personally never seen full CPU utilization during pure ingest.
>> >> Typically the bottleneck has been I/O related. The majority of
>> >> steady-state CPU utilization under a heavy ingest load is probably
>> >> due to compression unless you have custom constraints running. This
>> >> can depend on the compression algorithm you have selected.  There
>> >> is probably a measurable contribution from inserting into the
>> >> in-memory map.  Otherwise, not much computation occurs during ingest per mutation.
>> >>
>> >> On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <_dlmarion@comcast.net_
>> >> <ma...@comcast.net>> wrote:
>> >> That's a good point. I would also look at increasing
>> >> tserver.total.mutation.queue.max. Are you seeing hold times? If
>> >> not, I would keep pushing harder until you do, then move to
>> >> multiple tablet servers. Do you have any GC logs?
>> >>
>> >>
>> >> On July 6, 2017 at 4:47 AM Cyrille Savelief <_csavelief@gmail.com_
>> >> <ma...@gmail.com>> wrote:
>> >>
>> >> Are you sure Accumulo is not waiting for your app's data? There
>> >> might be GC pauses in your ingest code (we have already experienced that).
>> >>
>> >> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti
>> >> <_MASSIMIL@il.ibm.com_ <ma...@il.ibm.com>> a écrit :
>> >> Thank you all for the suggestions.
>> >>
>> >> About the native memory map I checked the logs on each tablet
>> >> server and it was loaded correctly (of course the
>> >> tserver.memory.maps.native.enabled
>> >> was set to true), so the GC pauses should not be the problem
>> >> eventually. I managed to get much better ingestion graph by
>> >> reducing the native map size to *2GB* and increasing the Batch
>> >> Writer threads number from the default (3 was really bad for my
>> >> configuration) to *10* (I think it does not make sense having more threads than tablet servers, am I right?).
>> >>
>> >> The configuration that I used for the table is:
>> >> "table.file.replication": "2",
>> >> "table.compaction.minor.logs.threshold": "3",
>> >> "table.durability": "flush",
>> >> "table.split.threshold": "1G"
>> >>
>> >> while for the tablet servers is:
>> >> "tserver.wal.blocksize": "1G",
>> >>   "tserver.walog.max.size": "2G",
>> >> "tserver.memory.maps.max": "2G",
>> >> "tserver.compaction.minor.concurrent.max": "50",
>> >> "tserver.compaction.major.concurrent.max": "20",
>> >> "tserver.wal.replication": "2",
>> >>   "tserver.compaction.major.thread.files.open.max": "15"
>> >>
>> >> The new graph:
>> >>
>> >>
>> >> I still have the problem of a CPU usage that is less than*20%.* So
>> >> I am thinking to run multiple tablet servers per node (like 5 or
>> >> 10) in order to maximize the CPU usage. Besides that I do not have
>> >> any other idea on how to stress those servers with ingestion.
>> >> Any suggestions are very welcome. Meanwhile, thank you all again
>> >> for your help.
>> >>
>> >>
>> >> Best Regards,
>> >> Massimiliano
>> >>
>> >>
>> >>
>> >> From: Jonathan Wonders <_jwonders88@gmail.com_ <mailto:
>> >> jwonders88@gmail.com>>
>> >> To: _user@accumulo.apache.org_ <ma...@accumulo.apache.org>
>> >> Date: 06/07/2017 04:01
>> >> Subject: Re: maximize usage of cluster resources during ingestion
>> >> -------------------------------------------------------------------
>> >> -----
>> >>
>> >>
>> >>
>> >> Hi Massimilian,
>> >>
>> >> Are you seeing held commits during the ingest pauses?  Just based
>> >> on having looked at many similar graphs in the past, this might be
>> >> one of the major culprits.  A tablet server has a memory region
>> >> with a bounded size
>> >> (tserver.memory.maps.max) where it buffers data that has not yet
>> >> been written to RFiles (through the process of minor compaction).
>> >> The region is segmented by tablet and each tablet can have a buffer
>> >> that is undergoing ingest as well as a buffer that is undergoing
>> >> minor compaction. A memory manager decides when to initiate minor
>> >> compactions for the tablet buffers and the default implementation
>> >> tries to keep the memory region 80-90% full while preferring to
>> >> compact the largest tablet buffers. Creating larger RFiles during minor compaction should lead to less major compactions.
>> >> During a minor compaction, the tablet buffer still "consumes"
>> >> memory within the in memory map and high ingest rates can lead to
>> >> exhausing the remaining capacity.  The default memory manage uses
>> >> an adaptive strategy to predict the expected memory usage and makes
>> >> compaction decisions that should maintain some free memory.  Batch
>> >> writers can be bursty and a bit unpredictable which could throw off
>> >> these estimates.  Also, depending on the ingest profile, sometimes
>> >> an in-memory tablet buffer will consume a large percentage of the
>> >> total buffer.  This leads to long minor compactions when the buffer
>> >> size is large which can allow ingest enough time to exhaust the
>> >> buffer before that memory can be reclaimed. When a tablet server
>> >> has to block ingest, it can affect client ingest rates to other
>> >> tablet servers due to the way that batch writers work.  This can
>> >> lead to other tablet servers underestimating future ingest rates which can further exacerbate the problem.
>> >>
>> >> There are some configuration changes that could reduce the severity
>> >> of held commits, although they might reduce peak ingest rates.
>> >> Reducing the in memory map size can reduce the maximum pause time due to held commits.
>> >> Adding additional tablets should help avoid the problem of a single
>> >> tablet buffer consuming a large percentage of the memory region.
>> >> It might be better to aim for ~20 tablets per server if your
>> >> problem allows for it.  It is also possible to replace the memory
>> >> manager with a custom one.  I've tried this in the past and have
>> >> seen stability improvements by making the memory thresholds less
>> >> aggressive (50-75% full).  This did reduce peak ingest rate in some cases, but that was a reasonable tradeoff.
>> >>
>> >> Based on your current configuration, if a tablet server is serving
>> >> 4 tablets and has a 32GB buffer, your first minor compactions will
>> >> be at least 8GB and they will probably grow larger over time until
>> >> the tablets naturally split.  Consider how long it would take to
>> >> write this RFile compared to your peak ingest rate.  As others have
>> >> suggested, make sure to use the native maps.  Based on your current
>> >> JVM heap size, using the Java in-memory map would probably lead to OOME or very bad GC performance.
>> >>
>> >> Accumulo can trace minor compaction durations so you can get a feel
>> >> for max pause times or measure the effect of configuration changes.
>> >>
>> >> Cheers,
>> >> --Jonathan
>> >>
>> >> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <_dlmarion@comcast.net_
>> >> <ma...@comcast.net>> wrote:
>> >>
>> >> Based on what Cyrille said, I would look at garbage collection,
>> >> specifically I would look at how much of your newly allocated
>> >> objects spill into the old generation before they are flushed to
>> >> disk. Additionally, I would turn off the debug log or log to SSD’s
>> >> if you have them. Another thought, seeing that you have 256GB RAM /
>> >> node, is to run multiple tablet servers per node. Do you have 10
>> >> threads on your Batch Writers? What about the Batch Writer latency,
>> >> is it too low such that you are not filling the buffer?
>> >>
>> >> *From:* Massimilian Mattetti [mailto:_MASSIMIL@il.ibm.com_ <mailto:
>> >> MASSIMIL@il.ibm.com>] *
>> >> Sent:* Wednesday, July 05, 2017 8:37 AM*
>> >> To:* _user@accumulo.apache.org_ <ma...@accumulo.apache.org>*
>> >> Subject:* maximize usage of cluster resources during ingestion
>> >>
>> >> Hi all,
>> >>
>> >> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers.
>> >> Each server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are
>> >> used as masters (running HDFS NameNodes, Accumulo Master and
>> >> Monitor). The other 10 machines has 12 Disks of 1 TB (11 used by
>> >> HDFS DataNode process) and are running Accumulo TServer processes.
>> >> All the machines are connected via a 10Gb network and 3 of them are
>> >> running ZooKeeper. I have run some heavy ingestion test on this
>> >> cluster but I have never been able to reach more than *20% *CPU
>> >> usage on each Tablet Server. I am running an ingestion process
>> >> (using batch writers) on each data node. The table is pre-split in
>> >> order to have 4 tablets per tablet server. Monitoring the network I
>> >> have seen that data is received/sent from each node with a peak
>> >> rate of about 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet servers is around 120MB/s.
>> >>
>> >> The table configuration I am playing with are:
>> >> "table.file.replication": "2",
>> >> "table.compaction.minor.logs.threshold": "10",
>> >> "table.durability": "flush",
>> >> "table.file.max": "30",
>> >> "table.compaction.major.ratio": "9",
>> >> "table.split.threshold": "1G"
>> >>
>> >> while the tablet server configuration is:
>> >> "tserver.wal.blocksize": "2G",
>> >> "tserver.walog.max.size": "8G",
>> >> "tserver.memory.maps.max": "32G",
>> >> "tserver.compaction.minor.concurrent.max": "50",
>> >> "tserver.compaction.major.concurrent.max": "8",
>> >> "tserver.total.mutation.queue.max": "50M",
>> >> "tserver.wal.replication": "2",
>> >> "tserver.compaction.major.thread.files.open.max": "15"
>> >>
>> >> the tablet server heap has been set to 32GB
>> >>
>> >>  From Monitor UI
>> >>
>> >>
>> >> As you can see I have a lot of valleys in which the ingestion rate
>> >> reaches 0.
>> >> What would be a good procedure to identify the bottleneck which
>> >> causes the 0 ingestion rate periods?
>> >> Thanks.
>> >>
>> >> Best Regards,
>> >> Max
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>

RE: maximize usage of cluster resources during ingestion

Posted by dl...@comcast.net.

Regarding the referenced paper, pre-splitting the tables, using an optimized zookeeper deployment, and increasing concurrent minor / major compactions are good things. I'm not sure that we want to recommend turning off the write ahead logs and replication for production deployments.

-----Original Message-----
From: Jeremy Kepner [mailto:kepner@ll.mit.edu] 
Sent: Thursday, July 13, 2017 10:05 AM
To: user@accumulo.apache.org
Subject: Re: maximize usage of cluster resources during ingestion

https://arxiv.org/abs/1406.4923  contains a number of tricks for maximizing ingest performance.

On Thu, Jul 13, 2017 at 08:13:40AM -0400, Jonathan Wonders wrote:
> Keep in mind that Accumulo puts a much different kind of load on HDFS 
> than the DFSIO benchmark.  It might be more appropriate to use a tool 
> like dstat to monitor HDD utilization and queue depth.  HDD throughput 
> benchmarks usually will involve high queue depths as disks are much 
> more effective when they can pipeline and batch updates. Accumulo's 
> WAL workload will typically call hflush or hsync periodically which 
> interrupts the IO pipeline much like memory barriers can interrupt CPU 
> pipelining except more severe.  This is necessary to provide 
> durability guarantees, but definitely comes at a cost to throughput.  
> Any database that has these durability guarantees will suffer 
> similarly to an extent.  For Accumulo, it is probably worse than for 
> non-distributed databases because the flush or sync must happen at 
> each replica prior to the mutation being added into the in-memory map.
> 
> I think one of the reasons the recommendation was made to add more 
> tablet servers is because each tablet server only writes to one WAL at 
> a time and each block will live on N disk based on replication factor.  
> If you have a replication factor of 3, there will be 10x3 blocks being 
> appended to at any given time (excluding compactions).  Since you have 
> 120 disks, not all will be participating in write-ahead-logging, so 
> you should not count the IO capacity of these extra disks towards 
> expected ingest throughput.  10 tablet servers per node is probably 
> too many because there would likely be a lot of contention 
> flushing/syncing WALs.  I'm not sure how smart HDFS is about how it 
> distributes the WAL load.  You might see more benefit with 2-4 
> tservers per node.  This would mostly likely require more batch writer threads in the client as well.
> 
> I'm not too surprised that snappy did not help because the WALs are 
> not compressed and are likely a bigger bottleneck than compaction 
> since you have many disks not participating in WAL.
> 
> 
> On Wed, Jul 12, 2017 at 11:16 AM, Josh Elser <el...@apache.org> wrote:
> 
> > You probably want to split the table further than just 4 tablets per 
> > tablet server. Try 10's of tablets per server.
> >
> > Also, merging the content from (who I assume is) your coworker on 
> > this stackoverflow post[1], I don't believe the suggestion[2] to 
> > verify WAL max size, minc threshold, and native maps size was brought up yet.
> >
> > Also, did you look at the JVM GC logs for the TabletServers like was 
> > previously suggested to you?
> >
> > [1] https://stackoverflow.com/questions/44928354/accumulo-tablet
> > -server-doesnt-utilize-all-available-resources-on-host-machine/
> > [2] https://accumulo.apache.org/1.8/accumulo_user_manual.html#_n
> > ative_maps_configuration
> >
> > On 7/12/17 10:12 AM, Massimilian Mattetti wrote:
> >
> >> Hi all,
> >>
> >> I ran a few experiments in the last days trying to identify what is 
> >> the bottleneck for the ingestion process.
> >> - Running 10 tservers per node instead of only one gave me a very 
> >> neglectable performance improvement of about 15%.
> >> - Running the ingestor processes from the two masters give the same 
> >> performance as running one ingestor process in each tablet server 
> >> (10
> >> ingestors)
> >> - neither the network limit (10 Gb network) nor the disk throughput 
> >> limit has been reached (1GB/s per node reached while running the 
> >> TestDFSIO benchmark on HDFS)
> >> - CPU is always around 20% on each tserver
> >> - changing compression from GZ to snappy did not provide any 
> >> benefit
> >> - increasing the tserver.total.mutation.queue.maxto 200MB actually 
> >> decreased the performance I am going to run some ingestion 
> >> experiment with Kudu over the next few days, but any other 
> >> suggestion on how improve the performance on Accumulo is very 
> >> welcome.
> >> Thanks.
> >>
> >> Best Regards,
> >> Massimiliano
> >>
> >>
> >>
> >> From: Jonathan Wonders <jw...@gmail.com>
> >> To: user@accumulo.apache.org, Dave Marion <dl...@comcast.net>
> >> Date: 07/07/2017 04:02
> >> Subject: Re: maximize usage of cluster resources during ingestion
> >> -------------------------------------------------------------------
> >> -----
> >>
> >>
> >>
> >> I've personally never seen full CPU utilization during pure ingest.
> >> Typically the bottleneck has been I/O related. The majority of 
> >> steady-state CPU utilization under a heavy ingest load is probably 
> >> due to compression unless you have custom constraints running. This 
> >> can depend on the compression algorithm you have selected.  There 
> >> is probably a measurable contribution from inserting into the 
> >> in-memory map.  Otherwise, not much computation occurs during ingest per mutation.
> >>
> >> On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <_dlmarion@comcast.net_ 
> >> <ma...@comcast.net>> wrote:
> >> That's a good point. I would also look at increasing 
> >> tserver.total.mutation.queue.max. Are you seeing hold times? If 
> >> not, I would keep pushing harder until you do, then move to 
> >> multiple tablet servers. Do you have any GC logs?
> >>
> >>
> >> On July 6, 2017 at 4:47 AM Cyrille Savelief <_csavelief@gmail.com_ 
> >> <ma...@gmail.com>> wrote:
> >>
> >> Are you sure Accumulo is not waiting for your app's data? There 
> >> might be GC pauses in your ingest code (we have already experienced that).
> >>
> >> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti 
> >> <_MASSIMIL@il.ibm.com_ <ma...@il.ibm.com>> a écrit :
> >> Thank you all for the suggestions.
> >>
> >> About the native memory map I checked the logs on each tablet 
> >> server and it was loaded correctly (of course the 
> >> tserver.memory.maps.native.enabled
> >> was set to true), so the GC pauses should not be the problem 
> >> eventually. I managed to get much better ingestion graph by 
> >> reducing the native map size to *2GB* and increasing the Batch 
> >> Writer threads number from the default (3 was really bad for my 
> >> configuration) to *10* (I think it does not make sense having more threads than tablet servers, am I right?).
> >>
> >> The configuration that I used for the table is:
> >> "table.file.replication": "2",
> >> "table.compaction.minor.logs.threshold": "3",
> >> "table.durability": "flush",
> >> "table.split.threshold": "1G"
> >>
> >> while for the tablet servers is:
> >> "tserver.wal.blocksize": "1G",
> >>   "tserver.walog.max.size": "2G",
> >> "tserver.memory.maps.max": "2G",
> >> "tserver.compaction.minor.concurrent.max": "50",
> >> "tserver.compaction.major.concurrent.max": "20",
> >> "tserver.wal.replication": "2",
> >>   "tserver.compaction.major.thread.files.open.max": "15"
> >>
> >> The new graph:
> >>
> >>
> >> I still have the problem of a CPU usage that is less than*20%.* So 
> >> I am thinking to run multiple tablet servers per node (like 5 or 
> >> 10) in order to maximize the CPU usage. Besides that I do not have 
> >> any other idea on how to stress those servers with ingestion.
> >> Any suggestions are very welcome. Meanwhile, thank you all again 
> >> for your help.
> >>
> >>
> >> Best Regards,
> >> Massimiliano
> >>
> >>
> >>
> >> From: Jonathan Wonders <_jwonders88@gmail.com_ <mailto:
> >> jwonders88@gmail.com>>
> >> To: _user@accumulo.apache.org_ <ma...@accumulo.apache.org>
> >> Date: 06/07/2017 04:01
> >> Subject: Re: maximize usage of cluster resources during ingestion
> >> -------------------------------------------------------------------
> >> -----
> >>
> >>
> >>
> >> Hi Massimilian,
> >>
> >> Are you seeing held commits during the ingest pauses?  Just based 
> >> on having looked at many similar graphs in the past, this might be 
> >> one of the major culprits.  A tablet server has a memory region 
> >> with a bounded size
> >> (tserver.memory.maps.max) where it buffers data that has not yet 
> >> been written to RFiles (through the process of minor compaction). 
> >> The region is segmented by tablet and each tablet can have a buffer 
> >> that is undergoing ingest as well as a buffer that is undergoing 
> >> minor compaction. A memory manager decides when to initiate minor 
> >> compactions for the tablet buffers and the default implementation 
> >> tries to keep the memory region 80-90% full while preferring to 
> >> compact the largest tablet buffers. Creating larger RFiles during minor compaction should lead to less major compactions.
> >> During a minor compaction, the tablet buffer still "consumes" 
> >> memory within the in memory map and high ingest rates can lead to 
> >> exhausing the remaining capacity.  The default memory manage uses 
> >> an adaptive strategy to predict the expected memory usage and makes 
> >> compaction decisions that should maintain some free memory.  Batch 
> >> writers can be bursty and a bit unpredictable which could throw off 
> >> these estimates.  Also, depending on the ingest profile, sometimes 
> >> an in-memory tablet buffer will consume a large percentage of the 
> >> total buffer.  This leads to long minor compactions when the buffer 
> >> size is large which can allow ingest enough time to exhaust the 
> >> buffer before that memory can be reclaimed. When a tablet server 
> >> has to block ingest, it can affect client ingest rates to other 
> >> tablet servers due to the way that batch writers work.  This can 
> >> lead to other tablet servers underestimating future ingest rates which can further exacerbate the problem.
> >>
> >> There are some configuration changes that could reduce the severity 
> >> of held commits, although they might reduce peak ingest rates.  
> >> Reducing the in memory map size can reduce the maximum pause time due to held commits.
> >> Adding additional tablets should help avoid the problem of a single 
> >> tablet buffer consuming a large percentage of the memory region.  
> >> It might be better to aim for ~20 tablets per server if your 
> >> problem allows for it.  It is also possible to replace the memory 
> >> manager with a custom one.  I've tried this in the past and have 
> >> seen stability improvements by making the memory thresholds less 
> >> aggressive (50-75% full).  This did reduce peak ingest rate in some cases, but that was a reasonable tradeoff.
> >>
> >> Based on your current configuration, if a tablet server is serving 
> >> 4 tablets and has a 32GB buffer, your first minor compactions will 
> >> be at least 8GB and they will probably grow larger over time until 
> >> the tablets naturally split.  Consider how long it would take to 
> >> write this RFile compared to your peak ingest rate.  As others have 
> >> suggested, make sure to use the native maps.  Based on your current 
> >> JVM heap size, using the Java in-memory map would probably lead to OOME or very bad GC performance.
> >>
> >> Accumulo can trace minor compaction durations so you can get a feel 
> >> for max pause times or measure the effect of configuration changes.
> >>
> >> Cheers,
> >> --Jonathan
> >>
> >> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <_dlmarion@comcast.net_ 
> >> <ma...@comcast.net>> wrote:
> >>
> >> Based on what Cyrille said, I would look at garbage collection, 
> >> specifically I would look at how much of your newly allocated 
> >> objects spill into the old generation before they are flushed to 
> >> disk. Additionally, I would turn off the debug log or log to SSD’s 
> >> if you have them. Another thought, seeing that you have 256GB RAM / 
> >> node, is to run multiple tablet servers per node. Do you have 10 
> >> threads on your Batch Writers? What about the Batch Writer latency, 
> >> is it too low such that you are not filling the buffer?
> >>
> >> *From:* Massimilian Mattetti [mailto:_MASSIMIL@il.ibm.com_ <mailto:
> >> MASSIMIL@il.ibm.com>] *
> >> Sent:* Wednesday, July 05, 2017 8:37 AM*
> >> To:* _user@accumulo.apache.org_ <ma...@accumulo.apache.org>*
> >> Subject:* maximize usage of cluster resources during ingestion
> >>
> >> Hi all,
> >>
> >> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. 
> >> Each server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are 
> >> used as masters (running HDFS NameNodes, Accumulo Master and 
> >> Monitor). The other 10 machines has 12 Disks of 1 TB (11 used by 
> >> HDFS DataNode process) and are running Accumulo TServer processes. 
> >> All the machines are connected via a 10Gb network and 3 of them are 
> >> running ZooKeeper. I have run some heavy ingestion test on this 
> >> cluster but I have never been able to reach more than *20% *CPU 
> >> usage on each Tablet Server. I am running an ingestion process 
> >> (using batch writers) on each data node. The table is pre-split in 
> >> order to have 4 tablets per tablet server. Monitoring the network I 
> >> have seen that data is received/sent from each node with a peak 
> >> rate of about 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet servers is around 120MB/s.
> >>
> >> The table configuration I am playing with are:
> >> "table.file.replication": "2",
> >> "table.compaction.minor.logs.threshold": "10",
> >> "table.durability": "flush",
> >> "table.file.max": "30",
> >> "table.compaction.major.ratio": "9",
> >> "table.split.threshold": "1G"
> >>
> >> while the tablet server configuration is:
> >> "tserver.wal.blocksize": "2G",
> >> "tserver.walog.max.size": "8G",
> >> "tserver.memory.maps.max": "32G",
> >> "tserver.compaction.minor.concurrent.max": "50",
> >> "tserver.compaction.major.concurrent.max": "8",
> >> "tserver.total.mutation.queue.max": "50M",
> >> "tserver.wal.replication": "2",
> >> "tserver.compaction.major.thread.files.open.max": "15"
> >>
> >> the tablet server heap has been set to 32GB
> >>
> >>  From Monitor UI
> >>
> >>
> >> As you can see I have a lot of valleys in which the ingestion rate 
> >> reaches 0.
> >> What would be a good procedure to identify the bottleneck which 
> >> causes the 0 ingestion rate periods?
> >> Thanks.
> >>
> >> Best Regards,
> >> Max
> >>
> >>
> >>
> >>
> >>
> >>
> >>

Re: maximize usage of cluster resources during ingestion

Posted by Jeremy Kepner <ke...@ll.mit.edu>.

https://arxiv.org/abs/1406.4923  contains a number of tricks for maximizing ingest performance.

On Thu, Jul 13, 2017 at 08:13:40AM -0400, Jonathan Wonders wrote:
> Keep in mind that Accumulo puts a much different kind of load on HDFS than
> the DFSIO benchmark.  It might be more appropriate to use a tool like dstat
> to monitor HDD utilization and queue depth.  HDD throughput benchmarks
> usually will involve high queue depths as disks are much more effective
> when they can pipeline and batch updates. Accumulo's WAL workload will
> typically call hflush or hsync periodically which interrupts the IO
> pipeline much like memory barriers can interrupt CPU pipelining except more
> severe.  This is necessary to provide durability guarantees, but definitely
> comes at a cost to throughput.  Any database that has these durability
> guarantees will suffer similarly to an extent.  For Accumulo, it is
> probably worse than for non-distributed databases because the flush or sync
> must happen at each replica prior to the mutation being added into the
> in-memory map.
> 
> I think one of the reasons the recommendation was made to add more tablet
> servers is because each tablet server only writes to one WAL at a time and
> each block will live on N disk based on replication factor.  If you have a
> replication factor of 3, there will be 10x3 blocks being appended to at any
> given time (excluding compactions).  Since you have 120 disks, not all will
> be participating in write-ahead-logging, so you should not count the IO
> capacity of these extra disks towards expected ingest throughput.  10
> tablet servers per node is probably too many because there would likely be
> a lot of contention flushing/syncing WALs.  I'm not sure how smart HDFS is
> about how it distributes the WAL load.  You might see more benefit with 2-4
> tservers per node.  This would mostly likely require more batch writer
> threads in the client as well.
> 
> I'm not too surprised that snappy did not help because the WALs are not
> compressed and are likely a bigger bottleneck than compaction since you
> have many disks not participating in WAL.
> 
> 
> On Wed, Jul 12, 2017 at 11:16 AM, Josh Elser <el...@apache.org> wrote:
> 
> > You probably want to split the table further than just 4 tablets per
> > tablet server. Try 10's of tablets per server.
> >
> > Also, merging the content from (who I assume is) your coworker on this
> > stackoverflow post[1], I don't believe the suggestion[2] to verify WAL max
> > size, minc threshold, and native maps size was brought up yet.
> >
> > Also, did you look at the JVM GC logs for the TabletServers like was
> > previously suggested to you?
> >
> > [1] https://stackoverflow.com/questions/44928354/accumulo-tablet
> > -server-doesnt-utilize-all-available-resources-on-host-machine/
> > [2] https://accumulo.apache.org/1.8/accumulo_user_manual.html#_n
> > ative_maps_configuration
> >
> > On 7/12/17 10:12 AM, Massimilian Mattetti wrote:
> >
> >> Hi all,
> >>
> >> I ran a few experiments in the last days trying to identify what is the
> >> bottleneck for the ingestion process.
> >> - Running 10 tservers per node instead of only one gave me a very
> >> neglectable performance improvement of about 15%.
> >> - Running the ingestor processes from the two masters give the same
> >> performance as running one ingestor process in each tablet server (10
> >> ingestors)
> >> - neither the network limit (10 Gb network) nor the disk throughput limit
> >> has been reached (1GB/s per node reached while running the TestDFSIO
> >> benchmark on HDFS)
> >> - CPU is always around 20% on each tserver
> >> - changing compression from GZ to snappy did not provide any benefit
> >> - increasing the tserver.total.mutation.queue.maxto 200MB actually
> >> decreased the performance
> >> I am going to run some ingestion experiment with Kudu over the next few
> >> days, but any other suggestion on how improve the performance on Accumulo
> >> is very welcome.
> >> Thanks.
> >>
> >> Best Regards,
> >> Massimiliano
> >>
> >>
> >>
> >> From: Jonathan Wonders <jw...@gmail.com>
> >> To: user@accumulo.apache.org, Dave Marion <dl...@comcast.net>
> >> Date: 07/07/2017 04:02
> >> Subject: Re: maximize usage of cluster resources during ingestion
> >> ------------------------------------------------------------------------
> >>
> >>
> >>
> >> I've personally never seen full CPU utilization during pure ingest.
> >> Typically the bottleneck has been I/O related. The majority of steady-state
> >> CPU utilization under a heavy ingest load is probably due to compression
> >> unless you have custom constraints running. This can depend on the
> >> compression algorithm you have selected.  There is probably a measurable
> >> contribution from inserting into the in-memory map.  Otherwise, not much
> >> computation occurs during ingest per mutation.
> >>
> >> On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <_dlmarion@comcast.net_
> >> <ma...@comcast.net>> wrote:
> >> That's a good point. I would also look at increasing
> >> tserver.total.mutation.queue.max. Are you seeing hold times? If not, I
> >> would keep pushing harder until you do, then move to multiple tablet
> >> servers. Do you have any GC logs?
> >>
> >>
> >> On July 6, 2017 at 4:47 AM Cyrille Savelief <_csavelief@gmail.com_
> >> <ma...@gmail.com>> wrote:
> >>
> >> Are you sure Accumulo is not waiting for your app's data? There might be
> >> GC pauses in your ingest code (we have already experienced that).
> >>
> >> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti <_MASSIMIL@il.ibm.com_
> >> <ma...@il.ibm.com>> a écrit :
> >> Thank you all for the suggestions.
> >>
> >> About the native memory map I checked the logs on each tablet server and
> >> it was loaded correctly (of course the tserver.memory.maps.native.enabled
> >> was set to true), so the GC pauses should not be the problem eventually. I
> >> managed to get much better ingestion graph by reducing the native map size
> >> to *2GB* and increasing the Batch Writer threads number from the default (3
> >> was really bad for my configuration) to *10* (I think it does not make
> >> sense having more threads than tablet servers, am I right?).
> >>
> >> The configuration that I used for the table is:
> >> "table.file.replication": "2",
> >> "table.compaction.minor.logs.threshold": "3",
> >> "table.durability": "flush",
> >> "table.split.threshold": "1G"
> >>
> >> while for the tablet servers is:
> >> "tserver.wal.blocksize": "1G",
> >>   "tserver.walog.max.size": "2G",
> >> "tserver.memory.maps.max": "2G",
> >> "tserver.compaction.minor.concurrent.max": "50",
> >> "tserver.compaction.major.concurrent.max": "20",
> >> "tserver.wal.replication": "2",
> >>   "tserver.compaction.major.thread.files.open.max": "15"
> >>
> >> The new graph:
> >>
> >>
> >> I still have the problem of a CPU usage that is less than*20%.* So I am
> >> thinking to run multiple tablet servers per node (like 5 or 10) in order to
> >> maximize the CPU usage. Besides that I do not have any other idea on how to
> >> stress those servers with ingestion.
> >> Any suggestions are very welcome. Meanwhile, thank you all again for your
> >> help.
> >>
> >>
> >> Best Regards,
> >> Massimiliano
> >>
> >>
> >>
> >> From: Jonathan Wonders <_jwonders88@gmail.com_ <mailto:
> >> jwonders88@gmail.com>>
> >> To: _user@accumulo.apache.org_ <ma...@accumulo.apache.org>
> >> Date: 06/07/2017 04:01
> >> Subject: Re: maximize usage of cluster resources during ingestion
> >> ------------------------------------------------------------------------
> >>
> >>
> >>
> >> Hi Massimilian,
> >>
> >> Are you seeing held commits during the ingest pauses?  Just based on
> >> having looked at many similar graphs in the past, this might be one of the
> >> major culprits.  A tablet server has a memory region with a bounded size
> >> (tserver.memory.maps.max) where it buffers data that has not yet been
> >> written to RFiles (through the process of minor compaction). The region is
> >> segmented by tablet and each tablet can have a buffer that is undergoing
> >> ingest as well as a buffer that is undergoing minor compaction. A memory
> >> manager decides when to initiate minor compactions for the tablet buffers
> >> and the default implementation tries to keep the memory region 80-90% full
> >> while preferring to compact the largest tablet buffers. Creating larger
> >> RFiles during minor compaction should lead to less major compactions.
> >> During a minor compaction, the tablet buffer still "consumes" memory within
> >> the in memory map and high ingest rates can lead to exhausing the remaining
> >> capacity.  The default memory manage uses an adaptive strategy to predict
> >> the expected memory usage and makes compaction decisions that should
> >> maintain some free memory.  Batch writers can be bursty and a bit
> >> unpredictable which could throw off these estimates.  Also, depending on
> >> the ingest profile, sometimes an in-memory tablet buffer will consume a
> >> large percentage of the total buffer.  This leads to long minor compactions
> >> when the buffer size is large which can allow ingest enough time to exhaust
> >> the buffer before that memory can be reclaimed. When a tablet server has to
> >> block ingest, it can affect client ingest rates to other tablet servers due
> >> to the way that batch writers work.  This can lead to other tablet servers
> >> underestimating future ingest rates which can further exacerbate the
> >> problem.
> >>
> >> There are some configuration changes that could reduce the severity of
> >> held commits, although they might reduce peak ingest rates.  Reducing the
> >> in memory map size can reduce the maximum pause time due to held commits.
> >> Adding additional tablets should help avoid the problem of a single tablet
> >> buffer consuming a large percentage of the memory region.  It might be
> >> better to aim for ~20 tablets per server if your problem allows for it.  It
> >> is also possible to replace the memory manager with a custom one.  I've
> >> tried this in the past and have seen stability improvements by making the
> >> memory thresholds less aggressive (50-75% full).  This did reduce peak
> >> ingest rate in some cases, but that was a reasonable tradeoff.
> >>
> >> Based on your current configuration, if a tablet server is serving 4
> >> tablets and has a 32GB buffer, your first minor compactions will be at
> >> least 8GB and they will probably grow larger over time until the tablets
> >> naturally split.  Consider how long it would take to write this RFile
> >> compared to your peak ingest rate.  As others have suggested, make sure to
> >> use the native maps.  Based on your current JVM heap size, using the Java
> >> in-memory map would probably lead to OOME or very bad GC performance.
> >>
> >> Accumulo can trace minor compaction durations so you can get a feel for
> >> max pause times or measure the effect of configuration changes.
> >>
> >> Cheers,
> >> --Jonathan
> >>
> >> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <_dlmarion@comcast.net_
> >> <ma...@comcast.net>> wrote:
> >>
> >> Based on what Cyrille said, I would look at garbage collection,
> >> specifically I would look at how much of your newly allocated objects spill
> >> into the old generation before they are flushed to disk. Additionally, I
> >> would turn off the debug log or log to SSD’s if you have them. Another
> >> thought, seeing that you have 256GB RAM / node, is to run multiple tablet
> >> servers per node. Do you have 10 threads on your Batch Writers? What about
> >> the Batch Writer latency, is it too low such that you are not filling the
> >> buffer?
> >>
> >> *From:* Massimilian Mattetti [mailto:_MASSIMIL@il.ibm.com_ <mailto:
> >> MASSIMIL@il.ibm.com>] *
> >> Sent:* Wednesday, July 05, 2017 8:37 AM*
> >> To:* _user@accumulo.apache.org_ <ma...@accumulo.apache.org>*
> >> Subject:* maximize usage of cluster resources during ingestion
> >>
> >> Hi all,
> >>
> >> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
> >> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
> >> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
> >> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
> >> running Accumulo TServer processes. All the machines are connected via a
> >> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
> >> ingestion test on this cluster but I have never been able to reach more
> >> than *20% *CPU usage on each Tablet Server. I am running an ingestion
> >> process (using batch writers) on each data node. The table is pre-split in
> >> order to have 4 tablets per tablet server. Monitoring the network I have
> >> seen that data is received/sent from each node with a peak rate of about
> >> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
> >> servers is around 120MB/s.
> >>
> >> The table configuration I am playing with are:
> >> "table.file.replication": "2",
> >> "table.compaction.minor.logs.threshold": "10",
> >> "table.durability": "flush",
> >> "table.file.max": "30",
> >> "table.compaction.major.ratio": "9",
> >> "table.split.threshold": "1G"
> >>
> >> while the tablet server configuration is:
> >> "tserver.wal.blocksize": "2G",
> >> "tserver.walog.max.size": "8G",
> >> "tserver.memory.maps.max": "32G",
> >> "tserver.compaction.minor.concurrent.max": "50",
> >> "tserver.compaction.major.concurrent.max": "8",
> >> "tserver.total.mutation.queue.max": "50M",
> >> "tserver.wal.replication": "2",
> >> "tserver.compaction.major.thread.files.open.max": "15"
> >>
> >> the tablet server heap has been set to 32GB
> >>
> >>  From Monitor UI
> >>
> >>
> >> As you can see I have a lot of valleys in which the ingestion rate
> >> reaches 0.
> >> What would be a good procedure to identify the bottleneck which causes
> >> the 0 ingestion rate periods?
> >> Thanks.
> >>
> >> Best Regards,
> >> Max
> >>
> >>
> >>
> >>
> >>
> >>
> >>

Re: maximize usage of cluster resources during ingestion

Posted by Jonathan Wonders <jw...@gmail.com>.

Keep in mind that Accumulo puts a much different kind of load on HDFS than
the DFSIO benchmark.  It might be more appropriate to use a tool like dstat
to monitor HDD utilization and queue depth.  HDD throughput benchmarks
usually will involve high queue depths as disks are much more effective
when they can pipeline and batch updates. Accumulo's WAL workload will
typically call hflush or hsync periodically which interrupts the IO
pipeline much like memory barriers can interrupt CPU pipelining except more
severe.  This is necessary to provide durability guarantees, but definitely
comes at a cost to throughput.  Any database that has these durability
guarantees will suffer similarly to an extent.  For Accumulo, it is
probably worse than for non-distributed databases because the flush or sync
must happen at each replica prior to the mutation being added into the
in-memory map.

I think one of the reasons the recommendation was made to add more tablet
servers is because each tablet server only writes to one WAL at a time and
each block will live on N disk based on replication factor.  If you have a
replication factor of 3, there will be 10x3 blocks being appended to at any
given time (excluding compactions).  Since you have 120 disks, not all will
be participating in write-ahead-logging, so you should not count the IO
capacity of these extra disks towards expected ingest throughput.  10
tablet servers per node is probably too many because there would likely be
a lot of contention flushing/syncing WALs.  I'm not sure how smart HDFS is
about how it distributes the WAL load.  You might see more benefit with 2-4
tservers per node.  This would mostly likely require more batch writer
threads in the client as well.

I'm not too surprised that snappy did not help because the WALs are not
compressed and are likely a bigger bottleneck than compaction since you
have many disks not participating in WAL.


On Wed, Jul 12, 2017 at 11:16 AM, Josh Elser <el...@apache.org> wrote:

> You probably want to split the table further than just 4 tablets per
> tablet server. Try 10's of tablets per server.
>
> Also, merging the content from (who I assume is) your coworker on this
> stackoverflow post[1], I don't believe the suggestion[2] to verify WAL max
> size, minc threshold, and native maps size was brought up yet.
>
> Also, did you look at the JVM GC logs for the TabletServers like was
> previously suggested to you?
>
> [1] https://stackoverflow.com/questions/44928354/accumulo-tablet
> -server-doesnt-utilize-all-available-resources-on-host-machine/
> [2] https://accumulo.apache.org/1.8/accumulo_user_manual.html#_n
> ative_maps_configuration
>
> On 7/12/17 10:12 AM, Massimilian Mattetti wrote:
>
>> Hi all,
>>
>> I ran a few experiments in the last days trying to identify what is the
>> bottleneck for the ingestion process.
>> - Running 10 tservers per node instead of only one gave me a very
>> neglectable performance improvement of about 15%.
>> - Running the ingestor processes from the two masters give the same
>> performance as running one ingestor process in each tablet server (10
>> ingestors)
>> - neither the network limit (10 Gb network) nor the disk throughput limit
>> has been reached (1GB/s per node reached while running the TestDFSIO
>> benchmark on HDFS)
>> - CPU is always around 20% on each tserver
>> - changing compression from GZ to snappy did not provide any benefit
>> - increasing the tserver.total.mutation.queue.maxto 200MB actually
>> decreased the performance
>> I am going to run some ingestion experiment with Kudu over the next few
>> days, but any other suggestion on how improve the performance on Accumulo
>> is very welcome.
>> Thanks.
>>
>> Best Regards,
>> Massimiliano
>>
>>
>>
>> From: Jonathan Wonders <jw...@gmail.com>
>> To: user@accumulo.apache.org, Dave Marion <dl...@comcast.net>
>> Date: 07/07/2017 04:02
>> Subject: Re: maximize usage of cluster resources during ingestion
>> ------------------------------------------------------------------------
>>
>>
>>
>> I've personally never seen full CPU utilization during pure ingest.
>> Typically the bottleneck has been I/O related. The majority of steady-state
>> CPU utilization under a heavy ingest load is probably due to compression
>> unless you have custom constraints running. This can depend on the
>> compression algorithm you have selected.  There is probably a measurable
>> contribution from inserting into the in-memory map.  Otherwise, not much
>> computation occurs during ingest per mutation.
>>
>> On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <_dlmarion@comcast.net_
>> <ma...@comcast.net>> wrote:
>> That's a good point. I would also look at increasing
>> tserver.total.mutation.queue.max. Are you seeing hold times? If not, I
>> would keep pushing harder until you do, then move to multiple tablet
>> servers. Do you have any GC logs?
>>
>>
>> On July 6, 2017 at 4:47 AM Cyrille Savelief <_csavelief@gmail.com_
>> <ma...@gmail.com>> wrote:
>>
>> Are you sure Accumulo is not waiting for your app's data? There might be
>> GC pauses in your ingest code (we have already experienced that).
>>
>> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti <_MASSIMIL@il.ibm.com_
>> <ma...@il.ibm.com>> a écrit :
>> Thank you all for the suggestions.
>>
>> About the native memory map I checked the logs on each tablet server and
>> it was loaded correctly (of course the tserver.memory.maps.native.enabled
>> was set to true), so the GC pauses should not be the problem eventually. I
>> managed to get much better ingestion graph by reducing the native map size
>> to *2GB* and increasing the Batch Writer threads number from the default (3
>> was really bad for my configuration) to *10* (I think it does not make
>> sense having more threads than tablet servers, am I right?).
>>
>> The configuration that I used for the table is:
>> "table.file.replication": "2",
>> "table.compaction.minor.logs.threshold": "3",
>> "table.durability": "flush",
>> "table.split.threshold": "1G"
>>
>> while for the tablet servers is:
>> "tserver.wal.blocksize": "1G",
>>   "tserver.walog.max.size": "2G",
>> "tserver.memory.maps.max": "2G",
>> "tserver.compaction.minor.concurrent.max": "50",
>> "tserver.compaction.major.concurrent.max": "20",
>> "tserver.wal.replication": "2",
>>   "tserver.compaction.major.thread.files.open.max": "15"
>>
>> The new graph:
>>
>>
>> I still have the problem of a CPU usage that is less than*20%.* So I am
>> thinking to run multiple tablet servers per node (like 5 or 10) in order to
>> maximize the CPU usage. Besides that I do not have any other idea on how to
>> stress those servers with ingestion.
>> Any suggestions are very welcome. Meanwhile, thank you all again for your
>> help.
>>
>>
>> Best Regards,
>> Massimiliano
>>
>>
>>
>> From: Jonathan Wonders <_jwonders88@gmail.com_ <mailto:
>> jwonders88@gmail.com>>
>> To: _user@accumulo.apache.org_ <ma...@accumulo.apache.org>
>> Date: 06/07/2017 04:01
>> Subject: Re: maximize usage of cluster resources during ingestion
>> ------------------------------------------------------------------------
>>
>>
>>
>> Hi Massimilian,
>>
>> Are you seeing held commits during the ingest pauses?  Just based on
>> having looked at many similar graphs in the past, this might be one of the
>> major culprits.  A tablet server has a memory region with a bounded size
>> (tserver.memory.maps.max) where it buffers data that has not yet been
>> written to RFiles (through the process of minor compaction). The region is
>> segmented by tablet and each tablet can have a buffer that is undergoing
>> ingest as well as a buffer that is undergoing minor compaction. A memory
>> manager decides when to initiate minor compactions for the tablet buffers
>> and the default implementation tries to keep the memory region 80-90% full
>> while preferring to compact the largest tablet buffers. Creating larger
>> RFiles during minor compaction should lead to less major compactions.
>> During a minor compaction, the tablet buffer still "consumes" memory within
>> the in memory map and high ingest rates can lead to exhausing the remaining
>> capacity.  The default memory manage uses an adaptive strategy to predict
>> the expected memory usage and makes compaction decisions that should
>> maintain some free memory.  Batch writers can be bursty and a bit
>> unpredictable which could throw off these estimates.  Also, depending on
>> the ingest profile, sometimes an in-memory tablet buffer will consume a
>> large percentage of the total buffer.  This leads to long minor compactions
>> when the buffer size is large which can allow ingest enough time to exhaust
>> the buffer before that memory can be reclaimed. When a tablet server has to
>> block ingest, it can affect client ingest rates to other tablet servers due
>> to the way that batch writers work.  This can lead to other tablet servers
>> underestimating future ingest rates which can further exacerbate the
>> problem.
>>
>> There are some configuration changes that could reduce the severity of
>> held commits, although they might reduce peak ingest rates.  Reducing the
>> in memory map size can reduce the maximum pause time due to held commits.
>> Adding additional tablets should help avoid the problem of a single tablet
>> buffer consuming a large percentage of the memory region.  It might be
>> better to aim for ~20 tablets per server if your problem allows for it.  It
>> is also possible to replace the memory manager with a custom one.  I've
>> tried this in the past and have seen stability improvements by making the
>> memory thresholds less aggressive (50-75% full).  This did reduce peak
>> ingest rate in some cases, but that was a reasonable tradeoff.
>>
>> Based on your current configuration, if a tablet server is serving 4
>> tablets and has a 32GB buffer, your first minor compactions will be at
>> least 8GB and they will probably grow larger over time until the tablets
>> naturally split.  Consider how long it would take to write this RFile
>> compared to your peak ingest rate.  As others have suggested, make sure to
>> use the native maps.  Based on your current JVM heap size, using the Java
>> in-memory map would probably lead to OOME or very bad GC performance.
>>
>> Accumulo can trace minor compaction durations so you can get a feel for
>> max pause times or measure the effect of configuration changes.
>>
>> Cheers,
>> --Jonathan
>>
>> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <_dlmarion@comcast.net_
>> <ma...@comcast.net>> wrote:
>>
>> Based on what Cyrille said, I would look at garbage collection,
>> specifically I would look at how much of your newly allocated objects spill
>> into the old generation before they are flushed to disk. Additionally, I
>> would turn off the debug log or log to SSD’s if you have them. Another
>> thought, seeing that you have 256GB RAM / node, is to run multiple tablet
>> servers per node. Do you have 10 threads on your Batch Writers? What about
>> the Batch Writer latency, is it too low such that you are not filling the
>> buffer?
>>
>> *From:* Massimilian Mattetti [mailto:_MASSIMIL@il.ibm.com_ <mailto:
>> MASSIMIL@il.ibm.com>] *
>> Sent:* Wednesday, July 05, 2017 8:37 AM*
>> To:* _user@accumulo.apache.org_ <ma...@accumulo.apache.org>*
>> Subject:* maximize usage of cluster resources during ingestion
>>
>> Hi all,
>>
>> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
>> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
>> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
>> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
>> running Accumulo TServer processes. All the machines are connected via a
>> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
>> ingestion test on this cluster but I have never been able to reach more
>> than *20% *CPU usage on each Tablet Server. I am running an ingestion
>> process (using batch writers) on each data node. The table is pre-split in
>> order to have 4 tablets per tablet server. Monitoring the network I have
>> seen that data is received/sent from each node with a peak rate of about
>> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
>> servers is around 120MB/s.
>>
>> The table configuration I am playing with are:
>> "table.file.replication": "2",
>> "table.compaction.minor.logs.threshold": "10",
>> "table.durability": "flush",
>> "table.file.max": "30",
>> "table.compaction.major.ratio": "9",
>> "table.split.threshold": "1G"
>>
>> while the tablet server configuration is:
>> "tserver.wal.blocksize": "2G",
>> "tserver.walog.max.size": "8G",
>> "tserver.memory.maps.max": "32G",
>> "tserver.compaction.minor.concurrent.max": "50",
>> "tserver.compaction.major.concurrent.max": "8",
>> "tserver.total.mutation.queue.max": "50M",
>> "tserver.wal.replication": "2",
>> "tserver.compaction.major.thread.files.open.max": "15"
>>
>> the tablet server heap has been set to 32GB
>>
>>  From Monitor UI
>>
>>
>> As you can see I have a lot of valleys in which the ingestion rate
>> reaches 0.
>> What would be a good procedure to identify the bottleneck which causes
>> the 0 ingestion rate periods?
>> Thanks.
>>
>> Best Regards,
>> Max
>>
>>
>>
>>
>>
>>
>>

Re: maximize usage of cluster resources during ingestion

Posted by Josh Elser <el...@apache.org>.

You probably want to split the table further than just 4 tablets per 
tablet server. Try 10's of tablets per server.

Also, merging the content from (who I assume is) your coworker on this 
stackoverflow post[1], I don't believe the suggestion[2] to verify WAL 
max size, minc threshold, and native maps size was brought up yet.

Also, did you look at the JVM GC logs for the TabletServers like was 
previously suggested to you?

[1] 
https://stackoverflow.com/questions/44928354/accumulo-tablet-server-doesnt-utilize-all-available-resources-on-host-machine/
[2] 
https://accumulo.apache.org/1.8/accumulo_user_manual.html#_native_maps_configuration

On 7/12/17 10:12 AM, Massimilian Mattetti wrote:
> Hi all,
> 
> I ran a few experiments in the last days trying to identify what is the 
> bottleneck for the ingestion process.
> - Running 10 tservers per node instead of only one gave me a very 
> neglectable performance improvement of about 15%.
> - Running the ingestor processes from the two masters give the same 
> performance as running one ingestor process in each tablet server (10 
> ingestors)
> - neither the network limit (10 Gb network) nor the disk throughput 
> limit has been reached (1GB/s per node reached while running the 
> TestDFSIO benchmark on HDFS)
> - CPU is always around 20% on each tserver
> - changing compression from GZ to snappy did not provide any benefit
> - increasing the tserver.total.mutation.queue.maxto 200MB actually 
> decreased the performance
> I am going to run some ingestion experiment with Kudu over the next few 
> days, but any other suggestion on how improve the performance on 
> Accumulo is very welcome.
> Thanks.
> 
> Best Regards,
> Massimiliano
> 
> 
> 
> From: Jonathan Wonders <jw...@gmail.com>
> To: user@accumulo.apache.org, Dave Marion <dl...@comcast.net>
> Date: 07/07/2017 04:02
> Subject: Re: maximize usage of cluster resources during ingestion
> ------------------------------------------------------------------------
> 
> 
> 
> I've personally never seen full CPU utilization during pure ingest.  
> Typically the bottleneck has been I/O related. The majority of 
> steady-state CPU utilization under a heavy ingest load is probably due 
> to compression unless you have custom constraints running. This can 
> depend on the compression algorithm you have selected.  There is 
> probably a measurable contribution from inserting into the in-memory 
> map.  Otherwise, not much computation occurs during ingest per mutation.
> 
> On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <_dlmarion@comcast.net_ 
> <ma...@comcast.net>> wrote:
> That's a good point. I would also look at increasing 
> tserver.total.mutation.queue.max. Are you seeing hold times? If not, I 
> would keep pushing harder until you do, then move to multiple tablet 
> servers. Do you have any GC logs?
> 
> 
> On July 6, 2017 at 4:47 AM Cyrille Savelief <_csavelief@gmail.com_ 
> <ma...@gmail.com>> wrote:
> 
> Are you sure Accumulo is not waiting for your app's data? There might be 
> GC pauses in your ingest code (we have already experienced that).
> 
> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti 
> <_MASSIMIL@il.ibm.com_ <ma...@il.ibm.com>> a écrit :
> Thank you all for the suggestions.
> 
> About the native memory map I checked the logs on each tablet server and 
> it was loaded correctly (of course the 
> tserver.memory.maps.native.enabled was set to true), so the GC pauses 
> should not be the problem eventually. I managed to get much better 
> ingestion graph by reducing the native map size to *2GB* and increasing 
> the Batch Writer threads number from the default (3 was really bad for 
> my configuration) to *10* (I think it does not make sense having more 
> threads than tablet servers, am I right?).
> 
> The configuration that I used for the table is:
> "table.file.replication": "2",
> "table.compaction.minor.logs.threshold": "3",
> "table.durability": "flush",
> "table.split.threshold": "1G"
> 
> while for the tablet servers is:
> "tserver.wal.blocksize": "1G",
>   "tserver.walog.max.size": "2G",
> "tserver.memory.maps.max": "2G",
> "tserver.compaction.minor.concurrent.max": "50",
> "tserver.compaction.major.concurrent.max": "20",
> "tserver.wal.replication": "2",
>   "tserver.compaction.major.thread.files.open.max": "15"
> 
> The new graph:
> 
> 
> I still have the problem of a CPU usage that is less than*20%.* So I am 
> thinking to run multiple tablet servers per node (like 5 or 10) in order 
> to maximize the CPU usage. Besides that I do not have any other idea on 
> how to stress those servers with ingestion.
> Any suggestions are very welcome. Meanwhile, thank you all again for 
> your help.
> 
> 
> Best Regards,
> Massimiliano
> 
> 
> 
> From: Jonathan Wonders <_jwonders88@gmail.com_ 
> <ma...@gmail.com>>
> To: _user@accumulo.apache.org_ <ma...@accumulo.apache.org>
> Date: 06/07/2017 04:01
> Subject: Re: maximize usage of cluster resources during ingestion
> ------------------------------------------------------------------------
> 
> 
> 
> Hi Massimilian,
> 
> Are you seeing held commits during the ingest pauses?  Just based on 
> having looked at many similar graphs in the past, this might be one of 
> the major culprits.  A tablet server has a memory region with a bounded 
> size (tserver.memory.maps.max) where it buffers data that has not yet 
> been written to RFiles (through the process of minor compaction). The 
> region is segmented by tablet and each tablet can have a buffer that is 
> undergoing ingest as well as a buffer that is undergoing minor 
> compaction. A memory manager decides when to initiate minor compactions 
> for the tablet buffers and the default implementation tries to keep the 
> memory region 80-90% full while preferring to compact the largest tablet 
> buffers. Creating larger RFiles during minor compaction should lead to 
> less major compactions.  During a minor compaction, the tablet buffer 
> still "consumes" memory within the in memory map and high ingest rates 
> can lead to exhausing the remaining capacity.  The default memory manage 
> uses an adaptive strategy to predict the expected memory usage and makes 
> compaction decisions that should maintain some free memory.  Batch 
> writers can be bursty and a bit unpredictable which could throw off 
> these estimates.  Also, depending on the ingest profile, sometimes an 
> in-memory tablet buffer will consume a large percentage of the total 
> buffer.  This leads to long minor compactions when the buffer size is 
> large which can allow ingest enough time to exhaust the buffer before 
> that memory can be reclaimed. When a tablet server has to block ingest, 
> it can affect client ingest rates to other tablet servers due to the way 
> that batch writers work.  This can lead to other tablet servers 
> underestimating future ingest rates which can further exacerbate the 
> problem.
> 
> There are some configuration changes that could reduce the severity of 
> held commits, although they might reduce peak ingest rates.  Reducing 
> the in memory map size can reduce the maximum pause time due to held 
> commits. Adding additional tablets should help avoid the problem of a 
> single tablet buffer consuming a large percentage of the memory region.  
> It might be better to aim for ~20 tablets per server if your problem 
> allows for it.  It is also possible to replace the memory manager with a 
> custom one.  I've tried this in the past and have seen stability 
> improvements by making the memory thresholds less aggressive (50-75% 
> full).  This did reduce peak ingest rate in some cases, but that was a 
> reasonable tradeoff.
> 
> Based on your current configuration, if a tablet server is serving 4 
> tablets and has a 32GB buffer, your first minor compactions will be at 
> least 8GB and they will probably grow larger over time until the tablets 
> naturally split.  Consider how long it would take to write this RFile 
> compared to your peak ingest rate.  As others have suggested, make sure 
> to use the native maps.  Based on your current JVM heap size, using the 
> Java in-memory map would probably lead to OOME or very bad GC performance.
> 
> Accumulo can trace minor compaction durations so you can get a feel for 
> max pause times or measure the effect of configuration changes.
> 
> Cheers,
> --Jonathan
> 
> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <_dlmarion@comcast.net_ 
> <ma...@comcast.net>> wrote:
> 
> Based on what Cyrille said, I would look at garbage collection, 
> specifically I would look at how much of your newly allocated objects 
> spill into the old generation before they are flushed to disk. 
> Additionally, I would turn off the debug log or log to SSD’s if you have 
> them. Another thought, seeing that you have 256GB RAM / node, is to run 
> multiple tablet servers per node. Do you have 10 threads on your Batch 
> Writers? What about the Batch Writer latency, is it too low such that 
> you are not filling the buffer?
> 
> *From:* Massimilian Mattetti [mailto:_MASSIMIL@il.ibm.com_ 
> <ma...@il.ibm.com>] *
> Sent:* Wednesday, July 05, 2017 8:37 AM*
> To:* _user@accumulo.apache.org_ <ma...@accumulo.apache.org>*
> Subject:* maximize usage of cluster resources during ingestion
> 
> Hi all,
> 
> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each 
> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as 
> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 
> 10 machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and 
> are running Accumulo TServer processes. All the machines are connected 
> via a 10Gb network and 3 of them are running ZooKeeper. I have run some 
> heavy ingestion test on this cluster but I have never been able to reach 
> more than *20% *CPU usage on each Tablet Server. I am running an 
> ingestion process (using batch writers) on each data node. The table is 
> pre-split in order to have 4 tablets per tablet server. Monitoring the 
> network I have seen that data is received/sent from each node with a 
> peak rate of about 120MB/s / 100MB/s while the aggregated disk write 
> throughput on each tablet servers is around 120MB/s.
> 
> The table configuration I am playing with are:
> "table.file.replication": "2",
> "table.compaction.minor.logs.threshold": "10",
> "table.durability": "flush",
> "table.file.max": "30",
> "table.compaction.major.ratio": "9",
> "table.split.threshold": "1G"
> 
> while the tablet server configuration is:
> "tserver.wal.blocksize": "2G",
> "tserver.walog.max.size": "8G",
> "tserver.memory.maps.max": "32G",
> "tserver.compaction.minor.concurrent.max": "50",
> "tserver.compaction.major.concurrent.max": "8",
> "tserver.total.mutation.queue.max": "50M",
> "tserver.wal.replication": "2",
> "tserver.compaction.major.thread.files.open.max": "15"
> 
> the tablet server heap has been set to 32GB
> 
>  From Monitor UI
> 
> 
> As you can see I have a lot of valleys in which the ingestion rate 
> reaches 0.
> What would be a good procedure to identify the bottleneck which causes 
> the 0 ingestion rate periods?
> Thanks.
> 
> Best Regards,
> Max
> 
> 
> 
> 
> 
>

Re: maximize usage of cluster resources during ingestion

Posted by Massimilian Mattetti <MA...@il.ibm.com>.

Hi all,

I ran a few experiments in the last days trying to identify what is the 
bottleneck for the ingestion process.
- Running 10 tservers per node instead of only one gave me a very 
neglectable performance improvement of about 15%. 
- Running the ingestor processes from the two masters give the same 
performance as running one ingestor process in each tablet server (10 
ingestors)
- neither the network limit (10 Gb network) nor the disk throughput limit 
has been reached (1GB/s per node reached while running the TestDFSIO 
benchmark on HDFS)
- CPU is always around 20% on each tserver
- changing compression from GZ to snappy did not provide any benefit
- increasing the tserver.total.mutation.queue.max to 200MB actually 
decreased the performance
I am going to run some ingestion experiment with Kudu over the next few 
days, but any other suggestion on how improve the performance on Accumulo 
is very welcome.
Thanks.

Best Regards,
Massimiliano 
 


From:   Jonathan Wonders <jw...@gmail.com>
To:     user@accumulo.apache.org, Dave Marion <dl...@comcast.net>
Date:   07/07/2017 04:02
Subject:        Re: maximize usage of cluster resources during ingestion



I've personally never seen full CPU utilization during pure ingest.  
Typically the bottleneck has been I/O related.  The majority of 
steady-state CPU utilization under a heavy ingest load is probably due to 
compression unless you have custom constraints running.  This can depend 
on the compression algorithm you have selected.  There is probably a 
measurable contribution from inserting into the in-memory map.  Otherwise, 
not much computation occurs during ingest per mutation.

On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <dl...@comcast.net> wrote:
That's a good point. I would also look at increasing 
tserver.total.mutation.queue.max. Are you seeing hold times? If not, I 
would keep pushing harder until you do, then move to multiple tablet 
servers. Do you have any GC logs?


On July 6, 2017 at 4:47 AM Cyrille Savelief <cs...@gmail.com> wrote:

Are you sure Accumulo is not waiting for your app's data? There might be 
GC pauses in your ingest code (we have already experienced that).

Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti <MA...@il.ibm.com> a 
écrit :
Thank you all for the suggestions.

About the native memory map I checked the logs on each tablet server and 
it was loaded correctly (of course the tserver.memory.maps.native.enabled 
was set to true), so the GC pauses should not be the problem eventually. I 
managed to get much better ingestion graph by reducing the native map size 
to 2GB and increasing the Batch Writer threads number from the default (3 
was really bad for my configuration) to 10 (I think it does not make sense 
having more threads than tablet servers, am I right?).

The configuration that I used for the table is:
"table.file.replication": "2",
"table.compaction.minor.logs.threshold": "3",
"table.durability": "flush",
"table.split.threshold": "1G"

while for the tablet servers is:
"tserver.wal.blocksize": "1G",
 "tserver.walog.max.size": "2G",
"tserver.memory.maps.max": "2G",
"tserver.compaction.minor.concurrent.max": "50",
"tserver.compaction.major.concurrent.max": "20",
"tserver.wal.replication": "2",
 "tserver.compaction.major.thread.files.open.max": "15"

The new graph:


I still have the problem of a CPU usage that is less than 20%. So I am 
thinking to run multiple tablet servers per node (like 5 or 10) in order 
to maximize the CPU usage. Besides that I do not have any other idea on 
how to stress those servers with ingestion. 
Any suggestions are very welcome. Meanwhile, thank you all again for your 
help.


Best Regards,
Massimiliano



From:        Jonathan Wonders <jw...@gmail.com>
To:        user@accumulo.apache.org
Date:        06/07/2017 04:01
Subject:        Re: maximize usage of cluster resources during ingestion



Hi Massimilian,

Are you seeing held commits during the ingest pauses?  Just based on 
having looked at many similar graphs in the past, this might be one of the 
major culprits.  A tablet server has a memory region with a bounded size 
(tserver.memory.maps.max) where it buffers data that has not yet been 
written to RFiles (through the process of minor compaction).  The region 
is segmented by tablet and each tablet can have a buffer that is 
undergoing ingest as well as a buffer that is undergoing minor 
compaction.  A memory manager decides when to initiate minor compactions 
for the tablet buffers and the default implementation tries to keep the 
memory region 80-90% full while preferring to compact the largest tablet 
buffers.  Creating larger RFiles during minor compaction should lead to 
less major compactions.  During a minor compaction, the tablet buffer 
still "consumes" memory within the in memory map and high ingest rates can 
lead to exhausing the remaining capacity.  The default memory manage uses 
an adaptive strategy to predict the expected memory usage and makes 
compaction decisions that should maintain some free memory.  Batch writers 
can be bursty and a bit unpredictable which could throw off these 
estimates.  Also, depending on the ingest profile, sometimes an in-memory 
tablet buffer will consume a large percentage of the total buffer.  This 
leads to long minor compactions when the buffer size is large which can 
allow ingest enough time to exhaust the buffer before that memory can be 
reclaimed.  When a tablet server has to block ingest, it can affect client 
ingest rates to other tablet servers due to the way that batch writers 
work.  This can lead to other tablet servers underestimating future ingest 
rates which can further exacerbate the problem.

There are some configuration changes that could reduce the severity of 
held commits, although they might reduce peak ingest rates.  Reducing the 
in memory map size can reduce the maximum pause time due to held commits.  
Adding additional tablets should help avoid the problem of a single tablet 
buffer consuming a large percentage of the memory region.  It might be 
better to aim for ~20 tablets per server if your problem allows for it.  
It is also possible to replace the memory manager with a custom one.  I've 
tried this in the past and have seen stability improvements by making the 
memory thresholds less aggressive (50-75% full).  This did reduce peak 
ingest rate in some cases, but that was a reasonable tradeoff.

Based on your current configuration, if a tablet server is serving 4 
tablets and has a 32GB buffer, your first minor compactions will be at 
least 8GB and they will probably grow larger over time until the tablets 
naturally split.  Consider how long it would take to write this RFile 
compared to your peak ingest rate.  As others have suggested, make sure to 
use the native maps.  Based on your current JVM heap size, using the Java 
in-memory map would probably lead to OOME or very bad GC performance.

Accumulo can trace minor compaction durations so you can get a feel for 
max pause times or measure the effect of configuration changes.

Cheers,
--Jonathan

On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <dl...@comcast.net> wrote:
 
Based on what Cyrille said, I would look at garbage collection, 
specifically I would look at how much of your newly allocated objects 
spill into the old generation before they are flushed to disk. 
Additionally, I would turn off the debug log or log to SSD?s if you have 
them. Another thought, seeing that you have 256GB RAM / node, is to run 
multiple tablet servers per node. Do you have 10 threads on your Batch 
Writers? What about the Batch Writer latency, is it too low such that you 
are not filling the buffer? 
 
From: Massimilian Mattetti [mailto:MASSIMIL@il.ibm.com] 
Sent: Wednesday, July 05, 2017 8:37 AM
To: user@accumulo.apache.org
Subject: maximize usage of cluster resources during ingestion
 
Hi all,

I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each 
server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as 
masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 
10 machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and 
are running Accumulo TServer processes. All the machines are connected via 
a 10Gb network and 3 of them are running ZooKeeper. I have run some heavy 
ingestion test on this cluster but I have never been able to reach more 
than 20% CPU usage on each Tablet Server. I am running an ingestion 
process (using batch writers) on each data node. The table is pre-split in 
order to have 4 tablets per tablet server. Monitoring the network I have 
seen that data is received/sent from each node with a peak rate of about 
120MB/s / 100MB/s while the aggregated disk write throughput on each 
tablet servers is around 120MB/s. 

The table configuration I am playing with are:
"table.file.replication": "2",
"table.compaction.minor.logs.threshold": "10",
"table.durability": "flush",
"table.file.max": "30",
"table.compaction.major.ratio": "9",
"table.split.threshold": "1G"

while the tablet server configuration is:
"tserver.wal.blocksize": "2G",
"tserver.walog.max.size": "8G",
"tserver.memory.maps.max": "32G",
"tserver.compaction.minor.concurrent.max": "50",
"tserver.compaction.major.concurrent.max": "8",
"tserver.total.mutation.queue.max": "50M",
"tserver.wal.replication": "2",
"tserver.compaction.major.thread.files.open.max": "15"

the tablet server heap has been set to 32GB

From Monitor UI


As you can see I have a lot of valleys in which the ingestion rate reaches 
0. 
What would be a good procedure to identify the bottleneck which causes the 
0 ingestion rate periods?
Thanks.

Best Regards,
Max

Re: maximize usage of cluster resources during ingestion

Posted by Jonathan Wonders <jw...@gmail.com>.

I've personally never seen full CPU utilization during pure ingest.
Typically the bottleneck has been I/O related.  The majority of
steady-state CPU utilization under a heavy ingest load is probably due to
compression unless you have custom constraints running.  This can depend on
the compression algorithm you have selected.  There is probably a
measurable contribution from inserting into the in-memory map.  Otherwise,
not much computation occurs during ingest per mutation.

On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <dl...@comcast.net> wrote:

> That's a good point. I would also look at increasing
> tserver.total.mutation.queue.max. Are you seeing hold times? If not, I
> would keep pushing harder until you do, then move to multiple tablet
> servers. Do you have any GC logs?
>
>
> On July 6, 2017 at 4:47 AM Cyrille Savelief <cs...@gmail.com> wrote:
>
> Are you sure Accumulo is not waiting for your app's data? There might be
> GC pauses in your ingest code (we have already experienced that).
>
> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti <MA...@il.ibm.com>
> a écrit :
>
>> Thank you all for the suggestions.
>>
>> About the native memory map I checked the logs on each tablet server and
>> it was loaded correctly (of course the tserver.memory.maps.native.enabled
>> was set to true), so the GC pauses should not be the problem eventually. I
>> managed to get much better ingestion graph by reducing the native map size
>> to *2GB* and increasing the Batch Writer threads number from the default
>> (3 was really bad for my configuration) to *10* (I think it does not
>> make sense having more threads than tablet servers, am I right?).
>>
>> The configuration that I used for the table is:
>> "table.file.replication": "2",
>> "table.compaction.minor.logs.threshold": "3",
>> "table.durability": "flush",
>> "table.split.threshold": "1G"
>>
>> while for the tablet servers is:
>> "tserver.wal.blocksize": "1G",
>>  "tserver.walog.max.size": "2G",
>> "tserver.memory.maps.max": "2G",
>> "tserver.compaction.minor.concurrent.max": "50",
>> "tserver.compaction.major.concurrent.max": "20",
>> "tserver.wal.replication": "2",
>>  "tserver.compaction.major.thread.files.open.max": "15"
>>
>> The new graph:
>>
>>
>> I still have the problem of a CPU usage that is less than* 20%.* So I am
>> thinking to run multiple tablet servers per node (like 5 or 10) in order to
>> maximize the CPU usage. Besides that I do not have any other idea on how to
>> stress those servers with ingestion.
>> Any suggestions are very welcome. Meanwhile, thank you all again for your
>> help.
>>
>>
>> Best Regards,
>> Massimiliano
>>
>>
>>
>> From:        Jonathan Wonders <jw...@gmail.com>
>> To:        user@accumulo.apache.org
>> Date:        06/07/2017 04:01
>> Subject:        Re: maximize usage of cluster resources during ingestion
>> ------------------------------
>>
>>
>>
>> Hi Massimilian,
>>
>> Are you seeing held commits during the ingest pauses?  Just based on
>> having looked at many similar graphs in the past, this might be one of the
>> major culprits.  A tablet server has a memory region with a bounded size
>> (tserver.memory.maps.max) where it buffers data that has not yet been
>> written to RFiles (through the process of minor compaction).  The region is
>> segmented by tablet and each tablet can have a buffer that is undergoing
>> ingest as well as a buffer that is undergoing minor compaction.  A memory
>> manager decides when to initiate minor compactions for the tablet buffers
>> and the default implementation tries to keep the memory region 80-90% full
>> while preferring to compact the largest tablet buffers.  Creating larger
>> RFiles during minor compaction should lead to less major compactions.
>> During a minor compaction, the tablet buffer still "consumes" memory within
>> the in memory map and high ingest rates can lead to exhausing the remaining
>> capacity.  The default memory manage uses an adaptive strategy to predict
>> the expected memory usage and makes compaction decisions that should
>> maintain some free memory.  Batch writers can be bursty and a bit
>> unpredictable which could throw off these estimates.  Also, depending on
>> the ingest profile, sometimes an in-memory tablet buffer will consume a
>> large percentage of the total buffer.  This leads to long minor compactions
>> when the buffer size is large which can allow ingest enough time to exhaust
>> the buffer before that memory can be reclaimed.  When a tablet server has
>> to block ingest, it can affect client ingest rates to other tablet servers
>> due to the way that batch writers work.  This can lead to other tablet
>> servers underestimating future ingest rates which can further exacerbate
>> the problem.
>>
>> There are some configuration changes that could reduce the severity of
>> held commits, although they might reduce peak ingest rates.  Reducing the
>> in memory map size can reduce the maximum pause time due to held commits.
>> Adding additional tablets should help avoid the problem of a single tablet
>> buffer consuming a large percentage of the memory region.  It might be
>> better to aim for ~20 tablets per server if your problem allows for it.  It
>> is also possible to replace the memory manager with a custom one.  I've
>> tried this in the past and have seen stability improvements by making the
>> memory thresholds less aggressive (50-75% full).  This did reduce peak
>> ingest rate in some cases, but that was a reasonable tradeoff.
>>
>> Based on your current configuration, if a tablet server is serving 4
>> tablets and has a 32GB buffer, your first minor compactions will be at
>> least 8GB and they will probably grow larger over time until the tablets
>> naturally split.  Consider how long it would take to write this RFile
>> compared to your peak ingest rate.  As others have suggested, make sure to
>> use the native maps.  Based on your current JVM heap size, using the Java
>> in-memory map would probably lead to OOME or very bad GC performance.
>>
>> Accumulo can trace minor compaction durations so you can get a feel for
>> max pause times or measure the effect of configuration changes.
>>
>> Cheers,
>> --Jonathan
>>
>> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <*dlmarion@comcast.net*
>> <dl...@comcast.net>> wrote:
>>
>>
>> Based on what Cyrille said, I would look at garbage collection,
>> specifically I would look at how much of your newly allocated objects spill
>> into the old generation before they are flushed to disk. Additionally, I
>> would turn off the debug log or log to SSD’s if you have them. Another
>> thought, seeing that you have 256GB RAM / node, is to run multiple tablet
>> servers per node. Do you have 10 threads on your Batch Writers? What about
>> the Batch Writer latency, is it too low such that you are not filling the
>> buffer?
>>
>>
>>
>> *From:* Massimilian Mattetti [mailto:*MASSIMIL@il.ibm.com*
>> <MA...@il.ibm.com>]
>> *Sent:* Wednesday, July 05, 2017 8:37 AM
>> *To:* *user@accumulo.apache.org* <us...@accumulo.apache.org>
>> *Subject:* maximize usage of cluster resources during ingestion
>>
>>
>>
>> Hi all,
>>
>> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
>> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
>> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
>> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
>> running Accumulo TServer processes. All the machines are connected via a
>> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
>> ingestion test on this cluster but I have never been able to reach more
>> than *20% *CPU usage on each Tablet Server. I am running an ingestion
>> process (using batch writers) on each data node. The table is pre-split in
>> order to have 4 tablets per tablet server. Monitoring the network I have
>> seen that data is received/sent from each node with a peak rate of about
>> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
>> servers is around 120MB/s.
>>
>> The table configuration I am playing with are:
>> "table.file.replication": "2",
>> "table.compaction.minor.logs.threshold": "10",
>> "table.durability": "flush",
>> "table.file.max": "30",
>> "table.compaction.major.ratio": "9",
>> "table.split.threshold": "1G"
>>
>> while the tablet server configuration is:
>> "tserver.wal.blocksize": "2G",
>> "tserver.walog.max.size": "8G",
>> "tserver.memory.maps.max": "32G",
>> "tserver.compaction.minor.concurrent.max": "50",
>> "tserver.compaction.major.concurrent.max": "8",
>> "tserver.total.mutation.queue.max": "50M",
>> "tserver.wal.replication": "2",
>> "tserver.compaction.major.thread.files.open.max": "15"
>>
>> the tablet server heap has been set to 32GB
>>
>> From Monitor UI
>>
>>
>> As you can see I have a lot of valleys in which the ingestion rate
>> reaches 0.
>> What would be a good procedure to identify the bottleneck which causes
>> the 0 ingestion rate periods?
>> Thanks.
>>
>> Best Regards,
>> Max
>>
>>
>>
>
>

Re: maximize usage of cluster resources during ingestion

Posted by Dave Marion <dl...@comcast.net>.

That's a good point. I would also look at increasing tserver.total.mutation.queue.max. Are you seeing hold times? If not, I would keep pushing harder until you do, then move to multiple tablet servers. Do you have any GC logs?


> On July 6, 2017 at 4:47 AM Cyrille Savelief <cs...@gmail.com> wrote:
> 
>     Are you sure Accumulo is not waiting for your app's data? There might be GC pauses in your ingest code (we have already experienced that).
> 
>     Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti <MASSIMIL@il.ibm.com mailto:MASSIMIL@il.ibm.com > a écrit :
> 
>         > > Thank you all for the suggestions.
> > 
> >         About the native memory map I checked the logs on each tablet server and it was loaded correctly (of course the tserver.memory.maps.native.enabled was set to true), so the GC pauses should not be the problem eventually. I managed to get much better ingestion graph by reducing the native map size to 2GB and increasing the Batch Writer threads number from the default (3 was really bad for my configuration) to 10 (I think it does not make sense having more threads than tablet servers, am I right?).
> > 
> >         The configuration that I used for the table is:
> >         "table.file.replication": "2",
> >         "table.compaction.minor.logs.threshold": "3",
> >         "table.durability": "flush",
> >         "table.split.threshold": "1G"
> > 
> >         while for the tablet servers is:
> >         "tserver.wal.blocksize": "1G",
> >          "tserver.walog.max.size": "2G",
> >         "tserver.memory.maps.max": "2G",
> >         "tserver.compaction.minor.concurrent.max": "50",
> >         "tserver.compaction.major.concurrent.max": "20",
> >         "tserver.wal.replication": "2",
> >          "tserver.compaction.major.thread.files.open.max": "15"
> > 
> >         The new graph:
> > 
> > 
> >         I still have the problem of a CPU usage that is less than 20%. So I am thinking to run multiple tablet servers per node (like 5 or 10) in order to maximize the CPU usage. Besides that I do not have any other idea on how to stress those servers with ingestion.
> >         Any suggestions are very welcome. Meanwhile, thank you all again for your help.
> > 
> > 
> >         Best Regards,
> >         Massimiliano
> > 
> > 
> > 
> >         From:        Jonathan Wonders <jwonders88@gmail.com mailto:jwonders88@gmail.com >
> >         To:        user@accumulo.apache.org mailto:user@accumulo.apache.org
> >         Date:        06/07/2017 04:01
> >         Subject:        Re: maximize usage of cluster resources during ingestion
> > 
> >         ---------------------------------------------
> > 
> > 
> > 
> >         Hi Massimilian,
> > 
> >         Are you seeing held commits during the ingest pauses?  Just based on having looked at many similar graphs in the past, this might be one of the major culprits.  A tablet server has a memory region with a bounded size (tserver.memory.maps.max) where it buffers data that has not yet been written to RFiles (through the process of minor compaction).  The region is segmented by tablet and each tablet can have a buffer that is undergoing ingest as well as a buffer that is undergoing minor compaction.  A memory manager decides when to initiate minor compactions for the tablet buffers and the default implementation tries to keep the memory region 80-90% full while preferring to compact the largest tablet buffers.  Creating larger RFiles during minor compaction should lead to less major compactions.  During a minor compaction, the tablet buffer still "consumes" memory within the in memory map and high ingest rates can lead to exhausing the remaining capacity.  The default memory manage uses an adaptive strategy to predict the expected memory usage and makes compaction decisions that should maintain some free memory.  Batch writers can be bursty and a bit unpredictable which could throw off these estimates.  Also, depending on the ingest profile, sometimes an in-memory tablet buffer will consume a large percentage of the total buffer.  This leads to long minor compactions when the buffer size is large which can allow ingest enough time to exhaust the buffer before that memory can be reclaimed.  When a tablet server has to block ingest, it can affect client ingest rates to other tablet servers due to the way that batch writers work.  This can lead to other tablet servers underestimating future ingest rates which can further exacerbate the problem.
> > 
> >         There are some configuration changes that could reduce the severity of held commits, although they might reduce peak ingest rates.  Reducing the in memory map size can reduce the maximum pause time due to held commits.  Adding additional tablets should help avoid the problem of a single tablet buffer consuming a large percentage of the memory region.  It might be better to aim for ~20 tablets per server if your problem allows for it.  It is also possible to replace the memory manager with a custom one.  I've tried this in the past and have seen stability improvements by making the memory thresholds less aggressive (50-75% full).  This did reduce peak ingest rate in some cases, but that was a reasonable tradeoff.
> > 
> >         Based on your current configuration, if a tablet server is serving 4 tablets and has a 32GB buffer, your first minor compactions will be at least 8GB and they will probably grow larger over time until the tablets naturally split.  Consider how long it would take to write this RFile compared to your peak ingest rate.  As others have suggested, make sure to use the native maps.  Based on your current JVM heap size, using the Java in-memory map would probably lead to OOME or very bad GC performance.
> > 
> >         Accumulo can trace minor compaction durations so you can get a feel for max pause times or measure the effect of configuration changes.
> > 
> >         Cheers,
> >         --Jonathan
> > 
> >         On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <dlmarion@comcast.net mailto:dlmarion@comcast.net > wrote:
> >          
> > 
> >         Based on what Cyrille said, I would look at garbage collection, specifically I would look at how much of your newly allocated objects spill into the old generation before they are flushed to disk. Additionally, I would turn off the debug log or log to SSD’s if you have them. Another thought, seeing that you have 256GB RAM / node, is to run multiple tablet servers per node. Do you have 10 threads on your Batch Writers? What about the Batch Writer latency, is it too low such that you are not filling the buffer?
> > 
> >          
> > 
> >         From: Massimilian Mattetti [mailto:MASSIMIL@il.ibm.com mailto:MASSIMIL@il.ibm.com ]
> >         Sent: Wednesday, July 05, 2017 8:37 AM
> >         To: user@accumulo.apache.org mailto:user@accumulo.apache.org
> >         Subject: maximize usage of cluster resources during ingestion
> > 
> >          
> > 
> >         Hi all,
> > 
> >         I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10 machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are running Accumulo TServer processes. All the machines are connected via a 10Gb network and 3 of them are running ZooKeeper. I have run some heavy ingestion test on this cluster but I have never been able to reach more than 20% CPU usage on each Tablet Server. I am running an ingestion process (using batch writers) on each data node. The table is pre-split in order to have 4 tablets per tablet server. Monitoring the network I have seen that data is received/sent from each node with a peak rate of about 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet servers is around 120MB/s.
> > 
> >         The table configuration I am playing with are:
> >         "table.file.replication": "2",
> >         "table.compaction.minor.logs.threshold": "10",
> >         "table.durability": "flush",
> >         "table.file.max": "30",
> >         "table.compaction.major.ratio": "9",
> >         "table.split.threshold": "1G"
> > 
> >         while the tablet server configuration is:
> >         "tserver.wal.blocksize": "2G",
> >         "tserver.walog.max.size": "8G",
> >         "tserver.memory.maps.max": "32G",
> >         "tserver.compaction.minor.concurrent.max": "50",
> >         "tserver.compaction.major.concurrent.max": "8",
> >         "tserver.total.mutation.queue.max": "50M",
> >         "tserver.wal.replication": "2",
> >         "tserver.compaction.major.thread.files.open.max": "15"
> > 
> >         the tablet server heap has been set to 32GB
> > 
> >         From Monitor UI
> > 
> > 
> >         As you can see I have a lot of valleys in which the ingestion rate reaches 0.
> >         What would be a good procedure to identify the bottleneck which causes the 0 ingestion rate periods?
> >         Thanks.
> > 
> >         Best Regards,
> >         Max
> > 
> > 
> >     >

Re: maximize usage of cluster resources during ingestion

Posted by Cyrille Savelief <cs...@gmail.com>.

Are you sure Accumulo is not waiting for your app's data? There might be GC
pauses in your ingest code (we have already experienced that).

Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti <MA...@il.ibm.com> a
écrit :

> Thank you all for the suggestions.
>
> About the native memory map I checked the logs on each tablet server and
> it was loaded correctly (of course the tserver.memory.maps.native.enabled
> was set to true), so the GC pauses should not be the problem eventually. I
> managed to get much better ingestion graph by reducing the native map size
> to *2GB* and increasing the Batch Writer threads number from the default
> (3 was really bad for my configuration) to *10* (I think it does not make
> sense having more threads than tablet servers, am I right?).
>
> The configuration that I used for the table is:
> "table.file.replication": "2",
> "table.compaction.minor.logs.threshold": "3",
> "table.durability": "flush",
> "table.split.threshold": "1G"
>
> while for the tablet servers is:
> "tserver.wal.blocksize": "1G",
>  "tserver.walog.max.size": "2G",
> "tserver.memory.maps.max": "2G",
> "tserver.compaction.minor.concurrent.max": "50",
> "tserver.compaction.major.concurrent.max": "20",
> "tserver.wal.replication": "2",
>  "tserver.compaction.major.thread.files.open.max": "15"
>
> The new graph:
>
>
> I still have the problem of a CPU usage that is less than* 20%.* So I am
> thinking to run multiple tablet servers per node (like 5 or 10) in order to
> maximize the CPU usage. Besides that I do not have any other idea on how to
> stress those servers with ingestion.
> Any suggestions are very welcome. Meanwhile, thank you all again for your
> help.
>
>
> Best Regards,
> Massimiliano
>
>
>
> From:        Jonathan Wonders <jw...@gmail.com>
> To:        user@accumulo.apache.org
> Date:        06/07/2017 04:01
> Subject:        Re: maximize usage of cluster resources during ingestion
> ------------------------------
>
>
>
> Hi Massimilian,
>
> Are you seeing held commits during the ingest pauses?  Just based on
> having looked at many similar graphs in the past, this might be one of the
> major culprits.  A tablet server has a memory region with a bounded size
> (tserver.memory.maps.max) where it buffers data that has not yet been
> written to RFiles (through the process of minor compaction).  The region is
> segmented by tablet and each tablet can have a buffer that is undergoing
> ingest as well as a buffer that is undergoing minor compaction.  A memory
> manager decides when to initiate minor compactions for the tablet buffers
> and the default implementation tries to keep the memory region 80-90% full
> while preferring to compact the largest tablet buffers.  Creating larger
> RFiles during minor compaction should lead to less major compactions.
> During a minor compaction, the tablet buffer still "consumes" memory within
> the in memory map and high ingest rates can lead to exhausing the remaining
> capacity.  The default memory manage uses an adaptive strategy to predict
> the expected memory usage and makes compaction decisions that should
> maintain some free memory.  Batch writers can be bursty and a bit
> unpredictable which could throw off these estimates.  Also, depending on
> the ingest profile, sometimes an in-memory tablet buffer will consume a
> large percentage of the total buffer.  This leads to long minor compactions
> when the buffer size is large which can allow ingest enough time to exhaust
> the buffer before that memory can be reclaimed.  When a tablet server has
> to block ingest, it can affect client ingest rates to other tablet servers
> due to the way that batch writers work.  This can lead to other tablet
> servers underestimating future ingest rates which can further exacerbate
> the problem.
>
> There are some configuration changes that could reduce the severity of
> held commits, although they might reduce peak ingest rates.  Reducing the
> in memory map size can reduce the maximum pause time due to held commits.
> Adding additional tablets should help avoid the problem of a single tablet
> buffer consuming a large percentage of the memory region.  It might be
> better to aim for ~20 tablets per server if your problem allows for it.  It
> is also possible to replace the memory manager with a custom one.  I've
> tried this in the past and have seen stability improvements by making the
> memory thresholds less aggressive (50-75% full).  This did reduce peak
> ingest rate in some cases, but that was a reasonable tradeoff.
>
> Based on your current configuration, if a tablet server is serving 4
> tablets and has a 32GB buffer, your first minor compactions will be at
> least 8GB and they will probably grow larger over time until the tablets
> naturally split.  Consider how long it would take to write this RFile
> compared to your peak ingest rate.  As others have suggested, make sure to
> use the native maps.  Based on your current JVM heap size, using the Java
> in-memory map would probably lead to OOME or very bad GC performance.
>
> Accumulo can trace minor compaction durations so you can get a feel for
> max pause times or measure the effect of configuration changes.
>
> Cheers,
> --Jonathan
>
> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <*dlmarion@comcast.net*
> <dl...@comcast.net>> wrote:
>
>
> Based on what Cyrille said, I would look at garbage collection,
> specifically I would look at how much of your newly allocated objects spill
> into the old generation before they are flushed to disk. Additionally, I
> would turn off the debug log or log to SSD’s if you have them. Another
> thought, seeing that you have 256GB RAM / node, is to run multiple tablet
> servers per node. Do you have 10 threads on your Batch Writers? What about
> the Batch Writer latency, is it too low such that you are not filling the
> buffer?
>
>
>
> *From:* Massimilian Mattetti [mailto:*MASSIMIL@il.ibm.com*
> <MA...@il.ibm.com>]
> *Sent:* Wednesday, July 05, 2017 8:37 AM
> *To:* *user@accumulo.apache.org* <us...@accumulo.apache.org>
> *Subject:* maximize usage of cluster resources during ingestion
>
>
>
> Hi all,
>
> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
> running Accumulo TServer processes. All the machines are connected via a
> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
> ingestion test on this cluster but I have never been able to reach more
> than *20% *CPU usage on each Tablet Server. I am running an ingestion
> process (using batch writers) on each data node. The table is pre-split in
> order to have 4 tablets per tablet server. Monitoring the network I have
> seen that data is received/sent from each node with a peak rate of about
> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
> servers is around 120MB/s.
>
> The table configuration I am playing with are:
> "table.file.replication": "2",
> "table.compaction.minor.logs.threshold": "10",
> "table.durability": "flush",
> "table.file.max": "30",
> "table.compaction.major.ratio": "9",
> "table.split.threshold": "1G"
>
> while the tablet server configuration is:
> "tserver.wal.blocksize": "2G",
> "tserver.walog.max.size": "8G",
> "tserver.memory.maps.max": "32G",
> "tserver.compaction.minor.concurrent.max": "50",
> "tserver.compaction.major.concurrent.max": "8",
> "tserver.total.mutation.queue.max": "50M",
> "tserver.wal.replication": "2",
> "tserver.compaction.major.thread.files.open.max": "15"
>
> the tablet server heap has been set to 32GB
>
> From Monitor UI
>
>
> As you can see I have a lot of valleys in which the ingestion rate reaches
> 0.
> What would be a good procedure to identify the bottleneck which causes the
> 0 ingestion rate periods?
> Thanks.
>
> Best Regards,
> Max
>
>
>

Re: maximize usage of cluster resources during ingestion

Posted by Massimilian Mattetti <MA...@il.ibm.com>.

Thank you all for the suggestions.

About the native memory map I checked the logs on each tablet server and 
it was loaded correctly (of course the tserver.memory.maps.native.enabled 
was set to true), so the GC pauses should not be the problem eventually. I 
managed to get much better ingestion graph by reducing the native map size 
to 2GB and increasing the Batch Writer threads number from the default (3 
was really bad for my configuration) to 10 (I think it does not make sense 
having more threads than tablet servers, am I right?).

The configuration that I used for the table is:
"table.file.replication": "2",
"table.compaction.minor.logs.threshold": "3",
"table.durability": "flush",
"table.split.threshold": "1G"

while for the tablet servers is:
"tserver.wal.blocksize": "1G",
 "tserver.walog.max.size": "2G",
"tserver.memory.maps.max": "2G",
"tserver.compaction.minor.concurrent.max": "50",
"tserver.compaction.major.concurrent.max": "20",
"tserver.wal.replication": "2",
 "tserver.compaction.major.thread.files.open.max": "15"

The new graph:


I still have the problem of a CPU usage that is less than 20%. So I am 
thinking to run multiple tablet servers per node (like 5 or 10) in order 
to maximize the CPU usage. Besides that I do not have any other idea on 
how to stress those servers with ingestion. 
Any suggestions are very welcome. Meanwhile, thank you all again for your 
help.


Best Regards,
Massimiliano



From:   Jonathan Wonders <jw...@gmail.com>
To:     user@accumulo.apache.org
Date:   06/07/2017 04:01
Subject:        Re: maximize usage of cluster resources during ingestion



Hi Massimilian,

Are you seeing held commits during the ingest pauses?  Just based on 
having looked at many similar graphs in the past, this might be one of the 
major culprits.  A tablet server has a memory region with a bounded size 
(tserver.memory.maps.max) where it buffers data that has not yet been 
written to RFiles (through the process of minor compaction).  The region 
is segmented by tablet and each tablet can have a buffer that is 
undergoing ingest as well as a buffer that is undergoing minor 
compaction.  A memory manager decides when to initiate minor compactions 
for the tablet buffers and the default implementation tries to keep the 
memory region 80-90% full while preferring to compact the largest tablet 
buffers.  Creating larger RFiles during minor compaction should lead to 
less major compactions.  During a minor compaction, the tablet buffer 
still "consumes" memory within the in memory map and high ingest rates can 
lead to exhausing the remaining capacity.  The default memory manage uses 
an adaptive strategy to predict the expected memory usage and makes 
compaction decisions that should maintain some free memory.  Batch writers 
can be bursty and a bit unpredictable which could throw off these 
estimates.  Also, depending on the ingest profile, sometimes an in-memory 
tablet buffer will consume a large percentage of the total buffer.  This 
leads to long minor compactions when the buffer size is large which can 
allow ingest enough time to exhaust the buffer before that memory can be 
reclaimed.  When a tablet server has to block ingest, it can affect client 
ingest rates to other tablet servers due to the way that batch writers 
work.  This can lead to other tablet servers underestimating future ingest 
rates which can further exacerbate the problem.

There are some configuration changes that could reduce the severity of 
held commits, although they might reduce peak ingest rates.  Reducing the 
in memory map size can reduce the maximum pause time due to held commits.  
Adding additional tablets should help avoid the problem of a single tablet 
buffer consuming a large percentage of the memory region.  It might be 
better to aim for ~20 tablets per server if your problem allows for it.  
It is also possible to replace the memory manager with a custom one.  I've 
tried this in the past and have seen stability improvements by making the 
memory thresholds less aggressive (50-75% full).  This did reduce peak 
ingest rate in some cases, but that was a reasonable tradeoff.

Based on your current configuration, if a tablet server is serving 4 
tablets and has a 32GB buffer, your first minor compactions will be at 
least 8GB and they will probably grow larger over time until the tablets 
naturally split.  Consider how long it would take to write this RFile 
compared to your peak ingest rate.  As others have suggested, make sure to 
use the native maps.  Based on your current JVM heap size, using the Java 
in-memory map would probably lead to OOME or very bad GC performance.

Accumulo can trace minor compaction durations so you can get a feel for 
max pause times or measure the effect of configuration changes.

Cheers,
--Jonathan

On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <dl...@comcast.net> wrote:
 
Based on what Cyrille said, I would look at garbage collection, 
specifically I would look at how much of your newly allocated objects 
spill into the old generation before they are flushed to disk. 
Additionally, I would turn off the debug log or log to SSD?s if you have 
them. Another thought, seeing that you have 256GB RAM / node, is to run 
multiple tablet servers per node. Do you have 10 threads on your Batch 
Writers? What about the Batch Writer latency, is it too low such that you 
are not filling the buffer? 
 
From: Massimilian Mattetti [mailto:MASSIMIL@il.ibm.com] 
Sent: Wednesday, July 05, 2017 8:37 AM
To: user@accumulo.apache.org
Subject: maximize usage of cluster resources during ingestion
 
Hi all,

I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each 
server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as 
masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 
10 machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and 
are running Accumulo TServer processes. All the machines are connected via 
a 10Gb network and 3 of them are running ZooKeeper. I have run some heavy 
ingestion test on this cluster but I have never been able to reach more 
than 20% CPU usage on each Tablet Server. I am running an ingestion 
process (using batch writers) on each data node. The table is pre-split in 
order to have 4 tablets per tablet server. Monitoring the network I have 
seen that data is received/sent from each node with a peak rate of about 
120MB/s / 100MB/s while the aggregated disk write throughput on each 
tablet servers is around 120MB/s. 

The table configuration I am playing with are:
"table.file.replication": "2",
"table.compaction.minor.logs.threshold": "10",
"table.durability": "flush",
"table.file.max": "30",
"table.compaction.major.ratio": "9",
"table.split.threshold": "1G"

while the tablet server configuration is:
"tserver.wal.blocksize": "2G",
"tserver.walog.max.size": "8G",
"tserver.memory.maps.max": "32G",
"tserver.compaction.minor.concurrent.max": "50",
"tserver.compaction.major.concurrent.max": "8",
"tserver.total.mutation.queue.max": "50M",
"tserver.wal.replication": "2",
"tserver.compaction.major.thread.files.open.max": "15"

the tablet server heap has been set to 32GB

From Monitor UI


As you can see I have a lot of valleys in which the ingestion rate reaches 
0. 
What would be a good procedure to identify the bottleneck which causes the 
0 ingestion rate periods?
Thanks.

Best Regards,
Max

Re: maximize usage of cluster resources during ingestion

Posted by Jonathan Wonders <jw...@gmail.com>.

Hi Massimilian,

Are you seeing held commits during the ingest pauses?  Just based on having
looked at many similar graphs in the past, this might be one of the major
culprits.  A tablet server has a memory region with a bounded size (
tserver.memory.maps.max) where it buffers data that has not yet been
written to RFiles (through the process of minor compaction).  The region is
segmented by tablet and each tablet can have a buffer that is undergoing
ingest as well as a buffer that is undergoing minor compaction.  A memory
manager decides when to initiate minor compactions for the tablet buffers
and the default implementation tries to keep the memory region 80-90% full
while preferring to compact the largest tablet buffers.  Creating larger
RFiles during minor compaction should lead to less major compactions.
During a minor compaction, the tablet buffer still "consumes" memory within
the in memory map and high ingest rates can lead to exhausing the remaining
capacity.  The default memory manage uses an adaptive strategy to predict
the expected memory usage and makes compaction decisions that should
maintain some free memory.  Batch writers can be bursty and a bit
unpredictable which could throw off these estimates.  Also, depending on
the ingest profile, sometimes an in-memory tablet buffer will consume a
large percentage of the total buffer.  This leads to long minor compactions
when the buffer size is large which can allow ingest enough time to exhaust
the buffer before that memory can be reclaimed.  When a tablet server has
to block ingest, it can affect client ingest rates to other tablet servers
due to the way that batch writers work.  This can lead to other tablet
servers underestimating future ingest rates which can further exacerbate
the problem.

There are some configuration changes that could reduce the severity of held
commits, although they might reduce peak ingest rates.  Reducing the in
memory map size can reduce the maximum pause time due to held commits.
Adding additional tablets should help avoid the problem of a single tablet
buffer consuming a large percentage of the memory region.  It might be
better to aim for ~20 tablets per server if your problem allows for it.  It
is also possible to replace the memory manager with a custom one.  I've
tried this in the past and have seen stability improvements by making the
memory thresholds less aggressive (50-75% full).  This did reduce peak
ingest rate in some cases, but that was a reasonable tradeoff.

Based on your current configuration, if a tablet server is serving 4
tablets and has a 32GB buffer, your first minor compactions will be at
least 8GB and they will probably grow larger over time until the tablets
naturally split.  Consider how long it would take to write this RFile
compared to your peak ingest rate.  As others have suggested, make sure to
use the native maps.  Based on your current JVM heap size, using the Java
in-memory map would probably lead to OOME or very bad GC performance.

Accumulo can trace minor compaction durations so you can get a feel for max
pause times or measure the effect of configuration changes.

Cheers,
--Jonathan

On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <dl...@comcast.net> wrote:

>
>
> Based on what Cyrille said, I would look at garbage collection,
> specifically I would look at how much of your newly allocated objects spill
> into the old generation before they are flushed to disk. Additionally, I
> would turn off the debug log or log to SSD’s if you have them. Another
> thought, seeing that you have 256GB RAM / node, is to run multiple tablet
> servers per node. Do you have 10 threads on your Batch Writers? What about
> the Batch Writer latency, is it too low such that you are not filling the
> buffer?
>
>
>
> *From:* Massimilian Mattetti [mailto:MASSIMIL@il.ibm.com]
> *Sent:* Wednesday, July 05, 2017 8:37 AM
> *To:* user@accumulo.apache.org
> *Subject:* maximize usage of cluster resources during ingestion
>
>
>
> Hi all,
>
> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
> running Accumulo TServer processes. All the machines are connected via a
> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
> ingestion test on this cluster but I have never been able to reach more
> than *20% *CPU usage on each Tablet Server. I am running an ingestion
> process (using batch writers) on each data node. The table is pre-split in
> order to have 4 tablets per tablet server. Monitoring the network I have
> seen that data is received/sent from each node with a peak rate of about
> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
> servers is around 120MB/s.
>
> The table configuration I am playing with are:
> "table.file.replication": "2",
> "table.compaction.minor.logs.threshold": "10",
> "table.durability": "flush",
> "table.file.max": "30",
> "table.compaction.major.ratio": "9",
> "table.split.threshold": "1G"
>
> while the tablet server configuration is:
> "tserver.wal.blocksize": "2G",
> "tserver.walog.max.size": "8G",
> "tserver.memory.maps.max": "32G",
> "tserver.compaction.minor.concurrent.max": "50",
> "tserver.compaction.major.concurrent.max": "8",
> "tserver.total.mutation.queue.max": "50M",
> "tserver.wal.replication": "2",
> "tserver.compaction.major.thread.files.open.max": "15"
>
> the tablet server heap has been set to 32GB
>
> From Monitor UI
>
>
> As you can see I have a lot of valleys in which the ingestion rate reaches
> 0.
> What would be a good procedure to identify the bottleneck which causes the
> 0 ingestion rate periods?
> Thanks.
>
> Best Regards,
> Max
>

RE: maximize usage of cluster resources during ingestion

Posted by Dave Marion <dl...@comcast.net>.

 

Based on what Cyrille said, I would look at garbage collection, specifically
I would look at how much of your newly allocated objects spill into the old
generation before they are flushed to disk. Additionally, I would turn off
the debug log or log to SSD's if you have them. Another thought, seeing that
you have 256GB RAM / node, is to run multiple tablet servers per node. Do
you have 10 threads on your Batch Writers? What about the Batch Writer
latency, is it too low such that you are not filling the buffer? 

 

From: Massimilian Mattetti [mailto:MASSIMIL@il.ibm.com] 
Sent: Wednesday, July 05, 2017 8:37 AM
To: user@accumulo.apache.org
Subject: maximize usage of cluster resources during ingestion

 

Hi all,

I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each server
has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as masters
(running HDFS NameNodes, Accumulo Master and Monitor). The other 10 machines
has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are running
Accumulo TServer processes. All the machines are connected via a 10Gb
network and 3 of them are running ZooKeeper. I have run some heavy ingestion
test on this cluster but I have never been able to reach more than 20% CPU
usage on each Tablet Server. I am running an ingestion process (using batch
writers) on each data node. The table is pre-split in order to have 4
tablets per tablet server. Monitoring the network I have seen that data is
received/sent from each node with a peak rate of about 120MB/s / 100MB/s
while the aggregated disk write throughput on each tablet servers is around
120MB/s. 

The table configuration I am playing with are:
"table.file.replication": "2",
"table.compaction.minor.logs.threshold": "10",
"table.durability": "flush",
"table.file.max": "30",
"table.compaction.major.ratio": "9",
"table.split.threshold": "1G"

while the tablet server configuration is:
"tserver.wal.blocksize": "2G",
"tserver.walog.max.size": "8G",
"tserver.memory.maps.max": "32G",
"tserver.compaction.minor.concurrent.max": "50",
"tserver.compaction.major.concurrent.max": "8",
"tserver.total.mutation.queue.max": "50M",
"tserver.wal.replication": "2",
"tserver.compaction.major.thread.files.open.max": "15"

the tablet server heap has been set to 32GB

From Monitor UI


As you can see I have a lot of valleys in which the ingestion rate reaches
0. 
What would be a good procedure to identify the bottleneck which causes the 0
ingestion rate periods?
Thanks.

Best Regards,
Max