You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Danil Lipovoy (Jira)" <ji...@apache.org> on 2021/02/28 19:46:00 UTC
[jira] [Created] (HBASE-25619) 50% reading performance degradation
2.4.1 over 1.6.0
Danil Lipovoy created HBASE-25619:
-------------------------------------
Summary: 50% reading performance degradation 2.4.1 over 1.6.0
Key: HBASE-25619
URL: https://issues.apache.org/jira/browse/HBASE-25619
Project: HBase
Issue Type: Bug
Reporter: Danil Lipovoy
Attachments: logs.zip, scripts.zip
I have found performance issues. YCSB tests show:
| |*Operations per second (batch 1000)*| |
| |*1.4.13*|*1.6.0*|*2.2.6*|*2.4.1*|*comments*|
|INSERTS|68|68|75|76|< this is fine|
|GETS|92|100|72|48|< 50% less than 1.6.0|
|FLUSHED GETS|126|141|120|108| |
|GET+INSERT|69|71|68|66| |
GETS - means gets right after inserts.
FLUSHED GETS - after flush and major compation
All numbers are average over 3 runs. For example GETS 2.4.1 => (45 + 49 + 50) / 3 = 48 got form:
--- run 01 hdl300_LRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 108
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 45
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 76
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 66
--- run 02 hdl300_LRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 109
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 49
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 77
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 66
--- run 03 hdl300_LRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 108
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 50
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 76
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 65
But always were 4 runs (not 3). First run for warm up and excluded from aggregation (usually it is faster then all runs later).
All test done with AdaptiveLRU (https://issues.apache.org/jira/browse/HBASE-23887)
This is because:
# RS on old LRU just often fall under pressure.
# It is faster than current version (much faster when server powerful).
For example on my PC (AMD Ryzen 7 2700X Eight-Core Processor) this is current version LRU (1.4.13):
--- run 01 hdl300_oldLRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 116
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 76
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 67
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 65
--- run 02 hdl300_oldLRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 115
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 81
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 66
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 67
--- run 03 hdl300_oldLRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 116
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 82
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 66
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 66
This is new version (1.4.13):
-- run 01 hdl300_newLRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 128
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 93
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 67
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 70
--- run 02 hdl300_newLRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 126
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 93
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 68
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 69
--- run 03 hdl300_newLRU_thr30_reg100 ---
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 fget ops= 125
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 get ops= 91
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 ins ops= 68
thr30 cnt100000 tim300 num0 max1 bch1000 reg100 upd ops= 67
All test done with the same params:
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>./tmp/hb</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>/tmp/hbase</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
<property>
<name>zookeeper.session.timeout</name>
<value>120000</value>
</property>
<property>
<name>hbase.rpc.timeout</name>
<value>120000</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>300</value>
</property>
<property>
<name>hbase.regionserver.metahandler.count</name>
<value>30</value>
</property>
<property>
<name>hbase.regionserver.maxlogs</name>
<value>200</value>
</property>
<property>
<name>hbase.hregion.memstore.flush.size</name>
<value>1342177280</value>
</property>
<property>
<name>hbase.hregion.memstore.block.multiplier</name>
<value>6</value>
</property>
<property>
<name>hbase.hstore.compactionThreshold</name>
<value>2</value>
</property>
<property>
<name>hbase.hstore.blockingStoreFiles</name>
<value>200</value>
</property>
<property>
<name>hbase.regionserver.optionalcacheflushinterval</name>
<value>18000000</value>
</property>
<property>
<name>hbase.regionserver.thread.compaction.large</name>
<value>12</value>
</property>
<property>
<name>hbase.regionserver.wal.enablecompression</name>
<value>true</value>
</property>
<property>
<name>hbase.server.compactchecker.interval.multiplier</name>
<value>200</value>
</property>
<property>
<name>hbase.rest.threads.min</name>
<value>8</value>
</property>
<property>
<name>hbase.rest.threads.max</name>
<value>150</value>
</property>
<property>
<name>hbase.thrift.minWorkerThreads</name>
<value>200</value>
</property>
<property>
<name>hbase.regionserver.thread.compaction.small</name>
<value>6</value>
</property>
<property>
<name>hbase.ipc.server.read.threadpool.size</name>
<value>60</value>
</property>
<property>
<name>hbase.lru.cache.heavy.eviction.count.limit</name>
<value>0</value>
</property>
<property>
<name>hbase.lru.cache.heavy.eviction.mb.size.limit</name>
<value>200</value>
</property>
<property>
<name>hbase.lru.cache.heavy.eviction.overhead.coefficient</name>
<value>0.01</value>
</property>
<property>
<name>hbase.wal.provider</name>
<value>multiwal</value>
</property>
</configuration>
And everywhere export HBASE_HEAPSIZE=22G
ZK is separate (downloaded from apache site) because RS just fall when ZK build-in.
Full logs in an attachment.
Every one can repeat the tests. I used modificated YCSB (added batch)
https://github.com/pustota2009/YCSB.git
It is possible just:
1. Download and set up ZK [https://www.apache.org/dyn/closer.lua/zookeeper/zookeeper-3.6.2/apache-zookeeper-3.6.2-bin.tar.gz]
2. Download and set up HBase ([https://hbase.apache.org/downloads.html)]
3. Tune HBase (with params above)
4. Download [^scripts.zip] (there YCSB and scripts) into hbase dir - the same level where bin, conf, log etc
5. Execute run-4-tests-30t-LRU.sh.
It will work about 1,5 hours and collect the results into hdl300_LRU_thr30_reg100.res and results_agg.txt
Maybe somebody would interested to investigate the cause this degradation and fix it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)