You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by hongbin ma <ma...@apache.org> on 2016/05/19 10:00:26 UTC
Re: Rows per second for RegionScanner

hi Vladimir

Thanks for your reply. I'll try.

a quick question: suppose there's only one region on a 4-core server, when
the region is being scanned, will all the cores being utilized to speed up
scanning? can you kindly show me some evidence in code or some official
document?

Last time I observed a 4-core server scanning a single region, the CPU
usage was only 25%. That's why I'm asking.

thanks!


On Fri, Apr 22, 2016 at 12:10 PM, Vladimir Rodionov <vl...@gmail.com>
wrote:

> Try disabling block encoding - you will get better numbers.
>
> >>  I mean per region scan speed,
>
> Scan performance depends on # of CPU cores, the more cores you have the
> more performance you will get. Your servers are pretty low end (4 virtual
> CPU cores is just 2 hardware cores). With 32 cores per node you will get 8x
> speed up (close to 8x).
>
> -Vlad
>
>
> On Thu, Apr 21, 2016 at 7:22 PM, hongbin ma <ma...@apache.org> wrote:
>
> > hi Thakrar
> >
> > Thanks for your reply.
> >
> > My settings for the RegionScanner Scan is
> >
> > scan.setCaching(1024)
> > scan.setMaxResultSize(5M)
> >
> > even if I change the caching to 100000 I'm still not getting any
> > improvements. I guess the caching works for remote scan through RPC,
> > however not helping too much for region side scan?
> >
> > I also tried the PREFETCH_BLOCKS_ON_OPEN for the whole table, however no
> > improvement was observed.
> >
> > I'm pursuing for pure scan-read performance optimization because our
> > application is sort of read-only. And I observed that even if I did no
> > other thing (only scanning) in my coprocessor, the scan speed is not
> > satisfying. The CPU seems to be fully utilized. May be the process of
> > decoding FAST_DIFF rows is too heavy for CPU? How many rows/second scan
> > speed would your expect on a normal setting? I mean per region scan
> speed,
> > not the overall scan speed counting in all regions.
> >
> > thanks
> >
> > On Thu, Apr 21, 2016 at 10:24 PM, Thakrar, Jayesh <
> > jthakrar@conversantmedia.com> wrote:
> >
> > > Just curious - have you set the scanner caching to some high value -
> say
> > > 1000 (or even higher in your small value case)?
> > >
> > > The parameter is hbase.client.scanner.caching
> > >
> > > You can read up on it - https://hbase.apache.org/book.html
> > >
> > > Another thing, are you just looking for pure scan-read performance
> > > optimization?
> > > Depending upon the table size you can also look into caching the table
> or
> > > not caching at all.
> > >
> > > -----Original Message-----
> > > From: hongbin ma [mailto:mahongbin@apache.org]
> > > Sent: Thursday, April 21, 2016 5:04 AM
> > > To: user@hbase.apache.org
> > > Subject: Rows per second for RegionScanner
> > >
> > > Hi, experts,
> > >
> > > I'm trying to figure out how fast hbase can scan. I'm setting up the
> > > RegionScan in a endpoint coprocessor so that no network overhead will
> be
> > > included. My average key length is 35 and average value length is 5.
> > >
> > > My test result is that if I warm all my interested blocks in the block
> > > cache, I'm only able to scan around 300,000 rows per second per region
> > > (with endpoint I guess it's one thread per region), so it's like
> > getting15M
> > > data per second. I'm not sure if this is already an acceptable number
> for
> > > HBase. The answers from you experts might help me to decide if it's
> worth
> > > to further dig into tuning it.
> > >
> > > thanks!
> > >
> > >
> > >
> > >
> > >
> > >
> > > other info:
> > >
> > > My hbase cluster is on 8 AWS m1.xlarge instance, with 4 CPU cores and
> 16G
> > > RAM. Each region server is configured 10G heap size. The test HTable
> has
> > 23
> > > regions, one hfile per region (just major compacted). There's no other
> > > resource contention when I ran the tests.
> > >
> > > Attached is the HFile output of one of the region hfile:
> > > =============================================
> > >  hbase  org.apache.hadoop.hbase.io.hfile.HFile -m -s -v -f
> > >
> > >
> >
> /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06
> > > 2016-04-21 09:16:04,091 INFO  [main] Configuration.deprecation:
> > > hadoop.native.lib is deprecated. Instead, use io.native.lib.available
> > > 2016-04-21 09:16:04,292 INFO  [main] util.ChecksumType: Checksum using
> > > org.apache.hadoop.util.PureJavaCrc32
> > > 2016-04-21 09:16:04,294 INFO  [main] util.ChecksumType: Checksum can
> use
> > > org.apache.hadoop.util.PureJavaCrc32C
> > > SLF4J: Class path contains multiple SLF4J bindings.
> > > SLF4J: Found binding in
> > >
> > >
> >
> [jar:file:/usr/hdp/2.2.9.0-3393/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > > SLF4J: Found binding in
> > >
> > >
> >
> [jar:file:/usr/hdp/2.2.9.0-3393/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > > explanation.
> > > 2016-04-21 09:16:05,654 INFO  [main] Configuration.deprecation:
> > > fs.default.name is deprecated. Instead, use fs.defaultFS Scanning ->
> > >
> > >
> >
> /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06
> > > Block index size as per heapsize: 3640
> > >
> > >
> >
> reader=/apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06,
> > >     compression=none,
> > >     cacheConf=CacheConfig:disabled,
> > >
> > >
> > >
> >
> firstKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00\x00\x00\x01\xF4/F1:M/0/Put,
> > >
> > >
> > >
> >
> lastKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9/F1:M/0/Put,
> > >     avgKeyLen=35,
> > >     avgValueLen=5,
> > >     entries=160988965,
> > >     length=1832309188
> > > Trailer:
> > >     fileinfoOffset=1832308623,
> > >     loadOnOpenDataOffset=1832306641,
> > >     dataIndexCount=43,
> > >     metaIndexCount=0,
> > >     totalUncomressedBytes=1831809883,
> > >     entryCount=160988965,
> > >     compressionCodec=NONE,
> > >     uncompressedDataIndexSize=5558733,
> > >     numDataIndexLevels=2,
> > >     firstDataBlockOffset=0,
> > >     lastDataBlockOffset=1832250057,
> > >     comparatorClassName=org.apache.hadoop.hbase.KeyValue$KeyComparator,
> > >     majorVersion=2,
> > >     minorVersion=3
> > > Fileinfo:
> > >     DATA_BLOCK_ENCODING = FAST_DIFF
> > >     DELETE_FAMILY_COUNT = \x00\x00\x00\x00\x00\x00\x00\x00
> > >     EARLIEST_PUT_TS = \x00\x00\x00\x00\x00\x00\x00\x00
> > >     MAJOR_COMPACTION_KEY = \xFF
> > >     MAX_SEQ_ID_KEY = 4
> > >     TIMERANGE = 0....0
> > >     hfile.AVG_KEY_LEN = 35
> > >     hfile.AVG_VALUE_LEN = 5
> > >     hfile.LASTKEY =
> > >
> > >
> >
> \x00\x16\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9\x02F1M\x00\x00\x00\x00\x00\x00\x00\x00\x04
> > > Mid-key:
> > >
> > >
> >
> \x00\x12\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1D\x04_\x07\x89\x00\x00\x02l\x00\x7F\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00\x00\x00\x007|\xBE$\x00\x00;\x81
> > > Bloom filter:
> > >     Not present
> > > Delete Family Bloom filter:
> > >     Not present
> > > Stats:
> > >    Key length:
> > >                min = 32.00
> > >                max = 37.00
> > >               mean = 35.11
> > >             stddev = 1.46
> > >             median = 35.00
> > >               75% <= 37.00
> > >               95% <= 37.00
> > >               98% <= 37.00
> > >               99% <= 37.00
> > >             99.9% <= 37.00
> > >              count = 160988965
> > >    Row size (bytes):
> > >                min = 44.00
> > >                max = 55.00
> > >               mean = 48.17
> > >             stddev = 1.43
> > >             median = 48.00
> > >               75% <= 50.00
> > >               95% <= 50.00
> > >               98% <= 50.00
> > >               99% <= 50.00
> > >             99.9% <= 51.97
> > >              count = 160988965
> > >    Row size (columns):
> > >                min = 1.00
> > >                max = 1.00
> > >               mean = 1.00
> > >             stddev = 0.00
> > >             median = 1.00
> > >               75% <= 1.00
> > >               95% <= 1.00
> > >               98% <= 1.00
> > >               99% <= 1.00
> > >             99.9% <= 1.00
> > >              count = 160988965
> > >    Val length:
> > >                min = 4.00
> > >                max = 12.00
> > >               mean = 5.06
> > >             stddev = 0.33
> > >             median = 5.00
> > >               75% <= 5.00
> > >               95% <= 5.00
> > >               98% <= 6.00
> > >               99% <= 8.00
> > >             99.9% <= 9.00
> > >              count = 160988965
> > > Key of biggest row:
> > >
> > >
> >
> \x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x04\xDD:\x06\x00U\x00\x00\x00\x8DS\xD2
> > > Scanned kv count -> 160988965
> > >
> > >
> > >
> > >
> > > This email and any files included with it may contain privileged,
> > > proprietary and/or confidential information that is for the sole use
> > > of the intended recipient(s).  Any disclosure, copying, distribution,
> > > posting, or use of the information contained in or attached to this
> > > email is prohibited unless permitted by the sender.  If you have
> > > received this email in error, please immediately notify the sender
> > > via return email, telephone, or fax and destroy this original
> > transmission
> > > and its included files without reading or saving it in any manner.
> > > Thank you.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone