You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Serega Sheypak <se...@gmail.com> on 2015/05/19 01:26:08 UTC

Optimizing compactions on super-low-cost HW

Hi, we are using extremely cheap HW:
2 HHD 7200
4*2 core (Hyperthreading)
32GB RAM

We met serious IO performance issues.
We have more or less even distribution of read/write requests. The same for
datasize.

ServerName Request Per Second Read Request Count Write Request Count
node01.domain.com,60020,1430172017193 195 171871826 16761699
node02.domain.com,60020,1426925053570 24 34314930 16006603
node03.domain.com,60020,1430860939797 22 32054801 16913299
node04.domain.com,60020,1431975656065 33 1765121 253405
node05.domain.com,60020,1430484646409 27 42248883 16406280
node07.domain.com,60020,1426776403757 27 36324492 16299432
node08.domain.com,60020,1426775898757 26 38507165 13582109
node09.domain.com,60020,1430440612531 27 34360873 15080194
node11.domain.com,60020,1431989669340 28 44307 13466
node12.domain.com,60020,1431927604238 30 5318096 2020855
node13.domain.com,60020,1431372874221 29 31764957 15843688
node14.domain.com,60020,1429640630771 41 36300097 13049801

ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed Storefile
Size Index Size Bloom Size
node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
310111k
node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
318854k
node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
307136k
node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
289316k
node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
334127k
node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
296169k
node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
312325k
node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
309734k
node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
264081k
node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
304137k
node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k 257607k
node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k 266677k


When compaction starts  random node gets I/O 100%, io wait for seconds,
even tenth of seconds.

What are the approaches to optimize minor and major compactions when you
are I/O bound..?

Re: Optimizing compactions on super-low-cost HW

Posted by Serega Sheypak <se...@gmail.com>.

Hi! Thank you for trying to help.
Here are the settings. Do you need to know some more?
> memstore
hbase.hregion.memstore.flush.size=128MB

> compaction
hbase.extendedperiod=1hour
hbase.hstore.compactionThreshold=3
hbase.hstore.blockingStoreFiles=10
hbase.hstore.compaction.max=_
hbase.hregion.majorcompaction=1day

hbase.offpeak.start.hour=1
hbase.offpeak.end.hour=5

2015-05-20 18:01 GMT+03:00 ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com>:

> Can you specify what are the other details like the memstore size, the
> compaction related configurations?
>
> On Wed, May 20, 2015 at 8:11 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > Hi, any input here?
> >
> > 2015-05-19 2:26 GMT+03:00 Serega Sheypak <se...@gmail.com>:
> >
> > > Hi, we are using extremely cheap HW:
> > > 2 HHD 7200
> > > 4*2 core (Hyperthreading)
> > > 32GB RAM
> > >
> > > We met serious IO performance issues.
> > > We have more or less even distribution of read/write requests. The same
> > > for datasize.
> > >
> > > ServerName Request Per Second Read Request Count Write Request Count
> > > node01.domain.com,60020,1430172017193 195 171871826 16761699
> > > node02.domain.com,60020,1426925053570 24 34314930 16006603
> > > node03.domain.com,60020,1430860939797 22 32054801 16913299
> > > node04.domain.com,60020,1431975656065 33 1765121 253405
> > > node05.domain.com,60020,1430484646409 27 42248883 16406280
> > > node07.domain.com,60020,1426776403757 27 36324492 16299432
> > > node08.domain.com,60020,1426775898757 26 38507165 13582109
> > > node09.domain.com,60020,1430440612531 27 34360873 15080194
> > > node11.domain.com,60020,1431989669340 28 44307 13466
> > > node12.domain.com,60020,1431927604238 30 5318096 2020855
> > > node13.domain.com,60020,1431372874221 29 31764957 15843688
> > > node14.domain.com,60020,1429640630771 41 36300097 13049801
> > >
> > > ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > Storefile
> > > Size Index Size Bloom Size
> > > node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
> > > 310111k
> > > node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
> > > 318854k
> > > node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
> > > 307136k
> > > node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
> > > 289316k
> > > node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
> > > 334127k
> > > node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
> > > 296169k
> > > node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
> > > 312325k
> > > node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
> > > 309734k
> > > node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
> > > 264081k
> > > node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
> > > 304137k
> > > node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> > > 257607k
> > > node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> > > 266677k
> > >
> > >
> > > When compaction starts  random node gets I/O 100%, io wait for seconds,
> > > even tenth of seconds.
> > >
> > > What are the approaches to optimize minor and major compactions when
> you
> > > are I/O bound..?
> > >
> >
>

Re: Optimizing compactions on super-low-cost HW

Posted by ramkrishna vasudevan <ra...@gmail.com>.

Can you specify what are the other details like the memstore size, the
compaction related configurations?

On Wed, May 20, 2015 at 8:11 PM, Serega Sheypak <se...@gmail.com>
wrote:

> Hi, any input here?
>
> 2015-05-19 2:26 GMT+03:00 Serega Sheypak <se...@gmail.com>:
>
> > Hi, we are using extremely cheap HW:
> > 2 HHD 7200
> > 4*2 core (Hyperthreading)
> > 32GB RAM
> >
> > We met serious IO performance issues.
> > We have more or less even distribution of read/write requests. The same
> > for datasize.
> >
> > ServerName Request Per Second Read Request Count Write Request Count
> > node01.domain.com,60020,1430172017193 195 171871826 16761699
> > node02.domain.com,60020,1426925053570 24 34314930 16006603
> > node03.domain.com,60020,1430860939797 22 32054801 16913299
> > node04.domain.com,60020,1431975656065 33 1765121 253405
> > node05.domain.com,60020,1430484646409 27 42248883 16406280
> > node07.domain.com,60020,1426776403757 27 36324492 16299432
> > node08.domain.com,60020,1426775898757 26 38507165 13582109
> > node09.domain.com,60020,1430440612531 27 34360873 15080194
> > node11.domain.com,60020,1431989669340 28 44307 13466
> > node12.domain.com,60020,1431927604238 30 5318096 2020855
> > node13.domain.com,60020,1431372874221 29 31764957 15843688
> > node14.domain.com,60020,1429640630771 41 36300097 13049801
> >
> > ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> Storefile
> > Size Index Size Bloom Size
> > node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
> > 310111k
> > node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
> > 318854k
> > node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
> > 307136k
> > node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
> > 289316k
> > node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
> > 334127k
> > node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
> > 296169k
> > node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
> > 312325k
> > node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
> > 309734k
> > node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
> > 264081k
> > node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
> > 304137k
> > node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> > 257607k
> > node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> > 266677k
> >
> >
> > When compaction starts  random node gets I/O 100%, io wait for seconds,
> > even tenth of seconds.
> >
> > What are the approaches to optimize minor and major compactions when you
> > are I/O bound..?
> >
>

Re: Optimizing compactions on super-low-cost HW

Posted by Serega Sheypak <se...@gmail.com>.

Hi, any input here?

2015-05-19 2:26 GMT+03:00 Serega Sheypak <se...@gmail.com>:

> Hi, we are using extremely cheap HW:
> 2 HHD 7200
> 4*2 core (Hyperthreading)
> 32GB RAM
>
> We met serious IO performance issues.
> We have more or less even distribution of read/write requests. The same
> for datasize.
>
> ServerName Request Per Second Read Request Count Write Request Count
> node01.domain.com,60020,1430172017193 195 171871826 16761699
> node02.domain.com,60020,1426925053570 24 34314930 16006603
> node03.domain.com,60020,1430860939797 22 32054801 16913299
> node04.domain.com,60020,1431975656065 33 1765121 253405
> node05.domain.com,60020,1430484646409 27 42248883 16406280
> node07.domain.com,60020,1426776403757 27 36324492 16299432
> node08.domain.com,60020,1426775898757 26 38507165 13582109
> node09.domain.com,60020,1430440612531 27 34360873 15080194
> node11.domain.com,60020,1431989669340 28 44307 13466
> node12.domain.com,60020,1431927604238 30 5318096 2020855
> node13.domain.com,60020,1431372874221 29 31764957 15843688
> node14.domain.com,60020,1429640630771 41 36300097 13049801
>
> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed Storefile
> Size Index Size Bloom Size
> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
> 310111k
> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
> 318854k
> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
> 307136k
> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
> 289316k
> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
> 334127k
> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
> 296169k
> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
> 312325k
> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
> 309734k
> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
> 264081k
> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
> 304137k
> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> 257607k
> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> 266677k
>
>
> When compaction starts  random node gets I/O 100%, io wait for seconds,
> even tenth of seconds.
>
> What are the approaches to optimize minor and major compactions when you
> are I/O bound..?
>

Re: Optimizing compactions on super-low-cost HW

Posted by Serega Sheypak <se...@gmail.com>.

Hi!
Please help ^)

2015-05-21 11:04 GMT+03:00 Serega Sheypak <se...@gmail.com>:

> > Do you have the system sharing
> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each drive
> with mirroring enabled. I can't persuade devops that mirroring could cause
> IO issues. What arguments can I bring? They use OS partition mirroring when
> disck fails, we can use other partition to boot OS and continue to work...
>
> >Do you have to compact? In other words, do you have read SLAs?
> Unfortunately, I have mixed workload from web applications. I need to
> write and read and SLA is < 50ms.
>
> >How are your read times currently?
> Cloudera manager says it's 4K reads per second and 500 writes per second
>
> >Does your working dataset fit in RAM or do
> reads have to go to disk?
> I have several tables for 500GB each and many small tables 10-20 GB. Small
> tables loaded hourly/daily using bulkload (prepare HFiles using MR and move
> them to HBase using utility). Big tables are used by webapps, they read and
> write them.
>
> >It looks like you are running at about three storefiles per column family
> is it hbase.hstore.compactionThreshold=3?
>
> >What if you upped the threshold at which minors run?
> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
>
> >Do you have a downtime during which you could schedule compactions?
> Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
>
> >Are you managing the major compactions yourself or are you having hbase
> do it for you?
> HBase, once a day hbase.hregion.majorcompaction=1day
>
> I can disable WAL. It's ok to loose some data in case of RS failure. I'm
> not doing banking transactions.
> If I disable WAL, could it help?
>
> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
>
>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <serega.sheypak@gmail.com
>> >
>> wrote:
>>
>> > Hi, we are using extremely cheap HW:
>> > 2 HHD 7200
>> > 4*2 core (Hyperthreading)
>> > 32GB RAM
>> >
>> > We met serious IO performance issues.
>> > We have more or less even distribution of read/write requests. The same
>> for
>> > datasize.
>> >
>> > ServerName Request Per Second Read Request Count Write Request Count
>> > node01.domain.com,60020,1430172017193 195 171871826 16761699
>> > node02.domain.com,60020,1426925053570 24 34314930 16006603
>> > node03.domain.com,60020,1430860939797 22 32054801 16913299
>> > node04.domain.com,60020,1431975656065 33 1765121 253405
>> > node05.domain.com,60020,1430484646409 27 42248883 16406280
>> > node07.domain.com,60020,1426776403757 27 36324492 16299432
>> > node08.domain.com,60020,1426775898757 26 38507165 13582109
>> > node09.domain.com,60020,1430440612531 27 34360873 15080194
>> > node11.domain.com,60020,1431989669340 28 44307 13466
>> > node12.domain.com,60020,1431927604238 30 5318096 2020855
>> > node13.domain.com,60020,1431372874221 29 31764957 15843688
>> > node14.domain.com,60020,1429640630771 41 36300097 13049801
>> >
>> > ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
>> > Storefile
>> > Size Index Size Bloom Size
>> > node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
>> > 310111k
>> > node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
>> > 318854k
>> > node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
>> > 307136k
>> > node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
>> > 289316k
>> > node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
>> > 334127k
>> > node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
>> > 296169k
>> > node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
>> > 312325k
>> > node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
>> > 309734k
>> > node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
>> > 264081k
>> > node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
>> > 304137k
>> > node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
>> > 257607k
>> > node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
>> > 266677k
>> >
>> >
>> > When compaction starts  random node gets I/O 100%, io wait for seconds,
>> > even tenth of seconds.
>> >
>> > What are the approaches to optimize minor and major compactions when you
>> > are I/O bound..?
>> >
>>
>> Yeah, with two disks, you will be crimped. Do you have the system sharing
>> with hbase/hdfs or is hdfs running on one disk only?
>>
>> Do you have to compact? In other words, do you have read SLAs?  How are
>> your read times currently?  Does your working dataset fit in RAM or do
>> reads have to go to disk?  It looks like you are running at about three
>> storefiles per column family.  What if you upped the threshold at which
>> minors run? Do you have a downtime during which you could schedule
>> compactions? Are you managing the major compactions yourself or are you
>> having hbase do it for you?
>>
>> St.Ack
>>
>
>

Re: Optimizing compactions on super-low-cost HW

Posted by Stack <st...@duboce.net>.

On Fri, May 22, 2015 at 1:18 AM, Serega Sheypak <se...@gmail.com>
wrote:

> >What version of hbase are you on?
> We are on CDH 5.2.1 HBase 0.98
>
>
> >These hfiles are created on same cluster with MR? (i.e. they are using up
> i/os)
> The same cluster :) They are created during night and we get IO degradation
> when no MR runs. I understand, that MR also gives significant IO pressure.
>
> >Can you cache more?
> Don't understand, can you explain? Row cache enabled for all tables which
> apps read.
>
>
Does your live dataset fit in memory? What is your cache hit rate (See in
logs).  Small improvements in what you can cache can show as big wins in
i/o.  To up your cache, up your heap size and/or give more to the block
cache. You could try with offheap cache (i think you'd have to upgrade
though to get offheap w/ necessary bug fixes).



> >Can you make it so files are bigger before you flush?
> How can I reach that? increase memstore size?
>
>
Yes. Read tuning/perf section in refguide.



> >the traffic is not so heavy?
> During night is 3-4 times less. I run major compactions during night.
>
>
I thought you said you did not manage the major compactions.  If you don't,
they invariably run when you don't want them to.



> >You realize that a major compaction will do full rewrite of your dataset?
> I do
>
> > When they run, how many storefiles are there?
> How can I measure that? Goto hdfs and count files in table catalog?
>
>
HBase emits metrics with this count in them (should be able to see in CM or
whatever you are using to capture cluster-wide metrics).  Otherwise, yeah,
lsr on hdfs.


> Do you have to run once a day?  Can you not run once a week?
> Maybe if there is no significant read performance penalty
>
>
Study what your compactions are doing. Are minors keeping up and keeping
the store file count reasonable?  Are you deleting data? Do you keep lots
of versions? If answer to last two questions are no, then you might be able
to up off major compactions for a week or more even.

St.Ack



> > Enable deferring sync'ing firs
> Will try...
>
> 2015-05-21 23:04 GMT+03:00 Stack <st...@duboce.net>:
>
> > On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> > > > Do you have the system sharing
> > > There are 2 HDD 7200 2TB each. There is 300GB OS partition on each
> drive
> > > with mirroring enabled. I can't persuade devops that mirroring could
> > cause
> > > IO issues. What arguments can I bring? They use OS partition mirroring
> > when
> > > disck fails, we can use other partition to boot OS and continue to
> > work...
> > >
> > >
> > You are already compromised i/o-wise having two disks only. I have not
> the
> > experience to say for sure but basic physics would seem to dictate that
> > having your two disks (partially) mirrored compromises your i/o even
> more.
> >
> > You are in a bit of a hard place. Your operators want the machine to boot
> > even after it loses 50% of its disk.
> >
> >
> > > >Do you have to compact? In other words, do you have read SLAs?
> > > Unfortunately, I have mixed workload from web applications. I need to
> > write
> > > and read and SLA is < 50ms.
> > >
> > >
> > Ok. You get the bit that seeks are about 10ms or each so with two disks
> you
> > can do 2x100 seeks a second presuming no one else is using disk.
> >
> >
> > > >How are your read times currently?
> > > Cloudera manager says it's 4K reads per second and 500 writes per
> second
> > >
> > > >Does your working dataset fit in RAM or do
> > > reads have to go to disk?
> > > I have several tables for 500GB each and many small tables 10-20 GB.
> > Small
> > > tables loaded hourly/daily using bulkload (prepare HFiles using MR and
> > move
> > > them to HBase using utility). Big tables are used by webapps, they read
> > and
> > > write them.
> > >
> > >
> > These hfiles are created on same cluster with MR? (i.e. they are using up
> > i/os)
> >
> >
> > > >It looks like you are running at about three storefiles per column
> > family
> > > is it hbase.hstore.compactionThreshold=3?
> > >
> >
> >
> > > >What if you upped the threshold at which minors run?
> > > you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> > >
> > >
> > Yes.
> >
> > Downside is that your reads may require more seeks to find a keyvalue.
> >
> > Can you cache more?
> >
> > Can you make it so files are bigger before you flush?
> >
> >
> >
> > > >Do you have a downtime during which you could schedule compactions?
> > > Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
> > >
> > >
> > So, it is running at full bore 24/7?  There is no 'downtime'... a time
> when
> > the traffic is not so heavy?
> >
> >
> >
> > > >Are you managing the major compactions yourself or are you having
> hbase
> > do
> > > it for you?
> > > HBase, once a day hbase.hregion.majorcompaction=1day
> > >
> > >
> > Have you studied your compactions?  You realize that a major compaction
> > will do full rewrite of your dataset?  When they run, how many storefiles
> > are there?
> >
> > Do you have to run once a day?  Can you not run once a week?  Can you
> > manage the compactions yourself... and run them a region at a time in a
> > rolling manner across the cluster rather than have them just run whenever
> > it suits them once a day?
> >
> >
> >
> > > I can disable WAL. It's ok to loose some data in case of RS failure.
> I'm
> > > not doing banking transactions.
> > > If I disable WAL, could it help?
> > >
> > >
> > It could but don't. Enable deferring sync'ing first if you can 'lose'
> some
> > data.
> >
> > Work on your flushing and compactions before you mess w/ WAL.
> >
> > What version of hbase are you on? You say CDH but the newer your hbase,
> the
> > better it does generally.
> >
> > St.Ack
> >
> >
> >
> >
> >
> > > 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
> > >
> > > > On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> > > serega.sheypak@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi, we are using extremely cheap HW:
> > > > > 2 HHD 7200
> > > > > 4*2 core (Hyperthreading)
> > > > > 32GB RAM
> > > > >
> > > > > We met serious IO performance issues.
> > > > > We have more or less even distribution of read/write requests. The
> > same
> > > > for
> > > > > datasize.
> > > > >
> > > > > ServerName Request Per Second Read Request Count Write Request
> Count
> > > > > node01.domain.com,60020,1430172017193 195 171871826 16761699
> > > > > node02.domain.com,60020,1426925053570 24 34314930 16006603
> > > > > node03.domain.com,60020,1430860939797 22 32054801 16913299
> > > > > node04.domain.com,60020,1431975656065 33 1765121 253405
> > > > > node05.domain.com,60020,1430484646409 27 42248883 16406280
> > > > > node07.domain.com,60020,1426776403757 27 36324492 16299432
> > > > > node08.domain.com,60020,1426775898757 26 38507165 13582109
> > > > > node09.domain.com,60020,1430440612531 27 34360873 15080194
> > > > > node11.domain.com,60020,1431989669340 28 44307 13466
> > > > > node12.domain.com,60020,1431927604238 30 5318096 2020855
> > > > > node13.domain.com,60020,1431372874221 29 31764957 15843688
> > > > > node14.domain.com,60020,1429640630771 41 36300097 13049801
> > > > >
> > > > > ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > > > > Storefile
> > > > > Size Index Size Bloom Size
> > > > > node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb
> > 641849k
> > > > > 310111k
> > > > > node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb
> > 649610k
> > > > > 318854k
> > > > > node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb
> > 627346k
> > > > > 307136k
> > > > > node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb
> > 655954k
> > > > > 289316k
> > > > > node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb
> > 688136k
> > > > > 334127k
> > > > > node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb
> > 631774k
> > > > > 296169k
> > > > > node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb
> > 681486k
> > > > > 312325k
> > > > > node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb
> > 658924k
> > > > > 309734k
> > > > > node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb
> > 664753k
> > > > > 264081k
> > > > > node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb
> > 652970k
> > > > > 304137k
> > > > > node13.domain.com,60020,1431372874221 82 178 937557m 70042mb
> 601684k
> > > > > 257607k
> > > > > node14.domain.com,60020,1429640630771 82 145 949090m 69749mb
> 592812k
> > > > > 266677k
> > > > >
> > > > >
> > > > > When compaction starts  random node gets I/O 100%, io wait for
> > seconds,
> > > > > even tenth of seconds.
> > > > >
> > > > > What are the approaches to optimize minor and major compactions
> when
> > > you
> > > > > are I/O bound..?
> > > > >
> > > >
> > > > Yeah, with two disks, you will be crimped. Do you have the system
> > sharing
> > > > with hbase/hdfs or is hdfs running on one disk only?
> > > >
> > > > Do you have to compact? In other words, do you have read SLAs?  How
> are
> > > > your read times currently?  Does your working dataset fit in RAM or
> do
> > > > reads have to go to disk?  It looks like you are running at about
> three
> > > > storefiles per column family.  What if you upped the threshold at
> which
> > > > minors run? Do you have a downtime during which you could schedule
> > > > compactions? Are you managing the major compactions yourself or are
> you
> > > > having hbase do it for you?
> > > >
> > > > St.Ack
> > > >
> > >
> >
>

Re: Optimizing compactions on super-low-cost HW

Posted by Serega Sheypak <se...@gmail.com>.

>What version of hbase are you on?
We are on CDH 5.2.1 HBase 0.98


>These hfiles are created on same cluster with MR? (i.e. they are using up
i/os)
The same cluster :) They are created during night and we get IO degradation
when no MR runs. I understand, that MR also gives significant IO pressure.

>Can you cache more?
Don't understand, can you explain? Row cache enabled for all tables which
apps read.

>Can you make it so files are bigger before you flush?
How can I reach that? increase memstore size?

>the traffic is not so heavy?
During night is 3-4 times less. I run major compactions during night.

>You realize that a major compaction will do full rewrite of your dataset?
I do

> When they run, how many storefiles are there?
How can I measure that? Goto hdfs and count files in table catalog?

Do you have to run once a day?  Can you not run once a week?
Maybe if there is no significant read performance penalty

> Enable deferring sync'ing firs
Will try...

2015-05-21 23:04 GMT+03:00 Stack <st...@duboce.net>:

> On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > > Do you have the system sharing
> > There are 2 HDD 7200 2TB each. There is 300GB OS partition on each drive
> > with mirroring enabled. I can't persuade devops that mirroring could
> cause
> > IO issues. What arguments can I bring? They use OS partition mirroring
> when
> > disck fails, we can use other partition to boot OS and continue to
> work...
> >
> >
> You are already compromised i/o-wise having two disks only. I have not the
> experience to say for sure but basic physics would seem to dictate that
> having your two disks (partially) mirrored compromises your i/o even more.
>
> You are in a bit of a hard place. Your operators want the machine to boot
> even after it loses 50% of its disk.
>
>
> > >Do you have to compact? In other words, do you have read SLAs?
> > Unfortunately, I have mixed workload from web applications. I need to
> write
> > and read and SLA is < 50ms.
> >
> >
> Ok. You get the bit that seeks are about 10ms or each so with two disks you
> can do 2x100 seeks a second presuming no one else is using disk.
>
>
> > >How are your read times currently?
> > Cloudera manager says it's 4K reads per second and 500 writes per second
> >
> > >Does your working dataset fit in RAM or do
> > reads have to go to disk?
> > I have several tables for 500GB each and many small tables 10-20 GB.
> Small
> > tables loaded hourly/daily using bulkload (prepare HFiles using MR and
> move
> > them to HBase using utility). Big tables are used by webapps, they read
> and
> > write them.
> >
> >
> These hfiles are created on same cluster with MR? (i.e. they are using up
> i/os)
>
>
> > >It looks like you are running at about three storefiles per column
> family
> > is it hbase.hstore.compactionThreshold=3?
> >
>
>
> > >What if you upped the threshold at which minors run?
> > you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> >
> >
> Yes.
>
> Downside is that your reads may require more seeks to find a keyvalue.
>
> Can you cache more?
>
> Can you make it so files are bigger before you flush?
>
>
>
> > >Do you have a downtime during which you could schedule compactions?
> > Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
> >
> >
> So, it is running at full bore 24/7?  There is no 'downtime'... a time when
> the traffic is not so heavy?
>
>
>
> > >Are you managing the major compactions yourself or are you having hbase
> do
> > it for you?
> > HBase, once a day hbase.hregion.majorcompaction=1day
> >
> >
> Have you studied your compactions?  You realize that a major compaction
> will do full rewrite of your dataset?  When they run, how many storefiles
> are there?
>
> Do you have to run once a day?  Can you not run once a week?  Can you
> manage the compactions yourself... and run them a region at a time in a
> rolling manner across the cluster rather than have them just run whenever
> it suits them once a day?
>
>
>
> > I can disable WAL. It's ok to loose some data in case of RS failure. I'm
> > not doing banking transactions.
> > If I disable WAL, could it help?
> >
> >
> It could but don't. Enable deferring sync'ing first if you can 'lose' some
> data.
>
> Work on your flushing and compactions before you mess w/ WAL.
>
> What version of hbase are you on? You say CDH but the newer your hbase, the
> better it does generally.
>
> St.Ack
>
>
>
>
>
> > 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
> >
> > > On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> > serega.sheypak@gmail.com>
> > > wrote:
> > >
> > > > Hi, we are using extremely cheap HW:
> > > > 2 HHD 7200
> > > > 4*2 core (Hyperthreading)
> > > > 32GB RAM
> > > >
> > > > We met serious IO performance issues.
> > > > We have more or less even distribution of read/write requests. The
> same
> > > for
> > > > datasize.
> > > >
> > > > ServerName Request Per Second Read Request Count Write Request Count
> > > > node01.domain.com,60020,1430172017193 195 171871826 16761699
> > > > node02.domain.com,60020,1426925053570 24 34314930 16006603
> > > > node03.domain.com,60020,1430860939797 22 32054801 16913299
> > > > node04.domain.com,60020,1431975656065 33 1765121 253405
> > > > node05.domain.com,60020,1430484646409 27 42248883 16406280
> > > > node07.domain.com,60020,1426776403757 27 36324492 16299432
> > > > node08.domain.com,60020,1426775898757 26 38507165 13582109
> > > > node09.domain.com,60020,1430440612531 27 34360873 15080194
> > > > node11.domain.com,60020,1431989669340 28 44307 13466
> > > > node12.domain.com,60020,1431927604238 30 5318096 2020855
> > > > node13.domain.com,60020,1431372874221 29 31764957 15843688
> > > > node14.domain.com,60020,1429640630771 41 36300097 13049801
> > > >
> > > > ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > > > Storefile
> > > > Size Index Size Bloom Size
> > > > node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb
> 641849k
> > > > 310111k
> > > > node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb
> 649610k
> > > > 318854k
> > > > node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb
> 627346k
> > > > 307136k
> > > > node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb
> 655954k
> > > > 289316k
> > > > node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb
> 688136k
> > > > 334127k
> > > > node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb
> 631774k
> > > > 296169k
> > > > node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb
> 681486k
> > > > 312325k
> > > > node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb
> 658924k
> > > > 309734k
> > > > node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb
> 664753k
> > > > 264081k
> > > > node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb
> 652970k
> > > > 304137k
> > > > node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> > > > 257607k
> > > > node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> > > > 266677k
> > > >
> > > >
> > > > When compaction starts  random node gets I/O 100%, io wait for
> seconds,
> > > > even tenth of seconds.
> > > >
> > > > What are the approaches to optimize minor and major compactions when
> > you
> > > > are I/O bound..?
> > > >
> > >
> > > Yeah, with two disks, you will be crimped. Do you have the system
> sharing
> > > with hbase/hdfs or is hdfs running on one disk only?
> > >
> > > Do you have to compact? In other words, do you have read SLAs?  How are
> > > your read times currently?  Does your working dataset fit in RAM or do
> > > reads have to go to disk?  It looks like you are running at about three
> > > storefiles per column family.  What if you upped the threshold at which
> > > minors run? Do you have a downtime during which you could schedule
> > > compactions? Are you managing the major compactions yourself or are you
> > > having hbase do it for you?
> > >
> > > St.Ack
> > >
> >
>

Re: Optimizing compactions on super-low-cost HW

Posted by Serega Sheypak <se...@gmail.com>.

Ok, got it. Thank you.

2015-05-25 7:58 GMT+03:00 lars hofhansl <la...@apache.org>:

> Re: blockingStoreFiles
> With LSM stores you do not get a smooth behavior when you continuously try
> to pump more data into the cluster than the system can absorb.
> For a while the memstores can absorb the write in RAM, then they need to
> flush. If compactions cannot keep up with the influx of new HFiles, you
> have two choices: (1) you allow the number of the HFiles to grow at the
> expense of read performance, or (2) you tell the clients to slow down
> (there are various levels of sophistication about how you do that, but
> that's besides the point).
> blockingStoreFiles is the maximum number of files (per store, i.e. per
> column family) that HBase will allow to accumulate before it stops
> accepting writes from the clients.In 0.94 it would simply block for a
> while. In 0.98 it throws an exception back to the client to tell it to back
> off.
> -- Lars
>
>      From: Serega Sheypak <se...@gmail.com>
>  To: user <us...@hbase.apache.org>; lars hofhansl <la...@apache.org>
>  Sent: Sunday, May 24, 2015 12:59 PM
>  Subject: Re: Optimizing compactions on super-low-cost HW
>
> Hi, thanks!
> > hbase.hstore.blockingStoreFiles
> Don't understand the idea of this setting, can I find explanation for
> "dummies"?
>
> >hbase.hregion.majorcompaction
> done already
>
> >DATA_BLOCK_ENCODING, SNAPPY
> I always use it by default, CPU OK
>
> > memstore flush size
> done
>
>
> >I assume only the 300g partitions are mirrored, right? (not the entire 2t
> drive)
> Aha
>
> >Can you add more machines?
> Will do it when earn money.
> Thank you :)
>
>
>
> 2015-05-24 21:42 GMT+03:00 lars hofhansl <la...@apache.org>:
>
> > Yeah, all you can do is drive your write amplification down.
> >
> >
> > As Stack said:
> > - Increase hbase.hstore.compactionThreshold, and
> > hbase.hstore.blockingStoreFiles. It'll hurt read, but in your case read
> is
> > already significantly hurt when compactions happen.
> >
> >
> > - Absolutely set hbase.hregion.majorcompaction to 1 week (with a jitter
> if
> > 1/2 week, that's the default in 0.98 and later). Minor compaction will
> > still happen, based on the compactionThreshold setting. Right now you're
> > rewriting _all_ you data _every_ day.
> >
> >
> > - Turning off WAL writing will safe you IO, but I doubt it'll help much.
> I
> > do not expect async WAL helps a lot as the aggregate IO is still the
> same.
> >
> > - See if you can enable DATA_BLOCK_ENCODING on your column families
> > (FAST_DIFF, or PREFIX are good). You can also try SNAPPY compression.
> That
> > would reduce you overall IO (Since your CPUs are also weak you'd have to
> > test the CPU/IO tradeoff)
> >
> >
> > - If you have RAM to spare, increase the memstore flush size (will lead
> to
> > initially larger and fewer files).
> >
> >
> > - Or (again if you have spare RAM) make your regions smaller, to curb
> > write amplification.
> >
> >
> > - I assume only the 300g partitions are mirrored, right? (not the entire
> > 2t drive)
> >
> >
> > I have some suggestions compiled here (if you don't mind the plug):
> >
> http://hadoop-hbase.blogspot.com/2015/05/my-hbasecon-talk-about-hbase.html
> >
> > Other than that, I'll repeat what others said, you have 14 extremely weak
> > machines, you can't expect the world from this.
> > You're aggregate IOPS are less than 3000, you aggregate IO bandwidth
> > ~3GB/s. Can you add more machines?
> >
> >
> > -- Lars
> >
> > ________________________________
> > From: Serega Sheypak <se...@gmail.com>
> > To: user <us...@hbase.apache.org>
> > Sent: Friday, May 22, 2015 3:45 AM
> > Subject: Re: Optimizing compactions on super-low-cost HW
> >
> >
> > We don't have money, these nodes are the cheapest. I totally agree that
> we
> > need 4-6 HDD, but there is no chance to get it unfortunately.
> > Okay, I'll try yo apply Stack suggestions.
> >
> >
> >
> >
> > 2015-05-22 13:00 GMT+03:00 Michael Segel <mi...@hotmail.com>:
> >
> > > Look, to be blunt, you’re screwed.
> > >
> > > If I read your cluster spec.. it sounds like you have a single i7 (quad
> > > core) cpu. That’s 4 cores or 8 threads.
> > >
> > > Mirroring the OS is common practice.
> > > Using the same drives for Hadoop… not so good, but once the sever boots
> > > up… not so much I/O.
> > > Its not good, but you could live with it….
> > >
> > > Your best bet is to add a couple of more spindles. Ideally you’d want
> to
> > > have 6 drives. the 2 OS drives mirrored and separate. (Use the extra
> > space
> > > to stash / write logs.) Then have 4 drives / spindles in JBOD for
> Hadoop.
> > > This brings you to a 1:1 on physical cores.  If your box can handle
> more
> > > spindles, then going to a total of 10 drives would improve performance
> > > further.
> > >
> > > However, you need to level set your expectations… you can only go so
> far.
> > > If you have 4 drives spinning,  you could start to saturate a 1GbE
> > network
> > > so that will hurt performance.
> > >
> > > That’s pretty much your only option in terms of fixing the hardware and
> > > then you have to start tuning.
> > >
> > > > On May 21, 2015, at 4:04 PM, Stack <st...@duboce.net> wrote:
> > > >
> > > > On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <
> > > serega.sheypak@gmail.com>
> > > > wrote:
> > > >
> > > >>> Do you have the system sharing
> > > >> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each
> > drive
> > > >> with mirroring enabled. I can't persuade devops that mirroring could
> > > cause
> > > >> IO issues. What arguments can I bring? They use OS partition
> mirroring
> > > when
> > > >> disck fails, we can use other partition to boot OS and continue to
> > > work...
> > > >>
> > > >>
> > > > You are already compromised i/o-wise having two disks only. I have
> not
> > > the
> > > > experience to say for sure but basic physics would seem to dictate
> that
> > > > having your two disks (partially) mirrored compromises your i/o even
> > > more.
> > > >
> > > > You are in a bit of a hard place. Your operators want the machine to
> > boot
> > > > even after it loses 50% of its disk.
> > > >
> > > >
> > > >>> Do you have to compact? In other words, do you have read SLAs?
> > > >> Unfortunately, I have mixed workload from web applications. I need
> to
> > > write
> > > >> and read and SLA is < 50ms.
> > > >>
> > > >>
> > > > Ok. You get the bit that seeks are about 10ms or each so with two
> disks
> > > you
> > > > can do 2x100 seeks a second presuming no one else is using disk.
> > > >
> > > >
> > > >>> How are your read times currently?
> > > >> Cloudera manager says it's 4K reads per second and 500 writes per
> > second
> > > >>
> > > >>> Does your working dataset fit in RAM or do
> > > >> reads have to go to disk?
> > > >> I have several tables for 500GB each and many small tables 10-20 GB.
> > > Small
> > > >> tables loaded hourly/daily using bulkload (prepare HFiles using MR
> and
> > > move
> > > >> them to HBase using utility). Big tables are used by webapps, they
> > read
> > > and
> > > >> write them.
> > > >>
> > > >>
> > > > These hfiles are created on same cluster with MR? (i.e. they are
> using
> > up
> > > > i/os)
> > > >
> > > >
> > > >>> It looks like you are running at about three storefiles per column
> > > family
> > > >> is it hbase.hstore.compactionThreshold=3?
> > > >>
> > > >
> > > >
> > > >>> What if you upped the threshold at which minors run?
> > > >> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> > > >>
> > > >>
> > > > Yes.
> > > >
> > > > Downside is that your reads may require more seeks to find a
> keyvalue.
> > > >
> > > > Can you cache more?
> > > >
> > > > Can you make it so files are bigger before you flush?
> > > >
> > > >
> > > >
> > > >>> Do you have a downtime during which you could schedule compactions?
> > > >> Unfortunately no. It should work 24/7 and sometimes it doesn't do
> it.
> > > >>
> > > >>
> > > > So, it is running at full bore 24/7?  There is no 'downtime'... a
> time
> > > when
> > > > the traffic is not so heavy?
> > > >
> > > >
> > > >
> > > >>> Are you managing the major compactions yourself or are you having
> > > hbase do
> > > >> it for you?
> > > >> HBase, once a day hbase.hregion.majorcompaction=1day
> > > >>
> > > >>
> > > > Have you studied your compactions?  You realize that a major
> compaction
> > > > will do full rewrite of your dataset?  When they run, how many
> > storefiles
> > > > are there?
> > > >
> > > > Do you have to run once a day?  Can you not run once a week?  Can you
> > > > manage the compactions yourself... and run them a region at a time
> in a
> > > > rolling manner across the cluster rather than have them just run
> > whenever
> > > > it suits them once a day?
> > > >
> > > >
> > > >
> > > >> I can disable WAL. It's ok to loose some data in case of RS failure.
> > I'm
> > > >> not doing banking transactions.
> > > >> If I disable WAL, could it help?
> > > >>
> > > >>
> > > > It could but don't. Enable deferring sync'ing first if you can 'lose'
> > > some
> > > > data.
> > > >
> > > > Work on your flushing and compactions before you mess w/ WAL.
> > > >
> > > > What version of hbase are you on? You say CDH but the newer your
> hbase,
> > > the
> > > > better it does generally.
> > > >
> > > > St.Ack
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
> > > >>
> > > >>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> > > >> serega.sheypak@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Hi, we are using extremely cheap HW:
> > > >>>> 2 HHD 7200
> > > >>>> 4*2 core (Hyperthreading)
> > > >>>> 32GB RAM
> > > >>>>
> > > >>>> We met serious IO performance issues.
> > > >>>> We have more or less even distribution of read/write requests. The
> > > same
> > > >>> for
> > > >>>> datasize.
> > > >>>>
> > > >>>> ServerName Request Per Second Read Request Count Write Request
> Count
> > > >>>> node01.domain.com,60020,1430172017193 195 171871826 16761699
> > > >>>> node02.domain.com,60020,1426925053570 24 34314930 16006603
> > > >>>> node03.domain.com,60020,1430860939797 22 32054801 16913299
> > > >>>> node04.domain.com,60020,1431975656065 33 1765121 253405
> > > >>>> node05.domain.com,60020,1430484646409 27 42248883 16406280
> > > >>>> node07.domain.com,60020,1426776403757 27 36324492 16299432
> > > >>>> node08.domain.com,60020,1426775898757 26 38507165 13582109
> > > >>>> node09.domain.com,60020,1430440612531 27 34360873 15080194
> > > >>>> node11.domain.com,60020,1431989669340 28 44307 13466
> > > >>>> node12.domain.com,60020,1431927604238 30 5318096 2020855
> > > >>>> node13.domain.com,60020,1431372874221 29 31764957 15843688
> > > >>>> node14.domain.com,60020,1429640630771 41 36300097 13049801
> > > >>>>
> > > >>>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > > >>>> Storefile
> > > >>>> Size Index Size Bloom Size
> > > >>>> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb
> > 641849k
> > > >>>> 310111k
> > > >>>> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb
> > 649610k
> > > >>>> 318854k
> > > >>>> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb
> > 627346k
> > > >>>> 307136k
> > > >>>> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb
> > 655954k
> > > >>>> 289316k
> > > >>>> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb
> > 688136k
> > > >>>> 334127k
> > > >>>> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb
> > 631774k
> > > >>>> 296169k
> > > >>>> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb
> > 681486k
> > > >>>> 312325k
> > > >>>> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb
> > 658924k
> > > >>>> 309734k
> > > >>>> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb
> > 664753k
> > > >>>> 264081k
> > > >>>> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb
> > 652970k
> > > >>>> 304137k
> > > >>>> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb
> > 601684k
> > > >>>> 257607k
> > > >>>> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb
> > 592812k
> > > >>>> 266677k
> > > >>>>
> > > >>>>
> > > >>>> When compaction starts  random node gets I/O 100%, io wait for
> > > seconds,
> > > >>>> even tenth of seconds.
> > > >>>>
> > > >>>> What are the approaches to optimize minor and major compactions
> when
> > > >> you
> > > >>>> are I/O bound..?
> > > >>>>
> > > >>>
> > > >>> Yeah, with two disks, you will be crimped. Do you have the system
> > > sharing
> > > >>> with hbase/hdfs or is hdfs running on one disk only?
> > > >>>
> > > >>> Do you have to compact? In other words, do you have read SLAs?  How
> > are
> > > >>> your read times currently?  Does your working dataset fit in RAM or
> > do
> > > >>> reads have to go to disk?  It looks like you are running at about
> > three
> > > >>> storefiles per column family.  What if you upped the threshold at
> > which
> > > >>> minors run? Do you have a downtime during which you could schedule
> > > >>> compactions? Are you managing the major compactions yourself or are
> > you
> > > >>> having hbase do it for you?
> > > >>>
> > > >>> St.Ack
> > > >>>
> > > >>
> > >
> > >
> >
>
>
>

Re: Optimizing compactions on super-low-cost HW

Posted by lars hofhansl <la...@apache.org>.

Re: blockingStoreFiles
With LSM stores you do not get a smooth behavior when you continuously try to pump more data into the cluster than the system can absorb.
For a while the memstores can absorb the write in RAM, then they need to flush. If compactions cannot keep up with the influx of new HFiles, you have two choices: (1) you allow the number of the HFiles to grow at the expense of read performance, or (2) you tell the clients to slow down (there are various levels of sophistication about how you do that, but that's besides the point).
blockingStoreFiles is the maximum number of files (per store, i.e. per column family) that HBase will allow to accumulate before it stops accepting writes from the clients.In 0.94 it would simply block for a while. In 0.98 it throws an exception back to the client to tell it to back off.
-- Lars

     From: Serega Sheypak <se...@gmail.com>
 To: user <us...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
 Sent: Sunday, May 24, 2015 12:59 PM
 Subject: Re: Optimizing compactions on super-low-cost HW
   
Hi, thanks!
> hbase.hstore.blockingStoreFiles
Don't understand the idea of this setting, can I find explanation for
"dummies"?

>hbase.hregion.majorcompaction
done already

>DATA_BLOCK_ENCODING, SNAPPY
I always use it by default, CPU OK

> memstore flush size
done


>I assume only the 300g partitions are mirrored, right? (not the entire 2t
drive)
Aha

>Can you add more machines?
Will do it when earn money.
Thank you :)



2015-05-24 21:42 GMT+03:00 lars hofhansl <la...@apache.org>:

> Yeah, all you can do is drive your write amplification down.
>
>
> As Stack said:
> - Increase hbase.hstore.compactionThreshold, and
> hbase.hstore.blockingStoreFiles. It'll hurt read, but in your case read is
> already significantly hurt when compactions happen.
>
>
> - Absolutely set hbase.hregion.majorcompaction to 1 week (with a jitter if
> 1/2 week, that's the default in 0.98 and later). Minor compaction will
> still happen, based on the compactionThreshold setting. Right now you're
> rewriting _all_ you data _every_ day.
>
>
> - Turning off WAL writing will safe you IO, but I doubt it'll help much. I
> do not expect async WAL helps a lot as the aggregate IO is still the same.
>
> - See if you can enable DATA_BLOCK_ENCODING on your column families
> (FAST_DIFF, or PREFIX are good). You can also try SNAPPY compression. That
> would reduce you overall IO (Since your CPUs are also weak you'd have to
> test the CPU/IO tradeoff)
>
>
> - If you have RAM to spare, increase the memstore flush size (will lead to
> initially larger and fewer files).
>
>
> - Or (again if you have spare RAM) make your regions smaller, to curb
> write amplification.
>
>
> - I assume only the 300g partitions are mirrored, right? (not the entire
> 2t drive)
>
>
> I have some suggestions compiled here (if you don't mind the plug):
> http://hadoop-hbase.blogspot.com/2015/05/my-hbasecon-talk-about-hbase.html
>
> Other than that, I'll repeat what others said, you have 14 extremely weak
> machines, you can't expect the world from this.
> You're aggregate IOPS are less than 3000, you aggregate IO bandwidth
> ~3GB/s. Can you add more machines?
>
>
> -- Lars
>
> ________________________________
> From: Serega Sheypak <se...@gmail.com>
> To: user <us...@hbase.apache.org>
> Sent: Friday, May 22, 2015 3:45 AM
> Subject: Re: Optimizing compactions on super-low-cost HW
>
>
> We don't have money, these nodes are the cheapest. I totally agree that we
> need 4-6 HDD, but there is no chance to get it unfortunately.
> Okay, I'll try yo apply Stack suggestions.
>
>
>
>
> 2015-05-22 13:00 GMT+03:00 Michael Segel <mi...@hotmail.com>:
>
> > Look, to be blunt, you’re screwed.
> >
> > If I read your cluster spec.. it sounds like you have a single i7 (quad
> > core) cpu. That’s 4 cores or 8 threads.
> >
> > Mirroring the OS is common practice.
> > Using the same drives for Hadoop… not so good, but once the sever boots
> > up… not so much I/O.
> > Its not good, but you could live with it….
> >
> > Your best bet is to add a couple of more spindles. Ideally you’d want to
> > have 6 drives. the 2 OS drives mirrored and separate. (Use the extra
> space
> > to stash / write logs.) Then have 4 drives / spindles in JBOD for Hadoop.
> > This brings you to a 1:1 on physical cores.  If your box can handle more
> > spindles, then going to a total of 10 drives would improve performance
> > further.
> >
> > However, you need to level set your expectations… you can only go so far.
> > If you have 4 drives spinning,  you could start to saturate a 1GbE
> network
> > so that will hurt performance.
> >
> > That’s pretty much your only option in terms of fixing the hardware and
> > then you have to start tuning.
> >
> > > On May 21, 2015, at 4:04 PM, Stack <st...@duboce.net> wrote:
> > >
> > > On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <
> > serega.sheypak@gmail.com>
> > > wrote:
> > >
> > >>> Do you have the system sharing
> > >> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each
> drive
> > >> with mirroring enabled. I can't persuade devops that mirroring could
> > cause
> > >> IO issues. What arguments can I bring? They use OS partition mirroring
> > when
> > >> disck fails, we can use other partition to boot OS and continue to
> > work...
> > >>
> > >>
> > > You are already compromised i/o-wise having two disks only. I have not
> > the
> > > experience to say for sure but basic physics would seem to dictate that
> > > having your two disks (partially) mirrored compromises your i/o even
> > more.
> > >
> > > You are in a bit of a hard place. Your operators want the machine to
> boot
> > > even after it loses 50% of its disk.
> > >
> > >
> > >>> Do you have to compact? In other words, do you have read SLAs?
> > >> Unfortunately, I have mixed workload from web applications. I need to
> > write
> > >> and read and SLA is < 50ms.
> > >>
> > >>
> > > Ok. You get the bit that seeks are about 10ms or each so with two disks
> > you
> > > can do 2x100 seeks a second presuming no one else is using disk.
> > >
> > >
> > >>> How are your read times currently?
> > >> Cloudera manager says it's 4K reads per second and 500 writes per
> second
> > >>
> > >>> Does your working dataset fit in RAM or do
> > >> reads have to go to disk?
> > >> I have several tables for 500GB each and many small tables 10-20 GB.
> > Small
> > >> tables loaded hourly/daily using bulkload (prepare HFiles using MR and
> > move
> > >> them to HBase using utility). Big tables are used by webapps, they
> read
> > and
> > >> write them.
> > >>
> > >>
> > > These hfiles are created on same cluster with MR? (i.e. they are using
> up
> > > i/os)
> > >
> > >
> > >>> It looks like you are running at about three storefiles per column
> > family
> > >> is it hbase.hstore.compactionThreshold=3?
> > >>
> > >
> > >
> > >>> What if you upped the threshold at which minors run?
> > >> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> > >>
> > >>
> > > Yes.
> > >
> > > Downside is that your reads may require more seeks to find a keyvalue.
> > >
> > > Can you cache more?
> > >
> > > Can you make it so files are bigger before you flush?
> > >
> > >
> > >
> > >>> Do you have a downtime during which you could schedule compactions?
> > >> Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
> > >>
> > >>
> > > So, it is running at full bore 24/7?  There is no 'downtime'... a time
> > when
> > > the traffic is not so heavy?
> > >
> > >
> > >
> > >>> Are you managing the major compactions yourself or are you having
> > hbase do
> > >> it for you?
> > >> HBase, once a day hbase.hregion.majorcompaction=1day
> > >>
> > >>
> > > Have you studied your compactions?  You realize that a major compaction
> > > will do full rewrite of your dataset?  When they run, how many
> storefiles
> > > are there?
> > >
> > > Do you have to run once a day?  Can you not run once a week?  Can you
> > > manage the compactions yourself... and run them a region at a time in a
> > > rolling manner across the cluster rather than have them just run
> whenever
> > > it suits them once a day?
> > >
> > >
> > >
> > >> I can disable WAL. It's ok to loose some data in case of RS failure.
> I'm
> > >> not doing banking transactions.
> > >> If I disable WAL, could it help?
> > >>
> > >>
> > > It could but don't. Enable deferring sync'ing first if you can 'lose'
> > some
> > > data.
> > >
> > > Work on your flushing and compactions before you mess w/ WAL.
> > >
> > > What version of hbase are you on? You say CDH but the newer your hbase,
> > the
> > > better it does generally.
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > >> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
> > >>
> > >>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> > >> serega.sheypak@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi, we are using extremely cheap HW:
> > >>>> 2 HHD 7200
> > >>>> 4*2 core (Hyperthreading)
> > >>>> 32GB RAM
> > >>>>
> > >>>> We met serious IO performance issues.
> > >>>> We have more or less even distribution of read/write requests. The
> > same
> > >>> for
> > >>>> datasize.
> > >>>>
> > >>>> ServerName Request Per Second Read Request Count Write Request Count
> > >>>> node01.domain.com,60020,1430172017193 195 171871826 16761699
> > >>>> node02.domain.com,60020,1426925053570 24 34314930 16006603
> > >>>> node03.domain.com,60020,1430860939797 22 32054801 16913299
> > >>>> node04.domain.com,60020,1431975656065 33 1765121 253405
> > >>>> node05.domain.com,60020,1430484646409 27 42248883 16406280
> > >>>> node07.domain.com,60020,1426776403757 27 36324492 16299432
> > >>>> node08.domain.com,60020,1426775898757 26 38507165 13582109
> > >>>> node09.domain.com,60020,1430440612531 27 34360873 15080194
> > >>>> node11.domain.com,60020,1431989669340 28 44307 13466
> > >>>> node12.domain.com,60020,1431927604238 30 5318096 2020855
> > >>>> node13.domain.com,60020,1431372874221 29 31764957 15843688
> > >>>> node14.domain.com,60020,1429640630771 41 36300097 13049801
> > >>>>
> > >>>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > >>>> Storefile
> > >>>> Size Index Size Bloom Size
> > >>>> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb
> 641849k
> > >>>> 310111k
> > >>>> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb
> 649610k
> > >>>> 318854k
> > >>>> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb
> 627346k
> > >>>> 307136k
> > >>>> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb
> 655954k
> > >>>> 289316k
> > >>>> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb
> 688136k
> > >>>> 334127k
> > >>>> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb
> 631774k
> > >>>> 296169k
> > >>>> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb
> 681486k
> > >>>> 312325k
> > >>>> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb
> 658924k
> > >>>> 309734k
> > >>>> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb
> 664753k
> > >>>> 264081k
> > >>>> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb
> 652970k
> > >>>> 304137k
> > >>>> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb
> 601684k
> > >>>> 257607k
> > >>>> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb
> 592812k
> > >>>> 266677k
> > >>>>
> > >>>>
> > >>>> When compaction starts  random node gets I/O 100%, io wait for
> > seconds,
> > >>>> even tenth of seconds.
> > >>>>
> > >>>> What are the approaches to optimize minor and major compactions when
> > >> you
> > >>>> are I/O bound..?
> > >>>>
> > >>>
> > >>> Yeah, with two disks, you will be crimped. Do you have the system
> > sharing
> > >>> with hbase/hdfs or is hdfs running on one disk only?
> > >>>
> > >>> Do you have to compact? In other words, do you have read SLAs?  How
> are
> > >>> your read times currently?  Does your working dataset fit in RAM or
> do
> > >>> reads have to go to disk?  It looks like you are running at about
> three
> > >>> storefiles per column family.  What if you upped the threshold at
> which
> > >>> minors run? Do you have a downtime during which you could schedule
> > >>> compactions? Are you managing the major compactions yourself or are
> you
> > >>> having hbase do it for you?
> > >>>
> > >>> St.Ack
> > >>>
> > >>
> >
> >
>

Re: Optimizing compactions on super-low-cost HW

Posted by Serega Sheypak <se...@gmail.com>.

Hi, thanks!
> hbase.hstore.blockingStoreFiles
Don't understand the idea of this setting, can I find explanation for
"dummies"?

>hbase.hregion.majorcompaction
done already

>DATA_BLOCK_ENCODING, SNAPPY
I always use it by default, CPU OK

> memstore flush size
done


>I assume only the 300g partitions are mirrored, right? (not the entire 2t
drive)
Aha

>Can you add more machines?
Will do it when earn money.
Thank you :)

2015-05-24 21:42 GMT+03:00 lars hofhansl <la...@apache.org>:

> Yeah, all you can do is drive your write amplification down.
>
>
> As Stack said:
> - Increase hbase.hstore.compactionThreshold, and
> hbase.hstore.blockingStoreFiles. It'll hurt read, but in your case read is
> already significantly hurt when compactions happen.
>
>
> - Absolutely set hbase.hregion.majorcompaction to 1 week (with a jitter if
> 1/2 week, that's the default in 0.98 and later). Minor compaction will
> still happen, based on the compactionThreshold setting. Right now you're
> rewriting _all_ you data _every_ day.
>
>
> - Turning off WAL writing will safe you IO, but I doubt it'll help much. I
> do not expect async WAL helps a lot as the aggregate IO is still the same.
>
> - See if you can enable DATA_BLOCK_ENCODING on your column families
> (FAST_DIFF, or PREFIX are good). You can also try SNAPPY compression. That
> would reduce you overall IO (Since your CPUs are also weak you'd have to
> test the CPU/IO tradeoff)
>
>
> - If you have RAM to spare, increase the memstore flush size (will lead to
> initially larger and fewer files).
>
>
> - Or (again if you have spare RAM) make your regions smaller, to curb
> write amplification.
>
>
> - I assume only the 300g partitions are mirrored, right? (not the entire
> 2t drive)
>
>
> I have some suggestions compiled here (if you don't mind the plug):
> http://hadoop-hbase.blogspot.com/2015/05/my-hbasecon-talk-about-hbase.html
>
> Other than that, I'll repeat what others said, you have 14 extremely weak
> machines, you can't expect the world from this.
> You're aggregate IOPS are less than 3000, you aggregate IO bandwidth
> ~3GB/s. Can you add more machines?
>
>
> -- Lars
>
> ________________________________
> From: Serega Sheypak <se...@gmail.com>
> To: user <us...@hbase.apache.org>
> Sent: Friday, May 22, 2015 3:45 AM
> Subject: Re: Optimizing compactions on super-low-cost HW
>
>
> We don't have money, these nodes are the cheapest. I totally agree that we
> need 4-6 HDD, but there is no chance to get it unfortunately.
> Okay, I'll try yo apply Stack suggestions.
>
>
>
>
> 2015-05-22 13:00 GMT+03:00 Michael Segel <mi...@hotmail.com>:
>
> > Look, to be blunt, you’re screwed.
> >
> > If I read your cluster spec.. it sounds like you have a single i7 (quad
> > core) cpu. That’s 4 cores or 8 threads.
> >
> > Mirroring the OS is common practice.
> > Using the same drives for Hadoop… not so good, but once the sever boots
> > up… not so much I/O.
> > Its not good, but you could live with it….
> >
> > Your best bet is to add a couple of more spindles. Ideally you’d want to
> > have 6 drives. the 2 OS drives mirrored and separate. (Use the extra
> space
> > to stash / write logs.) Then have 4 drives / spindles in JBOD for Hadoop.
> > This brings you to a 1:1 on physical cores.  If your box can handle more
> > spindles, then going to a total of 10 drives would improve performance
> > further.
> >
> > However, you need to level set your expectations… you can only go so far.
> > If you have 4 drives spinning,  you could start to saturate a 1GbE
> network
> > so that will hurt performance.
> >
> > That’s pretty much your only option in terms of fixing the hardware and
> > then you have to start tuning.
> >
> > > On May 21, 2015, at 4:04 PM, Stack <st...@duboce.net> wrote:
> > >
> > > On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <
> > serega.sheypak@gmail.com>
> > > wrote:
> > >
> > >>> Do you have the system sharing
> > >> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each
> drive
> > >> with mirroring enabled. I can't persuade devops that mirroring could
> > cause
> > >> IO issues. What arguments can I bring? They use OS partition mirroring
> > when
> > >> disck fails, we can use other partition to boot OS and continue to
> > work...
> > >>
> > >>
> > > You are already compromised i/o-wise having two disks only. I have not
> > the
> > > experience to say for sure but basic physics would seem to dictate that
> > > having your two disks (partially) mirrored compromises your i/o even
> > more.
> > >
> > > You are in a bit of a hard place. Your operators want the machine to
> boot
> > > even after it loses 50% of its disk.
> > >
> > >
> > >>> Do you have to compact? In other words, do you have read SLAs?
> > >> Unfortunately, I have mixed workload from web applications. I need to
> > write
> > >> and read and SLA is < 50ms.
> > >>
> > >>
> > > Ok. You get the bit that seeks are about 10ms or each so with two disks
> > you
> > > can do 2x100 seeks a second presuming no one else is using disk.
> > >
> > >
> > >>> How are your read times currently?
> > >> Cloudera manager says it's 4K reads per second and 500 writes per
> second
> > >>
> > >>> Does your working dataset fit in RAM or do
> > >> reads have to go to disk?
> > >> I have several tables for 500GB each and many small tables 10-20 GB.
> > Small
> > >> tables loaded hourly/daily using bulkload (prepare HFiles using MR and
> > move
> > >> them to HBase using utility). Big tables are used by webapps, they
> read
> > and
> > >> write them.
> > >>
> > >>
> > > These hfiles are created on same cluster with MR? (i.e. they are using
> up
> > > i/os)
> > >
> > >
> > >>> It looks like you are running at about three storefiles per column
> > family
> > >> is it hbase.hstore.compactionThreshold=3?
> > >>
> > >
> > >
> > >>> What if you upped the threshold at which minors run?
> > >> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> > >>
> > >>
> > > Yes.
> > >
> > > Downside is that your reads may require more seeks to find a keyvalue.
> > >
> > > Can you cache more?
> > >
> > > Can you make it so files are bigger before you flush?
> > >
> > >
> > >
> > >>> Do you have a downtime during which you could schedule compactions?
> > >> Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
> > >>
> > >>
> > > So, it is running at full bore 24/7?  There is no 'downtime'... a time
> > when
> > > the traffic is not so heavy?
> > >
> > >
> > >
> > >>> Are you managing the major compactions yourself or are you having
> > hbase do
> > >> it for you?
> > >> HBase, once a day hbase.hregion.majorcompaction=1day
> > >>
> > >>
> > > Have you studied your compactions?  You realize that a major compaction
> > > will do full rewrite of your dataset?  When they run, how many
> storefiles
> > > are there?
> > >
> > > Do you have to run once a day?  Can you not run once a week?  Can you
> > > manage the compactions yourself... and run them a region at a time in a
> > > rolling manner across the cluster rather than have them just run
> whenever
> > > it suits them once a day?
> > >
> > >
> > >
> > >> I can disable WAL. It's ok to loose some data in case of RS failure.
> I'm
> > >> not doing banking transactions.
> > >> If I disable WAL, could it help?
> > >>
> > >>
> > > It could but don't. Enable deferring sync'ing first if you can 'lose'
> > some
> > > data.
> > >
> > > Work on your flushing and compactions before you mess w/ WAL.
> > >
> > > What version of hbase are you on? You say CDH but the newer your hbase,
> > the
> > > better it does generally.
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > >> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
> > >>
> > >>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> > >> serega.sheypak@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi, we are using extremely cheap HW:
> > >>>> 2 HHD 7200
> > >>>> 4*2 core (Hyperthreading)
> > >>>> 32GB RAM
> > >>>>
> > >>>> We met serious IO performance issues.
> > >>>> We have more or less even distribution of read/write requests. The
> > same
> > >>> for
> > >>>> datasize.
> > >>>>
> > >>>> ServerName Request Per Second Read Request Count Write Request Count
> > >>>> node01.domain.com,60020,1430172017193 195 171871826 16761699
> > >>>> node02.domain.com,60020,1426925053570 24 34314930 16006603
> > >>>> node03.domain.com,60020,1430860939797 22 32054801 16913299
> > >>>> node04.domain.com,60020,1431975656065 33 1765121 253405
> > >>>> node05.domain.com,60020,1430484646409 27 42248883 16406280
> > >>>> node07.domain.com,60020,1426776403757 27 36324492 16299432
> > >>>> node08.domain.com,60020,1426775898757 26 38507165 13582109
> > >>>> node09.domain.com,60020,1430440612531 27 34360873 15080194
> > >>>> node11.domain.com,60020,1431989669340 28 44307 13466
> > >>>> node12.domain.com,60020,1431927604238 30 5318096 2020855
> > >>>> node13.domain.com,60020,1431372874221 29 31764957 15843688
> > >>>> node14.domain.com,60020,1429640630771 41 36300097 13049801
> > >>>>
> > >>>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > >>>> Storefile
> > >>>> Size Index Size Bloom Size
> > >>>> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb
> 641849k
> > >>>> 310111k
> > >>>> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb
> 649610k
> > >>>> 318854k
> > >>>> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb
> 627346k
> > >>>> 307136k
> > >>>> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb
> 655954k
> > >>>> 289316k
> > >>>> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb
> 688136k
> > >>>> 334127k
> > >>>> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb
> 631774k
> > >>>> 296169k
> > >>>> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb
> 681486k
> > >>>> 312325k
> > >>>> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb
> 658924k
> > >>>> 309734k
> > >>>> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb
> 664753k
> > >>>> 264081k
> > >>>> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb
> 652970k
> > >>>> 304137k
> > >>>> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb
> 601684k
> > >>>> 257607k
> > >>>> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb
> 592812k
> > >>>> 266677k
> > >>>>
> > >>>>
> > >>>> When compaction starts  random node gets I/O 100%, io wait for
> > seconds,
> > >>>> even tenth of seconds.
> > >>>>
> > >>>> What are the approaches to optimize minor and major compactions when
> > >> you
> > >>>> are I/O bound..?
> > >>>>
> > >>>
> > >>> Yeah, with two disks, you will be crimped. Do you have the system
> > sharing
> > >>> with hbase/hdfs or is hdfs running on one disk only?
> > >>>
> > >>> Do you have to compact? In other words, do you have read SLAs?  How
> are
> > >>> your read times currently?  Does your working dataset fit in RAM or
> do
> > >>> reads have to go to disk?  It looks like you are running at about
> three
> > >>> storefiles per column family.  What if you upped the threshold at
> which
> > >>> minors run? Do you have a downtime during which you could schedule
> > >>> compactions? Are you managing the major compactions yourself or are
> you
> > >>> having hbase do it for you?
> > >>>
> > >>> St.Ack
> > >>>
> > >>
> >
> >
>

Re: Optimizing compactions on super-low-cost HW

Posted by lars hofhansl <la...@apache.org>.

Yeah, all you can do is drive your write amplification down.


As Stack said:
- Increase hbase.hstore.compactionThreshold, and hbase.hstore.blockingStoreFiles. It'll hurt read, but in your case read is already significantly hurt when compactions happen.


- Absolutely set hbase.hregion.majorcompaction to 1 week (with a jitter if 1/2 week, that's the default in 0.98 and later). Minor compaction will still happen, based on the compactionThreshold setting. Right now you're rewriting _all_ you data _every_ day.


- Turning off WAL writing will safe you IO, but I doubt it'll help much. I do not expect async WAL helps a lot as the aggregate IO is still the same.

- See if you can enable DATA_BLOCK_ENCODING on your column families (FAST_DIFF, or PREFIX are good). You can also try SNAPPY compression. That would reduce you overall IO (Since your CPUs are also weak you'd have to test the CPU/IO tradeoff)


- If you have RAM to spare, increase the memstore flush size (will lead to initially larger and fewer files).


- Or (again if you have spare RAM) make your regions smaller, to curb write amplification.


- I assume only the 300g partitions are mirrored, right? (not the entire 2t drive)


I have some suggestions compiled here (if you don't mind the plug): 
http://hadoop-hbase.blogspot.com/2015/05/my-hbasecon-talk-about-hbase.html

Other than that, I'll repeat what others said, you have 14 extremely weak machines, you can't expect the world from this.
You're aggregate IOPS are less than 3000, you aggregate IO bandwidth ~3GB/s. Can you add more machines?


-- Lars

________________________________
From: Serega Sheypak <se...@gmail.com>
To: user <us...@hbase.apache.org> 
Sent: Friday, May 22, 2015 3:45 AM
Subject: Re: Optimizing compactions on super-low-cost HW


We don't have money, these nodes are the cheapest. I totally agree that we
need 4-6 HDD, but there is no chance to get it unfortunately.
Okay, I'll try yo apply Stack suggestions.




2015-05-22 13:00 GMT+03:00 Michael Segel <mi...@hotmail.com>:

> Look, to be blunt, you’re screwed.
>
> If I read your cluster spec.. it sounds like you have a single i7 (quad
> core) cpu. That’s 4 cores or 8 threads.
>
> Mirroring the OS is common practice.
> Using the same drives for Hadoop… not so good, but once the sever boots
> up… not so much I/O.
> Its not good, but you could live with it….
>
> Your best bet is to add a couple of more spindles. Ideally you’d want to
> have 6 drives. the 2 OS drives mirrored and separate. (Use the extra space
> to stash / write logs.) Then have 4 drives / spindles in JBOD for Hadoop.
> This brings you to a 1:1 on physical cores.  If your box can handle more
> spindles, then going to a total of 10 drives would improve performance
> further.
>
> However, you need to level set your expectations… you can only go so far.
> If you have 4 drives spinning,  you could start to saturate a 1GbE network
> so that will hurt performance.
>
> That’s pretty much your only option in terms of fixing the hardware and
> then you have to start tuning.
>
> > On May 21, 2015, at 4:04 PM, Stack <st...@duboce.net> wrote:
> >
> > On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> >>> Do you have the system sharing
> >> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each drive
> >> with mirroring enabled. I can't persuade devops that mirroring could
> cause
> >> IO issues. What arguments can I bring? They use OS partition mirroring
> when
> >> disck fails, we can use other partition to boot OS and continue to
> work...
> >>
> >>
> > You are already compromised i/o-wise having two disks only. I have not
> the
> > experience to say for sure but basic physics would seem to dictate that
> > having your two disks (partially) mirrored compromises your i/o even
> more.
> >
> > You are in a bit of a hard place. Your operators want the machine to boot
> > even after it loses 50% of its disk.
> >
> >
> >>> Do you have to compact? In other words, do you have read SLAs?
> >> Unfortunately, I have mixed workload from web applications. I need to
> write
> >> and read and SLA is < 50ms.
> >>
> >>
> > Ok. You get the bit that seeks are about 10ms or each so with two disks
> you
> > can do 2x100 seeks a second presuming no one else is using disk.
> >
> >
> >>> How are your read times currently?
> >> Cloudera manager says it's 4K reads per second and 500 writes per second
> >>
> >>> Does your working dataset fit in RAM or do
> >> reads have to go to disk?
> >> I have several tables for 500GB each and many small tables 10-20 GB.
> Small
> >> tables loaded hourly/daily using bulkload (prepare HFiles using MR and
> move
> >> them to HBase using utility). Big tables are used by webapps, they read
> and
> >> write them.
> >>
> >>
> > These hfiles are created on same cluster with MR? (i.e. they are using up
> > i/os)
> >
> >
> >>> It looks like you are running at about three storefiles per column
> family
> >> is it hbase.hstore.compactionThreshold=3?
> >>
> >
> >
> >>> What if you upped the threshold at which minors run?
> >> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> >>
> >>
> > Yes.
> >
> > Downside is that your reads may require more seeks to find a keyvalue.
> >
> > Can you cache more?
> >
> > Can you make it so files are bigger before you flush?
> >
> >
> >
> >>> Do you have a downtime during which you could schedule compactions?
> >> Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
> >>
> >>
> > So, it is running at full bore 24/7?  There is no 'downtime'... a time
> when
> > the traffic is not so heavy?
> >
> >
> >
> >>> Are you managing the major compactions yourself or are you having
> hbase do
> >> it for you?
> >> HBase, once a day hbase.hregion.majorcompaction=1day
> >>
> >>
> > Have you studied your compactions?  You realize that a major compaction
> > will do full rewrite of your dataset?  When they run, how many storefiles
> > are there?
> >
> > Do you have to run once a day?  Can you not run once a week?  Can you
> > manage the compactions yourself... and run them a region at a time in a
> > rolling manner across the cluster rather than have them just run whenever
> > it suits them once a day?
> >
> >
> >
> >> I can disable WAL. It's ok to loose some data in case of RS failure. I'm
> >> not doing banking transactions.
> >> If I disable WAL, could it help?
> >>
> >>
> > It could but don't. Enable deferring sync'ing first if you can 'lose'
> some
> > data.
> >
> > Work on your flushing and compactions before you mess w/ WAL.
> >
> > What version of hbase are you on? You say CDH but the newer your hbase,
> the
> > better it does generally.
> >
> > St.Ack
> >
> >
> >
> >
> >
> >> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
> >>
> >>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> >> serega.sheypak@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi, we are using extremely cheap HW:
> >>>> 2 HHD 7200
> >>>> 4*2 core (Hyperthreading)
> >>>> 32GB RAM
> >>>>
> >>>> We met serious IO performance issues.
> >>>> We have more or less even distribution of read/write requests. The
> same
> >>> for
> >>>> datasize.
> >>>>
> >>>> ServerName Request Per Second Read Request Count Write Request Count
> >>>> node01.domain.com,60020,1430172017193 195 171871826 16761699
> >>>> node02.domain.com,60020,1426925053570 24 34314930 16006603
> >>>> node03.domain.com,60020,1430860939797 22 32054801 16913299
> >>>> node04.domain.com,60020,1431975656065 33 1765121 253405
> >>>> node05.domain.com,60020,1430484646409 27 42248883 16406280
> >>>> node07.domain.com,60020,1426776403757 27 36324492 16299432
> >>>> node08.domain.com,60020,1426775898757 26 38507165 13582109
> >>>> node09.domain.com,60020,1430440612531 27 34360873 15080194
> >>>> node11.domain.com,60020,1431989669340 28 44307 13466
> >>>> node12.domain.com,60020,1431927604238 30 5318096 2020855
> >>>> node13.domain.com,60020,1431372874221 29 31764957 15843688
> >>>> node14.domain.com,60020,1429640630771 41 36300097 13049801
> >>>>
> >>>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> >>>> Storefile
> >>>> Size Index Size Bloom Size
> >>>> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
> >>>> 310111k
> >>>> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
> >>>> 318854k
> >>>> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
> >>>> 307136k
> >>>> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
> >>>> 289316k
> >>>> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
> >>>> 334127k
> >>>> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
> >>>> 296169k
> >>>> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
> >>>> 312325k
> >>>> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
> >>>> 309734k
> >>>> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
> >>>> 264081k
> >>>> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
> >>>> 304137k
> >>>> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> >>>> 257607k
> >>>> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> >>>> 266677k
> >>>>
> >>>>
> >>>> When compaction starts  random node gets I/O 100%, io wait for
> seconds,
> >>>> even tenth of seconds.
> >>>>
> >>>> What are the approaches to optimize minor and major compactions when
> >> you
> >>>> are I/O bound..?
> >>>>
> >>>
> >>> Yeah, with two disks, you will be crimped. Do you have the system
> sharing
> >>> with hbase/hdfs or is hdfs running on one disk only?
> >>>
> >>> Do you have to compact? In other words, do you have read SLAs?  How are
> >>> your read times currently?  Does your working dataset fit in RAM or do
> >>> reads have to go to disk?  It looks like you are running at about three
> >>> storefiles per column family.  What if you upped the threshold at which
> >>> minors run? Do you have a downtime during which you could schedule
> >>> compactions? Are you managing the major compactions yourself or are you
> >>> having hbase do it for you?
> >>>
> >>> St.Ack
> >>>
> >>
>
>

Re: Optimizing compactions on super-low-cost HW

Posted by Serega Sheypak <se...@gmail.com>.

We don't have money, these nodes are the cheapest. I totally agree that we
need 4-6 HDD, but there is no chance to get it unfortunately.
Okay, I'll try yo apply Stack suggestions.

2015-05-22 13:00 GMT+03:00 Michael Segel <mi...@hotmail.com>:

> Look, to be blunt, you’re screwed.
>
> If I read your cluster spec.. it sounds like you have a single i7 (quad
> core) cpu. That’s 4 cores or 8 threads.
>
> Mirroring the OS is common practice.
> Using the same drives for Hadoop… not so good, but once the sever boots
> up… not so much I/O.
> Its not good, but you could live with it….
>
> Your best bet is to add a couple of more spindles. Ideally you’d want to
> have 6 drives. the 2 OS drives mirrored and separate. (Use the extra space
> to stash / write logs.) Then have 4 drives / spindles in JBOD for Hadoop.
> This brings you to a 1:1 on physical cores.  If your box can handle more
> spindles, then going to a total of 10 drives would improve performance
> further.
>
> However, you need to level set your expectations… you can only go so far.
> If you have 4 drives spinning,  you could start to saturate a 1GbE network
> so that will hurt performance.
>
> That’s pretty much your only option in terms of fixing the hardware and
> then you have to start tuning.
>
> > On May 21, 2015, at 4:04 PM, Stack <st...@duboce.net> wrote:
> >
> > On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> >>> Do you have the system sharing
> >> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each drive
> >> with mirroring enabled. I can't persuade devops that mirroring could
> cause
> >> IO issues. What arguments can I bring? They use OS partition mirroring
> when
> >> disck fails, we can use other partition to boot OS and continue to
> work...
> >>
> >>
> > You are already compromised i/o-wise having two disks only. I have not
> the
> > experience to say for sure but basic physics would seem to dictate that
> > having your two disks (partially) mirrored compromises your i/o even
> more.
> >
> > You are in a bit of a hard place. Your operators want the machine to boot
> > even after it loses 50% of its disk.
> >
> >
> >>> Do you have to compact? In other words, do you have read SLAs?
> >> Unfortunately, I have mixed workload from web applications. I need to
> write
> >> and read and SLA is < 50ms.
> >>
> >>
> > Ok. You get the bit that seeks are about 10ms or each so with two disks
> you
> > can do 2x100 seeks a second presuming no one else is using disk.
> >
> >
> >>> How are your read times currently?
> >> Cloudera manager says it's 4K reads per second and 500 writes per second
> >>
> >>> Does your working dataset fit in RAM or do
> >> reads have to go to disk?
> >> I have several tables for 500GB each and many small tables 10-20 GB.
> Small
> >> tables loaded hourly/daily using bulkload (prepare HFiles using MR and
> move
> >> them to HBase using utility). Big tables are used by webapps, they read
> and
> >> write them.
> >>
> >>
> > These hfiles are created on same cluster with MR? (i.e. they are using up
> > i/os)
> >
> >
> >>> It looks like you are running at about three storefiles per column
> family
> >> is it hbase.hstore.compactionThreshold=3?
> >>
> >
> >
> >>> What if you upped the threshold at which minors run?
> >> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> >>
> >>
> > Yes.
> >
> > Downside is that your reads may require more seeks to find a keyvalue.
> >
> > Can you cache more?
> >
> > Can you make it so files are bigger before you flush?
> >
> >
> >
> >>> Do you have a downtime during which you could schedule compactions?
> >> Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
> >>
> >>
> > So, it is running at full bore 24/7?  There is no 'downtime'... a time
> when
> > the traffic is not so heavy?
> >
> >
> >
> >>> Are you managing the major compactions yourself or are you having
> hbase do
> >> it for you?
> >> HBase, once a day hbase.hregion.majorcompaction=1day
> >>
> >>
> > Have you studied your compactions?  You realize that a major compaction
> > will do full rewrite of your dataset?  When they run, how many storefiles
> > are there?
> >
> > Do you have to run once a day?  Can you not run once a week?  Can you
> > manage the compactions yourself... and run them a region at a time in a
> > rolling manner across the cluster rather than have them just run whenever
> > it suits them once a day?
> >
> >
> >
> >> I can disable WAL. It's ok to loose some data in case of RS failure. I'm
> >> not doing banking transactions.
> >> If I disable WAL, could it help?
> >>
> >>
> > It could but don't. Enable deferring sync'ing first if you can 'lose'
> some
> > data.
> >
> > Work on your flushing and compactions before you mess w/ WAL.
> >
> > What version of hbase are you on? You say CDH but the newer your hbase,
> the
> > better it does generally.
> >
> > St.Ack
> >
> >
> >
> >
> >
> >> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
> >>
> >>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> >> serega.sheypak@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi, we are using extremely cheap HW:
> >>>> 2 HHD 7200
> >>>> 4*2 core (Hyperthreading)
> >>>> 32GB RAM
> >>>>
> >>>> We met serious IO performance issues.
> >>>> We have more or less even distribution of read/write requests. The
> same
> >>> for
> >>>> datasize.
> >>>>
> >>>> ServerName Request Per Second Read Request Count Write Request Count
> >>>> node01.domain.com,60020,1430172017193 195 171871826 16761699
> >>>> node02.domain.com,60020,1426925053570 24 34314930 16006603
> >>>> node03.domain.com,60020,1430860939797 22 32054801 16913299
> >>>> node04.domain.com,60020,1431975656065 33 1765121 253405
> >>>> node05.domain.com,60020,1430484646409 27 42248883 16406280
> >>>> node07.domain.com,60020,1426776403757 27 36324492 16299432
> >>>> node08.domain.com,60020,1426775898757 26 38507165 13582109
> >>>> node09.domain.com,60020,1430440612531 27 34360873 15080194
> >>>> node11.domain.com,60020,1431989669340 28 44307 13466
> >>>> node12.domain.com,60020,1431927604238 30 5318096 2020855
> >>>> node13.domain.com,60020,1431372874221 29 31764957 15843688
> >>>> node14.domain.com,60020,1429640630771 41 36300097 13049801
> >>>>
> >>>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> >>>> Storefile
> >>>> Size Index Size Bloom Size
> >>>> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
> >>>> 310111k
> >>>> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
> >>>> 318854k
> >>>> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
> >>>> 307136k
> >>>> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
> >>>> 289316k
> >>>> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
> >>>> 334127k
> >>>> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
> >>>> 296169k
> >>>> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
> >>>> 312325k
> >>>> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
> >>>> 309734k
> >>>> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
> >>>> 264081k
> >>>> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
> >>>> 304137k
> >>>> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> >>>> 257607k
> >>>> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> >>>> 266677k
> >>>>
> >>>>
> >>>> When compaction starts  random node gets I/O 100%, io wait for
> seconds,
> >>>> even tenth of seconds.
> >>>>
> >>>> What are the approaches to optimize minor and major compactions when
> >> you
> >>>> are I/O bound..?
> >>>>
> >>>
> >>> Yeah, with two disks, you will be crimped. Do you have the system
> sharing
> >>> with hbase/hdfs or is hdfs running on one disk only?
> >>>
> >>> Do you have to compact? In other words, do you have read SLAs?  How are
> >>> your read times currently?  Does your working dataset fit in RAM or do
> >>> reads have to go to disk?  It looks like you are running at about three
> >>> storefiles per column family.  What if you upped the threshold at which
> >>> minors run? Do you have a downtime during which you could schedule
> >>> compactions? Are you managing the major compactions yourself or are you
> >>> having hbase do it for you?
> >>>
> >>> St.Ack
> >>>
> >>
>
>

Re: Optimizing compactions on super-low-cost HW

Posted by Michael Segel <mi...@hotmail.com>.

Look, to be blunt, you’re screwed. 

If I read your cluster spec.. it sounds like you have a single i7 (quad core) cpu. That’s 4 cores or 8 threads. 

Mirroring the OS is common practice. 
Using the same drives for Hadoop… not so good, but once the sever boots up… not so much I/O.
Its not good, but you could live with it…. 

Your best bet is to add a couple of more spindles. Ideally you’d want to have 6 drives. the 2 OS drives mirrored and separate. (Use the extra space to stash / write logs.) Then have 4 drives / spindles in JBOD for Hadoop. This brings you to a 1:1 on physical cores.  If your box can handle more spindles, then going to a total of 10 drives would improve performance further. 

However, you need to level set your expectations… you can only go so far. If you have 4 drives spinning,  you could start to saturate a 1GbE network so that will hurt performance. 

That’s pretty much your only option in terms of fixing the hardware and then you have to start tuning.

> On May 21, 2015, at 4:04 PM, Stack <st...@duboce.net> wrote:
> 
> On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <se...@gmail.com>
> wrote:
> 
>>> Do you have the system sharing
>> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each drive
>> with mirroring enabled. I can't persuade devops that mirroring could cause
>> IO issues. What arguments can I bring? They use OS partition mirroring when
>> disck fails, we can use other partition to boot OS and continue to work...
>> 
>> 
> You are already compromised i/o-wise having two disks only. I have not the
> experience to say for sure but basic physics would seem to dictate that
> having your two disks (partially) mirrored compromises your i/o even more.
> 
> You are in a bit of a hard place. Your operators want the machine to boot
> even after it loses 50% of its disk.
> 
> 
>>> Do you have to compact? In other words, do you have read SLAs?
>> Unfortunately, I have mixed workload from web applications. I need to write
>> and read and SLA is < 50ms.
>> 
>> 
> Ok. You get the bit that seeks are about 10ms or each so with two disks you
> can do 2x100 seeks a second presuming no one else is using disk.
> 
> 
>>> How are your read times currently?
>> Cloudera manager says it's 4K reads per second and 500 writes per second
>> 
>>> Does your working dataset fit in RAM or do
>> reads have to go to disk?
>> I have several tables for 500GB each and many small tables 10-20 GB. Small
>> tables loaded hourly/daily using bulkload (prepare HFiles using MR and move
>> them to HBase using utility). Big tables are used by webapps, they read and
>> write them.
>> 
>> 
> These hfiles are created on same cluster with MR? (i.e. they are using up
> i/os)
> 
> 
>>> It looks like you are running at about three storefiles per column family
>> is it hbase.hstore.compactionThreshold=3?
>> 
> 
> 
>>> What if you upped the threshold at which minors run?
>> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
>> 
>> 
> Yes.
> 
> Downside is that your reads may require more seeks to find a keyvalue.
> 
> Can you cache more?
> 
> Can you make it so files are bigger before you flush?
> 
> 
> 
>>> Do you have a downtime during which you could schedule compactions?
>> Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
>> 
>> 
> So, it is running at full bore 24/7?  There is no 'downtime'... a time when
> the traffic is not so heavy?
> 
> 
> 
>>> Are you managing the major compactions yourself or are you having hbase do
>> it for you?
>> HBase, once a day hbase.hregion.majorcompaction=1day
>> 
>> 
> Have you studied your compactions?  You realize that a major compaction
> will do full rewrite of your dataset?  When they run, how many storefiles
> are there?
> 
> Do you have to run once a day?  Can you not run once a week?  Can you
> manage the compactions yourself... and run them a region at a time in a
> rolling manner across the cluster rather than have them just run whenever
> it suits them once a day?
> 
> 
> 
>> I can disable WAL. It's ok to loose some data in case of RS failure. I'm
>> not doing banking transactions.
>> If I disable WAL, could it help?
>> 
>> 
> It could but don't. Enable deferring sync'ing first if you can 'lose' some
> data.
> 
> Work on your flushing and compactions before you mess w/ WAL.
> 
> What version of hbase are you on? You say CDH but the newer your hbase, the
> better it does generally.
> 
> St.Ack
> 
> 
> 
> 
> 
>> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
>> 
>>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>>> wrote:
>>> 
>>>> Hi, we are using extremely cheap HW:
>>>> 2 HHD 7200
>>>> 4*2 core (Hyperthreading)
>>>> 32GB RAM
>>>> 
>>>> We met serious IO performance issues.
>>>> We have more or less even distribution of read/write requests. The same
>>> for
>>>> datasize.
>>>> 
>>>> ServerName Request Per Second Read Request Count Write Request Count
>>>> node01.domain.com,60020,1430172017193 195 171871826 16761699
>>>> node02.domain.com,60020,1426925053570 24 34314930 16006603
>>>> node03.domain.com,60020,1430860939797 22 32054801 16913299
>>>> node04.domain.com,60020,1431975656065 33 1765121 253405
>>>> node05.domain.com,60020,1430484646409 27 42248883 16406280
>>>> node07.domain.com,60020,1426776403757 27 36324492 16299432
>>>> node08.domain.com,60020,1426775898757 26 38507165 13582109
>>>> node09.domain.com,60020,1430440612531 27 34360873 15080194
>>>> node11.domain.com,60020,1431989669340 28 44307 13466
>>>> node12.domain.com,60020,1431927604238 30 5318096 2020855
>>>> node13.domain.com,60020,1431372874221 29 31764957 15843688
>>>> node14.domain.com,60020,1429640630771 41 36300097 13049801
>>>> 
>>>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
>>>> Storefile
>>>> Size Index Size Bloom Size
>>>> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
>>>> 310111k
>>>> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
>>>> 318854k
>>>> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
>>>> 307136k
>>>> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
>>>> 289316k
>>>> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
>>>> 334127k
>>>> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
>>>> 296169k
>>>> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
>>>> 312325k
>>>> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
>>>> 309734k
>>>> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
>>>> 264081k
>>>> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
>>>> 304137k
>>>> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
>>>> 257607k
>>>> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
>>>> 266677k
>>>> 
>>>> 
>>>> When compaction starts  random node gets I/O 100%, io wait for seconds,
>>>> even tenth of seconds.
>>>> 
>>>> What are the approaches to optimize minor and major compactions when
>> you
>>>> are I/O bound..?
>>>> 
>>> 
>>> Yeah, with two disks, you will be crimped. Do you have the system sharing
>>> with hbase/hdfs or is hdfs running on one disk only?
>>> 
>>> Do you have to compact? In other words, do you have read SLAs?  How are
>>> your read times currently?  Does your working dataset fit in RAM or do
>>> reads have to go to disk?  It looks like you are running at about three
>>> storefiles per column family.  What if you upped the threshold at which
>>> minors run? Do you have a downtime during which you could schedule
>>> compactions? Are you managing the major compactions yourself or are you
>>> having hbase do it for you?
>>> 
>>> St.Ack
>>> 
>>

Re: Optimizing compactions on super-low-cost HW

Posted by Stack <st...@duboce.net>.

On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <se...@gmail.com>
wrote:

> > Do you have the system sharing
> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each drive
> with mirroring enabled. I can't persuade devops that mirroring could cause
> IO issues. What arguments can I bring? They use OS partition mirroring when
> disck fails, we can use other partition to boot OS and continue to work...
>
>
You are already compromised i/o-wise having two disks only. I have not the
experience to say for sure but basic physics would seem to dictate that
having your two disks (partially) mirrored compromises your i/o even more.

You are in a bit of a hard place. Your operators want the machine to boot
even after it loses 50% of its disk.


> >Do you have to compact? In other words, do you have read SLAs?
> Unfortunately, I have mixed workload from web applications. I need to write
> and read and SLA is < 50ms.
>
>
Ok. You get the bit that seeks are about 10ms or each so with two disks you
can do 2x100 seeks a second presuming no one else is using disk.


> >How are your read times currently?
> Cloudera manager says it's 4K reads per second and 500 writes per second
>
> >Does your working dataset fit in RAM or do
> reads have to go to disk?
> I have several tables for 500GB each and many small tables 10-20 GB. Small
> tables loaded hourly/daily using bulkload (prepare HFiles using MR and move
> them to HBase using utility). Big tables are used by webapps, they read and
> write them.
>
>
These hfiles are created on same cluster with MR? (i.e. they are using up
i/os)


> >It looks like you are running at about three storefiles per column family
> is it hbase.hstore.compactionThreshold=3?
>


> >What if you upped the threshold at which minors run?
> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
>
>
Yes.

Downside is that your reads may require more seeks to find a keyvalue.

Can you cache more?

Can you make it so files are bigger before you flush?



> >Do you have a downtime during which you could schedule compactions?
> Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
>
>
So, it is running at full bore 24/7?  There is no 'downtime'... a time when
the traffic is not so heavy?



> >Are you managing the major compactions yourself or are you having hbase do
> it for you?
> HBase, once a day hbase.hregion.majorcompaction=1day
>
>
Have you studied your compactions?  You realize that a major compaction
will do full rewrite of your dataset?  When they run, how many storefiles
are there?

Do you have to run once a day?  Can you not run once a week?  Can you
manage the compactions yourself... and run them a region at a time in a
rolling manner across the cluster rather than have them just run whenever
it suits them once a day?



> I can disable WAL. It's ok to loose some data in case of RS failure. I'm
> not doing banking transactions.
> If I disable WAL, could it help?
>
>
It could but don't. Enable deferring sync'ing first if you can 'lose' some
data.

Work on your flushing and compactions before you mess w/ WAL.

What version of hbase are you on? You say CDH but the newer your hbase, the
better it does generally.

St.Ack





> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
>
> > On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> > > Hi, we are using extremely cheap HW:
> > > 2 HHD 7200
> > > 4*2 core (Hyperthreading)
> > > 32GB RAM
> > >
> > > We met serious IO performance issues.
> > > We have more or less even distribution of read/write requests. The same
> > for
> > > datasize.
> > >
> > > ServerName Request Per Second Read Request Count Write Request Count
> > > node01.domain.com,60020,1430172017193 195 171871826 16761699
> > > node02.domain.com,60020,1426925053570 24 34314930 16006603
> > > node03.domain.com,60020,1430860939797 22 32054801 16913299
> > > node04.domain.com,60020,1431975656065 33 1765121 253405
> > > node05.domain.com,60020,1430484646409 27 42248883 16406280
> > > node07.domain.com,60020,1426776403757 27 36324492 16299432
> > > node08.domain.com,60020,1426775898757 26 38507165 13582109
> > > node09.domain.com,60020,1430440612531 27 34360873 15080194
> > > node11.domain.com,60020,1431989669340 28 44307 13466
> > > node12.domain.com,60020,1431927604238 30 5318096 2020855
> > > node13.domain.com,60020,1431372874221 29 31764957 15843688
> > > node14.domain.com,60020,1429640630771 41 36300097 13049801
> > >
> > > ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > > Storefile
> > > Size Index Size Bloom Size
> > > node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
> > > 310111k
> > > node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
> > > 318854k
> > > node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
> > > 307136k
> > > node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
> > > 289316k
> > > node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
> > > 334127k
> > > node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
> > > 296169k
> > > node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
> > > 312325k
> > > node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
> > > 309734k
> > > node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
> > > 264081k
> > > node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
> > > 304137k
> > > node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> > > 257607k
> > > node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> > > 266677k
> > >
> > >
> > > When compaction starts  random node gets I/O 100%, io wait for seconds,
> > > even tenth of seconds.
> > >
> > > What are the approaches to optimize minor and major compactions when
> you
> > > are I/O bound..?
> > >
> >
> > Yeah, with two disks, you will be crimped. Do you have the system sharing
> > with hbase/hdfs or is hdfs running on one disk only?
> >
> > Do you have to compact? In other words, do you have read SLAs?  How are
> > your read times currently?  Does your working dataset fit in RAM or do
> > reads have to go to disk?  It looks like you are running at about three
> > storefiles per column family.  What if you upped the threshold at which
> > minors run? Do you have a downtime during which you could schedule
> > compactions? Are you managing the major compactions yourself or are you
> > having hbase do it for you?
> >
> > St.Ack
> >
>

Re: Optimizing compactions on super-low-cost HW

Posted by Serega Sheypak <se...@gmail.com>.

> Do you have the system sharing
There are 2 HDD 7200 2TB each. There is 300GB OS partition on each drive
with mirroring enabled. I can't persuade devops that mirroring could cause
IO issues. What arguments can I bring? They use OS partition mirroring when
disck fails, we can use other partition to boot OS and continue to work...

>Do you have to compact? In other words, do you have read SLAs?
Unfortunately, I have mixed workload from web applications. I need to write
and read and SLA is < 50ms.

>How are your read times currently?
Cloudera manager says it's 4K reads per second and 500 writes per second

>Does your working dataset fit in RAM or do
reads have to go to disk?
I have several tables for 500GB each and many small tables 10-20 GB. Small
tables loaded hourly/daily using bulkload (prepare HFiles using MR and move
them to HBase using utility). Big tables are used by webapps, they read and
write them.

>It looks like you are running at about three storefiles per column family
is it hbase.hstore.compactionThreshold=3?

>What if you upped the threshold at which minors run?
you mean bump  hbase.hstore.compactionThreshold to 8 or 10?

>Do you have a downtime during which you could schedule compactions?
Unfortunately no. It should work 24/7 and sometimes it doesn't do it.

>Are you managing the major compactions yourself or are you having hbase do
it for you?
HBase, once a day hbase.hregion.majorcompaction=1day

I can disable WAL. It's ok to loose some data in case of RS failure. I'm
not doing banking transactions.
If I disable WAL, could it help?

2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:

> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > Hi, we are using extremely cheap HW:
> > 2 HHD 7200
> > 4*2 core (Hyperthreading)
> > 32GB RAM
> >
> > We met serious IO performance issues.
> > We have more or less even distribution of read/write requests. The same
> for
> > datasize.
> >
> > ServerName Request Per Second Read Request Count Write Request Count
> > node01.domain.com,60020,1430172017193 195 171871826 16761699
> > node02.domain.com,60020,1426925053570 24 34314930 16006603
> > node03.domain.com,60020,1430860939797 22 32054801 16913299
> > node04.domain.com,60020,1431975656065 33 1765121 253405
> > node05.domain.com,60020,1430484646409 27 42248883 16406280
> > node07.domain.com,60020,1426776403757 27 36324492 16299432
> > node08.domain.com,60020,1426775898757 26 38507165 13582109
> > node09.domain.com,60020,1430440612531 27 34360873 15080194
> > node11.domain.com,60020,1431989669340 28 44307 13466
> > node12.domain.com,60020,1431927604238 30 5318096 2020855
> > node13.domain.com,60020,1431372874221 29 31764957 15843688
> > node14.domain.com,60020,1429640630771 41 36300097 13049801
> >
> > ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > Storefile
> > Size Index Size Bloom Size
> > node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
> > 310111k
> > node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
> > 318854k
> > node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
> > 307136k
> > node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
> > 289316k
> > node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
> > 334127k
> > node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
> > 296169k
> > node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
> > 312325k
> > node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
> > 309734k
> > node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
> > 264081k
> > node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
> > 304137k
> > node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> > 257607k
> > node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> > 266677k
> >
> >
> > When compaction starts  random node gets I/O 100%, io wait for seconds,
> > even tenth of seconds.
> >
> > What are the approaches to optimize minor and major compactions when you
> > are I/O bound..?
> >
>
> Yeah, with two disks, you will be crimped. Do you have the system sharing
> with hbase/hdfs or is hdfs running on one disk only?
>
> Do you have to compact? In other words, do you have read SLAs?  How are
> your read times currently?  Does your working dataset fit in RAM or do
> reads have to go to disk?  It looks like you are running at about three
> storefiles per column family.  What if you upped the threshold at which
> minors run? Do you have a downtime during which you could schedule
> compactions? Are you managing the major compactions yourself or are you
> having hbase do it for you?
>
> St.Ack
>

Re: Optimizing compactions on super-low-cost HW

Posted by Stack <st...@duboce.net>.

On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <se...@gmail.com>
wrote:

> Hi, we are using extremely cheap HW:
> 2 HHD 7200
> 4*2 core (Hyperthreading)
> 32GB RAM
>
> We met serious IO performance issues.
> We have more or less even distribution of read/write requests. The same for
> datasize.
>
> ServerName Request Per Second Read Request Count Write Request Count
> node01.domain.com,60020,1430172017193 195 171871826 16761699
> node02.domain.com,60020,1426925053570 24 34314930 16006603
> node03.domain.com,60020,1430860939797 22 32054801 16913299
> node04.domain.com,60020,1431975656065 33 1765121 253405
> node05.domain.com,60020,1430484646409 27 42248883 16406280
> node07.domain.com,60020,1426776403757 27 36324492 16299432
> node08.domain.com,60020,1426775898757 26 38507165 13582109
> node09.domain.com,60020,1430440612531 27 34360873 15080194
> node11.domain.com,60020,1431989669340 28 44307 13466
> node12.domain.com,60020,1431927604238 30 5318096 2020855
> node13.domain.com,60020,1431372874221 29 31764957 15843688
> node14.domain.com,60020,1429640630771 41 36300097 13049801
>
> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> Storefile
> Size Index Size Bloom Size
> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb 641849k
> 310111k
> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb 649610k
> 318854k
> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb 627346k
> 307136k
> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb 655954k
> 289316k
> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb 688136k
> 334127k
> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb 631774k
> 296169k
> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb 681486k
> 312325k
> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb 658924k
> 309734k
> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb 664753k
> 264081k
> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb 652970k
> 304137k
> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb 601684k
> 257607k
> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb 592812k
> 266677k
>
>
> When compaction starts  random node gets I/O 100%, io wait for seconds,
> even tenth of seconds.
>
> What are the approaches to optimize minor and major compactions when you
> are I/O bound..?
>

Yeah, with two disks, you will be crimped. Do you have the system sharing
with hbase/hdfs or is hdfs running on one disk only?

Do you have to compact? In other words, do you have read SLAs?  How are
your read times currently?  Does your working dataset fit in RAM or do
reads have to go to disk?  It looks like you are running at about three
storefiles per column family.  What if you upped the threshold at which
minors run? Do you have a downtime during which you could schedule
compactions? Are you managing the major compactions yourself or are you
having hbase do it for you?

St.Ack