You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Otis Gospodnetic <ot...@gmail.com> on 2015/01/07 18:27:55 UTC

1 table, 1 dense CF => N tables, 1 dense CF ?

Hi,

It's been asked before, but I didn't find any *definite* answers and a lot
of answers I found via  are from a whiiiile back.

e.g. Tsuna provided pretty convincing info here:
http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+

... but that is from 3 years ago.  Maybe things changed?

Here's our use case:

Data/table layout:
* HBase is used for storing metrics at different granularities (1min, 5
min.... - a total of 6 different granularities)
* It's a multi-tenant system
* Keys are carefully crafted and include userId + number, where this number
contains the time and the granularity
* Everything's in 1 table and 1 CF

Access:
* We only access 1 system at a time, for a specific time range, and
specific granularity
* We periodically scan ALL data and delete data older than N days, where N
varies from user to user
* We periodically scan ALL data and merge multiple rows (of the same
granularity) into 1

Question:
Would there be any advantage in having 6 tables - one for each granularity
- instead of having everything in 1 table?
Assume each table would still have just 1 CF and the keys would remain the
same.

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Ted Yu <yu...@gmail.com>.

w.r.t. one WAL per region server, see HBASE-5699 'Run with > 1 WAL in
HRegionServer' which is in the upcoming 1.0.0 release.

Cheers

On Wed, Jan 7, 2015 at 5:21 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Not to dig too deep into ancient history, but Tsuna's comments are mostly
> still relevant today, except for...
>
> You also generally end up with fewer, bigger regions, which is almost
> > always better.  This entails that your RS are writing more data to fewer
> > WALs, which leads to more sequential writes across the board.  You'll end
> > up with fewer HLogs, which is also a good thing.
>
>
> HBase is one WAL per region server and has been for as long as I've paid
> attention. Unless I've missed something, number of tables doesn't change
> this fixed number.
>
> If you use HBase's client (which is most likely the case as the only other
> > alternative is asynchbase), beware that you need to create one HTable
> > instance per table per thread in your application code.
>
>
> You can still write your client application this way, but the preferred
> idiom is to use a single Connection instance from which all these resources
> are shared across HTable instances. This pattern is reinforced in the new
> client API introduced in 1.0
>
> FYI, I think you can write a Compaction coprocessor that implements your
> data expiration policy through normal compaction operations, thereby
> removing the necessity of the (expensive?) scan + write delete pattern
> entirely.
>
> -n
>
> On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com
> > wrote:
>
> > Hi,
> >
> > It's been asked before, but I didn't find any *definite* answers and a
> lot
> > of answers I found via  are from a whiiiile back.
> >
> > e.g. Tsuna provided pretty convincing info here:
> >
> >
> http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+
> >
> > ... but that is from 3 years ago.  Maybe things changed?
> >
> > Here's our use case:
> >
> > Data/table layout:
> > * HBase is used for storing metrics at different granularities (1min, 5
> > min.... - a total of 6 different granularities)
> > * It's a multi-tenant system
> > * Keys are carefully crafted and include userId + number, where this
> number
> > contains the time and the granularity
> > * Everything's in 1 table and 1 CF
> >
> > Access:
> > * We only access 1 system at a time, for a specific time range, and
> > specific granularity
> > * We periodically scan ALL data and delete data older than N days, where
> N
> > varies from user to user
> > * We periodically scan ALL data and merge multiple rows (of the same
> > granularity) into 1
> >
> > Question:
> > Would there be any advantage in having 6 tables - one for each
> granularity
> > - instead of having everything in 1 table?
> > Assume each table would still have just 1 CF and the keys would remain
> the
> > same.
> >
> > Thanks,
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
>

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Ted Yu <yu...@gmail.com>.

Thanks for the confirmation, Gary.

The change has been done through HBASE-12834.

Cheers

On Fri, Jan 9, 2015 at 1:01 PM, Gary Helmling <gh...@gmail.com> wrote:

> ScanType is a parameter of RegionObserver preCompact() and
> preCompactScannerOpen().  It seems like anything we are explicitly
> providing to coprocessor hooks should be LimitedPrivate.
>
> On Fri, Jan 9, 2015 at 12:26 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > w.r.t. ScanType, here is the logic used by DefaultCompactor:
> >
> >         ScanType scanType =
> >
> >             request.isAllFiles() ? ScanType.COMPACT_DROP_DELETES :
> > ScanType.
> > COMPACT_RETAIN_DELETES;
> >
> > BTW ScanType is currently marked InterfaceAudience.Private
> >
> > Should it be marked LimitedPrivate ?
> >
> > Cheers
> >
> > On Fri, Jan 9, 2015 at 12:19 PM, Gary Helmling <gh...@gmail.com>
> > wrote:
> >
> > > >
> > > >
> > > > 2) is more expensive than 1).
> > > > I'm wondering if we could use Compaction Coprocessor for 2)?
> HBaseHUT
> > > > needs to be able to grab N rows and merge them into 1, delete those N
> > > rows,
> > > > and just write that 1 new row.  This N could be several thousand
> rows.
> > > > Could Compaction Coprocessor really be used for that?
> > > >
> > > >
> > > It would depend on the details.  If you're simply aggregating the data
> > into
> > > one row, and:
> > > * the thousands of rows are contiguous in the scan
> > > * you can somehow incrementally update or emit the new row that you
> want
> > to
> > > create so that you don't need to retain all the old rows in memory
> > > * the new row you want to emit would sort sequentially into the same
> > > position
> > >
> > > Then overriding the scanner used for compaction could be a good
> solution.
> > > This would allow you to transform the cells emitted during compaction,
> > > including dropping the cells from the old rows and emitting new
> > > (transformed) cells for the new row.
> > >
> > >
> > > > Also, would that come into play during minor or major compactions or
> > > both?
> > > >
> > > >
> > > You can distinguish between them in your coprocessor hooks based on
> > > ScanType.  So up to you.
> > >
> >
>

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Gary Helmling <gh...@gmail.com>.

ScanType is a parameter of RegionObserver preCompact() and
preCompactScannerOpen().  It seems like anything we are explicitly
providing to coprocessor hooks should be LimitedPrivate.

On Fri, Jan 9, 2015 at 12:26 PM, Ted Yu <yu...@gmail.com> wrote:

> w.r.t. ScanType, here is the logic used by DefaultCompactor:
>
>         ScanType scanType =
>
>             request.isAllFiles() ? ScanType.COMPACT_DROP_DELETES :
> ScanType.
> COMPACT_RETAIN_DELETES;
>
> BTW ScanType is currently marked InterfaceAudience.Private
>
> Should it be marked LimitedPrivate ?
>
> Cheers
>
> On Fri, Jan 9, 2015 at 12:19 PM, Gary Helmling <gh...@gmail.com>
> wrote:
>
> > >
> > >
> > > 2) is more expensive than 1).
> > > I'm wondering if we could use Compaction Coprocessor for 2)?  HBaseHUT
> > > needs to be able to grab N rows and merge them into 1, delete those N
> > rows,
> > > and just write that 1 new row.  This N could be several thousand rows.
> > > Could Compaction Coprocessor really be used for that?
> > >
> > >
> > It would depend on the details.  If you're simply aggregating the data
> into
> > one row, and:
> > * the thousands of rows are contiguous in the scan
> > * you can somehow incrementally update or emit the new row that you want
> to
> > create so that you don't need to retain all the old rows in memory
> > * the new row you want to emit would sort sequentially into the same
> > position
> >
> > Then overriding the scanner used for compaction could be a good solution.
> > This would allow you to transform the cells emitted during compaction,
> > including dropping the cells from the old rows and emitting new
> > (transformed) cells for the new row.
> >
> >
> > > Also, would that come into play during minor or major compactions or
> > both?
> > >
> > >
> > You can distinguish between them in your coprocessor hooks based on
> > ScanType.  So up to you.
> >
>

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Ted Yu <yu...@gmail.com>.

w.r.t. ScanType, here is the logic used by DefaultCompactor:

        ScanType scanType =

            request.isAllFiles() ? ScanType.COMPACT_DROP_DELETES : ScanType.
COMPACT_RETAIN_DELETES;

BTW ScanType is currently marked InterfaceAudience.Private

Should it be marked LimitedPrivate ?

Cheers

On Fri, Jan 9, 2015 at 12:19 PM, Gary Helmling <gh...@gmail.com> wrote:

> >
> >
> > 2) is more expensive than 1).
> > I'm wondering if we could use Compaction Coprocessor for 2)?  HBaseHUT
> > needs to be able to grab N rows and merge them into 1, delete those N
> rows,
> > and just write that 1 new row.  This N could be several thousand rows.
> > Could Compaction Coprocessor really be used for that?
> >
> >
> It would depend on the details.  If you're simply aggregating the data into
> one row, and:
> * the thousands of rows are contiguous in the scan
> * you can somehow incrementally update or emit the new row that you want to
> create so that you don't need to retain all the old rows in memory
> * the new row you want to emit would sort sequentially into the same
> position
>
> Then overriding the scanner used for compaction could be a good solution.
> This would allow you to transform the cells emitted during compaction,
> including dropping the cells from the old rows and emitting new
> (transformed) cells for the new row.
>
>
> > Also, would that come into play during minor or major compactions or
> both?
> >
> >
> You can distinguish between them in your coprocessor hooks based on
> ScanType.  So up to you.
>

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Gary Helmling <gh...@gmail.com>.

>
>
> 2) is more expensive than 1).
> I'm wondering if we could use Compaction Coprocessor for 2)?  HBaseHUT
> needs to be able to grab N rows and merge them into 1, delete those N rows,
> and just write that 1 new row.  This N could be several thousand rows.
> Could Compaction Coprocessor really be used for that?
>
>
It would depend on the details.  If you're simply aggregating the data into
one row, and:
* the thousands of rows are contiguous in the scan
* you can somehow incrementally update or emit the new row that you want to
create so that you don't need to retain all the old rows in memory
* the new row you want to emit would sort sequentially into the same
position

Then overriding the scanner used for compaction could be a good solution.
This would allow you to transform the cells emitted during compaction,
including dropping the cells from the old rows and emitting new
(transformed) cells for the new row.


> Also, would that come into play during minor or major compactions or both?
>
>
You can distinguish between them in your coprocessor hooks based on
ScanType.  So up to you.

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Ted Yu <yu...@gmail.com>.

Otis:
You can find examples of how these methods are used in Phoenix.
Namely:
phoenix-core//src/main/java/org/apache/hadoop/hbase/regionserver/IndexHalfStoreFileReaderGenerator.java
phoenix-core//src/main/java/org/apache/phoenix/coprocessor/UngroupedAggregateRegionObserver.java
phoenix-core//src/main/java/org/apache/phoenix/hbase/index/Indexer.java

FYI

On Fri, Jan 9, 2015 at 12:03 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> I haven't written against this API yet, so I don't know all these answers
> off the top of my head. The interface you're interested in are the
> preCompact* methods in
>
> http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html
>
> On Fri, Jan 9, 2015 at 6:35 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com
> > wrote:
>
> > Hi,
> >
> > What Nick suggests below about using Compaction Coprocessor sounds
> > potentially very useful for us.  Q below.
> >
> > On Wed, Jan 7, 2015 at 8:21 PM, Nick Dimiduk <nd...@gmail.com> wrote:
> >
> > > Not to dig too deep into ancient history, but Tsuna's comments are
> mostly
> > > still relevant today, except for...
> > >
> > > You also generally end up with fewer, bigger regions, which is almost
> > > > always better.  This entails that your RS are writing more data to
> > fewer
> > > > WALs, which leads to more sequential writes across the board.  You'll
> > end
> > > > up with fewer HLogs, which is also a good thing.
> > >
> > >
> > > HBase is one WAL per region server and has been for as long as I've
> paid
> > > attention. Unless I've missed something, number of tables doesn't
> change
> > > this fixed number.
> > >
> > > If you use HBase's client (which is most likely the case as the only
> > other
> > > > alternative is asynchbase), beware that you need to create one HTable
> > > > instance per table per thread in your application code.
> > >
> > >
> > > You can still write your client application this way, but the preferred
> > > idiom is to use a single Connection instance from which all these
> > resources
> > > are shared across HTable instances. This pattern is reinforced in the
> new
> > > client API introduced in 1.0
> > >
> > > FYI, I think you can write a Compaction coprocessor that implements
> your
> > > data expiration policy through normal compaction operations, thereby
> > > removing the necessity of the (expensive?) scan + write delete pattern
> > > entirely.
> > >
> >
> > We actually do 2 types of full scans:
> > 1) scan everything and delete rows > N days old, where N can be different
> > for different users
> > 2) scan everything and merge multiple rows into 1 row via HBaseHUT -
> > https://github.com/sematext/HBaseHUT
> >
> > 2) is more expensive than 1).
> > I'm wondering if we could use Compaction Coprocessor for 2)?  HBaseHUT
> > needs to be able to grab N rows and merge them into 1, delete those N
> rows,
> > and just write that 1 new row.  This N could be several thousand rows.
> > Could Compaction Coprocessor really be used for that?
> >
> > Also, would that come into play during minor or major compactions or
> both?
> >
> > Thanks,
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> >
> > >
> > > -n
> > >
> > > On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic <
> > > otis.gospodnetic@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > It's been asked before, but I didn't find any *definite* answers and
> a
> > > lot
> > > > of answers I found via  are from a whiiiile back.
> > > >
> > > > e.g. Tsuna provided pretty convincing info here:
> > > >
> > > >
> > >
> >
> http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+
> > > >
> > > > ... but that is from 3 years ago.  Maybe things changed?
> > > >
> > > > Here's our use case:
> > > >
> > > > Data/table layout:
> > > > * HBase is used for storing metrics at different granularities
> (1min, 5
> > > > min.... - a total of 6 different granularities)
> > > > * It's a multi-tenant system
> > > > * Keys are carefully crafted and include userId + number, where this
> > > number
> > > > contains the time and the granularity
> > > > * Everything's in 1 table and 1 CF
> > > >
> > > > Access:
> > > > * We only access 1 system at a time, for a specific time range, and
> > > > specific granularity
> > > > * We periodically scan ALL data and delete data older than N days,
> > where
> > > N
> > > > varies from user to user
> > > > * We periodically scan ALL data and merge multiple rows (of the same
> > > > granularity) into 1
> > > >
> > > > Question:
> > > > Would there be any advantage in having 6 tables - one for each
> > > granularity
> > > > - instead of having everything in 1 table?
> > > > Assume each table would still have just 1 CF and the keys would
> remain
> > > the
> > > > same.
> > > >
> > > > Thanks,
> > > > Otis
> > > > --
> > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> Management
> > > > Solr & Elasticsearch Support * http://sematext.com/
> > > >
> > >
> >
>

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Nick Dimiduk <nd...@gmail.com>.

I haven't written against this API yet, so I don't know all these answers
off the top of my head. The interface you're interested in are the
preCompact* methods in
http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html

On Fri, Jan 9, 2015 at 6:35 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

> Hi,
>
> What Nick suggests below about using Compaction Coprocessor sounds
> potentially very useful for us.  Q below.
>
> On Wed, Jan 7, 2015 at 8:21 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > Not to dig too deep into ancient history, but Tsuna's comments are mostly
> > still relevant today, except for...
> >
> > You also generally end up with fewer, bigger regions, which is almost
> > > always better.  This entails that your RS are writing more data to
> fewer
> > > WALs, which leads to more sequential writes across the board.  You'll
> end
> > > up with fewer HLogs, which is also a good thing.
> >
> >
> > HBase is one WAL per region server and has been for as long as I've paid
> > attention. Unless I've missed something, number of tables doesn't change
> > this fixed number.
> >
> > If you use HBase's client (which is most likely the case as the only
> other
> > > alternative is asynchbase), beware that you need to create one HTable
> > > instance per table per thread in your application code.
> >
> >
> > You can still write your client application this way, but the preferred
> > idiom is to use a single Connection instance from which all these
> resources
> > are shared across HTable instances. This pattern is reinforced in the new
> > client API introduced in 1.0
> >
> > FYI, I think you can write a Compaction coprocessor that implements your
> > data expiration policy through normal compaction operations, thereby
> > removing the necessity of the (expensive?) scan + write delete pattern
> > entirely.
> >
>
> We actually do 2 types of full scans:
> 1) scan everything and delete rows > N days old, where N can be different
> for different users
> 2) scan everything and merge multiple rows into 1 row via HBaseHUT -
> https://github.com/sematext/HBaseHUT
>
> 2) is more expensive than 1).
> I'm wondering if we could use Compaction Coprocessor for 2)?  HBaseHUT
> needs to be able to grab N rows and merge them into 1, delete those N rows,
> and just write that 1 new row.  This N could be several thousand rows.
> Could Compaction Coprocessor really be used for that?
>
> Also, would that come into play during minor or major compactions or both?
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> >
> > -n
> >
> > On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic <
> > otis.gospodnetic@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > It's been asked before, but I didn't find any *definite* answers and a
> > lot
> > > of answers I found via  are from a whiiiile back.
> > >
> > > e.g. Tsuna provided pretty convincing info here:
> > >
> > >
> >
> http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+
> > >
> > > ... but that is from 3 years ago.  Maybe things changed?
> > >
> > > Here's our use case:
> > >
> > > Data/table layout:
> > > * HBase is used for storing metrics at different granularities (1min, 5
> > > min.... - a total of 6 different granularities)
> > > * It's a multi-tenant system
> > > * Keys are carefully crafted and include userId + number, where this
> > number
> > > contains the time and the granularity
> > > * Everything's in 1 table and 1 CF
> > >
> > > Access:
> > > * We only access 1 system at a time, for a specific time range, and
> > > specific granularity
> > > * We periodically scan ALL data and delete data older than N days,
> where
> > N
> > > varies from user to user
> > > * We periodically scan ALL data and merge multiple rows (of the same
> > > granularity) into 1
> > >
> > > Question:
> > > Would there be any advantage in having 6 tables - one for each
> > granularity
> > > - instead of having everything in 1 table?
> > > Assume each table would still have just 1 CF and the keys would remain
> > the
> > > same.
> > >
> > > Thanks,
> > > Otis
> > > --
> > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> >
>

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

What Nick suggests below about using Compaction Coprocessor sounds
potentially very useful for us.  Q below.

On Wed, Jan 7, 2015 at 8:21 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Not to dig too deep into ancient history, but Tsuna's comments are mostly
> still relevant today, except for...
>
> You also generally end up with fewer, bigger regions, which is almost
> > always better.  This entails that your RS are writing more data to fewer
> > WALs, which leads to more sequential writes across the board.  You'll end
> > up with fewer HLogs, which is also a good thing.
>
>
> HBase is one WAL per region server and has been for as long as I've paid
> attention. Unless I've missed something, number of tables doesn't change
> this fixed number.
>
> If you use HBase's client (which is most likely the case as the only other
> > alternative is asynchbase), beware that you need to create one HTable
> > instance per table per thread in your application code.
>
>
> You can still write your client application this way, but the preferred
> idiom is to use a single Connection instance from which all these resources
> are shared across HTable instances. This pattern is reinforced in the new
> client API introduced in 1.0
>
> FYI, I think you can write a Compaction coprocessor that implements your
> data expiration policy through normal compaction operations, thereby
> removing the necessity of the (expensive?) scan + write delete pattern
> entirely.
>

We actually do 2 types of full scans:
1) scan everything and delete rows > N days old, where N can be different
for different users
2) scan everything and merge multiple rows into 1 row via HBaseHUT -
https://github.com/sematext/HBaseHUT

2) is more expensive than 1).
I'm wondering if we could use Compaction Coprocessor for 2)?  HBaseHUT
needs to be able to grab N rows and merge them into 1, delete those N rows,
and just write that 1 new row.  This N could be several thousand rows.
Could Compaction Coprocessor really be used for that?

Also, would that come into play during minor or major compactions or both?

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




>
> -n
>
> On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com
> > wrote:
>
> > Hi,
> >
> > It's been asked before, but I didn't find any *definite* answers and a
> lot
> > of answers I found via  are from a whiiiile back.
> >
> > e.g. Tsuna provided pretty convincing info here:
> >
> >
> http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+
> >
> > ... but that is from 3 years ago.  Maybe things changed?
> >
> > Here's our use case:
> >
> > Data/table layout:
> > * HBase is used for storing metrics at different granularities (1min, 5
> > min.... - a total of 6 different granularities)
> > * It's a multi-tenant system
> > * Keys are carefully crafted and include userId + number, where this
> number
> > contains the time and the granularity
> > * Everything's in 1 table and 1 CF
> >
> > Access:
> > * We only access 1 system at a time, for a specific time range, and
> > specific granularity
> > * We periodically scan ALL data and delete data older than N days, where
> N
> > varies from user to user
> > * We periodically scan ALL data and merge multiple rows (of the same
> > granularity) into 1
> >
> > Question:
> > Would there be any advantage in having 6 tables - one for each
> granularity
> > - instead of having everything in 1 table?
> > Assume each table would still have just 1 CF and the keys would remain
> the
> > same.
> >
> > Thanks,
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
>

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Nick Dimiduk <nd...@gmail.com>.

Not to dig too deep into ancient history, but Tsuna's comments are mostly
still relevant today, except for...

You also generally end up with fewer, bigger regions, which is almost
> always better.  This entails that your RS are writing more data to fewer
> WALs, which leads to more sequential writes across the board.  You'll end
> up with fewer HLogs, which is also a good thing.


HBase is one WAL per region server and has been for as long as I've paid
attention. Unless I've missed something, number of tables doesn't change
this fixed number.

If you use HBase's client (which is most likely the case as the only other
> alternative is asynchbase), beware that you need to create one HTable
> instance per table per thread in your application code.


You can still write your client application this way, but the preferred
idiom is to use a single Connection instance from which all these resources
are shared across HTable instances. This pattern is reinforced in the new
client API introduced in 1.0

FYI, I think you can write a Compaction coprocessor that implements your
data expiration policy through normal compaction operations, thereby
removing the necessity of the (expensive?) scan + write delete pattern
entirely.

-n

On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

> Hi,
>
> It's been asked before, but I didn't find any *definite* answers and a lot
> of answers I found via  are from a whiiiile back.
>
> e.g. Tsuna provided pretty convincing info here:
>
> http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+
>
> ... but that is from 3 years ago.  Maybe things changed?
>
> Here's our use case:
>
> Data/table layout:
> * HBase is used for storing metrics at different granularities (1min, 5
> min.... - a total of 6 different granularities)
> * It's a multi-tenant system
> * Keys are carefully crafted and include userId + number, where this number
> contains the time and the granularity
> * Everything's in 1 table and 1 CF
>
> Access:
> * We only access 1 system at a time, for a specific time range, and
> specific granularity
> * We periodically scan ALL data and delete data older than N days, where N
> varies from user to user
> * We periodically scan ALL data and merge multiple rows (of the same
> granularity) into 1
>
> Question:
> Would there be any advantage in having 6 tables - one for each granularity
> - instead of having everything in 1 table?
> Assume each table would still have just 1 CF and the keys would remain the
> same.
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>

Re: 1 table, 1 dense CF => N tables, 1 dense CF ?

Posted by Stack <st...@duboce.net>.

On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

>
> Data/table layout:
> * HBase is used for storing metrics at different granularities (1min, 5
> min.... - a total of 6 different granularities)
> * It's a multi-tenant system
> * Keys are carefully crafted and include userId + number, where this number
> contains the time and the granularity
> * Everything's in 1 table and 1 CF
>
> Access:
> * We only access 1 system at a time, for a specific time range, and
> specific granularity
> * We periodically scan ALL data and delete data older than N days, where N
> varies from user to user
> * We periodically scan ALL data and merge multiple rows (of the same
> granularity) into 1
>
>
Are you having a problem Otis that you are trying to solve?

> Question:
> Would there be any advantage in having 6 tables - one for each granularity
> - instead of having everything in 1 table?
>

It could make for less rewriting of data.  If all in the one table, a
compaction will rewrite all granularities. If separate tables, the coarser
granularities would change less often so would flush/compact -- be
rewritten -- less often.

You might get similar effect if you put in place a split policy that split
regions on a granularity border; e.g. have all the 1minutes in one region
and anything at a coarser range goes into a different region.

You have notions of the relative proportions of the different
granularities? (e.g. is the coarsest granularity 10% or an irrelevant
0.0001%?)

Otherwise, as @tsuna says and yeah, what @nick says regards compaction;
might be worth exploring... could save you a bunch of churn.

St.Ack

> Assume each table would still have just 1 CF and the keys would remain the
> same.
>