You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by "Heather, James (ELS)" <ja...@elsevier.com> on 2016/08/03 14:28:50 UTC

Advice on Phoenix config

Hi,

We've got a quite wide table (maybe 50 cols) with about 1 billion rows in it, currently stored in MySQL; we're looking at moving it into Phoenix. The pk there is an autoincrement column, but each row also contains a UUID, and that would probably naturally become the pk in Phoenix. There are several other tables that hang off this table, in the sense that the pk for the main table is a foreign key in these other tables. There are several indexed columns in MySQL that would also need to carry over as indexes in Phoenix.

Most of the queries are reads, but maybe 20% of them are writes. Almost all of them are small, doing point lookups or returning a few rows based on one of the indexes.

Can anyone suggest sensible Phoenix/HBase config to get decent performance out of this? Specifically:

1. How should we encode the UUID? As BINARY(16)? And if this is the PK, and they are randomly generated UUIDs, presumably salting is unnecessary?
2. How many nodes should we expect to need to give us at least as good performance as our MySQL database with 1 billion rows?
3. How many regions?
4. Presumably this will start to out-perform MySQL as the number of rows in the database increases? When we've got 10 billion rows, MySQL might struggle but hopefully Phoenix will be fine?
5. Are there any particular HBase configs we should be aware of (RPC timeouts etc.) that we'll need to tweak to get decent performance? This applies partly to the bulk loading process (data migration) at the beginning, but also afterwards when it's released into production.

We'd be extremely grateful for any tips.

James

________________________________

Elsevier Limited. Registered Office: The Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084, Registered in England and Wales.

Re: Advice on Phoenix config

Posted by James Taylor <ja...@apache.org>.

Hi Tom & James,

I don't have any direct experience comparing MySQL to Phoenix and it's
difficult to generalize about these things as they vary greatly from use
case to use case. Best to do your own perf testing with your own queries at
representative data sizes. Our Pherf tool and YCSB can help you with this.
I'd encourage you to use the latest version of Phoenix and HBase and both
projects are improving rapidly. Would be much appreciated if you update us
on your findings.

As far as UUID, yes I've seen BINARY(16) used by others for this. If your
most common query isn't to lookup row(s) by UUID, I'd recommend including
the most common columns you query before your UUID in a composite primary
key. Designing a good composite primary key based on your most common
queries is probably the single most important design consideration in
Phoenix.

Phoenix + HBase scale horizontally, so as long as your cluster is sized
appropriately, the number of rows can continue growing. Point lookups and
range scans with limits will remain constant while full table scans should
grow linearly as your data increases. Secondary indexes mitigates this by
turning what would normally be a full table scan into a range scan along a
new primary key axis (at the expense of increasing write times).

As far as tuning[1], Andrew Purtell put together this excellent slide
deck[2]. Also, take a look at the use case presentations from PhoenixCon
here in the comments section[3]. One important feature to reduce RPC
traffic using the UPDATE_CACHE_FREQUENCY property at DDL time is described
here[4].

HTH. Thanks,

James

[1] https://phoenix.apache.org/tuning.html
[2] http://phoenix.apache.org/presentations/TuningForOLTP.pdf
[3]
http://www.meetup.com/SF-Bay-Area-Apache-Phoenix-Meetup/events/230545182/
[4] https://phoenix.apache.org/#Altering

On Tue, Aug 9, 2016 at 4:43 AM, Squires, Tom (ELS-LON) <
tom.squires@elsevier.com> wrote:

> Hi there,
>
>
> I am working with James Heather on this - does anyone have any pointers?
>
>
> Many thanks,
>
> Tom
>
>
> ------------------------------
> *From:* Heather, James (ELS) <ja...@elsevier.com>
> *Sent:* 03 August 2016 15:28
> *To:* user@phoenix.apache.org
> *Subject:* Advice on Phoenix config
>
> Hi,
>
> We've got a quite wide table (maybe 50 cols) with about 1 billion rows in
> it, currently stored in MySQL; we're looking at moving it into Phoenix. The
> pk there is an autoincrement column, but each row also contains a UUID, and
> that would probably naturally become the pk in Phoenix. There are several
> other tables that hang off this table, in the sense that the pk for the
> main table is a foreign key in these other tables. There are several
> indexed columns in MySQL that would also need to carry over as indexes in
> Phoenix.
>
> Most of the queries are reads, but maybe 20% of them are writes. Almost
> all of them are small, doing point lookups or returning a few rows based on
> one of the indexes.
>
> Can anyone suggest sensible Phoenix/HBase config to get decent performance
> out of this? Specifically:
>
>
>    1. How should we encode the UUID? As BINARY(16)? And if this is the
>    PK, and they are randomly generated UUIDs, presumably salting is
>    unnecessary?
>    2. How many nodes should we expect to need to give us at least as good
>    performance as our MySQL database with 1 billion rows?
>    3. How many regions?
>    4. Presumably this will start to out-perform MySQL as the number of
>    rows in the database increases? When we've got 10 billion rows, MySQL might
>    struggle but hopefully Phoenix will be fine?
>    5. Are there any particular HBase configs we should be aware of (RPC
>    timeouts etc.) that we'll need to tweak to get decent performance? This
>    applies partly to the bulk loading process (data migration) at the
>    beginning, but also afterwards when it's released into production.
>
>
> We'd be extremely grateful for any tips.
>
> James
>
> ------------------------------
>
> Elsevier Limited. Registered Office: The Boulevard, Langford Lane,
> Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084,
> Registered in England and Wales.
>
> ------------------------------
>
> Elsevier Limited. Registered Office: The Boulevard, Langford Lane,
> Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084,
> Registered in England and Wales.
>

Re: Advice on Phoenix config

Posted by "Squires, Tom (ELS-LON)" <to...@elsevier.com>.

Hi there,


I am working with James Heather on this - does anyone have any pointers?


Many thanks,

Tom


________________________________
From: Heather, James (ELS) <ja...@elsevier.com>
Sent: 03 August 2016 15:28
To: user@phoenix.apache.org
Subject: Advice on Phoenix config

Hi,

We've got a quite wide table (maybe 50 cols) with about 1 billion rows in it, currently stored in MySQL; we're looking at moving it into Phoenix. The pk there is an autoincrement column, but each row also contains a UUID, and that would probably naturally become the pk in Phoenix. There are several other tables that hang off this table, in the sense that the pk for the main table is a foreign key in these other tables. There are several indexed columns in MySQL that would also need to carry over as indexes in Phoenix.

Most of the queries are reads, but maybe 20% of them are writes. Almost all of them are small, doing point lookups or returning a few rows based on one of the indexes.

Can anyone suggest sensible Phoenix/HBase config to get decent performance out of this? Specifically:


  1.  How should we encode the UUID? As BINARY(16)? And if this is the PK, and they are randomly generated UUIDs, presumably salting is unnecessary?
  2.  How many nodes should we expect to need to give us at least as good performance as our MySQL database with 1 billion rows?
  3.  How many regions?
  4.  Presumably this will start to out-perform MySQL as the number of rows in the database increases? When we've got 10 billion rows, MySQL might struggle but hopefully Phoenix will be fine?
  5.  Are there any particular HBase configs we should be aware of (RPC timeouts etc.) that we'll need to tweak to get decent performance? This applies partly to the bulk loading process (data migration) at the beginning, but also afterwards when it's released into production.

We'd be extremely grateful for any tips.

James

________________________________

Elsevier Limited. Registered Office: The Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084, Registered in England and Wales.

________________________________

Elsevier Limited. Registered Office: The Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084, Registered in England and Wales.