You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Willem Conradie <wi...@pbtgroup.co.za> on 2016/01/15 10:12:23 UTC

Telco HBase POC

Hi,

I am currently consulting at a client with the following requirements.

They want to make available detailed data usage CDRs for customers to verify their data usage against the websites that they visited. In short this can be seen as an itemised bill for data usage. The data is currently not loaded into a RDBMS due to the volumes of data involved. The proposed solution is to load the data into HBase, running on a HDP cluster, and make it available for querying by the subscribers. It is critical to ensure low latency read access to the subscriber data, which possibly will be exposed to 25 million subscribers. We will be running a scaled down version first for a proof of concept with the intention of it becoming an operational data store. Once the solution is functioning properly for the data usage CDRs other CDR types will be added, as such we need to build a cost effective, scalable solution .

I am thinking of using Apache Phoenix for the following reasons:

1. 1. Current data loading into RDBMS is file based (CSV) via a staging server using the RDBMS file load drivers

2. 2. Use Apache Phoenix bin/psql.py script to mimic above process to load to HBase

3. 3. Expected data volume : 60 000 files per day
1 -to 10 MB per file
500 million records per day
500 GB total volume per day

4. 4. Use Apache Phoenix client for low latency data retrieval

Is Apache Phoenix a suitable candidate for this specific use case?

Regards,

Willem

Re: Telco HBase POC

Posted by James Taylor <ja...@apache.org>.

Thanks for sharing that, Vijay. What are you putting in the "serialization
and processing cost" bucket? Would be good to dive into that, file some
JIRAs (if it's related to Phoenix), and improve this.

    James

On Wed, Jan 20, 2016 at 9:00 AM, Vijay Vangapandu <
VijayVangapandu@eharmony.com> wrote:

> Hi guys,
> We recently migrated one of our user facing use cases to HBase and we are
> using Phoenix as query layer.
> We managed to get a singles record in 30Ms.
>
> Here is the response times breakdown.
> 75th - 29MS
> 95th - 43MS
> 99th - 76 MS
>
> We have about 6Billion records in store and each row contains around 30
> columns.
>
> We are using HortenWorks configuration with few config tweaks.
>
> We enabled the block cache.
>
> Our use case is to get all records associated to user to render in
> List/card view . Each user has on an average 5K records.
>
> Our biggest bottleneck is serialization.
> Above response times are for single record reads from HBase, but with
> serialization and processing cost its averaging at 80MS.
>
>
> Sent from my iPhone
>
> On Jan 20, 2016, at 6:27 AM, Riesland, Zack <Zack.Riesland@sensus.com
> <ma...@sensus.com>> wrote:
>
> I have a similar data pattern and 100ms response time is fairly consistent.
>
> I’ve been trying hard to find the right set of configs to get closer to
> 10-20ms with no luck, but I’m finding that 100ms average is pretty
> reasonable.
>
> From: Willem Conradie [mailto:willem.conradie@pbtgroup.co.za]
> Sent: Wednesday, January 20, 2016 8:31 AM
> To: jamestaylor@apache.org<ma...@apache.org>
> Cc: user@phoenix.apache.org<ma...@phoenix.apache.org>
> Subject: RE: Telco HBase POC
>
> Hi James,
>
> Thanks for being willing to assist.
>
> This is what the input data record will look like (test data) :
> UserID
>
> DateTime
>
> TXNID
>
> DeviceID
>
> IPAddress
>
> UsageArray
>
> URIArray
>
> 12345678901
>
> 20151006124945
>
> 992194978
>
> 123456789012345
>
> 111.111.111.111
>
>
> 10-4000:26272:1019324|0-4000:0:0|10-4000:25780:498309|420-4000:152:152|500-500:1258:2098|9001-9001:120:0|0-4000:0:0|502-4000:154:0|10-4000:73750:448374|420-4000:608:608|1-4000:364:550|358-4000:40:52
>
> www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|
> www.google.com<http://www.facebook.com|www.whatsapp.com|www.google.co.nz|
> ssl.gstatic.com|www.google.com/>
>
>
> Unique key on record is “UserID,DateTime,TXNID”.
>
> Read access pattern is as follows:
> User queries by UserID, DateTime range to supply them with usage stats
> (derived from ‘UsageArray’) for websites (derived from ‘URIArray’) visited
> over their selected time range.
>
> Just to recap the data volumes:
> Expected data volume :  60 000 files per day
>                                                   1 –to 10 MB per file
>                                                   500 million records per
> day
>                                                    500 GB total volume per
> day
>
> I need  to be flexible in the amount of data stored. Initially it will be
> 5 days, but can increase to 30 days and possibly 90 days.
>
> One concern I have (not founded in any way) with the phoenix client is
> whether it will be able to support data access for above queries within
> 100ms range.
>
> Regards,
> Willem
>
> From: James Taylor [mailto:jamestaylor@apache.org]
> Sent: 19 January 2016 10:07 PM
> To: user <us...@phoenix.apache.org>>
> Subject: Re: Telco HBase POC
>
> Hi Willem,
> Let us know how we can help as you start getting into this, in particular
> with your schema design based on your query requirements.
> Thanks,
> James
>
> On Mon, Jan 18, 2016 at 8:50 AM, Pariksheet Barapatre <
> pbarapatre@gmail.com<ma...@gmail.com>> wrote:
>
> Hi Willem,
>
> Use Phoenix bulk load. I guess your source is csv so phoenixcsvbulk loader
> can be used.
>
> How frequently you want to load these files. If you can wait for certain
> interval to merge these files and map reduce will bulk load to Phoenix
> table.
>
> Cheers
> Pari
> On 18-Jan-2016 4:17 pm, "Willem Conradie" <willem.conradie@pbtgroup.co.za
> <ma...@pbtgroup.co.za>> wrote:
> Hi Pari,
>
> My comments in blue.
>
> Few notes from my experience :
> 1. Use bulk load rather than psql.py. Load larger files(merge) instead of
> small files.
> Are you referring to native HBase bulk load or Phoenix MapReduce bulk
> load? Unfortunately we can’t change how the files are received from source.
> Must we pre-process to merge the files before running the bulk load utility?
>
> 2. Increase HBase block cache
> 3. Turn off HBase auto compaction
> 4. Select primary key correctly
> 5. Don't use salting . As table will be huge, your phoenix query will fork
> may scanners. Try something like hash on userid.
> 6. Define TTL to purge data periodically
>
>
> Regards,
> Willem
>
> From: Pariksheet Barapatre [mailto:pbarapatre@gmail.com<mailto:
> pbarapatre@gmail.com>]
> Sent: 15 January 2016 03:17 PM
> To: user@phoenix.apache.org<ma...@phoenix.apache.org>
> Subject: Re: Telco HBase POC
>
> Hi Willem,
> Looking at your use case. Phoenix would be a handy client.
> Few notes from my experience :
> 1. Use bulk load rather than psql.py. Load larger files(merge) instead of
> small files.
> 2. Increase HBase block cache
> 3. Turn off HBase auto compaction
> 4. Select primary key correctly
> 5. Don't use salting . As table will be huge, your phoenix query will fork
> may scanners. Try something like hash on userid.
> 6. Define TTL to purge data periodically
>
> Cheers
> Pari
>
>
> On 15 January 2016 at 17:48, Pedro Gandola <pedro.gandola@gmail.com
> <ma...@gmail.com>> wrote:
> Hi Willem,
>
> Just to give you my short experience as phoenix user.
>
> I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion
> entries.
> In our use case Phoenix is doing very well and it saved a lot of code
> complexity and time. If you guys have already decided that HBase is the way
> to go then having phoenix as a SQL layer it will help a lot, not only in
> terms of code simplicity but It will help you to create and maintain your
> indexes and views which can be hard&costly tasks using the plain HBase API.
> Joining tables it's just a simple SQL join :).
>
> And there are a lot of more useful features that make your life easier
> with HBase.
>
> In terms of performance and depending on the SLAs that you have you need
> to benchmark, however I think your main battles are going to be with HBase,
> JVM GCs, Network, FileSystem, etc...
>
> I would say to give Phoenix a try, for sure.
>
> Cheers
> Pedro
>
> On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <
> willem.conradie@pbtgroup.co.za<ma...@pbtgroup.co.za>>
> wrote:
>
>
>
> Hi,
>
>
>
> I am currently consulting at a client with the following requirements.
>
>
>
> They want to make available detailed data usage CDRs for customers to
> verify their data usage against the websites that they visited. In short
> this can be seen as an itemised bill for data usage.  The data is currently
> not loaded into a RDBMS due to the volumes of data involved. The proposed
> solution is to load the data into HBase, running on a HDP cluster, and make
> it available for querying by the subscribers.  It is critical to ensure low
> latency read access to the subscriber data, which possibly will be exposed
> to 25 million subscribers. We will be running a scaled down version first
> for a proof of concept with the intention of it becoming an operational
> data store.  Once the solution is functioning properly for the data usage
> CDRs other CDR types will be added, as such we need  to build a cost
> effective, scalable solution .
>
>
>
> I am thinking of using Apache Phoenix for the following reasons:
>
>
>
> 1.      1. Current data loading into RDBMS is file based (CSV) via a
> staging server using the RDBMS file load drivers
>
> 2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above process
> to load to HBase
>
> 3.       3. Expected data volume :  60 000 files per day
>                                                   1 –to 10 MB per file
>                                                   500 million records per
> day
>                                                    500 GB total volume per
> day
>
>
> 4.        4. Use Apache Phoenix client for low latency data retrieval
>
>
>
> Is Apache Phoenix a suitable candidate for this specific use case?
>
>
>
> Regards,
>
> Willem
>
>
>
>
>
> --
> Cheers,
> Pari
>
>

Re: Telco HBase POC

Posted by Vijay Vangapandu <Vi...@eharmony.com>.

Hi Guys,

There is small confusion in below email.
There is no deserialization issue in apache phoenix layer. Response times breakdown in the below email is by using phoenix and it’s pretty good.

What i am talking about is a java client library we created to deserialize the data from phoenix result set to model object.
We implemented a generic ORM kind of library on top of Phoenix for Java object mapping and built the support for DSL kind of queries and his is where the extra overhead is.

As I said I don't see any issue with Phoenix.

On Jan 20, 2016, at 9:00 AM, Vijay Vangapandu <Vi...@eharmony.com>> wrote:

Hi guys,
We recently migrated one of our user facing use cases to HBase and we are using Phoenix as query layer.
We managed to get a singles record in 30Ms.

Here is the response times breakdown.
75th - 29MS
95th - 43MS
99th - 76 MS

We have about 6Billion records in store and each row contains around 30 columns.

We are using HortenWorks configuration with few config tweaks.

We enabled the block cache.

Our use case is to get all records associated to user to render in
List/card view . Each user has on an average 5K records.

Our biggest bottleneck is serialization.
Above response times are for single record reads from HBase, but with serialization and processing cost its averaging at 80MS.


Sent from my iPhone

On Jan 20, 2016, at 6:27 AM, Riesland, Zack <Za...@sensus.com>> wrote:

I have a similar data pattern and 100ms response time is fairly consistent.

I’ve been trying hard to find the right set of configs to get closer to 10-20ms with no luck, but I’m finding that 100ms average is pretty reasonable.

From: Willem Conradie [mailto:willem.conradie@pbtgroup.co.za]
Sent: Wednesday, January 20, 2016 8:31 AM
To: jamestaylor@apache.org<ma...@apache.org>
Cc: user@phoenix.apache.org<ma...@phoenix.apache.org>
Subject: RE: Telco HBase POC

Hi James,

Thanks for being willing to assist.

This is what the input data record will look like (test data) :
UserID

DateTime

TXNID

DeviceID

IPAddress

UsageArray

URIArray

12345678901

20151006124945

992194978

123456789012345

111.111.111.111

10-4000:26272:1019324|0-4000:0:0|10-4000:25780:498309|420-4000:152:152|500-500:1258:2098|9001-9001:120:0|0-4000:0:0|502-4000:154:0|10-4000:73750:448374|420-4000:608:608|1-4000:364:550|358-4000:40:52

www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|www.google.com<http://www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|www.google.com/>


Unique key on record is “UserID,DateTime,TXNID”.

Read access pattern is as follows:
User queries by UserID, DateTime range to supply them with usage stats (derived from ‘UsageArray’) for websites (derived from ‘URIArray’) visited over their selected time range.

Just to recap the data volumes:
Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day

I need  to be flexible in the amount of data stored. Initially it will be 5 days, but can increase to 30 days and possibly 90 days.

One concern I have (not founded in any way) with the phoenix client is whether it will be able to support data access for above queries within 100ms range.

Regards,
Willem

From: James Taylor [mailto:jamestaylor@apache.org]
Sent: 19 January 2016 10:07 PM
To: user <us...@phoenix.apache.org>>
Subject: Re: Telco HBase POC

Hi Willem,
Let us know how we can help as you start getting into this, in particular with your schema design based on your query requirements.
Thanks,
James

On Mon, Jan 18, 2016 at 8:50 AM, Pariksheet Barapatre <pb...@gmail.com>> wrote:

Hi Willem,

Use Phoenix bulk load. I guess your source is csv so phoenixcsvbulk loader can be used.

How frequently you want to load these files. If you can wait for certain interval to merge these files and map reduce will bulk load to Phoenix table.

Cheers
Pari

On 18-Jan-2016 4:17 pm, "Willem Conradie" <wi...@pbtgroup.co.za>> wrote:
Hi Pari,

My comments in blue.

Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
Are you referring to native HBase bulk load or Phoenix MapReduce bulk load? Unfortunately we can’t change how the files are received from source. Must we pre-process to merge the files before running the bulk load utility?

2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically


Regards,
Willem

From: Pariksheet Barapatre [mailto:pbarapatre@gmail.com<ma...@gmail.com>]
Sent: 15 January 2016 03:17 PM
To: user@phoenix.apache.org<ma...@phoenix.apache.org>
Subject: Re: Telco HBase POC

Hi Willem,
Looking at your use case. Phoenix would be a handy client.
Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically

Cheers
Pari


On 15 January 2016 at 17:48, Pedro Gandola <pe...@gmail.com>> wrote:
Hi Willem,

Just to give you my short experience as phoenix user.

I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion entries.
In our use case Phoenix is doing very well and it saved a lot of code complexity and time. If you guys have already decided that HBase is the way to go then having phoenix as a SQL layer it will help a lot, not only in terms of code simplicity but It will help you to create and maintain your indexes and views which can be hard&costly tasks using the plain HBase API. Joining tables it's just a simple SQL join :).

And there are a lot of more useful features that make your life easier with HBase.

In terms of performance and depending on the SLAs that you have you need to benchmark, however I think your main battles are going to be with HBase, JVM GCs, Network, FileSystem, etc...

I would say to give Phoenix a try, for sure.

Cheers
Pedro

On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <wi...@pbtgroup.co.za>> wrote:



Hi,



I am currently consulting at a client with the following requirements.



They want to make available detailed data usage CDRs for customers to verify their data usage against the websites that they visited. In short this can be seen as an itemised bill for data usage.  The data is currently not loaded into a RDBMS due to the volumes of data involved. The proposed solution is to load the data into HBase, running on a HDP cluster, and make it available for querying by the subscribers.  It is critical to ensure low latency read access to the subscriber data, which possibly will be exposed to 25 million subscribers. We will be running a scaled down version first for a proof of concept with the intention of it becoming an operational data store.  Once the solution is functioning properly for the data usage CDRs other CDR types will be added, as such we need  to build a cost effective, scalable solution .



I am thinking of using Apache Phoenix for the following reasons:



1.      1. Current data loading into RDBMS is file based (CSV) via a staging server using the RDBMS file load drivers

2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above process to load to HBase

3.       3. Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day


4.        4. Use Apache Phoenix client for low latency data retrieval



Is Apache Phoenix a suitable candidate for this specific use case?



Regards,

Willem






--
Cheers,
Pari

Re: Telco HBase POC

Posted by Vijay Vangapandu <Vi...@eharmony.com>.

Hi guys,
We recently migrated one of our user facing use cases to HBase and we are using Phoenix as query layer.
We managed to get a singles record in 30Ms.

Here is the response times breakdown.
75th - 29MS
95th - 43MS
99th - 76 MS

We have about 6Billion records in store and each row contains around 30 columns.

We are using HortenWorks configuration with few config tweaks.

We enabled the block cache.

Our use case is to get all records associated to user to render in
List/card view . Each user has on an average 5K records.

Our biggest bottleneck is serialization.
Above response times are for single record reads from HBase, but with serialization and processing cost its averaging at 80MS.


Sent from my iPhone

On Jan 20, 2016, at 6:27 AM, Riesland, Zack <Za...@sensus.com>> wrote:

I have a similar data pattern and 100ms response time is fairly consistent.

I’ve been trying hard to find the right set of configs to get closer to 10-20ms with no luck, but I’m finding that 100ms average is pretty reasonable.

From: Willem Conradie [mailto:willem.conradie@pbtgroup.co.za]
Sent: Wednesday, January 20, 2016 8:31 AM
To: jamestaylor@apache.org<ma...@apache.org>
Cc: user@phoenix.apache.org<ma...@phoenix.apache.org>
Subject: RE: Telco HBase POC

Hi James,

Thanks for being willing to assist.

This is what the input data record will look like (test data) :
UserID

DateTime

TXNID

DeviceID

IPAddress

UsageArray

URIArray

12345678901

20151006124945

992194978

123456789012345

111.111.111.111

10-4000:26272:1019324|0-4000:0:0|10-4000:25780:498309|420-4000:152:152|500-500:1258:2098|9001-9001:120:0|0-4000:0:0|502-4000:154:0|10-4000:73750:448374|420-4000:608:608|1-4000:364:550|358-4000:40:52

www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|www.google.com<http://www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|www.google.com/>


Unique key on record is “UserID,DateTime,TXNID”.

Read access pattern is as follows:
User queries by UserID, DateTime range to supply them with usage stats (derived from ‘UsageArray’) for websites (derived from ‘URIArray’) visited over their selected time range.

Just to recap the data volumes:
Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day

I need  to be flexible in the amount of data stored. Initially it will be 5 days, but can increase to 30 days and possibly 90 days.

One concern I have (not founded in any way) with the phoenix client is whether it will be able to support data access for above queries within 100ms range.

Regards,
Willem

From: James Taylor [mailto:jamestaylor@apache.org]
Sent: 19 January 2016 10:07 PM
To: user <us...@phoenix.apache.org>>
Subject: Re: Telco HBase POC

Hi Willem,
Let us know how we can help as you start getting into this, in particular with your schema design based on your query requirements.
Thanks,
James

On Mon, Jan 18, 2016 at 8:50 AM, Pariksheet Barapatre <pb...@gmail.com>> wrote:

Hi Willem,

Use Phoenix bulk load. I guess your source is csv so phoenixcsvbulk loader can be used.

How frequently you want to load these files. If you can wait for certain interval to merge these files and map reduce will bulk load to Phoenix table.

Cheers
Pari
On 18-Jan-2016 4:17 pm, "Willem Conradie" <wi...@pbtgroup.co.za>> wrote:
Hi Pari,

My comments in blue.

Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
Are you referring to native HBase bulk load or Phoenix MapReduce bulk load? Unfortunately we can’t change how the files are received from source. Must we pre-process to merge the files before running the bulk load utility?

2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically


Regards,
Willem

From: Pariksheet Barapatre [mailto:pbarapatre@gmail.com<ma...@gmail.com>]
Sent: 15 January 2016 03:17 PM
To: user@phoenix.apache.org<ma...@phoenix.apache.org>
Subject: Re: Telco HBase POC

Hi Willem,
Looking at your use case. Phoenix would be a handy client.
Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically

Cheers
Pari


On 15 January 2016 at 17:48, Pedro Gandola <pe...@gmail.com>> wrote:
Hi Willem,

Just to give you my short experience as phoenix user.

I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion entries.
In our use case Phoenix is doing very well and it saved a lot of code complexity and time. If you guys have already decided that HBase is the way to go then having phoenix as a SQL layer it will help a lot, not only in terms of code simplicity but It will help you to create and maintain your indexes and views which can be hard&costly tasks using the plain HBase API. Joining tables it's just a simple SQL join :).

And there are a lot of more useful features that make your life easier with HBase.

In terms of performance and depending on the SLAs that you have you need to benchmark, however I think your main battles are going to be with HBase, JVM GCs, Network, FileSystem, etc...

I would say to give Phoenix a try, for sure.

Cheers
Pedro

On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <wi...@pbtgroup.co.za>> wrote:



Hi,



I am currently consulting at a client with the following requirements.



They want to make available detailed data usage CDRs for customers to verify their data usage against the websites that they visited. In short this can be seen as an itemised bill for data usage.  The data is currently not loaded into a RDBMS due to the volumes of data involved. The proposed solution is to load the data into HBase, running on a HDP cluster, and make it available for querying by the subscribers.  It is critical to ensure low latency read access to the subscriber data, which possibly will be exposed to 25 million subscribers. We will be running a scaled down version first for a proof of concept with the intention of it becoming an operational data store.  Once the solution is functioning properly for the data usage CDRs other CDR types will be added, as such we need  to build a cost effective, scalable solution .



I am thinking of using Apache Phoenix for the following reasons:



1.      1. Current data loading into RDBMS is file based (CSV) via a staging server using the RDBMS file load drivers

2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above process to load to HBase

3.       3. Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day


4.        4. Use Apache Phoenix client for low latency data retrieval



Is Apache Phoenix a suitable candidate for this specific use case?



Regards,

Willem





--
Cheers,
Pari

RE: Telco HBase POC

Posted by "Riesland, Zack" <Za...@sensus.com>.

I have a similar data pattern and 100ms response time is fairly consistent.

I’ve been trying hard to find the right set of configs to get closer to 10-20ms with no luck, but I’m finding that 100ms average is pretty reasonable.

From: Willem Conradie [mailto:willem.conradie@pbtgroup.co.za]
Sent: Wednesday, January 20, 2016 8:31 AM
To: jamestaylor@apache.org
Cc: user@phoenix.apache.org
Subject: RE: Telco HBase POC

Hi James,

Thanks for being willing to assist.

This is what the input data record will look like (test data) :
UserID

DateTime

TXNID

DeviceID

IPAddress

UsageArray

URIArray

12345678901

20151006124945

992194978

123456789012345

111.111.111.111

10-4000:26272:1019324|0-4000:0:0|10-4000:25780:498309|420-4000:152:152|500-500:1258:2098|9001-9001:120:0|0-4000:0:0|502-4000:154:0|10-4000:73750:448374|420-4000:608:608|1-4000:364:550|358-4000:40:52

www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|www.google.com<http://www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|www.google.com/>

Unique key on record is “UserID,DateTime,TXNID”.

Read access pattern is as follows:
User queries by UserID, DateTime range to supply them with usage stats (derived from ‘UsageArray’) for websites (derived from ‘URIArray’) visited over their selected time range.

Just to recap the data volumes:
Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day

I need  to be flexible in the amount of data stored. Initially it will be 5 days, but can increase to 30 days and possibly 90 days.

One concern I have (not founded in any way) with the phoenix client is whether it will be able to support data access for above queries within 100ms range.

Regards,
Willem

From: James Taylor [mailto:jamestaylor@apache.org]
Sent: 19 January 2016 10:07 PM
To: user <us...@phoenix.apache.org>>
Subject: Re: Telco HBase POC

Hi Willem,
Let us know how we can help as you start getting into this, in particular with your schema design based on your query requirements.
Thanks,
James

On Mon, Jan 18, 2016 at 8:50 AM, Pariksheet Barapatre <pb...@gmail.com>> wrote:

Hi Willem,

Use Phoenix bulk load. I guess your source is csv so phoenixcsvbulk loader can be used.

How frequently you want to load these files. If you can wait for certain interval to merge these files and map reduce will bulk load to Phoenix table.

Cheers
Pari
On 18-Jan-2016 4:17 pm, "Willem Conradie" <wi...@pbtgroup.co.za>> wrote:
Hi Pari,

My comments in blue.

Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
Are you referring to native HBase bulk load or Phoenix MapReduce bulk load? Unfortunately we can’t change how the files are received from source. Must we pre-process to merge the files before running the bulk load utility?

2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically

Regards,
Willem

From: Pariksheet Barapatre [mailto:pbarapatre@gmail.com<ma...@gmail.com>]
Sent: 15 January 2016 03:17 PM
To: user@phoenix.apache.org<ma...@phoenix.apache.org>
Subject: Re: Telco HBase POC

Hi Willem,
Looking at your use case. Phoenix would be a handy client.
Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically

Cheers
Pari

On 15 January 2016 at 17:48, Pedro Gandola <pe...@gmail.com>> wrote:
Hi Willem,

Just to give you my short experience as phoenix user.

I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion entries.
In our use case Phoenix is doing very well and it saved a lot of code complexity and time. If you guys have already decided that HBase is the way to go then having phoenix as a SQL layer it will help a lot, not only in terms of code simplicity but It will help you to create and maintain your indexes and views which can be hard&costly tasks using the plain HBase API. Joining tables it's just a simple SQL join :).

And there are a lot of more useful features that make your life easier with HBase.

In terms of performance and depending on the SLAs that you have you need to benchmark, however I think your main battles are going to be with HBase, JVM GCs, Network, FileSystem, etc...

I would say to give Phoenix a try, for sure.

Cheers
Pedro

On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <wi...@pbtgroup.co.za>> wrote:

Hi,

I am currently consulting at a client with the following requirements.

They want to make available detailed data usage CDRs for customers to verify their data usage against the websites that they visited. In short this can be seen as an itemised bill for data usage.  The data is currently not loaded into a RDBMS due to the volumes of data involved. The proposed solution is to load the data into HBase, running on a HDP cluster, and make it available for querying by the subscribers.  It is critical to ensure low latency read access to the subscriber data, which possibly will be exposed to 25 million subscribers. We will be running a scaled down version first for a proof of concept with the intention of it becoming an operational data store.  Once the solution is functioning properly for the data usage CDRs other CDR types will be added, as such we need  to build a cost effective, scalable solution .

I am thinking of using Apache Phoenix for the following reasons:

1.      1. Current data loading into RDBMS is file based (CSV) via a staging server using the RDBMS file load drivers

2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above process to load to HBase

3.       3. Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day

4.        4. Use Apache Phoenix client for low latency data retrieval

Is Apache Phoenix a suitable candidate for this specific use case?

Regards,

Willem

--
Cheers,
Pari

RE: Telco HBase POC

Posted by Willem Conradie <wi...@pbtgroup.co.za>.

Hi James,

Thanks for being willing to assist.

This is what the input data record will look like (test data) :
UserID

DateTime

TXNID

DeviceID

IPAddress

UsageArray

URIArray

12345678901

20151006124945

992194978

123456789012345

111.111.111.111

10-4000:26272:1019324|0-4000:0:0|10-4000:25780:498309|420-4000:152:152|500-500:1258:2098|9001-9001:120:0|0-4000:0:0|502-4000:154:0|10-4000:73750:448374|420-4000:608:608|1-4000:364:550|358-4000:40:52

www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|www.google.com<http://www.facebook.com|www.whatsapp.com|www.google.co.nz|ssl.gstatic.com|www.google.com/>


Unique key on record is “UserID,DateTime,TXNID”.

Read access pattern is as follows:
User queries by UserID, DateTime range to supply them with usage stats (derived from ‘UsageArray’) for websites (derived from ‘URIArray’) visited over their selected time range.

Just to recap the data volumes:
Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day

I need  to be flexible in the amount of data stored. Initially it will be 5 days, but can increase to 30 days and possibly 90 days.

One concern I have (not founded in any way) with the phoenix client is whether it will be able to support data access for above queries within 100ms range.

Regards,
Willem

From: James Taylor [mailto:jamestaylor@apache.org]
Sent: 19 January 2016 10:07 PM
To: user <us...@phoenix.apache.org>
Subject: Re: Telco HBase POC

Hi Willem,
Let us know how we can help as you start getting into this, in particular with your schema design based on your query requirements.
Thanks,
James

On Mon, Jan 18, 2016 at 8:50 AM, Pariksheet Barapatre <pb...@gmail.com>> wrote:

Hi Willem,

Use Phoenix bulk load. I guess your source is csv so phoenixcsvbulk loader can be used.

How frequently you want to load these files. If you can wait for certain interval to merge these files and map reduce will bulk load to Phoenix table.

Cheers
Pari
On 18-Jan-2016 4:17 pm, "Willem Conradie" <wi...@pbtgroup.co.za>> wrote:
Hi Pari,

My comments in blue.

Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
Are you referring to native HBase bulk load or Phoenix MapReduce bulk load? Unfortunately we can’t change how the files are received from source. Must we pre-process to merge the files before running the bulk load utility?

2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically


Regards,
Willem

From: Pariksheet Barapatre [mailto:pbarapatre@gmail.com<ma...@gmail.com>]
Sent: 15 January 2016 03:17 PM
To: user@phoenix.apache.org<ma...@phoenix.apache.org>
Subject: Re: Telco HBase POC

Hi Willem,
Looking at your use case. Phoenix would be a handy client.
Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically

Cheers
Pari


On 15 January 2016 at 17:48, Pedro Gandola <pe...@gmail.com>> wrote:
Hi Willem,

Just to give you my short experience as phoenix user.

I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion entries.
In our use case Phoenix is doing very well and it saved a lot of code complexity and time. If you guys have already decided that HBase is the way to go then having phoenix as a SQL layer it will help a lot, not only in terms of code simplicity but It will help you to create and maintain your indexes and views which can be hard&costly tasks using the plain HBase API. Joining tables it's just a simple SQL join :).

And there are a lot of more useful features that make your life easier with HBase.

In terms of performance and depending on the SLAs that you have you need to benchmark, however I think your main battles are going to be with HBase, JVM GCs, Network, FileSystem, etc...

I would say to give Phoenix a try, for sure.

Cheers
Pedro

On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <wi...@pbtgroup.co.za>> wrote:



Hi,



I am currently consulting at a client with the following requirements.



They want to make available detailed data usage CDRs for customers to verify their data usage against the websites that they visited. In short this can be seen as an itemised bill for data usage.  The data is currently not loaded into a RDBMS due to the volumes of data involved. The proposed solution is to load the data into HBase, running on a HDP cluster, and make it available for querying by the subscribers.  It is critical to ensure low latency read access to the subscriber data, which possibly will be exposed to 25 million subscribers. We will be running a scaled down version first for a proof of concept with the intention of it becoming an operational data store.  Once the solution is functioning properly for the data usage CDRs other CDR types will be added, as such we need  to build a cost effective, scalable solution .



I am thinking of using Apache Phoenix for the following reasons:



1.      1. Current data loading into RDBMS is file based (CSV) via a staging server using the RDBMS file load drivers

2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above process to load to HBase

3.       3. Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day


4.        4. Use Apache Phoenix client for low latency data retrieval



Is Apache Phoenix a suitable candidate for this specific use case?



Regards,

Willem





--
Cheers,
Pari

Re: Telco HBase POC

Posted by James Taylor <ja...@apache.org>.

Hi Willem,
Let us know how we can help as you start getting into this, in particular
with your schema design based on your query requirements.
Thanks,
James

On Mon, Jan 18, 2016 at 8:50 AM, Pariksheet Barapatre <pb...@gmail.com>
wrote:

> Hi Willem,
>
> Use Phoenix bulk load. I guess your source is csv so phoenixcsvbulk loader
> can be used.
>
> How frequently you want to load these files. If you can wait for certain
> interval to merge these files and map reduce will bulk load to Phoenix
> table.
>
> Cheers
> Pari
> On 18-Jan-2016 4:17 pm, "Willem Conradie" <wi...@pbtgroup.co.za>
> wrote:
>
>> Hi Pari,
>>
>>
>>
>> My comments in blue.
>>
>>
>>
>> Few notes from my experience :
>>
>> 1. Use bulk load rather than psql.py. Load larger files(merge) instead of
>> small files.
>>
>> Are you referring to native HBase bulk load or Phoenix MapReduce bulk
>> load? Unfortunately we can’t change how the files are received from source.
>> Must we pre-process to merge the files before running the bulk load
>> utility?
>>
>>
>>
>> 2. Increase HBase block cache
>>
>> 3. Turn off HBase auto compaction
>>
>> 4. Select primary key correctly
>> 5. Don't use salting . As table will be huge, your phoenix query will
>> fork may scanners. Try something like hash on userid.
>> 6. Define TTL to purge data periodically
>>
>>
>>
>>
>>
>> Regards,
>>
>> Willem
>>
>>
>>
>> *From:* Pariksheet Barapatre [mailto:pbarapatre@gmail.com]
>> *Sent:* 15 January 2016 03:17 PM
>> *To:* user@phoenix.apache.org
>> *Subject:* Re: Telco HBase POC
>>
>>
>>
>> Hi Willem,
>>
>> Looking at your use case. Phoenix would be a handy client.
>>
>> Few notes from my experience :
>>
>> 1. Use bulk load rather than psql.py. Load larger files(merge) instead of
>> small files.
>>
>> 2. Increase HBase block cache
>>
>> 3. Turn off HBase auto compaction
>>
>> 4. Select primary key correctly
>>
>> 5. Don't use salting . As table will be huge, your phoenix query will
>> fork may scanners. Try something like hash on userid.
>>
>> 6. Define TTL to purge data periodically
>>
>>
>>
>> Cheers
>>
>> Pari
>>
>>
>>
>>
>>
>> On 15 January 2016 at 17:48, Pedro Gandola <pe...@gmail.com>
>> wrote:
>>
>> Hi Willem,
>>
>> Just to give you my short experience as phoenix user.
>>
>> I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion
>> entries.
>>
>> In our use case Phoenix is doing very well and it saved a lot of code
>> complexity and time. If you guys have already decided that HBase is the way
>> to go then having phoenix as a SQL layer it will help a lot, not only in
>> terms of code simplicity but It will help you to create and maintain your
>> indexes and views which can be hard&costly tasks using the plain HBase API.
>> Joining tables it's just a simple SQL join :).
>>
>>
>>
>> And there are a lot of more useful features that make your life easier
>> with HBase.
>>
>> In terms of performance and depending on the SLAs that you have you need
>> to benchmark, however I think your main battles are going to be with HBase,
>> JVM GCs, Network, FileSystem, etc...
>>
>>
>> I would say to give Phoenix a try, for sure.
>>
>> Cheers
>> Pedro
>>
>>
>>
>> On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <
>> willem.conradie@pbtgroup.co.za> wrote:
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am currently consulting at a client with the following requirements.
>>
>>
>>
>> They want to make available detailed data usage CDRs for customers to
>> verify their data usage against the websites that they visited. In short
>> this can be seen as an itemised bill for data usage.  The data is currently
>> not loaded into a RDBMS due to the volumes of data involved. The proposed
>> solution is to load the data into HBase, running on a HDP cluster, and make
>> it available for querying by the subscribers.  It is critical to ensure low
>> latency read access to the subscriber data, which possibly will be exposed
>> to 25 million subscribers. We will be running a scaled down version first
>> for a proof of concept with the intention of it becoming an operational
>> data store.  Once the solution is functioning properly for the data usage
>> CDRs other CDR types will be added, as such we need  to build a cost
>> effective, scalable solution .
>>
>>
>>
>> I am thinking of using Apache Phoenix for the following reasons:
>>
>>
>>
>> 1.      1. Current data loading into RDBMS is file based (CSV) via a
>> staging server using the RDBMS file load drivers
>>
>> 2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above
>> process to load to HBase
>>
>> 3.       3. Expected data volume :  60 000 files per day
>>                                                   1 –to 10 MB per file
>>                                                   500 million records per
>> day
>>                                                    500 GB total volume
>> per day
>>
>>
>> 4.        4. Use Apache Phoenix client for low latency data retrieval
>>
>>
>>
>> Is Apache Phoenix a suitable candidate for this specific use case?
>>
>>
>>
>> Regards,
>>
>> Willem
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Cheers,
>>
>> Pari
>>
>

RE: Telco HBase POC

Posted by Pariksheet Barapatre <pb...@gmail.com>.

Hi Willem,

Use Phoenix bulk load. I guess your source is csv so phoenixcsvbulk loader
can be used.

How frequently you want to load these files. If you can wait for certain
interval to merge these files and map reduce will bulk load to Phoenix
table.

Cheers
Pari
On 18-Jan-2016 4:17 pm, "Willem Conradie" <wi...@pbtgroup.co.za>
wrote:

> Hi Pari,
>
>
>
> My comments in blue.
>
>
>
> Few notes from my experience :
>
> 1. Use bulk load rather than psql.py. Load larger files(merge) instead of
> small files.
>
> Are you referring to native HBase bulk load or Phoenix MapReduce bulk
> load? Unfortunately we can’t change how the files are received from source.
> Must we pre-process to merge the files before running the bulk load
> utility?
>
>
>
> 2. Increase HBase block cache
>
> 3. Turn off HBase auto compaction
>
> 4. Select primary key correctly
> 5. Don't use salting . As table will be huge, your phoenix query will fork
> may scanners. Try something like hash on userid.
> 6. Define TTL to purge data periodically
>
>
>
>
>
> Regards,
>
> Willem
>
>
>
> *From:* Pariksheet Barapatre [mailto:pbarapatre@gmail.com]
> *Sent:* 15 January 2016 03:17 PM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Telco HBase POC
>
>
>
> Hi Willem,
>
> Looking at your use case. Phoenix would be a handy client.
>
> Few notes from my experience :
>
> 1. Use bulk load rather than psql.py. Load larger files(merge) instead of
> small files.
>
> 2. Increase HBase block cache
>
> 3. Turn off HBase auto compaction
>
> 4. Select primary key correctly
>
> 5. Don't use salting . As table will be huge, your phoenix query will fork
> may scanners. Try something like hash on userid.
>
> 6. Define TTL to purge data periodically
>
>
>
> Cheers
>
> Pari
>
>
>
>
>
> On 15 January 2016 at 17:48, Pedro Gandola <pe...@gmail.com>
> wrote:
>
> Hi Willem,
>
> Just to give you my short experience as phoenix user.
>
> I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion
> entries.
>
> In our use case Phoenix is doing very well and it saved a lot of code
> complexity and time. If you guys have already decided that HBase is the way
> to go then having phoenix as a SQL layer it will help a lot, not only in
> terms of code simplicity but It will help you to create and maintain your
> indexes and views which can be hard&costly tasks using the plain HBase API.
> Joining tables it's just a simple SQL join :).
>
>
>
> And there are a lot of more useful features that make your life easier
> with HBase.
>
> In terms of performance and depending on the SLAs that you have you need
> to benchmark, however I think your main battles are going to be with HBase,
> JVM GCs, Network, FileSystem, etc...
>
>
> I would say to give Phoenix a try, for sure.
>
> Cheers
> Pedro
>
>
>
> On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <
> willem.conradie@pbtgroup.co.za> wrote:
>
>
>
> Hi,
>
>
>
> I am currently consulting at a client with the following requirements.
>
>
>
> They want to make available detailed data usage CDRs for customers to
> verify their data usage against the websites that they visited. In short
> this can be seen as an itemised bill for data usage.  The data is currently
> not loaded into a RDBMS due to the volumes of data involved. The proposed
> solution is to load the data into HBase, running on a HDP cluster, and make
> it available for querying by the subscribers.  It is critical to ensure low
> latency read access to the subscriber data, which possibly will be exposed
> to 25 million subscribers. We will be running a scaled down version first
> for a proof of concept with the intention of it becoming an operational
> data store.  Once the solution is functioning properly for the data usage
> CDRs other CDR types will be added, as such we need  to build a cost
> effective, scalable solution .
>
>
>
> I am thinking of using Apache Phoenix for the following reasons:
>
>
>
> 1.      1. Current data loading into RDBMS is file based (CSV) via a
> staging server using the RDBMS file load drivers
>
> 2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above
> process to load to HBase
>
> 3.       3. Expected data volume :  60 000 files per day
>                                                   1 –to 10 MB per file
>                                                   500 million records per
> day
>                                                    500 GB total volume per
> day
>
>
> 4.        4. Use Apache Phoenix client for low latency data retrieval
>
>
>
> Is Apache Phoenix a suitable candidate for this specific use case?
>
>
>
> Regards,
>
> Willem
>
>
>
>
>
>
>
>
> --
>
> Cheers,
>
> Pari
>

RE: Telco HBase POC

Posted by Willem Conradie <wi...@pbtgroup.co.za>.

Hi Pari,

My comments in blue.

Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
Are you referring to native HBase bulk load or Phoenix MapReduce bulk load? Unfortunately we can’t change how the files are received from source. Must we pre-process to merge the files before running the bulk load utility?

2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically


Regards,
Willem

From: Pariksheet Barapatre [mailto:pbarapatre@gmail.com]
Sent: 15 January 2016 03:17 PM
To: user@phoenix.apache.org
Subject: Re: Telco HBase POC

Hi Willem,
Looking at your use case. Phoenix would be a handy client.
Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of small files.
2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically

Cheers
Pari


On 15 January 2016 at 17:48, Pedro Gandola <pe...@gmail.com>> wrote:
Hi Willem,

Just to give you my short experience as phoenix user.

I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion entries.
In our use case Phoenix is doing very well and it saved a lot of code complexity and time. If you guys have already decided that HBase is the way to go then having phoenix as a SQL layer it will help a lot, not only in terms of code simplicity but It will help you to create and maintain your indexes and views which can be hard&costly tasks using the plain HBase API. Joining tables it's just a simple SQL join :).

And there are a lot of more useful features that make your life easier with HBase.

In terms of performance and depending on the SLAs that you have you need to benchmark, however I think your main battles are going to be with HBase, JVM GCs, Network, FileSystem, etc...

I would say to give Phoenix a try, for sure.

Cheers
Pedro

On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <wi...@pbtgroup.co.za>> wrote:



Hi,



I am currently consulting at a client with the following requirements.



They want to make available detailed data usage CDRs for customers to verify their data usage against the websites that they visited. In short this can be seen as an itemised bill for data usage.  The data is currently not loaded into a RDBMS due to the volumes of data involved. The proposed solution is to load the data into HBase, running on a HDP cluster, and make it available for querying by the subscribers.  It is critical to ensure low latency read access to the subscriber data, which possibly will be exposed to 25 million subscribers. We will be running a scaled down version first for a proof of concept with the intention of it becoming an operational data store.  Once the solution is functioning properly for the data usage CDRs other CDR types will be added, as such we need  to build a cost effective, scalable solution .



I am thinking of using Apache Phoenix for the following reasons:



1.      1. Current data loading into RDBMS is file based (CSV) via a staging server using the RDBMS file load drivers

2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above process to load to HBase

3.       3. Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day


4.        4. Use Apache Phoenix client for low latency data retrieval



Is Apache Phoenix a suitable candidate for this specific use case?



Regards,

Willem





--
Cheers,
Pari

Re: Telco HBase POC

Posted by Pariksheet Barapatre <pb...@gmail.com>.

Hi Willem,

Looking at your use case. Phoenix would be a handy client.

Few notes from my experience :
1. Use bulk load rather than psql.py. Load larger files(merge) instead of
small files.
2. Increase HBase block cache
3. Turn off HBase auto compaction
4. Select primary key correctly
5. Don't use salting . As table will be huge, your phoenix query will fork
may scanners. Try something like hash on userid.
6. Define TTL to purge data periodically

Cheers
Pari


On 15 January 2016 at 17:48, Pedro Gandola <pe...@gmail.com> wrote:

> Hi Willem,
>
> Just to give you my short experience as phoenix user.
>
> I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion
> entries.
>
> In our use case Phoenix is doing very well and it saved a lot of code
> complexity and time. If you guys have already decided that HBase is the way
> to go then having phoenix as a SQL layer it will help a lot, not only in
> terms of code simplicity but It will help you to create and maintain your
> indexes and views which can be hard&costly tasks using the plain HBase API.
> Joining tables it's just a simple SQL join :).
>
> And there are a lot of more useful features that make your life easier
> with HBase.
>
> In terms of performance and depending on the SLAs that you have you need
> to benchmark, however I think your main battles are going to be with HBase,
> JVM GCs, Network, FileSystem, etc...
>
> I would say to give Phoenix a try, for sure.
>
> Cheers
> Pedro
>
> On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <
> willem.conradie@pbtgroup.co.za> wrote:
>
>>
>> Hi,
>>
>>
>>
>> I am currently consulting at a client with the following requirements.
>>
>>
>>
>> They want to make available detailed data usage CDRs for customers to
>> verify their data usage against the websites that they visited. In short
>> this can be seen as an itemised bill for data usage.  The data is currently
>> not loaded into a RDBMS due to the volumes of data involved. The proposed
>> solution is to load the data into HBase, running on a HDP cluster, and make
>> it available for querying by the subscribers.  It is critical to ensure low
>> latency read access to the subscriber data, which possibly will be exposed
>> to 25 million subscribers. We will be running a scaled down version first
>> for a proof of concept with the intention of it becoming an operational
>> data store.  Once the solution is functioning properly for the data usage
>> CDRs other CDR types will be added, as such we need  to build a cost
>> effective, scalable solution .
>>
>>
>>
>> I am thinking of using Apache Phoenix for the following reasons:
>>
>>
>>
>> 1.      1. Current data loading into RDBMS is file based (CSV) via a
>> staging server using the RDBMS file load drivers
>>
>> 2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above
>> process to load to HBase
>>
>> 3.       3. Expected data volume :  60 000 files per day
>>                                                   1 –to 10 MB per file
>>                                                   500 million records per
>> day
>>                                                    500 GB total volume
>> per day
>>
>>
>> 4.        4. Use Apache Phoenix client for low latency data retrieval
>>
>>
>>
>> Is Apache Phoenix a suitable candidate for this specific use case?
>>
>>
>>
>> Regards,
>>
>> Willem
>>
>>
>


-- 
Cheers,
Pari

RE: Telco HBase POC

Posted by Willem Conradie <wi...@pbtgroup.co.za>.

Thanks for the prompt reply.
From: Pedro Gandola [mailto:pedro.gandola@gmail.com]
Sent: 15 January 2016 02:19 PM
To: user@phoenix.apache.org
Subject: Re: Telco HBase POC

Hi Willem,

Just to give you my short experience as phoenix user.

I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion entries.
In our use case Phoenix is doing very well and it saved a lot of code complexity and time. If you guys have already decided that HBase is the way to go then having phoenix as a SQL layer it will help a lot, not only in terms of code simplicity but It will help you to create and maintain your indexes and views which can be hard&costly tasks using the plain HBase API. Joining tables it's just a simple SQL join :).

And there are a lot of more useful features that make your life easier with HBase.

In terms of performance and depending on the SLAs that you have you need to benchmark, however I think your main battles are going to be with HBase, JVM GCs, Network, FileSystem, etc...

I would say to give Phoenix a try, for sure.

Cheers
Pedro

On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <wi...@pbtgroup.co.za>> wrote:

Hi,

I am currently consulting at a client with the following requirements.

They want to make available detailed data usage CDRs for customers to verify their data usage against the websites that they visited. In short this can be seen as an itemised bill for data usage.  The data is currently not loaded into a RDBMS due to the volumes of data involved. The proposed solution is to load the data into HBase, running on a HDP cluster, and make it available for querying by the subscribers.  It is critical to ensure low latency read access to the subscriber data, which possibly will be exposed to 25 million subscribers. We will be running a scaled down version first for a proof of concept with the intention of it becoming an operational data store.  Once the solution is functioning properly for the data usage CDRs other CDR types will be added, as such we need  to build a cost effective, scalable solution .

I am thinking of using Apache Phoenix for the following reasons:

1.      1. Current data loading into RDBMS is file based (CSV) via a staging server using the RDBMS file load drivers

2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above process to load to HBase

3.       3. Expected data volume :  60 000 files per day
                                                  1 –to 10 MB per file
                                                  500 million records per day
                                                   500 GB total volume per day

4.        4. Use Apache Phoenix client for low latency data retrieval

Is Apache Phoenix a suitable candidate for this specific use case?

Regards,

Willem

Re: Telco HBase POC

Posted by Pedro Gandola <pe...@gmail.com>.

Hi Willem,

Just to give you my short experience as phoenix user.

I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion
entries.

In our use case Phoenix is doing very well and it saved a lot of code
complexity and time. If you guys have already decided that HBase is the way
to go then having phoenix as a SQL layer it will help a lot, not only in
terms of code simplicity but It will help you to create and maintain your
indexes and views which can be hard&costly tasks using the plain HBase API.
Joining tables it's just a simple SQL join :).

And there are a lot of more useful features that make your life easier with
HBase.

In terms of performance and depending on the SLAs that you have you need to
benchmark, however I think your main battles are going to be with HBase,
JVM GCs, Network, FileSystem, etc...

I would say to give Phoenix a try, for sure.

Cheers
Pedro

On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <
willem.conradie@pbtgroup.co.za> wrote:

>
> Hi,
>
>
>
> I am currently consulting at a client with the following requirements.
>
>
>
> They want to make available detailed data usage CDRs for customers to
> verify their data usage against the websites that they visited. In short
> this can be seen as an itemised bill for data usage.  The data is currently
> not loaded into a RDBMS due to the volumes of data involved. The proposed
> solution is to load the data into HBase, running on a HDP cluster, and make
> it available for querying by the subscribers.  It is critical to ensure low
> latency read access to the subscriber data, which possibly will be exposed
> to 25 million subscribers. We will be running a scaled down version first
> for a proof of concept with the intention of it becoming an operational
> data store.  Once the solution is functioning properly for the data usage
> CDRs other CDR types will be added, as such we need  to build a cost
> effective, scalable solution .
>
>
>
> I am thinking of using Apache Phoenix for the following reasons:
>
>
>
> 1.      1. Current data loading into RDBMS is file based (CSV) via a
> staging server using the RDBMS file load drivers
>
> 2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above
> process to load to HBase
>
> 3.       3. Expected data volume :  60 000 files per day
>                                                   1 –to 10 MB per file
>                                                   500 million records per
> day
>                                                    500 GB total volume per
> day
>
>
> 4.        4. Use Apache Phoenix client for low latency data retrieval
>
>
>
> Is Apache Phoenix a suitable candidate for this specific use case?
>
>
>
> Regards,
>
> Willem
>
>