You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Mehmet Simsek <nu...@gmail.com> on 2013/02/17 20:33:12 UTC

Row Key Design in time based aplication

Hi,

I want to hold event log data in hbase but I couldn't decide row key. I must hold project id and time,I will use project ld and time combination while searching.

Row key can be below

ProjectId+timeInMs

In similiar application(open source TSDB) time is divided 1000 to round in this project.I can use this strategy but I don't know how we decide what divider must be?  1000 or 10000.

Why time is divided 1000 in this application? why didn't be hold without division?

Can you explain this strategy?

Re: Row Key Design in time based aplication

Posted by Michael Segel <mi...@hotmail.com>.

So its a *shameless* plug?  :-)

Depending on the project id, it could be a good key, but it would have to be something more meaningful than just a number. 

To answer the question about time... Time Stamps are Longs which hold the number of ms since the a set time. (I forget the date and time but you can look it up in the API docs)  By losing precision of the last 3 decimal places, you're rounding to the nearest second. 

Wasn't sure if that was answered before or not...


On Feb 17, 2013, at 7:19 PM, James Taylor <jt...@salesforce.com> wrote:

> Michael is right - Phoenix wouldn't automatically solve these issues for you - it would just a) decrease the amount of code you need to write while still giving you coprocessor-speed performance, and b) give you an industry standard API to read/write your data.
> 
> However, since the date is not the leading part of the key, it wouldn't be a problem for it to be monotonically increasing. If project_id is, then you could reverse the bytes on the way in and on the way out to prevent hot spotting on writes (basically taking the same approach as when you'd use the HBase native APIs). If you wanted to do it in SQL, you could add your own built-in function to Phoenix. I'll blog about how to do this soon.
> 
> James
> http://phoenix-hbase.blogspot.com/
> 
> On 02/17/2013 03:18 PM, Michael Segel wrote:
>> I'm not sure how a SQL interface above HBase will solve some of the issues with regional hot spotting when using time as the key. Or the problem with always adding data to the right of the last row.
>> 
>> The same would apply with the project id, assuming that it too is a number that grows incrementally with each project.
>> On Feb 17, 2013, at 4:50 PM, James Taylor <jt...@salesforce.com> wrote:
>> 
>>> Hello,
>>> Have you considered using Phoenix (https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a SQL layer on top of HBase. For this use case, you'd connect to your cluster like this:
>>> 
>>> Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register driver
>>> Connection conn = DriverManager..getConnection("jdbc:phoenix:localhost"); // connect to local HBase
>>> 
>>> Create a table like this (adding additional columns that you want to measure, like txn_count below):
>>> 
>>> conn.createStatement().execute(
>>>    "CREATE TABLE event_log (\n" +
>>>    "     project_id INTEGER NOT NULL, \n" +
>>>    "    time DATE NOT NULL,\n" +
>>>    "txn_count LONG\n" +
>>>    "CONSTRAINT pk PRIMARY KEY (project_id, time))");
>>> 
>>> Then to insert data you'd do this:
>>> 
>>> PreparedStatement preparedStmt = conn.prepareStatement(
>>>    "UPSERT INTO event_log VALUES(?,?,0)");
>>> 
>>> and you'd bind the values in JDBC like this:
>>> 
>>> preparedStmt.setInt(1, projectId);
>>> preparedStmt.setDate(2, time);
>>> preparedStmt.execute();
>>> 
>>> conn.commit(); // If upserting many values, you'd want to commit after upserting maybe 1000-10000 rows
>>> 
>>> Then at query data time, assuming you want to report on this data by grouping into different "time buckets", you could do as show below. Phoenix stores your date values at the millisecond granularity and you can decide a query time how you'd like to roll it up:
>>> 
>>> // Query with time bucket at the hour granularity
>>> conn.createStatement().execute(
>>>   "SELECT\n" +
>>>   "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
>>>   "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>   "GROUP BY project_id, TRUNC(time,'HOUR')");
>>> 
>>> // Query with time bucket at the day granularity
>>> conn.createStatement().execute(
>>>    "SELECT\n" +
>>>    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>    "GROUP BY project_id, TRUNC(time,'DAY')");
>>> 
>>> You could, of course include a WHERE clause in the query to filter based on the range of dates, particular projectIds, etc. like this:
>>> 
>>> conn.prepareStatement(
>>>    "SELECT\n" +
>>>    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>    "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
>>>    "GROUP BY project_id, TRUNC(time,'DAY')");
>>> preparedStmt.setInt(1, projectId1);
>>> preparedStmt.setInt(2, projectId2);
>>> preparedStmt.setInt(3, projectId3);
>>> preparedStmt.setDate(4, beginDate);
>>> preparedStmt.setDate(5, endDate);
>>> preparedStmt.execute();
>>> 
>>> 
>>> HTH.
>>> 
>>> Regards,
>>> 
>>>    James
>>> 
>>> On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
>>>> Hi,
>>>> 
>>>> I want to hold event log data in hbase but I couldn't decide row key. I must hold project id and time,I will use project ld and time combination while searching.
>>>> 
>>>> Row key can be below
>>>> 
>>>> ProjectId+timeInMs
>>>> 
>>>> In similiar application(open source TSDB) time is divided 1000 to round in this project.I can use this strategy but I don't know how we decide what divider must be?  1000 or 10000.
>>>> 
>>>> Why time is divided 1000 in this application? why didn't be hold without division?
>>>> 
>>>> Can you explain this strategy?
>>>> 
>>> 
>> Michael Segel  | (m) 312.755.9623
>> 
>> Segel and Associates
>> 
>> 
> 
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Row Key Design in time based aplication

Posted by Asaf Mesika <as...@gmail.com>.

How does avoid memory hogging the region server when multiple queries with
group by are executed, which is done in Hbase jvm. I know that Hbase does
not handle well when heap space is set beyond 12G, and combined with
compactions happening concurrently with queries, it creates a memory
competition. Is there a any way to control this in Pheonix? Is it scalable
in terms of the number of concurrent queries running in a cluster?

On Monday, February 18, 2013, James Taylor wrote:

> Michael is right - Phoenix wouldn't automatically solve these issues for
> you - it would just a) decrease the amount of code you need to write while
> still giving you coprocessor-speed performance, and b) give you an industry
> standard API to read/write your data.
>
> However, since the date is not the leading part of the key, it wouldn't be
> a problem for it to be monotonically increasing. If project_id is, then you
> could reverse the bytes on the way in and on the way out to prevent hot
> spotting on writes (basically taking the same approach as when you'd use
> the HBase native APIs). If you wanted to do it in SQL, you could add your
> own built-in function to Phoenix. I'll blog about how to do this soon.
>
> James
> http://phoenix-hbase.blogspot.**com/ <http://phoenix-hbase.blogspot.com/>
>
> On 02/17/2013 03:18 PM, Michael Segel wrote:
>
>> I'm not sure how a SQL interface above HBase will solve some of the
>> issues with regional hot spotting when using time as the key. Or the
>> problem with always adding data to the right of the last row.
>>
>> The same would apply with the project id, assuming that it too is a
>> number that grows incrementally with each project.
>> On Feb 17, 2013, at 4:50 PM, James Taylor <jt...@salesforce.com> wrote:
>>
>>  Hello,
>>> Have you considered using Phoenix (https://github.com/**
>>> forcedotcom/phoenix <https://github.com/forcedotcom/phoenix>) for this
>>> use case? Phoenix is a SQL layer on top of HBase. For this use case, you'd
>>> connect to your cluster like this:
>>>
>>> Class.forName("com.salesforce.**phoenix.jdbc.PhoenixDriver"); //
>>> register driver
>>> Connection conn = DriverManager..getConnection("**jdbc:phoenix:localhost");
>>> // connect to local HBase
>>>
>>> Create a table like this (adding additional columns that you want to
>>> measure, like txn_count below):
>>>
>>> conn.createStatement().**execute(
>>>     "CREATE TABLE event_log (\n" +
>>>     "     project_id INTEGER NOT NULL, \n" +
>>>     "    time DATE NOT NULL,\n" +
>>>     "txn_count LONG\n" +
>>>     "CONSTRAINT pk PRIMARY KEY (project_id, time))");
>>>
>>> Then to insert data you'd do this:
>>>
>>> PreparedStatement preparedStmt = conn.prepareStatement(
>>>     "UPSERT INTO event_log VALUES(?,?,0)");
>>>
>>> and you'd bind the values in JDBC like this:
>>>
>>> preparedStmt.setInt(1, projectId);
>>> preparedStmt.setDate(2, time);
>>> preparedStmt.execute();
>>>
>>> conn.commit(); // If upserting many values, you'd want to commit after
>>> upserting maybe 1000-10000 rows
>>>
>>> Then at query data time, assuming you want to report on this data by
>>> grouping into different "time buckets", you could do as show below. Phoenix
>>> stores your date values at the millisecond granularity and you can decide a
>>> query time how you'd like to roll it up:
>>>
>>> // Query with time bucket at the hour granularity
>>> conn.createStatement().**execute(
>>>    "SELECT\n" +
>>>    "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
>>>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>    "GROUP BY project_id, TRUNC(time,'HOUR')");
>>>
>>> // Query with time bucket at the day granularity
>>> conn.createStatement().**execute(
>>>     "SELECT\n" +
>>>     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>>     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>     "GROUP BY project_id, TRUNC(time,'DAY')");
>>>
>>> You could, of course include a WHERE clause in the query to filter based
>>> on the range of dates, particular projectIds, etc. like this:
>>>
>>> conn.prepareStatement(
>>>     "SELECT\n" +
>>>     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>>     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>     "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
>>>     "GROUP BY project_id, TRUNC(time,'DAY')");
>>> preparedStmt.setInt(1, projectId1);
>>> preparedStmt.setInt(2, projectId2);
>>> preparedStmt.setInt(3, projectId3);
>>> preparedStmt.setDate(4, beginDate);
>>> preparedStmt.setDate(5, endDate);
>>> preparedStmt.execute();
>>>
>>>
>>> HTH.
>>>
>>> Regards,
>>>
>>>     James
>>>
>>> On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
>>>
>>>> Hi,
>>>>
>>>> I want to hold event log data in hbase but I couldn't decide row key. I
>>>> must hold project id and time,I will use project ld and time combination
>>>> while searching.
>>>>
>>>> Row key can be below
>>>>
>>>> ProjectId+timeInMs
>>>>
>>>> In similiar application(open source TSDB) time is divided 1000 to round
>>>> in this project.I can use this strategy but I don't know how we decide what
>>>> divider must be?  1000 or 10000.
>>>>
>>>> Why time is divided 1000 in this application? why didn't be hold
>>>> without division?
>>>>
>>>> Can you explain this strategy?
>>>>
>>>>
>>>  Michael Segel  | (m) 312.755.9623
>>
>> Segel and Associates
>>
>>
>>
>

Re: Row Key Design in time based aplication

Posted by James Taylor <jt...@salesforce.com>.

Michael is right - Phoenix wouldn't automatically solve these issues for 
you - it would just a) decrease the amount of code you need to write 
while still giving you coprocessor-speed performance, and b) give you an 
industry standard API to read/write your data.

However, since the date is not the leading part of the key, it wouldn't 
be a problem for it to be monotonically increasing. If project_id is, 
then you could reverse the bytes on the way in and on the way out to 
prevent hot spotting on writes (basically taking the same approach as 
when you'd use the HBase native APIs). If you wanted to do it in SQL, 
you could add your own built-in function to Phoenix. I'll blog about how 
to do this soon.

James
http://phoenix-hbase.blogspot.com/

On 02/17/2013 03:18 PM, Michael Segel wrote:
> I'm not sure how a SQL interface above HBase will solve some of the issues with regional hot spotting when using time as the key. Or the problem with always adding data to the right of the last row.
>
> The same would apply with the project id, assuming that it too is a number that grows incrementally with each project.
> On Feb 17, 2013, at 4:50 PM, James Taylor <jt...@salesforce.com> wrote:
>
>> Hello,
>> Have you considered using Phoenix (https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a SQL layer on top of HBase. For this use case, you'd connect to your cluster like this:
>>
>> Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register driver
>> Connection conn = DriverManager..getConnection("jdbc:phoenix:localhost"); // connect to local HBase
>>
>> Create a table like this (adding additional columns that you want to measure, like txn_count below):
>>
>> conn.createStatement().execute(
>>     "CREATE TABLE event_log (\n" +
>>     "     project_id INTEGER NOT NULL, \n" +
>>     "    time DATE NOT NULL,\n" +
>>     "txn_count LONG\n" +
>>     "CONSTRAINT pk PRIMARY KEY (project_id, time))");
>>
>> Then to insert data you'd do this:
>>
>> PreparedStatement preparedStmt = conn.prepareStatement(
>>     "UPSERT INTO event_log VALUES(?,?,0)");
>>
>> and you'd bind the values in JDBC like this:
>>
>> preparedStmt.setInt(1, projectId);
>> preparedStmt.setDate(2, time);
>> preparedStmt.execute();
>>
>> conn.commit(); // If upserting many values, you'd want to commit after upserting maybe 1000-10000 rows
>>
>> Then at query data time, assuming you want to report on this data by grouping into different "time buckets", you could do as show below. Phoenix stores your date values at the millisecond granularity and you can decide a query time how you'd like to roll it up:
>>
>> // Query with time bucket at the hour granularity
>> conn.createStatement().execute(
>>    "SELECT\n" +
>>    "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
>>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>    "GROUP BY project_id, TRUNC(time,'HOUR')");
>>
>> // Query with time bucket at the day granularity
>> conn.createStatement().execute(
>>     "SELECT\n" +
>>     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>     "GROUP BY project_id, TRUNC(time,'DAY')");
>>
>> You could, of course include a WHERE clause in the query to filter based on the range of dates, particular projectIds, etc. like this:
>>
>> conn.prepareStatement(
>>     "SELECT\n" +
>>     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>     "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
>>     "GROUP BY project_id, TRUNC(time,'DAY')");
>> preparedStmt.setInt(1, projectId1);
>> preparedStmt.setInt(2, projectId2);
>> preparedStmt.setInt(3, projectId3);
>> preparedStmt.setDate(4, beginDate);
>> preparedStmt.setDate(5, endDate);
>> preparedStmt.execute();
>>
>>
>> HTH.
>>
>> Regards,
>>
>>     James
>>
>> On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
>>> Hi,
>>>
>>> I want to hold event log data in hbase but I couldn't decide row key. I must hold project id and time,I will use project ld and time combination while searching.
>>>
>>> Row key can be below
>>>
>>> ProjectId+timeInMs
>>>
>>> In similiar application(open source TSDB) time is divided 1000 to round in this project.I can use this strategy but I don't know how we decide what divider must be?  1000 or 10000.
>>>
>>> Why time is divided 1000 in this application? why didn't be hold without division?
>>>
>>> Can you explain this strategy?
>>>
>>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>

Re: Row Key Design in time based aplication

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Mehmet,

 If ProjectIds are sequential, then it is definitely not a feasible
approach. Division is just to make sure that all the regions are
evenly loaded. You can create pre-splitted tables to avoid the
region hotspotting. Alternatively hash your rowkeys so that all
the regionservers receive equal load.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 18, 2013 at 4:48 AM, Michael Segel <mi...@hotmail.com>wrote:

> I'm not sure how a SQL interface above HBase will solve some of the issues
> with regional hot spotting when using time as the key. Or the problem with
> always adding data to the right of the last row.
>
> The same would apply with the project id, assuming that it too is a number
> that grows incrementally with each project.
> On Feb 17, 2013, at 4:50 PM, James Taylor <jt...@salesforce.com> wrote:
>
> > Hello,
> > Have you considered using Phoenix (
> https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a
> SQL layer on top of HBase. For this use case, you'd connect to your cluster
> like this:
> >
> > Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register
> driver
> > Connection conn =
> DriverManager..getConnection("jdbc:phoenix:localhost"); // connect to local
> HBase
> >
> > Create a table like this (adding additional columns that you want to
> measure, like txn_count below):
> >
> > conn.createStatement().execute(
> >    "CREATE TABLE event_log (\n" +
> >    "     project_id INTEGER NOT NULL, \n" +
> >    "    time DATE NOT NULL,\n" +
> >    "txn_count LONG\n" +
> >    "CONSTRAINT pk PRIMARY KEY (project_id, time))");
> >
> > Then to insert data you'd do this:
> >
> > PreparedStatement preparedStmt = conn.prepareStatement(
> >    "UPSERT INTO event_log VALUES(?,?,0)");
> >
> > and you'd bind the values in JDBC like this:
> >
> > preparedStmt.setInt(1, projectId);
> > preparedStmt.setDate(2, time);
> > preparedStmt.execute();
> >
> > conn.commit(); // If upserting many values, you'd want to commit after
> upserting maybe 1000-10000 rows
> >
> > Then at query data time, assuming you want to report on this data by
> grouping into different "time buckets", you could do as show below. Phoenix
> stores your date values at the millisecond granularity and you can decide a
> query time how you'd like to roll it up:
> >
> > // Query with time bucket at the hour granularity
> > conn.createStatement().execute(
> >   "SELECT\n" +
> >   "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
> >   "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >   "GROUP BY project_id, TRUNC(time,'HOUR')");
> >
> > // Query with time bucket at the day granularity
> > conn.createStatement().execute(
> >    "SELECT\n" +
> >    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
> >    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >    "GROUP BY project_id, TRUNC(time,'DAY')");
> >
> > You could, of course include a WHERE clause in the query to filter based
> on the range of dates, particular projectIds, etc. like this:
> >
> > conn.prepareStatement(
> >    "SELECT\n" +
> >    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
> >    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >    "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
> >    "GROUP BY project_id, TRUNC(time,'DAY')");
> > preparedStmt.setInt(1, projectId1);
> > preparedStmt.setInt(2, projectId2);
> > preparedStmt.setInt(3, projectId3);
> > preparedStmt.setDate(4, beginDate);
> > preparedStmt.setDate(5, endDate);
> > preparedStmt.execute();
> >
> >
> > HTH.
> >
> > Regards,
> >
> >    James
> >
> > On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
> >> Hi,
> >>
> >> I want to hold event log data in hbase but I couldn't decide row key. I
> must hold project id and time,I will use project ld and time combination
> while searching.
> >>
> >> Row key can be below
> >>
> >> ProjectId+timeInMs
> >>
> >> In similiar application(open source TSDB) time is divided 1000 to round
> in this project.I can use this strategy but I don't know how we decide what
> divider must be?  1000 or 10000.
> >>
> >> Why time is divided 1000 in this application? why didn't be hold
> without division?
> >>
> >> Can you explain this strategy?
> >>
> >
> >
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>
>

Re: Row Key Design in time based aplication

Posted by Michael Segel <mi...@hotmail.com>.

I'm not sure how a SQL interface above HBase will solve some of the issues with regional hot spotting when using time as the key. Or the problem with always adding data to the right of the last row. 

The same would apply with the project id, assuming that it too is a number that grows incrementally with each project. 
On Feb 17, 2013, at 4:50 PM, James Taylor <jt...@salesforce.com> wrote:

> Hello,
> Have you considered using Phoenix (https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a SQL layer on top of HBase. For this use case, you'd connect to your cluster like this:
> 
> Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register driver
> Connection conn = DriverManager..getConnection("jdbc:phoenix:localhost"); // connect to local HBase
> 
> Create a table like this (adding additional columns that you want to measure, like txn_count below):
> 
> conn.createStatement().execute(
>    "CREATE TABLE event_log (\n" +
>    "     project_id INTEGER NOT NULL, \n" +
>    "    time DATE NOT NULL,\n" +
>    "txn_count LONG\n" +
>    "CONSTRAINT pk PRIMARY KEY (project_id, time))");
> 
> Then to insert data you'd do this:
> 
> PreparedStatement preparedStmt = conn.prepareStatement(
>    "UPSERT INTO event_log VALUES(?,?,0)");
> 
> and you'd bind the values in JDBC like this:
> 
> preparedStmt.setInt(1, projectId);
> preparedStmt.setDate(2, time);
> preparedStmt.execute();
> 
> conn.commit(); // If upserting many values, you'd want to commit after upserting maybe 1000-10000 rows
> 
> Then at query data time, assuming you want to report on this data by grouping into different "time buckets", you could do as show below. Phoenix stores your date values at the millisecond granularity and you can decide a query time how you'd like to roll it up:
> 
> // Query with time bucket at the hour granularity
> conn.createStatement().execute(
>   "SELECT\n" +
>   "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
>   "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>   "GROUP BY project_id, TRUNC(time,'HOUR')");
> 
> // Query with time bucket at the day granularity
> conn.createStatement().execute(
>    "SELECT\n" +
>    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>    "GROUP BY project_id, TRUNC(time,'DAY')");
> 
> You could, of course include a WHERE clause in the query to filter based on the range of dates, particular projectIds, etc. like this:
> 
> conn.prepareStatement(
>    "SELECT\n" +
>    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>    "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
>    "GROUP BY project_id, TRUNC(time,'DAY')");
> preparedStmt.setInt(1, projectId1);
> preparedStmt.setInt(2, projectId2);
> preparedStmt.setInt(3, projectId3);
> preparedStmt.setDate(4, beginDate);
> preparedStmt.setDate(5, endDate);
> preparedStmt.execute();
> 
> 
> HTH.
> 
> Regards,
> 
>    James
> 
> On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
>> Hi,
>> 
>> I want to hold event log data in hbase but I couldn't decide row key. I must hold project id and time,I will use project ld and time combination while searching.
>> 
>> Row key can be below
>> 
>> ProjectId+timeInMs
>> 
>> In similiar application(open source TSDB) time is divided 1000 to round in this project.I can use this strategy but I don't know how we decide what divider must be?  1000 or 10000.
>> 
>> Why time is divided 1000 in this application? why didn't be hold without division?
>> 
>> Can you explain this strategy?
>> 
> 
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Row Key Design in time based aplication

Posted by James Taylor <jt...@salesforce.com>.

Hello,
Have you considered using Phoenix 
(https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a 
SQL layer on top of HBase. For this use case, you'd connect to your 
cluster like this:

Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register 
driver
Connection conn = 
DriverManager..getConnection("jdbc:phoenix:localhost"); // connect to 
local HBase

Create a table like this (adding additional columns that you want to 
measure, like txn_count below):

conn.createStatement().execute(
     "CREATE TABLE event_log (\n" +
     "     project_id INTEGER NOT NULL, \n" +
     "    time DATE NOT NULL,\n" +
     "txn_count LONG\n" +
     "CONSTRAINT pk PRIMARY KEY (project_id, time))");

Then to insert data you'd do this:

PreparedStatement preparedStmt = conn.prepareStatement(
     "UPSERT INTO event_log VALUES(?,?,0)");

and you'd bind the values in JDBC like this:

preparedStmt.setInt(1, projectId);
preparedStmt.setDate(2, time);
preparedStmt.execute();

conn.commit(); // If upserting many values, you'd want to commit after 
upserting maybe 1000-10000 rows

Then at query data time, assuming you want to report on this data by 
grouping into different "time buckets", you could do as show below. 
Phoenix stores your date values at the millisecond granularity and you 
can decide a query time how you'd like to roll it up:

// Query with time bucket at the hour granularity
conn.createStatement().execute(
    "SELECT\n" +
    "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
    "GROUP BY project_id, TRUNC(time,'HOUR')");

// Query with time bucket at the day granularity
conn.createStatement().execute(
     "SELECT\n" +
     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
     "GROUP BY project_id, TRUNC(time,'DAY')");

You could, of course include a WHERE clause in the query to filter based 
on the range of dates, particular projectIds, etc. like this:

conn.prepareStatement(
     "SELECT\n" +
     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
     "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
     "GROUP BY project_id, TRUNC(time,'DAY')");
preparedStmt.setInt(1, projectId1);
preparedStmt.setInt(2, projectId2);
preparedStmt.setInt(3, projectId3);
preparedStmt.setDate(4, beginDate);
preparedStmt.setDate(5, endDate);
preparedStmt.execute();


HTH.

Regards,

     James

On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
> Hi,
>
> I want to hold event log data in hbase but I couldn't decide row key. I must hold project id and time,I will use project ld and time combination while searching.
>
> Row key can be below
>
> ProjectId+timeInMs
>
> In similiar application(open source TSDB) time is divided 1000 to round in this project.I can use this strategy but I don't know how we decide what divider must be?  1000 or 10000.
>
> Why time is divided 1000 in this application? why didn't be hold without division?
>
> Can you explain this strategy?
>