You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Monil Gandhi <mg...@gmail.com> on 2018/11/16 01:02:20 UTC

Rolling hourly data

Hello,
Currently we have hourly data in our phoenix table. However, the schema was
designed to perform well for daily data. Increasing the number of rows by
24X has lead to degradation of our service over time.
Our current schema is as follows

CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, cid
BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day, cid)
) COMPRESSION='SNAPPY'

The query we run is something along the lines of
SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid

Based on our investigation we have concluded that the main reason is purely
the number of rows that are being read. I am open to other suggestions

If number of rows is the case
I am wondering if there is a way to either
1. to roll hourly data to daily using views, secondary index or map reduce.
I know map reduce is possible.
2. migrate to a newer schema where cid is not part of pk and is actually a
column family. I was unable to find any kind of documentation on this.

Thanks
Monil

Re: Rolling hourly data

Posted by Monil Gandhi <mg...@gmail.com>.

Thanks much for the response. I don’t think Phoenix does roll ups or at
least the version we are using. We are constrained by emr on aws which is
on 4.14

Thomas - what I meant was I want to write a mr job or “update value” query
for all of the sids and there are about 200k different values. It would be
very easy of I can write a MR job to select all rows in a day and just
reduce it to one row in a different table based on the 3 primary keys. But
because day is the last part of primary key, I keep getting errors or the
MR job just takes a long time.

One thing I forgot to mention is that we do not need hourly granularity of
data during query time. It is purely for “real” time ingestion and getting
the latest data in the system.
Hence at this point I am thinking I will generate daily data in a different
offline pipeline and just keep on updating the same row during ingestion.
On Tue, Nov 27, 2018 at 12:19 AM Jonathan Leech <jo...@gmail.com> wrote:

> I would try writing the hourly values as 24 columns in a daily row, or as
> an array type.
>
>  I’m not up to speed on the latest Phoenix features, but if it could
> update a daily sum on the fly that might be ok. If that doesn’t exist yet
> or isn’t performant, it could be done in an Hbase coprocessor.
>
> I would also put the daily sum column in a separate column family in the
> same table, rather than a separate table.
>
> - Jonathan
>
> On Nov 26, 2018, at 5:53 PM, Thomas D'Silva <td...@salesforce.com>
> wrote:
>
> The 1-day aggregate query you are running should work, you might have to
> increase the client thread pool and queue size.
> See
> http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3C577D338D.2080100@gmail.com%3E
>
> On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi <mg...@gmail.com> wrote:
>
>> Thanks for getting back.
>> The query we run is always an aggregation query. The date range is
>> anytime current to 2 years back.
>> I was looking into UPSERT SELECT, but there does not seem to be an easy
>> way to run it. For instance I thought running something as below would be
>> optimal
>>
>> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach)
>> VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01',
>> 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM
>> PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT')
>> AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id,
>> cross_segment_id)
>>
>> Hence I tried running the select statement to see if it works. I started
>> to see below error
>> Error: Task
>> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580
>> rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool
>> size = 128, active threads = 128, queued tasks = 5000, completed tasks =
>> 5153] (state=08000,code=101)
>>
>> When I changed the SELECT query to include a particular sid, the upsert
>> select worked.
>> Hence I think the only way would be for me to run UPSERt for generating
>> daily data for range of sids or segment_id.
>>
>> Did I miss something?
>>
>> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <td...@salesforce.com>
>> wrote:
>>
>>> Since your PK already leads with (sid, day) I don't think adding a
>>> secondary index will help. Do you generally always run the aggregation
>>> query for the recently inserted data? The row timestamp feature might help
>>> in this case
>>> https://phoenix.apache.org/rowtimestamp.html
>>> If you run the same aggregate queries multiple times then another
>>> approach is to store the coarser daily aggregated data in a separate table
>>> that you can populate using an UPSERT SELECT.
>>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
>>> since you don't have an order by.
>>>
>>> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mg...@gmail.com> wrote:
>>>
>>>> Here it is
>>>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
>>>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
>>>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
>>>> CLIENT MERGE SORT
>>>>
>>>> Note: we have a dedicated phoenix query server
>>>>
>>>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gj...@salesforce.com>
>>>> wrote:
>>>>
>>>>> Monil,
>>>>>
>>>>> Could you please post the results of an EXPLAIN plan of your query?
>>>>> For directions how to do this please see
>>>>> http://phoenix.apache.org/explainplan.html
>>>>>
>>>>> Geoffrey Jacoby
>>>>>
>>>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mg...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>> Currently we have hourly data in our phoenix table. However, the
>>>>>> schema was designed to perform well for daily data. Increasing the number
>>>>>> of rows by 24X has lead to degradation of our service over time.
>>>>>> Our current schema is as follows
>>>>>>
>>>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT
>>>>>> NULL, cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid,
>>>>>> day, cid) ) COMPRESSION='SNAPPY'
>>>>>>
>>>>>> The query we run is something along the lines of
>>>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM
>>>>>> PUBLISHER_V4 WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01',
>>>>>> 'yyyy-MM-dd', 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP
>>>>>> BY sid, cid
>>>>>>
>>>>>> Based on our investigation we have concluded that the main reason is
>>>>>> purely the number of rows that are being read. I am open to other
>>>>>> suggestions
>>>>>>
>>>>>> If number of rows is the case
>>>>>> I am wondering if there is a way to either
>>>>>> 1. to roll hourly data to daily using views, secondary index or map
>>>>>> reduce. I know map reduce is possible.
>>>>>> 2. migrate to a newer schema where cid is not part of pk and is
>>>>>> actually a column family. I was unable to find any kind of documentation on
>>>>>> this.
>>>>>>
>>>>>> Thanks
>>>>>> Monil
>>>>>>
>>>>>

Re: Rolling hourly data

Posted by Jonathan Leech <jo...@gmail.com>.

I would try writing the hourly values as 24 columns in a daily row, or as an array type.

 I’m not up to speed on the latest Phoenix features, but if it could update a daily sum on the fly that might be ok. If that doesn’t exist yet or isn’t performant, it could be done in an Hbase coprocessor.

I would also put the daily sum column in a separate column family in the same table, rather than a separate table.

- Jonathan

> On Nov 26, 2018, at 5:53 PM, Thomas D'Silva <td...@salesforce.com> wrote:
> 
> The 1-day aggregate query you are running should work, you might have to increase the client thread pool and queue size.
> See http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3C577D338D.2080100@gmail.com%3E
> 
>> On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi <mg...@gmail.com> wrote:
>> Thanks for getting back.
>> The query we run is always an aggregation query. The date range is anytime current to 2 years back. 
>> I was looking into UPSERT SELECT, but there does not seem to be an easy way to run it. For instance I thought running something as below would be optimal
>> 
>> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach) VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01', 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id, cross_segment_id)
>> 
>> Hence I tried running the select statement to see if it works. I started to see below error
>> Error: Task org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580 rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool size = 128, active threads = 128, queued tasks = 5000, completed tasks = 5153] (state=08000,code=101)
>> 
>> When I changed the SELECT query to include a particular sid, the upsert select worked.
>> Hence I think the only way would be for me to run UPSERt for generating daily data for range of sids or segment_id.
>> 
>> Did I miss something?
>> 
>>> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <td...@salesforce.com> wrote:
>>> Since your PK already leads with (sid, day) I don't think adding a secondary index will help. Do you generally always run the aggregation query for the recently inserted data? The row timestamp feature might help in this case 
>>> https://phoenix.apache.org/rowtimestamp.html
>>> If you run the same aggregate queries multiple times then another approach is to store the coarser daily aggregated data in a separate table that you can populate using an UPSERT SELECT.
>>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT, since you don't have an order by.
>>> 
>>>> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mg...@gmail.com> wrote:
>>>> Here it is 
>>>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4 [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420 SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
>>>> CLIENT MERGE SORT
>>>> 
>>>> Note: we have a dedicated phoenix query server
>>>> 
>>>>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gj...@salesforce.com> wrote:
>>>>> Monil,
>>>>> 
>>>>> Could you please post the results of an EXPLAIN plan of your query? For directions how to do this please see http://phoenix.apache.org/explainplan.html
>>>>> 
>>>>> Geoffrey Jacoby
>>>>> 
>>>>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mg...@gmail.com> wrote:
>>>>>> Hello,
>>>>>> Currently we have hourly data in our phoenix table. However, the schema was designed to perform well for daily data. Increasing the number of rows by 24X has lead to degradation of our service over time. 
>>>>>> Our current schema is as follows 
>>>>>> 
>>>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day, cid) ) COMPRESSION='SNAPPY'
>>>>>> 
>>>>>> The query we run is something along the lines of 
>>>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4 WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>>>>> 
>>>>>> Based on our investigation we have concluded that the main reason is purely the number of rows that are being read. I am open to other suggestions
>>>>>> 
>>>>>> If number of rows is the case
>>>>>> I am wondering if there is a way to either
>>>>>> 1. to roll hourly data to daily using views, secondary index or map reduce. I know map reduce is possible.
>>>>>> 2. migrate to a newer schema where cid is not part of pk and is actually a column family. I was unable to find any kind of documentation on this.
>>>>>> 
>>>>>> Thanks
>>>>>> Monil

Re: Rolling hourly data

Posted by Thomas D'Silva <td...@salesforce.com>.

The 1-day aggregate query you are running should work, you might have to
increase the client thread pool and queue size.
See
http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3C577D338D.2080100@gmail.com%3E

On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi <mg...@gmail.com> wrote:

> Thanks for getting back.
> The query we run is always an aggregation query. The date range is anytime
> current to 2 years back.
> I was looking into UPSERT SELECT, but there does not seem to be an easy
> way to run it. For instance I thought running something as below would be
> optimal
>
> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach)
> VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01',
> 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM
> PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT')
> AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id,
> cross_segment_id)
>
> Hence I tried running the select statement to see if it works. I started
> to see below error
> Error: Task
> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580
> rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool
> size = 128, active threads = 128, queued tasks = 5000, completed tasks =
> 5153] (state=08000,code=101)
>
> When I changed the SELECT query to include a particular sid, the upsert
> select worked.
> Hence I think the only way would be for me to run UPSERt for generating
> daily data for range of sids or segment_id.
>
> Did I miss something?
>
> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <td...@salesforce.com>
> wrote:
>
>> Since your PK already leads with (sid, day) I don't think adding a
>> secondary index will help. Do you generally always run the aggregation
>> query for the recently inserted data? The row timestamp feature might help
>> in this case
>> https://phoenix.apache.org/rowtimestamp.html
>> If you run the same aggregate queries multiple times then another
>> approach is to store the coarser daily aggregated data in a separate table
>> that you can populate using an UPSERT SELECT.
>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
>> since you don't have an order by.
>>
>> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mg...@gmail.com> wrote:
>>
>>> Here it is
>>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
>>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
>>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
>>> CLIENT MERGE SORT
>>>
>>> Note: we have a dedicated phoenix query server
>>>
>>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gj...@salesforce.com>
>>> wrote:
>>>
>>>> Monil,
>>>>
>>>> Could you please post the results of an EXPLAIN plan of your query? For
>>>> directions how to do this please see
>>>> http://phoenix.apache.org/explainplan.html
>>>>
>>>> Geoffrey Jacoby
>>>>
>>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mg...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> Currently we have hourly data in our phoenix table. However, the
>>>>> schema was designed to perform well for daily data. Increasing the number
>>>>> of rows by 24X has lead to degradation of our service over time.
>>>>> Our current schema is as follows
>>>>>
>>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL,
>>>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day,
>>>>> cid) ) COMPRESSION='SNAPPY'
>>>>>
>>>>> The query we run is something along the lines of
>>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
>>>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
>>>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>>>>
>>>>> Based on our investigation we have concluded that the main reason is
>>>>> purely the number of rows that are being read. I am open to other
>>>>> suggestions
>>>>>
>>>>> If number of rows is the case
>>>>> I am wondering if there is a way to either
>>>>> 1. to roll hourly data to daily using views, secondary index or map
>>>>> reduce. I know map reduce is possible.
>>>>> 2. migrate to a newer schema where cid is not part of pk and is
>>>>> actually a column family. I was unable to find any kind of documentation on
>>>>> this.
>>>>>
>>>>> Thanks
>>>>> Monil
>>>>>
>>>>

Re: Rolling hourly data

Posted by Monil Gandhi <mg...@gmail.com>.

Thanks for getting back.
The query we run is always an aggregation query. The date range is anytime
current to 2 years back.
I was looking into UPSERT SELECT, but there does not seem to be an easy way
to run it. For instance I thought running something as below would be
optimal

UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach)
VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01',
'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM
PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT')
AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id,
cross_segment_id)

Hence I tried running the select statement to see if it works. I started to
see below error
Error: Task
org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580
rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool
size = 128, active threads = 128, queued tasks = 5000, completed tasks =
5153] (state=08000,code=101)

When I changed the SELECT query to include a particular sid, the upsert
select worked.
Hence I think the only way would be for me to run UPSERt for generating
daily data for range of sids or segment_id.

Did I miss something?

On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <td...@salesforce.com>
wrote:

> Since your PK already leads with (sid, day) I don't think adding a
> secondary index will help. Do you generally always run the aggregation
> query for the recently inserted data? The row timestamp feature might help
> in this case
> https://phoenix.apache.org/rowtimestamp.html
> If you run the same aggregate queries multiple times then another approach
> is to store the coarser daily aggregated data in a separate table that you
> can populate using an UPSERT SELECT.
> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
> since you don't have an order by.
>
> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mg...@gmail.com> wrote:
>
>> Here it is
>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
>> CLIENT MERGE SORT
>>
>> Note: we have a dedicated phoenix query server
>>
>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gj...@salesforce.com>
>> wrote:
>>
>>> Monil,
>>>
>>> Could you please post the results of an EXPLAIN plan of your query? For
>>> directions how to do this please see
>>> http://phoenix.apache.org/explainplan.html
>>>
>>> Geoffrey Jacoby
>>>
>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mg...@gmail.com> wrote:
>>>
>>>> Hello,
>>>> Currently we have hourly data in our phoenix table. However, the schema
>>>> was designed to perform well for daily data. Increasing the number of rows
>>>> by 24X has lead to degradation of our service over time.
>>>> Our current schema is as follows
>>>>
>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL,
>>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day,
>>>> cid) ) COMPRESSION='SNAPPY'
>>>>
>>>> The query we run is something along the lines of
>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
>>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
>>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>>>
>>>> Based on our investigation we have concluded that the main reason is
>>>> purely the number of rows that are being read. I am open to other
>>>> suggestions
>>>>
>>>> If number of rows is the case
>>>> I am wondering if there is a way to either
>>>> 1. to roll hourly data to daily using views, secondary index or map
>>>> reduce. I know map reduce is possible.
>>>> 2. migrate to a newer schema where cid is not part of pk and is
>>>> actually a column family. I was unable to find any kind of documentation on
>>>> this.
>>>>
>>>> Thanks
>>>> Monil
>>>>
>>>

Re: Rolling hourly data

Posted by Thomas D'Silva <td...@salesforce.com>.

Since your PK already leads with (sid, day) I don't think adding a
secondary index will help. Do you generally always run the aggregation
query for the recently inserted data? The row timestamp feature might help
in this case
https://phoenix.apache.org/rowtimestamp.html
If you run the same aggregate queries multiple times then another approach
is to store the coarser daily aggregated data in a separate table that you
can populate using an UPSERT SELECT.
I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
since you don't have an order by.

On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mg...@gmail.com> wrote:

> Here it is
> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
> CLIENT MERGE SORT
>
> Note: we have a dedicated phoenix query server
>
> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gj...@salesforce.com>
> wrote:
>
>> Monil,
>>
>> Could you please post the results of an EXPLAIN plan of your query? For
>> directions how to do this please see
>> http://phoenix.apache.org/explainplan.html
>>
>> Geoffrey Jacoby
>>
>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mg...@gmail.com> wrote:
>>
>>> Hello,
>>> Currently we have hourly data in our phoenix table. However, the schema
>>> was designed to perform well for daily data. Increasing the number of rows
>>> by 24X has lead to degradation of our service over time.
>>> Our current schema is as follows
>>>
>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL,
>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day,
>>> cid) ) COMPRESSION='SNAPPY'
>>>
>>> The query we run is something along the lines of
>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>>
>>> Based on our investigation we have concluded that the main reason is
>>> purely the number of rows that are being read. I am open to other
>>> suggestions
>>>
>>> If number of rows is the case
>>> I am wondering if there is a way to either
>>> 1. to roll hourly data to daily using views, secondary index or map
>>> reduce. I know map reduce is possible.
>>> 2. migrate to a newer schema where cid is not part of pk and is actually
>>> a column family. I was unable to find any kind of documentation on this.
>>>
>>> Thanks
>>> Monil
>>>
>>

Re: Rolling hourly data

Posted by Monil Gandhi <mg...@gmail.com>.

Here it is
CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
[-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
CLIENT MERGE SORT

Note: we have a dedicated phoenix query server

On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gj...@salesforce.com>
wrote:

> Monil,
>
> Could you please post the results of an EXPLAIN plan of your query? For
> directions how to do this please see
> http://phoenix.apache.org/explainplan.html
>
> Geoffrey Jacoby
>
> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mg...@gmail.com> wrote:
>
>> Hello,
>> Currently we have hourly data in our phoenix table. However, the schema
>> was designed to perform well for daily data. Increasing the number of rows
>> by 24X has lead to degradation of our service over time.
>> Our current schema is as follows
>>
>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL,
>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day,
>> cid) ) COMPRESSION='SNAPPY'
>>
>> The query we run is something along the lines of
>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>
>> Based on our investigation we have concluded that the main reason is
>> purely the number of rows that are being read. I am open to other
>> suggestions
>>
>> If number of rows is the case
>> I am wondering if there is a way to either
>> 1. to roll hourly data to daily using views, secondary index or map
>> reduce. I know map reduce is possible.
>> 2. migrate to a newer schema where cid is not part of pk and is actually
>> a column family. I was unable to find any kind of documentation on this.
>>
>> Thanks
>> Monil
>>
>

Re: Rolling hourly data

Posted by Geoffrey Jacoby <gj...@salesforce.com>.

Monil,

Could you please post the results of an EXPLAIN plan of your query? For
directions how to do this please see
http://phoenix.apache.org/explainplan.html

Geoffrey Jacoby

On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mg...@gmail.com> wrote:

> Hello,
> Currently we have hourly data in our phoenix table. However, the schema
> was designed to perform well for daily data. Increasing the number of rows
> by 24X has lead to degradation of our service over time.
> Our current schema is as follows
>
> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, cid
> BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day, cid)
> ) COMPRESSION='SNAPPY'
>
> The query we run is something along the lines of
> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>
> Based on our investigation we have concluded that the main reason is
> purely the number of rows that are being read. I am open to other
> suggestions
>
> If number of rows is the case
> I am wondering if there is a way to either
> 1. to roll hourly data to daily using views, secondary index or map
> reduce. I know map reduce is possible.
> 2. migrate to a newer schema where cid is not part of pk and is actually a
> column family. I was unable to find any kind of documentation on this.
>
> Thanks
> Monil
>