You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by tenny susanto <te...@gmail.com> on 2017/02/24 01:08:24 UTC

kudu table design question

I have a table (call this fact_table)  that I want to create in kudu.

I have an equivalent table in impala/parquet that is partitioned by day_id.

create table impala_fact_table (
company_id INT,
transcount INT)
partitioned by
(print_date_id INT)
STORED AS PARQUET;

so a common query would be:

select  sum(transcount)
from impala_fact_table f
join with company_dim c on f.company_id = c.company_id
where c.company_id in (123,456)
and f.print_date_id between 20170101 and 20170202

I created an equivalent of the fact table in kudu:

CREATE TABLE kudu_fact_table  (
id STRING,
print_date_id,
company_id INT,
transcount INT)
PRIMARY KEY(id,print_date_id)
) PARTITION BY HASH PARTITIONS 16
)
STORED AS KUDU
TBLPROPERTIES(
  'kudu.table_name' = 'kudu_fact_table',
  'kudu.master_addresses' = 'myserver:7051'
);

But the performance of the join with this kudu table is terrible, 2 secs
with impala table vs 126 secs with kudu table.

select  sum(transcount)
from kudu_fact_table f
join with company_dim c on f.company_id = c.company_id
where c.company_id in (123,456)
and f.print_date_id between 20170101 and 20170202

How should I design my kudu table so performance is somewhat comparable?

Re: kudu table design question

Posted by da...@gmail.com.
Hi Tenny

  I see in the plan that the kudu tables are missing statistics, so a first step would be to compute statistics a check the result.
  Also From the plan I see this particular query is only scanning from 3 kudu hosts, vs 24 in parquet. If queries are bound to scan a small range maybe consider adding hash partitioning with coarser grained range partitions.

Best
David

Sent from my iPhone

> On Mar 10, 2017, at 3:05 PM, tenny susanto <te...@gmail.com> wrote:
> 
> Ahh..no, I did not configure any compression on the kudu table. I will go check the docs and enable compression.
> 
> In the meantime, here's the query profiles against impala parquet table vs my existing no compression kudu table.
> 
> 
>> On Fri, Mar 10, 2017 at 11:50 AM, Todd Lipcon <to...@cloudera.com> wrote:
>> Hi Tenny,
>> 
>> Sorry for the delay on this thread. Just wanted to check in and find out how the experiments are going.
>> 
>> Do you have query profiles of the query against Parquet and the same query against Kudu? The 15x difference you reported is not expected.
>> 
>> I didn't see it mentioned above in the thread: did you configure any encoding or compression on the Kudu tables? This is often a source of major performance differences.
>> 
>> -Todd
>> 
>>> On Fri, Feb 24, 2017 at 2:11 PM, tenny susanto <te...@gmail.com> wrote:
>>> On my impala parquet table, each day partition is about 500MB - 1GB.
>>> 
>>> 
>>> So using range partition by day, query time went down to 35 sec from 123 sec
>>> 
>>> 
>>> Query against the impala table is 2 seconds.
>>> 
>>> 
>>> 
>>> 
>>>> On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <da...@cloudera.com> wrote:
>>>> Hi Tenny,
>>>> 
>>>> 1000 partitions is on the upper end of what I'd recommend - with 3x replication that's 125 tablet replicas per tablet server (something more like 20 or 30 would be ideal depending on hardware).  How much data does each day have?  I would aim for tablet size on the order of 50GiB, so if it's not that much per day you could try making week or month wide partitions.  Just bumping the number of partitions and being able to take advantage of partition pruning should improve the performance tremendously.
>>>> 
>>>> In the next release we're adding support for pushdown IN list predicates, which could help your query even more if you could put company_id as the first component of your primary key.  That being said, I think improved range partition will likely give the most dramatic improvements, and there's no need to wait.
>>>> 
>>>> Week wide range partitions can be specified like:
>>>> 
>>>>  PARTITION 20170101 <= VALUES < 20170108,
>>>>  PARTITION 20170108 <= VALUES < 20170115,
>>>>  ...
>>>> 
>>>> 
>>>> - Dan
>>>> 
>>>>> On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <te...@gmail.com> wrote:
>>>>> I have 24 tablet servers.
>>>>> 
>>>>> I added an id column because I needed a unique column to be the primary key as kudu required primary key to be specified.  My original table actually has 20 columns with no single primary key column. I concatenated 5 of them to build a unique id column which I made it as part of the primary key. I have tried specifying 5 columns to be the primary key but I noticed the inserts were much slower, so I tried with just 2 columns as primary key instead, seems to improve insert speed.
>>>>> 
>>>>> So this is my new schema and will measure query speed with it. If I partition by day, is 1000 partitions too many? What is the recommended maximum limit in the number of partitions kudu can handle?
>>>>> 
>>>>> CREATE TABLE kudu_fact_table  (
>>>>> print_date_id,
>>>>> id STRING,
>>>>> company_id INT,
>>>>> transcount INT)
>>>>> PRIMARY KEY(print_date_id,id)
>>>>> ) PARTITION BY RANGE (print_date_id) 
>>>>> (
>>>>>   PARTITION VALUE = 20170101,
>>>>>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3 year's worth of data)
>>>>>  )
>>>>> STORED AS KUDU
>>>>> TBLPROPERTIES(
>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>> );
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>>>> I'd add that moving the print_date_id to the beginning of the primary key in the Kudu fact table would allow each server to do a range scan instead of a full scan.
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <da...@cloudera.com> wrote:
>>>>>>> Hi Tenny,
>>>>>>> 
>>>>>>> First off, how many tablet servers are in your cluster?  16 partitions is appropriate for one or maybe two tablet servers, so if your cluster is bigger you could try bumping the number of partitions.
>>>>>>> 
>>>>>>> Second, the schemas don't look identical, you have an additional 'id' column in the Kudu table, and crucially, it doesn't have any predicates, so this query is doing a full table scan.
>>>>>>> 
>>>>>>> Finally, the Parquet table is likely able to take advantage of significant partition pruning due to the between clause.  An equivalent in Kudu would be range partitioning on the print_date_id.  You might try doing the same for Kudu.
>>>>>>> 
>>>>>>> - Dan
>>>>>>> 
>>>>>>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <te...@gmail.com> wrote:
>>>>>>>> I have a table (call this fact_table)  that I want to create in kudu. 
>>>>>>>> 
>>>>>>>> I have an equivalent table in impala/parquet that is partitioned by day_id. 
>>>>>>>> 
>>>>>>>> create table impala_fact_table (
>>>>>>>> company_id INT,
>>>>>>>> transcount INT)
>>>>>>>> partitioned by 
>>>>>>>> (print_date_id INT)
>>>>>>>> STORED AS PARQUET;
>>>>>>>> 
>>>>>>>> so a common query would be:
>>>>>>>> 
>>>>>>>> select  sum(transcount)
>>>>>>>> from impala_fact_table f
>>>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>>>> where c.company_id in (123,456)
>>>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>>> 
>>>>>>>> I created an equivalent of the fact table in kudu:
>>>>>>>> 
>>>>>>>> CREATE TABLE kudu_fact_table  (
>>>>>>>> id STRING,
>>>>>>>> print_date_id,
>>>>>>>> company_id INT,
>>>>>>>> transcount INT)
>>>>>>>> PRIMARY KEY(id,print_date_id)
>>>>>>>> ) PARTITION BY HASH PARTITIONS 16
>>>>>>>> )
>>>>>>>> STORED AS KUDU
>>>>>>>> TBLPROPERTIES(
>>>>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>>>>> );
>>>>>>>> 
>>>>>>>> But the performance of the join with this kudu table is terrible, 2 secs with impala table vs 126 secs with kudu table. 
>>>>>>>> 
>>>>>>>> select  sum(transcount)
>>>>>>>> from kudu_fact_table f
>>>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>>>> where c.company_id in (123,456)
>>>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>>> 
>>>>>>>> How should I design my kudu table so performance is somewhat comparable?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Regards,
>>>>> 
>>>>> Tenny Susanto
>>> 
>>> 
>>> 
>>> -- 
>>> Regards,
>>> 
>>> Tenny Susanto
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
> 
> 
> 
> -- 
> Regards,
> 
> Tenny Susanto 
> 
> <benchmark_impala_parquet.txt>
> <benchmark_kudu_range_partition.txt>

Re: kudu table design question

Posted by tenny susanto <te...@gmail.com>.
Ahh..no, I did not configure any compression on the kudu table. I will go
check the docs and enable compression.

In the meantime, here's the query profiles against impala parquet table vs
my existing no compression kudu table.


On Fri, Mar 10, 2017 at 11:50 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Tenny,
>
> Sorry for the delay on this thread. Just wanted to check in and find out
> how the experiments are going.
>
> Do you have query profiles of the query against Parquet and the same query
> against Kudu? The 15x difference you reported is not expected.
>
> I didn't see it mentioned above in the thread: did you configure any
> encoding or compression on the Kudu tables? This is often a source of major
> performance differences.
>
> -Todd
>
> On Fri, Feb 24, 2017 at 2:11 PM, tenny susanto <te...@gmail.com>
> wrote:
>
>> On my impala parquet table, each day partition is about 500MB - 1GB.
>>
>>
>> So using range partition by day, query time went down to 35 sec from 123
>> sec
>>
>>
>> Query against the impala table is 2 seconds.
>>
>>
>>
>>
>> On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>> Hi Tenny,
>>>
>>> 1000 partitions is on the upper end of what I'd recommend - with 3x
>>> replication that's 125 tablet replicas per tablet server (something more
>>> like 20 or 30 would be ideal depending on hardware).  How much data does
>>> each day have?  I would aim for tablet size on the order of 50GiB, so if
>>> it's not that much per day you could try making week or month wide
>>> partitions.  Just bumping the number of partitions and being able to take
>>> advantage of partition pruning should improve the performance tremendously.
>>>
>>> In the next release we're adding support for pushdown IN list
>>> predicates, which could help your query even more if you could put
>>> company_id as the first component of your primary key.  That being said, I
>>> think improved range partition will likely give the most dramatic
>>> improvements, and there's no need to wait.
>>>
>>> Week wide range partitions can be specified like:
>>>
>>>  PARTITION 20170101 <= VALUES < 20170108,
>>>  PARTITION 20170108 <= VALUES < 20170115,
>>>  ...
>>>
>>>
>>> - Dan
>>>
>>> On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <te...@gmail.com>
>>> wrote:
>>>
>>>> I have 24 tablet servers.
>>>>
>>>> I added an id column because I needed a unique column to be the primary
>>>> key as kudu required primary key to be specified.  My original table
>>>> actually has 20 columns with no single primary key column. I concatenated 5
>>>> of them to build a unique id column which I made it as part of the primary
>>>> key. I have tried specifying 5 columns to be the primary key but I noticed
>>>> the inserts were much slower, so I tried with just 2 columns as primary key
>>>> instead, seems to improve insert speed.
>>>>
>>>> So this is my new schema and will measure query speed with it. If I
>>>> partition by day, is 1000 partitions too many? What is the recommended
>>>> maximum limit in the number of partitions kudu can handle?
>>>>
>>>> CREATE TABLE kudu_fact_table  (
>>>> print_date_id,
>>>> id STRING,
>>>> company_id INT,
>>>> transcount INT)
>>>> PRIMARY KEY(print_date_id,id)
>>>> ) PARTITION BY RANGE (print_date_id)
>>>> (
>>>>   PARTITION VALUE = 20170101,
>>>>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have
>>>> 3 year's worth of data)
>>>>  )
>>>> STORED AS KUDU
>>>> TBLPROPERTIES(
>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>> );
>>>>
>>>>
>>>>
>>>> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>>
>>>>> I'd add that moving the print_date_id to the beginning of the primary
>>>>> key in the Kudu fact table would allow each server to do a range scan
>>>>> instead of a full scan.
>>>>>
>>>>> -Todd
>>>>>
>>>>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <da...@cloudera.com> wrote:
>>>>>
>>>>>> Hi Tenny,
>>>>>>
>>>>>> First off, how many tablet servers are in your cluster?  16
>>>>>> partitions is appropriate for one or maybe two tablet servers, so if your
>>>>>> cluster is bigger you could try bumping the number of partitions.
>>>>>>
>>>>>> Second, the schemas don't look identical, you have an additional 'id'
>>>>>> column in the Kudu table, and crucially, it doesn't have any predicates, so
>>>>>> this query is doing a full table scan.
>>>>>>
>>>>>> Finally, the Parquet table is likely able to take advantage of
>>>>>> significant partition pruning due to the between clause.  An equivalent in
>>>>>> Kudu would be range partitioning on the print_date_id.  You might try doing
>>>>>> the same for Kudu.
>>>>>>
>>>>>> - Dan
>>>>>>
>>>>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <
>>>>>> tennysusanto@gmail.com> wrote:
>>>>>>
>>>>>>> I have a table (call this fact_table)  that I want to create in
>>>>>>> kudu.
>>>>>>>
>>>>>>> I have an equivalent table in impala/parquet that is partitioned by
>>>>>>> day_id.
>>>>>>>
>>>>>>> create table impala_fact_table (
>>>>>>> company_id INT,
>>>>>>> transcount INT)
>>>>>>> partitioned by
>>>>>>> (print_date_id INT)
>>>>>>> STORED AS PARQUET;
>>>>>>>
>>>>>>> so a common query would be:
>>>>>>>
>>>>>>> select  sum(transcount)
>>>>>>> from impala_fact_table f
>>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>>> where c.company_id in (123,456)
>>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>>
>>>>>>> I created an equivalent of the fact table in kudu:
>>>>>>>
>>>>>>> CREATE TABLE kudu_fact_table  (
>>>>>>> id STRING,
>>>>>>> print_date_id,
>>>>>>> company_id INT,
>>>>>>> transcount INT)
>>>>>>> PRIMARY KEY(id,print_date_id)
>>>>>>> ) PARTITION BY HASH PARTITIONS 16
>>>>>>> )
>>>>>>> STORED AS KUDU
>>>>>>> TBLPROPERTIES(
>>>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>>>> );
>>>>>>>
>>>>>>> But the performance of the join with this kudu table is terrible, 2
>>>>>>> secs with impala table vs 126 secs with kudu table.
>>>>>>>
>>>>>>> select  sum(transcount)
>>>>>>> from kudu_fact_table f
>>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>>> where c.company_id in (123,456)
>>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>>
>>>>>>> How should I design my kudu table so performance is somewhat
>>>>>>> comparable?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Tenny Susanto
>>>>
>>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Tenny Susanto
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Regards,

Tenny Susanto

Re: kudu table design question

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Tenny,

Sorry for the delay on this thread. Just wanted to check in and find out
how the experiments are going.

Do you have query profiles of the query against Parquet and the same query
against Kudu? The 15x difference you reported is not expected.

I didn't see it mentioned above in the thread: did you configure any
encoding or compression on the Kudu tables? This is often a source of major
performance differences.

-Todd

On Fri, Feb 24, 2017 at 2:11 PM, tenny susanto <te...@gmail.com>
wrote:

> On my impala parquet table, each day partition is about 500MB - 1GB.
>
>
> So using range partition by day, query time went down to 35 sec from 123
> sec
>
>
> Query against the impala table is 2 seconds.
>
>
>
>
> On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <da...@cloudera.com> wrote:
>
>> Hi Tenny,
>>
>> 1000 partitions is on the upper end of what I'd recommend - with 3x
>> replication that's 125 tablet replicas per tablet server (something more
>> like 20 or 30 would be ideal depending on hardware).  How much data does
>> each day have?  I would aim for tablet size on the order of 50GiB, so if
>> it's not that much per day you could try making week or month wide
>> partitions.  Just bumping the number of partitions and being able to take
>> advantage of partition pruning should improve the performance tremendously.
>>
>> In the next release we're adding support for pushdown IN list predicates,
>> which could help your query even more if you could put company_id as the
>> first component of your primary key.  That being said, I think improved
>> range partition will likely give the most dramatic improvements, and
>> there's no need to wait.
>>
>> Week wide range partitions can be specified like:
>>
>>  PARTITION 20170101 <= VALUES < 20170108,
>>  PARTITION 20170108 <= VALUES < 20170115,
>>  ...
>>
>>
>> - Dan
>>
>> On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <te...@gmail.com>
>> wrote:
>>
>>> I have 24 tablet servers.
>>>
>>> I added an id column because I needed a unique column to be the primary
>>> key as kudu required primary key to be specified.  My original table
>>> actually has 20 columns with no single primary key column. I concatenated 5
>>> of them to build a unique id column which I made it as part of the primary
>>> key. I have tried specifying 5 columns to be the primary key but I noticed
>>> the inserts were much slower, so I tried with just 2 columns as primary key
>>> instead, seems to improve insert speed.
>>>
>>> So this is my new schema and will measure query speed with it. If I
>>> partition by day, is 1000 partitions too many? What is the recommended
>>> maximum limit in the number of partitions kudu can handle?
>>>
>>> CREATE TABLE kudu_fact_table  (
>>> print_date_id,
>>> id STRING,
>>> company_id INT,
>>> transcount INT)
>>> PRIMARY KEY(print_date_id,id)
>>> ) PARTITION BY RANGE (print_date_id)
>>> (
>>>   PARTITION VALUE = 20170101,
>>>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3
>>> year's worth of data)
>>>  )
>>> STORED AS KUDU
>>> TBLPROPERTIES(
>>>   'kudu.table_name' = 'kudu_fact_table',
>>>   'kudu.master_addresses' = 'myserver:7051'
>>> );
>>>
>>>
>>>
>>> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> I'd add that moving the print_date_id to the beginning of the primary
>>>> key in the Kudu fact table would allow each server to do a range scan
>>>> instead of a full scan.
>>>>
>>>> -Todd
>>>>
>>>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <da...@cloudera.com> wrote:
>>>>
>>>>> Hi Tenny,
>>>>>
>>>>> First off, how many tablet servers are in your cluster?  16 partitions
>>>>> is appropriate for one or maybe two tablet servers, so if your cluster is
>>>>> bigger you could try bumping the number of partitions.
>>>>>
>>>>> Second, the schemas don't look identical, you have an additional 'id'
>>>>> column in the Kudu table, and crucially, it doesn't have any predicates, so
>>>>> this query is doing a full table scan.
>>>>>
>>>>> Finally, the Parquet table is likely able to take advantage of
>>>>> significant partition pruning due to the between clause.  An equivalent in
>>>>> Kudu would be range partitioning on the print_date_id.  You might try doing
>>>>> the same for Kudu.
>>>>>
>>>>> - Dan
>>>>>
>>>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <tennysusanto@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> I have a table (call this fact_table)  that I want to create in kudu.
>>>>>>
>>>>>> I have an equivalent table in impala/parquet that is partitioned by
>>>>>> day_id.
>>>>>>
>>>>>> create table impala_fact_table (
>>>>>> company_id INT,
>>>>>> transcount INT)
>>>>>> partitioned by
>>>>>> (print_date_id INT)
>>>>>> STORED AS PARQUET;
>>>>>>
>>>>>> so a common query would be:
>>>>>>
>>>>>> select  sum(transcount)
>>>>>> from impala_fact_table f
>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>> where c.company_id in (123,456)
>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>
>>>>>> I created an equivalent of the fact table in kudu:
>>>>>>
>>>>>> CREATE TABLE kudu_fact_table  (
>>>>>> id STRING,
>>>>>> print_date_id,
>>>>>> company_id INT,
>>>>>> transcount INT)
>>>>>> PRIMARY KEY(id,print_date_id)
>>>>>> ) PARTITION BY HASH PARTITIONS 16
>>>>>> )
>>>>>> STORED AS KUDU
>>>>>> TBLPROPERTIES(
>>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>>> );
>>>>>>
>>>>>> But the performance of the join with this kudu table is terrible, 2
>>>>>> secs with impala table vs 126 secs with kudu table.
>>>>>>
>>>>>> select  sum(transcount)
>>>>>> from kudu_fact_table f
>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>> where c.company_id in (123,456)
>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>
>>>>>> How should I design my kudu table so performance is somewhat
>>>>>> comparable?
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Tenny Susanto
>>>
>>>
>>
>
>
> --
> Regards,
>
> Tenny Susanto
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: kudu table design question

Posted by tenny susanto <te...@gmail.com>.
On my impala parquet table, each day partition is about 500MB - 1GB.


So using range partition by day, query time went down to 35 sec from 123 sec


Query against the impala table is 2 seconds.




On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <da...@cloudera.com> wrote:

> Hi Tenny,
>
> 1000 partitions is on the upper end of what I'd recommend - with 3x
> replication that's 125 tablet replicas per tablet server (something more
> like 20 or 30 would be ideal depending on hardware).  How much data does
> each day have?  I would aim for tablet size on the order of 50GiB, so if
> it's not that much per day you could try making week or month wide
> partitions.  Just bumping the number of partitions and being able to take
> advantage of partition pruning should improve the performance tremendously.
>
> In the next release we're adding support for pushdown IN list predicates,
> which could help your query even more if you could put company_id as the
> first component of your primary key.  That being said, I think improved
> range partition will likely give the most dramatic improvements, and
> there's no need to wait.
>
> Week wide range partitions can be specified like:
>
>  PARTITION 20170101 <= VALUES < 20170108,
>  PARTITION 20170108 <= VALUES < 20170115,
>  ...
>
>
> - Dan
>
> On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <te...@gmail.com>
> wrote:
>
>> I have 24 tablet servers.
>>
>> I added an id column because I needed a unique column to be the primary
>> key as kudu required primary key to be specified.  My original table
>> actually has 20 columns with no single primary key column. I concatenated 5
>> of them to build a unique id column which I made it as part of the primary
>> key. I have tried specifying 5 columns to be the primary key but I noticed
>> the inserts were much slower, so I tried with just 2 columns as primary key
>> instead, seems to improve insert speed.
>>
>> So this is my new schema and will measure query speed with it. If I
>> partition by day, is 1000 partitions too many? What is the recommended
>> maximum limit in the number of partitions kudu can handle?
>>
>> CREATE TABLE kudu_fact_table  (
>> print_date_id,
>> id STRING,
>> company_id INT,
>> transcount INT)
>> PRIMARY KEY(print_date_id,id)
>> ) PARTITION BY RANGE (print_date_id)
>> (
>>   PARTITION VALUE = 20170101,
>>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3
>> year's worth of data)
>>  )
>> STORED AS KUDU
>> TBLPROPERTIES(
>>   'kudu.table_name' = 'kudu_fact_table',
>>   'kudu.master_addresses' = 'myserver:7051'
>> );
>>
>>
>>
>> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> I'd add that moving the print_date_id to the beginning of the primary
>>> key in the Kudu fact table would allow each server to do a range scan
>>> instead of a full scan.
>>>
>>> -Todd
>>>
>>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <da...@cloudera.com> wrote:
>>>
>>>> Hi Tenny,
>>>>
>>>> First off, how many tablet servers are in your cluster?  16 partitions
>>>> is appropriate for one or maybe two tablet servers, so if your cluster is
>>>> bigger you could try bumping the number of partitions.
>>>>
>>>> Second, the schemas don't look identical, you have an additional 'id'
>>>> column in the Kudu table, and crucially, it doesn't have any predicates, so
>>>> this query is doing a full table scan.
>>>>
>>>> Finally, the Parquet table is likely able to take advantage of
>>>> significant partition pruning due to the between clause.  An equivalent in
>>>> Kudu would be range partitioning on the print_date_id.  You might try doing
>>>> the same for Kudu.
>>>>
>>>> - Dan
>>>>
>>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <te...@gmail.com>
>>>> wrote:
>>>>
>>>>> I have a table (call this fact_table)  that I want to create in kudu.
>>>>>
>>>>> I have an equivalent table in impala/parquet that is partitioned by
>>>>> day_id.
>>>>>
>>>>> create table impala_fact_table (
>>>>> company_id INT,
>>>>> transcount INT)
>>>>> partitioned by
>>>>> (print_date_id INT)
>>>>> STORED AS PARQUET;
>>>>>
>>>>> so a common query would be:
>>>>>
>>>>> select  sum(transcount)
>>>>> from impala_fact_table f
>>>>> join with company_dim c on f.company_id = c.company_id
>>>>> where c.company_id in (123,456)
>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>
>>>>> I created an equivalent of the fact table in kudu:
>>>>>
>>>>> CREATE TABLE kudu_fact_table  (
>>>>> id STRING,
>>>>> print_date_id,
>>>>> company_id INT,
>>>>> transcount INT)
>>>>> PRIMARY KEY(id,print_date_id)
>>>>> ) PARTITION BY HASH PARTITIONS 16
>>>>> )
>>>>> STORED AS KUDU
>>>>> TBLPROPERTIES(
>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>> );
>>>>>
>>>>> But the performance of the join with this kudu table is terrible, 2
>>>>> secs with impala table vs 126 secs with kudu table.
>>>>>
>>>>> select  sum(transcount)
>>>>> from kudu_fact_table f
>>>>> join with company_dim c on f.company_id = c.company_id
>>>>> where c.company_id in (123,456)
>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>
>>>>> How should I design my kudu table so performance is somewhat
>>>>> comparable?
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Tenny Susanto
>>
>>
>


-- 
Regards,

Tenny Susanto

Re: kudu table design question

Posted by Dan Burkert <da...@cloudera.com>.
Hi Tenny,

1000 partitions is on the upper end of what I'd recommend - with 3x
replication that's 125 tablet replicas per tablet server (something more
like 20 or 30 would be ideal depending on hardware).  How much data does
each day have?  I would aim for tablet size on the order of 50GiB, so if
it's not that much per day you could try making week or month wide
partitions.  Just bumping the number of partitions and being able to take
advantage of partition pruning should improve the performance tremendously.

In the next release we're adding support for pushdown IN list predicates,
which could help your query even more if you could put company_id as the
first component of your primary key.  That being said, I think improved
range partition will likely give the most dramatic improvements, and
there's no need to wait.

Week wide range partitions can be specified like:

 PARTITION 20170101 <= VALUES < 20170108,
 PARTITION 20170108 <= VALUES < 20170115,
 ...


- Dan

On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <te...@gmail.com>
wrote:

> I have 24 tablet servers.
>
> I added an id column because I needed a unique column to be the primary
> key as kudu required primary key to be specified.  My original table
> actually has 20 columns with no single primary key column. I concatenated 5
> of them to build a unique id column which I made it as part of the primary
> key. I have tried specifying 5 columns to be the primary key but I noticed
> the inserts were much slower, so I tried with just 2 columns as primary key
> instead, seems to improve insert speed.
>
> So this is my new schema and will measure query speed with it. If I
> partition by day, is 1000 partitions too many? What is the recommended
> maximum limit in the number of partitions kudu can handle?
>
> CREATE TABLE kudu_fact_table  (
> print_date_id,
> id STRING,
> company_id INT,
> transcount INT)
> PRIMARY KEY(print_date_id,id)
> ) PARTITION BY RANGE (print_date_id)
> (
>   PARTITION VALUE = 20170101,
>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3
> year's worth of data)
>  )
> STORED AS KUDU
> TBLPROPERTIES(
>   'kudu.table_name' = 'kudu_fact_table',
>   'kudu.master_addresses' = 'myserver:7051'
> );
>
>
>
> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> I'd add that moving the print_date_id to the beginning of the primary key
>> in the Kudu fact table would allow each server to do a range scan instead
>> of a full scan.
>>
>> -Todd
>>
>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>> Hi Tenny,
>>>
>>> First off, how many tablet servers are in your cluster?  16 partitions
>>> is appropriate for one or maybe two tablet servers, so if your cluster is
>>> bigger you could try bumping the number of partitions.
>>>
>>> Second, the schemas don't look identical, you have an additional 'id'
>>> column in the Kudu table, and crucially, it doesn't have any predicates, so
>>> this query is doing a full table scan.
>>>
>>> Finally, the Parquet table is likely able to take advantage of
>>> significant partition pruning due to the between clause.  An equivalent in
>>> Kudu would be range partitioning on the print_date_id.  You might try doing
>>> the same for Kudu.
>>>
>>> - Dan
>>>
>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <te...@gmail.com>
>>> wrote:
>>>
>>>> I have a table (call this fact_table)  that I want to create in kudu.
>>>>
>>>> I have an equivalent table in impala/parquet that is partitioned by
>>>> day_id.
>>>>
>>>> create table impala_fact_table (
>>>> company_id INT,
>>>> transcount INT)
>>>> partitioned by
>>>> (print_date_id INT)
>>>> STORED AS PARQUET;
>>>>
>>>> so a common query would be:
>>>>
>>>> select  sum(transcount)
>>>> from impala_fact_table f
>>>> join with company_dim c on f.company_id = c.company_id
>>>> where c.company_id in (123,456)
>>>> and f.print_date_id between 20170101 and 20170202
>>>>
>>>> I created an equivalent of the fact table in kudu:
>>>>
>>>> CREATE TABLE kudu_fact_table  (
>>>> id STRING,
>>>> print_date_id,
>>>> company_id INT,
>>>> transcount INT)
>>>> PRIMARY KEY(id,print_date_id)
>>>> ) PARTITION BY HASH PARTITIONS 16
>>>> )
>>>> STORED AS KUDU
>>>> TBLPROPERTIES(
>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>> );
>>>>
>>>> But the performance of the join with this kudu table is terrible, 2
>>>> secs with impala table vs 126 secs with kudu table.
>>>>
>>>> select  sum(transcount)
>>>> from kudu_fact_table f
>>>> join with company_dim c on f.company_id = c.company_id
>>>> where c.company_id in (123,456)
>>>> and f.print_date_id between 20170101 and 20170202
>>>>
>>>> How should I design my kudu table so performance is somewhat comparable?
>>>>
>>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Regards,
>
> Tenny Susanto
>
>

Re: kudu table design question

Posted by tenny susanto <te...@gmail.com>.
I have 24 tablet servers.

I added an id column because I needed a unique column to be the primary key
as kudu required primary key to be specified.  My original table actually
has 20 columns with no single primary key column. I concatenated 5 of them
to build a unique id column which I made it as part of the primary key. I
have tried specifying 5 columns to be the primary key but I noticed the
inserts were much slower, so I tried with just 2 columns as primary key
instead, seems to improve insert speed.

So this is my new schema and will measure query speed with it. If I
partition by day, is 1000 partitions too many? What is the recommended
maximum limit in the number of partitions kudu can handle?

CREATE TABLE kudu_fact_table  (
print_date_id,
id STRING,
company_id INT,
transcount INT)
PRIMARY KEY(print_date_id,id)
) PARTITION BY RANGE (print_date_id)
(
  PARTITION VALUE = 20170101,
  PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3
year's worth of data)
 )
STORED AS KUDU
TBLPROPERTIES(
  'kudu.table_name' = 'kudu_fact_table',
  'kudu.master_addresses' = 'myserver:7051'
);



On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <to...@cloudera.com> wrote:

> I'd add that moving the print_date_id to the beginning of the primary key
> in the Kudu fact table would allow each server to do a range scan instead
> of a full scan.
>
> -Todd
>
> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <da...@cloudera.com> wrote:
>
>> Hi Tenny,
>>
>> First off, how many tablet servers are in your cluster?  16 partitions is
>> appropriate for one or maybe two tablet servers, so if your cluster is
>> bigger you could try bumping the number of partitions.
>>
>> Second, the schemas don't look identical, you have an additional 'id'
>> column in the Kudu table, and crucially, it doesn't have any predicates, so
>> this query is doing a full table scan.
>>
>> Finally, the Parquet table is likely able to take advantage of
>> significant partition pruning due to the between clause.  An equivalent in
>> Kudu would be range partitioning on the print_date_id.  You might try doing
>> the same for Kudu.
>>
>> - Dan
>>
>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <te...@gmail.com>
>> wrote:
>>
>>> I have a table (call this fact_table)  that I want to create in kudu.
>>>
>>> I have an equivalent table in impala/parquet that is partitioned by
>>> day_id.
>>>
>>> create table impala_fact_table (
>>> company_id INT,
>>> transcount INT)
>>> partitioned by
>>> (print_date_id INT)
>>> STORED AS PARQUET;
>>>
>>> so a common query would be:
>>>
>>> select  sum(transcount)
>>> from impala_fact_table f
>>> join with company_dim c on f.company_id = c.company_id
>>> where c.company_id in (123,456)
>>> and f.print_date_id between 20170101 and 20170202
>>>
>>> I created an equivalent of the fact table in kudu:
>>>
>>> CREATE TABLE kudu_fact_table  (
>>> id STRING,
>>> print_date_id,
>>> company_id INT,
>>> transcount INT)
>>> PRIMARY KEY(id,print_date_id)
>>> ) PARTITION BY HASH PARTITIONS 16
>>> )
>>> STORED AS KUDU
>>> TBLPROPERTIES(
>>>   'kudu.table_name' = 'kudu_fact_table',
>>>   'kudu.master_addresses' = 'myserver:7051'
>>> );
>>>
>>> But the performance of the join with this kudu table is terrible, 2 secs
>>> with impala table vs 126 secs with kudu table.
>>>
>>> select  sum(transcount)
>>> from kudu_fact_table f
>>> join with company_dim c on f.company_id = c.company_id
>>> where c.company_id in (123,456)
>>> and f.print_date_id between 20170101 and 20170202
>>>
>>> How should I design my kudu table so performance is somewhat comparable?
>>>
>>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Regards,

Tenny Susanto

Re: kudu table design question

Posted by Todd Lipcon <to...@cloudera.com>.
I'd add that moving the print_date_id to the beginning of the primary key
in the Kudu fact table would allow each server to do a range scan instead
of a full scan.

-Todd

On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <da...@cloudera.com> wrote:

> Hi Tenny,
>
> First off, how many tablet servers are in your cluster?  16 partitions is
> appropriate for one or maybe two tablet servers, so if your cluster is
> bigger you could try bumping the number of partitions.
>
> Second, the schemas don't look identical, you have an additional 'id'
> column in the Kudu table, and crucially, it doesn't have any predicates, so
> this query is doing a full table scan.
>
> Finally, the Parquet table is likely able to take advantage of significant
> partition pruning due to the between clause.  An equivalent in Kudu would
> be range partitioning on the print_date_id.  You might try doing the same
> for Kudu.
>
> - Dan
>
> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <te...@gmail.com>
> wrote:
>
>> I have a table (call this fact_table)  that I want to create in kudu.
>>
>> I have an equivalent table in impala/parquet that is partitioned by
>> day_id.
>>
>> create table impala_fact_table (
>> company_id INT,
>> transcount INT)
>> partitioned by
>> (print_date_id INT)
>> STORED AS PARQUET;
>>
>> so a common query would be:
>>
>> select  sum(transcount)
>> from impala_fact_table f
>> join with company_dim c on f.company_id = c.company_id
>> where c.company_id in (123,456)
>> and f.print_date_id between 20170101 and 20170202
>>
>> I created an equivalent of the fact table in kudu:
>>
>> CREATE TABLE kudu_fact_table  (
>> id STRING,
>> print_date_id,
>> company_id INT,
>> transcount INT)
>> PRIMARY KEY(id,print_date_id)
>> ) PARTITION BY HASH PARTITIONS 16
>> )
>> STORED AS KUDU
>> TBLPROPERTIES(
>>   'kudu.table_name' = 'kudu_fact_table',
>>   'kudu.master_addresses' = 'myserver:7051'
>> );
>>
>> But the performance of the join with this kudu table is terrible, 2 secs
>> with impala table vs 126 secs with kudu table.
>>
>> select  sum(transcount)
>> from kudu_fact_table f
>> join with company_dim c on f.company_id = c.company_id
>> where c.company_id in (123,456)
>> and f.print_date_id between 20170101 and 20170202
>>
>> How should I design my kudu table so performance is somewhat comparable?
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: kudu table design question

Posted by Dan Burkert <da...@cloudera.com>.
Hi Tenny,

First off, how many tablet servers are in your cluster?  16 partitions is
appropriate for one or maybe two tablet servers, so if your cluster is
bigger you could try bumping the number of partitions.

Second, the schemas don't look identical, you have an additional 'id'
column in the Kudu table, and crucially, it doesn't have any predicates, so
this query is doing a full table scan.

Finally, the Parquet table is likely able to take advantage of significant
partition pruning due to the between clause.  An equivalent in Kudu would
be range partitioning on the print_date_id.  You might try doing the same
for Kudu.

- Dan

On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <te...@gmail.com>
wrote:

> I have a table (call this fact_table)  that I want to create in kudu.
>
> I have an equivalent table in impala/parquet that is partitioned by
> day_id.
>
> create table impala_fact_table (
> company_id INT,
> transcount INT)
> partitioned by
> (print_date_id INT)
> STORED AS PARQUET;
>
> so a common query would be:
>
> select  sum(transcount)
> from impala_fact_table f
> join with company_dim c on f.company_id = c.company_id
> where c.company_id in (123,456)
> and f.print_date_id between 20170101 and 20170202
>
> I created an equivalent of the fact table in kudu:
>
> CREATE TABLE kudu_fact_table  (
> id STRING,
> print_date_id,
> company_id INT,
> transcount INT)
> PRIMARY KEY(id,print_date_id)
> ) PARTITION BY HASH PARTITIONS 16
> )
> STORED AS KUDU
> TBLPROPERTIES(
>   'kudu.table_name' = 'kudu_fact_table',
>   'kudu.master_addresses' = 'myserver:7051'
> );
>
> But the performance of the join with this kudu table is terrible, 2 secs
> with impala table vs 126 secs with kudu table.
>
> select  sum(transcount)
> from kudu_fact_table f
> join with company_dim c on f.company_id = c.company_id
> where c.company_id in (123,456)
> and f.print_date_id between 20170101 and 20170202
>
> How should I design my kudu table so performance is somewhat comparable?
>
>