You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by "Ray Liu (rayliu)" <ra...@cisco.com> on 2020/05/06 11:54:34 UTC

Why does partition keys have to be in the primary keys?

We have two pipelines writing to the same table, and that table is ranged partitioned by “day” field.

Each pipeline fills some of the fields in the table with the same key.

But the “day” field in these two pipelines may be different.

Because range partition keys must exist in primary keys, so there will be two records in the result table.

What we want is one complete record.

So my question is why does partition keys have to be in the primary keys?

Is there any workaround for this?

Re: Why does partition keys have to be in the primary keys?

Posted by "Ray Liu (rayliu)" <ra...@cisco.com>.
Hi Alexey,

I don’t think UPSERT works in this case, for example
We have a table that “key” field is the unique id for each record, but in order to use range partition, we have to use key+day as primary key.

First record (key1, day1,value1) arrives and is inserted into table
Later on, another record (key1, day2,value2) arrives.
If we use UPSERT there will be two records in the table
(key1, day1,value1) and (key1, day2,value2)
(day1 changed to day2 because time changes to the next day)

What we want is one record (key1, day1,value2).

Regards
Ray
From: Alexey Serbin <as...@cloudera.com>
Reply-To: "user@kudu.apache.org" <us...@kudu.apache.org>
Date: Thursday, May 7, 2020 at 05:13
To: "user@kudu.apache.org" <us...@kudu.apache.org>
Subject: Re: Why does partition keys have to be in the primary keys?

Hi,

The restriction on the partitioning key to be composed of primary key columns significantly simplifies the design and implementation.

However, I'm not sure I understand why the rules of partitioning come to play here.  To me it looks like the main question is about the schema for the table, i.e. what should be the primary key.  If different pipelines use different values for the 'day' field, but one result record is expected, does it imply that pipelines need to update already existing records?  If so, then maybe use UPSERT instead of INSERT for those pipelines?

I would start with trying to understand what's the primary key for the table to satisfy the requirements.  Once it's clear, I'd think about the partitioning rules for the table.


Thanks,

Alexey

On Wed, May 6, 2020 at 4:54 AM Ray Liu (rayliu) <ra...@cisco.com>> wrote:
We have two pipelines writing to the same table, and that table is ranged partitioned by “day” field.

Each pipeline fills some of the fields in the table with the same key.

But the “day” field in these two pipelines may be different.

Because range partition keys must exist in primary keys, so there will be two records in the result table.

What we want is one complete record.

So my question is why does partition keys have to be in the primary keys?

Is there any workaround for this?

Re: Why does partition keys have to be in the primary keys?

Posted by Mauricio Aristizabal <ma...@impact.com>.
Even if you didn't need 'day' in PK, since it's still range partitioned by
it, you still would end up with 2 records in the 2 partitions.  If day is
something that can 'change' then it can't be part of PK or range partition.

On Wed, May 6, 2020 at 2:13 PM Alexey Serbin <as...@cloudera.com> wrote:

> Hi,
>
> The restriction on the partitioning key to be composed of primary key
> columns significantly simplifies the design and implementation.
>
> However, I'm not sure I understand why the rules of partitioning come to
> play here.  To me it looks like the main question is about the schema for
> the table, i.e. what should be the primary key.  If different pipelines use
> different values for the 'day' field, but one result record is expected,
> does it imply that pipelines need to update already existing records?  If
> so, then maybe use UPSERT instead of INSERT for those pipelines?
>
> I would start with trying to understand what's the primary key for the
> table to satisfy the requirements.  Once it's clear, I'd think about the
> partitioning rules for the table.
>
>
> Thanks,
>
> Alexey
>
> On Wed, May 6, 2020 at 4:54 AM Ray Liu (rayliu) <ra...@cisco.com> wrote:
>
>> We have two pipelines writing to the same table, and that table is ranged
>> partitioned by “day” field.
>>
>>
>> Each pipeline fills some of the fields in the table with the same key.
>>
>>
>>
>> But the “day” field in these two pipelines may be different.
>>
>>
>>
>> Because range partition keys must exist in primary keys, so there will be
>> two records in the result table.
>>
>>
>>
>> What we want is one complete record.
>>
>>
>>
>> So my question is why does partition keys have to be in the primary keys?
>>
>>
>>
>> Is there any workaround for this?
>>
>

-- 
Mauricio Aristizabal
Architect - Data Pipeline
mauricio@impact.com | 323 309 4260
https://impact.com
<https://www.linkedin.com/company/impact-partech/>
<https://www.facebook.com/ImpactParTech/>
<https://twitter.com/impactpartech>
<https://www.youtube.com/c/impactpartech>
<https://go.impact.com/WB-PC-AW-Navigating-Partner-Marketing-Strategy-During-Covid-19.html>

Re: Why does partition keys have to be in the primary keys?

Posted by Alexey Serbin <as...@cloudera.com>.
Hi,

The restriction on the partitioning key to be composed of primary key
columns significantly simplifies the design and implementation.

However, I'm not sure I understand why the rules of partitioning come to
play here.  To me it looks like the main question is about the schema for
the table, i.e. what should be the primary key.  If different pipelines use
different values for the 'day' field, but one result record is expected,
does it imply that pipelines need to update already existing records?  If
so, then maybe use UPSERT instead of INSERT for those pipelines?

I would start with trying to understand what's the primary key for the
table to satisfy the requirements.  Once it's clear, I'd think about the
partitioning rules for the table.


Thanks,

Alexey

On Wed, May 6, 2020 at 4:54 AM Ray Liu (rayliu) <ra...@cisco.com> wrote:

> We have two pipelines writing to the same table, and that table is ranged
> partitioned by “day” field.
>
>
> Each pipeline fills some of the fields in the table with the same key.
>
>
>
> But the “day” field in these two pipelines may be different.
>
>
>
> Because range partition keys must exist in primary keys, so there will be
> two records in the result table.
>
>
>
> What we want is one complete record.
>
>
>
> So my question is why does partition keys have to be in the primary keys?
>
>
>
> Is there any workaround for this?
>