You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Paul Brannan <pa...@thesystech.com> on 2017/02/15 18:27:03 UTC

Delete row by partial key

I want to delete all rows that match a particular partial key.  For
example, if my schema includes columns "foo", "bar", and "baz" in its
primary key, I want to be able to delete all rows with "foo=16" and
"bar=32", regardless of the value of baz.  If I attempt to apply a
KuduDelete without specifying "baz", I get an error "Illegal state: Key not
specified".

The best I have come up with so far is to do a scan and copy the data
cell-by-cell from the RowPtr returned by the scan into the KuduPartialRow
used by the delete; I don't see any good way in the interface to copy row
data from one to the other without copying cell-by-cell.  The code looks
something like:

  for (auto idx : primary_key_column_indexes) {
    switch(schema.Column(idx).type()) {
    case KuduColumnSchema::INT16: // GetInt16/SetInt16
    case KuduColumnSchema::INT32: // GetInt32/SetInt32
    case KuduColumnSchema::STRING: // GetString/SetString
    // and so on...
    }
  }

Is there a better way?

Re: Delete row by partial key

Posted by Paul Brannan <pa...@thesystech.com>.

I think this makes sense for isset_bitmap_ and owned_strings_bitmap_.  I
have the source checked out from github and will try to build and run the
regression tests so I can play with this idea.

In general though it seems I only want to copy the primary keys (any given
RowPtr might have other columns).  I think I can come up with a way to do
that but it will take some more learning...

Thanks for the help!


On Thu, Feb 23, 2017 at 2:18 AM, Todd Lipcon <to...@cloudera.com> wrote:

> On Thu, Feb 16, 2017 at 8:36 AM, Paul Brannan <paul.brannan@thesystech.com
> > wrote:
>
>> Thank you for the quick response!
>>
>> I do understand the desire to not present in the API the appearance of a
>> feature that isn't really there.  As an example, I was reading KUDU-1291
>> yesterday and realized that since predicates are specified by name and not
>> column index, it's easy to accidentally construct an inefficient scan by
>> omitting the first component of the key.  It's not surprising given how
>> kudu works, but it's not obvious just by looking at the API.
>>
>> I would like to contribute a patch, though I would like to learn more
>> about kudu before I feel comfortable.  In this case, for example, my naive
>> solution may work, but isn't very robust -- since it has to switch on the
>> data type of each cell, it will fail when new data types are added.
>>
>
> I don't think that's too bad - we have lots of places where we have to add
> a new case label when a new type is added, so what's one more? :)
>
>
>>   I have a hunch it's possible to copy row_data_ directly somehow, but I
>> have no idea how to correctly set isset_bitmap_ or owned_strings_bitmap_.
>>
>
> "isset_bitmap_" should probably be set based on the projected columns of
> the scanner.
> "owned_stirngs_bitmap_" indicates which strings in the row are "owned" by
> the PartialRow rather than "referenced". I think to make this API safe,
> we'd probably want to clone all of the string data, which would imply
> setting owned_strings_bitmap_ to 1 bits for those strings.
>
> Hope that helps
> -Todd
>
>
>>
>> On Wed, Feb 15, 2017 at 1:32 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Hi Paul,
>>>
>>> You're correct that there is no predicate-based delete currently
>>> available, so you have to scan and then feed the results of the scan back
>>> into your desired mutations/deletes. This is intentional, since right now
>>> we don't have multi-row transactional capabilities, and a "delete by
>>> predicate" API would probably give the false impression that it is
>>> transactional.
>>>
>>> I also think you're right that there isn't a nice way of propagating a
>>> RowResult into a PartialRow as you need to do here. It seems you're a C++
>>> programmer -- any interest in contributing a patch to make this
>>> transformation a bit easier?
>>>
>>> -Todd
>>>
>>>
>>>
>>> On Wed, Feb 15, 2017 at 10:27 AM, Paul Brannan <
>>> paul.brannan@thesystech.com> wrote:
>>>
>>>> I want to delete all rows that match a particular partial key.  For
>>>> example, if my schema includes columns "foo", "bar", and "baz" in its
>>>> primary key, I want to be able to delete all rows with "foo=16" and
>>>> "bar=32", regardless of the value of baz.  If I attempt to apply a
>>>> KuduDelete without specifying "baz", I get an error "Illegal state: Key not
>>>> specified".
>>>>
>>>> The best I have come up with so far is to do a scan and copy the data
>>>> cell-by-cell from the RowPtr returned by the scan into the KuduPartialRow
>>>> used by the delete; I don't see any good way in the interface to copy row
>>>> data from one to the other without copying cell-by-cell.  The code looks
>>>> something like:
>>>>
>>>>   for (auto idx : primary_key_column_indexes) {
>>>>     switch(schema.Column(idx).type()) {
>>>>     case KuduColumnSchema::INT16: // GetInt16/SetInt16
>>>>     case KuduColumnSchema::INT32: // GetInt32/SetInt32
>>>>     case KuduColumnSchema::STRING: // GetString/SetString
>>>>     // and so on...
>>>>     }
>>>>   }
>>>>
>>>> Is there a better way?
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Delete row by partial key

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Feb 16, 2017 at 8:36 AM, Paul Brannan <pa...@thesystech.com>
wrote:

> Thank you for the quick response!
>
> I do understand the desire to not present in the API the appearance of a
> feature that isn't really there.  As an example, I was reading KUDU-1291
> yesterday and realized that since predicates are specified by name and not
> column index, it's easy to accidentally construct an inefficient scan by
> omitting the first component of the key.  It's not surprising given how
> kudu works, but it's not obvious just by looking at the API.
>
> I would like to contribute a patch, though I would like to learn more
> about kudu before I feel comfortable.  In this case, for example, my naive
> solution may work, but isn't very robust -- since it has to switch on the
> data type of each cell, it will fail when new data types are added.
>

I don't think that's too bad - we have lots of places where we have to add
a new case label when a new type is added, so what's one more? :)


>   I have a hunch it's possible to copy row_data_ directly somehow, but I
> have no idea how to correctly set isset_bitmap_ or owned_strings_bitmap_.
>

"isset_bitmap_" should probably be set based on the projected columns of
the scanner.
"owned_stirngs_bitmap_" indicates which strings in the row are "owned" by
the PartialRow rather than "referenced". I think to make this API safe,
we'd probably want to clone all of the string data, which would imply
setting owned_strings_bitmap_ to 1 bits for those strings.

Hope that helps
-Todd


>
> On Wed, Feb 15, 2017 at 1:32 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hi Paul,
>>
>> You're correct that there is no predicate-based delete currently
>> available, so you have to scan and then feed the results of the scan back
>> into your desired mutations/deletes. This is intentional, since right now
>> we don't have multi-row transactional capabilities, and a "delete by
>> predicate" API would probably give the false impression that it is
>> transactional.
>>
>> I also think you're right that there isn't a nice way of propagating a
>> RowResult into a PartialRow as you need to do here. It seems you're a C++
>> programmer -- any interest in contributing a patch to make this
>> transformation a bit easier?
>>
>> -Todd
>>
>>
>>
>> On Wed, Feb 15, 2017 at 10:27 AM, Paul Brannan <
>> paul.brannan@thesystech.com> wrote:
>>
>>> I want to delete all rows that match a particular partial key.  For
>>> example, if my schema includes columns "foo", "bar", and "baz" in its
>>> primary key, I want to be able to delete all rows with "foo=16" and
>>> "bar=32", regardless of the value of baz.  If I attempt to apply a
>>> KuduDelete without specifying "baz", I get an error "Illegal state: Key not
>>> specified".
>>>
>>> The best I have come up with so far is to do a scan and copy the data
>>> cell-by-cell from the RowPtr returned by the scan into the KuduPartialRow
>>> used by the delete; I don't see any good way in the interface to copy row
>>> data from one to the other without copying cell-by-cell.  The code looks
>>> something like:
>>>
>>>   for (auto idx : primary_key_column_indexes) {
>>>     switch(schema.Column(idx).type()) {
>>>     case KuduColumnSchema::INT16: // GetInt16/SetInt16
>>>     case KuduColumnSchema::INT32: // GetInt32/SetInt32
>>>     case KuduColumnSchema::STRING: // GetString/SetString
>>>     // and so on...
>>>     }
>>>   }
>>>
>>> Is there a better way?
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Delete row by partial key

Posted by Paul Brannan <pa...@thesystech.com>.

Thank you for the quick response!

I do understand the desire to not present in the API the appearance of a
feature that isn't really there.  As an example, I was reading KUDU-1291
yesterday and realized that since predicates are specified by name and not
column index, it's easy to accidentally construct an inefficient scan by
omitting the first component of the key.  It's not surprising given how
kudu works, but it's not obvious just by looking at the API.

I would like to contribute a patch, though I would like to learn more about
kudu before I feel comfortable.  In this case, for example, my naive
solution may work, but isn't very robust -- since it has to switch on the
data type of each cell, it will fail when new data types are added.  I have
a hunch it's possible to copy row_data_ directly somehow, but I have no
idea how to correctly set isset_bitmap_ or owned_strings_bitmap_.

On Wed, Feb 15, 2017 at 1:32 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Paul,
>
> You're correct that there is no predicate-based delete currently
> available, so you have to scan and then feed the results of the scan back
> into your desired mutations/deletes. This is intentional, since right now
> we don't have multi-row transactional capabilities, and a "delete by
> predicate" API would probably give the false impression that it is
> transactional.
>
> I also think you're right that there isn't a nice way of propagating a
> RowResult into a PartialRow as you need to do here. It seems you're a C++
> programmer -- any interest in contributing a patch to make this
> transformation a bit easier?
>
> -Todd
>
>
>
> On Wed, Feb 15, 2017 at 10:27 AM, Paul Brannan <
> paul.brannan@thesystech.com> wrote:
>
>> I want to delete all rows that match a particular partial key.  For
>> example, if my schema includes columns "foo", "bar", and "baz" in its
>> primary key, I want to be able to delete all rows with "foo=16" and
>> "bar=32", regardless of the value of baz.  If I attempt to apply a
>> KuduDelete without specifying "baz", I get an error "Illegal state: Key not
>> specified".
>>
>> The best I have come up with so far is to do a scan and copy the data
>> cell-by-cell from the RowPtr returned by the scan into the KuduPartialRow
>> used by the delete; I don't see any good way in the interface to copy row
>> data from one to the other without copying cell-by-cell.  The code looks
>> something like:
>>
>>   for (auto idx : primary_key_column_indexes) {
>>     switch(schema.Column(idx).type()) {
>>     case KuduColumnSchema::INT16: // GetInt16/SetInt16
>>     case KuduColumnSchema::INT32: // GetInt32/SetInt32
>>     case KuduColumnSchema::STRING: // GetString/SetString
>>     // and so on...
>>     }
>>   }
>>
>> Is there a better way?
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Delete row by partial key

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Paul,

You're correct that there is no predicate-based delete currently available,
so you have to scan and then feed the results of the scan back into your
desired mutations/deletes. This is intentional, since right now we don't
have multi-row transactional capabilities, and a "delete by predicate" API
would probably give the false impression that it is transactional.

I also think you're right that there isn't a nice way of propagating a
RowResult into a PartialRow as you need to do here. It seems you're a C++
programmer -- any interest in contributing a patch to make this
transformation a bit easier?

-Todd

On Wed, Feb 15, 2017 at 10:27 AM, Paul Brannan <pa...@thesystech.com>
wrote:

> I want to delete all rows that match a particular partial key.  For
> example, if my schema includes columns "foo", "bar", and "baz" in its
> primary key, I want to be able to delete all rows with "foo=16" and
> "bar=32", regardless of the value of baz.  If I attempt to apply a
> KuduDelete without specifying "baz", I get an error "Illegal state: Key not
> specified".
>
> The best I have come up with so far is to do a scan and copy the data
> cell-by-cell from the RowPtr returned by the scan into the KuduPartialRow
> used by the delete; I don't see any good way in the interface to copy row
> data from one to the other without copying cell-by-cell.  The code looks
> something like:
>
>   for (auto idx : primary_key_column_indexes) {
>     switch(schema.Column(idx).type()) {
>     case KuduColumnSchema::INT16: // GetInt16/SetInt16
>     case KuduColumnSchema::INT32: // GetInt32/SetInt32
>     case KuduColumnSchema::STRING: // GetString/SetString
>     // and so on...
>     }
>   }
>
> Is there a better way?
>
>

-- 
Todd Lipcon
Software Engineer, Cloudera