You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Pietro Gentile <pi...@gmail.com> on 2018/06/20 15:31:46 UTC

spark kudu issues

Hi all,

I am currently evaluating using Spark with Kudu.
So I am facing the following issues:

1) If you try to DELETE a row with a key that is not present on the table
you will have an Exception like this:

java.lang.RuntimeException: failed to write N rows from DataFrame to Kudu;
sample errors: Not found: key not found (error 0)

2) If you try to DELETE a row using a subset of a table key you will face
the following:

Caused by: java.lang.RuntimeException: failed to write N rows from
DataFrame to Kudu; sample errors: Invalid argument: No value provided for
key column:

The use cases presented above are correctly working if you interact with
kudu using Impala.

Any suggestions to overcome these limitation?

Thanks.
Best Regards

Pietro

Re: spark kudu issues

Posted by William Berkeley <wd...@gmail.com>.

What you're seeing is the expected behavior in both cases.

One way to achieve the semantics you want in both situations is to read in
the Kudu table to a data frame, then filter it in Spark SQL to contain just
the rows you want to delete, and then use that dataframe to do the
deletion. There should be no primary key errors with that method unless the
table is concurrently being deleted from.

For 2), what I describe above is what Impala does- it reads in the Kudu
table, finds the full primary keys of rows matching your partial
specification of the key, and then issues deletes for those rows.

Note that deletes of multiple rows aren't transactional.

I think having a way to issue deletes that ignores primary key errors is
reasonable, a "delete ignore" analog to insert ignore. I filed KUDU-2482
<http://kudu-2482/> for it.

-Will

On Wed, Jun 20, 2018 at 8:31 AM, Pietro Gentile <
pietro.gentile89.developer@gmail.com> wrote:

> Hi all,
>
> I am currently evaluating using Spark with Kudu.
> So I am facing the following issues:
>
> 1) If you try to DELETE a row with a key that is not present on the table
> you will have an Exception like this:
>
> java.lang.RuntimeException: failed to write N rows from DataFrame to Kudu;
> sample errors: Not found: key not found (error 0)
>
> 2) If you try to DELETE a row using a subset of a table key you will face
> the following:
>
> Caused by: java.lang.RuntimeException: failed to write N rows from
> DataFrame to Kudu; sample errors: Invalid argument: No value provided for
> key column:
>
> The use cases presented above are correctly working if you interact with
> kudu using Impala.
>
> Any suggestions to overcome these limitation?
>
> Thanks.
> Best Regards
>
> Pietro
>

Re: spark kudu issues

Posted by William Berkeley <wd...@gmail.com>.

What you're seeing is the expected behavior in both cases.

One way to achieve the semantics you want in both situations is to read in
the Kudu table to a data frame, then filter it in Spark SQL to contain just
the rows you want to delete, and then use that dataframe to do the
deletion. There should be no primary key errors with that method unless the
table is concurrently being deleted from.

For 2), what I describe above is what Impala does- it reads in the Kudu
table, finds the full primary keys of rows matching your partial
specification of the key, and then issues deletes for those rows.

Note that deletes of multiple rows aren't transactional.

I think having a way to issue deletes that ignores primary key errors is
reasonable, a "delete ignore" analog to insert ignore. I filed KUDU-2482
<http://kudu-2482/> for it.

-Will

On Wed, Jun 20, 2018 at 8:31 AM, Pietro Gentile <
pietro.gentile89.developer@gmail.com> wrote:

> Hi all,
>
> I am currently evaluating using Spark with Kudu.
> So I am facing the following issues:
>
> 1) If you try to DELETE a row with a key that is not present on the table
> you will have an Exception like this:
>
> java.lang.RuntimeException: failed to write N rows from DataFrame to Kudu;
> sample errors: Not found: key not found (error 0)
>
> 2) If you try to DELETE a row using a subset of a table key you will face
> the following:
>
> Caused by: java.lang.RuntimeException: failed to write N rows from
> DataFrame to Kudu; sample errors: Invalid argument: No value provided for
> key column:
>
> The use cases presented above are correctly working if you interact with
> kudu using Impala.
>
> Any suggestions to overcome these limitation?
>
> Thanks.
> Best Regards
>
> Pietro
>