You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Nabeelah Harris <na...@impact.com> on 2019/03/05 14:33:31 UTC

Check existing range partitions using the Java API

Hi there

Currently, the only method available on KuduTable to check which
partitions already exist is 'KuduTable.getFormattedRangePartitions'.
This however looks to be experimental and only intended for use by
Impala. Other than replicating the logic used in the above-mentioned
method, is there any way I can easily retrieve the range partitions
(or partitions at all) using the Java API? My use-case at the moment
is to create range partitions based on the data I am about to insert,
and to do so I want to first check if that range partition already
exists, to prevent errors.

Thanks
Nabeelah

Re: Check existing range partitions using the Java API

Posted by Grant Henke <gh...@cloudera.com>.

The work to add a public partition info api is tracked in KUDU-1872
<https://issues.apache.org/jira/browse/KUDU-1872>. I agree with Adar that
using the KuduPartitioner to detected rows in uncovered ranges is likely
the best option that exists today.

I don't know anyway to handle the errors coming from the KuduContext. The
writeRows method catches all errors and throws a RuntimeException without
enough context to handle the errors themselves. It would be cool if the API
allowed the user to provide an error handler function, or at minimum threw
the exception with enough context to handle. I filed KUDU-2737
<https://issues.apache.org/jira/browse/KUDU-2737> to track supporting row
error handling in KuduContext.

It's worth mentioning, if you would like to contribute to Kudu and provide
patches for the functionality you need, we would be happy to review and
commit those patches.







On Wed, Mar 6, 2019 at 3:14 AM Adar Lieber-Dembo <ad...@cloudera.com> wrote:

> FWIW, you can use a newer Kudu client with an older server as we take care
> to preserve backwards compatibility. The decoupling of client and server
> artifacts sort of makes sense anyway, because the server artifacts are
> found on the cluster nodes and the client artifacts are typically
> distributed along with the application.
>
> In any case, I agree that I don't see an obvious way to get at the
> underlying per-row errors if you're using the KuduContext. Maybe someone
> more familiar with the Kudu Spark bindings can chime in with suggestions.
>
> On Wed, Mar 6, 2019 at 12:57 AM Nabeelah Harris <
> nabeelah.harris@impact.com> wrote:
>
>> Hi Adar
>>
>> Thanks
>>
>> Option 1 isn't really viable, since we're running Cloudera with Kudu 1.7,
>> thus using the 1.7 client libraries. Option 2 seems to be the way to go,
>> though since I am using KuduContext, I'm not sure that there is a clean way
>> for me to check for errors row by row. Based on naively wrapping my
>> kukuContext.upsert call in a try...catch, and running an alterTable if a
>> SparkException is caught - I'm able to catch the SparkException that occurs
>> with 'java.lang.RuntimeException: failed to write 1 rows from DataFrame to
>> Kudu; sample errors: Not found: non-covered range' on the tasks, but of
>> course I still end up with a bunch of failed tasks, and the partition is
>> only added once all my tasks have failed.
>>
>> Do you perhaps have some guidance in this regard?
>>
>> On Wed, Mar 6, 2019 at 7:58 AM Adar Lieber-Dembo <ad...@cloudera.com>
>> wrote:
>>
>>> Here are some other options:
>>> 1. Use the new KuduPartitioner class, available in master but not yet
>>> in any releases. Given a PartialRow (i.e. a row to be inserted), you
>>> can find its "partition index" and, more importantly for your use
>>> case, receive an exception if no partition exists for the row.
>>> 2. Insert the data anyway, and rely on per-row errors to tell you that
>>> a partition is missing. This is a more "optimistic" approach, but a
>>> somewhat expensive one at that.
>>>
>>> Would either of these work for you?
>>>
>>> On Tue, Mar 5, 2019 at 6:33 AM Nabeelah Harris
>>> <na...@impact.com> wrote:
>>> >
>>> > Hi there
>>> >
>>> > Currently, the only method available on KuduTable to check which
>>> > partitions already exist is 'KuduTable.getFormattedRangePartitions'.
>>> > This however looks to be experimental and only intended for use by
>>> > Impala. Other than replicating the logic used in the above-mentioned
>>> > method, is there any way I can easily retrieve the range partitions
>>> > (or partitions at all) using the Java API? My use-case at the moment
>>> > is to create range partitions based on the data I am about to insert,
>>> > and to do so I want to first check if that range partition already
>>> > exists, to prevent errors.
>>> >
>>> > Thanks
>>> > Nabeelah
>>>
>>
>>
>> --
>> Nabeelah Harris
>> nabeelah.harris@impact.com |
>> https://impact.com
>> <https://www.linkedin.com/company/impact-martech/>
>> <https://www.facebook.com/ImpactMarTech/>
>> <https://twitter.com/impactmartech>
>> <https://www.youtube.com/c/impactmartech>
>> <https://impactgrowth.com/>
>>
>

-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Check existing range partitions using the Java API

Posted by Adar Lieber-Dembo <ad...@cloudera.com>.

FWIW, you can use a newer Kudu client with an older server as we take care
to preserve backwards compatibility. The decoupling of client and server
artifacts sort of makes sense anyway, because the server artifacts are
found on the cluster nodes and the client artifacts are typically
distributed along with the application.

In any case, I agree that I don't see an obvious way to get at the
underlying per-row errors if you're using the KuduContext. Maybe someone
more familiar with the Kudu Spark bindings can chime in with suggestions.

On Wed, Mar 6, 2019 at 12:57 AM Nabeelah Harris <na...@impact.com>
wrote:

> Hi Adar
>
> Thanks
>
> Option 1 isn't really viable, since we're running Cloudera with Kudu 1.7,
> thus using the 1.7 client libraries. Option 2 seems to be the way to go,
> though since I am using KuduContext, I'm not sure that there is a clean way
> for me to check for errors row by row. Based on naively wrapping my
> kukuContext.upsert call in a try...catch, and running an alterTable if a
> SparkException is caught - I'm able to catch the SparkException that occurs
> with 'java.lang.RuntimeException: failed to write 1 rows from DataFrame to
> Kudu; sample errors: Not found: non-covered range' on the tasks, but of
> course I still end up with a bunch of failed tasks, and the partition is
> only added once all my tasks have failed.
>
> Do you perhaps have some guidance in this regard?
>
> On Wed, Mar 6, 2019 at 7:58 AM Adar Lieber-Dembo <ad...@cloudera.com>
> wrote:
>
>> Here are some other options:
>> 1. Use the new KuduPartitioner class, available in master but not yet
>> in any releases. Given a PartialRow (i.e. a row to be inserted), you
>> can find its "partition index" and, more importantly for your use
>> case, receive an exception if no partition exists for the row.
>> 2. Insert the data anyway, and rely on per-row errors to tell you that
>> a partition is missing. This is a more "optimistic" approach, but a
>> somewhat expensive one at that.
>>
>> Would either of these work for you?
>>
>> On Tue, Mar 5, 2019 at 6:33 AM Nabeelah Harris
>> <na...@impact.com> wrote:
>> >
>> > Hi there
>> >
>> > Currently, the only method available on KuduTable to check which
>> > partitions already exist is 'KuduTable.getFormattedRangePartitions'.
>> > This however looks to be experimental and only intended for use by
>> > Impala. Other than replicating the logic used in the above-mentioned
>> > method, is there any way I can easily retrieve the range partitions
>> > (or partitions at all) using the Java API? My use-case at the moment
>> > is to create range partitions based on the data I am about to insert,
>> > and to do so I want to first check if that range partition already
>> > exists, to prevent errors.
>> >
>> > Thanks
>> > Nabeelah
>>
>
>
> --
> Nabeelah Harris
> nabeelah.harris@impact.com |
> https://impact.com
> <https://www.linkedin.com/company/impact-martech/>
> <https://www.facebook.com/ImpactMarTech/>
> <https://twitter.com/impactmartech>
> <https://www.youtube.com/c/impactmartech>
> <https://impactgrowth.com/>
>

Re: Check existing range partitions using the Java API

Posted by Nabeelah Harris <na...@impact.com>.

Hi Adar

Thanks

Option 1 isn't really viable, since we're running Cloudera with Kudu 1.7,
thus using the 1.7 client libraries. Option 2 seems to be the way to go,
though since I am using KuduContext, I'm not sure that there is a clean way
for me to check for errors row by row. Based on naively wrapping my
kukuContext.upsert call in a try...catch, and running an alterTable if a
SparkException is caught - I'm able to catch the SparkException that occurs
with 'java.lang.RuntimeException: failed to write 1 rows from DataFrame to
Kudu; sample errors: Not found: non-covered range' on the tasks, but of
course I still end up with a bunch of failed tasks, and the partition is
only added once all my tasks have failed.

Do you perhaps have some guidance in this regard?

On Wed, Mar 6, 2019 at 7:58 AM Adar Lieber-Dembo <ad...@cloudera.com> wrote:

> Here are some other options:
> 1. Use the new KuduPartitioner class, available in master but not yet
> in any releases. Given a PartialRow (i.e. a row to be inserted), you
> can find its "partition index" and, more importantly for your use
> case, receive an exception if no partition exists for the row.
> 2. Insert the data anyway, and rely on per-row errors to tell you that
> a partition is missing. This is a more "optimistic" approach, but a
> somewhat expensive one at that.
>
> Would either of these work for you?
>
> On Tue, Mar 5, 2019 at 6:33 AM Nabeelah Harris
> <na...@impact.com> wrote:
> >
> > Hi there
> >
> > Currently, the only method available on KuduTable to check which
> > partitions already exist is 'KuduTable.getFormattedRangePartitions'.
> > This however looks to be experimental and only intended for use by
> > Impala. Other than replicating the logic used in the above-mentioned
> > method, is there any way I can easily retrieve the range partitions
> > (or partitions at all) using the Java API? My use-case at the moment
> > is to create range partitions based on the data I am about to insert,
> > and to do so I want to first check if that range partition already
> > exists, to prevent errors.
> >
> > Thanks
> > Nabeelah
>

-- 
Nabeelah Harris
nabeelah.harris@impact.com |
https://impact.com
<https://www.linkedin.com/company/impact-martech/>
<https://www.facebook.com/ImpactMarTech/>
<https://twitter.com/impactmartech>
<https://www.youtube.com/c/impactmartech>
<https://impactgrowth.com/>

Re: Check existing range partitions using the Java API

Posted by Adar Lieber-Dembo <ad...@cloudera.com>.

Here are some other options:
1. Use the new KuduPartitioner class, available in master but not yet
in any releases. Given a PartialRow (i.e. a row to be inserted), you
can find its "partition index" and, more importantly for your use
case, receive an exception if no partition exists for the row.
2. Insert the data anyway, and rely on per-row errors to tell you that
a partition is missing. This is a more "optimistic" approach, but a
somewhat expensive one at that.

Would either of these work for you?

On Tue, Mar 5, 2019 at 6:33 AM Nabeelah Harris
<na...@impact.com> wrote:
>
> Hi there
>
> Currently, the only method available on KuduTable to check which
> partitions already exist is 'KuduTable.getFormattedRangePartitions'.
> This however looks to be experimental and only intended for use by
> Impala. Other than replicating the logic used in the above-mentioned
> method, is there any way I can easily retrieve the range partitions
> (or partitions at all) using the Java API? My use-case at the moment
> is to create range partitions based on the data I am about to insert,
> and to do so I want to first check if that range partition already
> exists, to prevent errors.
>
> Thanks
> Nabeelah