You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by Siddhi Mehta <sm...@gmail.com> on 2015/11/03 01:31:43 UTC

PhoenixHbaseStorage to Skip invalid rows

Hey All,

I wanted to add a notion of skipping invalid rows for PhoenixHbaseStorage
similar to how the CSVBulkLoad tool has an option of ignoring the bad
rows.I did some work on the apache pig code that allows Storers to have a
notion of Customizable/Configurable Errors PIG-4704
<https://issues.apache.org/jira/browse/PIG-4704>.

I wanted to plug this behavior for PhoenixHbaseStorage and propose certain
changes for the same.

*Current Behavior/Problem:*

PhoenixRecordWriter makes use of executeBatch() to process rows once batch
size is reached. If there are any client side validation/syntactical errors
like data not fitting the column size, executeBatch() throws an exception
and there is no-way to retrieve the valid rows from the batch and retry
them. We discard the whole batch or fail the job without errorhandling.

With auto commit set to false execute() also servers the purpose of not
making any rpc calls  but does a bunch of validation client side and adds
it to the client cache of mutation.

On conn.commit() we make a rpc call.

*Proposed Change*

To be able to use Configurable ErrorHandling and ignore only the failed
records instead of discarding the whole batch I want to propose changing
the behavior in PhoenixRecordWriter from execute to executeBatch() or
having a configuration to toggle between the 2 behaviors
Thoughts?

Re: PhoenixHbaseStorage to Skip invalid rows

Posted by Siddhi Mehta <sm...@gmail.com>.

Jira Created: https://issues.apache.org/jira/browse/PHOENIX-2367
I will submit a patch for review soon.

On Tue, Nov 3, 2015 at 10:52 AM, Jan Fernando <jf...@salesforce.com>
wrote:

> +1 on making this change. Can you file a JIRA for it?
>
> On Mon, Nov 2, 2015 at 4:31 PM, Siddhi Mehta <sm...@gmail.com> wrote:
>
> > Hey All,
> >
> > I wanted to add a notion of skipping invalid rows for PhoenixHbaseStorage
> > similar to how the CSVBulkLoad tool has an option of ignoring the bad
> > rows.I did some work on the apache pig code that allows Storers to have a
> > notion of Customizable/Configurable Errors PIG-4704
> > <https://issues.apache.org/jira/browse/PIG-4704>.
> >
> > I wanted to plug this behavior for PhoenixHbaseStorage and propose
> certain
> > changes for the same.
> >
> > *Current Behavior/Problem:*
> >
> > PhoenixRecordWriter makes use of executeBatch() to process rows once
> batch
> > size is reached. If there are any client side validation/syntactical
> errors
> > like data not fitting the column size, executeBatch() throws an exception
> > and there is no-way to retrieve the valid rows from the batch and retry
> > them. We discard the whole batch or fail the job without errorhandling.
> >
> > With auto commit set to false execute() also servers the purpose of not
> > making any rpc calls  but does a bunch of validation client side and adds
> > it to the client cache of mutation.
> >
> > On conn.commit() we make a rpc call.
> >
> > *Proposed Change*
> >
> > To be able to use Configurable ErrorHandling and ignore only the failed
> > records instead of discarding the whole batch I want to propose changing
> > the behavior in PhoenixRecordWriter from execute to executeBatch() or
> > having a configuration to toggle between the 2 behaviors
> > Thoughts?
> >
>

Re: PhoenixHbaseStorage to Skip invalid rows

Posted by Jan Fernando <jf...@salesforce.com>.

+1 on making this change. Can you file a JIRA for it?

On Mon, Nov 2, 2015 at 4:31 PM, Siddhi Mehta <sm...@gmail.com> wrote:

> Hey All,
>
> I wanted to add a notion of skipping invalid rows for PhoenixHbaseStorage
> similar to how the CSVBulkLoad tool has an option of ignoring the bad
> rows.I did some work on the apache pig code that allows Storers to have a
> notion of Customizable/Configurable Errors PIG-4704
> <https://issues.apache.org/jira/browse/PIG-4704>.
>
> I wanted to plug this behavior for PhoenixHbaseStorage and propose certain
> changes for the same.
>
> *Current Behavior/Problem:*
>
> PhoenixRecordWriter makes use of executeBatch() to process rows once batch
> size is reached. If there are any client side validation/syntactical errors
> like data not fitting the column size, executeBatch() throws an exception
> and there is no-way to retrieve the valid rows from the batch and retry
> them. We discard the whole batch or fail the job without errorhandling.
>
> With auto commit set to false execute() also servers the purpose of not
> making any rpc calls  but does a bunch of validation client side and adds
> it to the client cache of mutation.
>
> On conn.commit() we make a rpc call.
>
> *Proposed Change*
>
> To be able to use Configurable ErrorHandling and ignore only the failed
> records instead of discarding the whole batch I want to propose changing
> the behavior in PhoenixRecordWriter from execute to executeBatch() or
> having a configuration to toggle between the 2 behaviors
> Thoughts?
>