You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by "Josh Mahonin (JIRA)" <ji...@apache.org> on 2015/11/04 15:18:27 UTC

[jira] [Commented] (PHOENIX-2367) Change PhoenixRecordWriter to use execute instead of executeBatch

    [ https://issues.apache.org/jira/browse/PHOENIX-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989597#comment-14989597 ] 

Josh Mahonin commented on PHOENIX-2367:
---------------------------------------

I like the idea of having this change be configurable.

It sounds like there may not be any performance overhead from this new change, but if there is a slight overhead, it would be nice to be able to bypass it if possible. Although certainly a good idea for CSV loading, from the perspective of a Spark/Pig/MapReduce user, the written data is frequently required to be well formed at that stage, having already gone through several steps of an execution pipeline.

> Change PhoenixRecordWriter to use execute instead of executeBatch
> -----------------------------------------------------------------
>
>                 Key: PHOENIX-2367
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2367
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Siddhi Mehta
>            Assignee: Siddhi Mehta
>
> Hey All,
> I wanted to add a notion of skipping invalid rows for PhoenixHbaseStorage similar to how the CSVBulkLoad tool has an option of ignoring the bad rows.I did some work on the apache pig code that allows Storers to have a notion of Customizable/Configurable Errors PIG-4704.
> I wanted to plug this behavior for PhoenixHbaseStorage and propose certain changes for the same.
> Current Behavior/Problem:
> PhoenixRecordWriter makes use of executeBatch() to process rows once batch size is reached. If there are any client side validation/syntactical errors like data not fitting the column size, executeBatch() throws an exception and there is no-way to retrieve the valid rows from the batch and retry them. We discard the whole batch or fail the job without errorhandling.
> With auto commit set to false execute() also servers the purpose of not making any rpc calls  but does a bunch of validation client side and adds it to the client cache of mutation.
> On conn.commit() we make a rpc call.
> Proposed Change
> To be able to use Configurable ErrorHandling and ignore only the failed records instead of discarding the whole batch I want to propose changing the behavior in PhoenixRecordWriter from execute to executeBatch() or having a configuration to toggle between the 2 behaviors 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)