You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Elliot West (JIRA)" <ji...@apache.org> on 2015/04/01 11:22:52 UTC
[jira] [Updated] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.

     [ https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Elliot West updated HIVE-10165:
-------------------------------
    Description: 
h3. Overview
I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest] API so that it also supports the writing of record updates and deletes in addition to the already supported inserts.

h3. Motivation
We have many Hadoop processes outside of Hive that merge changed facts into existing datasets. Traditionally we achieve this by: reading in a ground-truth dataset and a modified dataset, grouping by a key, sorting by a sequence and then applying a function to determine inserted, updated, and deleted rows. However, in our current scheme we must rewrite all partitions that may potentially contain changes. In practice the number of mutated records is very small when compared with the records contained in a partition. This approach results in a number of operational issues:
* Excessive amount of write activity required for small data changes.
* Downstream applications cannot robustly read these datasets while they are being updated.
* Due to scale of the updates (hundreds or partitions) the scope for contention is high. 

I believe we can address this problem by instead writing only the changed records to a Hive transactional table. This should drastically reduce the amount of data that we need to write and also provide a means for managing concurrent access to the data. Our existing merge processes can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to an updated form of the hive-hcatalog-streaming API which will then have the required data to perform an update or insert in a transactional manner. 

h3. Benefits
* Enables the creation of large-scale dataset merge processes  
* Opens up Hive transactional functionality in an accessible manner to processes that operate outside of Hive.

h3. Implementation
We've patched the API to provide visibility to the underlying {{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by third-parties outside of the package. We've also updated the user facing interfaces to provide update and delete functionality. I've provided the modifications as three incremental, cumulative patches. Generally speaking, each patch makes the API less backwards compatible but more consistent with respect to offering updates, deletes as well as writes (inserts). Ideally I hope that all three patches have merit, but only the first patch is absolutely necessary to enable the features we need on the API, and it does so in a backwards compatible way. I'll summarise the contents of each patch:

h4. [^HIVE-10165.0.patch] - Required
This patch contains what we consider to be the minimum amount of changes required to allow users to create {{RecordWriter}} subclasses that can insert, update, and  delete records. These changes also maintain backwards compatibility at the expense of confusing the API a little. Note that the row representation has be changed from {{byte[]}} to {{Object}}. Within our data processing jobs our records are often available in a strongly typed and decoded form such as a POJO or a Tuple object. Therefore is seems to make sense that we are able to pass this through to the {{OrcRecordUpdater}} without having to go through a {{byte[]}} encoding step. This of course still allows users to use {{byte[]}} if they wish.

h4. [^HIVE-10165.1.patch] - Nice to have
This patch builds on the changes made in the *required* patch and aims to make the API cleaner and more consistent while accommodating updates and inserts. It also adds some logic to prevent the user from submitting multiple operation types to a single {{TransactionBatch}} as we found this creates data inconsistencies within the Hive table. This patch breaks backwards compatibility.

h4. [^HIVE-10165.2.patch] - Nomenclature
This final patch simply renames some of existing types to more accurately convey their increased responsibilities. The API is no longer writing just new records, it is now also responsible for writing operations that are applied to existing records. This patch breaks backwards compatibility.

h3. Example
I've attached simple typical usage of the API. This is not a patch and is intended as an illustration only: [^ReflectiveOperationWriter.java]

h3. Known issues
I have not yet provided any unit tests for the extended functionality. I fully expect that tests are required and will work on these if my patches have merit.

  was:
h3. Overview
I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest] API so that it also supports the writing of record updates and deletes in addition to the already supported inserts.

h3. Motivation
We have many Hadoop processes outside of Hive that merge changed facts into existing datasets. Traditionally we achieve this by: reading in a ground-truth dataset and a modified dataset, grouping by a key, sorting by a sequence and then applying a function to determine inserted, updated, and deleted rows. However, in our current scheme we must rewrite all partitions that may potentially contain changes. In practice the number of mutated records is very small when compared with the records contained in a partition. This approach results in a number of operational issues:
* Excessive amount of write activity required for small data changes.
* Downstream applications cannot robustly read these datasets while they are being updated.
* Due to scale of the updates (hundreds or partitions) the scope for contention is high. 

I believe we can address this problem by instead writing only the changed records to a Hive transactional table. This should drastically reduce the amount of data that we need to write and also provide a means for managing concurrent access to the data. Our existing merge processes can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to an updated form of the hive-hcatalog-streaming API which will then have the required data to perform an update or insert in a transactional manner. 

h3. Benefits
* Enables the creation of large-scale dataset merge processes  
* Opens up Hive transactional functionality in an accessible manner to processes that operate outside of Hive.

h3. Implementation
We've patched the API to provide visibility to the underlying {{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by third-parties outside of the package. We've also updated the user facing interfaces to provide update and delete functionality. I've provided the modifications as three incremental patches. Generally speaking, each patch makes the API less backwards compatible but more consistent with respect to offering updates, deletes as well as writes (inserts). Ideally I hope that all three patches have merit, but only the first patch is absolutely necessary to enable the features we need on the API, and it does so in a backwards compatible way. I'll summarise the contents of each patch:

h4. [^HIVE-10165.0.patch] - Required
This patch contains what we consider to be the minimum amount of changes required to allow users to create {{RecordWriter}} subclasses that can insert, update, and  delete records. These changes also maintain backwards compatibility at the expense of confusing the API a little. Note that the row representation has be changed from {{byte[]}} to {{Object}}. Within our data processing jobs our records are often available in a strongly typed and decoded form such as a POJO or a Tuple object. Therefore is seems to make sense that we are able to pass this through to the {{OrcRecordUpdater}} without having to go through a {{byte[]}} encoding step. This of course still allows users to use {{byte[]}} if they wish.

h4. [^HIVE-10165.1.patch] - Nice to have
This patch builds on the changes made in the *required* patch and aims to make the API cleaner and more consistent while accommodating updates and inserts. It also adds some logic to prevent the user from submitting multiple operation types to a single {{TransactionBatch}} as we found this creates data inconsistencies within the Hive table. This patch breaks backwards compatibility.

h4. [^HIVE-10165.2.patch] - Nomenclature
This final patch simply renames some of existing types to more accurately convey their increased responsibilities. The API is no longer writing just new records, it is now also responsible for writing operations that are applied to existing records. This patch breaks backwards compatibility.

h3. Example
I've attached simple typical usage of the API. This is not a patch and is intended as an illustration only: [^ReflectiveOperationWriter.java]

h3. Known issues
I have not yet provided any unit tests for the extended functionality. I fully expect that tests are required and will work on these if my patches have merit.


> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-10165
>                 URL: https://issues.apache.org/jira/browse/HIVE-10165
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>            Reporter: Elliot West
>            Assignee: Alan Gates
>              Labels: streaming_api
>             Fix For: 1.2.0
>
>         Attachments: HIVE-10165.0.patch, HIVE-10165.1.patch, HIVE-10165.2.patch, ReflectiveOperationWriter.java
>
>
> h3. Overview
> I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest] API so that it also supports the writing of record updates and deletes in addition to the already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into existing datasets. Traditionally we achieve this by: reading in a ground-truth dataset and a modified dataset, grouping by a key, sorting by a sequence and then applying a function to determine inserted, updated, and deleted rows. However, in our current scheme we must rewrite all partitions that may potentially contain changes. In practice the number of mutated records is very small when compared with the records contained in a partition. This approach results in a number of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for contention is high. 
> I believe we can address this problem by instead writing only the changed records to a Hive transactional table. This should drastically reduce the amount of data that we need to write and also provide a means for managing concurrent access to the data. Our existing merge processes can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to an updated form of the hive-hcatalog-streaming API which will then have the required data to perform an update or insert in a transactional manner. 
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes  
> * Opens up Hive transactional functionality in an accessible manner to processes that operate outside of Hive.
> h3. Implementation
> We've patched the API to provide visibility to the underlying {{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by third-parties outside of the package. We've also updated the user facing interfaces to provide update and delete functionality. I've provided the modifications as three incremental, cumulative patches. Generally speaking, each patch makes the API less backwards compatible but more consistent with respect to offering updates, deletes as well as writes (inserts). Ideally I hope that all three patches have merit, but only the first patch is absolutely necessary to enable the features we need on the API, and it does so in a backwards compatible way. I'll summarise the contents of each patch:
> h4. [^HIVE-10165.0.patch] - Required
> This patch contains what we consider to be the minimum amount of changes required to allow users to create {{RecordWriter}} subclasses that can insert, update, and  delete records. These changes also maintain backwards compatibility at the expense of confusing the API a little. Note that the row representation has be changed from {{byte[]}} to {{Object}}. Within our data processing jobs our records are often available in a strongly typed and decoded form such as a POJO or a Tuple object. Therefore is seems to make sense that we are able to pass this through to the {{OrcRecordUpdater}} without having to go through a {{byte[]}} encoding step. This of course still allows users to use {{byte[]}} if they wish.
> h4. [^HIVE-10165.1.patch] - Nice to have
> This patch builds on the changes made in the *required* patch and aims to make the API cleaner and more consistent while accommodating updates and inserts. It also adds some logic to prevent the user from submitting multiple operation types to a single {{TransactionBatch}} as we found this creates data inconsistencies within the Hive table. This patch breaks backwards compatibility.
> h4. [^HIVE-10165.2.patch] - Nomenclature
> This final patch simply renames some of existing types to more accurately convey their increased responsibilities. The API is no longer writing just new records, it is now also responsible for writing operations that are applied to existing records. This patch breaks backwards compatibility.
> h3. Example
> I've attached simple typical usage of the API. This is not a patch and is intended as an illustration only: [^ReflectiveOperationWriter.java]
> h3. Known issues
> I have not yet provided any unit tests for the extended functionality. I fully expect that tests are required and will work on these if my patches have merit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)