You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Elliot West <te...@gmail.com> on 2015/04/01 11:20:37 UTC
Re: Adding update/delete to the hive-hcatalog-streaming API

Hi Alan,

Regarding the streaming changes, I've raised an issue and submitted patches
here: https://issues.apache.org/jira/browse/HIVE-10165

Thanks - Elliot.

On 26 March 2015 at 23:20, Alan Gates <al...@gmail.com> wrote:

>
>
>   Elliot West <te...@gmail.com>
>  March 26, 2015 at 15:58
> Hi Alan,
>
> Yes, this is precisely our situation. The issues I'm having with the
> current API are that I cannot intercept the creation of the
> OrcRecordUpdater to set the recordIdColumn in the AcidOutputFormat.Options
> instance. Additionally, I cannot extend the TransactionBatch interface to
> expose further operations and instead have to encode the operation type in
> the row parameter and decode it in the RecordWriter implementation -
> possible but not very elegant.
>
> Yes, you'd have to make changes in the API itself to make this possible,
> which is fine.  I'm happy to review patches for this.
>
>
> Regarding the merge functionality, is this a new suggested feature of
> Hive? Is there a JIRA I could track?
>
> Not yet, it's just discussions I've had with users like you who'd be
> interested in merge, though from the SQL side rather than streaming.  I am
> not aware of anyone working on it at the moment.
>
> Alan.
>
>
> Thanks - Elliot.
>
>
> On Thursday, March 26, 2015, Alan Gates <al...@gmail.com> wrote:
>    Alan Gates <al...@gmail.com>
>  March 26, 2015 at 15:30
>  Are you saying that when the records arrive you don't know updates from
> inserts and you're already doing processing to determine that?  If so, this
> is exactly the case we'd like to hit with the merge functionality.
>
> If you're already scanning the existing ORC file and obtaining the unique
> identifier (which is the triple (rowid, txnid, bucketid)) and determining
> which records have changed then you could certainly change the streaming
> interface to enable passing the records through  in bulk (along with
> operation markers to indicate insert/update/delete) and write those in a
> delta file in one pass.
>
> Alan.
>
>   Elliot West <te...@gmail.com>
>  March 26, 2015 at 15:10
> Hi, thanks for your quick reply.
>
> I see your point, but in my case would I not have the required
> RecordIdentifiers available as I'm already reading the entire partition to
> determine which records have changed?  Admittedly Hive will not reveal
> the ROW__IDs to me but I assume (incorrectly perhaps) that these are in the
> underlying ORCFile and therefore can be read by my Cascading process. In
> this scenario (where I have the relevant recIds to hand) why would it then
> be necessary to perform additional scans when issuing the deletes/updates?
>
> Apologies if I'm misunderstanding something.
>
> Thanks - Elliot.
>
> On Thursday, March 26, 2015, Alan Gates <al...@gmail.com> wrote:
>   Alan Gates <al...@gmail.com>
>  March 26, 2015 at 14:48
>  The missing piece for adding update and delete to the streaming API is a
> primary key.  Updates and deletes in SQL work by scanning the table or
> partition where the record resides.  This is assumed to be ok since we are
> not supporting transactional workloads and thus update/deletes are assumed
> to be infrequent.  But a need to scan for each update or delete will not
> perform adequately in the streaming case.
>
> I've had a few discussions with others recently who are thinking of adding
> merge like functionality, where you would upload all changes to a temp
> table and then in one scan/transaction apply those changes.  This is a
> common way to handle these situations for data warehouses, and is much
> easier than adding a primary key concept to Hive.
>
> Alan.
>
>   Elliot West <te...@gmail.com>
>  March 26, 2015 at 14:08
> Hi,
>
> I'd like to ascertain if it might be possible to add 'update' and 'delete'
> operations to the hive-hcatalog-streaming API. I've been looking at the API
> with interest for the last week as it appears to have the potential to help
> with some general data processing patterns that are prevalent where I work.
> Ultimately, we continuously load large amounts of data into Hadoop which is
> partitioned by some time interval - usually hour, day, or month depending
> on the data size. However, the records that reside in this data can change.
> We often receive some new information that mutates part of an existing
> record already stored in a partition in HDFS. Typically the amount of
> mutations is very small compared to the number of records in each
> partitions.
>
> To handle this currently we re-read and re-write all partitions that could
> potentially be affected by new data. In practice a single hour's worth of
> new data can require the reading and writing of 1 month's worth of
> partitions. By storing the data in a transactional Hive table I believe
> that we can instead issue updates and deletes for only the affected rows.
> Although we do use Hive for analytics on this data, much of the processing
> that generates and consumes the data is performed using Cascading.
> Therefore I'd like to be able to read and write the data via an API which
> we'd aim to integrate into a Cascading Tap of some description. Our
> Cascading processes could determine the new, updated, and deleted records
> and then use the API to stream these changes to the transactional Hive
> table.
>
> We have most of this working in a proof of concept, but as
> hive-hcatalog-streaming does not expose the delete/update methods of the
> OrcRecordUpdater we've had to hack together something unpleasant based on
> the original API.
>
> As a first step I'd like to check if there is any appetite for adding such
> functionality to the API or if this goes against the original motivations
> of the project? If this suggestion sounds reasonable then I'd be keen to
> help move this forward.
>
> Thanks - Elliot.
>
>
>