You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by David Morin <mo...@gmail.com> on 2019/03/11 21:08:46 UTC

How to update Hive ACID tables in Flink

Hello,

I've just implemented a pipeline based on Apache Flink to synchronize
data between MySQL and Hive (transactional + bucketized) onto HDP
cluster. Flink jobs run on Yarn.
I've used Orc files but without ACID properties.
Then, we've created external tables on these hdfs directories that contain
these delta Orc files.
Then, MERGE INTO queries are executed periodically to merge data into the
Hive target table.
It works pretty well but we want to avoid the use of these Merge queries.
How can I update Orc files directly from my Flink job ?

Thanks,
David

Re: How to update Hive ACID tables in Flink

Posted by David Morin <mo...@gmail.com>.
Yes, I use HDP 2.6.5. Thus I still have to deal with Hive 2.
The migration to HDP 3 has been planned but in a couple of months.
So, thanks for your reply, I investigate deeper concerning the ACID support
for Orc in Hive 2.

Le mar. 12 mars 2019 à 22:51, Alan Gates <al...@gmail.com> a écrit :

> That's the old (Hive 2) version of ACID.  In the newer version (Hive 3)
> there's no update, just insert and delete (update is insert + delete).  If
> you're working against Hive 2 what you have is what you want.  If you're
> working against Hive 3 you'll need the newer stuff.
>
> Alan.
>
> On Tue, Mar 12, 2019 at 12:24 PM David Morin <mo...@gmail.com>
> wrote:
>
>> Thanks Alan.
>> Yes, the problem is fact was that this streaming API does not handle
>> update and delete.
>> I've used native Orc files and the next step I've planned to do is the
>> use of ACID support as described here:
>> https://orc.apache.org/docs/acid.html
>> The INSERT/UPDATE/DELETE seems to be implemented:
>> OPERATIONSERIALIZATION
>> INSERT 0
>> UPDATE 1
>> DELETE 2
>> Do you think this approach is suitable ?
>>
>>
>>
>> Le mar. 12 mars 2019 à 19:30, Alan Gates <al...@gmail.com> a écrit :
>>
>>> Have you looked at Hive's streaming ingest?
>>> https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest
>>> It is designed for this case, though it only handles insert (not
>>> update), so if you need updates you'd have to do the merge as you are
>>> currently doing.
>>>
>>> Alan.
>>>
>>> On Mon, Mar 11, 2019 at 2:09 PM David Morin <mo...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I've just implemented a pipeline based on Apache Flink to synchronize data between MySQL and Hive (transactional + bucketized) onto HDP cluster. Flink jobs run on Yarn.
>>>> I've used Orc files but without ACID properties.
>>>> Then, we've created external tables on these hdfs directories that contain
>>>> these delta Orc files.
>>>> Then, MERGE INTO queries are executed periodically to merge data into the
>>>> Hive target table.
>>>> It works pretty well but we want to avoid the use of these Merge queries.
>>>> How can I update Orc files directly from my Flink job ?
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>

Re: How to update Hive ACID tables in Flink

Posted by Alan Gates <al...@gmail.com>.
That's the old (Hive 2) version of ACID.  In the newer version (Hive 3)
there's no update, just insert and delete (update is insert + delete).  If
you're working against Hive 2 what you have is what you want.  If you're
working against Hive 3 you'll need the newer stuff.

Alan.

On Tue, Mar 12, 2019 at 12:24 PM David Morin <mo...@gmail.com>
wrote:

> Thanks Alan.
> Yes, the problem is fact was that this streaming API does not handle
> update and delete.
> I've used native Orc files and the next step I've planned to do is the use
> of ACID support as described here: https://orc.apache.org/docs/acid.html
> The INSERT/UPDATE/DELETE seems to be implemented:
> OPERATIONSERIALIZATION
> INSERT 0
> UPDATE 1
> DELETE 2
> Do you think this approach is suitable ?
>
>
>
> Le mar. 12 mars 2019 à 19:30, Alan Gates <al...@gmail.com> a écrit :
>
>> Have you looked at Hive's streaming ingest?
>> https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest
>> It is designed for this case, though it only handles insert (not update),
>> so if you need updates you'd have to do the merge as you are currently
>> doing.
>>
>> Alan.
>>
>> On Mon, Mar 11, 2019 at 2:09 PM David Morin <mo...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I've just implemented a pipeline based on Apache Flink to synchronize data between MySQL and Hive (transactional + bucketized) onto HDP cluster. Flink jobs run on Yarn.
>>> I've used Orc files but without ACID properties.
>>> Then, we've created external tables on these hdfs directories that contain
>>> these delta Orc files.
>>> Then, MERGE INTO queries are executed periodically to merge data into the
>>> Hive target table.
>>> It works pretty well but we want to avoid the use of these Merge queries.
>>> How can I update Orc files directly from my Flink job ?
>>>
>>> Thanks,
>>> David
>>>
>>>

Re: How to update Hive ACID tables in Flink

Posted by David Morin <mo...@gmail.com>.
Thanks Alan.
Yes, the problem is fact was that this streaming API does not handle update
and delete.
I've used native Orc files and the next step I've planned to do is the use
of ACID support as described here: https://orc.apache.org/docs/acid.html
The INSERT/UPDATE/DELETE seems to be implemented:
OPERATIONSERIALIZATION
INSERT 0
UPDATE 1
DELETE 2
Do you think this approach is suitable ?



Le mar. 12 mars 2019 à 19:30, Alan Gates <al...@gmail.com> a écrit :

> Have you looked at Hive's streaming ingest?
> https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest
> It is designed for this case, though it only handles insert (not update),
> so if you need updates you'd have to do the merge as you are currently
> doing.
>
> Alan.
>
> On Mon, Mar 11, 2019 at 2:09 PM David Morin <mo...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I've just implemented a pipeline based on Apache Flink to synchronize data between MySQL and Hive (transactional + bucketized) onto HDP cluster. Flink jobs run on Yarn.
>> I've used Orc files but without ACID properties.
>> Then, we've created external tables on these hdfs directories that contain
>> these delta Orc files.
>> Then, MERGE INTO queries are executed periodically to merge data into the
>> Hive target table.
>> It works pretty well but we want to avoid the use of these Merge queries.
>> How can I update Orc files directly from my Flink job ?
>>
>> Thanks,
>> David
>>
>>

Re: How to update Hive ACID tables in Flink

Posted by Alan Gates <al...@gmail.com>.
Have you looked at Hive's streaming ingest?
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest
It is designed for this case, though it only handles insert (not update),
so if you need updates you'd have to do the merge as you are currently
doing.

Alan.

On Mon, Mar 11, 2019 at 2:09 PM David Morin <mo...@gmail.com>
wrote:

> Hello,
>
> I've just implemented a pipeline based on Apache Flink to synchronize data between MySQL and Hive (transactional + bucketized) onto HDP cluster. Flink jobs run on Yarn.
> I've used Orc files but without ACID properties.
> Then, we've created external tables on these hdfs directories that contain
> these delta Orc files.
> Then, MERGE INTO queries are executed periodically to merge data into the
> Hive target table.
> It works pretty well but we want to avoid the use of these Merge queries.
> How can I update Orc files directly from my Flink job ?
>
> Thanks,
> David
>
>