You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Abhishek Gupta <ab...@gmail.com> on 2019/03/06 16:28:10 UTC

Read Hive ACID tables in Spark or Pig

Hi,

Does Hive ACID tables for Hive version 1.2 posses the capability of being
read into Apache Pig using HCatLoader or Spark using SQLContext.
For Spark, it seems it is only possible to read ACID tables if the table is
fully compacted i.e no delta folders exist in any partition. Details in the
following JIRA

https://issues.apache.org/jira/browse/SPARK-15348,
https://issues.apache.org/jira/browse/SPARK-15348

However I wanted to know if it is supported at all in Apache Pig to read
ACID tables in Hive

Re: Read Hive ACID tables in Spark or Pig

Posted by Alan Gates <al...@gmail.com>.

If you want to read those tables directly in something other than Hive,
yes, you need to get the valid writeid list for each table you're reading
from the metastore.  If you want to avoid merging data in, take a look at
Hive's streaming ingest, which allows you to ingest data into Hive without
merges, though it doesn't support update, only insert.
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

Alan.

On Mon, Mar 11, 2019 at 9:14 AM David Morin <mo...@gmail.com>
wrote:

> Hi,
>
> I've just implemented a pipeline to synchronize data between MySQL and
> Hive (transactional + bucketized) onto HDP cluster.
> I've used Orc files but without ACID properties.
> Then, we've created external tables on these hdfs directories that contain
> these delta Orc files.
> Then, MERGE INTO queries are executed periodically to merge data into the
> Hive target table.
> It works pretty well but we want to avoid the use of these Merge queries.
> It's not really clear at the moment. But thanks for your links. I'm going
> to delve into that point.
> To resume, if i want to avoid these queries, I have to get the valid
> transaction for each table from Hive Metastore and, then, read all related
> files.
> Is it correct ?
>
> Thanks,
> David
>
>
> Le dim. 10 mars 2019 à 01:45, Nicolas Paris <ni...@riseup.net> a
> écrit :
>
>> Thanks Alan for the clarifications.
>>
>> Hive has made such improvements it has lost its old friends in the
>> process. Hope one day all the friends speak together again: pig, spark,
>> presto read/write ACID together.
>>
>> On Sat, Mar 09, 2019 at 02:23:48PM -0800, Alan Gates wrote:
>> > There's only been one significant change in ACID that requires different
>> > implementations.  In ACID v1 delta files contained inserts, updates, and
>> > deletes.  In ACID v2 delta files are split so that inserts are placed
>> in one
>> > file, deletes in another, and updates are an insert plus a delete.
>> This change
>> > was put into Hive 3, so you have to upgrade your ACID tables when
>> upgrading
>> > from Hive 2 to 3.
>> >
>> > You can see info on ACID v1 at
>> https://cwiki.apache.org/confluence/display/Hive
>> > /Hive+Transactions
>> >
>> > You can get a start understanding ACID v2 with
>> https://issues.apache.org/jira/
>> > browse/HIVE-14035  This has design documents.  I don't guarantee the
>> > implementation completely matches the design, but you can at least get
>> an idea
>> > of the intent and follow the JIRA stream from there to see what was
>> > implemented.
>> >
>> > Alan.
>> >
>> > On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris <ni...@riseup.net>
>> wrote:
>> >
>> >     Hi,
>> >
>> >     > The issue is that outside readers don't understand which records
>> in
>> >     > the delta files are valid and which are not. Theoretically all
>> this
>> >     > is possible, as outside clients could get the valid transaction
>> list
>> >     > from the metastore and then read the files, but no one has done
>> this
>> >     > work.
>> >
>> >     I guess each hive version (1,2,3) differ in how they manage delta
>> files
>> >     isn't ? This means pig or spark need to implement 3 different ways
>> of
>> >     dealing with hive.
>> >
>> >     Is there any documentation that would help a developper to implement
>> >     those specific connectors ?
>> >
>> >     Thanks
>> >
>> >
>> >     On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
>> >     > Pig is in the same place as Spark, that the tables need to be
>> compacted
>> >     first.
>> >     > The issue is that outside readers don't understand which records
>> in the
>> >     delta
>> >     > files are valid and which are not.
>> >     >
>> >     > Theoretically all this is possible, as outside clients could get
>> the
>> >     valid
>> >     > transaction list from the metastore and then read the files, but
>> no one
>> >     has
>> >     > done this work.
>> >     >
>> >     > Alan.
>> >     >
>> >     > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <
>> abhilater@gmail.com>
>> >     wrote:
>> >     >
>> >     >     Hi,
>> >     >
>> >     >     Does Hive ACID tables for Hive version 1.2 posses the
>> capability of
>> >     being
>> >     >     read into Apache Pig using HCatLoader or Spark using
>> SQLContext.
>> >     >     For Spark, it seems it is only possible to read ACID tables
>> if the
>> >     table is
>> >     >     fully compacted i.e no delta folders exist in any partition.
>> Details
>> >     in the
>> >     >     following JIRA
>> >     >
>> >     >     https://issues.apache.org/jira/browse/SPARK-15348, https://
>> >     >     issues.apache.org/jira/browse/SPARK-15348
>> >     >
>> >     >     However I wanted to know if it is supported at all in Apache
>> Pig to
>> >     read
>> >     >     ACID tables in Hive
>> >     >
>> >
>> >     --
>> >     nicolas
>> >
>>
>> --
>> nicolas
>>
>

Re: Read Hive ACID tables in Spark or Pig

Posted by David Morin <mo...@gmail.com>.

Hi,

I've just implemented a pipeline to synchronize data between MySQL and Hive
(transactional + bucketized) onto HDP cluster.
I've used Orc files but without ACID properties.
Then, we've created external tables on these hdfs directories that contain
these delta Orc files.
Then, MERGE INTO queries are executed periodically to merge data into the
Hive target table.
It works pretty well but we want to avoid the use of these Merge queries.
It's not really clear at the moment. But thanks for your links. I'm going
to delve into that point.
To resume, if i want to avoid these queries, I have to get the valid
transaction for each table from Hive Metastore and, then, read all related
files.
Is it correct ?

Thanks,
David


Le dim. 10 mars 2019 à 01:45, Nicolas Paris <ni...@riseup.net> a
écrit :

> Thanks Alan for the clarifications.
>
> Hive has made such improvements it has lost its old friends in the
> process. Hope one day all the friends speak together again: pig, spark,
> presto read/write ACID together.
>
> On Sat, Mar 09, 2019 at 02:23:48PM -0800, Alan Gates wrote:
> > There's only been one significant change in ACID that requires different
> > implementations.  In ACID v1 delta files contained inserts, updates, and
> > deletes.  In ACID v2 delta files are split so that inserts are placed in
> one
> > file, deletes in another, and updates are an insert plus a delete.  This
> change
> > was put into Hive 3, so you have to upgrade your ACID tables when
> upgrading
> > from Hive 2 to 3.
> >
> > You can see info on ACID v1 at
> https://cwiki.apache.org/confluence/display/Hive
> > /Hive+Transactions
> >
> > You can get a start understanding ACID v2 with
> https://issues.apache.org/jira/
> > browse/HIVE-14035  This has design documents.  I don't guarantee the
> > implementation completely matches the design, but you can at least get
> an idea
> > of the intent and follow the JIRA stream from there to see what was
> > implemented.
> >
> > Alan.
> >
> > On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris <ni...@riseup.net>
> wrote:
> >
> >     Hi,
> >
> >     > The issue is that outside readers don't understand which records in
> >     > the delta files are valid and which are not. Theoretically all this
> >     > is possible, as outside clients could get the valid transaction
> list
> >     > from the metastore and then read the files, but no one has done
> this
> >     > work.
> >
> >     I guess each hive version (1,2,3) differ in how they manage delta
> files
> >     isn't ? This means pig or spark need to implement 3 different ways of
> >     dealing with hive.
> >
> >     Is there any documentation that would help a developper to implement
> >     those specific connectors ?
> >
> >     Thanks
> >
> >
> >     On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
> >     > Pig is in the same place as Spark, that the tables need to be
> compacted
> >     first.
> >     > The issue is that outside readers don't understand which records
> in the
> >     delta
> >     > files are valid and which are not.
> >     >
> >     > Theoretically all this is possible, as outside clients could get
> the
> >     valid
> >     > transaction list from the metastore and then read the files, but
> no one
> >     has
> >     > done this work.
> >     >
> >     > Alan.
> >     >
> >     > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <abhilater@gmail.com
> >
> >     wrote:
> >     >
> >     >     Hi,
> >     >
> >     >     Does Hive ACID tables for Hive version 1.2 posses the
> capability of
> >     being
> >     >     read into Apache Pig using HCatLoader or Spark using
> SQLContext.
> >     >     For Spark, it seems it is only possible to read ACID tables if
> the
> >     table is
> >     >     fully compacted i.e no delta folders exist in any partition.
> Details
> >     in the
> >     >     following JIRA
> >     >
> >     >     https://issues.apache.org/jira/browse/SPARK-15348, https://
> >     >     issues.apache.org/jira/browse/SPARK-15348
> >     >
> >     >     However I wanted to know if it is supported at all in Apache
> Pig to
> >     read
> >     >     ACID tables in Hive
> >     >
> >
> >     --
> >     nicolas
> >
>
> --
> nicolas
>

Re: Read Hive ACID tables in Spark or Pig

Posted by Nicolas Paris <ni...@riseup.net>.

Thanks Alan for the clarifications.

Hive has made such improvements it has lost its old friends in the
process. Hope one day all the friends speak together again: pig, spark,
presto read/write ACID together.

On Sat, Mar 09, 2019 at 02:23:48PM -0800, Alan Gates wrote:
> There's only been one significant change in ACID that requires different
> implementations.  In ACID v1 delta files contained inserts, updates, and
> deletes.  In ACID v2 delta files are split so that inserts are placed in one
> file, deletes in another, and updates are an insert plus a delete.  This change
> was put into Hive 3, so you have to upgrade your ACID tables when upgrading
> from Hive 2 to 3.
> 
> You can see info on ACID v1 at https://cwiki.apache.org/confluence/display/Hive
> /Hive+Transactions
> 
> You can get a start understanding ACID v2 with https://issues.apache.org/jira/
> browse/HIVE-14035  This has design documents.  I don't guarantee the
> implementation completely matches the design, but you can at least get an idea
> of the intent and follow the JIRA stream from there to see what was
> implemented.
> 
> Alan.
> 
> On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris <ni...@riseup.net> wrote:
> 
>     Hi,
> 
>     > The issue is that outside readers don't understand which records in
>     > the delta files are valid and which are not. Theoretically all this
>     > is possible, as outside clients could get the valid transaction list
>     > from the metastore and then read the files, but no one has done this
>     > work.
> 
>     I guess each hive version (1,2,3) differ in how they manage delta files
>     isn't ? This means pig or spark need to implement 3 different ways of
>     dealing with hive.
> 
>     Is there any documentation that would help a developper to implement
>     those specific connectors ?
> 
>     Thanks
> 
> 
>     On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
>     > Pig is in the same place as Spark, that the tables need to be compacted
>     first. 
>     > The issue is that outside readers don't understand which records in the
>     delta
>     > files are valid and which are not.
>     >
>     > Theoretically all this is possible, as outside clients could get the
>     valid
>     > transaction list from the metastore and then read the files, but no one
>     has
>     > done this work.
>     >
>     > Alan.
>     >
>     > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <ab...@gmail.com>
>     wrote:
>     >
>     >     Hi,
>     >
>     >     Does Hive ACID tables for Hive version 1.2 posses the capability of
>     being
>     >     read into Apache Pig using HCatLoader or Spark using SQLContext.
>     >     For Spark, it seems it is only possible to read ACID tables if the
>     table is
>     >     fully compacted i.e no delta folders exist in any partition. Details
>     in the
>     >     following JIRA
>     >
>     >     https://issues.apache.org/jira/browse/SPARK-15348, https://
>     >     issues.apache.org/jira/browse/SPARK-15348
>     >
>     >     However I wanted to know if it is supported at all in Apache Pig to
>     read
>     >     ACID tables in Hive
>     >
> 
>     --
>     nicolas
> 

-- 
nicolas

Re: Read Hive ACID tables in Spark or Pig

Posted by Alan Gates <al...@gmail.com>.

There's only been one significant change in ACID that requires different
implementations.  In ACID v1 delta files contained inserts, updates, and
deletes.  In ACID v2 delta files are split so that inserts are placed in
one file, deletes in another, and updates are an insert plus a delete.
This change was put into Hive 3, so you have to upgrade your ACID tables
when upgrading from Hive 2 to 3.

You can see info on ACID v1 at
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

You can get a start understanding ACID v2 with
https://issues.apache.org/jira/browse/HIVE-14035  This has design
documents.  I don't guarantee the implementation completely matches the
design, but you can at least get an idea of the intent and follow the JIRA
stream from there to see what was implemented.

Alan.

On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris <ni...@riseup.net>
wrote:

> Hi,
>
> > The issue is that outside readers don't understand which records in
> > the delta files are valid and which are not. Theoretically all this
> > is possible, as outside clients could get the valid transaction list
> > from the metastore and then read the files, but no one has done this
> > work.
>
> I guess each hive version (1,2,3) differ in how they manage delta files
> isn't ? This means pig or spark need to implement 3 different ways of
> dealing with hive.
>
> Is there any documentation that would help a developper to implement
> those specific connectors ?
>
> Thanks
>
>
> On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
> > Pig is in the same place as Spark, that the tables need to be compacted
> first.
> > The issue is that outside readers don't understand which records in the
> delta
> > files are valid and which are not.
> >
> > Theoretically all this is possible, as outside clients could get the
> valid
> > transaction list from the metastore and then read the files, but no one
> has
> > done this work.
> >
> > Alan.
> >
> > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <ab...@gmail.com>
> wrote:
> >
> >     Hi,
> >
> >     Does Hive ACID tables for Hive version 1.2 posses the capability of
> being
> >     read into Apache Pig using HCatLoader or Spark using SQLContext.
> >     For Spark, it seems it is only possible to read ACID tables if the
> table is
> >     fully compacted i.e no delta folders exist in any partition. Details
> in the
> >     following JIRA
> >
> >     https://issues.apache.org/jira/browse/SPARK-15348, https://
> >     issues.apache.org/jira/browse/SPARK-15348
> >
> >     However I wanted to know if it is supported at all in Apache Pig to
> read
> >     ACID tables in Hive
> >
>
> --
> nicolas
>

Re: Read Hive ACID tables in Spark or Pig

Posted by Nicolas Paris <ni...@riseup.net>.

Hi,

> The issue is that outside readers don't understand which records in
> the delta files are valid and which are not. Theoretically all this
> is possible, as outside clients could get the valid transaction list
> from the metastore and then read the files, but no one has done this
> work.

I guess each hive version (1,2,3) differ in how they manage delta files
isn't ? This means pig or spark need to implement 3 different ways of
dealing with hive.

Is there any documentation that would help a developper to implement
those specific connectors ?

Thanks


On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
> Pig is in the same place as Spark, that the tables need to be compacted first. 
> The issue is that outside readers don't understand which records in the delta
> files are valid and which are not.
> 
> Theoretically all this is possible, as outside clients could get the valid
> transaction list from the metastore and then read the files, but no one has
> done this work.
> 
> Alan.
> 
> On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <ab...@gmail.com> wrote:
> 
>     Hi,
> 
>     Does Hive ACID tables for Hive version 1.2 posses the capability of being
>     read into Apache Pig using HCatLoader or Spark using SQLContext.
>     For Spark, it seems it is only possible to read ACID tables if the table is
>     fully compacted i.e no delta folders exist in any partition. Details in the
>     following JIRA
> 
>     https://issues.apache.org/jira/browse/SPARK-15348, https://
>     issues.apache.org/jira/browse/SPARK-15348
> 
>     However I wanted to know if it is supported at all in Apache Pig to read
>     ACID tables in Hive
> 

-- 
nicolas

Re: Read Hive ACID tables in Spark or Pig

Posted by Abhishek Gupta <ab...@gmail.com>.

Thank you Alan, for the quick response. We are going ahead with the JDBC
route.

On Wed, Mar 6, 2019 at 11:22 PM Alan Gates <al...@gmail.com> wrote:

> Pig is in the same place as Spark, that the tables need to be compacted
> first.  The issue is that outside readers don't understand which records in
> the delta files are valid and which are not.
>
> Theoretically all this is possible, as outside clients could get the valid
> transaction list from the metastore and then read the files, but no one has
> done this work.
>
> Alan.
>
> On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <ab...@gmail.com> wrote:
>
>> Hi,
>>
>> Does Hive ACID tables for Hive version 1.2 posses the capability of being
>> read into Apache Pig using HCatLoader or Spark using SQLContext.
>> For Spark, it seems it is only possible to read ACID tables if the table
>> is fully compacted i.e no delta folders exist in any partition. Details in
>> the following JIRA
>>
>> https://issues.apache.org/jira/browse/SPARK-15348,
>> https://issues.apache.org/jira/browse/SPARK-15348
>>
>> However I wanted to know if it is supported at all in Apache Pig to read
>> ACID tables in Hive
>>
>

Re: Read Hive ACID tables in Spark or Pig

Posted by Alan Gates <al...@gmail.com>.

Pig is in the same place as Spark, that the tables need to be compacted
first.  The issue is that outside readers don't understand which records in
the delta files are valid and which are not.

Theoretically all this is possible, as outside clients could get the valid
transaction list from the metastore and then read the files, but no one has
done this work.

Alan.

On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <ab...@gmail.com> wrote:

> Hi,
>
> Does Hive ACID tables for Hive version 1.2 posses the capability of being
> read into Apache Pig using HCatLoader or Spark using SQLContext.
> For Spark, it seems it is only possible to read ACID tables if the table
> is fully compacted i.e no delta folders exist in any partition. Details in
> the following JIRA
>
> https://issues.apache.org/jira/browse/SPARK-15348,
> https://issues.apache.org/jira/browse/SPARK-15348
>
> However I wanted to know if it is supported at all in Apache Pig to read
> ACID tables in Hive
>