You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Gunther Hagleitner <gu...@apache.org> on 2014/02/16 00:52:02 UTC

Parquet support (HIVE-5783)

I read through the ticket, patch and documentation and would like to
suggest some changes.

As far as I can tell this basically adds parquet SerDes to hive, but the
file format remains external to hive. There is no way for hive devs to
makes changes, fix bugs add, change datatypes, add features to parquet
itself.

So:

- I suggest we document it as one of the built-in SerDes and not as a
native format like here:
https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
- I vote for the jira to say "Add parquet SerDes to Hive" and not "Native
support"
- I think we should revert the change to the grammar to allow "STORED AS
PARQUET" until we have a mechanism to do that for all SerDes, i.e.: someone
picks up: HIVE-5976. (I also don't think this actually works properly
unless we bundle parquet in hive-exec, which I don't think we want.)
- We should revert the deprecated classes (At least I don't understand how
a first drop needs to add deprecated stuff)

In general though, I'm also confused on why adding this SerDe to the hive
code base is beneficial. Seems to me that that just makes upgrading
Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive
release. To me that outweighs the benefit of a slightly more involved setup
of Hive + serde in the cluster.

Thanks,
Gunther.

Re: Parquet support (HIVE-5783)

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Storage handlers muddle the waters a bit IMO. That interface was
written for storage that is not file-based, e.g. hbase. Whereas Avro,
Parquet, Sequence File, etc are all file based.

I think we have to be practical about confusion. There are so many
Hadoop newbies out there, almost all of them new to Apache as well,
that there is going to be some confusion. For example, one person who
had been using Hadoop and Hive for a few months said to me "Hive moved
*from* Apache to Hortonworks". At the end of the day, regardless of
what we do, some level of confusion is going to persist amongst those
new to the ecosystem.

With that said, I do think that an overview of "Hive Storage" would be
a great addition to our documentation.

Brock

On Fri, Feb 21, 2014 at 1:27 AM, Lefty Leverenz <le...@gmail.com> wrote:
> This is in the Terminology
> section<https://cwiki.apache.org/confluence/display/Hive/StorageHandlers#StorageHandlers-Terminology>
> of
> the Storage Handlers doc:
>
> Storage handlers introduce a distinction between *native* and
> *non-native* tables.
>> A native table is one which Hive knows how to manage and access without a
>> storage handler; a non-native table is one which requires a storage handler.
>
>
> It goes on to say that non-native tables are created with a STORED BY
> clause (as opposed to a STORED AS clause).
>
> Does that clarify or muddy the waters?
>
>
> -- Lefty
>
>
> On Thu, Feb 20, 2014 at 7:37 PM, Lefty Leverenz <le...@gmail.com>wrote:
>
>> Some of these issues can be addressed in the documentation.  The "File
>> Formats" section of the Language Manual needs an overview, and that might
>> be a good place to explain the differences between Hive-owned formats and
>> external formats.  Or the SerDe doc could be beefed up:  Built-In SerDes<https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inSerDes>
>> .
>>
>> In the meantime, I've added a link to the Avro doc in the "File Formats"
>> list and mentioned Parquet in DDL's Row Format, Storage Format, and SerDe<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDe>section:
>>
>> Use STORED AS PARQUET (without ROW FORMAT SERDE) for the Parquet<https://cwiki.apache.org/confluence/display/Hive/Parquet> columnar
>>> storage format in Hive 0.13.0 and later<https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13andlater>;
>>> or use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT ... in Hive
>>> 0.10, 0.11, or 0.12<https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0.12>
>>> .
>>
>>
>> Does that work?
>>
>> -- Lefty
>>
>>
>> On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland <br...@cloudera.com> wrote:
>>
>>> Hi Alan,
>>>
>>> Response is inline, below:
>>>
>>> On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates <ga...@hortonworks.com>
>>> wrote:
>>> > Gunther, is it the case that there is anything extra that needs to be
>>> done to ship Parquet code with Hive right now?  If I read the patch
>>> correctly the Parquet jars were added to the pom and thus will be shipped
>>> as part of Hive.  As long as it works out of the box when a user says
>>> "create table ... stored as parquet" why do we care whether the parquet jar
>>> is owned by Hive or another project?
>>> >
>>> > The concern about feature mismatch in Parquet versus Hive is valid, but
>>> I'm not sure what to do about it other than assure that there are good
>>> error messages.  Users will often want to use non-Hive based storage
>>> formats (Parquet, Avro, etc.).  This means we need a good way to detect at
>>> SQL compile time that the underlying storage doesn't support the indicated
>>> data type and throw a good error.
>>>
>>> Agreed, the error messages should absolutely be good. I will ensure
>>> this is the case via https://issues.apache.org/jira/browse/HIVE-6457
>>>
>>> >
>>> > Also, it's important to be clear going forward about what Hive as a
>>> project is signing up for.  If tomorrow someone decides to add a new
>>> datatype or feature we need to be clear that we expect the contributor to
>>> make this work for Hive owned formats (text, RC, sequence, ORC) but not
>>> necessarily for external formats
>>>
>>> This makes sense to me.
>>>
>>> I'd just like to add that I have a patch available to improve the
>>> hive-exec uber jar and general query speed:
>>> https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a
>>> patch available to finish the generic STORED AS functionality:
>>> https://issues.apache.org/jira/browse/HIVE-5976
>>>
>>> Brock
>>>
>>
>>



-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: Parquet support (HIVE-5783)

Posted by Lefty Leverenz <le...@gmail.com>.
This is in the Terminology
section<https://cwiki.apache.org/confluence/display/Hive/StorageHandlers#StorageHandlers-Terminology>
of
the Storage Handlers doc:

Storage handlers introduce a distinction between *native* and
*non-native* tables.
> A native table is one which Hive knows how to manage and access without a
> storage handler; a non-native table is one which requires a storage handler.


It goes on to say that non-native tables are created with a STORED BY
clause (as opposed to a STORED AS clause).

Does that clarify or muddy the waters?


-- Lefty


On Thu, Feb 20, 2014 at 7:37 PM, Lefty Leverenz <le...@gmail.com>wrote:

> Some of these issues can be addressed in the documentation.  The "File
> Formats" section of the Language Manual needs an overview, and that might
> be a good place to explain the differences between Hive-owned formats and
> external formats.  Or the SerDe doc could be beefed up:  Built-In SerDes<https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inSerDes>
> .
>
> In the meantime, I've added a link to the Avro doc in the "File Formats"
> list and mentioned Parquet in DDL's Row Format, Storage Format, and SerDe<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDe>section:
>
> Use STORED AS PARQUET (without ROW FORMAT SERDE) for the Parquet<https://cwiki.apache.org/confluence/display/Hive/Parquet> columnar
>> storage format in Hive 0.13.0 and later<https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13andlater>;
>> or use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT ... in Hive
>> 0.10, 0.11, or 0.12<https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0.12>
>> .
>
>
> Does that work?
>
> -- Lefty
>
>
> On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland <br...@cloudera.com> wrote:
>
>> Hi Alan,
>>
>> Response is inline, below:
>>
>> On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates <ga...@hortonworks.com>
>> wrote:
>> > Gunther, is it the case that there is anything extra that needs to be
>> done to ship Parquet code with Hive right now?  If I read the patch
>> correctly the Parquet jars were added to the pom and thus will be shipped
>> as part of Hive.  As long as it works out of the box when a user says
>> "create table ... stored as parquet" why do we care whether the parquet jar
>> is owned by Hive or another project?
>> >
>> > The concern about feature mismatch in Parquet versus Hive is valid, but
>> I'm not sure what to do about it other than assure that there are good
>> error messages.  Users will often want to use non-Hive based storage
>> formats (Parquet, Avro, etc.).  This means we need a good way to detect at
>> SQL compile time that the underlying storage doesn't support the indicated
>> data type and throw a good error.
>>
>> Agreed, the error messages should absolutely be good. I will ensure
>> this is the case via https://issues.apache.org/jira/browse/HIVE-6457
>>
>> >
>> > Also, it's important to be clear going forward about what Hive as a
>> project is signing up for.  If tomorrow someone decides to add a new
>> datatype or feature we need to be clear that we expect the contributor to
>> make this work for Hive owned formats (text, RC, sequence, ORC) but not
>> necessarily for external formats
>>
>> This makes sense to me.
>>
>> I'd just like to add that I have a patch available to improve the
>> hive-exec uber jar and general query speed:
>> https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a
>> patch available to finish the generic STORED AS functionality:
>> https://issues.apache.org/jira/browse/HIVE-5976
>>
>> Brock
>>
>
>

Re: Parquet support (HIVE-5783)

Posted by Lefty Leverenz <le...@gmail.com>.
Some of these issues can be addressed in the documentation.  The "File
Formats" section of the Language Manual needs an overview, and that might
be a good place to explain the differences between Hive-owned formats and
external formats.  Or the SerDe doc could be beefed up:  Built-In
SerDes<https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inSerDes>
.

In the meantime, I've added a link to the Avro doc in the "File Formats"
list and mentioned Parquet in DDL's Row Format, Storage Format, and
SerDe<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDe>section:

Use STORED AS PARQUET (without ROW FORMAT SERDE) for the
Parquet<https://cwiki.apache.org/confluence/display/Hive/Parquet>
columnar
> storage format in Hive 0.13.0 and later<https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13andlater>;
> or use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT ... in Hive
> 0.10, 0.11, or 0.12<https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0.12>
> .


Does that work?

-- Lefty


On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland <br...@cloudera.com> wrote:

> Hi Alan,
>
> Response is inline, below:
>
> On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates <ga...@hortonworks.com>
> wrote:
> > Gunther, is it the case that there is anything extra that needs to be
> done to ship Parquet code with Hive right now?  If I read the patch
> correctly the Parquet jars were added to the pom and thus will be shipped
> as part of Hive.  As long as it works out of the box when a user says
> "create table ... stored as parquet" why do we care whether the parquet jar
> is owned by Hive or another project?
> >
> > The concern about feature mismatch in Parquet versus Hive is valid, but
> I'm not sure what to do about it other than assure that there are good
> error messages.  Users will often want to use non-Hive based storage
> formats (Parquet, Avro, etc.).  This means we need a good way to detect at
> SQL compile time that the underlying storage doesn't support the indicated
> data type and throw a good error.
>
> Agreed, the error messages should absolutely be good. I will ensure
> this is the case via https://issues.apache.org/jira/browse/HIVE-6457
>
> >
> > Also, it's important to be clear going forward about what Hive as a
> project is signing up for.  If tomorrow someone decides to add a new
> datatype or feature we need to be clear that we expect the contributor to
> make this work for Hive owned formats (text, RC, sequence, ORC) but not
> necessarily for external formats
>
> This makes sense to me.
>
> I'd just like to add that I have a patch available to improve the
> hive-exec uber jar and general query speed:
> https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a
> patch available to finish the generic STORED AS functionality:
> https://issues.apache.org/jira/browse/HIVE-5976
>
> Brock
>

Re: Parquet support (HIVE-5783)

Posted by Brock Noland <br...@cloudera.com>.
Hi Alan,

Response is inline, below:

On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates <ga...@hortonworks.com> wrote:
> Gunther, is it the case that there is anything extra that needs to be done to ship Parquet code with Hive right now?  If I read the patch correctly the Parquet jars were added to the pom and thus will be shipped as part of Hive.  As long as it works out of the box when a user says "create table ... stored as parquet" why do we care whether the parquet jar is owned by Hive or another project?
>
> The concern about feature mismatch in Parquet versus Hive is valid, but I'm not sure what to do about it other than assure that there are good error messages.  Users will often want to use non-Hive based storage formats (Parquet, Avro, etc.).  This means we need a good way to detect at SQL compile time that the underlying storage doesn't support the indicated data type and throw a good error.

Agreed, the error messages should absolutely be good. I will ensure
this is the case via https://issues.apache.org/jira/browse/HIVE-6457

>
> Also, it's important to be clear going forward about what Hive as a project is signing up for.  If tomorrow someone decides to add a new datatype or feature we need to be clear that we expect the contributor to make this work for Hive owned formats (text, RC, sequence, ORC) but not necessarily for external formats

This makes sense to me.

I'd just like to add that I have a patch available to improve the
hive-exec uber jar and general query speed:
https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a
patch available to finish the generic STORED AS functionality:
https://issues.apache.org/jira/browse/HIVE-5976

Brock

Re: Parquet support (HIVE-5783)

Posted by Alan Gates <ga...@hortonworks.com>.
Gunther, is it the case that there is anything extra that needs to be done to ship Parquet code with Hive right now?  If I read the patch correctly the Parquet jars were added to the pom and thus will be shipped as part of Hive.  As long as it works out of the box when a user says “create table … stored as parquet” why do we care whether the parquet jar is owned by Hive or another project?

The concern about feature mismatch in Parquet versus Hive is valid, but I’m not sure what to do about it other than assure that there are good error messages.  Users will often want to use non-Hive based storage formats (Parquet, Avro, etc.).  This means we need a good way to detect at SQL compile time that the underlying storage doesn’t support the indicated data type and throw a good error.

Also, it’s important to be clear going forward about what Hive as a project is signing up for.  If tomorrow someone decides to add a new datatype or feature we need to be clear that we expect the contributor to make this work for Hive owned formats (text, RC, sequence, ORC) but not necessarily for external formats (Parquet, Avro).  

Alan.

On Feb 17, 2014, at 7:03 PM, Gunther Hagleitner <gh...@hortonworks.com> wrote:

> Brock,
> 
> I'm not trying to "pick winners", I'm merely trying to say that the
> documentation/code should match what's actually there, so folks can make
> informed decisions.
> 
> The issue I have with the word "native" is that people have expectations
> when they hear it and I think these are not met.
> 
> I've had folks ask me why we're switching the default of hive to Parquet.
> This isn't the case obviously, but "native" to most people means just that:
> Hive's primary format. That's why I was asking for a title of "Add Parquet
> SerDe" for the jira. That's the exact same thing that was done for Avro
> under the exact same circumstances:
> https://issues.apache.org/jira/browse/HIVE-895.
> 
> Native also has other associations a) it supports the full data
> model/feature set and b) it's part of hive. Neither is the case and I don't
> think that's just a superficial difference. Support and usability will be
> different. That's why I think the documentation should delineate between
> RC/ORC/etc on one side and Parquet/Avro/etc on the other.
> 
> As mentioned in the jira "STORED AS" was reserved for what's actually part
> of hive (or hadoop core in the case of sequence file as you point out). I
> think there are reasons for that: a) being part of the grammar implies
> native as above b) you need to ship the code bundled in hive-exec for this
> to work (which is *broken* right now) and c) like you said we shouldn't
> pick winners by letting some of them become a keyword and others not. For
> these reasons I think Parquet should use the old syntax at this point. If
> you have a pluggable/configurable way great, but right now we don't have
> that.
> 
> Finally, yes, I am late to this party and I apologize for that. I'm happy
> to make the suggested changes myself, if that's the concern.
> 
> Thanks,
> Gunther.
> 
> 
> 
> On Sun, Feb 16, 2014 at 7:40 PM, Brock Noland <br...@cloudera.com> wrote:
> 
>> Hi Gunther,
>> 
>> Please find my response inline.
>> 
>> On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner <gu...@apache.org>
>> wrote:
>>> I read through the ticket, patch and documentation
>> 
>> Thank you very much for reading through these items!
>> 
>>> and would like to
>>> suggest some changes.
>> 
>> There was ample time to suggest these changes prior to commit. The
>> JIRA was created three months ago, and the title you object to and the
>> patch was up there over two months ago.
>> 
>>> As far as I can tell this basically adds parquet SerDes to hive, but the
>>> file format remains external to hive. There is no way for hive devs to
>>> makes changes, fix bugs add, change datatypes, add features to parquet
>>> itself.
>> 
>> As stated in many locations including the JIRA discussed here, we
>> shouldn't be picking winner/loser file formats. We use many external
>> libraries, none of which, all Hive developers have the ability to
>> modify. For example most Hive developers do not have the ability to
>> modify Sequence File. Tez is also an external library which few Hive
>> developers can change.
>> 
>>> So:
>>> 
>>> - I suggest we document it as one of the built-in SerDes and not as a
>>> native format like here:
>>> https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
>>> - I vote for the jira to say "Add parquet SerDes to Hive" and not "Native
>>> support"
>> 
>> The change provides the ability to create a parquet table with Hive,
>> natively. Therefore I don't see the issue you have with the word
>> native.
>> 
>>> - I think we should revert the change to the grammar to allow "STORED AS
>>> PARQUET" until we have a mechanism to do that for all SerDes, i.e.:
>> someone
>>> picks up: HIVE-5976. (I also don't think this actually works properly
>>> unless we bundle parquet in hive-exec, which I don't think we want.)
>> 
>> Again, you could have provided this feedback many moons ago. I am
>> personally interested in HIVE-5976 but it's orthogonal to this issue.
>> That change just makes it easier and cleaner to add STORED AS
>> keywords. The contributors of the Parquet integration are not required
>> to fix Hive. That is our job.
>> 
>>> - We should revert the deprecated classes (At least I don't understand
>> how
>>> a first drop needs to add deprecated stuff)
>> 
>> The deprecated classes are shells (no actual code) to support existing
>> users of Parquet, of which there are many. I see no justification for
>> impacting existing users when the workaround is trivial and
>> non-impacting to any other user.
>> 
>>> In general though, I'm also confused on why adding this SerDe to the hive
>>> code base is beneficial. Seems to me that that just makes upgrading
>>> Parquet, bug fixing, etc more difficult by tying a SerDe release to a
>> Hive
>>> release. To me that outweighs the benefit of a slightly more involved
>> setup
>>> of Hive + serde in the cluster.
>> 
>> The Hive APIs, which are not clearly defined, have changed often in
>> the past few releases making maintaining a file format extremely
>> difficult. For example, 0.12 and 0.13 break most if not all external
>> code bases.
>> 
>> However, beyond that, the community felt it was beneficial to make
>> Parquet easier to use. If you are not interested in Parquet then
>> ignore it as this change does not impact you. Tez integration is
>> something which does not interest myself and many other Hive
>> developers. Indeed other than a few cursory reviews and a few times
>> where I championed the refactoring you guys were doing in order to
>> support Tez, I have ignored the Tez work.
>> 
>> Sincerely,
>> Brock
>> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Parquet support (HIVE-5783)

Posted by Gunther Hagleitner <gh...@hortonworks.com>.
Brock,

I'm not trying to "pick winners", I'm merely trying to say that the
documentation/code should match what's actually there, so folks can make
informed decisions.

The issue I have with the word "native" is that people have expectations
when they hear it and I think these are not met.

I've had folks ask me why we're switching the default of hive to Parquet.
This isn't the case obviously, but "native" to most people means just that:
Hive's primary format. That's why I was asking for a title of "Add Parquet
SerDe" for the jira. That's the exact same thing that was done for Avro
under the exact same circumstances:
https://issues.apache.org/jira/browse/HIVE-895.

Native also has other associations a) it supports the full data
model/feature set and b) it's part of hive. Neither is the case and I don't
think that's just a superficial difference. Support and usability will be
different. That's why I think the documentation should delineate between
RC/ORC/etc on one side and Parquet/Avro/etc on the other.

As mentioned in the jira "STORED AS" was reserved for what's actually part
of hive (or hadoop core in the case of sequence file as you point out). I
think there are reasons for that: a) being part of the grammar implies
native as above b) you need to ship the code bundled in hive-exec for this
to work (which is *broken* right now) and c) like you said we shouldn't
pick winners by letting some of them become a keyword and others not. For
these reasons I think Parquet should use the old syntax at this point. If
you have a pluggable/configurable way great, but right now we don't have
that.

Finally, yes, I am late to this party and I apologize for that. I'm happy
to make the suggested changes myself, if that's the concern.

Thanks,
Gunther.



On Sun, Feb 16, 2014 at 7:40 PM, Brock Noland <br...@cloudera.com> wrote:

> Hi Gunther,
>
> Please find my response inline.
>
> On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner <gu...@apache.org>
> wrote:
> > I read through the ticket, patch and documentation
>
> Thank you very much for reading through these items!
>
> > and would like to
> > suggest some changes.
>
> There was ample time to suggest these changes prior to commit. The
> JIRA was created three months ago, and the title you object to and the
> patch was up there over two months ago.
>
> > As far as I can tell this basically adds parquet SerDes to hive, but the
> > file format remains external to hive. There is no way for hive devs to
> > makes changes, fix bugs add, change datatypes, add features to parquet
> > itself.
>
> As stated in many locations including the JIRA discussed here, we
> shouldn't be picking winner/loser file formats. We use many external
> libraries, none of which, all Hive developers have the ability to
> modify. For example most Hive developers do not have the ability to
> modify Sequence File. Tez is also an external library which few Hive
> developers can change.
>
> > So:
> >
> > - I suggest we document it as one of the built-in SerDes and not as a
> > native format like here:
> > https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
> > https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
> > - I vote for the jira to say "Add parquet SerDes to Hive" and not "Native
> > support"
>
> The change provides the ability to create a parquet table with Hive,
> natively. Therefore I don't see the issue you have with the word
> native.
>
> > - I think we should revert the change to the grammar to allow "STORED AS
> > PARQUET" until we have a mechanism to do that for all SerDes, i.e.:
> someone
> > picks up: HIVE-5976. (I also don't think this actually works properly
> > unless we bundle parquet in hive-exec, which I don't think we want.)
>
> Again, you could have provided this feedback many moons ago. I am
> personally interested in HIVE-5976 but it's orthogonal to this issue.
> That change just makes it easier and cleaner to add STORED AS
> keywords. The contributors of the Parquet integration are not required
> to fix Hive. That is our job.
>
> > - We should revert the deprecated classes (At least I don't understand
> how
> > a first drop needs to add deprecated stuff)
>
> The deprecated classes are shells (no actual code) to support existing
> users of Parquet, of which there are many. I see no justification for
> impacting existing users when the workaround is trivial and
> non-impacting to any other user.
>
> > In general though, I'm also confused on why adding this SerDe to the hive
> > code base is beneficial. Seems to me that that just makes upgrading
> > Parquet, bug fixing, etc more difficult by tying a SerDe release to a
> Hive
> > release. To me that outweighs the benefit of a slightly more involved
> setup
> > of Hive + serde in the cluster.
>
> The Hive APIs, which are not clearly defined, have changed often in
> the past few releases making maintaining a file format extremely
> difficult. For example, 0.12 and 0.13 break most if not all external
> code bases.
>
> However, beyond that, the community felt it was beneficial to make
> Parquet easier to use. If you are not interested in Parquet then
> ignore it as this change does not impact you. Tez integration is
> something which does not interest myself and many other Hive
> developers. Indeed other than a few cursory reviews and a few times
> where I championed the refactoring you guys were doing in order to
> support Tez, I have ignored the Tez work.
>
> Sincerely,
> Brock
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Parquet support (HIVE-5783)

Posted by Brock Noland <br...@cloudera.com>.
Hi Gunther,

Please find my response inline.

On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner <gu...@apache.org> wrote:
> I read through the ticket, patch and documentation

Thank you very much for reading through these items!

> and would like to
> suggest some changes.

There was ample time to suggest these changes prior to commit. The
JIRA was created three months ago, and the title you object to and the
patch was up there over two months ago.

> As far as I can tell this basically adds parquet SerDes to hive, but the
> file format remains external to hive. There is no way for hive devs to
> makes changes, fix bugs add, change datatypes, add features to parquet
> itself.

As stated in many locations including the JIRA discussed here, we
shouldn't be picking winner/loser file formats. We use many external
libraries, none of which, all Hive developers have the ability to
modify. For example most Hive developers do not have the ability to
modify Sequence File. Tez is also an external library which few Hive
developers can change.

> So:
>
> - I suggest we document it as one of the built-in SerDes and not as a
> native format like here:
> https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
> - I vote for the jira to say "Add parquet SerDes to Hive" and not "Native
> support"

The change provides the ability to create a parquet table with Hive,
natively. Therefore I don't see the issue you have with the word
native.

> - I think we should revert the change to the grammar to allow "STORED AS
> PARQUET" until we have a mechanism to do that for all SerDes, i.e.: someone
> picks up: HIVE-5976. (I also don't think this actually works properly
> unless we bundle parquet in hive-exec, which I don't think we want.)

Again, you could have provided this feedback many moons ago. I am
personally interested in HIVE-5976 but it's orthogonal to this issue.
That change just makes it easier and cleaner to add STORED AS
keywords. The contributors of the Parquet integration are not required
to fix Hive. That is our job.

> - We should revert the deprecated classes (At least I don't understand how
> a first drop needs to add deprecated stuff)

The deprecated classes are shells (no actual code) to support existing
users of Parquet, of which there are many. I see no justification for
impacting existing users when the workaround is trivial and
non-impacting to any other user.

> In general though, I'm also confused on why adding this SerDe to the hive
> code base is beneficial. Seems to me that that just makes upgrading
> Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive
> release. To me that outweighs the benefit of a slightly more involved setup
> of Hive + serde in the cluster.

The Hive APIs, which are not clearly defined, have changed often in
the past few releases making maintaining a file format extremely
difficult. For example, 0.12 and 0.13 break most if not all external
code bases.

However, beyond that, the community felt it was beneficial to make
Parquet easier to use. If you are not interested in Parquet then
ignore it as this change does not impact you. Tez integration is
something which does not interest myself and many other Hive
developers. Indeed other than a few cursory reviews and a few times
where I championed the refactoring you guys were doing in order to
support Tez, I have ignored the Tez work.

Sincerely,
Brock