You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Owen O'Malley <om...@apache.org> on 2015/03/19 22:44:00 UTC

ORC separate project

All,
   Over the last year, there has been a fair number of projects that want
to integrate with ORC, but don't want a dependence on Hive's exec jar.
Additionally, we've been working on a C++ reader (and soon writer) and it
would be great to host them both in the same project. Toward that end, I'd
like to create a separate ORC project at Apache. There will be lots of
technical details to work out, but I wanted to give the Hive community a
chance to discuss it. Do any of the Hive committers want to be included on
the proposal?

Of the current Hive committers, my list looks like:
* Alan
* Gunther
* Prasanth
* Lefty
* Owen
* Sergey
* Gopal
* Kevin

Did I miss anyone?

Thanks!
   Owen

Re: ORC separate project

Posted by Mostafa Mokhtar <mm...@hortonworks.com>.
Hi Owen,

Please add me as well.

Thanks
Mostafa

On 3/19/15, 3:21 PM, "Xuefu Zhang" <xz...@cloudera.com> wrote:

>Hi Owen,
>
>I'd like to get involved.
>
>Thanks,
>Xuefu
>
>On Thu, Mar 19, 2015 at 2:44 PM, Owen O'Malley <om...@apache.org> wrote:
>
>> All,
>>    Over the last year, there has been a fair number of projects that
>>want
>> to integrate with ORC, but don't want a dependence on Hive's exec jar.
>> Additionally, we've been working on a C++ reader (and soon writer) and
>>it
>> would be great to host them both in the same project. Toward that end,
>>I'd
>> like to create a separate ORC project at Apache. There will be lots of
>> technical details to work out, but I wanted to give the Hive community a
>> chance to discuss it. Do any of the Hive committers want to be included
>>on
>> the proposal?
>>
>> Of the current Hive committers, my list looks like:
>> * Alan
>> * Gunther
>> * Prasanth
>> * Lefty
>> * Owen
>> * Sergey
>> * Gopal
>> * Kevin
>>
>> Did I miss anyone?
>>
>> Thanks!
>>    Owen
>>


Re: ORC separate project

Posted by Xuefu Zhang <xz...@cloudera.com>.
Hi Owen,

I'd like to get involved.

Thanks,
Xuefu

On Thu, Mar 19, 2015 at 2:44 PM, Owen O'Malley <om...@apache.org> wrote:

> All,
>    Over the last year, there has been a fair number of projects that want
> to integrate with ORC, but don't want a dependence on Hive's exec jar.
> Additionally, we've been working on a C++ reader (and soon writer) and it
> would be great to host them both in the same project. Toward that end, I'd
> like to create a separate ORC project at Apache. There will be lots of
> technical details to work out, but I wanted to give the Hive community a
> chance to discuss it. Do any of the Hive committers want to be included on
> the proposal?
>
> Of the current Hive committers, my list looks like:
> * Alan
> * Gunther
> * Prasanth
> * Lefty
> * Owen
> * Sergey
> * Gopal
> * Kevin
>
> Did I miss anyone?
>
> Thanks!
>    Owen
>

Re: ORC separate project

Posted by Lefty Leverenz <le...@gmail.com>.
Is there a way to change this to a DISCUSS thread?  Or could everything be
copied into a new thread?  Or just start a new thread with a reference to
this one?

-- Lefty

On Tue, Apr 7, 2015 at 2:26 AM, Brock Noland <br...@apache.org> wrote:

> Hey guys,
>
> Good discussion here. One point of order, I feel like this should be a
> [DISCUSS] thread. Some folks filter on that specific text as it's
> quite standard in Apache to use that subject prefix for big issues
> like this one.
>
> Brock
>
> On Fri, Apr 3, 2015 at 3:59 PM, Thejas Nair <th...@gmail.com> wrote:
> > On Fri, Apr 3, 2015 at 1:25 PM, Lefty Leverenz <le...@gmail.com>
> > wrote:
> >
> >> Hive users who wished to use ORC would obviously need to pull in ORC
> >>> artifacts in addition to Hive.
> >>>
> >>
> >> What would happen with Hive features that (currently) only work with
> ORC?
> >> Would they be extended to work with other file formats and stay in Hive?
> >> What about future features -- would they have to work with multiple file
> >> formats from the get-go?
> >>
> >
> >
> > The storage-api module proposed above would lead to clearer storage
> > interfaces in hive. That will in turn help to implement such features
> using
> > other storage including parquet, hbase etc.
> > The result of this work will not automatically make those features worth
> > with ORC, somebody would need to do that.
> >
> > Whether future features would work for all formats would depend on
> whether
> > the new feature needs new functionality to be supported by the storage
> > layer. If the feature needs new storage functionality, I would expect new
> > interfaces to be defined in hive, and then implemented by the storage
> > engines that want to support that feature.
> >
> > This will not negatively impact experience of users with respect to ORC
> or
> > other storage formats. The way we package parquet in hive, we can package
> > ORC as well. In fact, users would be more easily be able to upgrade their
> > version of ORC being used, as releases can happen independent of each
> other.
>

Re: ORC separate project

Posted by Brock Noland <br...@apache.org>.
Hey guys,

Good discussion here. One point of order, I feel like this should be a
[DISCUSS] thread. Some folks filter on that specific text as it's
quite standard in Apache to use that subject prefix for big issues
like this one.

Brock

On Fri, Apr 3, 2015 at 3:59 PM, Thejas Nair <th...@gmail.com> wrote:
> On Fri, Apr 3, 2015 at 1:25 PM, Lefty Leverenz <le...@gmail.com>
> wrote:
>
>> Hive users who wished to use ORC would obviously need to pull in ORC
>>> artifacts in addition to Hive.
>>>
>>
>> What would happen with Hive features that (currently) only work with ORC?
>> Would they be extended to work with other file formats and stay in Hive?
>> What about future features -- would they have to work with multiple file
>> formats from the get-go?
>>
>
>
> The storage-api module proposed above would lead to clearer storage
> interfaces in hive. That will in turn help to implement such features using
> other storage including parquet, hbase etc.
> The result of this work will not automatically make those features worth
> with ORC, somebody would need to do that.
>
> Whether future features would work for all formats would depend on whether
> the new feature needs new functionality to be supported by the storage
> layer. If the feature needs new storage functionality, I would expect new
> interfaces to be defined in hive, and then implemented by the storage
> engines that want to support that feature.
>
> This will not negatively impact experience of users with respect to ORC or
> other storage formats. The way we package parquet in hive, we can package
> ORC as well. In fact, users would be more easily be able to upgrade their
> version of ORC being used, as releases can happen independent of each other.

Re: ORC separate project

Posted by Thejas Nair <th...@gmail.com>.
On Fri, Apr 3, 2015 at 1:25 PM, Lefty Leverenz <le...@gmail.com>
wrote:

> Hive users who wished to use ORC would obviously need to pull in ORC
>> artifacts in addition to Hive.
>>
>
> What would happen with Hive features that (currently) only work with ORC?
> Would they be extended to work with other file formats and stay in Hive?
> What about future features -- would they have to work with multiple file
> formats from the get-go?
>


The storage-api module proposed above would lead to clearer storage
interfaces in hive. That will in turn help to implement such features using
other storage including parquet, hbase etc.
The result of this work will not automatically make those features worth
with ORC, somebody would need to do that.

Whether future features would work for all formats would depend on whether
the new feature needs new functionality to be supported by the storage
layer. If the feature needs new storage functionality, I would expect new
interfaces to be defined in hive, and then implemented by the storage
engines that want to support that feature.

This will not negatively impact experience of users with respect to ORC or
other storage formats. The way we package parquet in hive, we can package
ORC as well. In fact, users would be more easily be able to upgrade their
version of ORC being used, as releases can happen independent of each other.

Re: ORC separate project

Posted by Lefty Leverenz <le...@gmail.com>.
I guess I'm echoing previous concerns, in less technical language.  (Should
have reread the thread before sending.)

-- Lefty

On Fri, Apr 3, 2015 at 4:25 PM, Lefty Leverenz <le...@gmail.com>
wrote:

> Hive users who wished to use ORC would obviously need to pull in ORC
>> artifacts in addition to Hive.
>>
>
> What would happen with Hive features that (currently) only work with ORC?
> Would they be extended to work with other file formats and stay in Hive?
> What about future features -- would they have to work with multiple file
> formats from the get-go?
>
> -- Lefty
>
> On Fri, Apr 3, 2015 at 3:51 PM, Alan Gates <al...@gmail.com> wrote:
>
>> A couple of points:
>>
>> 1) ORC isn't going into the incubator.  The proposal before the board is
>> for it to go straight to TLP.  There's no graduation to depend on.
>> 2) As currently proposed Hive would not depend on ORC to build.  Hive
>> users who wished to used ORC would obviously need to pull in ORC artifacts
>> in addition to Hive.  Given this I don't think it makes any sense to fork
>> ORC and have it in both places.  This actually seems the worse outcome, as
>> the two will inevitably diverge.
>>
>> Alan.
>>
>>   Xuefu Zhang <xz...@cloudera.com>
>>  April 3, 2015 at 6:41
>> I actually have a different thought to share along the same line.
>>
>> ORC is not a subproject in Hive. I'm not sure if it's the best we can do
>> by
>> making a surgery on Hive in order to make ORC a TLP, Not only may this
>> bring instability to Hive, but also it also makes Hive depend an
>> incubating
>> project. Not every project graduates(, though I do wish ORC a success as
>> TLP), some of them fail.
>>
>> Instead, I like the idea of forking Hive ORC as TLP and Hive keeps
>> whatever
>> it has. This way, the new project can do whatever it wants, and Hive
>> community probably doesn't care and has no saying to it. Once ORC as a TLP
>> graduates, Hive community can decide whether to go along with it and if so
>> how to integrate with it.
>>
>> I think this will subside the current controversy, help ORC proceed faster
>> as a TLP, and leave the decision to the near future.
>>
>> Thanks,
>> Xuefu
>>
>>
>>   Szehon Ho <sz...@cloudera.com>
>>  April 2, 2015 at 23:54
>> I also agree with this goal.
>>
>> As such, I think we should first see the proposal (JIRA?) for the
>> storage-api refactoring and other related work of Orc separating as TLP
>> before the actual separation happens, to make sure the separation is not
>> done in a way taking us further from this goal. It may very well be this
>> refactoring moves us closer to the goal, but seeing the proposal first
>> would give a lot of clarity.
>>
>> Thanks
>> Szehon
>>
>> On Thu, Apr 2, 2015 at 10:20 PM, Edward Capriolo <ed...@gmail.com>
>> <ed...@gmail.com>
>>
>>   Edward Capriolo <ed...@gmail.com>
>>  April 2, 2015 at 22:20
>> To reiterate, one thing I want to avoid is having hive rely on code that
>> sits in several tiny silos across Apache projects, or Apache Licensed but
>> not ASF projects. Hive is a mature TLP with a large number of committers
>> and it would not be a good situation if often work gets bottle necked
>> because changes had to be made across two projects simultaneously to
>> commit
>> a feature. Especially if the two projects do not share the same committer
>> list.
>>
>> I think if could be done perfectly things like ORC, Parquet, whatever
>> would
>> be <provided> scope dependencies, meaning the project can be built without
>> a particular piece but as a hole the project still works. (That might be
>> easier said than done :)
>>
>>
>>   Nick Dimiduk <nd...@gmail.com>
>>  April 1, 2015 at 11:51
>> I think the storage-api would be very helpful for HBase integration as
>> well.
>>
>>
>>   Owen O'Malley <om...@apache.org>
>>  April 1, 2015 at 11:22
>>
>>
>>
>>>
>>> What I'd like to see here is well defined interfaces in Hive so that any
>>> storage format that wants can implement them.  Hopefully that means things
>>> like interfaces and utility classes for acid, sargs, and vectorization move
>>> into this new Hive module storage-api.  Then Orc, Parquet, etc. can depend
>>> on this module without needing to pull in all of Hive.
>>>
>>> Then Hive contributors would only be forced to make changes in Orc when
>>> they want to implement something in Orc.
>>>
>>
>> Agreed. The goal of the new module keep a clean separation between the
>> code for ORC and Hive so that vectorization, sargs, and acid are kept in
>> Hive and are not moved to or duplicated in the ORC project.
>>
>> .. Owen
>>
>>
>

Re: ORC separate project

Posted by Lefty Leverenz <le...@gmail.com>.
>
> Hive users who wished to use ORC would obviously need to pull in ORC
> artifacts in addition to Hive.
>

What would happen with Hive features that (currently) only work with ORC?
Would they be extended to work with other file formats and stay in Hive?
What about future features -- would they have to work with multiple file
formats from the get-go?

-- Lefty

On Fri, Apr 3, 2015 at 3:51 PM, Alan Gates <al...@gmail.com> wrote:

> A couple of points:
>
> 1) ORC isn't going into the incubator.  The proposal before the board is
> for it to go straight to TLP.  There's no graduation to depend on.
> 2) As currently proposed Hive would not depend on ORC to build.  Hive
> users who wished to used ORC would obviously need to pull in ORC artifacts
> in addition to Hive.  Given this I don't think it makes any sense to fork
> ORC and have it in both places.  This actually seems the worse outcome, as
> the two will inevitably diverge.
>
> Alan.
>
>   Xuefu Zhang <xz...@cloudera.com>
>  April 3, 2015 at 6:41
> I actually have a different thought to share along the same line.
>
> ORC is not a subproject in Hive. I'm not sure if it's the best we can do by
> making a surgery on Hive in order to make ORC a TLP, Not only may this
> bring instability to Hive, but also it also makes Hive depend an incubating
> project. Not every project graduates(, though I do wish ORC a success as
> TLP), some of them fail.
>
> Instead, I like the idea of forking Hive ORC as TLP and Hive keeps whatever
> it has. This way, the new project can do whatever it wants, and Hive
> community probably doesn't care and has no saying to it. Once ORC as a TLP
> graduates, Hive community can decide whether to go along with it and if so
> how to integrate with it.
>
> I think this will subside the current controversy, help ORC proceed faster
> as a TLP, and leave the decision to the near future.
>
> Thanks,
> Xuefu
>
>
>   Szehon Ho <sz...@cloudera.com>
>  April 2, 2015 at 23:54
> I also agree with this goal.
>
> As such, I think we should first see the proposal (JIRA?) for the
> storage-api refactoring and other related work of Orc separating as TLP
> before the actual separation happens, to make sure the separation is not
> done in a way taking us further from this goal. It may very well be this
> refactoring moves us closer to the goal, but seeing the proposal first
> would give a lot of clarity.
>
> Thanks
> Szehon
>
> On Thu, Apr 2, 2015 at 10:20 PM, Edward Capriolo <ed...@gmail.com>
> <ed...@gmail.com>
>
>   Edward Capriolo <ed...@gmail.com>
>  April 2, 2015 at 22:20
> To reiterate, one thing I want to avoid is having hive rely on code that
> sits in several tiny silos across Apache projects, or Apache Licensed but
> not ASF projects. Hive is a mature TLP with a large number of committers
> and it would not be a good situation if often work gets bottle necked
> because changes had to be made across two projects simultaneously to commit
> a feature. Especially if the two projects do not share the same committer
> list.
>
> I think if could be done perfectly things like ORC, Parquet, whatever would
> be <provided> scope dependencies, meaning the project can be built without
> a particular piece but as a hole the project still works. (That might be
> easier said than done :)
>
>
>   Nick Dimiduk <nd...@gmail.com>
>  April 1, 2015 at 11:51
> I think the storage-api would be very helpful for HBase integration as
> well.
>
>
>   Owen O'Malley <om...@apache.org>
>  April 1, 2015 at 11:22
>
>
>
>>
>> What I'd like to see here is well defined interfaces in Hive so that any
>> storage format that wants can implement them.  Hopefully that means things
>> like interfaces and utility classes for acid, sargs, and vectorization move
>> into this new Hive module storage-api.  Then Orc, Parquet, etc. can depend
>> on this module without needing to pull in all of Hive.
>>
>> Then Hive contributors would only be forced to make changes in Orc when
>> they want to implement something in Orc.
>>
>
> Agreed. The goal of the new module keep a clean separation between the
> code for ORC and Hive so that vectorization, sargs, and acid are kept in
> Hive and are not moved to or duplicated in the ORC project.
>
> .. Owen
>
>

Re: ORC separate project

Posted by Lefty Leverenz <le...@gmail.com>.
Actually not so -- a spin-off project would have its own PMC and the Hive
PMC wouldn't have any say-so.  Of course, there would be some overlap of
the two PMCs.

(I'm not even sure if the PMC has governance of code, technically.  That
might belong to the committers or the development community.  Well, the PMC
does vote on release candidates so that's a kind of goverance.  But the
community is supposed to decide on major issues.)

Anyway under the Apache license, nobody needs permission from the PMC to
grab some code and use it for another purpose.


-- Lefty

On Tue, Apr 7, 2015 at 11:49 PM, Xuefu Zhang <xz...@cloudera.com> wrote:

> If I understood Allen's #2 comment, we are moving existing ORC code out of
> Hive and make it a separate project, which I definitely missed. Since
> existing Hive PMC has governance on the code, I would expect it's still the
> case even after the spinoff. Obviously the proposal doesn't reflect this.
>
> Thanks,
> Xuefu
>
> On Fri, Apr 3, 2015 at 12:51 PM, Alan Gates <al...@gmail.com> wrote:
>
>> A couple of points:
>>
>> 1) ORC isn't going into the incubator.  The proposal before the board is
>> for it to go straight to TLP.  There's no graduation to depend on.
>> 2) As currently proposed Hive would not depend on ORC to build.  Hive
>> users who wished to used ORC would obviously need to pull in ORC artifacts
>> in addition to Hive.  Given this I don't think it makes any sense to fork
>> ORC and have it in both places.  This actually seems the worse outcome, as
>> the two will inevitably diverge.
>>
>> Alan.
>>
>>   Xuefu Zhang <xz...@cloudera.com>
>>  April 3, 2015 at 6:41
>> I actually have a different thought to share along the same line.
>>
>> ORC is not a subproject in Hive. I'm not sure if it's the best we can do
>> by
>> making a surgery on Hive in order to make ORC a TLP, Not only may this
>> bring instability to Hive, but also it also makes Hive depend an
>> incubating
>> project. Not every project graduates(, though I do wish ORC a success as
>> TLP), some of them fail.
>>
>> Instead, I like the idea of forking Hive ORC as TLP and Hive keeps
>> whatever
>> it has. This way, the new project can do whatever it wants, and Hive
>> community probably doesn't care and has no saying to it. Once ORC as a TLP
>> graduates, Hive community can decide whether to go along with it and if so
>> how to integrate with it.
>>
>> I think this will subside the current controversy, help ORC proceed faster
>> as a TLP, and leave the decision to the near future.
>>
>> Thanks,
>> Xuefu
>>
>>
>>   Szehon Ho <sz...@cloudera.com>
>>  April 2, 2015 at 23:54
>> I also agree with this goal.
>>
>> As such, I think we should first see the proposal (JIRA?) for the
>> storage-api refactoring and other related work of Orc separating as TLP
>> before the actual separation happens, to make sure the separation is not
>> done in a way taking us further from this goal. It may very well be this
>> refactoring moves us closer to the goal, but seeing the proposal first
>> would give a lot of clarity.
>>
>> Thanks
>> Szehon
>>
>> On Thu, Apr 2, 2015 at 10:20 PM, Edward Capriolo <ed...@gmail.com>
>> <ed...@gmail.com>
>>
>>   Edward Capriolo <ed...@gmail.com>
>>  April 2, 2015 at 22:20
>> To reiterate, one thing I want to avoid is having hive rely on code that
>> sits in several tiny silos across Apache projects, or Apache Licensed but
>> not ASF projects. Hive is a mature TLP with a large number of committers
>> and it would not be a good situation if often work gets bottle necked
>> because changes had to be made across two projects simultaneously to
>> commit
>> a feature. Especially if the two projects do not share the same committer
>> list.
>>
>> I think if could be done perfectly things like ORC, Parquet, whatever
>> would
>> be <provided> scope dependencies, meaning the project can be built without
>> a particular piece but as a hole the project still works. (That might be
>> easier said than done :)
>>
>>
>>   Nick Dimiduk <nd...@gmail.com>
>>  April 1, 2015 at 11:51
>> I think the storage-api would be very helpful for HBase integration as
>> well.
>>
>>
>>   Owen O'Malley <om...@apache.org>
>>  April 1, 2015 at 11:22
>>
>>
>>
>>>
>>> What I'd like to see here is well defined interfaces in Hive so that any
>>> storage format that wants can implement them.  Hopefully that means things
>>> like interfaces and utility classes for acid, sargs, and vectorization move
>>> into this new Hive module storage-api.  Then Orc, Parquet, etc. can depend
>>> on this module without needing to pull in all of Hive.
>>>
>>> Then Hive contributors would only be forced to make changes in Orc when
>>> they want to implement something in Orc.
>>>
>>
>> Agreed. The goal of the new module keep a clean separation between the
>> code for ORC and Hive so that vectorization, sargs, and acid are kept in
>> Hive and are not moved to or duplicated in the ORC project.
>>
>> .. Owen
>>
>>
>

Re: ORC separate project

Posted by Xuefu Zhang <xz...@cloudera.com>.
If I understood Allen's #2 comment, we are moving existing ORC code out of
Hive and make it a separate project, which I definitely missed. Since
existing Hive PMC has governance on the code, I would expect it's still the
case even after the spinoff. Obviously the proposal doesn't reflect this.

Thanks,
Xuefu

On Fri, Apr 3, 2015 at 12:51 PM, Alan Gates <al...@gmail.com> wrote:

> A couple of points:
>
> 1) ORC isn't going into the incubator.  The proposal before the board is
> for it to go straight to TLP.  There's no graduation to depend on.
> 2) As currently proposed Hive would not depend on ORC to build.  Hive
> users who wished to used ORC would obviously need to pull in ORC artifacts
> in addition to Hive.  Given this I don't think it makes any sense to fork
> ORC and have it in both places.  This actually seems the worse outcome, as
> the two will inevitably diverge.
>
> Alan.
>
>   Xuefu Zhang <xz...@cloudera.com>
>  April 3, 2015 at 6:41
> I actually have a different thought to share along the same line.
>
> ORC is not a subproject in Hive. I'm not sure if it's the best we can do by
> making a surgery on Hive in order to make ORC a TLP, Not only may this
> bring instability to Hive, but also it also makes Hive depend an incubating
> project. Not every project graduates(, though I do wish ORC a success as
> TLP), some of them fail.
>
> Instead, I like the idea of forking Hive ORC as TLP and Hive keeps whatever
> it has. This way, the new project can do whatever it wants, and Hive
> community probably doesn't care and has no saying to it. Once ORC as a TLP
> graduates, Hive community can decide whether to go along with it and if so
> how to integrate with it.
>
> I think this will subside the current controversy, help ORC proceed faster
> as a TLP, and leave the decision to the near future.
>
> Thanks,
> Xuefu
>
>
>   Szehon Ho <sz...@cloudera.com>
>  April 2, 2015 at 23:54
> I also agree with this goal.
>
> As such, I think we should first see the proposal (JIRA?) for the
> storage-api refactoring and other related work of Orc separating as TLP
> before the actual separation happens, to make sure the separation is not
> done in a way taking us further from this goal. It may very well be this
> refactoring moves us closer to the goal, but seeing the proposal first
> would give a lot of clarity.
>
> Thanks
> Szehon
>
> On Thu, Apr 2, 2015 at 10:20 PM, Edward Capriolo <ed...@gmail.com>
> <ed...@gmail.com>
>
>   Edward Capriolo <ed...@gmail.com>
>  April 2, 2015 at 22:20
> To reiterate, one thing I want to avoid is having hive rely on code that
> sits in several tiny silos across Apache projects, or Apache Licensed but
> not ASF projects. Hive is a mature TLP with a large number of committers
> and it would not be a good situation if often work gets bottle necked
> because changes had to be made across two projects simultaneously to commit
> a feature. Especially if the two projects do not share the same committer
> list.
>
> I think if could be done perfectly things like ORC, Parquet, whatever would
> be <provided> scope dependencies, meaning the project can be built without
> a particular piece but as a hole the project still works. (That might be
> easier said than done :)
>
>
>   Nick Dimiduk <nd...@gmail.com>
>  April 1, 2015 at 11:51
> I think the storage-api would be very helpful for HBase integration as
> well.
>
>
>   Owen O'Malley <om...@apache.org>
>  April 1, 2015 at 11:22
>
>
>
>>
>> What I'd like to see here is well defined interfaces in Hive so that any
>> storage format that wants can implement them.  Hopefully that means things
>> like interfaces and utility classes for acid, sargs, and vectorization move
>> into this new Hive module storage-api.  Then Orc, Parquet, etc. can depend
>> on this module without needing to pull in all of Hive.
>>
>> Then Hive contributors would only be forced to make changes in Orc when
>> they want to implement something in Orc.
>>
>
> Agreed. The goal of the new module keep a clean separation between the
> code for ORC and Hive so that vectorization, sargs, and acid are kept in
> Hive and are not moved to or duplicated in the ORC project.
>
> .. Owen
>
>

Re: ORC separate project

Posted by Alan Gates <al...@gmail.com>.
A couple of points:

1) ORC isn't going into the incubator.  The proposal before the board is 
for it to go straight to TLP.  There's no graduation to depend on.
2) As currently proposed Hive would not depend on ORC to build.  Hive 
users who wished to used ORC would obviously need to pull in ORC 
artifacts in addition to Hive.  Given this I don't think it makes any 
sense to fork ORC and have it in both places.  This actually seems the 
worse outcome, as the two will inevitably diverge.

Alan.

> Xuefu Zhang <ma...@cloudera.com>
> April 3, 2015 at 6:41
> I actually have a different thought to share along the same line.
>
> ORC is not a subproject in Hive. I'm not sure if it's the best we can 
> do by
> making a surgery on Hive in order to make ORC a TLP, Not only may this
> bring instability to Hive, but also it also makes Hive depend an 
> incubating
> project. Not every project graduates(, though I do wish ORC a success as
> TLP), some of them fail.
>
> Instead, I like the idea of forking Hive ORC as TLP and Hive keeps 
> whatever
> it has. This way, the new project can do whatever it wants, and Hive
> community probably doesn't care and has no saying to it. Once ORC as a TLP
> graduates, Hive community can decide whether to go along with it and if so
> how to integrate with it.
>
> I think this will subside the current controversy, help ORC proceed faster
> as a TLP, and leave the decision to the near future.
>
> Thanks,
> Xuefu
>
>
> Szehon Ho <ma...@cloudera.com>
> April 2, 2015 at 23:54
> I also agree with this goal.
>
> As such, I think we should first see the proposal (JIRA?) for the
> storage-api refactoring and other related work of Orc separating as TLP
> before the actual separation happens, to make sure the separation is not
> done in a way taking us further from this goal. It may very well be this
> refactoring moves us closer to the goal, but seeing the proposal first
> would give a lot of clarity.
>
> Thanks
> Szehon
>
> On Thu, Apr 2, 2015 at 10:20 PM, Edward Capriolo <ed...@gmail.com>
>
> Edward Capriolo <ma...@gmail.com>
> April 2, 2015 at 22:20
> To reiterate, one thing I want to avoid is having hive rely on code that
> sits in several tiny silos across Apache projects, or Apache Licensed but
> not ASF projects. Hive is a mature TLP with a large number of committers
> and it would not be a good situation if often work gets bottle necked
> because changes had to be made across two projects simultaneously to 
> commit
> a feature. Especially if the two projects do not share the same committer
> list.
>
> I think if could be done perfectly things like ORC, Parquet, whatever 
> would
> be <provided> scope dependencies, meaning the project can be built without
> a particular piece but as a hole the project still works. (That might be
> easier said than done :)
>
>
> Nick Dimiduk <ma...@gmail.com>
> April 1, 2015 at 11:51
> I think the storage-api would be very helpful for HBase integration as 
> well.
>
>
> Owen O'Malley <ma...@apache.org>
> April 1, 2015 at 11:22
>
>
>
>
>     What I'd like to see here is well defined interfaces in Hive so
>     that any storage format that wants can implement them.  Hopefully
>     that means things like interfaces and utility classes for acid,
>     sargs, and vectorization move into this new Hive module
>     storage-api.  Then Orc, Parquet, etc. can depend on this module
>     without needing to pull in all of Hive.
>
>     Then Hive contributors would only be forced to make changes in Orc
>     when they want to implement something in Orc.
>
>
> Agreed. The goal of the new module keep a clean separation between the 
> code for ORC and Hive so that vectorization, sargs, and acid are kept 
> in Hive and are not moved to or duplicated in the ORC project.
>
> .. Owen

Re: ORC separate project

Posted by Xuefu Zhang <xz...@cloudera.com>.
I actually have a different thought to share along the same line.

ORC is not a subproject in Hive. I'm not sure if it's the best we can do by
making a surgery on Hive in order to make ORC a TLP, Not only may this
bring instability to Hive, but also it also makes Hive depend an incubating
project. Not every project graduates(, though I do wish ORC a success as
TLP), some of them fail.

Instead, I like the idea of forking Hive ORC as TLP and Hive keeps whatever
it has. This way, the new project can do whatever it wants, and Hive
community probably doesn't care and has no saying to it. Once ORC as a TLP
graduates, Hive community can decide whether to go along with it and if so
how to integrate with it.

I think this will subside the current controversy, help ORC proceed faster
as a TLP, and leave the decision to the near future.

Thanks,
Xuefu

On Thu, Apr 2, 2015 at 11:54 PM, Szehon Ho <sz...@cloudera.com> wrote:

> I also agree with this goal.
>
> As such, I think we should first see the proposal (JIRA?) for the
> storage-api refactoring and other related work of Orc separating as TLP
> before the actual separation happens, to make sure the separation is not
> done in a way taking us further from this goal.  It may very well be this
> refactoring moves us closer to the goal, but seeing the proposal first
> would give a lot of clarity.
>
> Thanks
> Szehon
>
> On Thu, Apr 2, 2015 at 10:20 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
> > To reiterate, one thing I want to avoid is having hive rely on code that
> > sits in several tiny silos across Apache projects, or Apache Licensed but
> > not ASF projects. Hive is a mature TLP with a large number of committers
> > and it would not be a good situation if often work gets bottle necked
> > because changes had to be made across two projects simultaneously to
> commit
> > a feature. Especially if the two projects do not share the same committer
> > list.
> >
> > I think if could be done perfectly things like ORC, Parquet, whatever
> would
> > be <provided> scope dependencies, meaning the project can be built
> without
> > a particular piece but as a hole the project still works. (That might be
> > easier said than done :)
> >
> > On Wed, Apr 1, 2015 at 2:51 PM, Nick Dimiduk <nd...@gmail.com> wrote:
> >
> > > I think the storage-api would be very helpful for HBase integration as
> > > well.
> > >
> > > On Wed, Apr 1, 2015 at 11:22 AM, Owen O'Malley <om...@apache.org>
> > wrote:
> > >
> > > >
> > > >
> > > > On Wed, Apr 1, 2015 at 10:10 AM, Alan Gates <al...@gmail.com>
> > > wrote:
> > > >
> > > >>
> > > >>
> > > >>   Carl Steinbach <cw...@gmail.com>
> > > >>  April 1, 2015 at 0:01
> > > >>
> > > >> Hi Owen,
> > > >>
> > > >> I think you're referring to the following questions I asked last
> week
> > on
> > > >> the PMC mailing list:
> > > >>
> > > >> 1) How much if any of the code for vectorization/sargs/ACID will
> > migrate
> > > >> over to the new ORC project.
> > > >>
> > > >> 2) Will Hive contributors encounter situations where they are
> required
> > > to
> > > >> make changes to ORC in order to complete work on projects related to
> > > >> vectorization/sargs/ACID or other Hive features?
> > > >>
> > > >>  What I'd like to see here is well defined interfaces in Hive so
> that
> > > any
> > > >> storage format that wants can implement them.  Hopefully that means
> > > things
> > > >> like interfaces and utility classes for acid, sargs, and
> vectorization
> > > move
> > > >> into this new Hive module storage-api.  Then Orc, Parquet, etc. can
> > > depend
> > > >> on this module without needing to pull in all of Hive.
> > > >>
> > > >> Then Hive contributors would only be forced to make changes in Orc
> > when
> > > >> they want to implement something in Orc.
> > > >>
> > > >
> > > > Agreed. The goal of the new module keep a clean separation between
> the
> > > > code for ORC and Hive so that vectorization, sargs, and acid are kept
> > in
> > > > Hive and are not moved to or duplicated in the ORC project.
> > > >
> > > > .. Owen
> > > >
> > >
> >
>

Re: ORC separate project

Posted by Szehon Ho <sz...@cloudera.com>.
I also agree with this goal.

As such, I think we should first see the proposal (JIRA?) for the
storage-api refactoring and other related work of Orc separating as TLP
before the actual separation happens, to make sure the separation is not
done in a way taking us further from this goal.  It may very well be this
refactoring moves us closer to the goal, but seeing the proposal first
would give a lot of clarity.

Thanks
Szehon

On Thu, Apr 2, 2015 at 10:20 PM, Edward Capriolo <ed...@gmail.com>
wrote:

> To reiterate, one thing I want to avoid is having hive rely on code that
> sits in several tiny silos across Apache projects, or Apache Licensed but
> not ASF projects. Hive is a mature TLP with a large number of committers
> and it would not be a good situation if often work gets bottle necked
> because changes had to be made across two projects simultaneously to commit
> a feature. Especially if the two projects do not share the same committer
> list.
>
> I think if could be done perfectly things like ORC, Parquet, whatever would
> be <provided> scope dependencies, meaning the project can be built without
> a particular piece but as a hole the project still works. (That might be
> easier said than done :)
>
> On Wed, Apr 1, 2015 at 2:51 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > I think the storage-api would be very helpful for HBase integration as
> > well.
> >
> > On Wed, Apr 1, 2015 at 11:22 AM, Owen O'Malley <om...@apache.org>
> wrote:
> >
> > >
> > >
> > > On Wed, Apr 1, 2015 at 10:10 AM, Alan Gates <al...@gmail.com>
> > wrote:
> > >
> > >>
> > >>
> > >>   Carl Steinbach <cw...@gmail.com>
> > >>  April 1, 2015 at 0:01
> > >>
> > >> Hi Owen,
> > >>
> > >> I think you're referring to the following questions I asked last week
> on
> > >> the PMC mailing list:
> > >>
> > >> 1) How much if any of the code for vectorization/sargs/ACID will
> migrate
> > >> over to the new ORC project.
> > >>
> > >> 2) Will Hive contributors encounter situations where they are required
> > to
> > >> make changes to ORC in order to complete work on projects related to
> > >> vectorization/sargs/ACID or other Hive features?
> > >>
> > >>  What I'd like to see here is well defined interfaces in Hive so that
> > any
> > >> storage format that wants can implement them.  Hopefully that means
> > things
> > >> like interfaces and utility classes for acid, sargs, and vectorization
> > move
> > >> into this new Hive module storage-api.  Then Orc, Parquet, etc. can
> > depend
> > >> on this module without needing to pull in all of Hive.
> > >>
> > >> Then Hive contributors would only be forced to make changes in Orc
> when
> > >> they want to implement something in Orc.
> > >>
> > >
> > > Agreed. The goal of the new module keep a clean separation between the
> > > code for ORC and Hive so that vectorization, sargs, and acid are kept
> in
> > > Hive and are not moved to or duplicated in the ORC project.
> > >
> > > .. Owen
> > >
> >
>

Re: ORC separate project

Posted by Edward Capriolo <ed...@gmail.com>.
To reiterate, one thing I want to avoid is having hive rely on code that
sits in several tiny silos across Apache projects, or Apache Licensed but
not ASF projects. Hive is a mature TLP with a large number of committers
and it would not be a good situation if often work gets bottle necked
because changes had to be made across two projects simultaneously to commit
a feature. Especially if the two projects do not share the same committer
list.

I think if could be done perfectly things like ORC, Parquet, whatever would
be <provided> scope dependencies, meaning the project can be built without
a particular piece but as a hole the project still works. (That might be
easier said than done :)

On Wed, Apr 1, 2015 at 2:51 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> I think the storage-api would be very helpful for HBase integration as
> well.
>
> On Wed, Apr 1, 2015 at 11:22 AM, Owen O'Malley <om...@apache.org> wrote:
>
> >
> >
> > On Wed, Apr 1, 2015 at 10:10 AM, Alan Gates <al...@gmail.com>
> wrote:
> >
> >>
> >>
> >>   Carl Steinbach <cw...@gmail.com>
> >>  April 1, 2015 at 0:01
> >>
> >> Hi Owen,
> >>
> >> I think you're referring to the following questions I asked last week on
> >> the PMC mailing list:
> >>
> >> 1) How much if any of the code for vectorization/sargs/ACID will migrate
> >> over to the new ORC project.
> >>
> >> 2) Will Hive contributors encounter situations where they are required
> to
> >> make changes to ORC in order to complete work on projects related to
> >> vectorization/sargs/ACID or other Hive features?
> >>
> >>  What I'd like to see here is well defined interfaces in Hive so that
> any
> >> storage format that wants can implement them.  Hopefully that means
> things
> >> like interfaces and utility classes for acid, sargs, and vectorization
> move
> >> into this new Hive module storage-api.  Then Orc, Parquet, etc. can
> depend
> >> on this module without needing to pull in all of Hive.
> >>
> >> Then Hive contributors would only be forced to make changes in Orc when
> >> they want to implement something in Orc.
> >>
> >
> > Agreed. The goal of the new module keep a clean separation between the
> > code for ORC and Hive so that vectorization, sargs, and acid are kept in
> > Hive and are not moved to or duplicated in the ORC project.
> >
> > .. Owen
> >
>

Re: ORC separate project

Posted by Nick Dimiduk <nd...@gmail.com>.
I think the storage-api would be very helpful for HBase integration as well.

On Wed, Apr 1, 2015 at 11:22 AM, Owen O'Malley <om...@apache.org> wrote:

>
>
> On Wed, Apr 1, 2015 at 10:10 AM, Alan Gates <al...@gmail.com> wrote:
>
>>
>>
>>   Carl Steinbach <cw...@gmail.com>
>>  April 1, 2015 at 0:01
>>
>> Hi Owen,
>>
>> I think you're referring to the following questions I asked last week on
>> the PMC mailing list:
>>
>> 1) How much if any of the code for vectorization/sargs/ACID will migrate
>> over to the new ORC project.
>>
>> 2) Will Hive contributors encounter situations where they are required to
>> make changes to ORC in order to complete work on projects related to
>> vectorization/sargs/ACID or other Hive features?
>>
>>  What I'd like to see here is well defined interfaces in Hive so that any
>> storage format that wants can implement them.  Hopefully that means things
>> like interfaces and utility classes for acid, sargs, and vectorization move
>> into this new Hive module storage-api.  Then Orc, Parquet, etc. can depend
>> on this module without needing to pull in all of Hive.
>>
>> Then Hive contributors would only be forced to make changes in Orc when
>> they want to implement something in Orc.
>>
>
> Agreed. The goal of the new module keep a clean separation between the
> code for ORC and Hive so that vectorization, sargs, and acid are kept in
> Hive and are not moved to or duplicated in the ORC project.
>
> .. Owen
>

Re: ORC separate project

Posted by Owen O'Malley <om...@apache.org>.
On Wed, Apr 1, 2015 at 10:10 AM, Alan Gates <al...@gmail.com> wrote:

>
>
>   Carl Steinbach <cw...@gmail.com>
>  April 1, 2015 at 0:01
>
> Hi Owen,
>
> I think you're referring to the following questions I asked last week on
> the PMC mailing list:
>
> 1) How much if any of the code for vectorization/sargs/ACID will migrate
> over to the new ORC project.
>
> 2) Will Hive contributors encounter situations where they are required to
> make changes to ORC in order to complete work on projects related to
> vectorization/sargs/ACID or other Hive features?
>
>  What I'd like to see here is well defined interfaces in Hive so that any
> storage format that wants can implement them.  Hopefully that means things
> like interfaces and utility classes for acid, sargs, and vectorization move
> into this new Hive module storage-api.  Then Orc, Parquet, etc. can depend
> on this module without needing to pull in all of Hive.
>
> Then Hive contributors would only be forced to make changes in Orc when
> they want to implement something in Orc.
>

Agreed. The goal of the new module keep a clean separation between the code
for ORC and Hive so that vectorization, sargs, and acid are kept in Hive
and are not moved to or duplicated in the ORC project.

.. Owen

Re: ORC separate project

Posted by Alan Gates <al...@gmail.com>.

> Carl Steinbach <ma...@gmail.com>
> April 1, 2015 at 0:01
> Hi Owen,
>
> I think you're referring to the following questions I asked last week on
> the PMC mailing list:
>
> 1) How much if any of the code for vectorization/sargs/ACID will migrate
> over to the new ORC project.
>
> 2) Will Hive contributors encounter situations where they are required to
> make changes to ORC in order to complete work on projects related to
> vectorization/sargs/ACID or other Hive features?
What I'd like to see here is well defined interfaces in Hive so that any 
storage format that wants can implement them.  Hopefully that means 
things like interfaces and utility classes for acid, sargs, and 
vectorization move into this new Hive module storage-api.  Then Orc, 
Parquet, etc. can depend on this module without needing to pull in all 
of Hive.

Then Hive contributors would only be forced to make changes in Orc when 
they want to implement something in Orc.

Alan.
>

Re: ORC separate project

Posted by Carl Steinbach <cw...@gmail.com>.
Hi Owen,

I think you're referring to the following questions I asked last week on
the PMC mailing list:

1) How much if any of the code for vectorization/sargs/ACID will migrate
over to the new ORC project.

2) Will Hive contributors encounter situations where they are required to
make changes to ORC in order to complete work on projects related to
vectorization/sargs/ACID or other Hive features?

Thanks for taking the time to write a response, but I don't think what you
wrote really answers either of these questions.

Some more comments/questions inline:

One of the concerns that has been mentioned is how to deal with the
> vectorization and SARG APIs.


I'm actually more concerned about what will happen to the code that
provides the implementation for these APIs. Can you comment on that?


> I'd like to propose that we pull the minimal
> set of classes in a new Hive module named "storage-api". This module will
> include VectorizedRowBatch, the various ColumnVector classes, and the SARG
> classes.


"storage-api" implies that there will be a separate "storage-impl" module.
Where will that live?



> It will form the start of an API that high performance storage
> formats can use to integrate with Hive. Both ORC and Parquet can use the
> new API to support vectorization and SARGs without performance destroying
> shims.


I'd like to understand this problem better, but I don't know where to
start. Can you provide a pointer to these "performance destroying shims"?

Thanks.

- Carl

Re: ORC separate project

Posted by Owen O'Malley <om...@apache.org>.
All,

Moving this forward, I'll submit a resolution to the Apache board for the
next meeting.

One of the concerns that has been mentioned is how to deal with the
vectorization and SARG APIs. I'd like to propose that we pull the minimal
set of classes in a new Hive module named "storage-api". This module will
include VectorizedRowBatch, the various ColumnVector classes, and the SARG
classes. It will form the start of an API that high performance storage
formats can use to integrate with Hive. Both ORC and Parquet can use the
new API to support vectorization and SARGs without performance destroying
shims. I'll create a jira to discuss the idea.

Thanks!
   Owen

Re: ORC separate project

Posted by Lefty Leverenz <le...@gmail.com>.
Count me in.

-- Lefty


On Thu, Mar 19, 2015 at 9:19 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> This is a great plan, +1!
>
>
> On Thursday, March 19, 2015, Owen O'Malley <om...@apache.org> wrote:
>
>> All,
>>    Over the last year, there has been a fair number of projects that want
>> to integrate with ORC, but don't want a dependence on Hive's exec jar.
>> Additionally, we've been working on a C++ reader (and soon writer) and it
>> would be great to host them both in the same project. Toward that end, I'd
>> like to create a separate ORC project at Apache. There will be lots of
>> technical details to work out, but I wanted to give the Hive community a
>> chance to discuss it. Do any of the Hive committers want to be included on
>> the proposal?
>>
>> Of the current Hive committers, my list looks like:
>> * Alan
>> * Gunther
>> * Prasanth
>> * Lefty
>> * Owen
>> * Sergey
>> * Gopal
>> * Kevin
>>
>> Did I miss anyone?
>>
>> Thanks!
>>    Owen
>>
>

Re: ORC separate project

Posted by Nick Dimiduk <nd...@gmail.com>.
This is a great plan, +1!

On Thursday, March 19, 2015, Owen O'Malley <om...@apache.org> wrote:

> All,
>    Over the last year, there has been a fair number of projects that want
> to integrate with ORC, but don't want a dependence on Hive's exec jar.
> Additionally, we've been working on a C++ reader (and soon writer) and it
> would be great to host them both in the same project. Toward that end, I'd
> like to create a separate ORC project at Apache. There will be lots of
> technical details to work out, but I wanted to give the Hive community a
> chance to discuss it. Do any of the Hive committers want to be included on
> the proposal?
>
> Of the current Hive committers, my list looks like:
> * Alan
> * Gunther
> * Prasanth
> * Lefty
> * Owen
> * Sergey
> * Gopal
> * Kevin
>
> Did I miss anyone?
>
> Thanks!
>    Owen
>

RE: ORC separate project

Posted by "Lalam, Chinna R" <ch...@intel.com>.
Hi Owen,

I'd like to get involved.  Please add me as well.

Thanks,
Chinna Rao Lalam


---------- Forwarded message ----------
From: Owen O'Malley <om...@apache.org>>
Date: Fri, Mar 20, 2015 at 3:14 AM
Subject: ORC separate project
To: "dev@hive.apache.org<ma...@hive.apache.org>" <de...@hive.apache.org>>, Lefty Leverenz <le...@gmail.com>>


All,
   Over the last year, there has been a fair number of projects that want
to integrate with ORC, but don't want a dependence on Hive's exec jar.
Additionally, we've been working on a C++ reader (and soon writer) and it
would be great to host them both in the same project. Toward that end, I'd
like to create a separate ORC project at Apache. There will be lots of
technical details to work out, but I wanted to give the Hive community a
chance to discuss it. Do any of the Hive committers want to be included on
the proposal?

Of the current Hive committers, my list looks like:
* Alan
* Gunther
* Prasanth
* Lefty
* Owen
* Sergey
* Gopal
* Kevin

Did I miss anyone?

Thanks!
   Owen