You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Owen O'Malley <om...@apache.org> on 2016/08/16 00:01:16 UTC

[DISCUSS] Making storage-api a separately released artifact

All,

As part of moving ORC out of Hive, we pulled all of the vectorization
storage and sarg classes into a separate module, which is named
storage-api.  Although it is currently only used by ORC, it could be used
by Parquet or Avro if they wanted to make a fast vectorized reader that
read directly in to Hive's VectorizedRowBatch without needing a shim or
data copy. Note that this is in many ways similar to pulling the Arrow
project out of Drill.

This unfortunately still leaves us with a circular dependency between Hive
and ORC. I'd hoped that storage-api wouldn't change that much, but that
doesn't seem to be happening. As a result, ORC ends up shipping its own
fork of storage-api.

Although we could make a new project for just the storage-api, I think it
would be better to make it a subproject of Hive that is released
independently.

What do others think?

   Owen

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Sushanth Sowmyan <kh...@gmail.com>.

+1 for having a separate storage-api project to define common interfaces
for people to develop against. It'll make things much easier to develop
against generically.

I'm okay(+0) with the sub-project idea as opposed to enthusiastic about it,
mostly because I have reservations that it'll encourage laziness and will
in practice wind up being tied to hive releases and dev and over time
assumptions of how hive works and what is available will bleed in. But,
still, having a motion of separation will definitely help.

On Aug 17, 2016 11:39, "Prasanth Jayachandran" <
pjayachandran@hortonworks.com> wrote:

> +1 for making it a subproject with separate (preferably shorter) release
> cycle. The module in itself is too small for a separate project. Also
> having a faster release cycle will resolve circular dependency and will
> help other projects make use of vectorization, sarg, bloom filter etc.
>
> For version management, how about adding another version after patch
> version i.e sub-project version?
> Example: 2.2.0.[0] will be storage api’s release version. Hive will always
> depend on 2.2.0-SNAPSHOT. I think maven will let us release modules with
> different versions. https://dev.c-ware.de/confluence/display/PUBLIC/
> Releasing+modules+of+a+multi-module+project+with+
> independent+version+numbers
>
> Thanks
> Prasanth
>
> > On Aug 17, 2016, at 10:46 AM, Alan Gates <al...@gmail.com> wrote:
> >
> > +1 for making the API clean and easy for other projects to work with.  A
> few questions:
> >
> > 1) Would this also make it easier for Parquet and others to implement
> Hive’s ACID interfaces?
> >
> > 2) Would we make any attempt to coordinate version numbers between Hive
> and the storage module, or would a given version of Hive just depend on a
> given version of the storage module?
> >
> > Alan.
> >
> >> On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org> wrote:
> >>
> >> All,
> >>
> >> As part of moving ORC out of Hive, we pulled all of the vectorization
> >> storage and sarg classes into a separate module, which is named
> >> storage-api.  Although it is currently only used by ORC, it could be
> used
> >> by Parquet or Avro if they wanted to make a fast vectorized reader that
> >> read directly in to Hive's VectorizedRowBatch without needing a shim or
> >> data copy. Note that this is in many ways similar to pulling the Arrow
> >> project out of Drill.
> >>
> >> This unfortunately still leaves us with a circular dependency between
> Hive
> >> and ORC. I'd hoped that storage-api wouldn't change that much, but that
> >> doesn't seem to be happening. As a result, ORC ends up shipping its own
> >> fork of storage-api.
> >>
> >> Although we could make a new project for just the storage-api, I think
> it
> >> would be better to make it a subproject of Hive that is released
> >> independently.
> >>
> >> What do others think?
> >>
> >>  Owen
> >
> >
>
>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

+1 for making it a subproject with separate (preferably shorter) release cycle. The module in itself is too small for a separate project. Also having a faster release cycle will resolve circular dependency and will help other projects make use of vectorization, sarg, bloom filter etc.

For version management, how about adding another version after patch version i.e sub-project version? 
Example: 2.2.0.[0] will be storage api’s release version. Hive will always depend on 2.2.0-SNAPSHOT. I think maven will let us release modules with different versions. https://dev.c-ware.de/confluence/display/PUBLIC/Releasing+modules+of+a+multi-module+project+with+independent+version+numbers

Thanks
Prasanth 

> On Aug 17, 2016, at 10:46 AM, Alan Gates <al...@gmail.com> wrote:
> 
> +1 for making the API clean and easy for other projects to work with.  A few questions:
> 
> 1) Would this also make it easier for Parquet and others to implement Hive’s ACID interfaces?
> 
> 2) Would we make any attempt to coordinate version numbers between Hive and the storage module, or would a given version of Hive just depend on a given version of the storage module?
> 
> Alan.
> 
>> On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org> wrote:
>> 
>> All,
>> 
>> As part of moving ORC out of Hive, we pulled all of the vectorization
>> storage and sarg classes into a separate module, which is named
>> storage-api.  Although it is currently only used by ORC, it could be used
>> by Parquet or Avro if they wanted to make a fast vectorized reader that
>> read directly in to Hive's VectorizedRowBatch without needing a shim or
>> data copy. Note that this is in many ways similar to pulling the Arrow
>> project out of Drill.
>> 
>> This unfortunately still leaves us with a circular dependency between Hive
>> and ORC. I'd hoped that storage-api wouldn't change that much, but that
>> doesn't seem to be happening. As a result, ORC ends up shipping its own
>> fork of storage-api.
>> 
>> Although we could make a new project for just the storage-api, I think it
>> would be better to make it a subproject of Hive that is released
>> independently.
>> 
>> What do others think?
>> 
>>  Owen
> 
>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Matthew McCline <mm...@hortonworks.com>.

For good performance the VectorizedRowBatch doesn't follow "traditional" good object rules -- for better or worse.  We made a number of member variables public so they can be accessed directly (e.g. for LongColumnVector the long[] vector is public) and avoid using an interface for faster direct object access to the ColumnVector family.

________________________________________
From: Sergio Pena <se...@cloudera.com>
Sent: Friday, August 26, 2016 12:58 PM
To: dev
Subject: Re: [DISCUSS] Making storage-api a separately released artifact

Question:

Wouldn't be better to move part of the implementations to Orc, Parquet and
Avro, and just have some interfaces and basic implementations on Hive? This
way we could avoid Orc, Parquet and/or Avro to depend from Hive. I saw this
on Parquet where they created a RowBatch class internally and returns that
to Hive, then in Hive we will just bind it to the Hive vectorized interface
to support vectorization. It justs an idea, I am not clear exactly what I
am trying to say :)


On Fri, Aug 19, 2016 at 11:01 PM, Lefty Leverenz <le...@gmail.com>
wrote:

> Sergey's idea is creative, although it leads to confusion about JIRA fix
> versions.  Issues would be given fix versions based on assumptions about
> whether SA or Hive will be released first.  (That's hard to predict when
> it's months away.)
>
> Keeping the version numbers tied together is very appealing.  Would it be
> possible to have incompatible changes in SA force a bump in the Hive
> release number?  Hm, I guess that means Hive would need a release at the
> same time as SA, but only for incompatible changes.
>
> What's the likelihood of another subproject getting spun off eventually?
> If that happened, the 4th minor version wouldn't make sense.  A 5th minor
> version wouldn't work either.
>
> -- Lefty
>
>
> On Fri, Aug 19, 2016 at 9:46 PM, Sergey Shelukhin <se...@hortonworks.com>
> wrote:
>
> > I am suggesting we always skip the number. So only one component gets the
> > next one :) In your example Hive trunk would be 2.3, and if SA is
> released
> > again it would become 2.4. Otherwise we’d need a compat table cause
> > versions will be totally out of sync.
> >
> > On 16/8/19, 16:31, "Owen O'Malley" <om...@apache.org> wrote:
> >
> > >That won't necessarily work, especially in the beginning. If we release
> SA
> > >2.2.0 and use it for Hive trunk with the assumption that the next Hive
> > >release will be 2.2. What do we do when we need to make an incompatible
> > >change in SA? I guess we could release SA as 2.3.0 and when hive makes
> its
> > >next release skip over Hive 2.2 and go straight to Hive 2.3.0. In
> general
> > >I
> > >think that we'd be better off with the release numbers not tied
> together.
> > >
> > >.. Owen
> > >
> > >On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > >
> > >wrote:
> > >
> > >> Can we just run the versions thru? I.e. increment it every time but
> > >> release only one component (or both if they happen to align I guess).
> > >> E.g. storage-api will be released at 2.2, and say 2.3 if it moves
> fast,
> > >> then Hive 2.4, then storage-api 2.5, etc.
> > >> That might make it easier to reason about compatibility because the
> > >>order
> > >> is obvious.
> > >>
> > >> On 16/8/19, 09:04, "Sergio Pena" <se...@cloudera.com> wrote:
> > >>
> > >> >I see Parquet is currently using the SearchArgument class for
> > >>predicates
> > >> >push down.
> > >> >Will this class be part of the new sub-module or project?
> > >> >
> > >> >Following Sushanth idea, can we have other API interfaces in the new
> > >> >project that other components can use?
> > >> >Perhaps having this may be a good reason to create a project.
> > >> >
> > >> >I'm -1 with the 4th minor version. As Owen mentioned, changing the
> 4th
> > >> >version number for incompatible changes is ugly and confusing.
> > >> >I like the new project idea more, +1, but  the storage-api may be too
> > >> >small
> > >> >for a new project.
> > >> >
> > >> >- Sergio
> > >> >
> > >> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <om...@apache.org>
> > >> wrote:
> > >> >
> > >> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <alanfgates@gmail.com
> >
> > >> >>wrote:
> > >> >>
> > >> >> > +1 for making the API clean and easy for other projects to work
> > >>with.
> > >> >> A
> > >> >> > few questions:
> > >> >> >
> > >> >> > 1) Would this also make it easier for Parquet and others to
> > >>implement
> > >> >> > Hive’s ACID interfaces?
> > >> >> >
> > >> >>
> > >> >> Currently the ACID interfaces haven't been moved over to
> storage-api,
> > >> >> although it would make sense to do so at some point.
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > 2) Would we make any attempt to coordinate version numbers
> between
> > >> >>Hive
> > >> >> > and the storage module, or would a given version of Hive just
> > >>depend
> > >> >>on a
> > >> >> > given version of the storage module?
> > >> >> >
> > >> >>
> > >> >> The two options that I see are:
> > >> >>
> > >> >> * Let the numbers run separately starting from 2.2.0.
> > >> >> * Tie the numbers together with an additional level of versioning
> > >>(eg.
> > >> >> 2.2.0.0).
> > >> >>
> > >> >> I think that letting the two version numbers diverge is better in
> the
> > >> >>long
> > >> >> term. For example, if you need to make an incompatible change, it
> is
> > >> >>pretty
> > >> >> ugly to do it as a fourth level version number (eg. an incompatible
> > >> >>change
> > >> >> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that
> storage-api
> > >> >>would
> > >> >> move faster than Hive, but as it stabilizes I expect it might start
> > >> >>moving
> > >> >> slower than Hive.
> > >> >>
> > >> >> I'd propose that we have Hive's build use a released version of
> > >> >>storage-api
> > >> >> rather than a snapshot.
> > >> >>
> > >> >> Thoughts?
> > >> >>
> > >> >>    Owen
> > >> >>
> > >> >>
> > >> >> > Alan.
> > >> >> >
> > >> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org>
> > >> wrote:
> > >> >> > >
> > >> >> > > All,
> > >> >> > >
> > >> >> > > As part of moving ORC out of Hive, we pulled all of the
> > >> >>vectorization
> > >> >> > > storage and sarg classes into a separate module, which is named
> > >> >> > > storage-api.  Although it is currently only used by ORC, it
> > >>could be
> > >> >> used
> > >> >> > > by Parquet or Avro if they wanted to make a fast vectorized
> > >>reader
> > >> >>that
> > >> >> > > read directly in to Hive's VectorizedRowBatch without needing a
> > >> >>shim or
> > >> >> > > data copy. Note that this is in many ways similar to pulling
> the
> > >> >>Arrow
> > >> >> > > project out of Drill.
> > >> >> > >
> > >> >> > > This unfortunately still leaves us with a circular dependency
> > >> >>between
> > >> >> > Hive
> > >> >> > > and ORC. I'd hoped that storage-api wouldn't change that much,
> > >>but
> > >> >>that
> > >> >> > > doesn't seem to be happening. As a result, ORC ends up shipping
> > >>its
> > >> >>own
> > >> >> > > fork of storage-api.
> > >> >> > >
> > >> >> > > Although we could make a new project for just the storage-api,
> I
> > >> >>think
> > >> >> it
> > >> >> > > would be better to make it a subproject of Hive that is
> released
> > >> >> > > independently.
> > >> >> > >
> > >> >> > > What do others think?
> > >> >> > >
> > >> >> > >   Owen
> > >> >> >
> > >> >> >
> > >> >>
> > >>
> > >>
> >
> >
>

RE: [DISCUSS] Making storage-api a separately released artifact

Posted by "Xu, Cheng A" <ch...@intel.com>.

Hi Sergio,
For vectorization, it works for most of types except decimal. For Hive row batch, it consists of an array of HiveDecimal which can't be initialized in Parquet side. We have to do a convert which will impact the performance.

-----Original Message-----
From: Sergio Pena [mailto:sergio.pena@cloudera.com] 
Sent: Saturday, August 27, 2016 3:59 AM
To: dev <de...@hive.apache.org>
Subject: Re: [DISCUSS] Making storage-api a separately released artifact

Question:

Wouldn't be better to move part of the implementations to Orc, Parquet and Avro, and just have some interfaces and basic implementations on Hive? This way we could avoid Orc, Parquet and/or Avro to depend from Hive. I saw this on Parquet where they created a RowBatch class internally and returns that to Hive, then in Hive we will just bind it to the Hive vectorized interface to support vectorization. It justs an idea, I am not clear exactly what I am trying to say :)


On Fri, Aug 19, 2016 at 11:01 PM, Lefty Leverenz <le...@gmail.com>
wrote:

> Sergey's idea is creative, although it leads to confusion about JIRA 
> fix versions.  Issues would be given fix versions based on assumptions 
> about whether SA or Hive will be released first.  (That's hard to 
> predict when it's months away.)
>
> Keeping the version numbers tied together is very appealing.  Would it 
> be possible to have incompatible changes in SA force a bump in the 
> Hive release number?  Hm, I guess that means Hive would need a release 
> at the same time as SA, but only for incompatible changes.
>
> What's the likelihood of another subproject getting spun off eventually?
> If that happened, the 4th minor version wouldn't make sense.  A 5th 
> minor version wouldn't work either.
>
> -- Lefty
>
>
> On Fri, Aug 19, 2016 at 9:46 PM, Sergey Shelukhin 
> <se...@hortonworks.com>
> wrote:
>
> > I am suggesting we always skip the number. So only one component 
> > gets the next one :) In your example Hive trunk would be 2.3, and if 
> > SA is
> released
> > again it would become 2.4. Otherwise we’d need a compat table cause 
> > versions will be totally out of sync.
> >
> > On 16/8/19, 16:31, "Owen O'Malley" <om...@apache.org> wrote:
> >
> > >That won't necessarily work, especially in the beginning. If we 
> > >release
> SA
> > >2.2.0 and use it for Hive trunk with the assumption that the next 
> > >Hive release will be 2.2. What do we do when we need to make an 
> > >incompatible change in SA? I guess we could release SA as 2.3.0 and 
> > >when hive makes
> its
> > >next release skip over Hive 2.2 and go straight to Hive 2.3.0. In
> general
> > >I
> > >think that we'd be better off with the release numbers not tied
> together.
> > >
> > >.. Owen
> > >
> > >On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > >
> > >wrote:
> > >
> > >> Can we just run the versions thru? I.e. increment it every time 
> > >> but release only one component (or both if they happen to align I guess).
> > >> E.g. storage-api will be released at 2.2, and say 2.3 if it moves
> fast,
> > >> then Hive 2.4, then storage-api 2.5, etc.
> > >> That might make it easier to reason about compatibility because 
> > >>the order  is obvious.
> > >>
> > >> On 16/8/19, 09:04, "Sergio Pena" <se...@cloudera.com> wrote:
> > >>
> > >> >I see Parquet is currently using the SearchArgument class for
> > >>predicates
> > >> >push down.
> > >> >Will this class be part of the new sub-module or project?
> > >> >
> > >> >Following Sushanth idea, can we have other API interfaces in the 
> > >> >new project that other components can use?
> > >> >Perhaps having this may be a good reason to create a project.
> > >> >
> > >> >I'm -1 with the 4th minor version. As Owen mentioned, changing 
> > >> >the
> 4th
> > >> >version number for incompatible changes is ugly and confusing.
> > >> >I like the new project idea more, +1, but  the storage-api may 
> > >> >be too small for a new project.
> > >> >
> > >> >- Sergio
> > >> >
> > >> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley 
> > >> ><om...@apache.org>
> > >> wrote:
> > >> >
> > >> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates 
> > >> >> <alanfgates@gmail.com
> >
> > >> >>wrote:
> > >> >>
> > >> >> > +1 for making the API clean and easy for other projects to 
> > >> >> > +work
> > >>with.
> > >> >> A
> > >> >> > few questions:
> > >> >> >
> > >> >> > 1) Would this also make it easier for Parquet and others to
> > >>implement
> > >> >> > Hive’s ACID interfaces?
> > >> >> >
> > >> >>
> > >> >> Currently the ACID interfaces haven't been moved over to
> storage-api,
> > >> >> although it would make sense to do so at some point.
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > 2) Would we make any attempt to coordinate version numbers
> between
> > >> >>Hive
> > >> >> > and the storage module, or would a given version of Hive 
> > >> >> > just
> > >>depend
> > >> >>on a
> > >> >> > given version of the storage module?
> > >> >> >
> > >> >>
> > >> >> The two options that I see are:
> > >> >>
> > >> >> * Let the numbers run separately starting from 2.2.0.
> > >> >> * Tie the numbers together with an additional level of 
> > >> >> versioning
> > >>(eg.
> > >> >> 2.2.0.0).
> > >> >>
> > >> >> I think that letting the two version numbers diverge is better 
> > >> >> in
> the
> > >> >>long
> > >> >> term. For example, if you need to make an incompatible change, 
> > >> >>it
> is
> > >> >>pretty
> > >> >> ugly to do it as a fourth level version number (eg. an 
> > >> >>incompatible change  from 2.2.0.0 to 2.2.0.1). At the 
> > >> >>beginning, I expect that
> storage-api
> > >> >>would
> > >> >> move faster than Hive, but as it stabilizes I expect it might 
> > >> >>start moving  slower than Hive.
> > >> >>
> > >> >> I'd propose that we have Hive's build use a released version 
> > >> >>of storage-api  rather than a snapshot.
> > >> >>
> > >> >> Thoughts?
> > >> >>
> > >> >>    Owen
> > >> >>
> > >> >>
> > >> >> > Alan.
> > >> >> >
> > >> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley 
> > >> >> > > <om...@apache.org>
> > >> wrote:
> > >> >> > >
> > >> >> > > All,
> > >> >> > >
> > >> >> > > As part of moving ORC out of Hive, we pulled all of the
> > >> >>vectorization
> > >> >> > > storage and sarg classes into a separate module, which is 
> > >> >> > > named storage-api.  Although it is currently only used by 
> > >> >> > > ORC, it
> > >>could be
> > >> >> used
> > >> >> > > by Parquet or Avro if they wanted to make a fast 
> > >> >> > > vectorized
> > >>reader
> > >> >>that
> > >> >> > > read directly in to Hive's VectorizedRowBatch without 
> > >> >> > > needing a
> > >> >>shim or
> > >> >> > > data copy. Note that this is in many ways similar to 
> > >> >> > > pulling
> the
> > >> >>Arrow
> > >> >> > > project out of Drill.
> > >> >> > >
> > >> >> > > This unfortunately still leaves us with a circular 
> > >> >> > > dependency
> > >> >>between
> > >> >> > Hive
> > >> >> > > and ORC. I'd hoped that storage-api wouldn't change that 
> > >> >> > > much,
> > >>but
> > >> >>that
> > >> >> > > doesn't seem to be happening. As a result, ORC ends up 
> > >> >> > > shipping
> > >>its
> > >> >>own
> > >> >> > > fork of storage-api.
> > >> >> > >
> > >> >> > > Although we could make a new project for just the 
> > >> >> > > storage-api,
> I
> > >> >>think
> > >> >> it
> > >> >> > > would be better to make it a subproject of Hive that is
> released
> > >> >> > > independently.
> > >> >> > >
> > >> >> > > What do others think?
> > >> >> > >
> > >> >> > >   Owen
> > >> >> >
> > >> >> >
> > >> >>
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Sergio Pena <se...@cloudera.com>.

Question:

Wouldn't be better to move part of the implementations to Orc, Parquet and
Avro, and just have some interfaces and basic implementations on Hive? This
way we could avoid Orc, Parquet and/or Avro to depend from Hive. I saw this
on Parquet where they created a RowBatch class internally and returns that
to Hive, then in Hive we will just bind it to the Hive vectorized interface
to support vectorization. It justs an idea, I am not clear exactly what I
am trying to say :)


On Fri, Aug 19, 2016 at 11:01 PM, Lefty Leverenz <le...@gmail.com>
wrote:

> Sergey's idea is creative, although it leads to confusion about JIRA fix
> versions.  Issues would be given fix versions based on assumptions about
> whether SA or Hive will be released first.  (That's hard to predict when
> it's months away.)
>
> Keeping the version numbers tied together is very appealing.  Would it be
> possible to have incompatible changes in SA force a bump in the Hive
> release number?  Hm, I guess that means Hive would need a release at the
> same time as SA, but only for incompatible changes.
>
> What's the likelihood of another subproject getting spun off eventually?
> If that happened, the 4th minor version wouldn't make sense.  A 5th minor
> version wouldn't work either.
>
> -- Lefty
>
>
> On Fri, Aug 19, 2016 at 9:46 PM, Sergey Shelukhin <se...@hortonworks.com>
> wrote:
>
> > I am suggesting we always skip the number. So only one component gets the
> > next one :) In your example Hive trunk would be 2.3, and if SA is
> released
> > again it would become 2.4. Otherwise we’d need a compat table cause
> > versions will be totally out of sync.
> >
> > On 16/8/19, 16:31, "Owen O'Malley" <om...@apache.org> wrote:
> >
> > >That won't necessarily work, especially in the beginning. If we release
> SA
> > >2.2.0 and use it for Hive trunk with the assumption that the next Hive
> > >release will be 2.2. What do we do when we need to make an incompatible
> > >change in SA? I guess we could release SA as 2.3.0 and when hive makes
> its
> > >next release skip over Hive 2.2 and go straight to Hive 2.3.0. In
> general
> > >I
> > >think that we'd be better off with the release numbers not tied
> together.
> > >
> > >.. Owen
> > >
> > >On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > >
> > >wrote:
> > >
> > >> Can we just run the versions thru? I.e. increment it every time but
> > >> release only one component (or both if they happen to align I guess).
> > >> E.g. storage-api will be released at 2.2, and say 2.3 if it moves
> fast,
> > >> then Hive 2.4, then storage-api 2.5, etc.
> > >> That might make it easier to reason about compatibility because the
> > >>order
> > >> is obvious.
> > >>
> > >> On 16/8/19, 09:04, "Sergio Pena" <se...@cloudera.com> wrote:
> > >>
> > >> >I see Parquet is currently using the SearchArgument class for
> > >>predicates
> > >> >push down.
> > >> >Will this class be part of the new sub-module or project?
> > >> >
> > >> >Following Sushanth idea, can we have other API interfaces in the new
> > >> >project that other components can use?
> > >> >Perhaps having this may be a good reason to create a project.
> > >> >
> > >> >I'm -1 with the 4th minor version. As Owen mentioned, changing the
> 4th
> > >> >version number for incompatible changes is ugly and confusing.
> > >> >I like the new project idea more, +1, but  the storage-api may be too
> > >> >small
> > >> >for a new project.
> > >> >
> > >> >- Sergio
> > >> >
> > >> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <om...@apache.org>
> > >> wrote:
> > >> >
> > >> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <alanfgates@gmail.com
> >
> > >> >>wrote:
> > >> >>
> > >> >> > +1 for making the API clean and easy for other projects to work
> > >>with.
> > >> >> A
> > >> >> > few questions:
> > >> >> >
> > >> >> > 1) Would this also make it easier for Parquet and others to
> > >>implement
> > >> >> > Hive’s ACID interfaces?
> > >> >> >
> > >> >>
> > >> >> Currently the ACID interfaces haven't been moved over to
> storage-api,
> > >> >> although it would make sense to do so at some point.
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > 2) Would we make any attempt to coordinate version numbers
> between
> > >> >>Hive
> > >> >> > and the storage module, or would a given version of Hive just
> > >>depend
> > >> >>on a
> > >> >> > given version of the storage module?
> > >> >> >
> > >> >>
> > >> >> The two options that I see are:
> > >> >>
> > >> >> * Let the numbers run separately starting from 2.2.0.
> > >> >> * Tie the numbers together with an additional level of versioning
> > >>(eg.
> > >> >> 2.2.0.0).
> > >> >>
> > >> >> I think that letting the two version numbers diverge is better in
> the
> > >> >>long
> > >> >> term. For example, if you need to make an incompatible change, it
> is
> > >> >>pretty
> > >> >> ugly to do it as a fourth level version number (eg. an incompatible
> > >> >>change
> > >> >> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that
> storage-api
> > >> >>would
> > >> >> move faster than Hive, but as it stabilizes I expect it might start
> > >> >>moving
> > >> >> slower than Hive.
> > >> >>
> > >> >> I'd propose that we have Hive's build use a released version of
> > >> >>storage-api
> > >> >> rather than a snapshot.
> > >> >>
> > >> >> Thoughts?
> > >> >>
> > >> >>    Owen
> > >> >>
> > >> >>
> > >> >> > Alan.
> > >> >> >
> > >> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org>
> > >> wrote:
> > >> >> > >
> > >> >> > > All,
> > >> >> > >
> > >> >> > > As part of moving ORC out of Hive, we pulled all of the
> > >> >>vectorization
> > >> >> > > storage and sarg classes into a separate module, which is named
> > >> >> > > storage-api.  Although it is currently only used by ORC, it
> > >>could be
> > >> >> used
> > >> >> > > by Parquet or Avro if they wanted to make a fast vectorized
> > >>reader
> > >> >>that
> > >> >> > > read directly in to Hive's VectorizedRowBatch without needing a
> > >> >>shim or
> > >> >> > > data copy. Note that this is in many ways similar to pulling
> the
> > >> >>Arrow
> > >> >> > > project out of Drill.
> > >> >> > >
> > >> >> > > This unfortunately still leaves us with a circular dependency
> > >> >>between
> > >> >> > Hive
> > >> >> > > and ORC. I'd hoped that storage-api wouldn't change that much,
> > >>but
> > >> >>that
> > >> >> > > doesn't seem to be happening. As a result, ORC ends up shipping
> > >>its
> > >> >>own
> > >> >> > > fork of storage-api.
> > >> >> > >
> > >> >> > > Although we could make a new project for just the storage-api,
> I
> > >> >>think
> > >> >> it
> > >> >> > > would be better to make it a subproject of Hive that is
> released
> > >> >> > > independently.
> > >> >> > >
> > >> >> > > What do others think?
> > >> >> > >
> > >> >> > >   Owen
> > >> >> >
> > >> >> >
> > >> >>
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Lefty Leverenz <le...@gmail.com>.

Sergey's idea is creative, although it leads to confusion about JIRA fix
versions.  Issues would be given fix versions based on assumptions about
whether SA or Hive will be released first.  (That's hard to predict when
it's months away.)

Keeping the version numbers tied together is very appealing.  Would it be
possible to have incompatible changes in SA force a bump in the Hive
release number?  Hm, I guess that means Hive would need a release at the
same time as SA, but only for incompatible changes.

What's the likelihood of another subproject getting spun off eventually?
If that happened, the 4th minor version wouldn't make sense.  A 5th minor
version wouldn't work either.

-- Lefty


On Fri, Aug 19, 2016 at 9:46 PM, Sergey Shelukhin <se...@hortonworks.com>
wrote:

> I am suggesting we always skip the number. So only one component gets the
> next one :) In your example Hive trunk would be 2.3, and if SA is released
> again it would become 2.4. Otherwise we’d need a compat table cause
> versions will be totally out of sync.
>
> On 16/8/19, 16:31, "Owen O'Malley" <om...@apache.org> wrote:
>
> >That won't necessarily work, especially in the beginning. If we release SA
> >2.2.0 and use it for Hive trunk with the assumption that the next Hive
> >release will be 2.2. What do we do when we need to make an incompatible
> >change in SA? I guess we could release SA as 2.3.0 and when hive makes its
> >next release skip over Hive 2.2 and go straight to Hive 2.3.0. In general
> >I
> >think that we'd be better off with the release numbers not tied together.
> >
> >.. Owen
> >
> >On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin <sergey@hortonworks.com
> >
> >wrote:
> >
> >> Can we just run the versions thru? I.e. increment it every time but
> >> release only one component (or both if they happen to align I guess).
> >> E.g. storage-api will be released at 2.2, and say 2.3 if it moves fast,
> >> then Hive 2.4, then storage-api 2.5, etc.
> >> That might make it easier to reason about compatibility because the
> >>order
> >> is obvious.
> >>
> >> On 16/8/19, 09:04, "Sergio Pena" <se...@cloudera.com> wrote:
> >>
> >> >I see Parquet is currently using the SearchArgument class for
> >>predicates
> >> >push down.
> >> >Will this class be part of the new sub-module or project?
> >> >
> >> >Following Sushanth idea, can we have other API interfaces in the new
> >> >project that other components can use?
> >> >Perhaps having this may be a good reason to create a project.
> >> >
> >> >I'm -1 with the 4th minor version. As Owen mentioned, changing the 4th
> >> >version number for incompatible changes is ugly and confusing.
> >> >I like the new project idea more, +1, but  the storage-api may be too
> >> >small
> >> >for a new project.
> >> >
> >> >- Sergio
> >> >
> >> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <om...@apache.org>
> >> wrote:
> >> >
> >> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <al...@gmail.com>
> >> >>wrote:
> >> >>
> >> >> > +1 for making the API clean and easy for other projects to work
> >>with.
> >> >> A
> >> >> > few questions:
> >> >> >
> >> >> > 1) Would this also make it easier for Parquet and others to
> >>implement
> >> >> > Hive’s ACID interfaces?
> >> >> >
> >> >>
> >> >> Currently the ACID interfaces haven't been moved over to storage-api,
> >> >> although it would make sense to do so at some point.
> >> >>
> >> >>
> >> >> >
> >> >> > 2) Would we make any attempt to coordinate version numbers between
> >> >>Hive
> >> >> > and the storage module, or would a given version of Hive just
> >>depend
> >> >>on a
> >> >> > given version of the storage module?
> >> >> >
> >> >>
> >> >> The two options that I see are:
> >> >>
> >> >> * Let the numbers run separately starting from 2.2.0.
> >> >> * Tie the numbers together with an additional level of versioning
> >>(eg.
> >> >> 2.2.0.0).
> >> >>
> >> >> I think that letting the two version numbers diverge is better in the
> >> >>long
> >> >> term. For example, if you need to make an incompatible change, it is
> >> >>pretty
> >> >> ugly to do it as a fourth level version number (eg. an incompatible
> >> >>change
> >> >> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api
> >> >>would
> >> >> move faster than Hive, but as it stabilizes I expect it might start
> >> >>moving
> >> >> slower than Hive.
> >> >>
> >> >> I'd propose that we have Hive's build use a released version of
> >> >>storage-api
> >> >> rather than a snapshot.
> >> >>
> >> >> Thoughts?
> >> >>
> >> >>    Owen
> >> >>
> >> >>
> >> >> > Alan.
> >> >> >
> >> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org>
> >> wrote:
> >> >> > >
> >> >> > > All,
> >> >> > >
> >> >> > > As part of moving ORC out of Hive, we pulled all of the
> >> >>vectorization
> >> >> > > storage and sarg classes into a separate module, which is named
> >> >> > > storage-api.  Although it is currently only used by ORC, it
> >>could be
> >> >> used
> >> >> > > by Parquet or Avro if they wanted to make a fast vectorized
> >>reader
> >> >>that
> >> >> > > read directly in to Hive's VectorizedRowBatch without needing a
> >> >>shim or
> >> >> > > data copy. Note that this is in many ways similar to pulling the
> >> >>Arrow
> >> >> > > project out of Drill.
> >> >> > >
> >> >> > > This unfortunately still leaves us with a circular dependency
> >> >>between
> >> >> > Hive
> >> >> > > and ORC. I'd hoped that storage-api wouldn't change that much,
> >>but
> >> >>that
> >> >> > > doesn't seem to be happening. As a result, ORC ends up shipping
> >>its
> >> >>own
> >> >> > > fork of storage-api.
> >> >> > >
> >> >> > > Although we could make a new project for just the storage-api, I
> >> >>think
> >> >> it
> >> >> > > would be better to make it a subproject of Hive that is released
> >> >> > > independently.
> >> >> > >
> >> >> > > What do others think?
> >> >> > >
> >> >> > >   Owen
> >> >> >
> >> >> >
> >> >>
> >>
> >>
>
>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Sergey Shelukhin <se...@hortonworks.com>.

I am suggesting we always skip the number. So only one component gets the
next one :) In your example Hive trunk would be 2.3, and if SA is released
again it would become 2.4. Otherwise we’d need a compat table cause
versions will be totally out of sync.

On 16/8/19, 16:31, "Owen O'Malley" <om...@apache.org> wrote:

>That won't necessarily work, especially in the beginning. If we release SA
>2.2.0 and use it for Hive trunk with the assumption that the next Hive
>release will be 2.2. What do we do when we need to make an incompatible
>change in SA? I guess we could release SA as 2.3.0 and when hive makes its
>next release skip over Hive 2.2 and go straight to Hive 2.3.0. In general
>I
>think that we'd be better off with the release numbers not tied together.
>
>.. Owen
>
>On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin <se...@hortonworks.com>
>wrote:
>
>> Can we just run the versions thru? I.e. increment it every time but
>> release only one component (or both if they happen to align I guess).
>> E.g. storage-api will be released at 2.2, and say 2.3 if it moves fast,
>> then Hive 2.4, then storage-api 2.5, etc.
>> That might make it easier to reason about compatibility because the
>>order
>> is obvious.
>>
>> On 16/8/19, 09:04, "Sergio Pena" <se...@cloudera.com> wrote:
>>
>> >I see Parquet is currently using the SearchArgument class for
>>predicates
>> >push down.
>> >Will this class be part of the new sub-module or project?
>> >
>> >Following Sushanth idea, can we have other API interfaces in the new
>> >project that other components can use?
>> >Perhaps having this may be a good reason to create a project.
>> >
>> >I'm -1 with the 4th minor version. As Owen mentioned, changing the 4th
>> >version number for incompatible changes is ugly and confusing.
>> >I like the new project idea more, +1, but  the storage-api may be too
>> >small
>> >for a new project.
>> >
>> >- Sergio
>> >
>> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <om...@apache.org>
>> wrote:
>> >
>> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <al...@gmail.com>
>> >>wrote:
>> >>
>> >> > +1 for making the API clean and easy for other projects to work
>>with.
>> >> A
>> >> > few questions:
>> >> >
>> >> > 1) Would this also make it easier for Parquet and others to
>>implement
>> >> > Hive’s ACID interfaces?
>> >> >
>> >>
>> >> Currently the ACID interfaces haven't been moved over to storage-api,
>> >> although it would make sense to do so at some point.
>> >>
>> >>
>> >> >
>> >> > 2) Would we make any attempt to coordinate version numbers between
>> >>Hive
>> >> > and the storage module, or would a given version of Hive just
>>depend
>> >>on a
>> >> > given version of the storage module?
>> >> >
>> >>
>> >> The two options that I see are:
>> >>
>> >> * Let the numbers run separately starting from 2.2.0.
>> >> * Tie the numbers together with an additional level of versioning
>>(eg.
>> >> 2.2.0.0).
>> >>
>> >> I think that letting the two version numbers diverge is better in the
>> >>long
>> >> term. For example, if you need to make an incompatible change, it is
>> >>pretty
>> >> ugly to do it as a fourth level version number (eg. an incompatible
>> >>change
>> >> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api
>> >>would
>> >> move faster than Hive, but as it stabilizes I expect it might start
>> >>moving
>> >> slower than Hive.
>> >>
>> >> I'd propose that we have Hive's build use a released version of
>> >>storage-api
>> >> rather than a snapshot.
>> >>
>> >> Thoughts?
>> >>
>> >>    Owen
>> >>
>> >>
>> >> > Alan.
>> >> >
>> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org>
>> wrote:
>> >> > >
>> >> > > All,
>> >> > >
>> >> > > As part of moving ORC out of Hive, we pulled all of the
>> >>vectorization
>> >> > > storage and sarg classes into a separate module, which is named
>> >> > > storage-api.  Although it is currently only used by ORC, it
>>could be
>> >> used
>> >> > > by Parquet or Avro if they wanted to make a fast vectorized
>>reader
>> >>that
>> >> > > read directly in to Hive's VectorizedRowBatch without needing a
>> >>shim or
>> >> > > data copy. Note that this is in many ways similar to pulling the
>> >>Arrow
>> >> > > project out of Drill.
>> >> > >
>> >> > > This unfortunately still leaves us with a circular dependency
>> >>between
>> >> > Hive
>> >> > > and ORC. I'd hoped that storage-api wouldn't change that much,
>>but
>> >>that
>> >> > > doesn't seem to be happening. As a result, ORC ends up shipping
>>its
>> >>own
>> >> > > fork of storage-api.
>> >> > >
>> >> > > Although we could make a new project for just the storage-api, I
>> >>think
>> >> it
>> >> > > would be better to make it a subproject of Hive that is released
>> >> > > independently.
>> >> > >
>> >> > > What do others think?
>> >> > >
>> >> > >   Owen
>> >> >
>> >> >
>> >>
>>
>>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Owen O'Malley <om...@apache.org>.

That won't necessarily work, especially in the beginning. If we release SA
2.2.0 and use it for Hive trunk with the assumption that the next Hive
release will be 2.2. What do we do when we need to make an incompatible
change in SA? I guess we could release SA as 2.3.0 and when hive makes its
next release skip over Hive 2.2 and go straight to Hive 2.3.0. In general I
think that we'd be better off with the release numbers not tied together.

.. Owen

On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin <se...@hortonworks.com>
wrote:

> Can we just run the versions thru? I.e. increment it every time but
> release only one component (or both if they happen to align I guess).
> E.g. storage-api will be released at 2.2, and say 2.3 if it moves fast,
> then Hive 2.4, then storage-api 2.5, etc.
> That might make it easier to reason about compatibility because the order
> is obvious.
>
> On 16/8/19, 09:04, "Sergio Pena" <se...@cloudera.com> wrote:
>
> >I see Parquet is currently using the SearchArgument class for predicates
> >push down.
> >Will this class be part of the new sub-module or project?
> >
> >Following Sushanth idea, can we have other API interfaces in the new
> >project that other components can use?
> >Perhaps having this may be a good reason to create a project.
> >
> >I'm -1 with the 4th minor version. As Owen mentioned, changing the 4th
> >version number for incompatible changes is ugly and confusing.
> >I like the new project idea more, +1, but  the storage-api may be too
> >small
> >for a new project.
> >
> >- Sergio
> >
> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <om...@apache.org>
> wrote:
> >
> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <al...@gmail.com>
> >>wrote:
> >>
> >> > +1 for making the API clean and easy for other projects to work with.
> >> A
> >> > few questions:
> >> >
> >> > 1) Would this also make it easier for Parquet and others to implement
> >> > Hive’s ACID interfaces?
> >> >
> >>
> >> Currently the ACID interfaces haven't been moved over to storage-api,
> >> although it would make sense to do so at some point.
> >>
> >>
> >> >
> >> > 2) Would we make any attempt to coordinate version numbers between
> >>Hive
> >> > and the storage module, or would a given version of Hive just depend
> >>on a
> >> > given version of the storage module?
> >> >
> >>
> >> The two options that I see are:
> >>
> >> * Let the numbers run separately starting from 2.2.0.
> >> * Tie the numbers together with an additional level of versioning (eg.
> >> 2.2.0.0).
> >>
> >> I think that letting the two version numbers diverge is better in the
> >>long
> >> term. For example, if you need to make an incompatible change, it is
> >>pretty
> >> ugly to do it as a fourth level version number (eg. an incompatible
> >>change
> >> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api
> >>would
> >> move faster than Hive, but as it stabilizes I expect it might start
> >>moving
> >> slower than Hive.
> >>
> >> I'd propose that we have Hive's build use a released version of
> >>storage-api
> >> rather than a snapshot.
> >>
> >> Thoughts?
> >>
> >>    Owen
> >>
> >>
> >> > Alan.
> >> >
> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org>
> wrote:
> >> > >
> >> > > All,
> >> > >
> >> > > As part of moving ORC out of Hive, we pulled all of the
> >>vectorization
> >> > > storage and sarg classes into a separate module, which is named
> >> > > storage-api.  Although it is currently only used by ORC, it could be
> >> used
> >> > > by Parquet or Avro if they wanted to make a fast vectorized reader
> >>that
> >> > > read directly in to Hive's VectorizedRowBatch without needing a
> >>shim or
> >> > > data copy. Note that this is in many ways similar to pulling the
> >>Arrow
> >> > > project out of Drill.
> >> > >
> >> > > This unfortunately still leaves us with a circular dependency
> >>between
> >> > Hive
> >> > > and ORC. I'd hoped that storage-api wouldn't change that much, but
> >>that
> >> > > doesn't seem to be happening. As a result, ORC ends up shipping its
> >>own
> >> > > fork of storage-api.
> >> > >
> >> > > Although we could make a new project for just the storage-api, I
> >>think
> >> it
> >> > > would be better to make it a subproject of Hive that is released
> >> > > independently.
> >> > >
> >> > > What do others think?
> >> > >
> >> > >   Owen
> >> >
> >> >
> >>
>
>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Sergey Shelukhin <se...@hortonworks.com>.

Can we just run the versions thru? I.e. increment it every time but
release only one component (or both if they happen to align I guess).
E.g. storage-api will be released at 2.2, and say 2.3 if it moves fast,
then Hive 2.4, then storage-api 2.5, etc.
That might make it easier to reason about compatibility because the order
is obvious.

On 16/8/19, 09:04, "Sergio Pena" <se...@cloudera.com> wrote:

>I see Parquet is currently using the SearchArgument class for predicates
>push down.
>Will this class be part of the new sub-module or project?
>
>Following Sushanth idea, can we have other API interfaces in the new
>project that other components can use?
>Perhaps having this may be a good reason to create a project.
>
>I'm -1 with the 4th minor version. As Owen mentioned, changing the 4th
>version number for incompatible changes is ugly and confusing.
>I like the new project idea more, +1, but  the storage-api may be too
>small
>for a new project.
>
>- Sergio
>
>On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <om...@apache.org> wrote:
>
>> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <al...@gmail.com>
>>wrote:
>>
>> > +1 for making the API clean and easy for other projects to work with.
>> A
>> > few questions:
>> >
>> > 1) Would this also make it easier for Parquet and others to implement
>> > Hive’s ACID interfaces?
>> >
>>
>> Currently the ACID interfaces haven't been moved over to storage-api,
>> although it would make sense to do so at some point.
>>
>>
>> >
>> > 2) Would we make any attempt to coordinate version numbers between
>>Hive
>> > and the storage module, or would a given version of Hive just depend
>>on a
>> > given version of the storage module?
>> >
>>
>> The two options that I see are:
>>
>> * Let the numbers run separately starting from 2.2.0.
>> * Tie the numbers together with an additional level of versioning (eg.
>> 2.2.0.0).
>>
>> I think that letting the two version numbers diverge is better in the
>>long
>> term. For example, if you need to make an incompatible change, it is
>>pretty
>> ugly to do it as a fourth level version number (eg. an incompatible
>>change
>> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api
>>would
>> move faster than Hive, but as it stabilizes I expect it might start
>>moving
>> slower than Hive.
>>
>> I'd propose that we have Hive's build use a released version of
>>storage-api
>> rather than a snapshot.
>>
>> Thoughts?
>>
>>    Owen
>>
>>
>> > Alan.
>> >
>> > > On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org> wrote:
>> > >
>> > > All,
>> > >
>> > > As part of moving ORC out of Hive, we pulled all of the
>>vectorization
>> > > storage and sarg classes into a separate module, which is named
>> > > storage-api.  Although it is currently only used by ORC, it could be
>> used
>> > > by Parquet or Avro if they wanted to make a fast vectorized reader
>>that
>> > > read directly in to Hive's VectorizedRowBatch without needing a
>>shim or
>> > > data copy. Note that this is in many ways similar to pulling the
>>Arrow
>> > > project out of Drill.
>> > >
>> > > This unfortunately still leaves us with a circular dependency
>>between
>> > Hive
>> > > and ORC. I'd hoped that storage-api wouldn't change that much, but
>>that
>> > > doesn't seem to be happening. As a result, ORC ends up shipping its
>>own
>> > > fork of storage-api.
>> > >
>> > > Although we could make a new project for just the storage-api, I
>>think
>> it
>> > > would be better to make it a subproject of Hive that is released
>> > > independently.
>> > >
>> > > What do others think?
>> > >
>> > >   Owen
>> >
>> >
>>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Sergio Pena <se...@cloudera.com>.

I see Parquet is currently using the SearchArgument class for predicates
push down.
Will this class be part of the new sub-module or project?

Following Sushanth idea, can we have other API interfaces in the new
project that other components can use?
Perhaps having this may be a good reason to create a project.

I'm -1 with the 4th minor version. As Owen mentioned, changing the 4th
version number for incompatible changes is ugly and confusing.
I like the new project idea more, +1, but  the storage-api may be too small
for a new project.

- Sergio

On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <om...@apache.org> wrote:

> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <al...@gmail.com> wrote:
>
> > +1 for making the API clean and easy for other projects to work with.  A
> > few questions:
> >
> > 1) Would this also make it easier for Parquet and others to implement
> > Hive’s ACID interfaces?
> >
>
> Currently the ACID interfaces haven't been moved over to storage-api,
> although it would make sense to do so at some point.
>
>
> >
> > 2) Would we make any attempt to coordinate version numbers between Hive
> > and the storage module, or would a given version of Hive just depend on a
> > given version of the storage module?
> >
>
> The two options that I see are:
>
> * Let the numbers run separately starting from 2.2.0.
> * Tie the numbers together with an additional level of versioning (eg.
> 2.2.0.0).
>
> I think that letting the two version numbers diverge is better in the long
> term. For example, if you need to make an incompatible change, it is pretty
> ugly to do it as a fourth level version number (eg. an incompatible change
> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api would
> move faster than Hive, but as it stabilizes I expect it might start moving
> slower than Hive.
>
> I'd propose that we have Hive's build use a released version of storage-api
> rather than a snapshot.
>
> Thoughts?
>
>    Owen
>
>
> > Alan.
> >
> > > On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org> wrote:
> > >
> > > All,
> > >
> > > As part of moving ORC out of Hive, we pulled all of the vectorization
> > > storage and sarg classes into a separate module, which is named
> > > storage-api.  Although it is currently only used by ORC, it could be
> used
> > > by Parquet or Avro if they wanted to make a fast vectorized reader that
> > > read directly in to Hive's VectorizedRowBatch without needing a shim or
> > > data copy. Note that this is in many ways similar to pulling the Arrow
> > > project out of Drill.
> > >
> > > This unfortunately still leaves us with a circular dependency between
> > Hive
> > > and ORC. I'd hoped that storage-api wouldn't change that much, but that
> > > doesn't seem to be happening. As a result, ORC ends up shipping its own
> > > fork of storage-api.
> > >
> > > Although we could make a new project for just the storage-api, I think
> it
> > > would be better to make it a subproject of Hive that is released
> > > independently.
> > >
> > > What do others think?
> > >
> > >   Owen
> >
> >
>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Owen O'Malley <om...@apache.org>.

On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <al...@gmail.com> wrote:

> +1 for making the API clean and easy for other projects to work with.  A
> few questions:
>
> 1) Would this also make it easier for Parquet and others to implement
> Hive’s ACID interfaces?
>

Currently the ACID interfaces haven't been moved over to storage-api,
although it would make sense to do so at some point.

>
> 2) Would we make any attempt to coordinate version numbers between Hive
> and the storage module, or would a given version of Hive just depend on a
> given version of the storage module?
>

The two options that I see are:

* Let the numbers run separately starting from 2.2.0.
* Tie the numbers together with an additional level of versioning (eg.
2.2.0.0).

I think that letting the two version numbers diverge is better in the long
term. For example, if you need to make an incompatible change, it is pretty
ugly to do it as a fourth level version number (eg. an incompatible change
from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api would
move faster than Hive, but as it stabilizes I expect it might start moving
slower than Hive.

I'd propose that we have Hive's build use a released version of storage-api
rather than a snapshot.

Thoughts?

   Owen

> Alan.
>
> > On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org> wrote:
> >
> > All,
> >
> > As part of moving ORC out of Hive, we pulled all of the vectorization
> > storage and sarg classes into a separate module, which is named
> > storage-api.  Although it is currently only used by ORC, it could be used
> > by Parquet or Avro if they wanted to make a fast vectorized reader that
> > read directly in to Hive's VectorizedRowBatch without needing a shim or
> > data copy. Note that this is in many ways similar to pulling the Arrow
> > project out of Drill.
> >
> > This unfortunately still leaves us with a circular dependency between
> Hive
> > and ORC. I'd hoped that storage-api wouldn't change that much, but that
> > doesn't seem to be happening. As a result, ORC ends up shipping its own
> > fork of storage-api.
> >
> > Although we could make a new project for just the storage-api, I think it
> > would be better to make it a subproject of Hive that is released
> > independently.
> >
> > What do others think?
> >
> >   Owen
>
>

Re: [DISCUSS] Making storage-api a separately released artifact

Posted by Alan Gates <al...@gmail.com>.

+1 for making the API clean and easy for other projects to work with.  A few questions:

1) Would this also make it easier for Parquet and others to implement Hive’s ACID interfaces?

2) Would we make any attempt to coordinate version numbers between Hive and the storage module, or would a given version of Hive just depend on a given version of the storage module?

Alan.

> On Aug 15, 2016, at 17:01, Owen O'Malley <om...@apache.org> wrote:
> 
> All,
> 
> As part of moving ORC out of Hive, we pulled all of the vectorization
> storage and sarg classes into a separate module, which is named
> storage-api.  Although it is currently only used by ORC, it could be used
> by Parquet or Avro if they wanted to make a fast vectorized reader that
> read directly in to Hive's VectorizedRowBatch without needing a shim or
> data copy. Note that this is in many ways similar to pulling the Arrow
> project out of Drill.
> 
> This unfortunately still leaves us with a circular dependency between Hive
> and ORC. I'd hoped that storage-api wouldn't change that much, but that
> doesn't seem to be happening. As a result, ORC ends up shipping its own
> fork of storage-api.
> 
> Although we could make a new project for just the storage-api, I think it
> would be better to make it a subproject of Hive that is released
> independently.
> 
> What do others think?
> 
>   Owen