You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Olga Natkovich <ol...@yahoo-inc.com> on 2011/11/07 20:15:50 UTC

[DISCUSSION]Pig releases with different versions of Hadoop

Hi,

In the past we have for the most part avoided supporting multiple versions of Hadoop with the same version of Pig. This is about to change with release of Hadoop 23. We need to come up with a strategy on how to support that. There are a couple of issues to consider:


(1)    Version numbering. Seems like encoding the information in the last version number makes sense. The details of the encoding need to be hashed out

(2)    Code changes required to support different version of Hadoop. This time around we made an effort to make sure that the same code can work with both. In the future that might not work and we would need to figure out how to maintain different code base. Most likely we would have to have additional branches off of main release branch

(3)    Anything else we need to consider?

Olga

Re: [DISCUSSION]Pig releases with different versions of Hadoop

Posted by Russell Jurney <ru...@gmail.com>.
Option 2 is consistent with 'Pigs eat anything.'

Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

On Nov 8, 2011, at 8:05 AM, Alan Gates <ga...@hortonworks.com> wrote:

>
> On Nov 7, 2011, at 11:15 AM, Olga Natkovich wrote:
>
>> Hi,
>>
>> In the past we have for the most part avoided supporting multiple versions of Hadoop with the same version of Pig. This is about to change with release of Hadoop 23. We need to come up with a strategy on how to support that. There are a couple of issues to consider:
>>
>>
>> (1)    Version numbering. Seems like encoding the information in the last version number makes sense. The details of the encoding need to be hashed out
>
> I can see two options.  One is to do major.minor.patch.hadoopversion, so for example 0.10.1.h23 and 0.10.1.h20.  The problem I see with that is we *have* to guarantee that they have the same functionality.  That is, 0.10.1 has all the same patches regardless of which Hadoop version it is (excepting maybe patches specific to a particular Hadoop version), the only difference is which one it's compiled for.  Another problem is that this will proliferate versions, cluttering up our website, confusing our users, and causing the PMC members vote after vote.
>
> The second option would be to rework the pig package so that it had the jars for both, and the pig shell script figures out based on the Hadoop it finds which version is being used.  This has the nice feature of guaranteeing the same features, but it has a few downsides.  One, it bloats our package (since it's carrying multiple jars).  Two, what happens when someone wants to add support for a new version (say Hadoop 22) to an existing release?  Three, now a release manager must have access to all versions of Hadoop we claim to cover, or wait for help from those who do, in order to test a release.
>
> Hive chose the second option, and dealt with the bloating issue by isolating all the version specific code in one jar.
>
> We could deal with the concern of adding new versions to an existing release by saying it's not allowed.  If you want to add a new supported version then you create a new version.  This will devolve into versions 0.10 and 0.12 work on 20 and 23, but 0.11 works on 22.  That will be horribly confusing for our users.
>
> I think the third issue of testability is going to mean certain Pig versions only support certain Hadoop versions without it being explicitly marked as well.  Again, I think this is really bad.
>
> So I vote for the major.minor.patch.hadoopversion solution, though I think we should work hard to make it clear to users how to select the right version of Pig when downloading it.
>
>
>>
>> (2)    Code changes required to support different version of Hadoop. This time around we made an effort to make sure that the same code can work with both. In the future that might not work and we would need to figure out how to maintain different code base. Most likely we would have to have additional branches off of main release branch
>
> Hopefully we can continue to do this via conditional compilation.  Having different branches isn't maintainable.  How do I push a Hadoop version specific patch to the next release?  We'll get an ever growing collection of patches that have to be applied on a Hadoop specific branch for every release.  We need to continue the rule that any patch must apply to the trunk, even when it's version specific.
>
>>
>> (3)    Anything else we need to consider?
>>
>> Olga
>
> Alan.

Re: [DISCUSSION]Pig releases with different versions of Hadoop

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I suspect it might be easier / more maintainable / still useful to only
officially support a couple of versions (and test on both), with "best
effort" support for others. So, for example, the current de-facto situation
is support 0.20.2 (currently the only "officially supported" version), and
maybe 0.20.205 (which I am guessing is what Hortonworks devs / customers
are mostly running). We can say that we provide "best effort" compatibility
for CDH{2,3}. In the future, I see this shifting to "official" support for
0.20.205 and 0.23, with "best effort" compatibility for 0.22 , CDH{3,4}.
Compile-time switches can control which hadoop version you build for.

Pig should expose some way to programmatically determine which version of
hadoop it was compiled against (and what version of Pig it is).

Ideally, we could rely on BigTop to help with ensuring a reasonable
compatibility level with the "best effort" versions.

I suspect maintaining a separate release for every hadoop version, given
the number of them, is going to be unmaintainable.

D

On Tue, Nov 8, 2011 at 8:04 AM, Alan Gates <ga...@hortonworks.com> wrote:

>
> On Nov 7, 2011, at 11:15 AM, Olga Natkovich wrote:
>
> > Hi,
> >
> > In the past we have for the most part avoided supporting multiple
> versions of Hadoop with the same version of Pig. This is about to change
> with release of Hadoop 23. We need to come up with a strategy on how to
> support that. There are a couple of issues to consider:
> >
> >
> > (1)    Version numbering. Seems like encoding the information in the
> last version number makes sense. The details of the encoding need to be
> hashed out
>
> I can see two options.  One is to do major.minor.patch.hadoopversion, so
> for example 0.10.1.h23 and 0.10.1.h20.  The problem I see with that is we
> *have* to guarantee that they have the same functionality.  That is, 0.10.1
> has all the same patches regardless of which Hadoop version it is
> (excepting maybe patches specific to a particular Hadoop version), the only
> difference is which one it's compiled for.  Another problem is that this
> will proliferate versions, cluttering up our website, confusing our users,
> and causing the PMC members vote after vote.
>
> The second option would be to rework the pig package so that it had the
> jars for both, and the pig shell script figures out based on the Hadoop it
> finds which version is being used.  This has the nice feature of
> guaranteeing the same features, but it has a few downsides.  One, it bloats
> our package (since it's carrying multiple jars).  Two, what happens when
> someone wants to add support for a new version (say Hadoop 22) to an
> existing release?  Three, now a release manager must have access to all
> versions of Hadoop we claim to cover, or wait for help from those who do,
> in order to test a release.
>
> Hive chose the second option, and dealt with the bloating issue by
> isolating all the version specific code in one jar.
>
> We could deal with the concern of adding new versions to an existing
> release by saying it's not allowed.  If you want to add a new supported
> version then you create a new version.  This will devolve into versions
> 0.10 and 0.12 work on 20 and 23, but 0.11 works on 22.  That will be
> horribly confusing for our users.
>
> I think the third issue of testability is going to mean certain Pig
> versions only support certain Hadoop versions without it being explicitly
> marked as well.  Again, I think this is really bad.
>
> So I vote for the major.minor.patch.hadoopversion solution, though I think
> we should work hard to make it clear to users how to select the right
> version of Pig when downloading it.
>
>
> >
> > (2)    Code changes required to support different version of Hadoop.
> This time around we made an effort to make sure that the same code can work
> with both. In the future that might not work and we would need to figure
> out how to maintain different code base. Most likely we would have to have
> additional branches off of main release branch
>
> Hopefully we can continue to do this via conditional compilation.  Having
> different branches isn't maintainable.  How do I push a Hadoop version
> specific patch to the next release?  We'll get an ever growing collection
> of patches that have to be applied on a Hadoop specific branch for every
> release.  We need to continue the rule that any patch must apply to the
> trunk, even when it's version specific.
>
> >
> > (3)    Anything else we need to consider?
> >
> > Olga
>
> Alan.

Re: [DISCUSSION]Pig releases with different versions of Hadoop

Posted by Alan Gates <ga...@hortonworks.com>.
On Nov 7, 2011, at 11:15 AM, Olga Natkovich wrote:

> Hi,
> 
> In the past we have for the most part avoided supporting multiple versions of Hadoop with the same version of Pig. This is about to change with release of Hadoop 23. We need to come up with a strategy on how to support that. There are a couple of issues to consider:
> 
> 
> (1)    Version numbering. Seems like encoding the information in the last version number makes sense. The details of the encoding need to be hashed out

I can see two options.  One is to do major.minor.patch.hadoopversion, so for example 0.10.1.h23 and 0.10.1.h20.  The problem I see with that is we *have* to guarantee that they have the same functionality.  That is, 0.10.1 has all the same patches regardless of which Hadoop version it is (excepting maybe patches specific to a particular Hadoop version), the only difference is which one it's compiled for.  Another problem is that this will proliferate versions, cluttering up our website, confusing our users, and causing the PMC members vote after vote.

The second option would be to rework the pig package so that it had the jars for both, and the pig shell script figures out based on the Hadoop it finds which version is being used.  This has the nice feature of guaranteeing the same features, but it has a few downsides.  One, it bloats our package (since it's carrying multiple jars).  Two, what happens when someone wants to add support for a new version (say Hadoop 22) to an existing release?  Three, now a release manager must have access to all versions of Hadoop we claim to cover, or wait for help from those who do, in order to test a release.    

Hive chose the second option, and dealt with the bloating issue by isolating all the version specific code in one jar.  

We could deal with the concern of adding new versions to an existing release by saying it's not allowed.  If you want to add a new supported version then you create a new version.  This will devolve into versions 0.10 and 0.12 work on 20 and 23, but 0.11 works on 22.  That will be horribly confusing for our users.

I think the third issue of testability is going to mean certain Pig versions only support certain Hadoop versions without it being explicitly marked as well.  Again, I think this is really bad.

So I vote for the major.minor.patch.hadoopversion solution, though I think we should work hard to make it clear to users how to select the right version of Pig when downloading it.


> 
> (2)    Code changes required to support different version of Hadoop. This time around we made an effort to make sure that the same code can work with both. In the future that might not work and we would need to figure out how to maintain different code base. Most likely we would have to have additional branches off of main release branch

Hopefully we can continue to do this via conditional compilation.  Having different branches isn't maintainable.  How do I push a Hadoop version specific patch to the next release?  We'll get an ever growing collection of patches that have to be applied on a Hadoop specific branch for every release.  We need to continue the rule that any patch must apply to the trunk, even when it's version specific.

> 
> (3)    Anything else we need to consider?
> 
> Olga

Alan.

Re: [DISCUSSION]Pig releases with different versions of Hadoop

Posted by Daniel Dai <da...@hortonworks.com>.
Hi, Alejandro,
I understand your concern but creating multiple pig.jar is inevitable. See
my comments below.

Daniel

On Mon, Nov 7, 2011 at 11:40 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> Hi Olga,
>
> Regarding #1, does this means we'd have a build of Pig X for each
> version of Hadoop we support? It seems to me this would be a bit
> complex to maintain.
>
Yes. Currently we only have plan to support 20.x and 23 (There is some work
for hadoop 22 in PIG-2277 <https://issues.apache.org/jira/browse/PIG-2277>,
but I don't know how it would end up). This is complex but I cannot see how
we can avoid it. Hopefully hadoop will converge and become API stable, so
that we don't need to do this trick in future hadoop release.

>
> Regarding #2, If Hadoop does a good job at maintaing public API
> backwards compatibility and Pig uses only Hadoop public API we would
> be good.
>
That's not true at least for 23 new apis.

>
> Regarding #3, still I can see potential issues (from my experience
> with Hadoop-Oozie) where the API did not change but the behavior dir.
> This means we'll have to be able to if/then/else within Pig whenever
> necessary based on the version of Hadoop.
>
We already do such trick if we can solve the version divergence by using
if/then/else or reflection. In that we only need to maintain only pig.jar.
However, there are some static dependencies which cannot be solved by these
tricks, that's why we do need a shims layer and generate different pig.jar
for different version of hadoop.

>
> A possible way of addressing this would be:
>
> * Pig should use the 'hadoop' to run Pig (this would help to cleanly
> bring into the classpath the Hadoop depedencies).
>
We've already done in PIG-2239


> * Pig could have a whitelist of Hadoop version it supports and fail if
> the current hadoop version is not supported (we could use version
> regex/ranges)
> * (what I'm suggesting in #3 above) Pig could use the Hadoop version
> as a code selector whenever necessary.
>
> Thanks.
>
> Alejandro
>
> On Mon, Nov 7, 2011 at 11:15 AM, Olga Natkovich <ol...@yahoo-inc.com>
> wrote:
> > Hi,
> >
> > In the past we have for the most part avoided supporting multiple
> versions of Hadoop with the same version of Pig. This is about to change
> with release of Hadoop 23. We need to come up with a strategy on how to
> support that. There are a couple of issues to consider:
> >
> >
> > (1)    Version numbering. Seems like encoding the information in the
> last version number makes sense. The details of the encoding need to be
> hashed out
> >
> > (2)    Code changes required to support different version of Hadoop.
> This time around we made an effort to make sure that the same code can work
> with both. In the future that might not work and we would need to figure
> out how to maintain different code base. Most likely we would have to have
> additional branches off of main release branch
> >
> > (3)    Anything else we need to consider?
> >
> > Olga
> >
>

Re: [DISCUSSION]Pig releases with different versions of Hadoop

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
Hi Olga,

Regarding #1, does this means we'd have a build of Pig X for each
version of Hadoop we support? It seems to me this would be a bit
complex to maintain.

Regarding #2, If Hadoop does a good job at maintaing public API
backwards compatibility and Pig uses only Hadoop public API we would
be good.

Regarding #3, still I can see potential issues (from my experience
with Hadoop-Oozie) where the API did not change but the behavior dir.
This means we'll have to be able to if/then/else within Pig whenever
necessary based on the version of Hadoop.

A possible way of addressing this would be:

* Pig should use the 'hadoop' to run Pig (this would help to cleanly
bring into the classpath the Hadoop depedencies).
* Pig could have a whitelist of Hadoop version it supports and fail if
the current hadoop version is not supported (we could use version
regex/ranges)
* (what I'm suggesting in #3 above) Pig could use the Hadoop version
as a code selector whenever necessary.

Thanks.

Alejandro

On Mon, Nov 7, 2011 at 11:15 AM, Olga Natkovich <ol...@yahoo-inc.com> wrote:
> Hi,
>
> In the past we have for the most part avoided supporting multiple versions of Hadoop with the same version of Pig. This is about to change with release of Hadoop 23. We need to come up with a strategy on how to support that. There are a couple of issues to consider:
>
>
> (1)    Version numbering. Seems like encoding the information in the last version number makes sense. The details of the encoding need to be hashed out
>
> (2)    Code changes required to support different version of Hadoop. This time around we made an effort to make sure that the same code can work with both. In the future that might not work and we would need to figure out how to maintain different code base. Most likely we would have to have additional branches off of main release branch
>
> (3)    Anything else we need to consider?
>
> Olga
>