You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Geoffrey Gallaway <ge...@geoffeg.org> on 2011/01/19 23:24:07 UTC

Managing pig script jar dependencies

I'm looking for some suggestions and ideas for how to handle JAR
dependencies in a production environment.

Most of the pig scripts I write require multiple JAR files. For instance, I
have a pig script that processes some data through a Solr instance which
requires my Solr UDF and some solr, lucene and apache commons jars. These
pig scripts are stored in a git repo and that git repo is deployed to our
production cluster. Obviously we don't want to store the jars in git; I'd
rather store them in our mvn repo with the rest of the jars the company
uses.

The plan is to have a maven pom.xml for each pig script that defines which
jars that pig script depends on. A shell script will then call "mvn
dependency:copy-dependencies -DoutputDirectory=pig-jars" before calling the
actual pig command to run the script. Given that, I'm trying to figure out
the best solution to a few questions.

* For development I'd like to store the pig jar (pig-0.7.0-core.jar) in
maven but there is no pom.xml for that jar (easily fixed) and that jar
contains all the java prerequisites (javax.servlet, apache commons, etc)
which seem to be making maven unhappy when I try to import it into the maven
company repo. Is there a pig-only jar?

* What do other people use to deploy their code to various systems? Check in
jars with the code? Keep jars in a separate, network-based directory?

Geoff
-- 
Sent from my email client.

Re: Managing pig script jar dependencies

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
shade plugin is useful only to a point. E.g. signed jars would not survive
that in my experience (BouncyCastle library comes to mind).

-d

On Thu, Jan 20, 2011 at 10:44 PM, Erik Onnen <eo...@gmail.com> wrote:

> As a new member to the list, I offer our lone data point. We use the maven
> shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/
>
> Shade produces an "uber" JAR with an optional declared main class.
>
> <http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
> reasonable number of dependencies (in our case ~40), it just works and
> results in a single JAR. We're lucky enough that across the board, we can
> use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.
>
> <http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
> two caveats we've encountered:
> * System dependencies aren't rolled into the "uber" JAR - if you want
> something to be in the deployment artifact, you need to at a minimum put it
> into your local repo - we do this via bash scripting for HBase 0.90.0 for
> example.
> * Conflicts - so far we've managed to do a maven dependency:tree and
> exclude
> conflicting dependencies, but I'm sure there is a point where that will not
> work any more.
>
> I'd love to hear how others are solving the problem, so far this has worked
> for us.
>
> -erik
>
>
> On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay <
> skaluskar@informatica.com
> > wrote:
>
> > Hi Dmitriy,
> >
> > Well, what I have is still experimental & not in any product. But, yes
> > we can compile to a Pig script. I try to use the native relational
> > operators where possible & use UDFs in other cases.
> >
> > I don't understand which conflicts you are referring to. Initially, I
> > was trying to create a single jar (containing all the 300 dependencies)
> > using the maven-dependency-plugin (BTW that seems to be the recommended
> > approach & should work in many cases) but it turned out that some of our
> > internal components had conflicting file names for some of the resources
> > (should probably be fixed!). My current approach works better because I
> > don't try to re-package any dependency. Yes, startup times are slow - of
> > course, I am open to other ideas :-)
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > Sent: 21 January 2011 07:57
> > To: user@pig.apache.org
> > Subject: Re: Managing pig script jar dependencies
> >
> > Sanjay,
> > Informatica compiles to Pig now, eh? Interesting...
> > How do you handle jar conflicts if you bundle the whole lot? Doesn't
> > this cost you a lot on job startup time?
> >
> > Dmitriy
> >
> >
> > On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
> > <skaluskar@informatica.com
> > > wrote:
> >
> > > I have a similar problem and I can tell you what I am doing currently,
> >
> > > just in case it is useful. I have a tool that generates PIG scripts
> > > from some other representation (Informatica mappings), and in many
> > > cases the scripts also call UDFs that depend on about 300 jars & 580
> > > native libraries. Additionally, I generate a jar for each PIG script
> > > that contains the UDFs called from that script. I add the latter jar
> > > in the script in a register statement. But registering the 300 jars
> > > that the UDFs depend on individually is error prone & tedious; so I
> > > have automated that part. I have a top-level jar that includes all the
> >
> > > 300 jars on its Class-path in the MANIFEST.MF and I add this top-level
> >
> > > jar to the classpath. I generate that (top-level jar) using maven's
> > > assembly plugin. I also generate a zip of everything (jars, native
> > > libs) using maven's assembly plugin and use dist cache to distribute
> > > it and add the native libs to the LD_LIBRARY_PATH.
> > >
> > > -----Original Message-----
> > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > Sent: 21 January 2011 05:57
> > > To: user@pig.apache.org
> > > Subject: Re: Managing pig script jar dependencies
> > >
> > > This is becoming a bigger problem for us as well, as use of Pig
> > > becomes more varied across the company.
> > > Would love some to hear what others have found to work for them.
> > >
> > > D
> > >
> > > On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
> > > <ge...@geoffeg.org>wrote:
> > >
> > > > I'm looking for some suggestions and ideas for how to handle JAR
> > > > dependencies in a production environment.
> > > >
> > > > Most of the pig scripts I write require multiple JAR files. For
> > > > instance, I have a pig script that processes some data through a
> > > > Solr instance which requires my Solr UDF and some solr, lucene and
> > > > apache commons jars. These pig scripts are stored in a git repo and
> > > > that git repo is deployed to our production cluster. Obviously we
> > > > don't want to
> > >
> > > > store the jars in git; I'd rather store them in our mvn repo with
> > > > the rest of the jars the company uses.
> > > >
> > > > The plan is to have a maven pom.xml for each pig script that defines
> >
> > > > which jars that pig script depends on. A shell script will then call
> >
> > > > "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
> >
> > > > calling the actual pig command to run the script. Given that, I'm
> > > > trying to figure out the best solution to a few questions.
> > > >
> > > > * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
> >
> > > > in maven but there is no pom.xml for that jar (easily fixed) and
> > > > that jar contains all the java prerequisites (javax.servlet, apache
> > > > commons, etc) which seem to be making maven unhappy when I try to
> > > > import it into the maven company repo. Is there a pig-only jar?
> > > >
> > > > * What do other people use to deploy their code to various systems?
> > > > Check in jars with the code? Keep jars in a separate, network-based
> > > > directory?
> > > >
> > > > Geoff
> > > > --
> > > > Sent from my email client.
> > > >
> > >
> >
>

Re: Managing pig script jar dependencies

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
In Oozie we run into a similar problem.

As workflows with pig actions proliferate the lib/ directory of each
workflow app had to contain Pig and dependent JARs. This becomes a nightmare
as to maintain as workflow app increase.

The approach to solve this was to add to oozie the concept of a sharelib/
directory in HDFS.

Then copy to the sharelib/ all the JARs you want to use across multiple
workflow applications.

When submitting a workflow you can specify the sharelib/ dir you want to use
or you can indicate Oozie to use the system sharelib/ (the default one).

Oozie then adds to the distributed cache for the for Pig job all the JARs in
the specified sharelib/

The benefits of this approach is that JAR files are only once in HDFS and
they can be managed and updated globally. And users won't miss a JAR by
mistake.

This feature is coming in Oozie 2.3

Pig could easily have a -sharelib option that points to an HDFS sharelib/
directory thus achieving the same.

<ad>
BTW, as Oozie supports submitting pig jobs over Oozie, doing 'oozie pig -f
....' you can get the feature for free, plus  that Oozie becomes a Pig
server (you get a job ID and you track progress later), all this without
having to write a workflow.
</ad>

Hope this helps.

Alejandro


On Fri, Jan 21, 2011 at 2:44 PM, Erik Onnen <eo...@gmail.com> wrote:

> As a new member to the list, I offer our lone data point. We use the maven
> shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/
>
> Shade produces an "uber" JAR with an optional declared main class.
>
> <http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
> reasonable number of dependencies (in our case ~40), it just works and
> results in a single JAR. We're lucky enough that across the board, we can
> use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.
>
> <http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
> two caveats we've encountered:
> * System dependencies aren't rolled into the "uber" JAR - if you want
> something to be in the deployment artifact, you need to at a minimum put it
> into your local repo - we do this via bash scripting for HBase 0.90.0 for
> example.
> * Conflicts - so far we've managed to do a maven dependency:tree and
> exclude
> conflicting dependencies, but I'm sure there is a point where that will not
> work any more.
>
> I'd love to hear how others are solving the problem, so far this has worked
> for us.
>
> -erik
>
>
> On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay <
> skaluskar@informatica.com
> > wrote:
>
> > Hi Dmitriy,
> >
> > Well, what I have is still experimental & not in any product. But, yes
> > we can compile to a Pig script. I try to use the native relational
> > operators where possible & use UDFs in other cases.
> >
> > I don't understand which conflicts you are referring to. Initially, I
> > was trying to create a single jar (containing all the 300 dependencies)
> > using the maven-dependency-plugin (BTW that seems to be the recommended
> > approach & should work in many cases) but it turned out that some of our
> > internal components had conflicting file names for some of the resources
> > (should probably be fixed!). My current approach works better because I
> > don't try to re-package any dependency. Yes, startup times are slow - of
> > course, I am open to other ideas :-)
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > Sent: 21 January 2011 07:57
> > To: user@pig.apache.org
> > Subject: Re: Managing pig script jar dependencies
> >
> > Sanjay,
> > Informatica compiles to Pig now, eh? Interesting...
> > How do you handle jar conflicts if you bundle the whole lot? Doesn't
> > this cost you a lot on job startup time?
> >
> > Dmitriy
> >
> >
> > On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
> > <skaluskar@informatica.com
> > > wrote:
> >
> > > I have a similar problem and I can tell you what I am doing currently,
> >
> > > just in case it is useful. I have a tool that generates PIG scripts
> > > from some other representation (Informatica mappings), and in many
> > > cases the scripts also call UDFs that depend on about 300 jars & 580
> > > native libraries. Additionally, I generate a jar for each PIG script
> > > that contains the UDFs called from that script. I add the latter jar
> > > in the script in a register statement. But registering the 300 jars
> > > that the UDFs depend on individually is error prone & tedious; so I
> > > have automated that part. I have a top-level jar that includes all the
> >
> > > 300 jars on its Class-path in the MANIFEST.MF and I add this top-level
> >
> > > jar to the classpath. I generate that (top-level jar) using maven's
> > > assembly plugin. I also generate a zip of everything (jars, native
> > > libs) using maven's assembly plugin and use dist cache to distribute
> > > it and add the native libs to the LD_LIBRARY_PATH.
> > >
> > > -----Original Message-----
> > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > Sent: 21 January 2011 05:57
> > > To: user@pig.apache.org
> > > Subject: Re: Managing pig script jar dependencies
> > >
> > > This is becoming a bigger problem for us as well, as use of Pig
> > > becomes more varied across the company.
> > > Would love some to hear what others have found to work for them.
> > >
> > > D
> > >
> > > On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
> > > <ge...@geoffeg.org>wrote:
> > >
> > > > I'm looking for some suggestions and ideas for how to handle JAR
> > > > dependencies in a production environment.
> > > >
> > > > Most of the pig scripts I write require multiple JAR files. For
> > > > instance, I have a pig script that processes some data through a
> > > > Solr instance which requires my Solr UDF and some solr, lucene and
> > > > apache commons jars. These pig scripts are stored in a git repo and
> > > > that git repo is deployed to our production cluster. Obviously we
> > > > don't want to
> > >
> > > > store the jars in git; I'd rather store them in our mvn repo with
> > > > the rest of the jars the company uses.
> > > >
> > > > The plan is to have a maven pom.xml for each pig script that defines
> >
> > > > which jars that pig script depends on. A shell script will then call
> >
> > > > "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
> >
> > > > calling the actual pig command to run the script. Given that, I'm
> > > > trying to figure out the best solution to a few questions.
> > > >
> > > > * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
> >
> > > > in maven but there is no pom.xml for that jar (easily fixed) and
> > > > that jar contains all the java prerequisites (javax.servlet, apache
> > > > commons, etc) which seem to be making maven unhappy when I try to
> > > > import it into the maven company repo. Is there a pig-only jar?
> > > >
> > > > * What do other people use to deploy their code to various systems?
> > > > Check in jars with the code? Keep jars in a separate, network-based
> > > > directory?
> > > >
> > > > Geoff
> > > > --
> > > > Sent from my email client.
> > > >
> > >
> >
>

Re: Managing pig script jar dependencies

Posted by Erik Onnen <eo...@gmail.com>.
As a new member to the list, I offer our lone data point. We use the maven
shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/

Shade produces an "uber" JAR with an optional declared main class.

<http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
reasonable number of dependencies (in our case ~40), it just works and
results in a single JAR. We're lucky enough that across the board, we can
use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.

<http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
two caveats we've encountered:
* System dependencies aren't rolled into the "uber" JAR - if you want
something to be in the deployment artifact, you need to at a minimum put it
into your local repo - we do this via bash scripting for HBase 0.90.0 for
example.
* Conflicts - so far we've managed to do a maven dependency:tree and exclude
conflicting dependencies, but I'm sure there is a point where that will not
work any more.

I'd love to hear how others are solving the problem, so far this has worked
for us.

-erik


On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay <skaluskar@informatica.com
> wrote:

> Hi Dmitriy,
>
> Well, what I have is still experimental & not in any product. But, yes
> we can compile to a Pig script. I try to use the native relational
> operators where possible & use UDFs in other cases.
>
> I don't understand which conflicts you are referring to. Initially, I
> was trying to create a single jar (containing all the 300 dependencies)
> using the maven-dependency-plugin (BTW that seems to be the recommended
> approach & should work in many cases) but it turned out that some of our
> internal components had conflicting file names for some of the resources
> (should probably be fixed!). My current approach works better because I
> don't try to re-package any dependency. Yes, startup times are slow - of
> course, I am open to other ideas :-)
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> Sent: 21 January 2011 07:57
> To: user@pig.apache.org
> Subject: Re: Managing pig script jar dependencies
>
> Sanjay,
> Informatica compiles to Pig now, eh? Interesting...
> How do you handle jar conflicts if you bundle the whole lot? Doesn't
> this cost you a lot on job startup time?
>
> Dmitriy
>
>
> On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
> <skaluskar@informatica.com
> > wrote:
>
> > I have a similar problem and I can tell you what I am doing currently,
>
> > just in case it is useful. I have a tool that generates PIG scripts
> > from some other representation (Informatica mappings), and in many
> > cases the scripts also call UDFs that depend on about 300 jars & 580
> > native libraries. Additionally, I generate a jar for each PIG script
> > that contains the UDFs called from that script. I add the latter jar
> > in the script in a register statement. But registering the 300 jars
> > that the UDFs depend on individually is error prone & tedious; so I
> > have automated that part. I have a top-level jar that includes all the
>
> > 300 jars on its Class-path in the MANIFEST.MF and I add this top-level
>
> > jar to the classpath. I generate that (top-level jar) using maven's
> > assembly plugin. I also generate a zip of everything (jars, native
> > libs) using maven's assembly plugin and use dist cache to distribute
> > it and add the native libs to the LD_LIBRARY_PATH.
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > Sent: 21 January 2011 05:57
> > To: user@pig.apache.org
> > Subject: Re: Managing pig script jar dependencies
> >
> > This is becoming a bigger problem for us as well, as use of Pig
> > becomes more varied across the company.
> > Would love some to hear what others have found to work for them.
> >
> > D
> >
> > On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
> > <ge...@geoffeg.org>wrote:
> >
> > > I'm looking for some suggestions and ideas for how to handle JAR
> > > dependencies in a production environment.
> > >
> > > Most of the pig scripts I write require multiple JAR files. For
> > > instance, I have a pig script that processes some data through a
> > > Solr instance which requires my Solr UDF and some solr, lucene and
> > > apache commons jars. These pig scripts are stored in a git repo and
> > > that git repo is deployed to our production cluster. Obviously we
> > > don't want to
> >
> > > store the jars in git; I'd rather store them in our mvn repo with
> > > the rest of the jars the company uses.
> > >
> > > The plan is to have a maven pom.xml for each pig script that defines
>
> > > which jars that pig script depends on. A shell script will then call
>
> > > "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
>
> > > calling the actual pig command to run the script. Given that, I'm
> > > trying to figure out the best solution to a few questions.
> > >
> > > * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
>
> > > in maven but there is no pom.xml for that jar (easily fixed) and
> > > that jar contains all the java prerequisites (javax.servlet, apache
> > > commons, etc) which seem to be making maven unhappy when I try to
> > > import it into the maven company repo. Is there a pig-only jar?
> > >
> > > * What do other people use to deploy their code to various systems?
> > > Check in jars with the code? Keep jars in a separate, network-based
> > > directory?
> > >
> > > Geoff
> > > --
> > > Sent from my email client.
> > >
> >
>

RE: Managing pig script jar dependencies

Posted by "Kaluskar, Sanjay" <sk...@informatica.com>.
Hi Dmitriy,

Well, what I have is still experimental & not in any product. But, yes
we can compile to a Pig script. I try to use the native relational
operators where possible & use UDFs in other cases.

I don't understand which conflicts you are referring to. Initially, I
was trying to create a single jar (containing all the 300 dependencies)
using the maven-dependency-plugin (BTW that seems to be the recommended
approach & should work in many cases) but it turned out that some of our
internal components had conflicting file names for some of the resources
(should probably be fixed!). My current approach works better because I
don't try to re-package any dependency. Yes, startup times are slow - of
course, I am open to other ideas :-)

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com] 
Sent: 21 January 2011 07:57
To: user@pig.apache.org
Subject: Re: Managing pig script jar dependencies

Sanjay,
Informatica compiles to Pig now, eh? Interesting...
How do you handle jar conflicts if you bundle the whole lot? Doesn't
this cost you a lot on job startup time?

Dmitriy


On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
<skaluskar@informatica.com
> wrote:

> I have a similar problem and I can tell you what I am doing currently,

> just in case it is useful. I have a tool that generates PIG scripts 
> from some other representation (Informatica mappings), and in many 
> cases the scripts also call UDFs that depend on about 300 jars & 580 
> native libraries. Additionally, I generate a jar for each PIG script 
> that contains the UDFs called from that script. I add the latter jar 
> in the script in a register statement. But registering the 300 jars 
> that the UDFs depend on individually is error prone & tedious; so I 
> have automated that part. I have a top-level jar that includes all the

> 300 jars on its Class-path in the MANIFEST.MF and I add this top-level

> jar to the classpath. I generate that (top-level jar) using maven's 
> assembly plugin. I also generate a zip of everything (jars, native 
> libs) using maven's assembly plugin and use dist cache to distribute 
> it and add the native libs to the LD_LIBRARY_PATH.
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> Sent: 21 January 2011 05:57
> To: user@pig.apache.org
> Subject: Re: Managing pig script jar dependencies
>
> This is becoming a bigger problem for us as well, as use of Pig 
> becomes more varied across the company.
> Would love some to hear what others have found to work for them.
>
> D
>
> On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
> <ge...@geoffeg.org>wrote:
>
> > I'm looking for some suggestions and ideas for how to handle JAR 
> > dependencies in a production environment.
> >
> > Most of the pig scripts I write require multiple JAR files. For 
> > instance, I have a pig script that processes some data through a 
> > Solr instance which requires my Solr UDF and some solr, lucene and 
> > apache commons jars. These pig scripts are stored in a git repo and 
> > that git repo is deployed to our production cluster. Obviously we 
> > don't want to
>
> > store the jars in git; I'd rather store them in our mvn repo with 
> > the rest of the jars the company uses.
> >
> > The plan is to have a maven pom.xml for each pig script that defines

> > which jars that pig script depends on. A shell script will then call

> > "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before

> > calling the actual pig command to run the script. Given that, I'm 
> > trying to figure out the best solution to a few questions.
> >
> > * For development I'd like to store the pig jar (pig-0.7.0-core.jar)

> > in maven but there is no pom.xml for that jar (easily fixed) and 
> > that jar contains all the java prerequisites (javax.servlet, apache 
> > commons, etc) which seem to be making maven unhappy when I try to 
> > import it into the maven company repo. Is there a pig-only jar?
> >
> > * What do other people use to deploy their code to various systems?
> > Check in jars with the code? Keep jars in a separate, network-based 
> > directory?
> >
> > Geoff
> > --
> > Sent from my email client.
> >
>

Re: Managing pig script jar dependencies

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Sanjay,
Informatica compiles to Pig now, eh? Interesting...
How do you handle jar conflicts if you bundle the whole lot? Doesn't this
cost you a lot on job startup time?

Dmitriy


On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay <skaluskar@informatica.com
> wrote:

> I have a similar problem and I can tell you what I am doing currently,
> just in case it is useful. I have a tool that generates PIG scripts from
> some other representation (Informatica mappings), and in many cases the
> scripts also call UDFs that depend on about 300 jars & 580 native
> libraries. Additionally, I generate a jar for each PIG script that
> contains the UDFs called from that script. I add the latter jar in the
> script in a register statement. But registering the 300 jars that the
> UDFs depend on individually is error prone & tedious; so I have
> automated that part. I have a top-level jar that includes all the 300
> jars on its Class-path in the MANIFEST.MF and I add this top-level jar
> to the classpath. I generate that (top-level jar) using maven's assembly
> plugin. I also generate a zip of everything (jars, native libs) using
> maven's assembly plugin and use dist cache to distribute it and add the
> native libs to the LD_LIBRARY_PATH.
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> Sent: 21 January 2011 05:57
> To: user@pig.apache.org
> Subject: Re: Managing pig script jar dependencies
>
> This is becoming a bigger problem for us as well, as use of Pig becomes
> more varied across the company.
> Would love some to hear what others have found to work for them.
>
> D
>
> On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
> <ge...@geoffeg.org>wrote:
>
> > I'm looking for some suggestions and ideas for how to handle JAR
> > dependencies in a production environment.
> >
> > Most of the pig scripts I write require multiple JAR files. For
> > instance, I have a pig script that processes some data through a Solr
> > instance which requires my Solr UDF and some solr, lucene and apache
> > commons jars. These pig scripts are stored in a git repo and that git
> > repo is deployed to our production cluster. Obviously we don't want to
>
> > store the jars in git; I'd rather store them in our mvn repo with the
> > rest of the jars the company uses.
> >
> > The plan is to have a maven pom.xml for each pig script that defines
> > which jars that pig script depends on. A shell script will then call
> > "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
> > calling the actual pig command to run the script. Given that, I'm
> > trying to figure out the best solution to a few questions.
> >
> > * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
> > in maven but there is no pom.xml for that jar (easily fixed) and that
> > jar contains all the java prerequisites (javax.servlet, apache
> > commons, etc) which seem to be making maven unhappy when I try to
> > import it into the maven company repo. Is there a pig-only jar?
> >
> > * What do other people use to deploy their code to various systems?
> > Check in jars with the code? Keep jars in a separate, network-based
> > directory?
> >
> > Geoff
> > --
> > Sent from my email client.
> >
>

RE: Managing pig script jar dependencies

Posted by "Kaluskar, Sanjay" <sk...@informatica.com>.
I have a similar problem and I can tell you what I am doing currently,
just in case it is useful. I have a tool that generates PIG scripts from
some other representation (Informatica mappings), and in many cases the
scripts also call UDFs that depend on about 300 jars & 580 native
libraries. Additionally, I generate a jar for each PIG script that
contains the UDFs called from that script. I add the latter jar in the
script in a register statement. But registering the 300 jars that the
UDFs depend on individually is error prone & tedious; so I have
automated that part. I have a top-level jar that includes all the 300
jars on its Class-path in the MANIFEST.MF and I add this top-level jar
to the classpath. I generate that (top-level jar) using maven's assembly
plugin. I also generate a zip of everything (jars, native libs) using
maven's assembly plugin and use dist cache to distribute it and add the
native libs to the LD_LIBRARY_PATH.

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com] 
Sent: 21 January 2011 05:57
To: user@pig.apache.org
Subject: Re: Managing pig script jar dependencies

This is becoming a bigger problem for us as well, as use of Pig becomes
more varied across the company.
Would love some to hear what others have found to work for them.

D

On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
<ge...@geoffeg.org>wrote:

> I'm looking for some suggestions and ideas for how to handle JAR 
> dependencies in a production environment.
>
> Most of the pig scripts I write require multiple JAR files. For 
> instance, I have a pig script that processes some data through a Solr 
> instance which requires my Solr UDF and some solr, lucene and apache 
> commons jars. These pig scripts are stored in a git repo and that git 
> repo is deployed to our production cluster. Obviously we don't want to

> store the jars in git; I'd rather store them in our mvn repo with the 
> rest of the jars the company uses.
>
> The plan is to have a maven pom.xml for each pig script that defines 
> which jars that pig script depends on. A shell script will then call 
> "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before 
> calling the actual pig command to run the script. Given that, I'm 
> trying to figure out the best solution to a few questions.
>
> * For development I'd like to store the pig jar (pig-0.7.0-core.jar) 
> in maven but there is no pom.xml for that jar (easily fixed) and that 
> jar contains all the java prerequisites (javax.servlet, apache 
> commons, etc) which seem to be making maven unhappy when I try to 
> import it into the maven company repo. Is there a pig-only jar?
>
> * What do other people use to deploy their code to various systems? 
> Check in jars with the code? Keep jars in a separate, network-based 
> directory?
>
> Geoff
> --
> Sent from my email client.
>

Re: Managing pig script jar dependencies

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
This is becoming a bigger problem for us as well, as use of Pig becomes more
varied across the company.
Would love some to hear what others have found to work for them.

D

On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway <ge...@geoffeg.org>wrote:

> I'm looking for some suggestions and ideas for how to handle JAR
> dependencies in a production environment.
>
> Most of the pig scripts I write require multiple JAR files. For instance, I
> have a pig script that processes some data through a Solr instance which
> requires my Solr UDF and some solr, lucene and apache commons jars. These
> pig scripts are stored in a git repo and that git repo is deployed to our
> production cluster. Obviously we don't want to store the jars in git; I'd
> rather store them in our mvn repo with the rest of the jars the company
> uses.
>
> The plan is to have a maven pom.xml for each pig script that defines which
> jars that pig script depends on. A shell script will then call "mvn
> dependency:copy-dependencies -DoutputDirectory=pig-jars" before calling the
> actual pig command to run the script. Given that, I'm trying to figure out
> the best solution to a few questions.
>
> * For development I'd like to store the pig jar (pig-0.7.0-core.jar) in
> maven but there is no pom.xml for that jar (easily fixed) and that jar
> contains all the java prerequisites (javax.servlet, apache commons, etc)
> which seem to be making maven unhappy when I try to import it into the
> maven
> company repo. Is there a pig-only jar?
>
> * What do other people use to deploy their code to various systems? Check
> in
> jars with the code? Keep jars in a separate, network-based directory?
>
> Geoff
> --
> Sent from my email client.
>

Re: Managing pig script jar dependencies

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
We have a bootstrap command that copies all libraries of hour maven assembly
to a location in HDFS (actually, we use maven groupId and artifactId of our
assembly in the hierarchical path to ensure each client has its jars on the
backend avaiable of exactly the same assembly build).

We also use Grunt embedding instead of server embedding. Grunt has
invaluable preprocessing capabilities compared to PigServer(). Basically we
kickoff a java client that has grunt integrated and knows of its maven build
number, so it knows what hdfs locations to pass on to pig for the jars. This
is a little bit of a hack over Grunt but it's only perhaps a hunred lines
longer than just do the same thing with a PigServer.

-dmitriy

On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway <ge...@geoffeg.org>wrote:

> I'm looking for some suggestions and ideas for how to handle JAR
> dependencies in a production environment.
>
>
> * What do other people use to deploy their code to various systems? Check
> in
> jars with the code? Keep jars in a separate, network-based directory?
>
> Geoff
> --
> Sent from my email client.
>