You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Eric Tschetter <ec...@gmail.com> on 2010/04/08 04:04:39 UTC

Dependency Management

I'm writing because I was having some issues with the register command
not working how I expected it to.  Specifically, it seemed like the
way a jar was specified as an entry in the -cp list passed to the java
command that ran org.apache.pig.Main affected the acceptable string
that register would like.

I.e.

with

java -cp some-jar.jar:pig.jar org.apache.pig.Main

I had to write

register some-jar.jar;

with

java -cp ../lib/some-jar.jar:pig.jar org.apache.pig.Main

I had to write

register ../lib/some-jar.jar

This means that my register commands are directly affected by how the
files are laid out on disk.  But, one of the benefits of having a
classpath that magically adds things is that my java code doesn't care
about where the dependencies are on disk, they are just there.  I like
that model.

So, in order to obtain that model (i.e. make the code agnostic about
where its dependencies live), I assumed that pig must register its own
jar when submitting a job with hadoop.  So, I promptly created an uber
jar, splatting pig.jar along with the dependencies for my UDFs and
LOAD functions (thank you jarjar!) and then ran

java -cp uber-jar.jar org.apache.pig.Main some-script.pig

To my surprise, however, the jobs kept on failing over on my hadoop
cluster complaining about not being able to find my LOAD function.
Confused, I grabbed the jar from the JT and looked at it and sure
enough, none of the LOAD/UDF stuff was in it.  After hitting up a
couple of resources for background on why this would be happening, I
find out that pig actually creates a brand new jar file including all
of the various dependencies and submits that.  Unfortunately, however,
it doesn't include itself.  It purposely filters out all packages
except "org.apache.pig" and "org.apache.tools.bzip2r", meaning that it
filters out everything included in my uber jar, thus thwarting my
plans to manage my own dependencies without any external power
changing them.

I've now patched a local copy of branch-0.6 to stop filtering out the
pig-containing jar and just push it up to the JT, this is something
that is just a client-side change so I am fine with maintaining my own
local fork of pig in order to have this functionality.  But, my
questions to the list are two-fold:

1) How much measured performance gain do you get out of paring down an
8MB jar into a 3MB jar and sending that out to the various task
trackers?  If anyone is running pig jobs (i.e. filter, group, join,
etc) against say 1GB of data, they just might be able to get results
back in less time by just having the data as a flat text file on their
box and doing the stuff there.  So, even if we take 2 GB of data as a
minimum amount of data we are working on, it would take pushing that
jar to 400 different task trackers to make a 5MB difference equate to
the size of the data.  However, if you have 400 different task
trackers, you are on the larger end of the spectrum as far as cluster
size is concerned and have a very high probability of working on more
than 2 GB of data.  Also, if you are just working on 2GB of data, even
assuming a low block size of 64MB means that your job will run on a
max of 32 (or 128 with speculative execution of 4) task trackers
meaning that the actual amount of data sent out because of the jar is
still less than the actual data.  So, I'm wondering if there have been
some tests run that show that there actually is a significant gain
achieved from paring down the jar files?

2) Is the patch to stop trying to filter down the pig-containing jar
and just include whatever jar it happened to be in something that
would be accepted were I to make a JIRA and contribute it back?  Or,
should I just accept that I'm in the minority and maintain local
forks?

--Eric Tschetter

Re: Dependency Management

Posted by hc busy <hc...@gmail.com>.

Well, a 20mb jar file on HDFS isn't that helpful either unless you have the
replication jacked up to = number of machines or something.


On Wed, Apr 14, 2010 at 5:56 PM, Scott Carey <sc...@richrelevance.com>wrote:

> On Apr 7, 2010, at 7:04 PM, Eric Tschetter wrote:
>
> >
> > 2) Is the patch to stop trying to filter down the pig-containing jar
> > and just include whatever jar it happened to be in something that
> > would be accepted were I to make a JIRA and contribute it back?  Or,
> > should I just accept that I'm in the minority and maintain local
> > forks?
> >
>
> I am curious about what the approaches are for Pig here.  Does REGISTER use
> the distributed cache at all?
> Can you use the distributed cache for your use case?
>
> I would be happy defining a smaller jar with just my UDF classes for
> REGISTER and placing the remaining stuff in HDFS and using Hadoop's
> distributed cache.
> It would also help to let REGISTER get files from HDFS.  The docs don't say
> that its possible, but
> REGISTER hdfs://server:port/lib/myStuff.jar would be nice.
>
> I REGISTER a 28MB Jar file (and piggybank) now, and pushing that around and
> re-packaging contributes significantly to latency.
>
>
> > --Eric Tschetter
>
>

Re: Dependency Management

Posted by Scott Carey <sc...@richrelevance.com>.

On Apr 7, 2010, at 7:04 PM, Eric Tschetter wrote:

> 
> 2) Is the patch to stop trying to filter down the pig-containing jar
> and just include whatever jar it happened to be in something that
> would be accepted were I to make a JIRA and contribute it back?  Or,
> should I just accept that I'm in the minority and maintain local
> forks?
> 

I am curious about what the approaches are for Pig here.  Does REGISTER use the distributed cache at all?
Can you use the distributed cache for your use case?

I would be happy defining a smaller jar with just my UDF classes for REGISTER and placing the remaining stuff in HDFS and using Hadoop's distributed cache.
It would also help to let REGISTER get files from HDFS.  The docs don't say that its possible, but 
REGISTER hdfs://server:port/lib/myStuff.jar would be nice.

I REGISTER a 28MB Jar file (and piggybank) now, and pushing that around and re-packaging contributes significantly to latency.

> --Eric Tschetter

Re: Dependency Management

Posted by Thejas Nair <te...@yahoo-inc.com>.

In 0.7, you can specify the jars you want to register on commandline
-Dpig.additional.jars=".."
See - https://issues.apache.org/jira/browse/PIG-1226

You can even specify all the classes in your classpath in that property.
Would that feature help ?

I don't have the numbers for performance gain achieved by sending smaller
jar to all the nodes. But since this file needs to be made available on each
node, I think there will be some impact on startup time of query.

-Thejas


On 4/7/10 7:04 PM, "Eric Tschetter" <ec...@gmail.com> wrote:

> I'm writing because I was having some issues with the register command
> not working how I expected it to.  Specifically, it seemed like the
> way a jar was specified as an entry in the -cp list passed to the java
> command that ran org.apache.pig.Main affected the acceptable string
> that register would like.
> 
> I.e.
> 
> with
> 
> java -cp some-jar.jar:pig.jar org.apache.pig.Main
> 
> I had to write
> 
> register some-jar.jar;
> 
> with
> 
> java -cp ../lib/some-jar.jar:pig.jar org.apache.pig.Main
> 
> I had to write
> 
> register ../lib/some-jar.jar
> 
> This means that my register commands are directly affected by how the
> files are laid out on disk.  But, one of the benefits of having a
> classpath that magically adds things is that my java code doesn't care
> about where the dependencies are on disk, they are just there.  I like
> that model.
> 
> So, in order to obtain that model (i.e. make the code agnostic about
> where its dependencies live), I assumed that pig must register its own
> jar when submitting a job with hadoop.  So, I promptly created an uber
> jar, splatting pig.jar along with the dependencies for my UDFs and
> LOAD functions (thank you jarjar!) and then ran
> 
> java -cp uber-jar.jar org.apache.pig.Main some-script.pig
> 
> To my surprise, however, the jobs kept on failing over on my hadoop
> cluster complaining about not being able to find my LOAD function.
> Confused, I grabbed the jar from the JT and looked at it and sure
> enough, none of the LOAD/UDF stuff was in it.  After hitting up a
> couple of resources for background on why this would be happening, I
> find out that pig actually creates a brand new jar file including all
> of the various dependencies and submits that.  Unfortunately, however,
> it doesn't include itself.  It purposely filters out all packages
> except "org.apache.pig" and "org.apache.tools.bzip2r", meaning that it
> filters out everything included in my uber jar, thus thwarting my
> plans to manage my own dependencies without any external power
> changing them.
> 
> I've now patched a local copy of branch-0.6 to stop filtering out the
> pig-containing jar and just push it up to the JT, this is something
> that is just a client-side change so I am fine with maintaining my own
> local fork of pig in order to have this functionality.  But, my
> questions to the list are two-fold:
> 
> 1) How much measured performance gain do you get out of paring down an
> 8MB jar into a 3MB jar and sending that out to the various task
> trackers?  If anyone is running pig jobs (i.e. filter, group, join,
> etc) against say 1GB of data, they just might be able to get results
> back in less time by just having the data as a flat text file on their
> box and doing the stuff there.  So, even if we take 2 GB of data as a
> minimum amount of data we are working on, it would take pushing that
> jar to 400 different task trackers to make a 5MB difference equate to
> the size of the data.  However, if you have 400 different task
> trackers, you are on the larger end of the spectrum as far as cluster
> size is concerned and have a very high probability of working on more
> than 2 GB of data.  Also, if you are just working on 2GB of data, even
> assuming a low block size of 64MB means that your job will run on a
> max of 32 (or 128 with speculative execution of 4) task trackers
> meaning that the actual amount of data sent out because of the jar is
> still less than the actual data.  So, I'm wondering if there have been
> some tests run that show that there actually is a significant gain
> achieved from paring down the jar files?
> 
> 2) Is the patch to stop trying to filter down the pig-containing jar
> and just include whatever jar it happened to be in something that
> would be accepted were I to make a JIRA and contribute it back?  Or,
> should I just accept that I'm in the minority and maintain local
> forks?
> 
> --Eric Tschetter