You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Eric Tschetter (JIRA)" <ji...@apache.org> on 2010/07/22 01:35:50 UTC

[jira] Created: (PIG-1511) Pig removes packages from its own jar when building the JAR to ship to Hadoop

Pig removes packages from its own jar when building the JAR to ship to Hadoop
-----------------------------------------------------------------------------

                 Key: PIG-1511
                 URL: https://issues.apache.org/jira/browse/PIG-1511
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Eric Tschetter
         Attachments: pig-1511.diff

Pig generates a new jar file to ship over to Hadoop.  Pig has a couple of packages whitelisted that it includes from its own jar.  Pig throws away everything else.

I package all of my dependencies into a single jar file.  Pig is included in this jar file.  I do it this way because my code needs to run reliably and reproducibly in production.  Pig throws away all of my dependencies.

I don't know what the performance gain is of shaving ~5MB off of a jar that is pushed to a job tracker once and then used to run over 100s of GB of data.  The overhead is minimal on my cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1511) Pig removes packages from its own jar when building the JAR to ship to Hadoop

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891648#action_12891648 ] 

Alan Gates commented on PIG-1511:
---------------------------------

The issue there is that blacklists are hard to maintain.  Every time some adds a package to Pig they have to remember to add to that blacklist.  

If you register your jar Pig will wrap it up and take it along.  Does this not work for your use case?

> Pig removes packages from its own jar when building the JAR to ship to Hadoop
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1511
>                 URL: https://issues.apache.org/jira/browse/PIG-1511
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Eric Tschetter
>         Attachments: pig-1511.diff
>
>
> Pig generates a new jar file to ship over to Hadoop.  Pig has a couple of packages whitelisted that it includes from its own jar.  Pig throws away everything else.
> I package all of my dependencies into a single jar file.  Pig is included in this jar file.  I do it this way because my code needs to run reliably and reproducibly in production.  Pig throws away all of my dependencies.
> I don't know what the performance gain is of shaving ~5MB off of a jar that is pushed to a job tracker once and then used to run over 100s of GB of data.  The overhead is minimal on my cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1511) Pig removes packages from its own jar when building the JAR to ship to Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892568#action_12892568 ] 

Hadoop QA commented on PIG-1511:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12450112/pig-1511.diff
  against trunk revision 979362.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/354/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/354/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/354/console

This message is automatically generated.

> Pig removes packages from its own jar when building the JAR to ship to Hadoop
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1511
>                 URL: https://issues.apache.org/jira/browse/PIG-1511
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Eric Tschetter
>         Attachments: pig-1511.diff
>
>
> Pig generates a new jar file to ship over to Hadoop.  Pig has a couple of packages whitelisted that it includes from its own jar.  Pig throws away everything else.
> I package all of my dependencies into a single jar file.  Pig is included in this jar file.  I do it this way because my code needs to run reliably and reproducibly in production.  Pig throws away all of my dependencies.
> I don't know what the performance gain is of shaving ~5MB off of a jar that is pushed to a job tracker once and then used to run over 100s of GB of data.  The overhead is minimal on my cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1511) Pig removes packages from its own jar when building the JAR to ship to Hadoop

Posted by "Eric Tschetter (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Tschetter updated PIG-1511:
--------------------------------

    Attachment: pig-1511.diff

> Pig removes packages from its own jar when building the JAR to ship to Hadoop
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1511
>                 URL: https://issues.apache.org/jira/browse/PIG-1511
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Eric Tschetter
>         Attachments: pig-1511.diff
>
>
> Pig generates a new jar file to ship over to Hadoop.  Pig has a couple of packages whitelisted that it includes from its own jar.  Pig throws away everything else.
> I package all of my dependencies into a single jar file.  Pig is included in this jar file.  I do it this way because my code needs to run reliably and reproducibly in production.  Pig throws away all of my dependencies.
> I don't know what the performance gain is of shaving ~5MB off of a jar that is pushed to a job tracker once and then used to run over 100s of GB of data.  The overhead is minimal on my cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1511) Pig removes packages from its own jar when building the JAR to ship to Hadoop

Posted by "Eric Tschetter (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891430#action_12891430 ] 

Eric Tschetter commented on PIG-1511:
-------------------------------------

Why would I use Hadoop to process data where the overhead of sending 5MB of
jar file over the wire is larger than the amount of data you are
processing.  There are tools called grep, sort, uniq, sed, cut, awk, and
join which make working with that tiny of data simple...

That said, if shaving off a few MB of the jar file is that important,
perhaps change it from a whitelist to a blacklist.

On Jul 22, 2010 10:50 AM, "Alan Gates (JIRA)" <ji...@apache.org> wrote:


   [
https://issues.apache.org/jira/browse/PIG-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891261#action_12891261]

Alan Gates commented on PIG-1511:
---------------------------------

We don't want to do this by default.  In a couple of instances keeping the
size of this jar down is more important.  One, when the number of tasks
being used is very large, since that jar is being copied once to each task,
and two when the job itself is quite small and the setup costs become a
concern.

Hadoop
-----------------------------------------------------------------------------
packages whitelisted that it includes from its own jar.  Pig throws away
everything else.
in this jar file.  I do it this way because my code needs to run reliably
and reproducibly in production.  Pig throws away all of my dependencies.
that is pushed to a job tracker once and then used to run over 100s of GB of
data.  The overhead is minimal on my cluster.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


> Pig removes packages from its own jar when building the JAR to ship to Hadoop
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1511
>                 URL: https://issues.apache.org/jira/browse/PIG-1511
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Eric Tschetter
>         Attachments: pig-1511.diff
>
>
> Pig generates a new jar file to ship over to Hadoop.  Pig has a couple of packages whitelisted that it includes from its own jar.  Pig throws away everything else.
> I package all of my dependencies into a single jar file.  Pig is included in this jar file.  I do it this way because my code needs to run reliably and reproducibly in production.  Pig throws away all of my dependencies.
> I don't know what the performance gain is of shaving ~5MB off of a jar that is pushed to a job tracker once and then used to run over 100s of GB of data.  The overhead is minimal on my cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1511) Pig removes packages from its own jar when building the JAR to ship to Hadoop

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891261#action_12891261 ] 

Alan Gates commented on PIG-1511:
---------------------------------

We don't want to do this by default.  In a couple of instances keeping the size of this jar down is more important.  One, when the number of tasks being used is very large, since that jar is being copied once to each task, and two when the job itself is quite small and the setup costs become a concern.

> Pig removes packages from its own jar when building the JAR to ship to Hadoop
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1511
>                 URL: https://issues.apache.org/jira/browse/PIG-1511
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Eric Tschetter
>         Attachments: pig-1511.diff
>
>
> Pig generates a new jar file to ship over to Hadoop.  Pig has a couple of packages whitelisted that it includes from its own jar.  Pig throws away everything else.
> I package all of my dependencies into a single jar file.  Pig is included in this jar file.  I do it this way because my code needs to run reliably and reproducibly in production.  Pig throws away all of my dependencies.
> I don't know what the performance gain is of shaving ~5MB off of a jar that is pushed to a job tracker once and then used to run over 100s of GB of data.  The overhead is minimal on my cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1511) Pig removes packages from its own jar when building the JAR to ship to Hadoop

Posted by "Eric Tschetter (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Tschetter updated PIG-1511:
--------------------------------

    Status: Patch Available  (was: Open)

Here is patch.

> Pig removes packages from its own jar when building the JAR to ship to Hadoop
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1511
>                 URL: https://issues.apache.org/jira/browse/PIG-1511
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Eric Tschetter
>         Attachments: pig-1511.diff
>
>
> Pig generates a new jar file to ship over to Hadoop.  Pig has a couple of packages whitelisted that it includes from its own jar.  Pig throws away everything else.
> I package all of my dependencies into a single jar file.  Pig is included in this jar file.  I do it this way because my code needs to run reliably and reproducibly in production.  Pig throws away all of my dependencies.
> I don't know what the performance gain is of shaving ~5MB off of a jar that is pushed to a job tracker once and then used to run over 100s of GB of data.  The overhead is minimal on my cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.