You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2009/12/10 02:00:27 UTC

[jira] Created: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

set default_parallelism construct does not set the number of reducers correctly
-------------------------------------------------------------------------------

                 Key: PIG-1144
                 URL: https://issues.apache.org/jira/browse/PIG-1144
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.7.0
         Environment: Hadoop 20 cluster with multi-node installation
            Reporter: Viraj Bhat
             Fix For: 0.7.0


Hi all,
 I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
{code}
...
public void visitMROp(MapReduceOper mr)
mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
...
{code}

When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.

Attaching the script and the explain output

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789135#action_12789135 ] 

Hadoop QA commented on PIG-1144:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427670/PIG-1144-2.patch
  against trunk revision 889346.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/114/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/114/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/114/console

This message is automatically generated.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Status: Patch Available  (was: Open)

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791112#action_12791112 ] 

Hadoop QA commented on PIG-1144:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427792/PIG-1144-3.patch
  against trunk revision 890596.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/125/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/125/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/125/console

This message is automatically generated.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Status: Open  (was: Patch Available)

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Status: Patch Available  (was: Open)

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788447#action_12788447 ] 

Daniel Dai commented on PIG-1144:
---------------------------------

I find the root cause of the problem. For every sort job, we hard code parallelism as 1 if user do not use PARALLEL keyword. We shall leave parallelism as -1 in this case, and then the later code will find it and use default_parallel value instead.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788706#action_12788706 ] 

Hadoop QA commented on PIG-1144:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427558/PIG-1144-1.patch
  against trunk revision 888852.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/112/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/112/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/112/console

This message is automatically generated.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Patch committed to both trunk and 0.6 branch.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789020#action_12789020 ] 

Alan Gates commented on PIG-1144:
---------------------------------

The previous code was trying to read the default parallelism from the JobClient rather than setting it to -1, as the current code does.  This seems strange, but was there a reason for that?

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Affects Version/s:     (was: 0.7.0)
                       0.6.0
               Status: Patch Available  (was: Open)

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai reassigned PIG-1144:
-------------------------------

    Assignee: Daniel Dai

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Attachment: PIG-1144-4.patch

Include logic for local mode. However, this is only for user use "-x local". If user do not use -x option, and run into local mode because of no hadoop configuration file in CLASSPATH, Pig do not have a way to detect that, and order by job will fail.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788456#action_12788456 ] 

Daniel Dai commented on PIG-1144:
---------------------------------

Leaving parallelism to -1 is not a solution here. In setting up the sampling job, we should know how many reducers we are going to use. So we need use default_parallel way before JobControlCompiler.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788434#action_12788434 ] 

Daniel Dai commented on PIG-1144:
---------------------------------

Hi, Viraj,
default parallelism is set in JobControlCompiler, which is after MRCompiler. Also if you just do explain, this code will not be invoked. Did you see in the real cluster, it actually use 1 reducer?

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792288#action_12792288 ] 

Olga Natkovich commented on PIG-1144:
-------------------------------------

+1; patch looks good!

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Status: Open  (was: Patch Available)

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788439#action_12788439 ] 

Viraj Bhat commented on PIG-1144:
---------------------------------

Hi Daniel,
One more thing to note is that the Last Sort M/R job has a parallelism of 1. Should it not be -1?
Viraj

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792251#action_12792251 ] 

Hadoop QA commented on PIG-1144:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428331/PIG-1144-4.patch
  against trunk revision 891499.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/137/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/137/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/137/console

This message is automatically generated.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Attachment: PIG-1144-1.patch

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789734#action_12789734 ] 

Hadoop QA commented on PIG-1144:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427792/PIG-1144-3.patch
  against trunk revision 889870.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/120/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/120/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/120/console

This message is automatically generated.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788481#action_12788481 ] 

Viraj Bhat commented on PIG-1144:
---------------------------------

Hi Daniel,
 Thanks again for your input. This is more of a performance issue, where users do not detect, till they see that 1 reducer job has failed in the sort phase. They safely assume that the default_parallel keyword will do the trick.
Viraj

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1144:
--------------------------------

    Status: Open  (was: Patch Available)

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1144:
--------------------------------

    Fix Version/s:     (was: 0.7.0)
                   0.6.0

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Attachment: PIG-1144-3.patch

Change the patch to take "mapred.reduce.tasks" into account. The hierarchy for determining the parallelism is:
1. PARALLEL keywords
2. default_parallel
3. mapred.reduce.tasks system property
4. default value: 1

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Status: Open  (was: Patch Available)

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1144:
--------------------------------

    Status: Patch Available  (was: Open)

resubmitting to rerun the tests

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791120#action_12791120 ] 

Olga Natkovich commented on PIG-1144:
-------------------------------------

I think the code that always set parallelism to 1 in local mode has been unintentionally removed.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Status: Patch Available  (was: Open)

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1144:
----------------------------

    Attachment: PIG-1144-2.patch

I think the reason is quantile job need to know how many reducers we are going to use in order to decide tuples to write into quantilesFile. The number of reducers is a constant field of the plan. We cannot say -1 and let hadoop to decide the parallelism later. The fix actually take default_parallel as the constant if user do not use PARALLEL key word. It applies to both order by and skew join. Merge join and FRJoin are map only and regular join has been taken care of in the original code. Attach the patch again, nothing change except for including a new test case for skew join.

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-1144:
----------------------------

    Attachment: brokenparallel.out
                genericscript_broken_parallel.pig

Script and explain output

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788436#action_12788436 ] 

Viraj Bhat commented on PIG-1144:
---------------------------------

This happens on the real cluster, where the sorting job did not complete because of a single reducer. 

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>             Fix For: 0.7.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791004#action_12791004 ] 

Olga Natkovich commented on PIG-1144:
-------------------------------------

this has performance consequences so need to get it into 0.6.0 release

> set default_parallelism construct does not set the number of reducers correctly
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1144
>                 URL: https://issues.apache.org/jira/browse/PIG-1144
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>         Environment: Hadoop 20 cluster with multi-node installation
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.