You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2010/02/22 19:20:28 UTC

[jira] Created: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Safe-guards against misconfigured Pig scripts without PARALLEL keyword
----------------------------------------------------------------------

                 Key: PIG-1249
                 URL: https://issues.apache.org/jira/browse/PIG-1249
             Project: Pig
          Issue Type: Improvement
            Reporter: Arun C Murthy
            Priority: Critical


It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.

We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892871#action_12892871 ] 

Olga Natkovich commented on PIG-1249:
-------------------------------------

Hi Jeff, 

Thanks for the quick response. I will review and commit the patch. I am going to add a log statement for the reduce value that has been computed. I will also copy your doc comment from the src to the JIRA to assist our doc writer

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871358#action_12871358 ] 

Alan Gates commented on PIG-1249:
---------------------------------

1. In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate.

{color:blue}
It won't throw IOException when file doesn't exit, getTotalInputFileSize will return 0 if not loading from file or file doesn't exit. And the final estimated reducer number will be 1.

{color}
{color:red}
Could we add a test to test this?  I think it would be good to assure it works in this situation.  Maybe you could take one of the tests that uses the Hbase loader.
{color}

2. I'm curious where the values of ~1GB per reducer and 999 reducers came from.

{color:blue}
These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience.
{color}
{color:red}
ok, good enough.  We can adjust them later if we need to.
{color}

3. Does this estimate apply only to the first job or to all jobs?

{color:blue}
It will apply to all the jobs
{color}
{color:red}
Eventually we should change this to do the estimation on the fly in the JobControlCompiler.  Since most queries tend to aggregate data down after a number of steps I suspect that using the initial input to estimate the entire query will mean that the final results are parallelized too widely.  But this is better than the current situation where they aren't parallelized at all.
{color}

4. How does this work in the case of joins, where there are multiple inputs to a job?

{color:blue}
it will estimate the reducer number according the all the inputs files' size
{color}
{color:red}
cool
{color}

So other than testing the non-file case I'm +1 on this patch.


> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

    Status: Patch Available  (was: Open)

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875551#action_12875551 ] 

Hadoop QA commented on PIG-1249:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12446173/PIG-1249-4.patch
  against trunk revision 951229.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 5 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/console

This message is automatically generated.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838227#action_12838227 ] 

Jeff Zhang commented on PIG-1249:
---------------------------------

+1, And I  find that hive can estimate the reducer number according the input size. This is a really useful feature.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>            Priority: Critical
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1249:
--------------------------------

    Release Note: 
In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallel), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.

Note that this is not a backward compatible change and set default_parallel to restore the value to 1

  was:
In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallelism), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.


> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

    Attachment: PIG-1249.patch

The current idea is borrowed from hive, use the input file size to estimate the reducer number.
Two parameters can been set for this purpose
pig.exec.reducers.bytes.per.reducer  // the number of bytes of input for each reducer
pig.exec.reducers.max                          // the max number of reducer number

This only work for hdfs, won't work for other data source such as hbase or cassandra.



> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868893#action_12868893 ] 

Thejas M Nair commented on PIG-1249:
------------------------------------

If default_parallel has not been set, the patch sets a new default number of reducers based on input file sizes.
If the 'input' specified in the load statement is not an hdfs file, it fail to find the file size and default of 1 reduce will be used.

The next steps in automatically determining number of reducers (which can be addressed in separate jiras) are -
1. Determining different number of reducers for each MR job of a pig-query, based on the input size for the MR job.
2. Extending this functionality to load functions that don't take hdfs files as input. We can look at using LoadMetaData.getStatistics() .

Comments on the patch -
If default_parallel is specified, the number of reducers doesn't need to be determined.
{code}

estimateNumberOfReducers(conf,mro);
if (pigContext.defaultParallel > 0)
       conf.set("mapred.reduce.tasks", ""+pigContext.defaultParallel);

{code}
can be changed to 
{code}
if (pigContext.defaultParallel > 0)
     conf.set("mapred.reduce.tasks", ""+pigContext.defaultParallel);
else
     estimateNumberOfReducers(conf,mro);
{code}

Everything else looks good.

Hudson still seems to be having problems. I am currently running unit tests with this patch.



> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang reassigned PIG-1249:
-------------------------------

    Assignee: Jeff Zhang

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1249:
--------------------------------

    Release Note: 
In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallelism), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893446#action_12893446 ] 

Olga Natkovich commented on PIG-1249:
-------------------------------------

Comments for the documentation:

+    /**
+     * Currently the estimation of reducer number is only applied to HDFS, The estimation is based on the input size of data storage on HDFS.
+     * Two parameters can been configured for the estimation, one is pig.exec.reducers.max which constrain the maximum number of reducer task (default is 999). The other
+     * is pig.exec.reducers.bytes.per.reducer(default value is 1000*1000*1000) which means the how much data can been handled for each reducer.
+     * e.g. the following is your pig script
+     * a = load '/data/a';
+     * b = load '/data/b';
+     * c = join a by $0, b by $0;
+     * store c into '/tmp';
+     *
+     * The size of /data/a is 1000*1000*1000, and size of /data/b is 2*1000*1000*1000.
+     * Then the estimated reducer number is (1000*1000*1000+2*1000*1000*1000)/(1000*1000*1000)=3


> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

               Status: Patch Available  (was: Open)
    Affects Version/s: 0.8.0

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870948#action_12870948 ] 

Jeff Zhang commented on PIG-1249:
---------------------------------

Response to Alan's questions,

   1.  In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate.
   {color:blue}
             It won't throw IOException when file doesn't exit,  getTotalInputFileSize will return 0 if not loading from file or file doesn't exit. And the final estimated reducer number will be 1.
   {color}
   2. I'm curious where the values of ~1GB per reducer and 999 reducers came from.
   {color:blue}
            These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience.
   {color}
   3. Does this estimate apply only to the first job or to all jobs?
   {color:blue}
   It will apply to all the jobs
   {color}
   4. How does this work in the case of joins, where there are multiple inputs to a job?
   {color:blue}
   it will estimate the reducer number according the all the inputs files' size 
   {color}


> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

    Status: Open  (was: Patch Available)

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

    Attachment: PIG_1249_2.patch

Clear the logic. 
first check the PARALLE in query, 
if not set, then check the defaultParallel in PigContext, 
if not set, do estimation of reducer number.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1249:
--------------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

Patch committed. Thanks Jeff!

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1249:
----------------------------

    Attachment: PIG-1249-4.patch

Patch with merge conflict resolution.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874903#action_12874903 ] 

Jeff Zhang commented on PIG-1249:
---------------------------------

Alan,Thanks for your help.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

    Status: Open  (was: Patch Available)

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1249:
----------------------------

    Status: Open  (was: Patch Available)

The latest patch doesn't apply because of a merge conflict.  I'll attach a patch that addresses this.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892697#action_12892697 ] 

Jeff Zhang commented on PIG-1249:
---------------------------------

Olga, I generated the patch for the latest trunk. And add doc for in method estimateNumberOfReducers in JobControlCompiler. If you need anything else, feel free to tell me.



> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870896#action_12870896 ] 

Alan Gates commented on PIG-1249:
---------------------------------

Questions/Comments:

# In this code, what happens if a loader is not loading from a file (like an HBase loader)?  It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die.  Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate.
# I'm curious where the values of ~1GB per reducer and 999 reducers came from.
# Does this estimate apply only to the first job or to all jobs?
# How does this work in the case of joins, where there are multiple inputs to a job?

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

    Attachment: PIG-1249_5.patch

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868905#action_12868905 ] 

Alan Gates commented on PIG-1249:
---------------------------------

One thing we want to be sure of is that if users explicitly set parallel to 1, we don't override it.  From reviewing the above code it isn't clear to me whether that's the case here or not.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892873#action_12892873 ] 

Hadoop QA commented on PIG-1249:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12450579/PIG-1249_5.patch
  against trunk revision 979503.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 5 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/console

This message is automatically generated.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868931#action_12868931 ] 

Dmitriy V. Ryaboy commented on PIG-1249:
----------------------------------------

This is a good spot to leverage pinning options that were added to operators a while back. The parser would pin the parallel option if it encounters the PARALLEL keyword, and the code in this patch wouldn't get invoked unless parallelism is not pinned.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891789#action_12891789 ] 

Olga Natkovich commented on PIG-1249:
-------------------------------------

Jeff, sorry this patch did not get much attention in a while. Can I ask you to do the following:

(1) Regenrate the patch for the latest trunk and make sure that the tests are passing and we get no additional warnings
(2) Add a docs comment that describes in one place what are the exact heuristics, when they are applied and how they can be influenced. I will ask our doc writer to incorporate this information in Pig 0.8.0 documentation
(3) If it is not already done, can we log the value that will be used so that the user knows what is happenning

Thanks!

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

    Attachment: PIG_1249_3.patch

Update the patch, including testcase of non-dfs input and do path checking when doing estimation



> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872322#action_12872322 ] 

Alan Gates commented on PIG-1249:
---------------------------------

+1, new test looks good.

Hudson is still having troubles.  We should run the "ant test" and "ant test-patch" directives manually.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1249:
----------------------------

    Status: Patch Available  (was: Open)

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872196#action_12872196 ] 

Hadoop QA commented on PIG-1249:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12445559/PIG_1249_3.patch
  against trunk revision 948526.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 5 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/6/console

This message is automatically generated.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891788#action_12891788 ] 

Olga Natkovich commented on PIG-1249:
-------------------------------------

Ashutosh,

First, the changes are not going to be in framework till Hadoop 22 and I don't think we want to wait that far as we are seeing quite a few problems on our cluster. Second, I think we want to take a direction with pig of setting things up for users. Of course, we don't have stats right now to do so accurately but I think this is a step in the right direction

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1249:
--------------------------------

    Fix Version/s: 0.8.0

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.8.0
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887283#action_12887283 ] 

Ashutosh Chauhan commented on PIG-1249:
---------------------------------------

Map-reduce framework has a jira related to this issue.  https://issues.apache.org/jira/browse/MAPREDUCE-1521 It has two implications for Pig:

1) We need to reconsider whether we still want Pig to set number of reducers on user's behalf. We can choose not to "intelligently" choose # of reducers and let framework fail the  job which doesn't "correctly" specify # of reducers. Then, Pig is out of this guessing game and users are forced by framework to correctly specify # of reducers. 

2) Now that MR framework will fail the job based on configured limits, operators where Pig does compute and set number of reducers (like skewed join etc.) should now be aware of those limits so that # of reducers computed by them fall within those limits.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Posted by "Jeff Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-1249:
----------------------------

    Status: Patch Available  (was: Open)

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.