You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2011/03/23 01:54:05 UTC

[jira] [Created] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value

GFCross should allow the user to set the DEFAULT_PARALLELISM value
------------------------------------------------------------------

                 Key: PIG-1932
                 URL: https://issues.apache.org/jira/browse/PIG-1932
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.8.0
            Reporter: Alan Gates
            Priority: Minor


The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments in the code on how that value was settled on.  Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause.  It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join.  The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.  

We should leave the default value at 96 but allow a property to override this default and change the value.

We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1932:
----------------------------

    Attachment: PIG-1932_2.patch

Commit test unit tests pass

     [exec] -1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     -1 release audit.  The applied patch generated 552 release audit warnings (more than the trunk's current 550 warnings).
     [exec]
     [exec]
     [exec]

Release audit issues are due to new file and changes to javadocs in GFCross.

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch, PIG-1932_2.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments in the code on how that value was settled on.  Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause.  It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join.  The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.  
> We should leave the default value at 96 but allow a property to override this default and change the value.
> We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1932:
----------------------------

    Fix Version/s: 0.9.0
         Assignee: Alan Gates
           Status: Patch Available  (was: Open)

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments in the code on how that value was settled on.  Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause.  It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join.  The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.  
> We should leave the default value at 96 but allow a property to override this default and change the value.
> We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1932:
----------------------------

    Status: Open  (was: Patch Available)

Daniel convinced me I should use the parallelism value from the cross, since what's really important about this is how many join groups it creates.  You want to create enough groups to keep each reducers busy.

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments in the code on how that value was settled on.  Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause.  It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join.  The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.  
> We should leave the default value at 96 but allow a property to override this default and change the value.
> We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1932:
----------------------------

    Status: Patch Available  (was: Open)

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch, PIG-1932_2.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments in the code on how that value was settled on.  Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause.  It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join.  The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.  
> We should leave the default value at 96 but allow a property to override this default and change the value.
> We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012579#comment-13012579 ] 

Daniel Dai commented on PIG-1932:
---------------------------------

+1

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch, PIG-1932_2.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments in the code on how that value was settled on.  Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause.  It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join.  The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.  
> We should leave the default value at 96 but allow a property to override this default and change the value.
> We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1932:
----------------------------

    Attachment: PIG-1932.patch

Unit tests pass.  Results of test-patch:

     [exec] -1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     -1 release audit.  The applied patch generated 545 release audit warnings (more than the trunk's current 544 warnings).
     [exec]

the new release audit warning is because I added a file.

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments in the code on how that value was settled on.  Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause.  It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join.  The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.  
> We should leave the default value at 96 but allow a property to override this default and change the value.
> We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1932:
----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Patch 2 checked in.

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch, PIG-1932_2.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments in the code on how that value was settled on.  Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause.  It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join.  The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.  
> We should leave the default value at 96 but allow a property to override this default and change the value.
> We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira