You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2012/08/23 00:06:42 UTC

[jira] [Created] (PIG-2888) Improve performance of POPartialAgg

Dmitriy V. Ryaboy created PIG-2888:
--------------------------------------

             Summary: Improve performance of POPartialAgg
                 Key: PIG-2888
                 URL: https://issues.apache.org/jira/browse/PIG-2888
             Project: Pig
          Issue Type: Improvement
            Reporter: Dmitriy V. Ryaboy


During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442613#comment-13442613 ] 

Dmitriy V. Ryaboy commented on PIG-2888:
----------------------------------------

none of the PigMix queries hit the particular bad behavior this is meant to address. I've verified that the speed is on par with the previous implementation for those "good" use cases.

Here is a script for which Pig with this patch finishes in 57 seconds, while without the patch, it takes 13 mins 48 secs:

{code}
rmf tmp/delme
l = load 'data.txt';
x = foreach l generate $0 as l, (int) (RANDOM() * 10000) as num; 
g = foreach (group x by num % 100) { d = distinct x.num; generate SUM(d); }
store g into 'tmp/delme';
{code}

Data file contains about 7 million rows, 1 letter each. 
This is an intentionally skewed example, but we've encountered similar problems with real data, particularly when grouping by high-cardinality columns like user_id and subsequently performing algebraic operations on nested distincts.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Attachment: partialagg_patch_5.patch
    
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Attachment: partialagg_patch_1.patch

The attached patch is an initial pass at this implementation. Reading it as a diff may be hard -- about 70% of the code in POPartialAgg changed -- I recommend applying it to a git branch and looking at the class directly.

I have not implemented memory-based triggering yet, for now just relying on hardcoded limits on number of tuples in the caches.

I have also not implemented the functionality to automatically turn off hash-based aggregation.

Tests (except the memory setting related tests) pass.

Test runs on synthetic data both in local mode and on a cluster produced correct data.

Cluster runs indicate significant improvement in overall speed of execution when using this approach.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Attachment: partialagg_patch_3.patch

Minor logging and spill perf improvements (reusing the iterator, forcing an agg if any list gets too big, being slightly more clever about hashmap sizing).
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Attachment: PIG-2888.final.patch

Attaching final patch, committing under Julien's +1.

Three changes added:
1) Missed a test class earlier which needed the static string moved to PigConfiguration. Done now.

2) a slight change to build.xml to ensure junit3 comes before junit4 in the test classpath. Otherwise the build occasionally failed to compile, depending on environment.

3) An unrelated fix to TestPOCast, which was failing and this preventing me from passing test-commit.


                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch, partialagg_patch_6.patch, PIG-2888.final.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443793#comment-13443793 ] 

Dmitriy V. Ryaboy commented on PIG-2888:
----------------------------------------

bq. There's a "pig.exec.nocombiner" that was not replaced by a constant.

Fixed.

bq. It would be nice to have a consistent way of getting booleans (and floats) from the conf

Feels like scope creep.. maybe in another ticket? I don't want to get into how to design that around Properties, Configurations, and PigConfigurations.

bq. some of the class description was still applicable
Added better docs.

bq. what is the reason for this particular value?

Bad math :). Fixed the math and added an explanation of how I got there.

bq. Don't you want a visitor to just list them all once and set the count? That way you would not have to worry about keeping a reference on them.

I could do that, but this feels much cleaner -- no visitors, no serialization, no changes to the MRCompiler/JCCompiler, very self-contained, and works at runtime instead of having to be preset by the planner.

bq. +0.5 so that it is never 0 ? Math.min(1, ...) is more readable.

No, +0.5 so that it's a round() instead of floor()

bq. LOG.info() should be wrapped in if (LOG.isInfoEnabled()) { ... } for perf
Done for places where it matters (functions invoked more than once and messages where args are not constants)

bq.in aggregateSecondLevel() can't the processedInputMap be reused?

No -- aggregate() adds to the list of tuples in the target map, we want to overwrite in this case.

bq. in getMinOutputReductionFromProp(), if minReduction <= 0 it should throw an exception.

Added a log message instead. 
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439904#comment-13439904 ] 

Dmitriy V. Ryaboy commented on PIG-2888:
----------------------------------------

The current implementation makes a two key assumptions that are frequently violated in real-life datasets and scripts:

1) The intermediate UDF is cheap to invoke
2) Records come in mostly-grouped order (records with the same key tend to follow each other).

When condition 2 is not satisfied, POPartialAgg winds up calling the intermediate UDF on all accumulated values so far for a given key, plus a new tuple, for every single tuple it sees. This causes a significant performance degradation.

Instead, we propose accumulating tuples across the board until a memory threshold is reached. Once this threshold is reached, all keys and tuples are fed into the intermediate UDF and the results put into a second-level map (presumably, having been significantly shrunk by the intermediate UDF).  This repeats until the second-level map hits its threshold, at which point *it* is summarized and its values replaced with the aggregated ones. If after such a reduction the memory occupied by the hashmap is still near the threshold, the results are returned to the regular MR pipeline.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Attachment: partialagg_patch_6.patch

Same, but with the offending test removed.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch, partialagg_patch_6.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy reassigned PIG-2888:
--------------------------------------

    Assignee: Dmitriy V. Ryaboy
    
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.11
           Status: Resolved  (was: Patch Available)
    
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>             Fix For: 0.11
>
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch, partialagg_patch_6.patch, PIG-2888.final.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Attachment: partialagg_patch_2.patch

Attaching a second version. It's ready for review.

This takes care of memory estimation (and actually looks at number of operators, doesn't just hardcode a magic "3"), and turns off if reduction is insufficient.

Would love to get a 3-rd party verification of the speed improvements. Maybe someone who has recent PigMix results can rerun with this patch?

One of the test cases (TestPOPartialAgg.testPartialMultiInput1HashMemEmpty) still fails, because it assumes that even if no memory is allocated to internal cached bags, consecutive keys still get aggregated. That's an assumption that's pretty specific to the old implementation. Does anyone think that feature is critical? If not, I would like to remove the test.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2888) Improve performance of POPartialAgg

Posted by "Julien Le Dem (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442974#comment-13442974 ] 

Julien Le Dem commented on PIG-2888:
------------------------------------

Awesome. Stop now or it will start to be negative soon.
Comments:
* There's a "pig.exec.nocombiner" that was not replaced by a constant.
* It would be nice to have a consistent way of getting booleans (and floats) from the conf. Something like:
{noformat}
PigConfiguration.getBoolean(Properties p, key) {
  return "true".equals(p.getProperty(key, "false"));
}
{noformat}
* some of the class description was still applicable
{noformat}
/**
 * Do partial aggregation in map plan. It uses a hash-map to aggregate. 
 * ...
 */
 public class POPartialAgg extends PhysicalOperator {
{noformat}
* what is the reason for this particular value?
{noformat}
 private static final int MAX_LIST_SIZE = 1 << 13 - 1;
{noformat}
* It looks like this could be a HashSet as the value never gets used (but there's no WeakHashSet so I gues I got my answer). It could be as well WeakHashMap<POPartialAgg, ?>. Don't you want a visitor to just list them all once and set the count? That way you would not have to worry about keeping a reference on them. 
{noformat}
private static final WeakHashMap<POPartialAgg, Byte> ALL_POPARTS = new WeakHashMap<POPartialAgg, Byte>();
{noformat}
* +0.5 so that it is never 0 ? Math.min(1, ...) is more readable. 
{noformat}
 firstTierThreshold = (int) (0.5 + totalTuples * (1f - (1f / sizeReduction)));
 secondTierThreshold = (int) (0.5 + totalTuples *  (1f / sizeReduction));
{noformat}
* LOG.info() should be wrapped in if (LOG.isInfoEnabled()) { ... } for perf
* in aggregateSecondLevel() can't the processedInputMap be reused?
* in getMinOutputReductionFromProp(), if minReduction <= 0 it should throw an exception.


                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Status: Patch Available  (was: Open)
    
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch, partialagg_patch_6.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2888) Improve performance of POPartialAgg

Posted by "Julien Le Dem (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444279#comment-13444279 ] 

Julien Le Dem commented on PIG-2888:
------------------------------------

+1
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Attachment: partialagg_patch_4.patch

Significant improvements to transitions from raw to processed map. Better mem utilization estimation. Better logging.

While profiling, also noticed an inordinate amount of time being spent in Distinct$Initial's bag registration, fixed that.

The task that I cited as taking 57 seconds with this patch earlier? It now takes 30 seconds. Also saw 40% speed improvement vs older version of this patch on a production job.

Please review :).
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira