You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Sergey (JIRA)" <ji...@apache.org> on 2013/08/02 12:51:48 UTC
[jira] [Commented] (PIG-3409) org.apache.pig.data.DefaultTuple
hashcode perfomance issue
[ https://issues.apache.org/jira/browse/PIG-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13727559#comment-13727559 ]
Sergey commented on PIG-3409:
-----------------------------
http://bigdatapath.com/wp-content/uploads/2013/08/hash_code_perfomance_issue.png
Here is visual VM running in local mode.
I'm joining 100 mb of data and 100 mb of data using replicated join by 4 int fields.
Cluster-mode on 18 reducers, 32 cores, -Xmx=3072Mb for the task takes ~30 min to join 6Gb of data (6Gb/18 per task) with 100Mb of data by four fields.
> org.apache.pig.data.DefaultTuple hashcode perfomance issue
> ----------------------------------------------------------
>
> Key: PIG-3409
> URL: https://issues.apache.org/jira/browse/PIG-3409
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.11
> Reporter: Sergey
> Priority: Critical
> Original Estimate: 3h
> Remaining Estimate: 3h
>
> I've met serious perfomance issue.
> please see visualvm screenshot.
> Here is hashCode implementation from the class:
> {code}
> @Override
> public int hashCode() {
> int hash = 17;
> for (Iterator<Object> it = mFields.iterator(); it.hasNext();) {
> Object o = it.next();
> if (o != null) {
> hash = 31 * hash + o.hashCode();
> }
> }
> return hash;
> }
> {code}
> I don't see any reason here to iterate over the whole tuple, aggregate hash value and then return it.
> I can fix it, if it's possible to take part in dev process. I'm new to it :(
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira