You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Siying Dong (JIRA)" <ji...@apache.org> on 2010/11/19 21:10:13 UTC

[jira] Created: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
-------------------------------------------------------------------------

                 Key: HIVE-1802
                 URL: https://issues.apache.org/jira/browse/HIVE-1802
             Project: Hive
          Issue Type: Improvement
            Reporter: Siying Dong
            Assignee: Siying Dong


Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.

Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964932#action_12964932 ] 

Siying Dong commented on HIVE-1802:
-----------------------------------

Yongqiang, after some face-to-face discussion, are you OK with go for this approach now?

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935045#action_12935045 ] 

Siying Dong commented on HIVE-1802:
-----------------------------------

I still don't get it. Isn't it what this patch is doing?

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1802:
---------------------------------

    Status: Open  (was: Patch Available)

The patch doesn't apply cleanly to trunk. Can you please rebase it? Thanks.

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934957#action_12934957 ] 

He Yongqiang commented on HIVE-1802:
------------------------------------

For one Text key in join, i think in your patch you still need an array copy.  For one Text key in group by, array copy is not needed.

I mean the new code only process one Text key in Group by, which we can avoid array copy.

For other cases, maybe we can optimize BinarySortableSerDe to use array copy instead of write?

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-1802:
------------------------------

    Attachment: HIVE-1802.2.patch

Refactored PlanUtils a little bit. I didn't come up a straight forward way to refactor to be a factory and make it clear. I tried to break up PlanUtils to several classes, by those return TableDesc and ReduceSyncDesc, as well as others. Hope it can be better maintained.

No function change from previous patch.

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934778#action_12934778 ] 

Siying Dong commented on HIVE-1802:
-----------------------------------

Yongqiang, I didn't quite get it. One key applies to both of Group-by and Join. And we ARE only processing those two cases. And we are avoiding array copy in those case. It's exactly what we are doing here.

Are you suggesting we should also optimize other cases too? It will be nice if we can. I didn't come up with a way that let BinarySortableSerDe to use array copy. The problem is that to make binary sorting order the same as key order, we need a delimiter and in order to have delimiter, strings need to be encoded to escape the delimiter. Any better idea?

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-1802:
------------------------------

    Status: Patch Available  (was: Open)

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1802:
-----------------------------

    Status: Open  (was: Patch Available)

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965021#action_12965021 ] 

He Yongqiang commented on HIVE-1802:
------------------------------------

Yes. I am ok with the current approach.

(moving forward, we still need to figure out a better way which can be more easy to maintain and extend. Like, we may want to try to separate serdes used for group-by and join. If we do that in the current approach, we need to have 4 serdes for reduce-sink.)

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934380#action_12934380 ] 

Namit Jain commented on HIVE-1802:
----------------------------------

The code looks OK, but it is not very easy to add new serde's this way.
Can you refactor PlanUtils change into a factory - so that it is easy to add new changes

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-1802:
------------------------------

    Status: Patch Available  (was: Open)

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934382#action_12934382 ] 

He Yongqiang commented on HIVE-1802:
------------------------------------

I think we only need serialize here. No? Can we make it easier? I mean only processing cases where there is one key, and type is Text, and also only for group by. In this case, we can avoid an array copy. 

But if it is a join, or there are multiple keys in group by, we anyway need to do array copy. The problem of binarysortableserde is that it uses write() to write bytes. Can we make binarysortableserde to use array copy? Maybe we can use some java nio classes, like ByteBuffer?

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934980#action_12934980 ] 

Siying Dong commented on HIVE-1802:
-----------------------------------

For any Group by, we needed 2 mem-copies. One from Text objects to buffer, one add an extra tag to the end of the buffer.
Now, the case with single Text takes no mem-copy (except the first byte is 0) and for multiple keys it needs one (from Text object to buffer).

For join, we needed 2 mem-copies. One from Text to buffer, one add tag.
Now one single Text needs one copy from buffer to add a tag. Other cases we still need two copies.

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-1802:
------------------------------

    Attachment: HIVE-1802.1.patch

1. Two another SerDe only for encoding single string and single bigint, respectively.
2. When generating reduce plan, identify single sting and bigint case and write the serde in the plan
3. add a test for key as bigint as keys
4. fix the bug of adding FF to the end of group-by keys and pay one more mem-copy.


> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1802:
---------------------------------

    Component/s: Query Processor

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1802) Encode MapReduce Shuffling Keys Differently for Single string/bigint Key

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935043#action_12935043 ] 

He Yongqiang commented on HIVE-1802:
------------------------------------

>>For any Group by, we needed 2 mem-copies. One from Text objects to buffer, one add an extra tag to the end of the buffer.
I think for Join we will need array copy and put a tag at the end.

I mean optimize BinarySortableSerDe might be a better idea to optimize cases when need array copy.
The code can be cleaner and simpler if only optimize the one Text key case in Group by, and put other optimizations in BinarySortableSerDe.

> Encode MapReduce Shuffling Keys Differently for  Single string/bigint Key
> -------------------------------------------------------------------------
>
>                 Key: HIVE-1802
>                 URL: https://issues.apache.org/jira/browse/HIVE-1802
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-1802.1.patch, HIVE-1802.2.patch
>
>
> Delimiters are not needed if we only have one shuffling key, and in the same time escaping delimiters are not needed. We can save some CPU time on serializing and shuffle slightly less amount of data to save memory footprint and network traffic.
> Also there is a bug that for group-by, we by mistake add a -1 to the end of the key and pay one more unnecessary mem-copy. Can be easily fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.