You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Thomas Friedrich (JIRA)" <ji...@apache.org> on 2016/08/12 23:08:20 UTC

[jira] [Commented] (HIVE-14533) improve performance of enforceMaxLength in HiveCharWritable/HiveVarcharWritable

    [ https://issues.apache.org/jira/browse/HIVE-14533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419657#comment-15419657 ] 

Thomas Friedrich commented on HIVE-14533:
-----------------------------------------

The patch adds a check to enforceMaxLength to only enforce the maxLength if the string is longer than maxLength. This check can be done without decoding the string, so it saves the unnecessary decoding of every value.

HiveVarcharWritable: if (value.getLength()>maxLength && getCharacterLength()>maxLength)
- value.getLength is the number of bytes of the string
- maxLength is the max number of characters
For single-byte characters, the number of bytes is similar to the number of characters. For double-byte characters, the number of characters is less than the number of bytes. If the number of bytes is lower than maxLength, then the string has fewer than maxLength characters and we don't have to truncate the string. If the number of bytes is larger than the number of characters, we need to compare the characterLength with the maxLength. We could just compare getCharacterLength()>maxLength in any case, but getCharacterLength calls getTextUtfLength which takes more time than just comparing the byte length with maxLength.

HiveCharwritable: if (getCharacterLength()!=maxLength)
For char values, we can only compare the number of characters with the maxLength and if it's different we need to call set to enforce the right length. This is to ensure we get the padded value if the string is not long enough and to truncate it in case it's longer. If we were to compare the bytes (value.getLength()) with maxLength, then it might not enforce the maxLength if double-byte characters are involved.



> improve performance of enforceMaxLength in HiveCharWritable/HiveVarcharWritable
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-14533
>                 URL: https://issues.apache.org/jira/browse/HIVE-14533
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 1.2.1, 2.1.0
>            Reporter: Thomas Friedrich
>            Assignee: Thomas Friedrich
>            Priority: Minor
>              Labels: performance
>         Attachments: HIVE-14533.patch
>
>
> The enforceMaxLength method in HiveVarcharWritable calls 
> set(getHiveVarchar(), maxLength); and in HiveCharWritable set(getHiveChar(), maxLength); no matter how long the string is. The calls to getHiveVarchar() and getHiveChar() decode the string every time the method is called (Text.toString() calls Text.decode). This can be very expensive and is unnecessary if the string is shorter than maxLength for HiveVarcharWritable or different than maxLength for HiveCharWritable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)