You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Prashant Kommireddi (JIRA)" <ji...@apache.org> on 2013/03/19 03:29:15 UTC

[jira] [Updated] (PIG-3110) pig corrupts chararrays with trailing whitespace when converting them to long

     [ https://issues.apache.org/jira/browse/PIG-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prashant Kommireddi updated PIG-3110:
-------------------------------------

    Attachment: PIG-3110.patch

Patch contains changes to Utf8StorageConverter and TestConversions. In addition to making the above discussed changes I have also added an additional check

{code}
        if(b == null || b.length == 0) {
            return null;
        }
{code}
We don't need to parse further if the input byte array is empty, thereby avoiding expensive valueOf(String s) calls.

Also, this could further be optimized if the only reason now for falling back on Double.valueOf() is to handle floating points. The current process for bytesToLong and bytesToInteger in case of floating point numbers is:
1. Integer/Long.valueOf(String)
2. If 1 results in null, call Double.valueOf
3. Convert result of 2 back to Integer/Long.

Input bytearray can be determined to be a floating point thereby avoiding call 1.

Last thing, the above process takes place regardless of whether input byte array is numeric or not. This is unnecessary in case of strings like "1234abcd". 

If all agree, we should open another JIRA and optimize these methods further.
                
> pig corrupts chararrays with trailing whitespace when converting them to long
> -----------------------------------------------------------------------------
>
>                 Key: PIG-3110
>                 URL: https://issues.apache.org/jira/browse/PIG-3110
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>    Affects Versions: 0.10.0
>            Reporter: Ido Hadanny
>         Attachments: PIG-3110.patch
>
>
> when trying to convert the following string into long, pig corrupts it. data:
> 1703598819951657279 ,44081037
> data1 = load 'data' using CSVLoader as (a: chararray ,b: int);
> data2 = foreach data1 generate (long)a as a;
> dump data2;
> (1703598819951657216)    <--- last 2 digits are corrupted
> data2 = foreach data1 generate (long)TRIM(a) as a;
> dump data2;
> (1703598819951657279)    <--- correct

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira