You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Leo Heska (Created) (JIRA)" <ji...@apache.org> on 2012/03/25 15:25:25 UTC

[jira] [Created] (PIG-2613) Pig substitutes/mangels "upper ASCII" characters (values > 127)

Pig substitutes/mangels "upper ASCII" characters (values > 127)
---------------------------------------------------------------

                 Key: PIG-2613
                 URL: https://issues.apache.org/jira/browse/PIG-2613
             Project: Pig
          Issue Type: Bug
          Components: data, parser
    Affects Versions: 0.8.1
         Environment: linux
            Reporter: Leo Heska


Create small/dummy input file that contains ASCII 254 (decimal) characters. These are often represented as the Thorn character. A sample line looks like this:

   1þ4þaaaþbbbþcccþdddþ7þ8þ9

but your browser may not render that correctly. Hex representation of that sample line:

   31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A

or, with spaces added for your convenience in reading:

   31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A

You can see that this is just a sample line of plain ASCII numerals and lower-case letters, separated by the FE (hex) or 254 (decimal) code point.

Now load, like this:

   dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);

A dump 

   dump dummyts;
   
shows this:

   (1�4�aaa�bbb�ccc�ddd�7�8�9)

The problem does not seem to be with the dump. I have written a UDF that counts characters in the line and returns TRUE if the character count is correct. When I do this:
 
   fd = filter dummyts by CountRight(line, 254, 8);

which is saying "validate that there are 8 instances of the ASCII 254 code point/character" I get no results. When I do this:

   fd1 = filter dummyts by CountRight(line, 97, 3);

which says "validate that there are three instances of the 'a' (ASCII 97) character the results are perfect.

It looks like something in Pig's load is changing instances of ASCII 254 to the following three characters:

   �



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2613) Pig substitutes/mangles "upper ASCII" characters (values > 127)

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238198#comment-13238198 ] 

Daniel Dai commented on PIG-2613:
---------------------------------

Can you attach your input?
                
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
>                 Key: PIG-2613
>                 URL: https://issues.apache.org/jira/browse/PIG-2613
>             Project: Pig
>          Issue Type: Bug
>          Components: data, parser
>    Affects Versions: 0.8.1
>         Environment: linux
>            Reporter: Leo Heska
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters. These are often represented as the Thorn character. A sample line looks like this:
>    1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that sample line:
>    31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
>    31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and lower-case letters, separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
>    dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);
> A dump 
>    dump dummyts;
>    
> shows this:
>    (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that counts characters in the line and returns TRUE if the character count is correct. When I do this:
>  
>    fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code point/character" I get no results. When I do this:
>    fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97) character the results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to the following three characters:
>    �

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PIG-2613) Pig substitutes/mangles "upper ASCII" characters (values > 127)

Posted by "Leo Heska (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leo Heska updated PIG-2613:
---------------------------

    Summary: Pig substitutes/mangles "upper ASCII" characters (values > 127)  (was: Pig substitutes/mangels "upper ASCII" characters (values > 127))
    
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
>                 Key: PIG-2613
>                 URL: https://issues.apache.org/jira/browse/PIG-2613
>             Project: Pig
>          Issue Type: Bug
>          Components: data, parser
>    Affects Versions: 0.8.1
>         Environment: linux
>            Reporter: Leo Heska
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters. These are often represented as the Thorn character. A sample line looks like this:
>    1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that sample line:
>    31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
>    31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and lower-case letters, separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
>    dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);
> A dump 
>    dump dummyts;
>    
> shows this:
>    (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that counts characters in the line and returns TRUE if the character count is correct. When I do this:
>  
>    fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code point/character" I get no results. When I do this:
>    fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97) character the results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to the following three characters:
>    �

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2613) Pig substitutes/mangles "upper ASCII" characters (values > 127)

Posted by "Leo Heska (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238263#comment-13238263 ] 

Leo Heska commented on PIG-2613:
--------------------------------

Sample input file attached, though the one line of input already included in the original report is sufficient to reproduce the problem.
                
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
>                 Key: PIG-2613
>                 URL: https://issues.apache.org/jira/browse/PIG-2613
>             Project: Pig
>          Issue Type: Bug
>          Components: data, parser
>    Affects Versions: 0.8.1
>         Environment: linux
>            Reporter: Leo Heska
>         Attachments: DummyDataTS.txt
>
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters. These are often represented as the Thorn character. A sample line looks like this:
>    1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that sample line:
>    31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
>    31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and lower-case letters, separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
>    dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);
> A dump 
>    dump dummyts;
>    
> shows this:
>    (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that counts characters in the line and returns TRUE if the character count is correct. When I do this:
>  
>    fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code point/character" I get no results. When I do this:
>    fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97) character the results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to the following three characters:
>    �

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2613) Pig substitutes/mangles "upper ASCII" characters (values > 127)

Posted by "Leo Heska (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238947#comment-13238947 ] 

Leo Heska commented on PIG-2613:
--------------------------------

Forgot to mention - this sounds very like HIVE-237:

   https://issues.apache.org/jira/browse/HIVE-237

Is this just a different presentation of the same problem?

HIVE-237 was closed with status "Won't Fix."

Could be that both Hive and Pig store/handle only characters x00 - x7F (decimal 0 - 127) only in strings?
                
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
>                 Key: PIG-2613
>                 URL: https://issues.apache.org/jira/browse/PIG-2613
>             Project: Pig
>          Issue Type: Bug
>          Components: data, parser
>    Affects Versions: 0.8.1
>         Environment: linux
>            Reporter: Leo Heska
>         Attachments: DummyDataTS.txt
>
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters. These are often represented as the Thorn character. A sample line looks like this:
>    1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that sample line:
>    31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
>    31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and lower-case letters, separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
>    dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);
> A dump 
>    dump dummyts;
>    
> shows this:
>    (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that counts characters in the line and returns TRUE if the character count is correct. When I do this:
>  
>    fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code point/character" I get no results. When I do this:
>    fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97) character the results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to the following three characters:
>    �

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PIG-2613) Pig substitutes/mangles "upper ASCII" characters (values > 127)

Posted by "Leo Heska (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leo Heska updated PIG-2613:
---------------------------

    Attachment: DummyDataTS.txt
    
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
>                 Key: PIG-2613
>                 URL: https://issues.apache.org/jira/browse/PIG-2613
>             Project: Pig
>          Issue Type: Bug
>          Components: data, parser
>    Affects Versions: 0.8.1
>         Environment: linux
>            Reporter: Leo Heska
>         Attachments: DummyDataTS.txt
>
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters. These are often represented as the Thorn character. A sample line looks like this:
>    1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that sample line:
>    31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
>    31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and lower-case letters, separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
>    dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);
> A dump 
>    dump dummyts;
>    
> shows this:
>    (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that counts characters in the line and returns TRUE if the character count is correct. When I do this:
>  
>    fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code point/character" I get no results. When I do this:
>    fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97) character the results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to the following three characters:
>    �

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira