You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Leo Heska (Updated) (JIRA)" <ji...@apache.org> on 2012/03/25 15:25:25 UTC
[jira] [Updated] (PIG-2613) Pig substitutes/mangles "upper ASCII"
characters (values > 127)
[ https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Leo Heska updated PIG-2613:
---------------------------
Summary: Pig substitutes/mangles "upper ASCII" characters (values > 127) (was: Pig substitutes/mangels "upper ASCII" characters (values > 127))
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
> Key: PIG-2613
> URL: https://issues.apache.org/jira/browse/PIG-2613
> Project: Pig
> Issue Type: Bug
> Components: data, parser
> Affects Versions: 0.8.1
> Environment: linux
> Reporter: Leo Heska
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters. These are often represented as the Thorn character. A sample line looks like this:
> 1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that sample line:
> 31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
> 31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and lower-case letters, separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
> dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);
> A dump
> dump dummyts;
>
> shows this:
> (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that counts characters in the line and returns TRUE if the character count is correct. When I do this:
>
> fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code point/character" I get no results. When I do this:
> fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97) character the results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to the following three characters:
> �
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira