You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2008/10/21 03:22:44 UTC

[jira] Created: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Illustrate and Dump do not seem to work correctly for files containing utf8
---------------------------------------------------------------------------

                 Key: PIG-504
                 URL: https://issues.apache.org/jira/browse/PIG-504
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
         Environment: Hadoop 18
            Reporter: Viraj Bhat


For the snippet of code which runs on the latest type branch
{code}
A = load 'utf8.txt' using PigStorage() as (text: chararray);
illustrate A;
{code}

results in this output being produced

---------------------------------
| A     | text: bytearray cn: 1 | 
---------------------------------
|       | ????????????????      | 
---------------------------------

Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "???????????????" is not displayed properly

Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (text: chararray);
dump A;
{code}

produces (??????)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-504:
-------------------------------

    Fix Version/s: types_branch

> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>             Fix For: types_branch
>
>         Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. 
> The utf8 characters after username are not displayed correctly but instead substituted by ?.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-504:
---------------------------

    Attachment: utf8.txt

utf8.txt for verifying the problem

> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>         Attachments: utf8.txt
>
>
> For the snippet of code which runs on the latest types branch
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. The utf8 characters after username are not displayed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-504:
-------------------------------

    Attachment: 504.patch

Comments inline
1) text should be chararray, not bytearray.
-> Pig casts the data only when it is used. PigStorage loads data as bytearrays. If the data is used, PlanOptimizer inserts a foreach after load that casts data to particular data-types. So, for a script like the following
{{
a = load 'utf8.txt' as (x:chararray);
b = foreach a generate x;
illustrate b;

------------------------
| a     | x: bytearray | 
------------------------
|       | quinnφ      | 
------------------------
------------------------
| b     | x: chararray | 
------------------------
|       | quinn?       | 
------------------------

}}

2) cn: 1 should be removed from the display
-> I had used the toString method of schemas. I have modified the toString method in the attached patch. I would request Santhosh to have a look at it. 

3) Value for text is "username??" is not displayed properly
-> The datatypes use toString method of the object. The default charset used might be machine dependent. I am not sure why was the decision taken to go with the default charset instead of utf-8 or utf-16. I would request Alan to comment on it.

> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>         Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. 
> The utf8 characters after username are not displayed correctly but instead substituted by ?.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich resolved PIG-504.
--------------------------------

    Resolution: Fixed

dump part is addressed by PIG-497.

The illustrate just requires environment var change:

export LANG="en_US.UTF-8"

> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>             Fix For: types_branch
>
>         Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. 
> The utf8 characters after username are not displayed correctly but instead substituted by ?.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641933#action_12641933 ] 

Olga Natkovich commented on PIG-504:
------------------------------------

patch committed. keeping the issue opened till the type is fixed.

> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>             Fix For: types_branch
>
>         Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. 
> The utf8 characters after username are not displayed correctly but instead substituted by ?.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641416#action_12641416 ] 

Olga Natkovich commented on PIG-504:
------------------------------------

PIG-497 deals with dump issue. This jira should be just about illustrate

> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>         Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. 
> The utf8 characters after username are not displayed correctly but instead substituted by ?.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641417#action_12641417 ] 

Olga Natkovich commented on PIG-504:
------------------------------------

Shubham: regarding (1) illustrate should be doing the same thing as describe. If you look at describe, you will see that it would say chararray

> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>         Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. 
> The utf8 characters after username are not displayed correctly but instead substituted by ?.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-504:
---------------------------

    Description: 
For the snippet of code which runs on the latest types branch
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
illustrate A;
{code}

results in this output being produced
-------------------------------
| A     | t1: bytearray cn: 1 | 
-------------------------------
|       | gabriella??         | 
-------------------------------

Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "username??" is not displayed properly

Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
dump A;
{code}

(david?)
(rachel?)
(jessica?)
(sarah?)
(katie?)
(wendy?)
(david?)
(priscilla?)
(oscar?)
(xavier?)
..some more. The utf8 characters after username are not displayed.

  was:
For the snippet of code which runs on the latest type branch
{code}
A = load 'utf8.txt' using PigStorage() as (text: chararray);
illustrate A;
{code}

results in this output being produced

---------------------------------
| A     | text: bytearray cn: 1 | 
---------------------------------
|       | ????????????????      | 
---------------------------------

Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "???????????????" is not displayed properly

Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (text: chararray);
dump A;
{code}

produces (??????)



> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>
> For the snippet of code which runs on the latest types branch
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. The utf8 characters after username are not displayed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work correctly for files containing utf8

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-504:
---------------------------

    Description: 
For the snippet of code which runs on the latest types branch. (utf8.txt attached)
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
illustrate A;
{code}

results in this output being produced
-------------------------------
| A     | t1: bytearray cn: 1 | 
-------------------------------
|       | gabriella??         | 
-------------------------------

Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "username??" is not displayed properly

Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
dump A;
{code}

(david?)
(rachel?)
(jessica?)
(sarah?)
(katie?)
(wendy?)
(david?)
(priscilla?)
(oscar?)
(xavier?)
..some more. 

The utf8 characters after username are not displayed correctly but instead substituted by ?.

  was:
For the snippet of code which runs on the latest types branch
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
illustrate A;
{code}

results in this output being produced
-------------------------------
| A     | t1: bytearray cn: 1 | 
-------------------------------
|       | gabriella??         | 
-------------------------------

Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "username??" is not displayed properly

Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
dump A;
{code}

(david?)
(rachel?)
(jessica?)
(sarah?)
(katie?)
(wendy?)
(david?)
(priscilla?)
(oscar?)
(xavier?)
..some more. The utf8 characters after username are not displayed.


> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
>                 Key: PIG-504
>                 URL: https://issues.apache.org/jira/browse/PIG-504
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Hadoop 18
>            Reporter: Viraj Bhat
>         Attachments: utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A     | t1: bytearray cn: 1 | 
> -------------------------------
> |       | gabriella??         | 
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. 
> The utf8 characters after username are not displayed correctly but instead substituted by ?.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.