You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2008/10/21 03:22:44 UTC
[jira] Created: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Illustrate and Dump do not seem to work correctly for files containing utf8
---------------------------------------------------------------------------
Key: PIG-504
URL: https://issues.apache.org/jira/browse/PIG-504
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: types_branch
Environment: Hadoop 18
Reporter: Viraj Bhat
For the snippet of code which runs on the latest type branch
{code}
A = load 'utf8.txt' using PigStorage() as (text: chararray);
illustrate A;
{code}
results in this output being produced
---------------------------------
| A | text: bytearray cn: 1 |
---------------------------------
| | ???????????????? |
---------------------------------
Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "???????????????" is not displayed properly
Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (text: chararray);
dump A;
{code}
produces (??????)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich updated PIG-504:
-------------------------------
Fix Version/s: types_branch
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
> Fix For: types_branch
>
> Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more.
> The utf8 characters after username are not displayed correctly but instead substituted by ?.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Bhat updated PIG-504:
---------------------------
Attachment: utf8.txt
utf8.txt for verifying the problem
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
> Attachments: utf8.txt
>
>
> For the snippet of code which runs on the latest types branch
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. The utf8 characters after username are not displayed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shubham Chopra updated PIG-504:
-------------------------------
Attachment: 504.patch
Comments inline
1) text should be chararray, not bytearray.
-> Pig casts the data only when it is used. PigStorage loads data as bytearrays. If the data is used, PlanOptimizer inserts a foreach after load that casts data to particular data-types. So, for a script like the following
{{
a = load 'utf8.txt' as (x:chararray);
b = foreach a generate x;
illustrate b;
------------------------
| a | x: bytearray |
------------------------
| | quinnφ |
------------------------
------------------------
| b | x: chararray |
------------------------
| | quinn? |
------------------------
}}
2) cn: 1 should be removed from the display
-> I had used the toString method of schemas. I have modified the toString method in the attached patch. I would request Santhosh to have a look at it.
3) Value for text is "username??" is not displayed properly
-> The datatypes use toString method of the object. The default charset used might be machine dependent. I am not sure why was the decision taken to go with the default charset instead of utf-8 or utf-16. I would request Alan to comment on it.
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
> Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more.
> The utf8 characters after username are not displayed correctly but instead substituted by ?.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich resolved PIG-504.
--------------------------------
Resolution: Fixed
dump part is addressed by PIG-497.
The illustrate just requires environment var change:
export LANG="en_US.UTF-8"
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
> Fix For: types_branch
>
> Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more.
> The utf8 characters after username are not displayed correctly but instead substituted by ?.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641933#action_12641933 ]
Olga Natkovich commented on PIG-504:
------------------------------------
patch committed. keeping the issue opened till the type is fixed.
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
> Fix For: types_branch
>
> Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more.
> The utf8 characters after username are not displayed correctly but instead substituted by ?.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641416#action_12641416 ]
Olga Natkovich commented on PIG-504:
------------------------------------
PIG-497 deals with dump issue. This jira should be just about illustrate
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
> Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more.
> The utf8 characters after username are not displayed correctly but instead substituted by ?.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641417#action_12641417 ]
Olga Natkovich commented on PIG-504:
------------------------------------
Shubham: regarding (1) illustrate should be doing the same thing as describe. If you look at describe, you will see that it would say chararray
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
> Attachments: 504.patch, utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more.
> The utf8 characters after username are not displayed correctly but instead substituted by ?.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Bhat updated PIG-504:
---------------------------
Description:
For the snippet of code which runs on the latest types branch
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
illustrate A;
{code}
results in this output being produced
-------------------------------
| A | t1: bytearray cn: 1 |
-------------------------------
| | gabriella?? |
-------------------------------
Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "username??" is not displayed properly
Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
dump A;
{code}
(david?)
(rachel?)
(jessica?)
(sarah?)
(katie?)
(wendy?)
(david?)
(priscilla?)
(oscar?)
(xavier?)
..some more. The utf8 characters after username are not displayed.
was:
For the snippet of code which runs on the latest type branch
{code}
A = load 'utf8.txt' using PigStorage() as (text: chararray);
illustrate A;
{code}
results in this output being produced
---------------------------------
| A | text: bytearray cn: 1 |
---------------------------------
| | ???????????????? |
---------------------------------
Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "???????????????" is not displayed properly
Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (text: chararray);
dump A;
{code}
produces (??????)
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
>
> For the snippet of code which runs on the latest types branch
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more. The utf8 characters after username are not displayed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-504) Illustrate and Dump do not seem to work
correctly for files containing utf8
Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Bhat updated PIG-504:
---------------------------
Description:
For the snippet of code which runs on the latest types branch. (utf8.txt attached)
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
illustrate A;
{code}
results in this output being produced
-------------------------------
| A | t1: bytearray cn: 1 |
-------------------------------
| | gabriella?? |
-------------------------------
Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "username??" is not displayed properly
Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
dump A;
{code}
(david?)
(rachel?)
(jessica?)
(sarah?)
(katie?)
(wendy?)
(david?)
(priscilla?)
(oscar?)
(xavier?)
..some more.
The utf8 characters after username are not displayed correctly but instead substituted by ?.
was:
For the snippet of code which runs on the latest types branch
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
illustrate A;
{code}
results in this output being produced
-------------------------------
| A | t1: bytearray cn: 1 |
-------------------------------
| | gabriella?? |
-------------------------------
Three observations:
1) text should be chararray, not bytearray.
2) cn: 1 should be removed from the display
3) Value for text is "username??" is not displayed properly
Now replacing illustrate with dump
{code}
A = load 'utf8.txt' using PigStorage() as (t1: chararray);
dump A;
{code}
(david?)
(rachel?)
(jessica?)
(sarah?)
(katie?)
(wendy?)
(david?)
(priscilla?)
(oscar?)
(xavier?)
..some more. The utf8 characters after username are not displayed.
> Illustrate and Dump do not seem to work correctly for files containing utf8
> ---------------------------------------------------------------------------
>
> Key: PIG-504
> URL: https://issues.apache.org/jira/browse/PIG-504
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Hadoop 18
> Reporter: Viraj Bhat
> Attachments: utf8.txt
>
>
> For the snippet of code which runs on the latest types branch. (utf8.txt attached)
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> illustrate A;
> {code}
> results in this output being produced
> -------------------------------
> | A | t1: bytearray cn: 1 |
> -------------------------------
> | | gabriella?? |
> -------------------------------
> Three observations:
> 1) text should be chararray, not bytearray.
> 2) cn: 1 should be removed from the display
> 3) Value for text is "username??" is not displayed properly
> Now replacing illustrate with dump
> {code}
> A = load 'utf8.txt' using PigStorage() as (t1: chararray);
> dump A;
> {code}
> (david?)
> (rachel?)
> (jessica?)
> (sarah?)
> (katie?)
> (wendy?)
> (david?)
> (priscilla?)
> (oscar?)
> (xavier?)
> ..some more.
> The utf8 characters after username are not displayed correctly but instead substituted by ?.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.