You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by jeremiah rounds <ro...@gmail.com> on 2012/08/13 23:49:03 UTC

Can anyone give me a hint about this column behavior?

Greetings,

I am new to pig.  I am trying to get to know it on a laptop with
hadoop 20.2 installed in local mode.  I have prior experience with
hadoop, but I figure my error is so weird I blew the pig install or
something.

Here is what I have my problem distilled down too:

$ pig -x local -M


grunt> set pig.splitCombination false;
grunt> cat ERROR_9999_.csv
11,21,31
12,22,32
13,23,33
14,24,34
15,25,35



grunt> raw = load 'ERROR_9999_.csv' USING PigStorage(',',
'-tagsource') AS (file: chararray, col1: chararray,col2: chararray,
col3: chararray);
grunt> dump raw;
(ERROR_9999_.csv,11,21,31)
(ERROR_9999_.csv,12,22,32)
(ERROR_9999_.csv,13,23,33)
(ERROR_9999_.csv,14,24,34)
(ERROR_9999_.csv,15,25,35)

grunt> s1 = FOREACH raw GENERATE  col1, col2, col3;
grunt> dump s1;
(ERROR_9999_.csv,21,31)
(ERROR_9999_.csv,22,32)
(ERROR_9999_.csv,23,33)
(ERROR_9999_.csv,24,34)
(ERROR_9999_.csv,25,35)


Now obviously you wouldn't put on the filename only to take it off,
but this is a distilled down repeatable case that captures my issue in
a larger project.  col1 has become the filename even though it used to
be a double digit number in a chararray for raw.

The describes go like this:
grunt> describe raw;
raw: {file: chararray,col1: chararray,col2: chararray,col3: chararray}
grunt> describe s1;
s1: {col1: chararray,col2: chararray,col3: chararray}

There is an explain at the end of the email if that is useful to
anyone.  I have figured out that the issue seems related to -tagsource
and pruning columns.  Is that indicative of anything I might have done
wrong in an install?


Thanks,
Jeremiah

grunt> explain s1
2012-08-13 17:47:28,315 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
pruned for raw: $0
initialized
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
s1: (Name: LOStore Schema:
col1#41:chararray,col2#42:chararray,col3#43:chararray)ColumnPrune:InputUids=[42,
43, 41]ColumnPrune:OutputUids=[42, 43, 41]
|
|---s1: (Name: LOForEach Schema:
col1#41:chararray,col2#42:chararray,col3#43:chararray)
    |   |
    |   (Name: LOGenerate[false,false,false] Schema:
col1#41:chararray,col2#42:chararray,col3#43:chararray)
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 41)
    |   |   |
    |   |   |---col1:(Name: Project Type: bytearray Uid: 41 Input: 0
Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 42)
    |   |   |
    |   |   |---col2:(Name: Project Type: bytearray Uid: 42 Input: 1
Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 43)
    |   |   |
    |   |   |---col3:(Name: Project Type: bytearray Uid: 43 Input: 2
Column: (*))
    |   |
    |   |---(Name: LOInnerLoad[0] Schema: col1#41:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[1] Schema: col2#42:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[2] Schema: col3#43:bytearray)
    |
    |---raw: (Name: LOLoad Schema:
col1#41:bytearray,col2#42:bytearray,col3#43:bytearray)ColumnPrune:RequiredColumns=[1,
2, 3]ColumnPrune:InputUids=[42, 43, 41]ColumnPrune:OutputUids=[42, 43,
41]RequiredFields:[1, 2, 3]

#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
s1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-40
|
|---s1: New For Each(false,false,false)[bag] - scope-39
    |   |
    |   Cast[chararray] - scope-31
    |   |
    |   |---Project[bytearray][0] - scope-30
    |   |
    |   Cast[chararray] - scope-34
    |   |
    |   |---Project[bytearray][1] - scope-33
    |   |
    |   Cast[chararray] - scope-37
    |   |
    |   |---Project[bytearray][2] - scope-36
    |
    |---raw: Load(file:///home/jrounds/Documents/12summer/paper/ERROR_9999_.csv:PigStorage(',','-tagsource'))
- scope-29

2012-08-13 17:47:28,321 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-41
Map Plan
s1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-40
|
|---s1: New For Each(false,false,false)[bag] - scope-39
    |   |
    |   Cast[chararray] - scope-31
    |   |
    |   |---Project[bytearray][0] - scope-30
    |   |
    |   Cast[chararray] - scope-34
    |   |
    |   |---Project[bytearray][1] - scope-33
    |   |
    |   Cast[chararray] - scope-37
    |   |
    |   |---Project[bytearray][2] - scope-36
    |
    |---raw: Load(file:///home/jrounds/Documents/12summer/paper/ERROR_9999_.csv:PigStorage(',','-tagsource'))
- scope-29--------
Global sort: false
----------------

Re: Can anyone give me a hint about this column behavior?

Posted by Bill Graham <bi...@gmail.com>.
This seems like a bug in PigStorage. Would you mind opening a JIRA with the
steps to reproduce that you've include here?

thanks,
Bill

On Mon, Aug 13, 2012 at 3:44 PM, jeremiah rounds
<ro...@gmail.com>wrote:

> Greetings pig users,
>
> This is regarding my previous post (in quotes below)
>
>
> I was able to remove this column error by using the start up:
> pig -x local -M -t ColumnMapKeyPrune
>
>
> I have no more insight than that  I only tried it because someone else
> reported their column oriented error went away with that command line
> switch.  I restarted pig two times with and without the -t to verify
> the error went away and came back.
>
>
> With  pig -x local -M -t ColumnMapKeyPrune I get:
> grunt> dump s1;
> (11,21,31)
> (12,22,32)
> (13,23,33)
> (14,24,34)
> (15,25,35)
>
>
> With pig -x local -M I get:
> grunt > dump s1;
> (ERROR_9999_.csv,21,31)
> (ERROR_9999_.csv,22,32)
> (ERROR_9999_.csv,23,33)
> (ERROR_9999_.csv,24,34)
> (ERROR_9999_.csv,25,35)
>
>
>
>
> ---------- Forwarded message ----------
> From: jeremiah rounds <ro...@gmail.com>
> Date: Mon, Aug 13, 2012 at 5:49 PM
> Subject: Can anyone give me a hint about this column behavior?
> To: user@pig.apache.org
>
>
> Greetings,
>
> I am new to pig.  I am trying to get to know it on a laptop with
> hadoop 20.2 installed in local mode.  I have prior experience with
> hadoop, but I figure my error is so weird I blew the pig install or
> something.
>
> Here is what I have my problem distilled down too:
>
> $ pig -x local -M
>
>
> grunt> set pig.splitCombination false;
> grunt> cat ERROR_9999_.csv
> 11,21,31
> 12,22,32
> 13,23,33
> 14,24,34
> 15,25,35
>
>
>
> grunt> raw = load 'ERROR_9999_.csv' USING PigStorage(',',
> '-tagsource') AS (file: chararray, col1: chararray,col2: chararray,
> col3: chararray);
> grunt> dump raw;
> (ERROR_9999_.csv,11,21,31)
> (ERROR_9999_.csv,12,22,32)
> (ERROR_9999_.csv,13,23,33)
> (ERROR_9999_.csv,14,24,34)
> (ERROR_9999_.csv,15,25,35)
>
> grunt> s1 = FOREACH raw GENERATE  col1, col2, col3;
> grunt> dump s1;
> (ERROR_9999_.csv,21,31)
> (ERROR_9999_.csv,22,32)
> (ERROR_9999_.csv,23,33)
> (ERROR_9999_.csv,24,34)
> (ERROR_9999_.csv,25,35)
>
>
> Now obviously you wouldn't put on the filename only to take it off,
> but this is a distilled down repeatable case that captures my issue in
> a larger project.  col1 has become the filename even though it used to
> be a double digit number in a chararray for raw.
>
> The describes go like this:
> grunt> describe raw;
> raw: {file: chararray,col1: chararray,col2: chararray,col3: chararray}
> grunt> describe s1;
> s1: {col1: chararray,col2: chararray,col3: chararray}
>
> There is an explain at the end of the email if that is useful to
> anyone.  I have figured out that the issue seems related to -tagsource
> and pruning columns.  Is that indicative of anything I might have done
> wrong in an install?
>
>
> Thanks,
> Jeremiah
>
> grunt> explain s1
> 2012-08-13 17:47:28,315 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
> pruned for raw: $0
> initialized
> #-----------------------------------------------
> # New Logical Plan:
> #-----------------------------------------------
> s1: (Name: LOStore Schema:
>
> col1#41:chararray,col2#42:chararray,col3#43:chararray)ColumnPrune:InputUids=[42,
> 43, 41]ColumnPrune:OutputUids=[42, 43, 41]
> |
> |---s1: (Name: LOForEach Schema:
> col1#41:chararray,col2#42:chararray,col3#43:chararray)
>     |   |
>     |   (Name: LOGenerate[false,false,false] Schema:
> col1#41:chararray,col2#42:chararray,col3#43:chararray)
>     |   |   |
>     |   |   (Name: Cast Type: chararray Uid: 41)
>     |   |   |
>     |   |   |---col1:(Name: Project Type: bytearray Uid: 41 Input: 0
> Column: (*))
>     |   |   |
>     |   |   (Name: Cast Type: chararray Uid: 42)
>     |   |   |
>     |   |   |---col2:(Name: Project Type: bytearray Uid: 42 Input: 1
> Column: (*))
>     |   |   |
>     |   |   (Name: Cast Type: chararray Uid: 43)
>     |   |   |
>     |   |   |---col3:(Name: Project Type: bytearray Uid: 43 Input: 2
> Column: (*))
>     |   |
>     |   |---(Name: LOInnerLoad[0] Schema: col1#41:bytearray)
>     |   |
>     |   |---(Name: LOInnerLoad[1] Schema: col2#42:bytearray)
>     |   |
>     |   |---(Name: LOInnerLoad[2] Schema: col3#43:bytearray)
>     |
>     |---raw: (Name: LOLoad Schema:
>
> col1#41:bytearray,col2#42:bytearray,col3#43:bytearray)ColumnPrune:RequiredColumns=[1,
> 2, 3]ColumnPrune:InputUids=[42, 43, 41]ColumnPrune:OutputUids=[42, 43,
> 41]RequiredFields:[1, 2, 3]
>
> #-----------------------------------------------
> # Physical Plan:
> #-----------------------------------------------
> s1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-40
> |
> |---s1: New For Each(false,false,false)[bag] - scope-39
>     |   |
>     |   Cast[chararray] - scope-31
>     |   |
>     |   |---Project[bytearray][0] - scope-30
>     |   |
>     |   Cast[chararray] - scope-34
>     |   |
>     |   |---Project[bytearray][1] - scope-33
>     |   |
>     |   Cast[chararray] - scope-37
>     |   |
>     |   |---Project[bytearray][2] - scope-36
>     |
>     |---raw:
> Load(file:///home/jrounds/Documents/12summer/paper/ERROR_9999_.csv:PigStorage(',','-tagsource'))
> - scope-29
>
> 2012-08-13 17:47:28,321 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - File concatenation threshold: 100 optimistic? false
> #--------------------------------------------------
> # Map Reduce Plan
> #--------------------------------------------------
> MapReduce node scope-41
> Map Plan
> s1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-40
> |
> |---s1: New For Each(false,false,false)[bag] - scope-39
>     |   |
>     |   Cast[chararray] - scope-31
>     |   |
>     |   |---Project[bytearray][0] - scope-30
>     |   |
>     |   Cast[chararray] - scope-34
>     |   |
>     |   |---Project[bytearray][1] - scope-33
>     |   |
>     |   Cast[chararray] - scope-37
>     |   |
>     |   |---Project[bytearray][2] - scope-36
>     |
>     |---raw:
> Load(file:///home/jrounds/Documents/12summer/paper/ERROR_9999_.csv:PigStorage(',','-tagsource'))
> - scope-29--------
> Global sort: false
> ----------------
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Can anyone give me a hint about this column behavior?

Posted by jeremiah rounds <ro...@gmail.com>.
Greetings pig users,

This is regarding my previous post (in quotes below)


I was able to remove this column error by using the start up:
pig -x local -M -t ColumnMapKeyPrune


I have no more insight than that  I only tried it because someone else
reported their column oriented error went away with that command line
switch.  I restarted pig two times with and without the -t to verify
the error went away and came back.


With  pig -x local -M -t ColumnMapKeyPrune I get:
grunt> dump s1;
(11,21,31)
(12,22,32)
(13,23,33)
(14,24,34)
(15,25,35)


With pig -x local -M I get:
grunt > dump s1;
(ERROR_9999_.csv,21,31)
(ERROR_9999_.csv,22,32)
(ERROR_9999_.csv,23,33)
(ERROR_9999_.csv,24,34)
(ERROR_9999_.csv,25,35)




---------- Forwarded message ----------
From: jeremiah rounds <ro...@gmail.com>
Date: Mon, Aug 13, 2012 at 5:49 PM
Subject: Can anyone give me a hint about this column behavior?
To: user@pig.apache.org


Greetings,

I am new to pig.  I am trying to get to know it on a laptop with
hadoop 20.2 installed in local mode.  I have prior experience with
hadoop, but I figure my error is so weird I blew the pig install or
something.

Here is what I have my problem distilled down too:

$ pig -x local -M


grunt> set pig.splitCombination false;
grunt> cat ERROR_9999_.csv
11,21,31
12,22,32
13,23,33
14,24,34
15,25,35



grunt> raw = load 'ERROR_9999_.csv' USING PigStorage(',',
'-tagsource') AS (file: chararray, col1: chararray,col2: chararray,
col3: chararray);
grunt> dump raw;
(ERROR_9999_.csv,11,21,31)
(ERROR_9999_.csv,12,22,32)
(ERROR_9999_.csv,13,23,33)
(ERROR_9999_.csv,14,24,34)
(ERROR_9999_.csv,15,25,35)

grunt> s1 = FOREACH raw GENERATE  col1, col2, col3;
grunt> dump s1;
(ERROR_9999_.csv,21,31)
(ERROR_9999_.csv,22,32)
(ERROR_9999_.csv,23,33)
(ERROR_9999_.csv,24,34)
(ERROR_9999_.csv,25,35)


Now obviously you wouldn't put on the filename only to take it off,
but this is a distilled down repeatable case that captures my issue in
a larger project.  col1 has become the filename even though it used to
be a double digit number in a chararray for raw.

The describes go like this:
grunt> describe raw;
raw: {file: chararray,col1: chararray,col2: chararray,col3: chararray}
grunt> describe s1;
s1: {col1: chararray,col2: chararray,col3: chararray}

There is an explain at the end of the email if that is useful to
anyone.  I have figured out that the issue seems related to -tagsource
and pruning columns.  Is that indicative of anything I might have done
wrong in an install?


Thanks,
Jeremiah

grunt> explain s1
2012-08-13 17:47:28,315 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
pruned for raw: $0
initialized
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
s1: (Name: LOStore Schema:
col1#41:chararray,col2#42:chararray,col3#43:chararray)ColumnPrune:InputUids=[42,
43, 41]ColumnPrune:OutputUids=[42, 43, 41]
|
|---s1: (Name: LOForEach Schema:
col1#41:chararray,col2#42:chararray,col3#43:chararray)
    |   |
    |   (Name: LOGenerate[false,false,false] Schema:
col1#41:chararray,col2#42:chararray,col3#43:chararray)
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 41)
    |   |   |
    |   |   |---col1:(Name: Project Type: bytearray Uid: 41 Input: 0
Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 42)
    |   |   |
    |   |   |---col2:(Name: Project Type: bytearray Uid: 42 Input: 1
Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 43)
    |   |   |
    |   |   |---col3:(Name: Project Type: bytearray Uid: 43 Input: 2
Column: (*))
    |   |
    |   |---(Name: LOInnerLoad[0] Schema: col1#41:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[1] Schema: col2#42:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[2] Schema: col3#43:bytearray)
    |
    |---raw: (Name: LOLoad Schema:
col1#41:bytearray,col2#42:bytearray,col3#43:bytearray)ColumnPrune:RequiredColumns=[1,
2, 3]ColumnPrune:InputUids=[42, 43, 41]ColumnPrune:OutputUids=[42, 43,
41]RequiredFields:[1, 2, 3]

#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
s1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-40
|
|---s1: New For Each(false,false,false)[bag] - scope-39
    |   |
    |   Cast[chararray] - scope-31
    |   |
    |   |---Project[bytearray][0] - scope-30
    |   |
    |   Cast[chararray] - scope-34
    |   |
    |   |---Project[bytearray][1] - scope-33
    |   |
    |   Cast[chararray] - scope-37
    |   |
    |   |---Project[bytearray][2] - scope-36
    |
    |---raw: Load(file:///home/jrounds/Documents/12summer/paper/ERROR_9999_.csv:PigStorage(',','-tagsource'))
- scope-29

2012-08-13 17:47:28,321 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-41
Map Plan
s1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-40
|
|---s1: New For Each(false,false,false)[bag] - scope-39
    |   |
    |   Cast[chararray] - scope-31
    |   |
    |   |---Project[bytearray][0] - scope-30
    |   |
    |   Cast[chararray] - scope-34
    |   |
    |   |---Project[bytearray][1] - scope-33
    |   |
    |   Cast[chararray] - scope-37
    |   |
    |   |---Project[bytearray][2] - scope-36
    |
    |---raw: Load(file:///home/jrounds/Documents/12summer/paper/ERROR_9999_.csv:PigStorage(',','-tagsource'))
- scope-29--------
Global sort: false
----------------