You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jonathan Packer <js...@columbia.edu> on 2012/06/22 21:09:49 UTC

Pig creates wrong schema after dereferencing nested tuple fields

I'm running into a strange problem where Pig is not detecting the schema
correctly when one dereferences multiple fields from a nested tuple. I
wanted to check whether this was a bug I should file on JIRA or whether the
multiple dereference syntax is deprecated or something.

The following script fails:

data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3:
int, f4: int);

nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple;

dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3);
DESCRIBE dereferenced;

uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.f3;
DESCRIBE uses_dereferenced;

The schema of "dereferenced" should be {f1: int, nested_tuple: (f2: int,
f3: int)}. DESCRIBE thinks it is {f1: int, f2: int} instead. When dump is
used, the data is actually in form of the correct schema however, ex.

(1,(2,3))
(5,(6,7))
...

This is not just a problem with DESCRIBE. Because the schema is incorrect,
the reference to "nested_tuple" in the "uses_dereferenced" statement is
considered to be invalid, and the script fails to run. The error is:

Invalid field projection. Projected field [nested_tuple] does not exist in
schema: f1:int,f2:int.

Re: Pig creates wrong schema after dereferencing nested tuple fields

Posted by Russell Jurney <ru...@gmail.com>.
Try this:

data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3:
int, f4: int);

nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple;

dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3) as
tuple_two;
DESCRIBE dereferenced;

uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.(f3) as f3;
DESCRIBE uses_dereferenced;


I find explicit naming can sometimes fix bugs.

On Fri, Jun 22, 2012 at 12:09 PM, Jonathan Packer <js...@columbia.edu>wrote:

> I'm running into a strange problem where Pig is not detecting the schema
> correctly when one dereferences multiple fields from a nested tuple. I
> wanted to check whether this was a bug I should file on JIRA or whether the
> multiple dereference syntax is deprecated or something.
>
> The following script fails:
>
> data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3:
> int, f4: int);
>
> nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple;
>
> dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3);
> DESCRIBE dereferenced;
>
> uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.f3;
> DESCRIBE uses_dereferenced;
>
> The schema of "dereferenced" should be {f1: int, nested_tuple: (f2: int,
> f3: int)}. DESCRIBE thinks it is {f1: int, f2: int} instead. When dump is
> used, the data is actually in form of the correct schema however, ex.
>
> (1,(2,3))
> (5,(6,7))
> ...
>
> This is not just a problem with DESCRIBE. Because the schema is incorrect,
> the reference to "nested_tuple" in the "uses_dereferenced" statement is
> considered to be invalid, and the script fails to run. The error is:
>
> Invalid field projection. Projected field [nested_tuple] does not exist in
> schema: f1:int,f2:int.
>



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com