You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Santhosh Srinivasan <sm...@yahoo-inc.com> on 2008/08/29 18:49:35 UTC

Flattening bags and tuples without a known schema

This email discusses a use case of flattening a bag or tuple when the
schema of the bag or tuple is not known, i.e., null.

When UDFs return bags or tuples (complex type), the schema of the
complex type can be declared via the outputSchema method of the UDF. By
default, the outputSchema method in EvalFunc (the abstract base class)
returns null. When users try to flatten the output of the UDF, the
schema of the flattened column cannot be determined. An example follows.

E.g.: 

--myudf returns a bag whose schema is null, i.e., not declared
B = foreach A generate flatten(myudf), $1 as x;

In the above example, since the schema of the bag returned by myudf is
not known, we have two possible options:

1. Erring on the side of safety, set the schema of the flattened column
to be a bytearray. While this is a safe assumption, authors of the UDF
who are aware of the exact return value of the UDF, will try to access
the elements appropriately. For example, if myudf returned a bag with
tuples containing 3 elements, the following might be a possible use
case:

C = foreach B generate $2 as mycolumn;

At this point, the safe assumption about the flattened column being a
single column of type bytearray will generate {bytearray, x: bytearray}
as the schema for B. As a result, statement C will generate a parse
exception for out of bound access.

Given the fact that UDF authors have complete knowledge about the return
values of the UDF, they should override the outputSchema method in the
UDF to ensure correct schemas. The other option is to specify the schema
as part of the "AS" clause in the generate statement, i.e.,

B = foreach A generate flatten(myudf) as (name: chararray, age: int,
gpa: float), $1 as x;

2. Set the schema of the foreach to be unknown or null. The bag returned
by the UDF could contain arbitrary number of columns, making it
impossible to set the correct column number for the other expression, x
in the generate clause. In all likelihood, this will break existing pig
scripts as:

B = foreach A generate flatten(myudf), $1 as x;
C = foreach B generate $1 + x;


Currently, I have an implementation for option 1. Any
thoughts/suggestions/comments are welcome.

Thanks,
Santhosh

Re: Flattening bags and tuples without a known schema

Posted by Alan Gates <ga...@yahoo-inc.com>.
I vote for option 2, as it is consistent with other pig operations.  
When we load a file and no schema is given, we make no assumptions.  
When we union two relations with differing schemas, the resulting 
relation has no schema.  I think it makes sense to do the same thing 
here.  If the user happens to know his UDF's schema, he can provide it 
via an AS clause.  I agree that this will break some scripts but it is 
consistent with the rest of the way we do things.

Alan.

Santhosh Srinivasan wrote:
> This email discusses a use case of flattening a bag or tuple when the
> schema of the bag or tuple is not known, i.e., null.
>
> When UDFs return bags or tuples (complex type), the schema of the
> complex type can be declared via the outputSchema method of the UDF. By
> default, the outputSchema method in EvalFunc (the abstract base class)
> returns null. When users try to flatten the output of the UDF, the
> schema of the flattened column cannot be determined. An example follows.
>
> E.g.: 
>
> --myudf returns a bag whose schema is null, i.e., not declared
> B = foreach A generate flatten(myudf), $1 as x;
>
> In the above example, since the schema of the bag returned by myudf is
> not known, we have two possible options:
>
> 1. Erring on the side of safety, set the schema of the flattened column
> to be a bytearray. While this is a safe assumption, authors of the UDF
> who are aware of the exact return value of the UDF, will try to access
> the elements appropriately. For example, if myudf returned a bag with
> tuples containing 3 elements, the following might be a possible use
> case:
>
> C = foreach B generate $2 as mycolumn;
>
> At this point, the safe assumption about the flattened column being a
> single column of type bytearray will generate {bytearray, x: bytearray}
> as the schema for B. As a result, statement C will generate a parse
> exception for out of bound access.
>
> Given the fact that UDF authors have complete knowledge about the return
> values of the UDF, they should override the outputSchema method in the
> UDF to ensure correct schemas. The other option is to specify the schema
> as part of the "AS" clause in the generate statement, i.e.,
>
> B = foreach A generate flatten(myudf) as (name: chararray, age: int,
> gpa: float), $1 as x;
>
> 2. Set the schema of the foreach to be unknown or null. The bag returned
> by the UDF could contain arbitrary number of columns, making it
> impossible to set the correct column number for the other expression, x
> in the generate clause. In all likelihood, this will break existing pig
> scripts as:
>
> B = foreach A generate flatten(myudf), $1 as x;
> C = foreach B generate $1 + x;
>
>
> Currently, I have an implementation for option 1. Any
> thoughts/suggestions/comments are welcome.
>
> Thanks,
> Santhosh
>