You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/09/05 01:45:30 UTC
Re: Flattening bags and tuples without a known schema

I vote for option 2, as it is consistent with other pig operations.  
When we load a file and no schema is given, we make no assumptions.  
When we union two relations with differing schemas, the resulting 
relation has no schema.  I think it makes sense to do the same thing 
here.  If the user happens to know his UDF's schema, he can provide it 
via an AS clause.  I agree that this will break some scripts but it is 
consistent with the rest of the way we do things.

Alan.

Santhosh Srinivasan wrote:
> This email discusses a use case of flattening a bag or tuple when the
> schema of the bag or tuple is not known, i.e., null.
>
> When UDFs return bags or tuples (complex type), the schema of the
> complex type can be declared via the outputSchema method of the UDF. By
> default, the outputSchema method in EvalFunc (the abstract base class)
> returns null. When users try to flatten the output of the UDF, the
> schema of the flattened column cannot be determined. An example follows.
>
> E.g.: 
>
> --myudf returns a bag whose schema is null, i.e., not declared
> B = foreach A generate flatten(myudf), $1 as x;
>
> In the above example, since the schema of the bag returned by myudf is
> not known, we have two possible options:
>
> 1. Erring on the side of safety, set the schema of the flattened column
> to be a bytearray. While this is a safe assumption, authors of the UDF
> who are aware of the exact return value of the UDF, will try to access
> the elements appropriately. For example, if myudf returned a bag with
> tuples containing 3 elements, the following might be a possible use
> case:
>
> C = foreach B generate $2 as mycolumn;
>
> At this point, the safe assumption about the flattened column being a
> single column of type bytearray will generate {bytearray, x: bytearray}
> as the schema for B. As a result, statement C will generate a parse
> exception for out of bound access.
>
> Given the fact that UDF authors have complete knowledge about the return
> values of the UDF, they should override the outputSchema method in the
> UDF to ensure correct schemas. The other option is to specify the schema
> as part of the "AS" clause in the generate statement, i.e.,
>
> B = foreach A generate flatten(myudf) as (name: chararray, age: int,
> gpa: float), $1 as x;
>
> 2. Set the schema of the foreach to be unknown or null. The bag returned
> by the UDF could contain arbitrary number of columns, making it
> impossible to set the correct column number for the other expression, x
> in the generate clause. In all likelihood, this will break existing pig
> scripts as:
>
> B = foreach A generate flatten(myudf), $1 as x;
> C = foreach B generate $1 + x;
>
>
> Currently, I have an implementation for option 1. Any
> thoughts/suggestions/comments are welcome.
>
> Thanks,
> Santhosh
>