You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Narayanan K <kn...@gmail.com> on 2014/06/14 23:43:54 UTC

Output Schema of Pig UDF that returns a Tuple

Hi

I am writing a Pig UDF that returns a Tuple as per
http://wiki.apache.org/pig/UDFManual . I want the output tuple to have
a particular schema, Say {name:chararray, age:int} after I FLATTEN it
out after using the UDF.

As per the UDFManual, the method below

public Schema outputSchema(Schema input) {
           try{
               Schema tupleSchema = new Schema();
               tupleSchema.add(input.getField(1));
               tupleSchema.add(input.getField(0));
               return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input),

          tupleSchema, DataType.TUPLE));
           }catch (Exception e){
                   return null;
           }
       }
   }

gives this.getClass().getName().toLowerCase()::name and
this.getClass().getName().toLowerCase()::age as the fields after I
flatten.

My actual usecase has a Tuple that has a schema with 100 columns with
nested bags etc..

Is there some way I can get rid of the prefix of each of the fields ?

I just need schema of the Tuple as

 { field_name1: datatype1, field_name2:datatype 2, .... field_name100:
datatype 100 }


Thanks
Narayanan

Re: Output Schema of Pig UDF that returns a Tuple

Posted by Lorand Bendig <lb...@gmail.com>.

Hi Narayanan,

The disambiguation string (null:prefix) is added by the flatten operator 
not by the outputSchema().
--Lorand

On 06/17/2014 02:26 AM, Narayanan K wrote:
> Hi Lorand
>
> Thanks for the reply. My use case has around 100 columns and growing,
> and I didn't want to make the script look ugly and error prone with
> the definition of schema of all 100 columns.
>
> My idea was the UDF will return tuple for each record with a self
> explanatory schema returned by outputSchema() and I can use this to
> write directly into a Hive Table with HCatStorer(). The HCatStorer
> expects same name for each field from the Pig script and hive table
> schema. Hence if the outputSchema() could provide a Tuple with same
> name as the field name instead of a null:: prefix, it will be helpful.
>
>
> Regards
> Narayanan
>

Re: Output Schema of Pig UDF that returns a Tuple

Posted by Narayanan K <kn...@gmail.com>.

Hi Lorand

Thanks for the reply. My use case has around 100 columns and growing,
and I didn't want to make the script look ugly and error prone with
the definition of schema of all 100 columns.

My idea was the UDF will return tuple for each record with a self
explanatory schema returned by outputSchema() and I can use this to
write directly into a Hive Table with HCatStorer(). The HCatStorer
expects same name for each field from the Pig script and hive table
schema. Hence if the outputSchema() could provide a Tuple with same
name as the field name instead of a null:: prefix, it will be helpful.


Regards
Narayanan

Re: Output Schema of Pig UDF that returns a Tuple

Posted by Lorand Bendig <lb...@gmail.com>.

Hi,

If you flatten a tuple/bag, Pig will prefix the field with a 
disambiguation string ([prefix]::). (See: 
http://pig.apache.org/docs/r0.12.0/basic.html#disambiguate).
In your example getSchemaName() returns a generated unique name built 
from the classname + first input schema field + a unique id. If you want 
to get rid of the disambiguation string, you need to explicitly define 
the schema when flattening:

Example:

A = load 'data.txt' using PigStorage() as (c:chararray);
B = foreach A generate TOBAG(TOTUPLE($0, 1)) as ({(field1:chararray, 
field2:int)});
describe B;
B: {bag_0: {(field1: chararray,field2: int)}}

Define schema for flatten:

C = foreach B generate flatten($0) as (field1:chararray, field2:int);
describe C;
C: {field1: chararray,field2: int}
D = foreach C generate field1;
...

However, if the original column name (field1) is unique within the 
schema, you can refer to it by this name, rather than using the 
disambiguated form (bag_0::field1), so you don't need to explicitly set 
the schema:

C = foreach B generate flatten($0);
describe C;
C: {bag_0::field1: chararray,bag_0::field2: int}
D = foreach C generate field1;  --refers to bag_0::field1
...

Hope this helps!
--Lorand


On 06/14/2014 11:43 PM, Narayanan K wrote:
> Hi
>
> I am writing a Pig UDF that returns a Tuple as per
> http://wiki.apache.org/pig/UDFManual . I want the output tuple to have
> a particular schema, Say {name:chararray, age:int} after I FLATTEN it
> out after using the UDF.
>
> As per the UDFManual, the method below
>
> public Schema outputSchema(Schema input) {
>             try{
>                 Schema tupleSchema = new Schema();
>                 tupleSchema.add(input.getField(1));
>                 tupleSchema.add(input.getField(0));
>                 return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input),
>
>            tupleSchema, DataType.TUPLE));
>             }catch (Exception e){
>                     return null;
>             }
>         }
>     }
>
> gives this.getClass().getName().toLowerCase()::name and
> this.getClass().getName().toLowerCase()::age as the fields after I
> flatten.
>
> My actual usecase has a Tuple that has a schema with 100 columns with
> nested bags etc..
>
> Is there some way I can get rid of the prefix of each of the fields ?
>
> I just need schema of the Tuple as
>
>   { field_name1: datatype1, field_name2:datatype 2, .... field_name100:
> datatype 100 }
>
>
> Thanks
> Narayanan
>