You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Dan DeCapria, CivicScience" <da...@civicscience.com> on 2013/03/11 20:35:53 UTC

Pigify Data Input to UDF for Unit Testing

First poster here! Really excited to get some feedback and contribute to
Pig!

I am attempting to simplify the UDF input process in the context of scaling
JUnit testing. Previously, to create a valid Pig input for my UDFs for
JUnit testing, I have had to make each layer/nesting of the Pig input from
org.apache.pig.data.* constructs, per each use case to unit test.  I am
looking for a quick methodology to simplify this process and to scale for
addition unit testing.  A use case is defined below:

Assume the input schema is defined a priori.  Assume also that the
outputSchema is properly defined in the UDF to be unit tested. Illustrating
the InputSchema from the prior pig process, I have the InputData in the
form of InputSchema, per my testing UDF. Conceptually, the unit testing
approach is as follows:

InputSchema
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray)),field_e:chararray)}

OutputSchema
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray),tuple_d2:tuple(field_c:chararray,field_d:chararray)),field_e:chararray)}

Prior (non-scalable) methodology:
Create bag_a DataBag.
Create tuple_b Tuple.
Create tuple_c1 Tuple.
Create tuple_d1 Tuple.
append data field_a to tuple_d1.  append data field_b to tuple_d1.
append tuple_c1 to tuple_b. append data field_e to tuple_b.
append tuple_b to bag_a.
unit test UDF(bag_a). //

Is there a way to 'pigify' the InputSchema data String, as it appears from
illustrate of the prior pig process, to be fed into the UDF(InputData),
such that I do not have to perform the Prior methodology explicitly? A
solution would be ideal of the form:

Awesome methodology:
String_of_data_in_inputFormat:
 bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray)),field_b)}
DataBag bag_a = pigify(String_of_data_in_inputFormat);
unit test UDF(bag_a). //

Thanks in advance,

-Dan DeCapria

Re: Pigify Data Input to UDF for Unit Testing

Posted by "Dan DeCapria, CivicScience" <da...@civicscience.com>.
Bumping, and improving simplicity of the use case.

In Java, with a filled DataBag bag, invoke a SerDe like operation, such
that:

Assume DataBag bag instantiated and filled;
String bag_string = bag.toString();
DataBag new_bag = some_deserializer(bag_string);
new_bag.equals(bag) returns true;

Does there exist an ots method or process 'some_deserializer()' to go from
the .toString String back to an org.apache.pig.data.DataBag?

Many Thanks!