You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Dan DeCapria, CivicScience" <da...@civicscience.com> on 2013/03/21 16:57:20 UTC

Utf8StorageConverter Not Handling Empty Tuples Properly

For Pig 0.10.1, I came across a use case for the caster *
Utf8StorageConverter.consumeTuple()* method, whereby passing an empty tuple
to the caster did not create a valid empty tuple output. The output was a
tuple object containing an empty DataByteArray.  I believe this promotes
discussion on the set-theoretic form of the empty states for this caster's
pre and post-conditions. An empty tuple is the empty set ∅, just as the
empty bag is the empty set ∅
https://en.wikipedia.org/wiki/Tuple#Tuples_as_nested_sets.  For Pig, I
believe ∅ translates for Tuples to TupleFactory.getInstance().newTuple()
and for bags BagFactory.getInstance().newDefaultBag() and not Null Objects.

Use Case:

String string_input = "()";
String string_schema = "t1:tuple()";
Tuple t1 = this.tuple_factory.newTuple();
Utf8StorageConverter caster = new Utf8StorageConverter();
LogicalSchema ls = Utils.parseSchema(schema);
ResourceSchema rs = new ResourceSchema(ls);
ResourceSchema.ResourceFieldSchema[] fields = rs.getFields();
Object result = CastUtils.convertToType(caster,
string_input.getBytes("UTF-8"), fields[0], fields[0].getType());
org.junit.Assert.assertEquals(result, t1);     // this will fail as
consumeTuple() is logically ill-defined


The Object result is of the assumed form "t1:tuple(a:bytearray)" which is
incorrect, and should be "t1:tuple()".  In other words, the result contains
a field of type DataByteArray and value 0.
Upon examining the code block, a relatively easy fix would be a conditional
on line 170-171, converting to:

*src/org/apache/pig/builtin/Utf8StorageConverter.java:*
170: DataByteArray value = new DataByteArray(mOut.toByteArray());
171: if (value.size() > 0) { //  non-empty tuple condition
172:     t.append(value);
173: }


Implementing this fix will generate these successful unit tests:
//  empty tuple test
String string_input = "()";
String string_schema = "t1:tuple()";
Tuple t1 = this.tuple_factory.newTuple();
Utf8StorageConverter caster = new Utf8StorageConverter();
LogicalSchema ls = Utils.parseSchema(schema);
ResourceSchema rs = new ResourceSchema(ls);
ResourceSchema.ResourceFieldSchema[] fields = rs.getFields();
Object result = CastUtils.convertToType(caster,
string_input.getBytes("UTF-8"), fields[0], fields[0].getType());
org.junit.Assert.assertEquals(result, t1);     // with code-block fix,
success as empty

//  for reference, the same approach with empty DataBags
String string_input = "{}";
String string_schema = "b1:bag{}";
DataBag b1 = this.bag_factory.newDefaultBag();
Utf8StorageConverter caster = new Utf8StorageConverter();
LogicalSchema ls = Utils.parseSchema(schema);
ResourceSchema rs = new ResourceSchema(ls);
ResourceSchema.ResourceFieldSchema[] fields = rs.getFields();
Object result = CastUtils.convertToType(caster,
string_input.getBytes("UTF-8"), fields[0], fields[0].getType());
org.junit.Assert.assertEquals(result, b1);  // success as empty, no
modifications required


Thoughts on this discussion point?

-Dan