You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tiago Macambira <ti...@chaordicsystems.com> on 2012/08/15 16:11:36 UTC

Python UDF and ClassCastException errors

Hi there.

I am having some trouble trying to use Pig and a Python UDF function
and I was wondering if someone could shed a light into what I am doing
wrong. It seems that Pig has some issues trying to handle a bag of
tuples returned by a python UDF as it is getting the following
ClassCastExcetion:

java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot
be cast to org.apache.pig.data.DataBag

Bellow I pasted the simplest code I could come up with that
exemplifies what I am trying to do.


A = LOAD 'data' AS (url:chararray,outlink:chararray);
-- (www.ccc.com,www.hjk.com)
-- (www.ddd.com,www.xyz.org)
-- (www.aaa.com,www.cvn.org)
-- (www.www.com,www.kpt.net)
-- (www.www.com,www.xyz.org)
-- (www.ddd.com,www.xyz.org)
B = GROUP A BY url;
-- (www.aaa.com,{(www.aaa.com,www.cvn.org)})
-- (www.ccc.com,{(www.ccc.com,www.hjk.com)})
-- (www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
-- (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
C = foreach B generate group, COUNT(A);
dump C;
-- (www.aaa.com,1)
-- (www.ccc.com,1)
-- (www.ddd.com,2)
-- (www.www.com,2)

-- OK, fine 'till here. Let's try with a UDF:

register 'my_udf.py' using jython as sample;
-- @outputSchema("res:{t:(value:chararray)}")
-- def test_list_of_items():
--     return tuple([i for i in range(5)])

H = foreach B generate group, my_udf.test_list_of_items();
DUMP H;
-- (www.aaa.com,(0,1,2,3,4))
-- (www.ccc.com,(0,1,2,3,4))
-- (www.ddd.com,(0,1,2,3,4))
-- (www.www.com,(0,1,2,3,4))
DESCRIBE H;
-- H: {group: chararray,res: {t: (value: chararray)}}
I = FOREACH H generate group, COUNT(res);
DUMP I
-- POWWWW!
-- org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
Error while computing count in COUNT
--	at org.apache.pig.builtin.COUNT.exec(COUNT.java:74)
-- (...)
-- Caused by: java.lang.ClassCastException:
org.apache.pig.data.BinSedesTuple cannot be cast to
org.apache.pig.data.DataBag




Background: Most of the data I have to work with is generated by
Hadoop Streaming apps and consists in tab-separated pairs of
JSON-encoded data. If I am not mistaken, this is not something that
could be parsed by JsonLoader or ElephantBird directly so I wrote a
python UDF to decode the JSON data but I was unable to use it as
expected in Pig code. I kept getting strange errors even though my
data was in the same schema as data used in some documentation
examples and I was doing exactly the same kind of manipulations.

Cheers.

Tiago Alves Macambira

Re: Python UDF and ClassCastException errors

Posted by Tiago Macambira <ti...@chaordicsystems.com>.
+user@pig, so others can avoid this silly, newbie mistake.

On Wed, Aug 15, 2012 at 11:21 AM, Philipp Pahl <ph...@gmail.com> wrote:
> Hi Tiago,
>
> I didn't test it, but have you tried to put [] around the tuple?

Wow! Worked like a charm! I definitely own you a beer :)

I wrote a small test function to get me acquainted with what was a
valid output of a UDF function and what wasn't. Just a one-line
function with a single return statement: `def test(): return [(1)]`.
Unfortunately I got trapped in common python pitfall: [(1)] != [(1,)].
Since Pig doesn't handle [(1)] as a valid python UDF return value I
started to think one could not return a list from a python UDF.

> @outputSchema("res:{t:(value:chararray)}")
> -- def test_list_of_items():
> --     return [ tuple([i for i in range(5)]) ]
>
> as it is done in the collectBag function here:
> https://cwiki.apache.org/PIG/udfsusingscriptinglanguages.html

I completely missed that sample function in the documentation. My bad. :(

But, anyway, sorry for the newbie question. And thanks :)

Cheers.

Tiago Alves Macambira