You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Travis Crawford <tr...@gmail.com> on 2010/01/16 22:50:33 UTC

cast to tuple errors

Hey pig gurus -

I'm having an issue with cast-to-tuple errors, such as:

ERROR 2999: Unexpected internal error.
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator cannot be cast to
org.apache.pig.data.Tuple

Any help understanding where I've gone wrong would be appreciated!


DETAILS:

Given some Apache logs I'd like to see the percentage of responses by
response code by minute. Basically, I'd like to generate the following:

"""
day_hour_min  response_code  response_code_count  total_responses
 response_code_pct
200101011458            200                    9               10
     0.9
200101011458            503                    1               10
     0.1
"""


I'm using the following steps, says describe. Note `counted' looks correct.

"""
data: {date: chararray,hour: chararray,minute: chararray,response_code:
chararray}

grouped_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),data: {date: chararray,hour: chararray,minute:
chararray,response_code: chararray}}

grouped_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),data: {date:
chararray,hour: chararray,minute: chararray,response_code: chararray}}

counted_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),count: long}
counted_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),count: long}

counted: {counted_by_minute_by_response_code::group: (date: chararray,hour:
chararray,minute: chararray,response_code:
chararray),counted_by_minute_by_response_code::count:
long,counted_by_minute::group: (date: chararray,hour: chararray,minute:
chararray),counted_by_minute::count: long}
"""


Everything works up until my join, where illustrate gives the above
exception. Strangely, I can store the output but it only contains the date,
hour, and minute fields -- missing the counts. For example:

"""
20100110 1 9 20100110 1 9
"""

counted = join
  counted_by_minute_by_response_code by (group.date, group.hour,
group.minute),
  counted_by_minute by (group.date, group.hour, group.minute)
  parallel 1;


I've tried writing this a few ways now and always have an issue when
referencing members of the group tuple. For example, I concat
date+hour+minute together and got one step further, but then ran into what I
believe is the same issue when doing the following:

counted_pct = foreach counted generate
  counted_by_minute_by_response_code::group.timebucket as timebucket,
  counted_by_minute_by_response_code::group.response_code as response_code,
  counted_by_minute_by_response_code::count as response_code_count,
  counted_by_minute::count as response_code_count_total,
  (float)counted_by_minute_by_response_code::count /
(float)counted_by_minute::count as response_code_pct;

Here I got "java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple" when referencing timebucket or response_code.
Removing those two items allowed the script to complete (although with not
very useful output).

Any thoughts on what the problem might be?

Thanks!
Travis

Re: cast to tuple errors

Posted by Rekha Joshi <re...@yahoo-inc.com>.
Ideally better if you could provide your pig version and the script.

However I suspect the dump/store in your case after join would work fine , and even the explain/describe, but the issue is only in illustrate.
There are some issues logged on for illustrate behavior eg: PIG-534,for classcastexcpetion you get with other format can be seen under PIG-449

Cheers,
/R


On 1/17/10 3:20 AM, "Travis Crawford" <tr...@gmail.com> wrote:

Hey pig gurus -

I'm having an issue with cast-to-tuple errors, such as:

ERROR 2999: Unexpected internal error.
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator cannot be cast to
org.apache.pig.data.Tuple

Any help understanding where I've gone wrong would be appreciated!


DETAILS:

Given some Apache logs I'd like to see the percentage of responses by
response code by minute. Basically, I'd like to generate the following:

"""
day_hour_min  response_code  response_code_count  total_responses
 response_code_pct
200101011458            200                    9               10
     0.9
200101011458            503                    1               10
     0.1
"""


I'm using the following steps, says describe. Note `counted' looks correct.

"""
data: {date: chararray,hour: chararray,minute: chararray,response_code:
chararray}

grouped_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),data: {date: chararray,hour: chararray,minute:
chararray,response_code: chararray}}

grouped_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),data: {date:
chararray,hour: chararray,minute: chararray,response_code: chararray}}

counted_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),count: long}
counted_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),count: long}

counted: {counted_by_minute_by_response_code::group: (date: chararray,hour:
chararray,minute: chararray,response_code:
chararray),counted_by_minute_by_response_code::count:
long,counted_by_minute::group: (date: chararray,hour: chararray,minute:
chararray),counted_by_minute::count: long}
"""


Everything works up until my join, where illustrate gives the above
exception. Strangely, I can store the output but it only contains the date,
hour, and minute fields -- missing the counts. For example:

"""
20100110 1 9 20100110 1 9
"""

counted = join
  counted_by_minute_by_response_code by (group.date, group.hour,
group.minute),
  counted_by_minute by (group.date, group.hour, group.minute)
  parallel 1;


I've tried writing this a few ways now and always have an issue when
referencing members of the group tuple. For example, I concat
date+hour+minute together and got one step further, but then ran into what I
believe is the same issue when doing the following:

counted_pct = foreach counted generate
  counted_by_minute_by_response_code::group.timebucket as timebucket,
  counted_by_minute_by_response_code::group.response_code as response_code,
  counted_by_minute_by_response_code::count as response_code_count,
  counted_by_minute::count as response_code_count_total,
  (float)counted_by_minute_by_response_code::count /
(float)counted_by_minute::count as response_code_pct;

Here I got "java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple" when referencing timebucket or response_code.
Removing those two items allowed the script to complete (although with not
very useful output).

Any thoughts on what the problem might be?

Thanks!
Travis