You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Travis Crawford <tr...@gmail.com> on 2010/01/16 22:50:33 UTC
cast to tuple errors
Hey pig gurus -
I'm having an issue with cast-to-tuple errors, such as:
ERROR 2999: Unexpected internal error.
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator cannot be cast to
org.apache.pig.data.Tuple
Any help understanding where I've gone wrong would be appreciated!
DETAILS:
Given some Apache logs I'd like to see the percentage of responses by
response code by minute. Basically, I'd like to generate the following:
"""
day_hour_min response_code response_code_count total_responses
response_code_pct
200101011458 200 9 10
0.9
200101011458 503 1 10
0.1
"""
I'm using the following steps, says describe. Note `counted' looks correct.
"""
data: {date: chararray,hour: chararray,minute: chararray,response_code:
chararray}
grouped_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),data: {date: chararray,hour: chararray,minute:
chararray,response_code: chararray}}
grouped_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),data: {date:
chararray,hour: chararray,minute: chararray,response_code: chararray}}
counted_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),count: long}
counted_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),count: long}
counted: {counted_by_minute_by_response_code::group: (date: chararray,hour:
chararray,minute: chararray,response_code:
chararray),counted_by_minute_by_response_code::count:
long,counted_by_minute::group: (date: chararray,hour: chararray,minute:
chararray),counted_by_minute::count: long}
"""
Everything works up until my join, where illustrate gives the above
exception. Strangely, I can store the output but it only contains the date,
hour, and minute fields -- missing the counts. For example:
"""
20100110 1 9 20100110 1 9
"""
counted = join
counted_by_minute_by_response_code by (group.date, group.hour,
group.minute),
counted_by_minute by (group.date, group.hour, group.minute)
parallel 1;
I've tried writing this a few ways now and always have an issue when
referencing members of the group tuple. For example, I concat
date+hour+minute together and got one step further, but then ran into what I
believe is the same issue when doing the following:
counted_pct = foreach counted generate
counted_by_minute_by_response_code::group.timebucket as timebucket,
counted_by_minute_by_response_code::group.response_code as response_code,
counted_by_minute_by_response_code::count as response_code_count,
counted_by_minute::count as response_code_count_total,
(float)counted_by_minute_by_response_code::count /
(float)counted_by_minute::count as response_code_pct;
Here I got "java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple" when referencing timebucket or response_code.
Removing those two items allowed the script to complete (although with not
very useful output).
Any thoughts on what the problem might be?
Thanks!
Travis
Re: cast to tuple errors
Posted by Rekha Joshi <re...@yahoo-inc.com>.
Ideally better if you could provide your pig version and the script.
However I suspect the dump/store in your case after join would work fine , and even the explain/describe, but the issue is only in illustrate.
There are some issues logged on for illustrate behavior eg: PIG-534,for classcastexcpetion you get with other format can be seen under PIG-449
Cheers,
/R
On 1/17/10 3:20 AM, "Travis Crawford" <tr...@gmail.com> wrote:
Hey pig gurus -
I'm having an issue with cast-to-tuple errors, such as:
ERROR 2999: Unexpected internal error.
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator cannot be cast to
org.apache.pig.data.Tuple
Any help understanding where I've gone wrong would be appreciated!
DETAILS:
Given some Apache logs I'd like to see the percentage of responses by
response code by minute. Basically, I'd like to generate the following:
"""
day_hour_min response_code response_code_count total_responses
response_code_pct
200101011458 200 9 10
0.9
200101011458 503 1 10
0.1
"""
I'm using the following steps, says describe. Note `counted' looks correct.
"""
data: {date: chararray,hour: chararray,minute: chararray,response_code:
chararray}
grouped_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),data: {date: chararray,hour: chararray,minute:
chararray,response_code: chararray}}
grouped_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),data: {date:
chararray,hour: chararray,minute: chararray,response_code: chararray}}
counted_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),count: long}
counted_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),count: long}
counted: {counted_by_minute_by_response_code::group: (date: chararray,hour:
chararray,minute: chararray,response_code:
chararray),counted_by_minute_by_response_code::count:
long,counted_by_minute::group: (date: chararray,hour: chararray,minute:
chararray),counted_by_minute::count: long}
"""
Everything works up until my join, where illustrate gives the above
exception. Strangely, I can store the output but it only contains the date,
hour, and minute fields -- missing the counts. For example:
"""
20100110 1 9 20100110 1 9
"""
counted = join
counted_by_minute_by_response_code by (group.date, group.hour,
group.minute),
counted_by_minute by (group.date, group.hour, group.minute)
parallel 1;
I've tried writing this a few ways now and always have an issue when
referencing members of the group tuple. For example, I concat
date+hour+minute together and got one step further, but then ran into what I
believe is the same issue when doing the following:
counted_pct = foreach counted generate
counted_by_minute_by_response_code::group.timebucket as timebucket,
counted_by_minute_by_response_code::group.response_code as response_code,
counted_by_minute_by_response_code::count as response_code_count,
counted_by_minute::count as response_code_count_total,
(float)counted_by_minute_by_response_code::count /
(float)counted_by_minute::count as response_code_pct;
Here I got "java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple" when referencing timebucket or response_code.
Removing those two items allowed the script to complete (although with not
very useful output).
Any thoughts on what the problem might be?
Thanks!
Travis