You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Time Less <ti...@gmail.com> on 2012/01/19 22:27:18 UTC

Odd Behaviour with get_json() (or perhaps with explode(array))

We are running this query:

select name, sum_id
from (
select name, players,
array(player1, player2, player3, player4, player5, player6, player7,
player8) arr
from (
select name,
get_json_object(roster_json, '$.memberList.playerId') players,
get_json_object(roster_json, '$.memberList.playerId\[0]') player1,
get_json_object(roster_json, '$.memberList.playerId\[1]') player2,
get_json_object(roster_json, '$.memberList.playerId\[2]') player3,
get_json_object(roster_json, '$.memberList.playerId\[3]') player4,
get_json_object(roster_json, '$.memberList.playerId\[4]') player5,
get_json_object(roster_json, '$.memberList.playerId\[5]') player6,
get_json_object(roster_json, '$.memberList.playerId\[6]') player7,
get_json_object(roster_json, '$.memberList.playerId\[7]') player8
from team
*-- limit 1000000 -- some large number to make sure it runs*
) t2
where player8 is not null -- 8 members
) t1
lateral view explode(arr) member as sum_id
;

Notice the commented-out LIMIT in the innermost subquery. When we attempt
to run the query, we get this error:

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201112131753_22484, Tracking URL =
http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201112131753_22484
Kill Command = /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201112131753_22484
2012-01-17 11:06:21,579 Stage-1 map = 0%,  reduce = 0%
2012-01-17 11:06:45,745 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201112131753_22484 with errors
java.lang.RuntimeException: Error while reading from task log url
        at
org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
        at
org.apache.hadoop.hive.ql.exec.ExecDriver.showJobFailDebugInfo(ExecDriver.java:889)
        at
org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:680)
        at
org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
        at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
        at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
        at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for
URL:
http://laxhadoop1-004:50060/tasklog?taskid=attempt_201112131753_22484_m_000000_0&all=true
        at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
        at java.net.URL.openStream(URL.java:1010)
        at
org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
        ... 16 more
Ended Job = job_201112131753_22484 with exception
'java.lang.RuntimeException(Error while reading from task log url)'
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MapRedTask

If we uncomment the LIMIT clause, then the query *runs WITHOUT ERROR* and
gives us the output we expect. However, the innermost subquery is about
150k rows of data, so the LIMIT has no effect other than working around
this bug. I don't know the command to determine the version of Hive we're
running, so hopefully seeing the installed RPM will determine that:

$ rpm -qa | grep hive
hadoop-hive-0.7.1+42.4-2

Does anyone have a clue about what's going on here?

-- 
Tim Ellis
Riot Games

Re: Odd Behaviour with get_json() (or perhaps with explode(array))

Posted by Time Less <ti...@gmail.com>.
It might be worth noting that this is already a work-around that
get_json_object() doesn't return an array if the key we're specifying is an
array, but instead returns a string. The BI user wants to do the following:

select get_json_object(roster_json, '$.memberList.playerId') players

And he wants the result in the "players" column to be an actual array. The
output is instead a plain old string that looks like
"[player1,player2,...,player8]". That's why he's building an array on the
inner query that he subsequently explodes on the outer one.

We think this is probably a bug, but we're really not sure. Does anyone
have any feedback? Does it look like a bug or an error on the part of our
BI team?


On Thu, Jan 19, 2012 at 1:27 PM, Time Less <ti...@gmail.com> wrote:

> We are running this query:
>
> select name, sum_id
> from (
> select name, players,
> array(player1, player2, player3, player4, player5, player6, player7,
> player8) arr
> from (
> select name,
> get_json_object(roster_json, '$.memberList.playerId') players,
> get_json_object(roster_json, '$.memberList.playerId\[0]') player1,
> get_json_object(roster_json, '$.memberList.playerId\[1]') player2,
> get_json_object(roster_json, '$.memberList.playerId\[2]') player3,
> get_json_object(roster_json, '$.memberList.playerId\[3]') player4,
> get_json_object(roster_json, '$.memberList.playerId\[4]') player5,
> get_json_object(roster_json, '$.memberList.playerId\[5]') player6,
> get_json_object(roster_json, '$.memberList.playerId\[6]') player7,
> get_json_object(roster_json, '$.memberList.playerId\[7]') player8
> from team
> *-- limit 1000000 -- some large number to make sure it runs*
> ) t2
> where player8 is not null -- 8 members
> ) t1
> lateral view explode(arr) member as sum_id
> ;
>
> Notice the commented-out LIMIT in the innermost subquery. When we attempt
> to run the query, we get this error:
>
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_201112131753_22484, Tracking URL =
> http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201112131753_22484
> Kill Command = /usr/lib/hadoop/bin/hadoop job
> -Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201112131753_22484
> 2012-01-17 11:06:21,579 Stage-1 map = 0%,  reduce = 0%
> 2012-01-17 11:06:45,745 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201112131753_22484 with errors
> java.lang.RuntimeException: Error while reading from task log url
>         at
> org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
>         at
> org.apache.hadoop.hive.ql.exec.ExecDriver.showJobFailDebugInfo(ExecDriver.java:889)
>         at
> org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:680)
>         at
> org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
>         at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
>         at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
>         at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
>         at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: java.io.IOException: Server returned HTTP response code: 400
> for URL:
> http://laxhadoop1-004:50060/tasklog?taskid=attempt_201112131753_22484_m_000000_0&all=true
>         at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
>         at java.net.URL.openStream(URL.java:1010)
>         at
> org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
>         ... 16 more
> Ended Job = job_201112131753_22484 with exception
> 'java.lang.RuntimeException(Error while reading from task log url)'
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.MapRedTask
>
> If we uncomment the LIMIT clause, then the query *runs WITHOUT ERROR* and
> gives us the output we expect. However, the innermost subquery is about
> 150k rows of data, so the LIMIT has no effect other than working around
> this bug. I don't know the command to determine the version of Hive we're
> running, so hopefully seeing the installed RPM will determine that:
>
> $ rpm -qa | grep hive
> hadoop-hive-0.7.1+42.4-2
>
> Does anyone have a clue about what's going on here?
>
> --
> Tim Ellis
> Riot Games
>
>


-- 
Tim