You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Time Less <ti...@gmail.com> on 2012/01/19 22:27:18 UTC
Odd Behaviour with get_json() (or perhaps with explode(array))
We are running this query:
select name, sum_id
from (
select name, players,
array(player1, player2, player3, player4, player5, player6, player7,
player8) arr
from (
select name,
get_json_object(roster_json, '$.memberList.playerId') players,
get_json_object(roster_json, '$.memberList.playerId\[0]') player1,
get_json_object(roster_json, '$.memberList.playerId\[1]') player2,
get_json_object(roster_json, '$.memberList.playerId\[2]') player3,
get_json_object(roster_json, '$.memberList.playerId\[3]') player4,
get_json_object(roster_json, '$.memberList.playerId\[4]') player5,
get_json_object(roster_json, '$.memberList.playerId\[5]') player6,
get_json_object(roster_json, '$.memberList.playerId\[6]') player7,
get_json_object(roster_json, '$.memberList.playerId\[7]') player8
from team
*-- limit 1000000 -- some large number to make sure it runs*
) t2
where player8 is not null -- 8 members
) t1
lateral view explode(arr) member as sum_id
;
Notice the commented-out LIMIT in the innermost subquery. When we attempt
to run the query, we get this error:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201112131753_22484, Tracking URL =
http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201112131753_22484
Kill Command = /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201112131753_22484
2012-01-17 11:06:21,579 Stage-1 map = 0%, reduce = 0%
2012-01-17 11:06:45,745 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112131753_22484 with errors
java.lang.RuntimeException: Error while reading from task log url
at
org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
at
org.apache.hadoop.hive.ql.exec.ExecDriver.showJobFailDebugInfo(ExecDriver.java:889)
at
org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:680)
at
org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for
URL:
http://laxhadoop1-004:50060/tasklog?taskid=attempt_201112131753_22484_m_000000_0&all=true
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at
org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
... 16 more
Ended Job = job_201112131753_22484 with exception
'java.lang.RuntimeException(Error while reading from task log url)'
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MapRedTask
If we uncomment the LIMIT clause, then the query *runs WITHOUT ERROR* and
gives us the output we expect. However, the innermost subquery is about
150k rows of data, so the LIMIT has no effect other than working around
this bug. I don't know the command to determine the version of Hive we're
running, so hopefully seeing the installed RPM will determine that:
$ rpm -qa | grep hive
hadoop-hive-0.7.1+42.4-2
Does anyone have a clue about what's going on here?
--
Tim Ellis
Riot Games
Re: Odd Behaviour with get_json() (or perhaps with explode(array))
Posted by Time Less <ti...@gmail.com>.
It might be worth noting that this is already a work-around that
get_json_object() doesn't return an array if the key we're specifying is an
array, but instead returns a string. The BI user wants to do the following:
select get_json_object(roster_json, '$.memberList.playerId') players
And he wants the result in the "players" column to be an actual array. The
output is instead a plain old string that looks like
"[player1,player2,...,player8]". That's why he's building an array on the
inner query that he subsequently explodes on the outer one.
We think this is probably a bug, but we're really not sure. Does anyone
have any feedback? Does it look like a bug or an error on the part of our
BI team?
On Thu, Jan 19, 2012 at 1:27 PM, Time Less <ti...@gmail.com> wrote:
> We are running this query:
>
> select name, sum_id
> from (
> select name, players,
> array(player1, player2, player3, player4, player5, player6, player7,
> player8) arr
> from (
> select name,
> get_json_object(roster_json, '$.memberList.playerId') players,
> get_json_object(roster_json, '$.memberList.playerId\[0]') player1,
> get_json_object(roster_json, '$.memberList.playerId\[1]') player2,
> get_json_object(roster_json, '$.memberList.playerId\[2]') player3,
> get_json_object(roster_json, '$.memberList.playerId\[3]') player4,
> get_json_object(roster_json, '$.memberList.playerId\[4]') player5,
> get_json_object(roster_json, '$.memberList.playerId\[5]') player6,
> get_json_object(roster_json, '$.memberList.playerId\[6]') player7,
> get_json_object(roster_json, '$.memberList.playerId\[7]') player8
> from team
> *-- limit 1000000 -- some large number to make sure it runs*
> ) t2
> where player8 is not null -- 8 members
> ) t1
> lateral view explode(arr) member as sum_id
> ;
>
> Notice the commented-out LIMIT in the innermost subquery. When we attempt
> to run the query, we get this error:
>
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_201112131753_22484, Tracking URL =
> http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201112131753_22484
> Kill Command = /usr/lib/hadoop/bin/hadoop job
> -Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201112131753_22484
> 2012-01-17 11:06:21,579 Stage-1 map = 0%, reduce = 0%
> 2012-01-17 11:06:45,745 Stage-1 map = 100%, reduce = 100%
> Ended Job = job_201112131753_22484 with errors
> java.lang.RuntimeException: Error while reading from task log url
> at
> org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
> at
> org.apache.hadoop.hive.ql.exec.ExecDriver.showJobFailDebugInfo(ExecDriver.java:889)
> at
> org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:680)
> at
> org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: java.io.IOException: Server returned HTTP response code: 400
> for URL:
> http://laxhadoop1-004:50060/tasklog?taskid=attempt_201112131753_22484_m_000000_0&all=true
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
> at java.net.URL.openStream(URL.java:1010)
> at
> org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
> ... 16 more
> Ended Job = job_201112131753_22484 with exception
> 'java.lang.RuntimeException(Error while reading from task log url)'
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.MapRedTask
>
> If we uncomment the LIMIT clause, then the query *runs WITHOUT ERROR* and
> gives us the output we expect. However, the innermost subquery is about
> 150k rows of data, so the LIMIT has no effect other than working around
> this bug. I don't know the command to determine the version of Hive we're
> running, so hopefully seeing the installed RPM will determine that:
>
> $ rpm -qa | grep hive
> hadoop-hive-0.7.1+42.4-2
>
> Does anyone have a clue about what's going on here?
>
> --
> Tim Ellis
> Riot Games
>
>
--
Tim