You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Stephen Sprague <sp...@gmail.com> on 2016/08/31 03:28:50 UTC
hive.root.logger influencing query plan?? so it's not so
Hi guys,
I've banged my head on this one all day and i need to surrender. I have a
query that hangs (never returns). However, when i turn on logging to DEBUG
level it works. I'm stumped. I include here the query, the different
query plans (with the only thing different being the log level) and a
traceback on the query that hangs.
* the query
#!/bin/bash
sql="
explain SELECT *
FROM raw_logs.fact_www_access_datekey_by_hour
WHERE date_key = 20160828 -- partition key
and lower(http_string) = 'foo' -- always evaluates to false
limit 1 -- if i remove limit 1 it works. I expect
an empty result set.
;
"
#hive -hiveconf hive.root.logger=ERROR,console -e "$sql" >stdout.bad
2>stderr.bad #hangs
hive -hiveconf hive.root.logger=DEBUG,console -e "$sql" >stdout.good
2>stderr.good #works
* explain plan for the one that hangs (ERROR,console)
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: 1
Processor Tree:
TableScan
alias: fact_www_access_datekey_by_hour
Statistics: Num rows: 144355924 Data size: 158755813572 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (lower(http_string) = 'foo') (type: boolean)
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Select Operator
expressions: payload_year (type: int), payload_month (type:
int), payload_day (type: int), payload_time (type: string),
payload_gmt_offset (type: string), perf_time_micros (type: int),
http_string (type: string), http_type (type: string), http_version (type:
string), http_rc (type: int), http_size (type: int), referrer (type:
string), user_agent (type: string), bot_flag (type: string), ip_list (type:
array<string>), scalar_tags (type: map<string,string>), url_components
(type: array<string>), user_id (type: string), lt_session_id (type:
string), session_id (type: string), log_name (type: string), log_lineno
(type: int), log_line (type: string), web_server (type: string),
location_info (type: map<string,string>), dp_main_url_type (type: string),
dp_inventory_type (type: string), dp_main_url (type:
map<string,array<string>>), dp_tag_url (type: map<string,array<string>>),
dp_referrer_url (type: map<string,array<string>>), hour (type: int),
20160828 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15,
_col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24,
_col25, _col26, _col27, _col28, _col29, _col30, _col31
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Limit
Number of rows: 1
Statistics: Num rows: 1 Data size: 1099 Basic stats:
COMPLETE Column stats: NONE
ListSink
* query plan for the one that works (DEBUG,console)
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: fact_www_access_datekey_by_hour
Statistics: Num rows: 144355924 Data size: 158755813572 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (lower(http_string) = 'foo') (type: boolean)
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Select Operator
expressions: payload_year (type: int), payload_month (type:
int), payload_day (type: int), payload_time (type: string),
payload_gmt_offset (type: string), perf_time_micros (type: int),
http_string (type: string), http_type (type: string), http_version (type:
string), http_rc (type: int), http_size (type: int), referrer (type:
string), user_agent (type: string), bot_flag (type: string), ip_list (type:
array<string>), scalar_tags (type: map<string,string>), url_components
(type: array<string>), user_id (type: string), lt_session_id (type:
string), session_id (type: string), log_name (type: string), log_lineno
(type: int), log_line (type: string), web_server (type: string),
location_info (type: map<string,string>), dp_main_url_type (type: string),
dp_inventory_type (type: string), dp_main_url (type:
map<string,array<string>>), dp_tag_url (type: map<string,array<string>>),
dp_referrer_url (type: map<string,array<string>>), hour (type: int),
20160828 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14,
_col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23,
_col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Limit
Number of rows: 1
Statistics: Num rows: 1 Data size: 1099 Basic stats:
COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 1 Data size: 1099 Basic stats:
COMPLETE Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 1
Processor Tree:
ListSink
* traceback when i press ctl-c on the client
^CFailed with exception java.io.IOException:java.io.IOException: Filesystem
closed
16/08/30 20:16:09 [62cf87a8-2b39-47d2-b234-53631b60d84e main]: ERROR
CliDriver: - Failed with exception java.io.IOException:java.io.IOException:
Filesystem closed
java.io.IOException: java.io.IOException: Filesystem closed
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:517)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:424)
at
org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:144)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1885)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:335)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:742)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:794)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:833)
at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:897)
at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:700)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at
org.apache.hadoop.io.SequenceFile$Reader.readBlock(SequenceFile.java:2101)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2508)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:484)
... 16 more
on the one that hangs the job is never submitted to YARN. On the one that
succeeds the job is submitted. Ultimately though why does changing the log
level influence the query plan?
any insights greatly appreciated.
Thanks,
Stephen.
Re: hive.root.logger influencing query plan?? so it's not so
Posted by Gopal Vijayaraghavan <go...@apache.org>.
> I have a query that hangs (never returns). However, when i turn on
>logging to DEBUG level it works. I'm stumped.
Slowing down things does make a lot of stuff work - logging does something
more than slow things down, it actually introduces a synchronization point
(global lock) for each log.<level> call that makes it to the final output.
Running with the extra locked sections might be avoiding a race condition
- that worries me.
But the plans are different - very different.
> Stage: Stage-0
> Fetch Operator
> limit: 1
> Processor Tree:
> ListSink
Crucial bit - the debug logging produces a 2 stage query. No reason why
logging should change the plan there.
> any insights greatly appreciated.
set hive.fetch.task.conversion=none;
I'm guessing that the whole data partition is < 1Gb?
Cheers,
Gopal