You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Aditya Kishore (JIRA)" <ji...@apache.org> on 2013/02/01 01:33:13 UTC

[jira] [Commented] (HIVE-3935) New line character in output when sequence file is used for storage and table is empty

    [ https://issues.apache.org/jira/browse/HIVE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13568289#comment-13568289 ] 

Aditya Kishore commented on HIVE-3935:
--------------------------------------

The reason for the difference in behavior between 0.7 and 0.9 is different default InputFormat used.

Having said that, the root cause is that the query planner detects this scan query on virtual column as meta only scan and converts the scan operation on the partition to meta only scan and sets {{OneNullRowInputFormat}} as the partition's InputFormat, which *always* emits exactly one row irrespective of whether a partition exists or not.

And this one row (null) is sent to mapper as input which forward this to reducer and eventually to query result.

A temporary fix for this problem is to disable MetadataOnlyOptimizer in hive-site.xml.

{code}
<property>
  <name>hive.optimize.metadataonly</name>
  <value>false</value>
</property>
{code}

Though this has some performance penalty in certain cases, it is better than reverting to HiveInputFormat.
                
> New line character in output when sequence file is used for storage and table is empty
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-3935
>                 URL: https://issues.apache.org/jira/browse/HIVE-3935
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.9.0, 0.10.0
>         Environment: Centos 6.3
>            Reporter: Doodle gum
>
> When a "select distinct" command is issued on empty table which uses sequence file for storage, a new extra line (0x0a) is present in the result set even when table has no data. This output is not consistent with result of same command Hive 0.7.1 and can cause workflows to fail due to wrong record count.
> Execution on Hive 0.9 and 0.10
> hive> create table hoge2(col1 string,col2 string) partitioned by (p_part
> string) stored as sequencefile;
> hive> describe hoge2;
> OK
> col1    string
> col2    string
> p_part  string
> Time taken: 0.24 seconds
> hive> select distinct p_part from hoge2;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks not specified. Estimated from input data size: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
> Starting Job = job_201301230112_0001, Tracking URL =
> http://testcluster2-1:50030/jobdetails.jsp?jobid=job_201301230112_0001
> Kill Command = /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job 
> -Dmapred.job.tracker=maprfs:/// -kill job_201301230112_0001
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
> 2013-01-23 02:50:16,843 Stage-1 map = 0%,  reduce = 0%
> 2013-01-23 02:50:26,897 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13
> sec
> 2013-01-23 02:50:27,905 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13
> sec
> 2013-01-23 02:50:28,911 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13
> sec
> 2013-01-23 02:50:29,919 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13
> sec
> 2013-01-23 02:50:30,925 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13
> sec
> 2013-01-23 02:50:31,933 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13
> sec
> 2013-01-23 02:50:32,939 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13
> sec
> 2013-01-23 02:50:33,945 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.8
> sec
> MapReduce Total cumulative CPU time: 1 seconds 800 msec
> Ended Job = job_201301230112_0001
> MapReduce Jobs Launched:
> Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.8 sec   MAPRFS Read: 327 MAPRFS
> Write: 71 SUCCESS
> Total MapReduce CPU Time Spent: 1 seconds 800 msec
> OK
> Time taken: 21.94 seconds
> Result on Hive 0.7.1
> hive> select count(distinct p_part) from hoge3;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
> Starting Job = job_201210261659_0019, Tracking URL =
> http://testcluster1-1:50030/jobdetails.jsp?jobid=job_201210261659_0019
> Kill Command = /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job 
> -Dmapred.job.tracker=maprfs:/// -kill job_201210261659_0019
> 2013-01-23 21:42:01,787 Stage-1 map = 0%,  reduce = 0%
> 2013-01-23 21:42:07,815 Stage-1 map = 100%,  reduce = 0%
> 2013-01-23 21:42:12,835 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201210261659_0019
> OK
> 0
> Time taken: 16.637 seconds
> Underlying Hadoop version for Hive 0.9 is Hadoop 1.0.3 and for Hive 0.7 it is 0.20.203

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira