You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Chao (JIRA)" <ji...@apache.org> on 2014/11/13 20:01:33 UTC
[jira] [Created] (HIVE-8859) ColumnStatsTask fails because of SparkMapJoinResolver

Chao created HIVE-8859:
--------------------------

             Summary: ColumnStatsTask fails because of SparkMapJoinResolver
                 Key: HIVE-8859
                 URL: https://issues.apache.org/jira/browse/HIVE-8859
             Project: Hive
          Issue Type: Sub-task
          Components: Spark
    Affects Versions: spark-branch
            Reporter: Chao
            Assignee: Chao


The following query fails:

{code}
ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key,value;
{code}

The plan looks like:

{noformat}
STAGE DEPENDENCIES:
  Stage-0 is a root stage
  Stage-2 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 1)
      DagName: chao_20141113105959_486b4bba-a2da-43c5-bf42-0ee69cd42576:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: src
                  Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: key (type: string), value (type: string)
                    outputColumnNames: key, value
                    Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: compute_stats(key, 16), compute_stats(value, 16)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                      Reduce Output Operator
                        sort order: 
                        Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                        value expressions: _col0 (type: struct<columntype:string,maxlength:bigint,sumlength:bigint,count:bigint,countnulls:bigint,bitvector:string,numbitvectors:int>), _col1 (type: struct<columntype:string,maxlength:bigint,sumlength:bigint,count:bigint,countnulls:bigint,bitvector:string,numbitvectors:int>)
        Reducer 2 
            Reduce Operator Tree:
              Group By Operator
                aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Select Operator
                  expressions: _col0 (type: struct<columntype:string,maxlength:bigint,avglength:double,countnulls:bigint,numdistinctvalues:bigint>), _col1 (type: struct<columntype:string,maxlength:bigint,avglength:double,countnulls:bigint,numdistinctvalues:bigint>)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-2
    Column Stats Work
      Column Stats Desc:
          Columns: key, value
          Column Types: string, string
          Table: src
{noformat}

This query will fail because {{SparkMapJoinResolver#createSparkTask}} swaps the order of two tasks in the root task list. But, this is rather interesting, since if they are both root tasks, then order shouldn't matter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)