You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jone Zhang <jo...@gmail.com> on 2015/12/03 09:09:35 UTC

Why there are two different stages on the same query when i use hive on spark.

Hive1.2.1 on Spark1.4.1

*The first query is:*
set mapred.reduce.tasks=100;
use u_wsd;
insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151202)
select t1.uin,t1.clientip from
(select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
t1
left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
where ds=20151201) t2
on t1.uin=t2.uin
where t2.clientip is NULL;

*The second query is:*
set mapred.reduce.tasks=100;
use u_wsd;
insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151201)
select t1.uin,t1.clientip from
(select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201)
t1
left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
where ds=20151130) t2
on t1.uin=t2.uin
where t2.clientip is NULL;

*The attachment show the two query's stages.*
*Here is the partition info*
104.3 M
 /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202
110.0 M
 /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201
112.6 M
 /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130



*Why there are two different stages?*
*The stage1 in first query is very slowly.*

*Thanks.*
*Best wishes.*

Re: Why there are two different stages on the same query when i use hive on spark.

Posted by Xuefu Zhang <xz...@cloudera.com>.
The first stage for 1st query is to build a hash table for map join. It
took 7s to finish. Why do you think it's slow? Of course, it seemed you had
many small files, since there were 100 mappers, so each file would be very
small. This is not good for performance. Also consider using other data
formats other than text.

If a given stage is absurdly slow, check the task or executor statistics on
Spark. You could have a bad node in your cluster, or your data is skewed.
You can check if there is any specific task that takes much longer time
than the rest.

On Thu, Dec 3, 2015 at 6:51 PM, Jone Zhang <jo...@gmail.com> wrote:

> *Thanks for you warning.*
> *The first query is mapjoin and second query is reducejoin.The data format
> is all textInputFormat.*
> *I'll  go to learn more about  mapjoin of **hive on spark** anyway,But
> why** stage1 of  first query in attachment is so slowly?*
>
> *Explain first query:*
> hive (u_wsd)> explain insert overwrite table t_sd_ucm_cominfo_incremental
> partition (ds=20151202)
>             > select t1.uin,t1.clientip from
>             > (select uin,clientip from t_sd_ucm_cominfo_FinalResult where
> ds=20151202) t1
>             > left outer join (select uin,clientip from
> t_sd_ucm_cominfo_FinalResult where ds=20151201) t2
>             > on t1.uin=t2.uin
>             > where t2.clientip is NULL;
> OK
> STAGE DEPENDENCIES:
>   Stage-3 is a root stage
>   Stage-1 depends on stages: Stage-3
>   Stage-0 depends on stages: Stage-1
>   Stage-2 depends on stages: Stage-0
>
> STAGE PLANS:
>   Stage: Stage-3
>     Spark
>       DagName: mqq_20151204103210_fb913c60-c5ed-438f-8efb-950ee5639dd5:2
>       Vertices:
>         Map 2
>             Map Operator Tree:
>                 TableScan
>                   alias: t_sd_ucm_cominfo_finalresult
>                   Statistics: Num rows: 108009 Data size: 2873665 Basic
> stats: COMPLETE Column stats: NONE
>                   Select Operator
>                     expressions: uin (type: string), clientip (type:
> string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 108009 Data size: 2873665 Basic
> stats: COMPLETE Column stats: NONE
>                     Spark HashTable Sink Operator
>                       keys:
>                         0 _col0 (type: string)
>                         1 _col0 (type: string)
>             Local Work:
>               Map Reduce Local Work
>
>   Stage: Stage-1
>     Spark
>       DagName: mqq_20151204103210_fb913c60-c5ed-438f-8efb-950ee5639dd5:1
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: t_sd_ucm_cominfo_finalresult
>                   Statistics: Num rows: 103779 Data size: 2746785 Basic
> stats: COMPLETE Column stats: NONE
>                   Select Operator
>                     expressions: uin (type: string), clientip (type:
> string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 103779 Data size: 2746785 Basic
> stats: COMPLETE Column stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Left Outer Join0 to 1
>                       keys:
>                         0 _col0 (type: string)
>                         1 _col0 (type: string)
>                       outputColumnNames: _col0, _col1, _col3
>                       input vertices:
>                         1 Map 2
>                       Statistics: Num rows: 118809 Data size: 3161031
> Basic stats: COMPLETE Column stats: NONE
>                       Filter Operator
>                         predicate: _col3 is null (type: boolean)
>                         Statistics: Num rows: 59404 Data size: 1580502
> Basic stats: COMPLETE Column stats: NONE
>                         Select Operator
>                           expressions: _col0 (type: string), _col1 (type:
> string)
>                           outputColumnNames: _col0, _col1
>                           Statistics: Num rows: 59404 Data size: 1580502
> Basic stats: COMPLETE Column stats: NONE
>                           File Output Operator
>                             compressed: false
>                             Statistics: Num rows: 59404 Data size: 1580502
> Basic stats: COMPLETE Column stats: NONE
>                             table:
>                                 input format:
> org.apache.hadoop.mapred.TextInputFormat
>                                 output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                                 serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                                 name: u_wsd.t_sd_ucm_cominfo_incremental
>             Local Work:
>               Map Reduce Local Work
>
>   Stage: Stage-0
>     Move Operator
>       tables:
>           partition:
>             ds 20151202
>           replace: true
>           table:
>               input format: org.apache.hadoop.mapred.TextInputFormat
>               output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>               name: u_wsd.t_sd_ucm_cominfo_incremental
>
>   Stage: Stage-2
>     Stats-Aggr Operator
>
>
> *Explain second query:*
> hive (u_wsd)> explain insert overwrite table t_sd_ucm_cominfo_incremental
> partition (ds=20151201)
>             > select t1.uin,t1.clientip from
>             > (select uin,clientip from t_sd_ucm_cominfo_FinalResult where
> ds=20151201) t1
>             > left outer join (select uin,clientip from
> t_sd_ucm_cominfo_FinalResult where ds=20151130) t2
>             > on t1.uin=t2.uin
>             > where t2.clientip is NULL;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
>   Stage-2 depends on stages: Stage-0
>
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 100), Map 3
> (PARTITION-LEVEL SORT, 100)
>       DagName: mqq_20151204103243_3eab6e6c-941e-476a-897f-cae97657063e:3
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: t_sd_ucm_cominfo_finalresult
>                   Statistics: Num rows: 108009 Data size: 2873665 Basic
> stats: COMPLETE Column stats: NONE
>                   Select Operator
>                     expressions: uin (type: string), clientip (type:
> string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 108009 Data size: 2873665 Basic
> stats: COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: _col0 (type: string)
>                       sort order: +
>                       Map-reduce partition columns: _col0 (type: string)
>                       Statistics: Num rows: 108009 Data size: 2873665
> Basic stats: COMPLETE Column stats: NONE
>                       value expressions: _col1 (type: string)
>         Map 3
>             Map Operator Tree:
>                 TableScan
>                   alias: t_sd_ucm_cominfo_finalresult
>                   Statistics: Num rows: 590130 Data size: 118026051 Basic
> stats: COMPLETE Column stats: NONE
>                   Select Operator
>                     expressions: uin (type: string), clientip (type:
> string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 590130 Data size: 118026051
> Basic stats: COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: _col0 (type: string)
>                       sort order: +
>                       Map-reduce partition columns: _col0 (type: string)
>                       Statistics: Num rows: 590130 Data size: 118026051
> Basic stats: COMPLETE Column stats: NONE
>                       value expressions: _col1 (type: string)
>         Reducer 2
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Left Outer Join0 to 1
>                 keys:
>                   0 _col0 (type: string)
>                   1 _col0 (type: string)
>                 outputColumnNames: _col0, _col1, _col3
>                 Statistics: Num rows: 649143 Data size: 129828658 Basic
> stats: COMPLETE Column stats: NONE
>                 Filter Operator
>                   predicate: _col3 is null (type: boolean)
>                   Statistics: Num rows: 324571 Data size: 64914228 Basic
> stats: COMPLETE Column stats: NONE
>                   Select Operator
>                     expressions: _col0 (type: string), _col1 (type: string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 324571 Data size: 64914228 Basic
> stats: COMPLETE Column stats: NONE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 324571 Data size: 64914228
> Basic stats: COMPLETE Column stats: NONE
>                       table:
>                           input format:
> org.apache.hadoop.mapred.TextInputFormat
>                           output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                           name: u_wsd.t_sd_ucm_cominfo_incremental
>
>   Stage: Stage-0
>     Move Operator
>       tables:
>           partition:
>             ds 20151201
>           replace: true
>           table:
>               input format: org.apache.hadoop.mapred.TextInputFormat
>               output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>               name: u_wsd.t_sd_ucm_cominfo_incremental
>
>   Stage: Stage-2
>     Stats-Aggr Operator
>
> *Thanks.*
>
> 2015-12-03 22:17 GMT+08:00 Xuefu Zhang <xz...@cloudera.com>:
>
>> Can you also attach explain query result? What's your data format?
>>
>> --Xuefu
>>
>> On Thu, Dec 3, 2015 at 12:09 AM, Jone Zhang <jo...@gmail.com>
>> wrote:
>>
>>> Hive1.2.1 on Spark1.4.1
>>>
>>> *The first query is:*
>>> set mapred.reduce.tasks=100;
>>> use u_wsd;
>>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>>> 20151202)
>>> select t1.uin,t1.clientip from
>>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
>>> t1
>>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>>> where ds=20151201) t2
>>> on t1.uin=t2.uin
>>> where t2.clientip is NULL;
>>>
>>> *The second query is:*
>>> set mapred.reduce.tasks=100;
>>> use u_wsd;
>>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>>> 20151201)
>>> select t1.uin,t1.clientip from
>>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201)
>>> t1
>>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>>> where ds=20151130) t2
>>> on t1.uin=t2.uin
>>> where t2.clientip is NULL;
>>>
>>> *The attachment show the two query's stages.*
>>> *Here is the partition info*
>>> 104.3 M
>>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202
>>> 110.0 M
>>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201
>>> 112.6 M
>>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130
>>>
>>>
>>>
>>> *Why there are two different stages?*
>>> *The stage1 in first query is very slowly.*
>>>
>>> *Thanks.*
>>> *Best wishes.*
>>>
>>
>>
>

Re: Why there are two different stages on the same query when i use hive on spark.

Posted by Jone Zhang <jo...@gmail.com>.
*Thanks for you warning.*
*The first query is mapjoin and second query is reducejoin.The data format
is all textInputFormat.*
*I'll  go to learn more about  mapjoin of **hive on spark** anyway,But
why** stage1
of  first query in attachment is so slowly?*

*Explain first query:*
hive (u_wsd)> explain insert overwrite table t_sd_ucm_cominfo_incremental
partition (ds=20151202)
            > select t1.uin,t1.clientip from
            > (select uin,clientip from t_sd_ucm_cominfo_FinalResult where
ds=20151202) t1
            > left outer join (select uin,clientip from
t_sd_ucm_cominfo_FinalResult where ds=20151201) t2
            > on t1.uin=t2.uin
            > where t2.clientip is NULL;
OK
STAGE DEPENDENCIES:
  Stage-3 is a root stage
  Stage-1 depends on stages: Stage-3
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-3
    Spark
      DagName: mqq_20151204103210_fb913c60-c5ed-438f-8efb-950ee5639dd5:2
      Vertices:
        Map 2
            Map Operator Tree:
                TableScan
                  alias: t_sd_ucm_cominfo_finalresult
                  Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: uin (type: string), clientip (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                    Spark HashTable Sink Operator
                      keys:
                        0 _col0 (type: string)
                        1 _col0 (type: string)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: mqq_20151204103210_fb913c60-c5ed-438f-8efb-950ee5639dd5:1
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: t_sd_ucm_cominfo_finalresult
                  Statistics: Num rows: 103779 Data size: 2746785 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: uin (type: string), clientip (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 103779 Data size: 2746785 Basic
stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Left Outer Join0 to 1
                      keys:
                        0 _col0 (type: string)
                        1 _col0 (type: string)
                      outputColumnNames: _col0, _col1, _col3
                      input vertices:
                        1 Map 2
                      Statistics: Num rows: 118809 Data size: 3161031 Basic
stats: COMPLETE Column stats: NONE
                      Filter Operator
                        predicate: _col3 is null (type: boolean)
                        Statistics: Num rows: 59404 Data size: 1580502
Basic stats: COMPLETE Column stats: NONE
                        Select Operator
                          expressions: _col0 (type: string), _col1 (type:
string)
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 59404 Data size: 1580502
Basic stats: COMPLETE Column stats: NONE
                          File Output Operator
                            compressed: false
                            Statistics: Num rows: 59404 Data size: 1580502
Basic stats: COMPLETE Column stats: NONE
                            table:
                                input format:
org.apache.hadoop.mapred.TextInputFormat
                                output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                                serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                                name: u_wsd.t_sd_ucm_cominfo_incremental
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            ds 20151202
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: u_wsd.t_sd_ucm_cominfo_incremental

  Stage: Stage-2
    Stats-Aggr Operator


*Explain second query:*
hive (u_wsd)> explain insert overwrite table t_sd_ucm_cominfo_incremental
partition (ds=20151201)
            > select t1.uin,t1.clientip from
            > (select uin,clientip from t_sd_ucm_cominfo_FinalResult where
ds=20151201) t1
            > left outer join (select uin,clientip from
t_sd_ucm_cominfo_FinalResult where ds=20151130) t2
            > on t1.uin=t2.uin
            > where t2.clientip is NULL;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 100), Map 3
(PARTITION-LEVEL SORT, 100)
      DagName: mqq_20151204103243_3eab6e6c-941e-476a-897f-cae97657063e:3
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: t_sd_ucm_cominfo_finalresult
                  Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: uin (type: string), clientip (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                      value expressions: _col1 (type: string)
        Map 3
            Map Operator Tree:
                TableScan
                  alias: t_sd_ucm_cominfo_finalresult
                  Statistics: Num rows: 590130 Data size: 118026051 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: uin (type: string), clientip (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 590130 Data size: 118026051 Basic
stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 590130 Data size: 118026051
Basic stats: COMPLETE Column stats: NONE
                      value expressions: _col1 (type: string)
        Reducer 2
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 _col0 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col0, _col1, _col3
                Statistics: Num rows: 649143 Data size: 129828658 Basic
stats: COMPLETE Column stats: NONE
                Filter Operator
                  predicate: _col3 is null (type: boolean)
                  Statistics: Num rows: 324571 Data size: 64914228 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: string), _col1 (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 324571 Data size: 64914228 Basic
stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 324571 Data size: 64914228
Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format:
org.apache.hadoop.mapred.TextInputFormat
                          output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          name: u_wsd.t_sd_ucm_cominfo_incremental

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            ds 20151201
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: u_wsd.t_sd_ucm_cominfo_incremental

  Stage: Stage-2
    Stats-Aggr Operator

*Thanks.*

2015-12-03 22:17 GMT+08:00 Xuefu Zhang <xz...@cloudera.com>:

> Can you also attach explain query result? What's your data format?
>
> --Xuefu
>
> On Thu, Dec 3, 2015 at 12:09 AM, Jone Zhang <jo...@gmail.com>
> wrote:
>
>> Hive1.2.1 on Spark1.4.1
>>
>> *The first query is:*
>> set mapred.reduce.tasks=100;
>> use u_wsd;
>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>> 20151202)
>> select t1.uin,t1.clientip from
>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
>> t1
>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>> where ds=20151201) t2
>> on t1.uin=t2.uin
>> where t2.clientip is NULL;
>>
>> *The second query is:*
>> set mapred.reduce.tasks=100;
>> use u_wsd;
>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>> 20151201)
>> select t1.uin,t1.clientip from
>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201)
>> t1
>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>> where ds=20151130) t2
>> on t1.uin=t2.uin
>> where t2.clientip is NULL;
>>
>> *The attachment show the two query's stages.*
>> *Here is the partition info*
>> 104.3 M
>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202
>> 110.0 M
>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201
>> 112.6 M
>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130
>>
>>
>>
>> *Why there are two different stages?*
>> *The stage1 in first query is very slowly.*
>>
>> *Thanks.*
>> *Best wishes.*
>>
>
>

Re: Why there are two different stages on the same query when i use hive on spark.

Posted by Xuefu Zhang <xz...@cloudera.com>.
Can you also attach explain query result? What's your data format?

--Xuefu

On Thu, Dec 3, 2015 at 12:09 AM, Jone Zhang <jo...@gmail.com> wrote:

> Hive1.2.1 on Spark1.4.1
>
> *The first query is:*
> set mapred.reduce.tasks=100;
> use u_wsd;
> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151202
> )
> select t1.uin,t1.clientip from
> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
> t1
> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
> where ds=20151201) t2
> on t1.uin=t2.uin
> where t2.clientip is NULL;
>
> *The second query is:*
> set mapred.reduce.tasks=100;
> use u_wsd;
> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151201
> )
> select t1.uin,t1.clientip from
> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201)
> t1
> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
> where ds=20151130) t2
> on t1.uin=t2.uin
> where t2.clientip is NULL;
>
> *The attachment show the two query's stages.*
> *Here is the partition info*
> 104.3 M
>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202
> 110.0 M
>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201
> 112.6 M
>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130
>
>
>
> *Why there are two different stages?*
> *The stage1 in first query is very slowly.*
>
> *Thanks.*
> *Best wishes.*
>