You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org> on 2010/08/13 09:45:16 UTC

[jira] Created: (HIVE-1538) FilterOperator is applied twice with ppd on.

FilterOperator is applied twice with ppd on.
--------------------------------------------

                 Key: HIVE-1538
                 URL: https://issues.apache.org/jira/browse/HIVE-1538
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
            Reporter: Amareshwari Sriramadasu


With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898120#action_12898120 ] 

Amareshwari Sriramadasu commented on HIVE-1538:
-----------------------------------------------

I see that if a query has where clause, the FilterOperator is applied twice.

Explain on a query with where clause :
hive> explain select * from input1 where input1.key != 10;
{noformat}
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF input1)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (!= (. (TOK_TABLE_OR_COL input1) key) 10))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        input1
          TableScan
            alias: input1
            Filter Operator
              predicate:
                  expr: (key <> 10)
                  type: boolean
              Filter Operator
                predicate:
                    expr: (key <> 10)
                    type: boolean
                Select Operator
                  expressions:
                        expr: key
                        type: int
                        expr: value
                        type: int
                  outputColumnNames: _col0, _col1
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
Time taken: 0.099 seconds
{noformat}

I see the same from the Mapper logs also. The first FilterOperator does the
filtering and second operator always filters zero rows.

{noformat}
....
2010-08-13 13:20:21,451 INFO ExecMapper: 
<MAP>Id =5
  <Children>
    <TS>Id =0
      <Children>
        <FIL>Id =1
          <Children>
            <FIL>Id =2
              <Children>
                <SEL>Id =3
                  <Children>
                    <FS>Id =4
                      <Parent>Id = 3 null<\Parent>
                    <\FS>
                  <\Children>
                  <Parent>Id = 2 null<\Parent>
                <\SEL>
              <\Children>
              <Parent>Id = 1 null<\Parent>
            <\FIL>
          <\Children>
          <Parent>Id = 0 null<\Parent>
        <\FIL>
      <\Children>
      <Parent>Id = 5 null<\Parent>
    <\TS>
  <\Children>
<\MAP>
...
2010-08-13 13:20:21,489 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 5 forwarding 1 rows
2010-08-13 13:20:21,489 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 1 rows
2010-08-13 13:20:21,600 INFO ExecMapper: ExecMapper: processing 1 rows: used memory = 10765360
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 5 finished. closing... 
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 5 forwarded 1 rows
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing... 
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 1 rows
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 1 finished. closing... 
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 1 forwarded 0 rows
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: PASSED:0
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: FILTERED:1
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 2 finished. closing... 
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 2 forwarded 0 rows
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: PASSED:0
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: FILTERED:0
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 3 finished. closing... 
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 3 forwarded 0 rows
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 4 finished. closing... 
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 4 forwarded 0 rows
2010-08-13 13:20:21,600 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Final Path: FS hdfs://localhost:19000/tmp/hive-amarsri/hive_2010-08-13_13-20-11_483_2065579562420016208/_tmp.-ext-10001/000000_0
2010-08-13 13:20:21,601 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Writing to temp file: FS hdfs://localhost:19000/tmp/hive-amarsri/hive_2010-08-13_13-20-11_483_2065579562420016208/_tmp.-ext-10001/_tmp.000000_0
2010-08-13 13:20:21,604 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS hdfs://localhost:19000/tmp/hive-amarsri/hive_2010-08-13_13-20-11_483_2065579562420016208/_tmp.-ext-10001/000000_0
2010-08-13 13:20:21,629 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 3 Close done
2010-08-13 13:20:21,629 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 2 Close done
2010-08-13 13:20:21,629 INFO org.apache.hadoop.hive.ql.exec.FilterOperator: 1 Close done
2010-08-13 13:20:21,629 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 Close done
2010-08-13 13:20:21,629 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 5 Close done
2010-08-13 13:20:21,629 INFO ExecMapper: ExecMapper: processed 1 rows: used memory = 11454224
...
{noformat}


> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898122#action_12898122 ] 

Amareshwari Sriramadasu commented on HIVE-1538:
-----------------------------------------------

With hive.optimize.ppd set to false, I see that the FilterOperator is applied only once.
{noformat}
hive> SET hive.optimize.ppd=false;
hive> explain select * from input1 where input1.key != 10;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF input1)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (!= (. (TOK_TABLE_OR_COL input1) key) 10))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        input1
          TableScan
            alias: input1
            Filter Operator
              predicate:
                  expr: (key <> 10)
                  type: boolean
              Select Operator
                expressions:
                      expr: key
                      type: int
                      expr: value
                      type: int
                outputColumnNames: _col0, _col1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

Time taken: 0.022 seconds
{noformat}

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930247#action_12930247 ] 

John Sichi commented on HIVE-1538:
----------------------------------

You might want to look into HIVE-1342 while you are working on this one.

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898144#action_12898144 ] 

Amareshwari Sriramadasu commented on HIVE-1538:
-----------------------------------------------

Also, I observed that Select Operator is applied twice for a MapJoin query. Is it related to this?

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913354#action_12913354 ] 

John Sichi commented on HIVE-1538:
----------------------------------

It would be cool to get this fixed; without it the predicate decomposition I added for HIVE-1226 is pointless.

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929976#action_12929976 ] 

Amareshwari Sriramadasu commented on HIVE-1538:
-----------------------------------------------

There are a couple of issues in removing the original filter operator.
# All the expressions in the filter predicate may not be pushed.
** I'm planning to create a filter operator with non-final candidates as a child of the original filter op and mark the original filter op for deletion.
# The candidate predicates may not pushed past some operators. For ex. Outer Join operator does not allow candidates for all aliases; LIMIT/SCRIPT/UDTF operators do not push any predicates.
** I'm planning to create a filter operator with unpushed predicates, as a child of the operator through which the predicates could not be pushed.

Finally, remove the original filter operators which are marked for deletion.

Thoughts? Any suggestions?

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906666#action_12906666 ] 

Namit Jain commented on HIVE-1538:
----------------------------------

That is right - this has nothing to do with map join.
Whenever, a predicate is pushed down, it is also retained, thereby having 2 identical filters.

Is this creating a performance problem ? It can definitely be optimized.
I totally agree with your proposed solution.

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921603#action_12921603 ] 

Namit Jain commented on HIVE-1538:
----------------------------------

Amareshwari, are you planning to work on this ?

We are trying to improve the performance, and some profiling showed that can lead to lot of improvements

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925777#action_12925777 ] 

Amareshwari Sriramadasu commented on HIVE-1538:
-----------------------------------------------

bq. I think the solution is to collect the operators who are contributing the predicates for "final candidates of predicare pushdown" and remove them from the final operator graph.
This does not work as I thought earlier, because all the predicates in the FilterOperator may not be pushed. We might have to reconstruct the FilterOperator with un-pushed predicates.

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906428#action_12906428 ] 

Amareshwari Sriramadasu commented on HIVE-1538:
-----------------------------------------------

With predicate pushdown on, the final candidates for predicate pushdown are collected for the top operator. 
And a FilterOperator is created, with the final candidates, as a child of TableScanOperator (topOp). But the operators (FilterOperators) whose predicates are pushed down, are not removed.
I think the solution is to collect the operators who are contributing the predicates for "final candidates of predicare pushdown" and remove them from the final operator graph.
Thoughts?

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921964#action_12921964 ] 

Amareshwari Sriramadasu commented on HIVE-1538:
-----------------------------------------------

Namit, I can take this up once I'm done with HIVE-474 i.e. mostly after a week.

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921966#action_12921966 ] 

Namit Jain commented on HIVE-1538:
----------------------------------

Thanks, That will be great. It can lead to substantial improvement (10-15%) on the map-side 
for a large range of queries

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-1538) FilterOperator is applied twice with ppd on.

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi reassigned HIVE-1538:
--------------------------------

    Assignee: Amareshwari Sriramadasu

> FilterOperator is applied twice with ppd on.
> --------------------------------------------
>
>                 Key: HIVE-1538
>                 URL: https://issues.apache.org/jira/browse/HIVE-1538
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>
> With hive.optimize.ppd set to true, FilterOperator is applied twice. And it seems second operator is always filtering zero rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.