You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Vladimir Klimontovich (JIRA)" <ji...@apache.org> on 2010/02/15 23:32:27 UTC

[jira] Created: (HIVE-1173) Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query

Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query
--------------------------------------------------------------------------------------------------------------------------------

                 Key: HIVE-1173
                 URL: https://issues.apache.org/jira/browse/HIVE-1173
             Project: Hadoop Hive
          Issue Type: Bug
    Affects Versions: 0.4.1, 0.4.0
            Reporter: Vladimir Klimontovich


Brief description:

case 1) non-deterministic present in partition condition, joins are present in query => partition pruner doesn't do filtering of partitions based on condition
case 2) non-deterministic present in partition condition, joins aren't present in query => partition pruner do filtering of partitions based on condition
It's quite illogical when pruning depends on presence of joins in query.

Example:

Let's consider following sequence of hive queries:

1) Create non-deterministic function:

create temporary function UDF2 as 'UDF2';
{{
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;

@UDFType(deterministic=false)
	public class UDF2 extends UDF {
		public String evaluate(String val) {
			return val;
		}
	}
}}

2) Create tables

CREATE TABLE Main (
	a STRING,
	b INT
)

PARTITIONED BY(part STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '10'
STORED AS TEXTFILE;

ALTER TABLE Main ADD PARTITION (part="part1") LOCATION "/hive-join-test/part1/";
ALTER TABLE Main ADD PARTITION (part="part2") LOCATION "/hive-join-test/part2/";

CREATE TABLE Joined (
	a STRING,
	f STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '10'
STORED AS TEXTFILE
LOCATION '/hive-join-test/join/';

3) Run first query:

select 
	m.a,
	m.b
from Main m
where
	part > UDF2('part0') AND part = 'part1';

The pruner will work for this query: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1

4) Run second query (with join):

select 
	m.a,
	j.a,
	m.b
from Main m
join Joined j on
	j.a=m.a
where
	part > UDF2('part0') AND part = 'part1';

Pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2,hdfs://localhost:9000/hive-join-test/join

5) Also lets try to run query with MAPJOIN hint

select /*+MAPJOIN(j)*/ 
	m.a,
	j.a,
	m.b
from Main m
join Joined j on
	j.a=m.a
where
	part > UDF2('part0') AND part = 'part1';

The result is the same, pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1173) Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834438#action_12834438 ] 

Zheng Shao commented on HIVE-1173:
----------------------------------

Can you try condition: "part = 'part1' AND part > UDF2('part0')"

The optimizer might do something different because of the short-circuit calculation of "AND".


> Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1173
>                 URL: https://issues.apache.org/jira/browse/HIVE-1173
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.4.0, 0.4.1
>            Reporter: Vladimir Klimontovich
>
> Brief description:
> case 1) non-deterministic present in partition condition, joins are present in query => partition pruner doesn't do filtering of partitions based on condition
> case 2) non-deterministic present in partition condition, joins aren't present in query => partition pruner do filtering of partitions based on condition
> It's quite illogical when pruning depends on presence of joins in query.
> Example:
> Let's consider following sequence of hive queries:
> 1) Create non-deterministic function:
> create temporary function UDF2 as 'UDF2';
> {{
> import org.apache.hadoop.hive.ql.exec.UDF;
> import org.apache.hadoop.hive.ql.udf.UDFType;
> @UDFType(deterministic=false)
> 	public class UDF2 extends UDF {
> 		public String evaluate(String val) {
> 			return val;
> 		}
> 	}
> }}
> 2) Create tables
> CREATE TABLE Main (
> 	a STRING,
> 	b INT
> )
> PARTITIONED BY(part STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '10'
> STORED AS TEXTFILE;
> ALTER TABLE Main ADD PARTITION (part="part1") LOCATION "/hive-join-test/part1/";
> ALTER TABLE Main ADD PARTITION (part="part2") LOCATION "/hive-join-test/part2/";
> CREATE TABLE Joined (
> 	a STRING,
> 	f STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '10'
> STORED AS TEXTFILE
> LOCATION '/hive-join-test/join/';
> 3) Run first query:
> select 
> 	m.a,
> 	m.b
> from Main m
> where
> 	part > UDF2('part0') AND part = 'part1';
> The pruner will work for this query: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1
> 4) Run second query (with join):
> select 
> 	m.a,
> 	j.a,
> 	m.b
> from Main m
> join Joined j on
> 	j.a=m.a
> where
> 	part > UDF2('part0') AND part = 'part1';
> Pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2,hdfs://localhost:9000/hive-join-test/join
> 5) Also lets try to run query with MAPJOIN hint
> select /*+MAPJOIN(j)*/ 
> 	m.a,
> 	j.a,
> 	m.b
> from Main m
> join Joined j on
> 	j.a=m.a
> where
> 	part > UDF2('part0') AND part = 'part1';
> The result is the same, pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1173) Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query

Posted by "Vladimir Klimontovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834710#action_12834710 ] 

Vladimir Klimontovich commented on HIVE-1173:
---------------------------------------------

I just tried "part = 'part1' AND part > UDF2('part0')" condition. Query plan remained the same.

> Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1173
>                 URL: https://issues.apache.org/jira/browse/HIVE-1173
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.4.0, 0.4.1
>            Reporter: Vladimir Klimontovich
>
> Brief description:
> case 1) non-deterministic present in partition condition, joins are present in query => partition pruner doesn't do filtering of partitions based on condition
> case 2) non-deterministic present in partition condition, joins aren't present in query => partition pruner do filtering of partitions based on condition
> It's quite illogical when pruning depends on presence of joins in query.
> Example:
> Let's consider following sequence of hive queries:
> 1) Create non-deterministic function:
> create temporary function UDF2 as 'UDF2';
> {{
> import org.apache.hadoop.hive.ql.exec.UDF;
> import org.apache.hadoop.hive.ql.udf.UDFType;
> @UDFType(deterministic=false)
> 	public class UDF2 extends UDF {
> 		public String evaluate(String val) {
> 			return val;
> 		}
> 	}
> }}
> 2) Create tables
> CREATE TABLE Main (
> 	a STRING,
> 	b INT
> )
> PARTITIONED BY(part STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '10'
> STORED AS TEXTFILE;
> ALTER TABLE Main ADD PARTITION (part="part1") LOCATION "/hive-join-test/part1/";
> ALTER TABLE Main ADD PARTITION (part="part2") LOCATION "/hive-join-test/part2/";
> CREATE TABLE Joined (
> 	a STRING,
> 	f STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '10'
> STORED AS TEXTFILE
> LOCATION '/hive-join-test/join/';
> 3) Run first query:
> select 
> 	m.a,
> 	m.b
> from Main m
> where
> 	part > UDF2('part0') AND part = 'part1';
> The pruner will work for this query: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1
> 4) Run second query (with join):
> select 
> 	m.a,
> 	j.a,
> 	m.b
> from Main m
> join Joined j on
> 	j.a=m.a
> where
> 	part > UDF2('part0') AND part = 'part1';
> Pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2,hdfs://localhost:9000/hive-join-test/join
> 5) Also lets try to run query with MAPJOIN hint
> select /*+MAPJOIN(j)*/ 
> 	m.a,
> 	j.a,
> 	m.b
> from Main m
> join Joined j on
> 	j.a=m.a
> where
> 	part > UDF2('part0') AND part = 'part1';
> The result is the same, pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1173) Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834788#action_12834788 ] 

Namit Jain commented on HIVE-1173:
----------------------------------

I think we should fix this for 0.5 also

> Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1173
>                 URL: https://issues.apache.org/jira/browse/HIVE-1173
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.4.0, 0.4.1
>            Reporter: Vladimir Klimontovich
>
> Brief description:
> case 1) non-deterministic present in partition condition, joins are present in query => partition pruner doesn't do filtering of partitions based on condition
> case 2) non-deterministic present in partition condition, joins aren't present in query => partition pruner do filtering of partitions based on condition
> It's quite illogical when pruning depends on presence of joins in query.
> Example:
> Let's consider following sequence of hive queries:
> 1) Create non-deterministic function:
> create temporary function UDF2 as 'UDF2';
> {{
> import org.apache.hadoop.hive.ql.exec.UDF;
> import org.apache.hadoop.hive.ql.udf.UDFType;
> @UDFType(deterministic=false)
> 	public class UDF2 extends UDF {
> 		public String evaluate(String val) {
> 			return val;
> 		}
> 	}
> }}
> 2) Create tables
> CREATE TABLE Main (
> 	a STRING,
> 	b INT
> )
> PARTITIONED BY(part STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '10'
> STORED AS TEXTFILE;
> ALTER TABLE Main ADD PARTITION (part="part1") LOCATION "/hive-join-test/part1/";
> ALTER TABLE Main ADD PARTITION (part="part2") LOCATION "/hive-join-test/part2/";
> CREATE TABLE Joined (
> 	a STRING,
> 	f STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '10'
> STORED AS TEXTFILE
> LOCATION '/hive-join-test/join/';
> 3) Run first query:
> select 
> 	m.a,
> 	m.b
> from Main m
> where
> 	part > UDF2('part0') AND part = 'part1';
> The pruner will work for this query: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1
> 4) Run second query (with join):
> select 
> 	m.a,
> 	j.a,
> 	m.b
> from Main m
> join Joined j on
> 	j.a=m.a
> where
> 	part > UDF2('part0') AND part = 'part1';
> Pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2,hdfs://localhost:9000/hive-join-test/join
> 5) Also lets try to run query with MAPJOIN hint
> select /*+MAPJOIN(j)*/ 
> 	m.a,
> 	j.a,
> 	m.b
> from Main m
> join Joined j on
> 	j.a=m.a
> where
> 	part > UDF2('part0') AND part = 'part1';
> The result is the same, pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1173) Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query

Posted by "Vladimir Klimontovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Klimontovich updated HIVE-1173:
----------------------------------------

    Component/s: Query Processor

> Partition pruner cancels pruning if non-deterministic function present in filtering expression only in joins is present in query
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1173
>                 URL: https://issues.apache.org/jira/browse/HIVE-1173
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.4.0, 0.4.1
>            Reporter: Vladimir Klimontovich
>
> Brief description:
> case 1) non-deterministic present in partition condition, joins are present in query => partition pruner doesn't do filtering of partitions based on condition
> case 2) non-deterministic present in partition condition, joins aren't present in query => partition pruner do filtering of partitions based on condition
> It's quite illogical when pruning depends on presence of joins in query.
> Example:
> Let's consider following sequence of hive queries:
> 1) Create non-deterministic function:
> create temporary function UDF2 as 'UDF2';
> {{
> import org.apache.hadoop.hive.ql.exec.UDF;
> import org.apache.hadoop.hive.ql.udf.UDFType;
> @UDFType(deterministic=false)
> 	public class UDF2 extends UDF {
> 		public String evaluate(String val) {
> 			return val;
> 		}
> 	}
> }}
> 2) Create tables
> CREATE TABLE Main (
> 	a STRING,
> 	b INT
> )
> PARTITIONED BY(part STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '10'
> STORED AS TEXTFILE;
> ALTER TABLE Main ADD PARTITION (part="part1") LOCATION "/hive-join-test/part1/";
> ALTER TABLE Main ADD PARTITION (part="part2") LOCATION "/hive-join-test/part2/";
> CREATE TABLE Joined (
> 	a STRING,
> 	f STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '10'
> STORED AS TEXTFILE
> LOCATION '/hive-join-test/join/';
> 3) Run first query:
> select 
> 	m.a,
> 	m.b
> from Main m
> where
> 	part > UDF2('part0') AND part = 'part1';
> The pruner will work for this query: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1
> 4) Run second query (with join):
> select 
> 	m.a,
> 	j.a,
> 	m.b
> from Main m
> join Joined j on
> 	j.a=m.a
> where
> 	part > UDF2('part0') AND part = 'part1';
> Pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2,hdfs://localhost:9000/hive-join-test/join
> 5) Also lets try to run query with MAPJOIN hint
> select /*+MAPJOIN(j)*/ 
> 	m.a,
> 	j.a,
> 	m.b
> from Main m
> join Joined j on
> 	j.a=m.a
> where
> 	part > UDF2('part0') AND part = 'part1';
> The result is the same, pruner doesn't work: mapred.input.dir=hdfs://localhost:9000/hive-join-test/part1,hdfs://localhost:9000/hive-join-test/part2

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.