You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Raymond Tang (Jira)" <ji...@apache.org> on 2022/08/24 08:49:00 UTC
[jira] [Updated] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table
[ https://issues.apache.org/jira/browse/SPARK-40206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Tang updated SPARK-40206:
---------------------------------
Description:
Hi team,
I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance.
However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?*
How to produce the issue:
Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string)
comment 'A bucketed table'
partitioned by(country string)
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession
appName = "PySpark Hive Bucketing Example"
master = "local"
# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()
# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
data.append([int(i), 'U'+str(i), countries[i % 2]])
df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
df.show()
# Save df to Hive table test_db.bucket_table
df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession
appName = "PySpark Hive Bucketing Example"
master = "local"
# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()
df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU.
Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated.
was:
Hi team,
I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance.
However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?*
How to produce the issue:
Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string)
comment 'A bucketed table'
partitioned by(country string)
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession
appName = "PySpark Hive Bucketing Example"
master = "local"
# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()
# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
data.append([int(i), 'U'+str(i), countries[i % 2]])
df = spark.createDataFrame(data, ['country', 'user_id', 'key'])
df.show()
# Save df to Hive table test_db.bucket_table
df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession
appName = "PySpark Hive Bucketing Example"
master = "local"
# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()
df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU.
Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated.
> Spark SQL Predict Pushdown for Hive Bucketed Table
> --------------------------------------------------
>
> Key: SPARK-40206
> URL: https://issues.apache.org/jira/browse/SPARK-40206
> Project: Spark
> Issue Type: Question
> Components: Spark Core
> Affects Versions: 3.3.0
> Reporter: Raymond Tang
> Priority: Minor
> Labels: hive, hive-buckets, spark
>
> Hi team,
> I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance.
> However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?*
>
> How to produce the issue:
> Create a Hive 3 table using the following DDL:
> {code:java}
> create table test_db.bucket_table(user_id int, key string)
> comment 'A bucketed table'
> partitioned by(country string)
> clustered by(user_id) sorted by (key) into 10 buckets
> stored as ORC;{code}
> And then insert into this table using the following PySpark script:
> {code:java}
> from pyspark.sql import SparkSession
> appName = "PySpark Hive Bucketing Example"
> master = "local"
> # Create Spark session with Hive supported.
> spark = SparkSession.builder \
> .appName(appName) \
> .master(master) \
> .enableHiveSupport() \
> .getOrCreate()
> # prepare sample data for inserting into hive table
> data = []
> countries = ['CN', 'AU']
> for i in range(0, 1000):
> data.append([int(i), 'U'+str(i), countries[i % 2]])
> df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
> df.show()
> # Save df to Hive table test_db.bucket_table
> df.write.mode('append').insertInto('test_db.bucket_table') {code}
> Then query the table using the following script:
> {code:java}
> from pyspark.sql import SparkSession
> appName = "PySpark Hive Bucketing Example"
> master = "local"
> # Create Spark session with Hive supported.
> spark = SparkSession.builder \
> .appName(appName) \
> .master(master) \
> .enableHiveSupport() \
> .getOrCreate()
> df = spark.sql("""select * from test_db.bucket_table
> where country='AU' and user_id=101
> """)
> df.show()
> df.explain(extended=True) {code}
> I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU.
> Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org