You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Raymond Tang (Jira)" <ji...@apache.org> on 2022/08/24 08:49:00 UTC

[jira] [Updated] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table

     [ https://issues.apache.org/jira/browse/SPARK-40206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raymond Tang updated SPARK-40206:
---------------------------------
    Description: 
Hi team,

I was testing out Hive bucket table features.  One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?*

 

How to produce the issue:

Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string) 
comment 'A bucketed table' 
partitioned by(country string) 
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
    data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU.

Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated. 

 

  was:
Hi team,

I was testing out Hive bucket table features.  One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?*

 

How to produce the issue:

Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string) 
comment 'A bucketed table' 
partitioned by(country string) 
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
    data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['country', 'user_id', 'key'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU.

Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated. 

 


> Spark SQL Predict Pushdown for Hive Bucketed Table
> --------------------------------------------------
>
>                 Key: SPARK-40206
>                 URL: https://issues.apache.org/jira/browse/SPARK-40206
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Raymond Tang
>            Priority: Minor
>              Labels: hive, hive-buckets, spark
>
> Hi team,
> I was testing out Hive bucket table features.  One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance.
> However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?*
>  
> How to produce the issue:
> Create a Hive 3 table using the following DDL:
> {code:java}
> create table test_db.bucket_table(user_id int, key string) 
> comment 'A bucketed table' 
> partitioned by(country string) 
> clustered by(user_id) sorted by (key) into 10 buckets
> stored as ORC;{code}
> And then insert into this table using the following PySpark script:
> {code:java}
> from pyspark.sql import SparkSession
> appName = "PySpark Hive Bucketing Example"
> master = "local"
> # Create Spark session with Hive supported.
> spark = SparkSession.builder \
>     .appName(appName) \
>     .master(master) \
>     .enableHiveSupport() \
>     .getOrCreate()
> # prepare sample data for inserting into hive table
> data = []
> countries = ['CN', 'AU']
> for i in range(0, 1000):
>     data.append([int(i),  'U'+str(i), countries[i % 2]])
> df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
> df.show()
> # Save df to Hive table test_db.bucket_table
> df.write.mode('append').insertInto('test_db.bucket_table') {code}
> Then query the table using the following script:
> {code:java}
> from pyspark.sql import SparkSession
> appName = "PySpark Hive Bucketing Example"
> master = "local"
> # Create Spark session with Hive supported.
> spark = SparkSession.builder \
>     .appName(appName) \
>     .master(master) \
>     .enableHiveSupport() \
>     .getOrCreate()
> df = spark.sql("""select * from test_db.bucket_table
> where country='AU' and user_id=101
> """)
> df.show()
> df.explain(extended=True) {code}
> I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU.
> Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org