You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Drew (Jira)" <ji...@apache.org> on 2022/08/31 07:04:00 UTC
[jira] [Created] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3
Drew created SPARK-40287:
----------------------------
Summary: Load Data using Spark by a single partition moves entire dataset under same location in S3
Key: SPARK-40287
URL: https://issues.apache.org/jira/browse/SPARK-40287
Project: Spark
Issue Type: Question
Components: Spark Core
Affects Versions: 3.2.1
Reporter: Drew
Hello,
I'm experiencing an issue in PySpark when creating a hive table and loading in the data to the table. So I'm using an Amazon s3 bucket as a data location and I'm creating a table as parquet and trying to load data into that table by a single partition, and I'm seeing some weird behavior. When selecting the data location in s3 of a parquet file to load into my table. All of the data is moved into the specified location in my create table command including the partitions I didn't specify in the load data command. For example:
{code:java}
# create a data frame in pyspark with partitions
df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], ["c1", "c2", "p"])
# save it to S3
df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
{code}
In the current state S3 should have a new folder `data` with two folders which contain a parquet file in each partition.
- s3://bucket/data/p=x/
- part-00001.snappy.parquet
- s3://bucket/data/p=y/
- part-00002.snappy.parquet
- part-00003.snappy.parquet
{code:java}
# create new table
spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) STORED AS parquet LOCATION 's3://bucket/new/'")
# load the saved table data from s3 specifying single partition value x
spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION (p='x')")
spark.sql("select * from src").show()
# output:
# +---+---+---+
# | c1| c2| p|
# +---+---+---+
# +---+---+---+
{code}
After running the `load data` command, and looking at the table I'm left with no data loaded in. When checking S3 the data source we saved earlier is moved under `s3://bucket/new/` oddly enough it also brought over the other partitions along with it directory structure listed below.
- s3://bucket/new/
- p=x/
- p=x/
- part-00001.snappy.parquet
- p=y/
- part-00002.snappy.parquet
- part-00003.snappy.parquet
Is this the intended behavior of loading the data in from a partitioned parquet file? Is the previous file supposed to be moved/deleted from source directory?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org