You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Drew (Jira)" <ji...@apache.org> on 2022/08/31 07:04:00 UTC
[jira] [Created] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3

Drew created SPARK-40287:
----------------------------

             Summary: Load Data using Spark by a single partition moves entire dataset under same location in S3
                 Key: SPARK-40287
                 URL: https://issues.apache.org/jira/browse/SPARK-40287
             Project: Spark
          Issue Type: Question
          Components: Spark Core
    Affects Versions: 3.2.1
            Reporter: Drew


Hello,

I'm experiencing an issue in PySpark when creating a hive table and loading in the data to the table. So I'm using an Amazon s3 bucket as a data location and I'm creating a table as parquet and trying to load data into that table by a single partition, and I'm seeing some weird behavior. When selecting the data location in s3 of a parquet file to load into my table. All of the data is moved into the specified location in my create table command including the partitions I didn't specify in the load data command. For example:
{code:java}
# create a data frame in pyspark with partitions
df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], ["c1", "c2", "p"])
# save it to S3
df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
{code}

In the current state S3 should have a new folder `data` with two folders which contain a parquet file in each partition. 
  
 - s3://bucket/data/p=x/
    - part-00001.snappy.parquet
 - s3://bucket/data/p=y/
    - part-00002.snappy.parquet
    - part-00003.snappy.parquet

 
{code:java}
# create new table
spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) STORED AS parquet LOCATION 's3://bucket/new/'")
# load the saved table data from s3 specifying single partition value x
spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION (p='x')")
spark.sql("select * from src").show()
# output: 
# +---+---+---+
# | c1| c2|  p|
# +---+---+---+
# +---+---+---+
{code}


After running the `load data` command, and looking at the table I'm left with no data loaded in. When checking S3 the data source we saved earlier is moved under `s3://bucket/new/` oddly enough it also brought over the other partitions along with it directory structure listed below. 

- s3://bucket/new/
    - p=x/
        - p=x/
            - part-00001.snappy.parquet
        - p=y/
            - part-00002.snappy.parquet
            - part-00003.snappy.parquet

Is this the intended behavior of loading the data in from a partitioned parquet file? Is the previous file supposed to be moved/deleted from source directory? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org