You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/07/26 10:00:00 UTC
[jira] [Commented] (SPARK-28505) Add data source option for omitting partitioned columns when saving to file

    [ https://issues.apache.org/jira/browse/SPARK-28505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893698#comment-16893698 ] 

Hyukjin Kwon commented on SPARK-28505:
--------------------------------------

I don't quite understand. Hive also reads partitioned columns as well.

{code}
scala> val myDF = spark.range(10).selectExpr("id as value1", "id as value2", "id as year", "id as month", "id as day")
myDF: org.apache.spark.sql.DataFrame = [value1: bigint, value2: bigint ... 3 more fields]

scala> myDF.select("value1", "value2", "year","month","day").write.format("csv").option("header", "true").partitionBy("year","month","day").save("/tmp/foo")
{code}

{code}
➜ ~ cd /tmp/foo
➜ foo ls
_SUCCESS year=0 year=1 year=2 year=3 year=4 year=5 year=6 year=7 year=8 year=9
➜ foo tree .
.
├── _SUCCESS
├── year=0
│   └── month=0
│   └── day=0
│   └── part-00001-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=1
│   └── month=1
│   └── day=1
│   └── part-00002-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=2
│   └── month=2
│   └── day=2
│   └── part-00003-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=3
│   └── month=3
│   └── day=3
│   └── part-00004-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=4
│   └── month=4
│   └── day=4
│   └── part-00005-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=5
│   └── month=5
│   └── day=5
│   └── part-00007-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=6
│   └── month=6
│   └── day=6
│   └── part-00008-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=7
│   └── month=7
│   └── day=7
│   └── part-00009-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=8
│   └── month=8
│   └── day=8
│   └── part-00010-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
└── year=9
 └── month=9
 └── day=9
 └── part-00011-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv

30 directories, 11 files
➜ foo cat part-00001-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
value1,value2
0,0
{code}

and Spark doesn't save partitioned column in its output file.

> Add data source option for omitting partitioned columns when saving to file
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-28505
>                 URL: https://issues.apache.org/jira/browse/SPARK-28505
>             Project: Spark
>          Issue Type: Wish
>          Components: Input/Output, Spark Core
>    Affects Versions: 2.4.4, 3.0.0
>            Reporter: Juarez Rudsatz
>            Priority: Minor
>
> It is very useful to have a option for omiting the columns used in partitioning from the output while writing to a file data source like csv, avro, parquet, orc or excel.
> Consider the following code:
> {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}}
>  {{myDF.select("value1", "value2", "year","month","day")}}
>  {{.write().format("csv")}}
>  {{.option("header", "true")}}
>  {{.partionBy("year","month","day")}}
>  {{.save("hdfs://user/spark/warehouse/csv_output_dir");}}
> This will output many files in separated folders in a structure like:
> {{csv_output_dir/_SUCCESS}}
>  {{csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}}
>  {{csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}}
>  {{...}}
> And the output will be something like:
> {{┌──────┬──────┬──────┬───────┬─────┐}}
>  {{│ val1 │ val2 │ year │ month │ day │}}
>  {{├──────┼──────┼──────┼───────┼─────┤}}
>  {{│ 3673 │ 2345 │ 2019 │     7 │ 10  │}}
>  {{│ 2345 │ 3423 │ 2019 │     7 │ 10  │}}
>  {{│ 8765 │ 2423 │ 2019 │     7 │ 10  │}}
>  {{└──────┴──────┴──────┴───────┴─────┘}}
> When using partitioning in HIVE, the output from same source data will be something like:
> {{┌──────┬──────┐}}
>  {{│ val1 │ val2 │}}
>  {{├──────┼──────┤}}
>  {{│ 3673 │ 2345 │}}
>  {{│ 2345 │ 3423 │}}
>  {{│ 8765 │ 2423 │}}
>  {{└──────┴──────┘}}
> In this case the columns of the partitioning are not present in the CSV files. However output files follows the same folder/path structure as existing today.
> Please considere adding a opt-in config for DataFrameWriter for leaving out the partitioning columns as in the second example.
> The code could be something like:
> {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}}
>  {{myDF.select("value1", "value2", "year","month","day")}}
>  {{.write().format("csv")}}
>  {{.option("header", "true")}}
>  *{{.option("partition.omit.cols", "true")}}*
>  {{.partionBy("year","month","day")}}
>  {{.save("hdfs://user/spark/warehouse/csv_output_dir");}}
> Thanks.
>   



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org