You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/07/26 10:00:00 UTC
[jira] [Commented] (SPARK-28505) Add data source option for
omitting partitioned columns when saving to file
[ https://issues.apache.org/jira/browse/SPARK-28505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893698#comment-16893698 ]
Hyukjin Kwon commented on SPARK-28505:
--------------------------------------
I don't quite understand. Hive also reads partitioned columns as well.
{code}
scala> val myDF = spark.range(10).selectExpr("id as value1", "id as value2", "id as year", "id as month", "id as day")
myDF: org.apache.spark.sql.DataFrame = [value1: bigint, value2: bigint ... 3 more fields]
scala> myDF.select("value1", "value2", "year","month","day").write.format("csv").option("header", "true").partitionBy("year","month","day").save("/tmp/foo")
{code}
{code}
➜ ~ cd /tmp/foo
➜ foo ls
_SUCCESS year=0 year=1 year=2 year=3 year=4 year=5 year=6 year=7 year=8 year=9
➜ foo tree .
.
├── _SUCCESS
├── year=0
│ └── month=0
│ └── day=0
│ └── part-00001-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=1
│ └── month=1
│ └── day=1
│ └── part-00002-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=2
│ └── month=2
│ └── day=2
│ └── part-00003-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=3
│ └── month=3
│ └── day=3
│ └── part-00004-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=4
│ └── month=4
│ └── day=4
│ └── part-00005-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=5
│ └── month=5
│ └── day=5
│ └── part-00007-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=6
│ └── month=6
│ └── day=6
│ └── part-00008-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=7
│ └── month=7
│ └── day=7
│ └── part-00009-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
├── year=8
│ └── month=8
│ └── day=8
│ └── part-00010-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
└── year=9
└── month=9
└── day=9
└── part-00011-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
30 directories, 11 files
➜ foo cat part-00001-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv
value1,value2
0,0
{code}
and Spark doesn't save partitioned column in its output file.
> Add data source option for omitting partitioned columns when saving to file
> ---------------------------------------------------------------------------
>
> Key: SPARK-28505
> URL: https://issues.apache.org/jira/browse/SPARK-28505
> Project: Spark
> Issue Type: Wish
> Components: Input/Output, Spark Core
> Affects Versions: 2.4.4, 3.0.0
> Reporter: Juarez Rudsatz
> Priority: Minor
>
> It is very useful to have a option for omiting the columns used in partitioning from the output while writing to a file data source like csv, avro, parquet, orc or excel.
> Consider the following code:
> {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}}
> {{myDF.select("value1", "value2", "year","month","day")}}
> {{.write().format("csv")}}
> {{.option("header", "true")}}
> {{.partionBy("year","month","day")}}
> {{.save("hdfs://user/spark/warehouse/csv_output_dir");}}
> This will output many files in separated folders in a structure like:
> {{csv_output_dir/_SUCCESS}}
> {{csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}}
> {{csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}}
> {{...}}
> And the output will be something like:
> {{┌──────┬──────┬──────┬───────┬─────┐}}
> {{│ val1 │ val2 │ year │ month │ day │}}
> {{├──────┼──────┼──────┼───────┼─────┤}}
> {{│ 3673 │ 2345 │ 2019 │ 7 │ 10 │}}
> {{│ 2345 │ 3423 │ 2019 │ 7 │ 10 │}}
> {{│ 8765 │ 2423 │ 2019 │ 7 │ 10 │}}
> {{└──────┴──────┴──────┴───────┴─────┘}}
> When using partitioning in HIVE, the output from same source data will be something like:
> {{┌──────┬──────┐}}
> {{│ val1 │ val2 │}}
> {{├──────┼──────┤}}
> {{│ 3673 │ 2345 │}}
> {{│ 2345 │ 3423 │}}
> {{│ 8765 │ 2423 │}}
> {{└──────┴──────┘}}
> In this case the columns of the partitioning are not present in the CSV files. However output files follows the same folder/path structure as existing today.
> Please considere adding a opt-in config for DataFrameWriter for leaving out the partitioning columns as in the second example.
> The code could be something like:
> {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}}
> {{myDF.select("value1", "value2", "year","month","day")}}
> {{.write().format("csv")}}
> {{.option("header", "true")}}
> *{{.option("partition.omit.cols", "true")}}*
> {{.partionBy("year","month","day")}}
> {{.save("hdfs://user/spark/warehouse/csv_output_dir");}}
> Thanks.
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org