You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/06/18 00:40:00 UTC

[jira] [Resolved] (SPARK-39474) Streamline the options for the `.write` method of a Spark DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-39474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-39474.
----------------------------------
    Fix Version/s:     (was: 3.2.1)
       Resolution: Not A Problem

> Streamline the options for the `.write` method of a Spark DataFrame
> -------------------------------------------------------------------
>
>                 Key: SPARK-39474
>                 URL: https://issues.apache.org/jira/browse/SPARK-39474
>             Project: Spark
>          Issue Type: Wish
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Chris Mahoney
>            Priority: Minor
>
> Hi Team!
> I'd like to set up a much easier way to optimize my {{delta}} tables. Specifically, I am referring to the {{sql}} command {{{}OPTIMIZE <table>{}}}.
> Let me show you the differences:
> *Current:*
> First, run:
> {code:java}
> import pandas as pd
> df = spark.createDataFrame(
>     pd.DataFrame(
>         {'a': [1,2,3,4],
>          'b': ['a','b','c','d']}
>     )
> )
> df.write.mode('overwrite').format('delta').save('./folder'){code}
>  Then, once it's saved, run:
> {code:java}
> CREATE TABLE df USING DELTA LOCATION './folder' {code}
>  Then, once the table is loaded, run:
> {code:java}
> OPTIMIZE df
> --or
> OPTIMIZE df ZORDER BY (b) {code}
> As you can see, there are many steps needed.
> *Future:*
> I'd like to be able to do something like this:
> {code:java}
> import pandas as pd
> df = spark.createDataFrame(
>     pd.DataFrame(
>         {'a':[1,2,3,4],
>          'b':['a','b','c','d']}
>     )
> )
> df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
> #or
> df.write.mode('overwrite').format('delta').options(optimize=True, zorder_by=('b')).save('./folder') {code}
> As you can see, it's much more streamlined, and keeps the code to a higher-level.
> Thank you.
>  
> References:
>  * [https://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.html]
>  * [https://medium.com/@debusinha2009/cheatsheet-on-understanding-zorder-and-optimize-for-your-delta-tables-1556282221d3]
>  * [https://www.cloudiqtech.com/partition-optimize-and-zorder-delta-tables-in-azure-databricks/]
>  * [https://docs.databricks.com/delta/optimizations/file-mgmt.html]
>  * [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html]
>  * [https://stackoverflow.com/questions/65320949/parquet-vs-delta-format-in-azure-data-lake-gen-2-store?_sm_au_=iVV4WjsV0q7WQktrJfsTkK7RqJB10]
>  * [https://www.i-programmer.info/news/197-data-mining/12582-databricks-delta-adds-faster-parquet-import.html#:~:text=Databricks%20says%20Delta%20is%2010,data%20management%2C%20and%20query%20serving].
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org