You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chris Mahoney (Jira)" <ji...@apache.org> on 2022/06/15 02:38:00 UTC
[jira] [Updated] (SPARK-39474) Streamline the options for the `.write` method of a Spark DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-39474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Mahoney updated SPARK-39474:
----------------------------------
    Description: 
Hi Team!


I'd like to set up a much easier way to optimize my `delta` tables. Specifically, I am referring to the `sql` command `OPTIMIZE <table`.

Let me show you the differences:

*Current:*

First, run:

 
{code:java}
import pandas as pd
df = spark.createDataFrame(
    pd.DataFrame(
        {'a': [1,2,3,4],
         'b': ['a','b','c','d']}
    )
)
df.write.mode('overwrite').format('delta').save('./folder'){code}
 Then, once it's saved, run:

 
{code:java}
CREATE TABLE df USING DELTA LOCATION './folder' {code}
 Then, once the table is loaded, run:
{code:java}
OPTIMIZE df
--or
OPTIMIZE df ZORDER BY (b) {code}
As you can see, there are many steps needed.

*{*}Future:{*}*
 I'd like to be able to do something like this:
{code:java}
import pandas as pd
df = spark.createDataFrame(
    pd.DataFrame(
        {'a':[1,2,3,4],
         'b':['a','b','c','d']}
    )
)
df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
#or
df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') {code}
As you can see, it's much more streamlined, and keeps the code to a higher-level.
 
Thank you.

  was:
Hi Team!
I'd like to set up a much easier way to optimize my `delta` tables. Specifically, I am referring to the `sql` command `OPTIMIZE <table`.

Let me show you the differences:

**Current:**

First, run:

```python
import pandas as pd
df = spark.createDataFrame(
    pd.DataFrame(
        {'a': [1,2,3,4],
         'b': ['a','b','c','d']}
    )
)
df.write.mode('overwrite').format('delta').save('./folder')
```
 
Then, once it's saved, run:
 
```sql
CREATE TABLE df USING DELTA LOCATION './folder'
```
 
Then, once the table is loaded, run:
 
```sql
OPTIMIZE df
--or
OPTIMIZE df ZORDER BY (b)
```
As you can see, there are many steps needed.
 
**Future:**
 
I'd like to be able to do something like this:
 
```python
import pandas as pd
df = spark.createDataFrame(
    pd.DataFrame(
        {'a':[1,2,3,4],
         'b':['a','b','c','d']}
    )
)
df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
#or
df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder')
```
 
As you can see, it's much more streamlined, and keeps the code to a higher-level.
 
Thank you.


> Streamline the options for the `.write` method of a Spark DataFrame
> -------------------------------------------------------------------
>
>                 Key: SPARK-39474
>                 URL: https://issues.apache.org/jira/browse/SPARK-39474
>             Project: Spark
>          Issue Type: Wish
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Chris Mahoney
>            Priority: Minor
>             Fix For: 3.2.1
>
>
> Hi Team!
> I'd like to set up a much easier way to optimize my `delta` tables. Specifically, I am referring to the `sql` command `OPTIMIZE <table`.
> Let me show you the differences:
> *Current:*
> First, run:
>  
> {code:java}
> import pandas as pd
> df = spark.createDataFrame(
>     pd.DataFrame(
>         {'a': [1,2,3,4],
>          'b': ['a','b','c','d']}
>     )
> )
> df.write.mode('overwrite').format('delta').save('./folder'){code}
>  Then, once it's saved, run:
>  
> {code:java}
> CREATE TABLE df USING DELTA LOCATION './folder' {code}
>  Then, once the table is loaded, run:
> {code:java}
> OPTIMIZE df
> --or
> OPTIMIZE df ZORDER BY (b) {code}
> As you can see, there are many steps needed.
> *{*}Future:{*}*
>  I'd like to be able to do something like this:
> {code:java}
> import pandas as pd
> df = spark.createDataFrame(
>     pd.DataFrame(
>         {'a':[1,2,3,4],
>          'b':['a','b','c','d']}
>     )
> )
> df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
> #or
> df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') {code}
> As you can see, it's much more streamlined, and keeps the code to a higher-level.
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org