You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chris Mahoney (Jira)" <ji...@apache.org> on 2022/06/15 02:38:00 UTC
[jira] [Updated] (SPARK-39474) Streamline the options for the `.write` method of a Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-39474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Mahoney updated SPARK-39474:
----------------------------------
Description:
Hi Team!
I'd like to set up a much easier way to optimize my `delta` tables. Specifically, I am referring to the `sql` command `OPTIMIZE <table`.
Let me show you the differences:
*Current:*
First, run:
{code:java}
import pandas as pd
df = spark.createDataFrame(
pd.DataFrame(
{'a': [1,2,3,4],
'b': ['a','b','c','d']}
)
)
df.write.mode('overwrite').format('delta').save('./folder'){code}
Then, once it's saved, run:
{code:java}
CREATE TABLE df USING DELTA LOCATION './folder' {code}
Then, once the table is loaded, run:
{code:java}
OPTIMIZE df
--or
OPTIMIZE df ZORDER BY (b) {code}
As you can see, there are many steps needed.
*{*}Future:{*}*
I'd like to be able to do something like this:
{code:java}
import pandas as pd
df = spark.createDataFrame(
pd.DataFrame(
{'a':[1,2,3,4],
'b':['a','b','c','d']}
)
)
df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
#or
df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') {code}
As you can see, it's much more streamlined, and keeps the code to a higher-level.
Thank you.
was:
Hi Team!
I'd like to set up a much easier way to optimize my `delta` tables. Specifically, I am referring to the `sql` command `OPTIMIZE <table`.
Let me show you the differences:
**Current:**
First, run:
```python
import pandas as pd
df = spark.createDataFrame(
pd.DataFrame(
{'a': [1,2,3,4],
'b': ['a','b','c','d']}
)
)
df.write.mode('overwrite').format('delta').save('./folder')
```
Then, once it's saved, run:
```sql
CREATE TABLE df USING DELTA LOCATION './folder'
```
Then, once the table is loaded, run:
```sql
OPTIMIZE df
--or
OPTIMIZE df ZORDER BY (b)
```
As you can see, there are many steps needed.
**Future:**
I'd like to be able to do something like this:
```python
import pandas as pd
df = spark.createDataFrame(
pd.DataFrame(
{'a':[1,2,3,4],
'b':['a','b','c','d']}
)
)
df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
#or
df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder')
```
As you can see, it's much more streamlined, and keeps the code to a higher-level.
Thank you.
> Streamline the options for the `.write` method of a Spark DataFrame
> -------------------------------------------------------------------
>
> Key: SPARK-39474
> URL: https://issues.apache.org/jira/browse/SPARK-39474
> Project: Spark
> Issue Type: Wish
> Components: PySpark
> Affects Versions: 3.2.1
> Reporter: Chris Mahoney
> Priority: Minor
> Fix For: 3.2.1
>
>
> Hi Team!
> I'd like to set up a much easier way to optimize my `delta` tables. Specifically, I am referring to the `sql` command `OPTIMIZE <table`.
> Let me show you the differences:
> *Current:*
> First, run:
>
> {code:java}
> import pandas as pd
> df = spark.createDataFrame(
> pd.DataFrame(
> {'a': [1,2,3,4],
> 'b': ['a','b','c','d']}
> )
> )
> df.write.mode('overwrite').format('delta').save('./folder'){code}
> Then, once it's saved, run:
>
> {code:java}
> CREATE TABLE df USING DELTA LOCATION './folder' {code}
> Then, once the table is loaded, run:
> {code:java}
> OPTIMIZE df
> --or
> OPTIMIZE df ZORDER BY (b) {code}
> As you can see, there are many steps needed.
> *{*}Future:{*}*
> I'd like to be able to do something like this:
> {code:java}
> import pandas as pd
> df = spark.createDataFrame(
> pd.DataFrame(
> {'a':[1,2,3,4],
> 'b':['a','b','c','d']}
> )
> )
> df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
> #or
> df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') {code}
> As you can see, it's much more streamlined, and keeps the code to a higher-level.
>
> Thank you.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org