You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yikun Jiang <yi...@gmail.com> on 2021/04/22 07:33:19 UTC

[DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Hi, all

*Background:*

Currently, there is a withColumns
<https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
method to help users/devs add/replace multiple columns at once.
But this method is private and isn't exposed as a public API interface,
that means it cannot be used by the user directly, and also it is not
supported in PySpark API.

As the dataframe user, I can only call withColumn() multiple times:

df.withColumn("key1", col("key1")).withColumn("key2",
col("key2")).withColumn("key3", col("key3"))

rather than:

df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])

Multiple calls bring some higher cost on developer experience and
performance. Especially in a PySpark related scenario, multiple calls mean
multiple py4j calls.

As mentioned
<https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
@Hyukjin, there were some previous discussions on  SPARK-12225
<https://issues.apache.org/jira/browse/SPARK-12225> [2] .

[1]
https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
[2] https://issues.apache.org/jira/browse/SPARK-12225

*Potential solution:*
Looks like there are 2 potential solutions if we want to support it:

1. Introduce a *withColumns *api for Scala/Python.
A separate public withColumns API will be added in scala/python api.

2. Make withColumn can receive *single col *and also the* list of cols*.
I did some experimental try on PySpark on
https://github.com/apache/spark/pull/32276
Just like Maciej said
<https://github.com/apache/spark/pull/32276#pullrequestreview-641280217> it
will bring some confusion with naming.


Thanks for your reading, feel free to reply if you have any other concerns
or suggestions!


Regards,
Yikun

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Posted by Паша <pa...@gmail.com>.
I've created my own implicit withColumnsRenamed for such a purpose which
just accepted a map of string→string and called rename multiple times.

вт, 4 мая 2021 г. в 10:22, Yikun Jiang <yi...@gmail.com>:

> @Saurabh @Mr.Powers Thanks for the input information.
>
> I personal perfer to introduce the `withColumns` because it bring more
> friendly development experience rather than select( * ).
>
> This is the PR to add `withColumns`:
> https://github.com/apache/spark/pull/32431
>
> Regards,
> Yikun
>
>
> Saurabh Chawla <s....@gmail.com> 于2021年4月30日周五 下午1:13写道:
>
>> Hi All,
>>
>> I also had a scenario where at runtime, I needed to loop through a
>> dataframe to use withColumn many times.
>>
>>  For the safer side I used the reflection to access the withColumns to
>> prevent any java.lang.StackOverflowError.
>>
>> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
>> val newConfigurationMethod =
>>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], classOf[Seq[Column]])
>> newConfigurationMethod.invoke(
>>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>>
>> It would be great if we use the "withColumns" rather than using the
>> reflection code like this.
>> or
>> make changes in the code to merge the project with existing project in
>> the plan, instead of adding the new project every time we call the "
>> withColumn".
>>
>> +1 for exposing the *withColumns*
>>
>> Regards
>> Saurabh Chawla
>>
>> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yi...@gmail.com> wrote:
>>
>>> Hi, all
>>>
>>> *Background:*
>>>
>>> Currently, there is a withColumns
>>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
>>> method to help users/devs add/replace multiple columns at once.
>>> But this method is private and isn't exposed as a public API interface,
>>> that means it cannot be used by the user directly, and also it is not
>>> supported in PySpark API.
>>>
>>> As the dataframe user, I can only call withColumn() multiple times:
>>>
>>> df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
>>>
>>> rather than:
>>>
>>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])
>>>
>>> Multiple calls bring some higher cost on developer experience and
>>> performance. Especially in a PySpark related scenario, multiple calls mean
>>> multiple py4j calls.
>>>
>>> As mentioned
>>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143>
>>> from @Hyukjin, there were some previous discussions on  SPARK-12225
>>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>>
>>> *Potential solution:*
>>> Looks like there are 2 potential solutions if we want to support it:
>>>
>>> 1. Introduce a *withColumns *api for Scala/Python.
>>> A separate public withColumns API will be added in scala/python api.
>>>
>>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>>> I did some experimental try on PySpark on
>>> https://github.com/apache/spark/pull/32276
>>> Just like Maciej said
>>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
>>> it will bring some confusion with naming.
>>>
>>>
>>> Thanks for your reading, feel free to reply if you have any other
>>> concerns or suggestions!
>>>
>>>
>>> Regards,
>>> Yikun
>>>
>>

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Posted by Yikun Jiang <yi...@gmail.com>.
@Saurabh @Mr.Powers Thanks for the input information.

I personal perfer to introduce the `withColumns` because it bring more
friendly development experience rather than select( * ).

This is the PR to add `withColumns`:
https://github.com/apache/spark/pull/32431

Regards,
Yikun


Saurabh Chawla <s....@gmail.com> 于2021年4月30日周五 下午1:13写道:

> Hi All,
>
> I also had a scenario where at runtime, I needed to loop through a
> dataframe to use withColumn many times.
>
>  For the safer side I used the reflection to access the withColumns to
> prevent any java.lang.StackOverflowError.
>
> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
> val newConfigurationMethod =
>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], classOf[Seq[Column]])
> newConfigurationMethod.invoke(
>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>
> It would be great if we use the "withColumns" rather than using the
> reflection code like this.
> or
> make changes in the code to merge the project with existing project in the
> plan, instead of adding the new project every time we call the "
> withColumn".
>
> +1 for exposing the *withColumns*
>
> Regards
> Saurabh Chawla
>
> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yi...@gmail.com> wrote:
>
>> Hi, all
>>
>> *Background:*
>>
>> Currently, there is a withColumns
>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
>> method to help users/devs add/replace multiple columns at once.
>> But this method is private and isn't exposed as a public API interface,
>> that means it cannot be used by the user directly, and also it is not
>> supported in PySpark API.
>>
>> As the dataframe user, I can only call withColumn() multiple times:
>>
>> df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
>>
>> rather than:
>>
>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])
>>
>> Multiple calls bring some higher cost on developer experience and
>> performance. Especially in a PySpark related scenario, multiple calls mean
>> multiple py4j calls.
>>
>> As mentioned
>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
>> @Hyukjin, there were some previous discussions on  SPARK-12225
>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>>
>> [1]
>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>
>> *Potential solution:*
>> Looks like there are 2 potential solutions if we want to support it:
>>
>> 1. Introduce a *withColumns *api for Scala/Python.
>> A separate public withColumns API will be added in scala/python api.
>>
>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>> I did some experimental try on PySpark on
>> https://github.com/apache/spark/pull/32276
>> Just like Maciej said
>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
>> it will bring some confusion with naming.
>>
>>
>> Thanks for your reading, feel free to reply if you have any other
>> concerns or suggestions!
>>
>>
>> Regards,
>> Yikun
>>
>

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Posted by Matthew Powers <ma...@gmail.com>.
Thanks for starting this good discussion.  You can add multiple columns
with select to avoid calling withColumn multiple times:

val newCols = Seq(col("*"), lit("val1").as("key1"), lit("val2").as("key2"))
df.select(newCols: _*).show()

withColumns would be a nice interface for less technical Spark users.

Here's a related discussion
<http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-SQL-Python-Scala-and-R-API-Consistency-td30620.html>
on why the maintainers are sometimes hesitant to expose the surface area of
the Spark API (maintainers are doing a great job keeping the API clean).

As Maciej suggested in the thread, I created a separate project called bebe
<https://github.com/MrPowers/bebe> to expose developer friendly functions
that the maintainers don't want to expose in Spark.  If the Spark
maintainers decide that they don't want to add withColumns to the Spark
API, we can at least add it to bebe.


On Fri, Apr 30, 2021 at 1:13 AM Saurabh Chawla <s....@gmail.com>
wrote:

> Hi All,
>
> I also had a scenario where at runtime, I needed to loop through a
> dataframe to use withColumn many times.
>
>  For the safer side I used the reflection to access the withColumns to
> prevent any java.lang.StackOverflowError.
>
> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
> val newConfigurationMethod =
>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], classOf[Seq[Column]])
> newConfigurationMethod.invoke(
>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>
> It would be great if we use the "withColumns" rather than using the
> reflection code like this.
> or
> make changes in the code to merge the project with existing project in the
> plan, instead of adding the new project every time we call the "
> withColumn".
>
> +1 for exposing the *withColumns*
>
> Regards
> Saurabh Chawla
>
> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yi...@gmail.com> wrote:
>
>> Hi, all
>>
>> *Background:*
>>
>> Currently, there is a withColumns
>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
>> method to help users/devs add/replace multiple columns at once.
>> But this method is private and isn't exposed as a public API interface,
>> that means it cannot be used by the user directly, and also it is not
>> supported in PySpark API.
>>
>> As the dataframe user, I can only call withColumn() multiple times:
>>
>> df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
>>
>> rather than:
>>
>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])
>>
>> Multiple calls bring some higher cost on developer experience and
>> performance. Especially in a PySpark related scenario, multiple calls mean
>> multiple py4j calls.
>>
>> As mentioned
>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
>> @Hyukjin, there were some previous discussions on  SPARK-12225
>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>>
>> [1]
>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>
>> *Potential solution:*
>> Looks like there are 2 potential solutions if we want to support it:
>>
>> 1. Introduce a *withColumns *api for Scala/Python.
>> A separate public withColumns API will be added in scala/python api.
>>
>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>> I did some experimental try on PySpark on
>> https://github.com/apache/spark/pull/32276
>> Just like Maciej said
>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
>> it will bring some confusion with naming.
>>
>>
>> Thanks for your reading, feel free to reply if you have any other
>> concerns or suggestions!
>>
>>
>> Regards,
>> Yikun
>>
>

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Posted by Saurabh Chawla <s....@gmail.com>.
Hi All,

I also had a scenario where at runtime, I needed to loop through a
dataframe to use withColumn many times.

 For the safer side I used the reflection to access the withColumns to
prevent any java.lang.StackOverflowError.

val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
val newConfigurationMethod =
  dataSetClass.getMethod("withColumns", classOf[Seq[String]],
classOf[Seq[Column]])
newConfigurationMethod.invoke(
  baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]

It would be great if we use the "withColumns" rather than using the
reflection code like this.
or
make changes in the code to merge the project with existing project in the
plan, instead of adding the new project every time we call the "withColumn".

+1 for exposing the *withColumns*

Regards
Saurabh Chawla

On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yi...@gmail.com> wrote:

> Hi, all
>
> *Background:*
>
> Currently, there is a withColumns
> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
> method to help users/devs add/replace multiple columns at once.
> But this method is private and isn't exposed as a public API interface,
> that means it cannot be used by the user directly, and also it is not
> supported in PySpark API.
>
> As the dataframe user, I can only call withColumn() multiple times:
>
> df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
>
> rather than:
>
> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])
>
> Multiple calls bring some higher cost on developer experience and
> performance. Especially in a PySpark related scenario, multiple calls mean
> multiple py4j calls.
>
> As mentioned
> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
> @Hyukjin, there were some previous discussions on  SPARK-12225
> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>
> [1]
> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
> [2] https://issues.apache.org/jira/browse/SPARK-12225
>
> *Potential solution:*
> Looks like there are 2 potential solutions if we want to support it:
>
> 1. Introduce a *withColumns *api for Scala/Python.
> A separate public withColumns API will be added in scala/python api.
>
> 2. Make withColumn can receive *single col *and also the* list of cols*.
> I did some experimental try on PySpark on
> https://github.com/apache/spark/pull/32276
> Just like Maciej said
> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
> it will bring some confusion with naming.
>
>
> Thanks for your reading, feel free to reply if you have any other concerns
> or suggestions!
>
>
> Regards,
> Yikun
>