You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yikun Jiang <yi...@gmail.com> on 2021/05/04 07:22:32 UTC

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

@Saurabh @Mr.Powers Thanks for the input information.

I personal perfer to introduce the `withColumns` because it bring more
friendly development experience rather than select( * ).

This is the PR to add `withColumns`:
https://github.com/apache/spark/pull/32431

Regards,
Yikun


Saurabh Chawla <s....@gmail.com> 于2021年4月30日周五 下午1:13写道:

> Hi All,
>
> I also had a scenario where at runtime, I needed to loop through a
> dataframe to use withColumn many times.
>
>  For the safer side I used the reflection to access the withColumns to
> prevent any java.lang.StackOverflowError.
>
> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
> val newConfigurationMethod =
>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], classOf[Seq[Column]])
> newConfigurationMethod.invoke(
>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>
> It would be great if we use the "withColumns" rather than using the
> reflection code like this.
> or
> make changes in the code to merge the project with existing project in the
> plan, instead of adding the new project every time we call the "
> withColumn".
>
> +1 for exposing the *withColumns*
>
> Regards
> Saurabh Chawla
>
> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yi...@gmail.com> wrote:
>
>> Hi, all
>>
>> *Background:*
>>
>> Currently, there is a withColumns
>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
>> method to help users/devs add/replace multiple columns at once.
>> But this method is private and isn't exposed as a public API interface,
>> that means it cannot be used by the user directly, and also it is not
>> supported in PySpark API.
>>
>> As the dataframe user, I can only call withColumn() multiple times:
>>
>> df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
>>
>> rather than:
>>
>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])
>>
>> Multiple calls bring some higher cost on developer experience and
>> performance. Especially in a PySpark related scenario, multiple calls mean
>> multiple py4j calls.
>>
>> As mentioned
>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
>> @Hyukjin, there were some previous discussions on  SPARK-12225
>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>>
>> [1]
>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>
>> *Potential solution:*
>> Looks like there are 2 potential solutions if we want to support it:
>>
>> 1. Introduce a *withColumns *api for Scala/Python.
>> A separate public withColumns API will be added in scala/python api.
>>
>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>> I did some experimental try on PySpark on
>> https://github.com/apache/spark/pull/32276
>> Just like Maciej said
>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
>> it will bring some confusion with naming.
>>
>>
>> Thanks for your reading, feel free to reply if you have any other
>> concerns or suggestions!
>>
>>
>> Regards,
>> Yikun
>>
>

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Posted by Паша <pa...@gmail.com>.
I've created my own implicit withColumnsRenamed for such a purpose which
just accepted a map of string→string and called rename multiple times.

вт, 4 мая 2021 г. в 10:22, Yikun Jiang <yi...@gmail.com>:

> @Saurabh @Mr.Powers Thanks for the input information.
>
> I personal perfer to introduce the `withColumns` because it bring more
> friendly development experience rather than select( * ).
>
> This is the PR to add `withColumns`:
> https://github.com/apache/spark/pull/32431
>
> Regards,
> Yikun
>
>
> Saurabh Chawla <s....@gmail.com> 于2021年4月30日周五 下午1:13写道:
>
>> Hi All,
>>
>> I also had a scenario where at runtime, I needed to loop through a
>> dataframe to use withColumn many times.
>>
>>  For the safer side I used the reflection to access the withColumns to
>> prevent any java.lang.StackOverflowError.
>>
>> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
>> val newConfigurationMethod =
>>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], classOf[Seq[Column]])
>> newConfigurationMethod.invoke(
>>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>>
>> It would be great if we use the "withColumns" rather than using the
>> reflection code like this.
>> or
>> make changes in the code to merge the project with existing project in
>> the plan, instead of adding the new project every time we call the "
>> withColumn".
>>
>> +1 for exposing the *withColumns*
>>
>> Regards
>> Saurabh Chawla
>>
>> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yi...@gmail.com> wrote:
>>
>>> Hi, all
>>>
>>> *Background:*
>>>
>>> Currently, there is a withColumns
>>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
>>> method to help users/devs add/replace multiple columns at once.
>>> But this method is private and isn't exposed as a public API interface,
>>> that means it cannot be used by the user directly, and also it is not
>>> supported in PySpark API.
>>>
>>> As the dataframe user, I can only call withColumn() multiple times:
>>>
>>> df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
>>>
>>> rather than:
>>>
>>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])
>>>
>>> Multiple calls bring some higher cost on developer experience and
>>> performance. Especially in a PySpark related scenario, multiple calls mean
>>> multiple py4j calls.
>>>
>>> As mentioned
>>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143>
>>> from @Hyukjin, there were some previous discussions on  SPARK-12225
>>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>>
>>> *Potential solution:*
>>> Looks like there are 2 potential solutions if we want to support it:
>>>
>>> 1. Introduce a *withColumns *api for Scala/Python.
>>> A separate public withColumns API will be added in scala/python api.
>>>
>>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>>> I did some experimental try on PySpark on
>>> https://github.com/apache/spark/pull/32276
>>> Just like Maciej said
>>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
>>> it will bring some confusion with naming.
>>>
>>>
>>> Thanks for your reading, feel free to reply if you have any other
>>> concerns or suggestions!
>>>
>>>
>>> Regards,
>>> Yikun
>>>
>>