You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/03 20:55:14 UTC

[GitHub] [spark] drernie commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

drernie commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029392644


   My experience (and others) suggests that repeatedly calling withColumn is highly inefficient:
   
   https://stackoverflow.com/questions/41400504/spark-scala-repeated-calls-to-withcolumn-using-the-same-function-on-multiple-c/41400588#41400588
   
   The suggested alternative is using select in a very non-obvious way:
   ```
               df.select(
                   "*", # selects all existing columns
                   *[
                       F.sum(col).over(windowval).alias(col_name)
                       for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"])
                   ]
               )
   ```
   Which doesn't even seem to be documented for Python:
   https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.select.html
   
   I would greatly appreciate this API being made available, as it would greatly enhance the performance and reliability of my notebooks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org