You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/03 20:55:14 UTC
[GitHub] [spark] drernie commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
drernie commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029392644
My experience (and others) suggests that repeatedly calling withColumn is highly inefficient:
https://stackoverflow.com/questions/41400504/spark-scala-repeated-calls-to-withcolumn-using-the-same-function-on-multiple-c/41400588#41400588
The suggested alternative is using select in a very non-obvious way:
```
df.select(
"*", # selects all existing columns
*[
F.sum(col).over(windowval).alias(col_name)
for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"])
]
)
```
Which doesn't even seem to be documented for Python:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.select.html
I would greatly appreciate this API being made available, as it would greatly enhance the performance and reliability of my notebooks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org