You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "WeichenXu123 (via GitHub)" <gi...@apache.org> on 2023/09/12 08:55:59 UTC

[GitHub] [spark] WeichenXu123 opened a new pull request, #42887: [SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe

WeichenXu123 opened a new pull request, #42887:
URL: https://github.com/apache/spark/pull/42887

### What changes were proposed in this pull request?

Currently, to avoid data copy, Spark connect ML model directly changes input pandas dataframe for appending prediction columns. But we can use `pandas_df.copy(deep=False)` to shallow copy it and then append prediction columns in copied dataframe. This is easier for user to use it.

### Why are the changes needed?

This makes `pyspark.ml.connect` model `transform` method has more similar behavior with `pyspark.ml` model, i.e., the input dataframe is intact after `transform` is called. Otherwise user might be surprise at the new behavior and have to change more code to migrate their workload to `pyspark.ml.connect`

### Does this PR introduce _any_ user-facing change?

Yes.
Previous behavior:
In `pyspark.ml.connect`, `model.transform` will append new columns into input pandas dataframe, and return input dataframe object.

Changed behavior:
In `pyspark.ml.connect`, `model.transform` will shallow copy input pandas dataframe and append new columns into shallow copied pandas dataframe, then return copied pandas dataframe.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42887: [SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #42887:
URL: https://github.com/apache/spark/pull/42887#discussion_r1324374674


##########
python/pyspark/ml/tests/connect/test_legacy_mode_pipeline.py:
##########
@@ -19,6 +19,7 @@
 import tempfile
 import unittest
 import numpy as np
+import pandas as pd

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #42887: [SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng closed pull request #42887: [SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe
URL: https://github.com/apache/spark/pull/42887


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org