You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "sergioschena-db (via GitHub)" <gi...@apache.org> on 2023/08/07 19:03:40 UTC

[GitHub] [spark] sergioschena-db opened a new pull request, #42382: [WIP][ML] Remove usage of RDD APIs for loads in spark-ml

sergioschena-db opened a new pull request, #42382:
URL: https://github.com/apache/spark/pull/42382

### What changes were proposed in this pull request?

The dataframe based spark-ml is still relying on RDD APIs to load metadata of the saved models, using `sc.textFile` instead of `spark.read.text`.
I am proposing to refactor the code to pass a `SparkSession` instead of a `SparkContext` to the internal reader classes, for both scala and python APIs.

### Why are the changes needed?

It is more consistent to use dataframe-only APIs in the spark-ml module, that is supposed to use only dataframe APIs.

### Does this PR introduce _any_ user-facing change?

The changes should affect only internal Loader classes.

### How was this patch tested?

Unit-tests run successfully.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maziyarpanahi commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "maziyarpanahi (via GitHub)" <gi...@apache.org>.

maziyarpanahi commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1330333290


##########
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:
##########
@@ -405,12 +405,14 @@ private[ml] object DefaultParamsWriter {
   def saveMetadata(
       instance: Params,
       path: String,
-      sc: SparkContext,

Review Comment:
   @srowen @WeichenXu123 I would appreciate it if, as we move towards Spark 4.0, we could all consider the compatibility of third-party libraries beyond those that Databricks plans to deprecate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sergioschena-db commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "sergioschena-db (via GitHub)" <gi...@apache.org>.

sergioschena-db commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1341006124


##########
python/pyspark/ml/util.py:
##########
@@ -437,7 +437,7 @@ def extractJsonParams(instance: "Params", skipParams: Sequence[str]) -> Dict[str
     def saveMetadata(
         instance: "Params",
         path: str,
-        sc: SparkContext,
+        sparkSession: SparkSession,

Review Comment:
   What if I keep the same method signature, with the `sc` and inside it I use the builder to obtain the spark session object ?
   ```
       val sparkSession = SparkSession.builder.getOrCreate()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sergioschena-db commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "sergioschena-db (via GitHub)" <gi...@apache.org>.

sergioschena-db commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1330341091


##########
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:
##########
@@ -405,12 +405,14 @@ private[ml] object DefaultParamsWriter {
   def saveMetadata(
       instance: Params,
       path: String,
-      sc: SparkContext,

Review Comment:
   @srowen i didn't have the chance to progress on this one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on PR #42382:
URL: https://github.com/apache/spark/pull/42382#issuecomment-1727681685

   @zhengruifeng 
   
   Can we make the interface `saveMetadata` support both `sparkContext` and `sparkSession` argument ?
   and in spark repo, we always pass sparkSession as the argument.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #42382:
URL: https://github.com/apache/spark/pull/42382#issuecomment-1740811949

   (This needs a JIRA too)
   I think that's reasonable too. I _think_ it wouldn't be breaking to make this final option a default one (default of None/null) in all languages, so callers aren't forced to provide it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on PR #42382:
URL: https://github.com/apache/spark/pull/42382#issuecomment-1672974095

   yeah, I also feel it may happen to break downstream libraries, even though it is internal change
   
   also cc @WeichenXu123 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1290838126


##########
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:
##########
@@ -405,12 +405,14 @@ private[ml] object DefaultParamsWriter {
   def saveMetadata(
       instance: Params,
       path: String,
-      sc: SparkContext,

Review Comment:
   But xgboost4j-spark is not used widely (recommended replacement is python xgboost.spark), we are going to deprecate and remove it from databricks runtime.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1290838126


##########
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:
##########
@@ -405,12 +405,14 @@ private[ml] object DefaultParamsWriter {
   def saveMetadata(
       instance: Params,
       path: String,
-      sc: SparkContext,

Review Comment:
   But xgboost4j-spark is not used widely, we are going to deprecate and remove it from databricks runtime.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1288721492


##########
python/pyspark/ml/util.py:
##########
@@ -437,7 +437,7 @@ def extractJsonParams(instance: "Params", skipParams: Sequence[str]) -> Dict[str
     def saveMetadata(
         instance: "Params",
         path: str,
-        sc: SparkContext,
+        sparkSession: SparkSession,

Review Comment:
   Same comment here about retaining compatibility



##########
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:
##########
@@ -405,12 +405,14 @@ private[ml] object DefaultParamsWriter {
   def saveMetadata(
       instance: Params,
       path: String,
-      sc: SparkContext,

Review Comment:
   I wonder if we should retain the existing SparkContext method. Third party libraries would still use it, unless they later make a change like you're making to Spark ML. So this would break them. When it's easy enough to retain (and deprecate?) this shared method. Same for loadMetadata



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1290838818


##########
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:
##########
@@ -405,12 +405,14 @@ private[ml] object DefaultParamsWriter {
   def saveMetadata(
       instance: Params,
       path: String,
-      sc: SparkContext,

Review Comment:
   I don't think that much is directly relevant to Spark; there is just no reason to break compatibility when it's easy to keep the existing method too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [ML] Remove usage of RDD APIs for load/save in spark-ml [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #42382:
URL: https://github.com/apache/spark/pull/42382#issuecomment-1882034172

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1330338378


##########
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:
##########
@@ -405,12 +405,14 @@ private[ml] object DefaultParamsWriter {
   def saveMetadata(
       instance: Params,
       path: String,
-      sc: SparkContext,

Review Comment:
   @sergioschena-db ping for updates on this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on PR #42382:
URL: https://github.com/apache/spark/pull/42382#issuecomment-1728616265

   what about adding a implicit conversions `sc -> spark` in ml:
   ml only use `spark`; 3-rd lib will need to import this implicit conversion;
   
   or
   
   just keep two `saveMetadata` methods?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on code in PR #42382:
URL: https://github.com/apache/spark/pull/42382#discussion_r1290837783


##########
mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:
##########
@@ -405,12 +405,14 @@ private[ml] object DefaultParamsWriter {
   def saveMetadata(
       instance: Params,
       path: String,
-      sc: SparkContext,

Review Comment:
   Yes this might be an issue, e.g. xgboost4j-spark lib is influenced:
   https://github.com/dmlc/xgboost/blob/d6385355815005625115b648f6b7dc861eacd47e/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala#L494



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #42382:
URL: https://github.com/apache/spark/pull/42382#issuecomment-1671614095

   See https://spark.apache.org/contributing.html - go ahead and make and link a JIRA.
   I think this would target Spark 4.0?
   Does this relate to Spark Connect, like does it improve compatibility by going through the DF API?
   It seems like an OK change just on principle, as long as it doesn't break things unnecessarily


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [ML] Remove usage of RDD APIs for load/save in spark-ml [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml
URL: https://github.com/apache/spark/pull/42382


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org