You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "wbo4958 (via GitHub)" <gi...@apache.org> on 2024/02/23 12:06:55 UTC

[PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

wbo4958 opened a new pull request, #45232:
URL: https://github.com/apache/spark/pull/45232

   ### What changes were proposed in this pull request?
   
   Support stage-level scheduling for PySpark connect DataFrame APIs (mapInPandas and mapInArrow).
   
   
   ### Why are the changes needed?
   
   This is the follow-up of https://github.com/apache/spark/pull/44852.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through the connect pysprark client.
   
   
   ### How was this patch tested?
   
   Pass the CIs and manual tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1517085186


##########
python/pyspark/resource/tests/test_connect_resources.py:
##########
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.resource import ResourceProfileBuilder, TaskResourceRequests
+from pyspark.sql import SparkSession
+
+
+class ResourceProfileTests(unittest.TestCase):
+    def test_profile_before_sc_for_connect(self):
+        rpb = ResourceProfileBuilder()
+        treqs = TaskResourceRequests().cpus(2)
+        # no exception for building ResourceProfile
+        rp = rpb.require(treqs).build

Review Comment:
   Yes, Added the checking in the test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519241310


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   I'm not quite following you about "by uuids".
   
   So the basic implementation is
   
   1. The client creates ResourceProfile
   2. if the profile ID of ResourceProfile is accessed for the first time, then the client will ask to create ResourceProfile and add it to the ResourceProfileManager on the server side, and the server side will return the profile ID to the client which will set the id to the ResourceProfile on the client side.
   3. The internal mapInPandas/mapInArrow will just use the ResourceProfile id, and the server side can extract the ResourceProfile from ResourceProfileManager according to the id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515485716


##########
python/pyspark/sql/connect/resource/profile.py:
##########
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class ResourceProfile:

Review Comment:
   This ResourceProfile is not user-facing; it is primarily used internally to create the resource profile on the server side and retrieve the associated resource profile ID.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1518491571


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   Shouldn't these be uuids? Just to make sure that we have intentional difference and no off by one?



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -1011,5 +1039,7 @@ service SparkConnectService {
 
   // FetchErrorDetails retrieves the matched exception with details based on a provided error id.
   rpc FetchErrorDetails(FetchErrorDetailsRequest) returns (FetchErrorDetailsResponse) {}
-}
 
+  // Build ResourceProfile and get the profile id
+  rpc BuildResourceProfile(BuildResourceProfileRequest) returns (BuildResourceProfileResponse) {}

Review Comment:
   Why does this need to be an extra RPC and not just a command?



##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   Why do we need an extra RPC for that? Can't you attach the resource profile directly to this call?



##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectBuildResourceProfileHandler.scala:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.service
+
+import scala.jdk.CollectionConverters.MapHasAsScala
+
+import io.grpc.stub.StreamObserver
+
+import org.apache.spark.connect.proto
+import org.apache.spark.internal.Logging
+import org.apache.spark.resource.{ExecutorResourceRequest, ResourceProfile, TaskResourceProfile, TaskResourceRequest}
+
+class SparkConnectBuildResourceProfileHandler(
+    responseObserver: StreamObserver[proto.BuildResourceProfileResponse])
+    extends Logging {
+
+  /**
+   * transform the spark connect ResourceProfile to spark ResourceProfile
+   * @param rp
+   *   Spark connect ResourceProfile
+   * @return
+   *   the Spark ResourceProfile
+   */
+  private def transformResourceProfile(rp: proto.ResourceProfile): ResourceProfile = {
+    val ereqs = rp.getExecutorResourcesMap.asScala.map { case (name, res) =>
+      name -> new ExecutorResourceRequest(
+        res.getResourceName,
+        res.getAmount,
+        res.getDiscoveryScript,
+        res.getVendor)
+    }.toMap
+    val treqs = rp.getTaskResourcesMap.asScala.map { case (name, res) =>
+      name -> new TaskResourceRequest(res.getResourceName, res.getAmount)
+    }.toMap
+
+    if (ereqs.isEmpty) {
+      new TaskResourceProfile(treqs)
+    } else {
+      new ResourceProfile(ereqs, treqs)
+    }
+  }
+
+  def handle(request: proto.BuildResourceProfileRequest): Unit = {
+    val holder = SparkConnectService
+      .getOrCreateIsolatedSession(request.getUserContext.getUserId, request.getSessionId)
+
+    val rp = transformResourceProfile(request.getProfile)
+
+    val session = holder.session
+    session.sparkContext.resourceProfileManager.addResourceProfile(rp)

Review Comment:
   How are these cleaned up?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2030753074

   > Please make sure that the follow work does not get lost.
   
   Sure, I will get it done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2029532177

   > In contrast to the regular spark API this implementation doesn't manage the lifecycle of of the resource request. Can you create a follow up Jira that removes the resource request from the spark context again?
   
   Hi @grundprinzip. Done, created a [JIRA](https://issues.apache.org/jira/browse/SPARK-47668) task.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1510840968


##########
python/pyspark/resource/profile.py:
##########
@@ -114,14 +122,26 @@ def id(self) -> int:
         int
             A unique id of this :class:`ResourceProfile`
         """
-
-        if self._java_resource_profile is not None:
-            return self._java_resource_profile.id()
-        else:
-            raise RuntimeError(
-                "SparkContext must be created to get the id, get the id "
-                "after adding the ResourceProfile to an RDD"
-            )
+        with self._lock:
+            if self._id is None:
+                if self._java_resource_profile is not None:
+                    self._id = self._java_resource_profile.id()
+                else:
+                    from pyspark.sql import is_remote
+
+                    if is_remote():
+                        from pyspark.sql.connect.resource.profile import _ResourceProfile
+
+                        rp = _ResourceProfile(
+                            self._executor_resource_requests, self._task_resource_requests
+                        )
+                        self._id = rp.id

Review Comment:
   Sorry if I missed sth but where is this id being set?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1982322729

   > > Does this PR introduce any user-facing change?
   > > Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through the connect pysprark client.
   > 
   > I think you are adding the ResourceProfile api to spark connect for anything to use, correct? I guess since Spark Connect doesn't have SparkContext support its only usable by these apis? It would be nice to have more info in the description about what you are adding and if they match the current python api's exactly?
   
   There are no APIs changed/added for ResourceProfile, this PR just changed ResourceProfile internally to support Spark Connect. Like, [here](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-245bd4d0d3aea466e32424806642d225ffa4f07e75ef1d45f7fd7a68fbb3e4c3R118-R141) [here](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-e2439904a861b6dd69efec067e87ff43070e3b673911ab286aca382897954c2bR167-R469)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1513987011


##########
python/pyspark/resource/profile.py:
##########
@@ -114,14 +122,26 @@ def id(self) -> int:
         int
             A unique id of this :class:`ResourceProfile`
         """
-
-        if self._java_resource_profile is not None:
-            return self._java_resource_profile.id()
-        else:
-            raise RuntimeError(
-                "SparkContext must be created to get the id, get the id "
-                "after adding the ResourceProfile to an RDD"
-            )
+        with self._lock:
+            if self._id is None:
+                if self._java_resource_profile is not None:
+                    self._id = self._java_resource_profile.id()
+                else:
+                    from pyspark.sql import is_remote
+
+                    if is_remote():
+                        from pyspark.sql.connect.resource.profile import _ResourceProfile
+
+                        rp = _ResourceProfile(
+                            self._executor_resource_requests, self._task_resource_requests
+                        )
+                        self._id = rp.id
+                    else:
+                        raise RuntimeError(
+                            "SparkContext must be created to get the id, get the id "
+                            "after adding the ResourceProfile to an RDD"

Review Comment:
   The above [if](https://github.com/apache/spark/pull/45232/files/95a75af98715a750b78529c1a04867206a99dfe6#diff-245bd4d0d3aea466e32424806642d225ffa4f07e75ef1d45f7fd7a68fbb3e4c3R132) is for connect case, while this branch is for the normal RDD usage. Yeah, you're right, Sql also has supported ResourceProfile, so let me change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519235937


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectBuildResourceProfileHandler.scala:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.service
+
+import scala.jdk.CollectionConverters.MapHasAsScala
+
+import io.grpc.stub.StreamObserver
+
+import org.apache.spark.connect.proto
+import org.apache.spark.internal.Logging
+import org.apache.spark.resource.{ExecutorResourceRequest, ResourceProfile, TaskResourceProfile, TaskResourceRequest}
+
+class SparkConnectBuildResourceProfileHandler(
+    responseObserver: StreamObserver[proto.BuildResourceProfileResponse])
+    extends Logging {
+
+  /**
+   * transform the spark connect ResourceProfile to spark ResourceProfile
+   * @param rp
+   *   Spark connect ResourceProfile
+   * @return
+   *   the Spark ResourceProfile
+   */
+  private def transformResourceProfile(rp: proto.ResourceProfile): ResourceProfile = {
+    val ereqs = rp.getExecutorResourcesMap.asScala.map { case (name, res) =>
+      name -> new ExecutorResourceRequest(
+        res.getResourceName,
+        res.getAmount,
+        res.getDiscoveryScript,
+        res.getVendor)
+    }.toMap
+    val treqs = rp.getTaskResourcesMap.asScala.map { case (name, res) =>
+      name -> new TaskResourceRequest(res.getResourceName, res.getAmount)
+    }.toMap
+
+    if (ereqs.isEmpty) {
+      new TaskResourceProfile(treqs)
+    } else {
+      new ResourceProfile(ereqs, treqs)
+    }
+  }
+
+  def handle(request: proto.BuildResourceProfileRequest): Unit = {
+    val holder = SparkConnectService
+      .getOrCreateIsolatedSession(request.getUserContext.getUserId, request.getSessionId)
+
+    val rp = transformResourceProfile(request.getProfile)
+
+    val session = holder.session
+    session.sparkContext.resourceProfileManager.addResourceProfile(rp)

Review Comment:
   Yeah, Both ResourceProfile and ResourceProfileManager don't have the cleanup. If you think we need to cleanup, we can file another PR for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1540499756


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -375,6 +375,9 @@ message ExecutePlanResponse {
     // Response type informing if the stream is complete in reattachable execution.
     ResultComplete result_complete = 14;
 
+    // Response for command that creates ResourceProfile.
+    CreateResourceProfileCommandResult create_resource_profile_command_result = 20;

Review Comment:
   Isn't the next ID 17 here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1963263080

   ## With dynamic allocation enabled.
   
   ``` bash
   start-connect-server.sh --master spark://192.168.0.106:7077 \
      --jars jars/spark-connect_2.13-4.0.0-SNAPSHOT.jar  \
   	 --conf spark.executor.cores=4 \
   	 --conf spark.task.cpus=1 \
   	 --conf spark.dynamicAllocation.enabled=true \
      --conf spark.dynamicAllocation.maxExecutors=1 \
   ```
   
   The above command enables the dynamic allocation and the max executors required is set to 1 in order to test. And then launch the spark connect pyspark client by
   
   ``` bash
   pyspark --remote "sc://localhost"
   ```
   
   ### TaskResourceProfile without any specific executor request information
   
   Test code,
   
   ```python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   
   def filter_func(iterator):
       for pdf in iterator:
           yield pdf
   
   df = spark.range(0, 100, 1, 4)
   
   treqs = TaskResourceRequests().cpus(3)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, rp).collect()
   ```
   
   The rp refers to the TaskResourceProfile without any specific executor request information, thus the executor information will utilize the default values from Default ResourceProfile (executor.cores=4).
   
   The above code will require an extra executor which will have the same `executor.cores/memory` as the default ResourceProfile.
   
   ![0](https://github.com/apache/spark/assets/1320706/abd5a1ad-8564-490f-abc5-a45621496040)
   
   ![1](https://github.com/apache/spark/assets/1320706/340954c5-0dd8-42f3-9368-5dccdd3d7b62)
   
   ![2](https://github.com/apache/spark/assets/1320706/7280b9da-6346-4f81-853a-81e37aaecb9d)
   
   
   
   ### Different executor request information 
   
   ```python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   
   def filter_func(iterator):
       for pdf in iterator:
           yield pdf
   
   df = spark.range(0, 100, 1, 4)
   
   ereqs = ExecutorResourceRequests().cores(6)
   treqs = TaskResourceRequests().cpus(5)
   rp = ResourceProfileBuilder().require(treqs).require(ereqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, rp).collect()
   ```
   
   ![5](https://github.com/apache/spark/assets/1320706/eb104194-c156-4319-ac48-f1d199304501)
   
   ![6](https://github.com/apache/spark/assets/1320706/a514f1a7-4ffd-497d-897f-6b5f786ea86b)
   
   ![7](https://github.com/apache/spark/assets/1320706/7ade7aec-992e-44ed-ae3f-70b25c208243)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1970509831

   Hi @tgravescs, This PR changed ResourceProfile a little bit to support connect, Could you help review it? Thx very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.

tgravescs commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1516375405


##########
python/pyspark/resource/tests/test_connect_resources.py:
##########
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.resource import ResourceProfileBuilder, TaskResourceRequests
+from pyspark.sql import SparkSession
+
+
+class ResourceProfileTests(unittest.TestCase):
+    def test_profile_before_sc_for_connect(self):
+        rpb = ResourceProfileBuilder()
+        treqs = TaskResourceRequests().cpus(2)
+        # no exception for building ResourceProfile
+        rp = rpb.require(treqs).build

Review Comment:
   can the user properly see the executor/task resource requests back from the resource profile?  



##########
python/pyspark/resource/tests/test_connect_resources.py:
##########
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.resource import ResourceProfileBuilder, TaskResourceRequests
+from pyspark.sql import SparkSession
+
+
+class ResourceProfileTests(unittest.TestCase):
+    def test_profile_before_sc_for_connect(self):

Review Comment:
   Any tests needed for error checking/failure cases and make sure that gets to user properly?



##########
python/pyspark/resource/profile.py:
##########
@@ -114,14 +122,23 @@ def id(self) -> int:
         int
             A unique id of this :class:`ResourceProfile`
         """
+        with self._lock:
+            if self._id is None:
+                if self._java_resource_profile is not None:
+                    self._id = self._java_resource_profile.id()
+                else:
+                    from pyspark.sql import is_remote
 
-        if self._java_resource_profile is not None:
-            return self._java_resource_profile.id()
-        else:
-            raise RuntimeError(
-                "SparkContext must be created to get the id, get the id "
-                "after adding the ResourceProfile to an RDD"
-            )
+                    if is_remote():
+                        from pyspark.sql.connect.resource.profile import ResourceProfile
+
+                        rp = ResourceProfile(

Review Comment:
   I haven't reviewed much of the connect code so probaly missing something.  we do this to get an id but then we throw it away?  I guess this is a spark connect ResourceProfile and when its actually created its stored on the spark context side so you just need the id to reference?  Adding a comment here would be good 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519232536


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   Hi @grundprinzip, The user still needs to know the exact ResourceProfile id, if we attach resource profile in the call, seems we can't get id in this call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519232917


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -1011,5 +1039,7 @@ service SparkConnectService {
 
   // FetchErrorDetails retrieves the matched exception with details based on a provided error id.
   rpc FetchErrorDetails(FetchErrorDetailsRequest) returns (FetchErrorDetailsResponse) {}
-}
 
+  // Build ResourceProfile and get the profile id
+  rpc BuildResourceProfile(BuildResourceProfileRequest) returns (BuildResourceProfileResponse) {}

Review Comment:
   Hi @grundprinzip, Really good suggestion, just made the newest commit to move it to the command.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.

tgravescs commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1980935581

   >Does this PR introduce any user-facing change?
   > Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through the connect pysprark client.
   
   I think you are adding the ResourceProfile api to spark connect for anything to use, correct?   I guess since Spark Connect doesn't have SparkContext support its only usable by these apis?     It would be nice to have more info in the description about what you are adding and if they match the current python api's exactly?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515455169


##########
python/pyspark/sql/connect/resource/profile.py:
##########
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class ResourceProfile:

Review Comment:
   is this class user-facing?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1545872296


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -375,6 +375,9 @@ message ExecutePlanResponse {
     // Response type informing if the stream is complete in reattachable execution.
     ResultComplete result_complete = 14;
 
+    // Response for command that creates ResourceProfile.
+    CreateResourceProfileCommandResult create_resource_profile_command_result = 20;

Review Comment:
   Hi @grundprinzip, Thx for reviewing, Changed it to 17. Could you help review it again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1517085041


##########
python/pyspark/resource/tests/test_connect_resources.py:
##########
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.resource import ResourceProfileBuilder, TaskResourceRequests
+from pyspark.sql import SparkSession
+
+
+class ResourceProfileTests(unittest.TestCase):
+    def test_profile_before_sc_for_connect(self):

Review Comment:
   Added the error checking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1513991688


##########
python/pyspark/resource/profile.py:
##########
@@ -114,14 +122,26 @@ def id(self) -> int:
         int
             A unique id of this :class:`ResourceProfile`
         """
-
-        if self._java_resource_profile is not None:
-            return self._java_resource_profile.id()
-        else:
-            raise RuntimeError(
-                "SparkContext must be created to get the id, get the id "
-                "after adding the ResourceProfile to an RDD"
-            )
+        with self._lock:
+            if self._id is None:
+                if self._java_resource_profile is not None:
+                    self._id = self._java_resource_profile.id()
+                else:
+                    from pyspark.sql import is_remote
+
+                    if is_remote():
+                        from pyspark.sql.connect.resource.profile import _ResourceProfile
+
+                        rp = _ResourceProfile(
+                            self._executor_resource_requests, self._task_resource_requests
+                        )
+                        self._id = rp.id

Review Comment:
   the `self._id` sets to None by default, https://github.com/apache/spark/pull/45232/files/95a75af98715a750b78529c1a04867206a99dfe6#diff-c570ce2f67f7eef5e5f223c6c638903d2dd364bc2ab36c59a0e7132c742682ffR65
   
   while the `rp.id` is set from https://github.com/apache/spark/pull/45232/files/95a75af98715a750b78529c1a04867206a99dfe6#diff-c570ce2f67f7eef5e5f223c6c638903d2dd364bc2ab36c59a0e7132c742682ffR65



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1514004559


##########
python/pyspark/resource/profile.py:
##########
@@ -114,14 +122,26 @@ def id(self) -> int:
         int
             A unique id of this :class:`ResourceProfile`
         """
-
-        if self._java_resource_profile is not None:
-            return self._java_resource_profile.id()
-        else:
-            raise RuntimeError(
-                "SparkContext must be created to get the id, get the id "
-                "after adding the ResourceProfile to an RDD"
-            )
+        with self._lock:
+            if self._id is None:
+                if self._java_resource_profile is not None:
+                    self._id = self._java_resource_profile.id()
+                else:
+                    from pyspark.sql import is_remote
+
+                    if is_remote():
+                        from pyspark.sql.connect.resource.profile import _ResourceProfile
+
+                        rp = _ResourceProfile(
+                            self._executor_resource_requests, self._task_resource_requests
+                        )
+                        self._id = rp.id

Review Comment:
   Ah, okay. you're getting it from `SparkConnectBuildResourceProfileHandler`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1510814720


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -967,6 +967,34 @@ message FetchErrorDetailsResponse {
   }
 }
 
+message BuildResourceProfileRequest {

Review Comment:
   This would need @grundprinzip and @hvanhovell 's review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519241310


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   Hi @grundprinzip,
   
   I'm not quite following you about "by uuids".
   
   So the basic implementation is
   
   1. The client creates ResourceProfile
   2. if the profile ID of ResourceProfile is accessed for the first time, then the client will ask to create ResourceProfile and add it to the ResourceProfileManager on the server side, and the server side will return the profile ID to the client which will set the id to the ResourceProfile on the client side.
   3. The internal mapInPandas/mapInArrow will just use the ResourceProfile id, and the server side can extract the ResourceProfile from ResourceProfileManager according to the id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2030829152

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1510818206


##########
python/pyspark/resource/profile.py:
##########
@@ -114,14 +122,26 @@ def id(self) -> int:
         int
             A unique id of this :class:`ResourceProfile`
         """
-
-        if self._java_resource_profile is not None:
-            return self._java_resource_profile.id()
-        else:
-            raise RuntimeError(
-                "SparkContext must be created to get the id, get the id "
-                "after adding the ResourceProfile to an RDD"
-            )
+        with self._lock:
+            if self._id is None:
+                if self._java_resource_profile is not None:
+                    self._id = self._java_resource_profile.id()
+                else:
+                    from pyspark.sql import is_remote
+
+                    if is_remote():
+                        from pyspark.sql.connect.resource.profile import _ResourceProfile
+
+                        rp = _ResourceProfile(
+                            self._executor_resource_requests, self._task_resource_requests
+                        )
+                        self._id = rp.id
+                    else:
+                        raise RuntimeError(
+                            "SparkContext must be created to get the id, get the id "
+                            "after adding the ResourceProfile to an RDD"

Review Comment:
   RDD is hidden in Spark Connect concept. Let's probably avoid this term.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1976100589

   Looks fine in general


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.

tgravescs commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1985925035

   changes look fine to me 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515490084


##########
dev/sparktestsupport/modules.py:
##########
@@ -554,6 +554,7 @@ def __hash__(self):
         "pyspark.resource.profile",
         # unittests
         "pyspark.resource.tests.test_resources",
+        "pyspark.resource.tests.test_connect_resources",

Review Comment:
   This pull request already includes [pyspark.sql.tests.connect.test_resources](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-1ebc67a99155114aa6bced912ce7942956ce784eeb59aadab75b82814684bd27R1031) to test the general mapInPandas/mapInArrow functionality with ResourceProfile. On the other hand, `pyspark.resource.tests.test_connect_resources` is specifically for testing special cases like creating a ResourceProfile before establishing a remote session. Therefore, it seems appropriate to keep the tests in their respective locations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1996052736

   Hi @grundprinzip, I would be grateful if you could kindly take another look at this PR, Thx.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1989639850

   Hi @grundprinzip, Could you help review it again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #45232: [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile
URL: https://github.com/apache/spark/pull/45232


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1510824227


##########
python/pyspark/sql/connect/resource/profile.py:
##########
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class _ResourceProfile:

Review Comment:
   It's fine without leading `_`. The whole `connect` package isn't supposed to be external.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1501943441


##########
python/pyspark/resource/profile.py:
##########
@@ -99,6 +99,11 @@ def __init__(
         _exec_req: Optional[Dict[str, ExecutorResourceRequest]] = None,

Review Comment:
   Let's add `versionchanged` directive at `ResourceProfile` docstring. e,g.,:
   
   ```
       .. versionchanged:: 4.0.0
           Supports Spark Connect.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1963261901

   # Manual tests
   
   
   The manual tests were conducted on a spark Standalone cluster with only 1 worker which has 6 cpu cores.
   
   ## With dynamic allocation disabled.
   
   ``` bash
   start-connect-server.sh --master spark://192.168.0.106:7077 \
      --jars jars/spark-connect_2.13-4.0.0-SNAPSHOT.jar  \
   	 --conf spark.executor.cores=4 \
   	 --conf spark.task.cpus=1 \
   	 --conf spark.dynamicAllocation.enabled=false
   ```
   
   The above command starts the connect server and it requires 1 executor with 4 CPU cores, and the default `task.cpus = 1`, so the default tasks parallelism is 4 at a time.
   
   And then launch the spark connect pyspark client by
   
   ``` bash
   pyspark --remote "sc://localhost"
   ```
   
   1. `task.cores=1`
   
   Test code:
   
   ``` python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   
   def filter_func(iterator):
       for pdf in iterator:
           yield pdf
   
   df = spark.range(0, 100, 1, 6)
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   treqs = TaskResourceRequests().cpus(1)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, rp).collect()
   ```
   
   When the required `task.cpus=1`, `executor.cores=4` (No executor resource specified, use the default one), there will be 4 tasks running for rp at the same time.
   
   The entire Spark application consists of a single Spark job that will be divided into two stages. The first shuffle stage comprises 6 tasks, the first 4 tasks will be executed simultaneously, then the last 2 tasks.
   
   ![1](https://github.com/apache/spark/assets/1320706/720fef7b-3a72-456f-9c60-01b86011ec84)
   
   The second ResultStage comprises 3 tasks, all of which will be executed simultaneously since the required `task.cpus` is  1.
   
   ![2](https://github.com/apache/spark/assets/1320706/0804d2af-e25d-4f54-906e-cc367a6aa2eb)
   
   2. `task.cores=2`
   
   Test code,
   
   ``` python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   
   def filter_func(iterator):
       for pdf in iterator:
           yield pdf
   
   df = spark.range(0, 100, 1, 6)
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   treqs = TaskResourceRequests().cpus(2)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, rp).collect()
   ```
   
   When the required `task.cpus=2`, `executor.cores=4` (No executor resource specified, use the default one), there will be 2 tasks running for rp.
   
   The first shuffle stage behaves the same as the first one. 
   
   The second ResultStage comprises 3 tasks, so the first 2 tasks will be running at a time, and then execute the last task.
   
   ![3](https://github.com/apache/spark/assets/1320706/870fedb2-52f0-4a54-9b23-f37b9d2a2228)
   
   3. `task.cores=3`
   
   Test code,
   
   ``` python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   
   def filter_func(iterator):
       for pdf in iterator:
           yield pdf
   
   df = spark.range(0, 100, 1, 6)
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   treqs = TaskResourceRequests().cpus(3)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, rp).collect()
   ```
   
   When the required `task.cpus=3`, `executor.cores=4` (No executor resource specified, use the default one), there will be 1 task running for rp.
   
   The first shuffle stage behaves the same as the first one. 
   
   The second ResultStage comprises 3 tasks, all of which will be running serially.
   
   ![4](https://github.com/apache/spark/assets/1320706/a6b730ab-99d5-4563-a853-0682fcd3a10d)
   
   
   4. `task.cores=5`
   
   ``` python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   
   def filter_func(iterator):
       for pdf in iterator:
           yield pdf
   
   df = spark.range(0, 100, 1, 6)
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, ResourceProfileBuilder
   treqs = TaskResourceRequests().cpus(5)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, rp).collect()
   ```
   
   exception happened.
   ``` console
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/bobwang/github/mytools/spark.home/spark-4.0.0-SNAPSHOT-bin-wbo4958-spark/python/pyspark/sql/connect/dataframe.py", line 1763, in collect
       table, schema = self._to_table()
     File "/home/bobwang/github/mytools/spark.home/spark-4.0.0-SNAPSHOT-bin-wbo4958-spark/python/pyspark/sql/connect/dataframe.py", line 1774, in _to_table
       query = self._plan.to_proto(self._session.client)
     File "/home/bobwang/github/mytools/spark.home/spark-4.0.0-SNAPSHOT-bin-wbo4958-spark/python/pyspark/sql/connect/plan.py", line 127, in to_proto
       plan.root.CopyFrom(self.plan(session))
     File "/home/xxx/github/mytools/spark.home/spark-4.0.0-SNAPSHOT-bin-wbo4958-spark/python/pyspark/sql/connect/plan.py", line 2201, in plan
       plan.map_partitions.profile_id = self._profile.id
     File "/home/bobwang/github/mytools/spark.home/spark-4.0.0-SNAPSHOT-bin-wbo4958-spark/python/pyspark/resource/profile.py", line 132, in id
       rp = _ResourceProfile(
     File "/home/bobwang/github/mytools/spark.home/spark-4.0.0-SNAPSHOT-bin-wbo4958-spark/python/pyspark/sql/connect/resource/profile.py", line 65, in __init__
       self._id = session.client.build_resource_profile(self._remote_profile)
     File "/home/bobwang/github/mytools/spark.home/spark-4.0.0-SNAPSHOT-bin-wbo4958-spark/python/pyspark/sql/connect/client/core.py", line 1741, in build_resource_profile
       resp = self._stub.BuildResourceProfile(req)
     File "/home/bobwang/anaconda3/envs/pyspark/lib/python3.10/site-packages/grpc/_channel.py", line 1160, in __call__
       return _end_unary_response_blocking(state, call, False, None)
     File "/home/bobwang/anaconda3/envs/pyspark/lib/python3.10/site-packages/grpc/_channel.py", line 1003, in _end_unary_response_blocking
       raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
   grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
   	status = StatusCode.INTERNAL
   	details = "The number of cores per executor (=4) has to be >= the number of cpus per task = 5."
   	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"The number of cores per executor (=4) has to be >= the number of cpus per task = 5.", grpc_status:13, created_time:"2024-02-26T10:42:37.331616664+08:00"}"
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1502014261


##########
python/pyspark/resource/profile.py:
##########
@@ -99,6 +99,11 @@ def __init__(
         _exec_req: Optional[Dict[str, ExecutorResourceRequest]] = None,

Review Comment:
   Thx, Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1513993480


##########
python/pyspark/sql/connect/resource/profile.py:
##########
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class _ResourceProfile:

Review Comment:
   Done. Thx for the info



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2030792790

   Hi @HyukjinKwon, Could you help merge it? Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1517084721


##########
python/pyspark/resource/profile.py:
##########
@@ -114,14 +122,23 @@ def id(self) -> int:
         int
             A unique id of this :class:`ResourceProfile`
         """
+        with self._lock:
+            if self._id is None:
+                if self._java_resource_profile is not None:
+                    self._id = self._java_resource_profile.id()
+                else:
+                    from pyspark.sql import is_remote
 
-        if self._java_resource_profile is not None:
-            return self._java_resource_profile.id()
-        else:
-            raise RuntimeError(
-                "SparkContext must be created to get the id, get the id "
-                "after adding the ResourceProfile to an RDD"
-            )
+                    if is_remote():
+                        from pyspark.sql.connect.resource.profile import ResourceProfile
+
+                        rp = ResourceProfile(

Review Comment:
   Yes, You're right, for the remote mode, the ResourceProfile is just like the proxy of the Spark ResourceProfile on the server side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1517084819


##########
python/pyspark/resource/profile.py:
##########
@@ -114,14 +122,23 @@ def id(self) -> int:
         int
             A unique id of this :class:`ResourceProfile`
         """
+        with self._lock:
+            if self._id is None:
+                if self._java_resource_profile is not None:
+                    self._id = self._java_resource_profile.id()
+                else:
+                    from pyspark.sql import is_remote
 
-        if self._java_resource_profile is not None:
-            return self._java_resource_profile.id()
-        else:
-            raise RuntimeError(
-                "SparkContext must be created to get the id, get the id "
-                "after adding the ResourceProfile to an RDD"
-            )
+                    if is_remote():
+                        from pyspark.sql.connect.resource.profile import ResourceProfile
+
+                        rp = ResourceProfile(

Review Comment:
   Done to add a comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2009170881

   Hi @grundprinzip, @HyukjinKwon, @zhengruifeng, This PR has been there for a while, could you help review/merge it? Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1963270400

   Hi @tgravescs @WeichenXu123 @zhengruifeng @Ngone51, Could you also please help review it. Thx.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.

wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1980292725

   Hi @HyukjinKwon, Could you help review again, thx very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515448887


##########
dev/sparktestsupport/modules.py:
##########
@@ -554,6 +554,7 @@ def __hash__(self):
         "pyspark.resource.profile",
         # unittests
         "pyspark.resource.tests.test_resources",
+        "pyspark.resource.tests.test_connect_resources",

Review Comment:
   this test is for spark connect, I think we should move it to Module `pyspark_connect`?
   
   maybe we can move the test cases in it to `pyspark.sql.tests.connect.test_resources`?



##########
dev/sparktestsupport/modules.py:
##########
@@ -554,6 +554,7 @@ def __hash__(self):
         "pyspark.resource.profile",
         # unittests
         "pyspark.resource.tests.test_resources",
+        "pyspark.resource.tests.test_connect_resources",

Review Comment:
   this test is for spark connect, I think we should move it to Module `pyspark_connect`?
   
   or maybe we can move the test cases in it to `pyspark.sql.tests.connect.test_resources`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org