You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@celeborn.apache.org by "AngersZhuuuu (via GitHub)" <gi...@apache.org> on 2023/02/02 03:03:34 UTC

[GitHub] [incubator-celeborn] AngersZhuuuu opened a new pull request, #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

AngersZhuuuu opened a new pull request, #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197

   ### What changes were proposed in this pull request?
   ```
   Driver stacktrace:
       at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
       at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
       at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
       at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
       at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
       at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
       at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
       at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
       at scala.Option.foreach(Option.scala:407)
       at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
       at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
       at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
       at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
       at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:200)
       ... 75 more
   Caused by: java.lang.ClassCastException: org.apache.spark.shuffle.sort.BypassMergeSortShuffleHandle cannot be cast to org.apache.spark.shuffle.celeborn.RssShuffleHandle
       at org.apache.spark.shuffle.celeborn.RssShuffleManager.getReaderForRange(RssShuffleManager.java:248)
       at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:195)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
       at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
       at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
       at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
       at org.apache.spark.scheduler.Task.run(Task.scala:127)
       at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:463)
       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:466)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748) 
   ```
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#issuecomment-1413097241

   > > > Does it mean that the current CI can not detect the test failures?
   > > 
   > > 
   > > Seems so, In my other pr, it cause a fallback, then meet this issue.
   > 
   > We need UTs for fallback casdes.
   
   Added one UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#issuecomment-1413092657

   > > Does it mean that the current CI can not detect the test failures?
   > 
   > Seems so, In my other pr, it cause a fallback, then meet this issue.
   
   We need UTs for fallback casdes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] pan3793 commented on a diff in pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#discussion_r1093983818


##########
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/ShuffleFallbackSuite.scala:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.celeborn.tests.spark
+
+import scala.util.Random
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.SparkSession
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.funsuite.AnyFunSuite
+
+import org.apache.celeborn.client.ShuffleClient
+import org.apache.celeborn.common.protocol.CompressionCodec
+
+class ShuffleFallbackSuite extends AnyFunSuite
+  with SparkTestBase
+  with BeforeAndAfterEach {
+
+  override def beforeEach(): Unit = {
+    ShuffleClient.reset()
+  }
+
+  override def afterEach(): Unit = {
+    System.gc()
+  }
+
+  private def enableRss(conf: SparkConf) = {
+    conf.set("spark.shuffle.manager", "org.apache.spark.shuffle.celeborn.RssShuffleManager")
+      .set("spark.rss.master.address", masterInfo._1.rpcEnv.address.toString)
+      .set("spark.rss.shuffle.split.threshold", "10MB")

Review Comment:
   rss => celeborn ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on a diff in pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#discussion_r1093985063


##########
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/ShuffleFallbackSuite.scala:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.celeborn.tests.spark
+
+import scala.util.Random
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.SparkSession
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.funsuite.AnyFunSuite
+
+import org.apache.celeborn.client.ShuffleClient
+import org.apache.celeborn.common.protocol.CompressionCodec
+
+class ShuffleFallbackSuite extends AnyFunSuite
+  with SparkTestBase
+  with BeforeAndAfterEach {
+
+  override def beforeEach(): Unit = {
+    ShuffleClient.reset()
+  }
+
+  override def afterEach(): Unit = {
+    System.gc()
+  }
+
+  private def enableRss(conf: SparkConf) = {
+    conf.set("spark.shuffle.manager", "org.apache.spark.shuffle.celeborn.RssShuffleManager")
+      .set("spark.rss.master.address", masterInfo._1.rpcEnv.address.toString)
+      .set("spark.rss.shuffle.split.threshold", "10MB")

Review Comment:
   > rss => celeborn ?
   
   Let me refresh all this conf in other pr...so much



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on a diff in pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#discussion_r1093968932


##########
client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/RssShuffleManager.java:
##########
@@ -244,16 +244,27 @@ public <K, C> ShuffleReader<K, C> getReaderForRange(
       int endPartition,
       TaskContext context,
       ShuffleReadMetricsReporter metrics) {
-    @SuppressWarnings("unchecked")
-    RssShuffleHandle<K, ?, C> h = (RssShuffleHandle<K, ?, C>) handle;
-    return new RssShuffleReader<>(
-        h,
+    if (handle instanceof RssShuffleHandle) {
+      @SuppressWarnings("unchecked")
+      RssShuffleHandle<K, ?, C> h = (RssShuffleHandle<K, ?, C>) handle;
+      return new RssShuffleReader<>(
+          h,
+          startPartition,
+          endPartition,
+          startMapIndex,
+          endMapIndex,
+          context,
+          celebornConf,
+          metrics);
+    }
+    return SparkUtils.getReader(
+        sortShuffleManager(),
+        handle,
+        0,

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on a diff in pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on code in PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#discussion_r1093967903


##########
client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/RssShuffleManager.java:
##########
@@ -244,16 +244,27 @@ public <K, C> ShuffleReader<K, C> getReaderForRange(
       int endPartition,
       TaskContext context,
       ShuffleReadMetricsReporter metrics) {
-    @SuppressWarnings("unchecked")
-    RssShuffleHandle<K, ?, C> h = (RssShuffleHandle<K, ?, C>) handle;
-    return new RssShuffleReader<>(
-        h,
+    if (handle instanceof RssShuffleHandle) {
+      @SuppressWarnings("unchecked")
+      RssShuffleHandle<K, ?, C> h = (RssShuffleHandle<K, ?, C>) handle;
+      return new RssShuffleReader<>(
+          h,
+          startPartition,
+          endPartition,
+          startMapIndex,
+          endMapIndex,
+          context,
+          celebornConf,
+          metrics);
+    }
+    return SparkUtils.getReader(
+        sortShuffleManager(),
+        handle,
+        0,

Review Comment:
   startMapIndex



##########
client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/RssShuffleManager.java:
##########
@@ -244,16 +244,27 @@ public <K, C> ShuffleReader<K, C> getReaderForRange(
       int endPartition,
       TaskContext context,
       ShuffleReadMetricsReporter metrics) {
-    @SuppressWarnings("unchecked")
-    RssShuffleHandle<K, ?, C> h = (RssShuffleHandle<K, ?, C>) handle;
-    return new RssShuffleReader<>(
-        h,
+    if (handle instanceof RssShuffleHandle) {
+      @SuppressWarnings("unchecked")
+      RssShuffleHandle<K, ?, C> h = (RssShuffleHandle<K, ?, C>) handle;
+      return new RssShuffleReader<>(
+          h,
+          startPartition,
+          endPartition,
+          startMapIndex,
+          endMapIndex,
+          context,
+          celebornConf,
+          metrics);
+    }
+    return SparkUtils.getReader(
+        sortShuffleManager(),
+        handle,
+        0,
+        Integer.MAX_VALUE,

Review Comment:
   endMapIndex



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#issuecomment-1413092007

   > Does it mean that the current CI can not detect the test failures?
   
   Seems so, In my other pr, it cause a fallback, then meet this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu merged pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu merged PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] codecov[bot] commented on pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "codecov[bot] (via GitHub)" <gi...@apache.org>.
codecov[bot] commented on PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#issuecomment-1413091798

   # [Codecov](https://codecov.io/gh/apache/incubator-celeborn/pull/1197?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1197](https://codecov.io/gh/apache/incubator-celeborn/pull/1197?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9764adc) into [main](https://codecov.io/gh/apache/incubator-celeborn/commit/021004714bf5c1db4527c686a7d71c4a3a6ca72b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0210047) will **decrease** coverage by `0.02%`.
   > The diff coverage is `n/a`.
   
   > :exclamation: Current head 9764adc differs from pull request most recent head 5734765. Consider uploading reports for the commit 5734765 to get more accurate results
   
   ```diff
   @@             Coverage Diff              @@
   ##               main    #1197      +/-   ##
   ============================================
   - Coverage     26.89%   26.86%   -0.02%     
   + Complexity      777      775       -2     
   ============================================
     Files           205      205              
     Lines         17427    17433       +6     
     Branches       1899     1899              
   ============================================
   - Hits           4685     4682       -3     
   - Misses        12419    12425       +6     
   - Partials        323      326       +3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-celeborn/pull/1197?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...celeborn/service/deploy/master/SlotsAllocator.java](https://codecov.io/gh/apache/incubator-celeborn/pull/1197?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-bWFzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jZWxlYm9ybi9zZXJ2aWNlL2RlcGxveS9tYXN0ZXIvU2xvdHNBbGxvY2F0b3IuamF2YQ==) | `69.27% <0.00%> (-2.45%)` | :arrow_down: |
   | [...apache/celeborn/service/deploy/worker/Worker.scala](https://codecov.io/gh/apache/incubator-celeborn/pull/1197?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-d29ya2VyL3NyYy9tYWluL3NjYWxhL29yZy9hcGFjaGUvY2VsZWJvcm4vc2VydmljZS9kZXBsb3kvd29ya2VyL1dvcmtlci5zY2FsYQ==) | `0.00% <0.00%> (ø)` | |
   | [...ice/deploy/master/clustermeta/ha/HARaftServer.java](https://codecov.io/gh/apache/incubator-celeborn/pull/1197?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-bWFzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jZWxlYm9ybi9zZXJ2aWNlL2RlcGxveS9tYXN0ZXIvY2x1c3Rlcm1ldGEvaGEvSEFSYWZ0U2VydmVyLmphdmE=) | `77.93% <0.00%> (+1.36%)` | :arrow_up: |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] pan3793 commented on pull request #1197: [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on PR #1197:
URL: https://github.com/apache/incubator-celeborn/pull/1197#issuecomment-1413089857

   Does it mean that the current CI can not detect the test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org