You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/04/14 12:35:49 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #40793: [WIP][SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

zhengruifeng opened a new pull request, #40793:
URL: https://github.com/apache/spark/pull/40793

   ### What changes were proposed in this pull request?
   `TorchDistributorLocalUnitTestsOnConnect` and `TorchDistributorLocalUnitTestsIIOnConnect` were not stable and occasionally got stuck. However, I can not reproduce the issue locally. So they had been disabled.
   
   This PR is to reenable them, I found that the old tests for Torch set up the connect sessions in `setUp` and close them in `tearDown`, however such session operations are expensive and we should use `setUpClass` and `tearDownClass` instead. After this change, the related tests seems much stable. So I think the root cause is still related to the resources, since TorchDistributor works on barrier mode, when there is not enough resources in Github Action, the tests just keep waiting.
   
   
   ### Why are the changes needed?
   for test coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   Reenable `TorchDistributorLocalUnitTestsOnConnect` and `TorchDistributorLocalUnitTestsIIOnConnect`
   
   ### How was this patch tested?
   CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40793:
URL: https://github.com/apache/spark/pull/40793#issuecomment-1512278648

   I see a failure in `pyspark.ml.clustering`, should be unrelated.
   but let me re-run `pyspark.ml.clustering` and test torch one more time.
   right now, I didn't see the previous torch-related issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168440330


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -382,59 +400,51 @@ def test_end_to_end_run_locally(self) -> None:
 
 @unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTests(TorchDistributorLocalUnitTestsMixin, unittest.TestCase):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        sc = SparkContext("local-cluster[2,2,1024]", class_name, conf=conf)
-        self.spark = SparkSession(sc)
-        self.mnist_dir_path = tempfile.mkdtemp()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name
+        cls.mnist_dir_path = mnist_dir_path
 
-    def tearDown(self) -> None:
-        shutil.rmtree(self.mnist_dir_path)
-        os.unlink(self.gpu_discovery_script_file.name)
-        self.spark.stop()
+        conf = SparkConf()
+        for k, v in get_local_mode_conf().items():
+            conf = conf.set(k, v)
+        conf = conf.set("spark.driver.resource.gpu.discoveryScript", gpu_discovery_script_file_name)
+
+        sc = SparkContext("local-cluster[2,2,1024]", "TorchDistributorLocalUnitTests", conf=conf)
+        cls.spark = SparkSession(sc)
+
+    @classmethod
+    def tearDownClass(cls):
+        shutil.rmtree(cls.mnist_dir_path)
+        os.unlink(cls.gpu_discovery_script_file_name)
+        cls.spark.stop()
 
 
 @unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTestsII(TorchDistributorLocalUnitTestsMixin, unittest.TestCase):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        sc = SparkContext("local[4]", class_name, conf=conf)
-        self.spark = SparkSession(sc)
-        self.mnist_dir_path = tempfile.mkdtemp()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name

Review Comment:
   ditto:
   
   ```
   (cls.gpu_discovery_script_file_name, cls.mnist_dir_path) = set_up_test_dirs()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168457195


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -382,59 +400,51 @@ def test_end_to_end_run_locally(self) -> None:
 
 @unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTests(TorchDistributorLocalUnitTestsMixin, unittest.TestCase):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        sc = SparkContext("local-cluster[2,2,1024]", class_name, conf=conf)
-        self.spark = SparkSession(sc)
-        self.mnist_dir_path = tempfile.mkdtemp()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name
+        cls.mnist_dir_path = mnist_dir_path
 
-    def tearDown(self) -> None:
-        shutil.rmtree(self.mnist_dir_path)
-        os.unlink(self.gpu_discovery_script_file.name)
-        self.spark.stop()
+        conf = SparkConf()
+        for k, v in get_local_mode_conf().items():
+            conf = conf.set(k, v)
+        conf = conf.set("spark.driver.resource.gpu.discoveryScript", gpu_discovery_script_file_name)
+
+        sc = SparkContext("local-cluster[2,2,1024]", "TorchDistributorLocalUnitTests", conf=conf)
+        cls.spark = SparkSession(sc)
+
+    @classmethod
+    def tearDownClass(cls):
+        shutil.rmtree(cls.mnist_dir_path)
+        os.unlink(cls.gpu_discovery_script_file_name)
+        cls.spark.stop()
 
 
 @unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTestsII(TorchDistributorLocalUnitTestsMixin, unittest.TestCase):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        sc = SparkContext("local[4]", class_name, conf=conf)
-        self.spark = SparkSession(sc)
-        self.mnist_dir_path = tempfile.mkdtemp()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name

Review Comment:
   nice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect
URL: https://github.com/apache/spark/pull/40793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40793:
URL: https://github.com/apache/spark/pull/40793#issuecomment-1510991598

   This PR actually only change `setUp & tearDown` to `setUpClass & tearDownClass`.
   
   In all the 6 runs, the PyTorch related tests all passed, so I think it is ready for review.
   
   @HyukjinKwon @dongjoon-hyun @WeichenXu123 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168439123


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -382,59 +400,51 @@ def test_end_to_end_run_locally(self) -> None:
 
 @unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTests(TorchDistributorLocalUnitTestsMixin, unittest.TestCase):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        sc = SparkContext("local-cluster[2,2,1024]", class_name, conf=conf)
-        self.spark = SparkSession(sc)
-        self.mnist_dir_path = tempfile.mkdtemp()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name
+        cls.mnist_dir_path = mnist_dir_path

Review Comment:
   ```suggestion
           cls.gpu_discovery_script_file_name = set_up_test_dirs()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168435358


##########
python/pyspark/ml/tests/connect/test_parity_torch_distributor.py:
##########
@@ -75,24 +84,30 @@ def _get_inputs_for_test_local_training_succeeds(self):
         ]
 
 
-@unittest.skip("unstable, ignore for now")
+@unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTestsIIOnConnect(
     TorchDistributorLocalUnitTestsMixin, unittest.TestCase
 ):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        builder = SparkSession.builder.appName(class_name)
-        for k, v in conf.getAll():
-            if k not in ["spark.master", "spark.remote", "spark.app.name"]:
-                builder = builder.config(k, v)
-        self.spark = builder.remote("local[4]").getOrCreate()
-        self.mnist_dir_path = tempfile.mkdtemp()
-
-    def tearDown(self) -> None:
-        shutil.rmtree(self.mnist_dir_path)
-        os.unlink(self.gpu_discovery_script_file.name)
-        self.spark.stop()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name
+        cls.mnist_dir_path = mnist_dir_path
+
+        builder = SparkSession.builder.appName("TorchDistributorLocalUnitTestsIIOnConnect")

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168454911


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -122,6 +122,43 @@ def train_fn(learning_rate: float) -> Any:
     return train_fn
 
 
+def set_up_test_dirs():
+    gpu_discovery_script_file = tempfile.NamedTemporaryFile(delete=False)
+    gpu_discovery_script_file_name = gpu_discovery_script_file.name
+    gpu_discovery_script_file.write(
+        b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}'
+    )
+    gpu_discovery_script_file.close()
+
+    # create temporary directory for Worker resources coordination
+    tempdir = tempfile.NamedTemporaryFile(delete=False)
+    os.unlink(tempdir.name)
+    os.chmod(

Review Comment:
   actually this `set_up_test_dirs` method just follows the initial logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168438008


##########
python/pyspark/ml/tests/connect/test_parity_torch_distributor.py:
##########
@@ -107,21 +122,26 @@ def _get_inputs_for_test_local_training_succeeds(self):
 class TorchDistributorDistributedUnitTestsOnConnect(
     TorchDistributorDistributedUnitTestsMixin, unittest.TestCase
 ):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        builder = SparkSession.builder.appName(class_name)
-        for k, v in conf.getAll():
-            if k not in ["spark.master", "spark.remote", "spark.app.name"]:
-                builder = builder.config(k, v)
-
-        self.spark = builder.remote("local-cluster[2,2,1024]").getOrCreate()
-        self.mnist_dir_path = tempfile.mkdtemp()
-
-    def tearDown(self) -> None:
-        shutil.rmtree(self.mnist_dir_path)
-        os.unlink(self.gpu_discovery_script_file.name)
-        self.spark.stop()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name
+        cls.mnist_dir_path = mnist_dir_path
+
+        builder = SparkSession.builder.appName("TorchDistributorDistributedUnitTestsOnConnect")

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168441615


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -122,6 +122,43 @@ def train_fn(learning_rate: float) -> Any:
     return train_fn
 
 
+def set_up_test_dirs():

Review Comment:
   Ideally we should define another mixin (like `SQLTestUtils`), and inherits it to be consistent but I am fine as are for now since it's just few methods, and they do not assume states.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168438544


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -122,6 +122,43 @@ def train_fn(learning_rate: float) -> Any:
     return train_fn
 
 
+def set_up_test_dirs():
+    gpu_discovery_script_file = tempfile.NamedTemporaryFile(delete=False)
+    gpu_discovery_script_file_name = gpu_discovery_script_file.name
+    gpu_discovery_script_file.write(
+        b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}'
+    )
+    gpu_discovery_script_file.close()
+
+    # create temporary directory for Worker resources coordination
+    tempdir = tempfile.NamedTemporaryFile(delete=False)
+    os.unlink(tempdir.name)
+    os.chmod(

Review Comment:
   Mind add a comment what this means?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40793:
URL: https://github.com/apache/spark/pull/40793#issuecomment-1512379310

   in all the 10 runs, `test_parity_torch_distributor` and `test_distributor` passed without being hanged.
   so I think it is stable enough to re-enable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168436435


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -122,6 +122,43 @@ def train_fn(learning_rate: float) -> Any:
     return train_fn
 
 
+def set_up_test_dirs():
+    gpu_discovery_script_file = tempfile.NamedTemporaryFile(delete=False)
+    gpu_discovery_script_file_name = gpu_discovery_script_file.name
+    gpu_discovery_script_file.write(
+        b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}'

Review Comment:
   no biggie but I would convert a dict to JSON and write it after encoding it.



##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -122,6 +122,43 @@ def train_fn(learning_rate: float) -> Any:
     return train_fn
 
 
+def set_up_test_dirs():
+    gpu_discovery_script_file = tempfile.NamedTemporaryFile(delete=False)
+    gpu_discovery_script_file_name = gpu_discovery_script_file.name
+    gpu_discovery_script_file.write(
+        b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}'
+    )
+    gpu_discovery_script_file.close()

Review Comment:
   try finally



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168588076


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -122,6 +122,43 @@ def train_fn(learning_rate: float) -> Any:
     return train_fn
 
 
+def set_up_test_dirs():
+    gpu_discovery_script_file = tempfile.NamedTemporaryFile(delete=False)
+    gpu_discovery_script_file_name = gpu_discovery_script_file.name
+    gpu_discovery_script_file.write(
+        b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}'

Review Comment:
   I don't find a easy way to add the `\`, let me just follow existing test for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40793: [WIP][SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40793:
URL: https://github.com/apache/spark/pull/40793#issuecomment-1508448973

   let me try merging commits from master several times, to see whether this fix is stable enough


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168573414


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -122,6 +122,43 @@ def train_fn(learning_rate: float) -> Any:
     return train_fn
 
 
+def set_up_test_dirs():
+    gpu_discovery_script_file = tempfile.NamedTemporaryFile(delete=False)
+    gpu_discovery_script_file_name = gpu_discovery_script_file.name
+    gpu_discovery_script_file.write(
+        b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}'
+    )
+    gpu_discovery_script_file.close()

Review Comment:
   got it, will update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40793: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40793:
URL: https://github.com/apache/spark/pull/40793#issuecomment-1512379955

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168439887


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -382,59 +400,51 @@ def test_end_to_end_run_locally(self) -> None:
 
 @unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTests(TorchDistributorLocalUnitTestsMixin, unittest.TestCase):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        sc = SparkContext("local-cluster[2,2,1024]", class_name, conf=conf)
-        self.spark = SparkSession(sc)
-        self.mnist_dir_path = tempfile.mkdtemp()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name
+        cls.mnist_dir_path = mnist_dir_path
 
-    def tearDown(self) -> None:
-        shutil.rmtree(self.mnist_dir_path)
-        os.unlink(self.gpu_discovery_script_file.name)
-        self.spark.stop()
+        conf = SparkConf()
+        for k, v in get_local_mode_conf().items():
+            conf = conf.set(k, v)
+        conf = conf.set("spark.driver.resource.gpu.discoveryScript", gpu_discovery_script_file_name)
+
+        sc = SparkContext("local-cluster[2,2,1024]", "TorchDistributorLocalUnitTests", conf=conf)

Review Comment:
   ditto. `cls.__name__`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168441615


##########
python/pyspark/ml/torch/tests/test_distributor.py:
##########
@@ -122,6 +122,43 @@ def train_fn(learning_rate: float) -> Any:
     return train_fn
 
 
+def set_up_test_dirs():

Review Comment:
   Ideally we should define another mixin (like `SQLTestUtils`), and inherits it to be consistent but I am fine as are for now since it's just few methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40793: [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40793:
URL: https://github.com/apache/spark/pull/40793#discussion_r1168434979


##########
python/pyspark/ml/tests/connect/test_parity_torch_distributor.py:
##########
@@ -33,38 +32,48 @@
     TorchDistributorLocalUnitTestsMixin,
     TorchDistributorDistributedUnitTestsMixin,
     TorchWrapperUnitTestsMixin,
+    set_up_test_dirs,
+    get_local_mode_conf,
+    get_distributed_mode_conf,
 )
 
 
 @unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorBaselineUnitTestsOnConnect(
     TorchDistributorBaselineUnitTestsMixin, unittest.TestCase
 ):
-    def setUp(self) -> None:
-        self.spark = SparkSession.builder.remote("local[4]").getOrCreate()
+    @classmethod
+    def setUpClass(cls):
+        cls.spark = SparkSession.builder.remote("local[4]").getOrCreate()
 
-    def tearDown(self) -> None:
-        self.spark.stop()
+    @classmethod
+    def tearDownClass(cls):
+        cls.spark.stop()
 
 
-@unittest.skip("unstable, ignore for now")
+@unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTestsOnConnect(
     TorchDistributorLocalUnitTestsMixin, unittest.TestCase
 ):
-    def setUp(self) -> None:
-        class_name = self.__class__.__name__
-        conf = self._get_spark_conf()
-        builder = SparkSession.builder.appName(class_name)
-        for k, v in conf.getAll():
-            if k not in ["spark.master", "spark.remote", "spark.app.name"]:
-                builder = builder.config(k, v)
-        self.spark = builder.remote("local-cluster[2,2,1024]").getOrCreate()
-        self.mnist_dir_path = tempfile.mkdtemp()
-
-    def tearDown(self) -> None:
-        shutil.rmtree(self.mnist_dir_path)
-        os.unlink(self.gpu_discovery_script_file.name)
-        self.spark.stop()
+    @classmethod
+    def setUpClass(cls):
+        (gpu_discovery_script_file_name, mnist_dir_path) = set_up_test_dirs()
+        cls.gpu_discovery_script_file_name = gpu_discovery_script_file_name
+        cls.mnist_dir_path = mnist_dir_path
+
+        builder = SparkSession.builder.appName("TorchDistributorLocalUnitTestsOnConnect")

Review Comment:
   `cls.__name__` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org