You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/27 10:15:22 UTC

[GitHub] [spark] pralabhkumar opened a new pull request, #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

pralabhkumar opened a new pull request, #36701:
URL: https://github.com/apache/spark/pull/36701

   ### What changes were proposed in this pull request?
   This PR add test cases for shuffle.py
   
   ### Why are the changes needed?
   To cover corner test cases and increase coverage. This will increase the coverage of shuffle.py to close to 90%
   
   ### Does this PR introduce _any_ user-facing change?
   No - test only
   
   
   ### How was this patch tested?
   CI in this PR should test it out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r888519926


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -117,6 +169,37 @@ def legit_merge_combiners(x, y):
             m.mergeCombiners(map(lambda x_y1: (x_y1[0], [x_y1[1]]), data))
 
 
+class ExternalGroupByTests(unittest.TestCase):
+    def setUp(self):
+        self.N = 1 << 20
+        values = [i for i in range(self.N)]
+        keys = [i for i in range(2)]
+        import itertools
+
+        self.data = [value for value in itertools.product(keys, values)]
+        self.agg = Aggregator(
+            lambda x: [x], lambda x, y: x.append(y) or x, lambda x, y: x.extend(y) or x
+        )
+
+    def test_medium_dataset(self):
+        # SPARK-39179: Test external group by for medium dataset
+        m = ExternalGroupBy(self.agg, 5, partitions=3)
+        m.mergeValues(self.data)
+        self.assertTrue(m.spills >= 1)
+        self.assertEqual(sum(sum(v) for k, v in m.items()), 2 * sum(range(self.N)))
+
+    def test_dataset_with_keys_are_unsorted(self):
+        # SPARK-39179: Test external group when numbers of keys are greater than SORT KEY Limit.
+        m = ExternalGroupBy(self.agg, 5, partitions=3)
+        try:
+            m.SORT_KEY_LIMIT = 1
+            m.mergeValues(self.data)
+            self.assertTrue(m.spills >= 1)
+            self.assertEqual(sum(sum(v) for k, v in m.items()), 2 * sum(range(self.N)))
+        finally:
+            m.SORT_KEY_LIMIT = 1000

Review Comment:
   Let's probably do this way:
   
   ```python
   origin = m.SORT_KEY_LIMIT
   try:
       m.SORT_KEY_LIMIT = ...
   finally:
       m.SORT_KEY_LIMIT = origin
   ```
   
   In this way, if somebody changes the default value in the main code, we won't have to fix the test together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r888519237


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +63,49 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+
+        with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2:
+            os.environ["SPARK_LOCAL_DIRS"] = tempdir1 + "," + tempdir2
+            index_of_tempdir1 = [False, False]
+            for idx in range(10):
+                m = ExternalMerger(self.agg, 20)
+                if m.localdirs[0].startswith(tempdir1):
+                    index_of_tempdir1[0] = True
+                elif m.localdirs[1].startswith(tempdir1):
+                    index_of_tempdir1[1] = True
+                m.mergeValues(self.data)
+                self.assertTrue(m.spills >= 1)
+                self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)))
+            self.assertTrue(index_of_tempdir1[0] and (index_of_tempdir1[0] == index_of_tempdir1[1]))
+            del os.environ["SPARK_LOCAL_DIRS"]

Review Comment:
   @pralabhkumar let's probably restore to the original value so other tests won't be affected. For exmaple, after this test, `SPARK_LOCAL_DIRS` will be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r888632502


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +63,49 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+
+        with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2:
+            os.environ["SPARK_LOCAL_DIRS"] = tempdir1 + "," + tempdir2
+            index_of_tempdir1 = [False, False]
+            for idx in range(10):
+                m = ExternalMerger(self.agg, 20)
+                if m.localdirs[0].startswith(tempdir1):
+                    index_of_tempdir1[0] = True
+                elif m.localdirs[1].startswith(tempdir1):
+                    index_of_tempdir1[1] = True
+                m.mergeValues(self.data)
+                self.assertTrue(m.spills >= 1)
+                self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)))
+            self.assertTrue(index_of_tempdir1[0] and (index_of_tempdir1[0] == index_of_tempdir1[1]))
+            del os.environ["SPARK_LOCAL_DIRS"]

Review Comment:
   @HyukjinKwon , sorry didn't understand it completely . Please help . 
   Originally SPARK_LOCAL_DIRS is not set in the environment variable(it has no default value)  . So as part of this test it was set , once test case completed , it was removed at the end of the test case, so that rest of the cases work as is . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36701:
URL: https://github.com/apache/spark/pull/36701#issuecomment-1145477772

   LGTM otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r888632502


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +63,49 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+
+        with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2:
+            os.environ["SPARK_LOCAL_DIRS"] = tempdir1 + "," + tempdir2
+            index_of_tempdir1 = [False, False]
+            for idx in range(10):
+                m = ExternalMerger(self.agg, 20)
+                if m.localdirs[0].startswith(tempdir1):
+                    index_of_tempdir1[0] = True
+                elif m.localdirs[1].startswith(tempdir1):
+                    index_of_tempdir1[1] = True
+                m.mergeValues(self.data)
+                self.assertTrue(m.spills >= 1)
+                self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)))
+            self.assertTrue(index_of_tempdir1[0] and (index_of_tempdir1[0] == index_of_tempdir1[1]))
+            del os.environ["SPARK_LOCAL_DIRS"]

Review Comment:
   @HyukjinKwon , sorry didn't understand , your suggestion completely . Please help . 
   Originally SPARK_LOCAL_DIRS is not set in the environment variable(it has no default value)  . So as part of this test it was set , once test case completed , it was removed at the end of the test case, so that rest of the cases work as is .  Let me know , if i am missing anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36701:
URL: https://github.com/apache/spark/pull/36701#issuecomment-1140376433

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36701:
URL: https://github.com/apache/spark/pull/36701#issuecomment-1146909153

   Thank you, @pralabhkumar and @HyukjinKwon . Merged to master for Apache Spark 3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #36701:  [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py
URL: https://github.com/apache/spark/pull/36701


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r884364982


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +61,53 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+        import tempfile
+
+        with tempfile.TemporaryDirectory(dir="/tmp") as tempdir1, tempfile.TemporaryDirectory(
+            dir="/tmp"
+        ) as tempdir2:
+            import os

Review Comment:
   Let's also import this on the top



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r888778272


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +63,59 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+
+        with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2:
+            original = None
+            if "SPARK_LOCAL_DIRS" in os.environ:
+                original = os.environ["SPARK_LOCAL_DIRS"]

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r888632502


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +63,49 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+
+        with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2:
+            os.environ["SPARK_LOCAL_DIRS"] = tempdir1 + "," + tempdir2
+            index_of_tempdir1 = [False, False]
+            for idx in range(10):
+                m = ExternalMerger(self.agg, 20)
+                if m.localdirs[0].startswith(tempdir1):
+                    index_of_tempdir1[0] = True
+                elif m.localdirs[1].startswith(tempdir1):
+                    index_of_tempdir1[1] = True
+                m.mergeValues(self.data)
+                self.assertTrue(m.spills >= 1)
+                self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)))
+            self.assertTrue(index_of_tempdir1[0] and (index_of_tempdir1[0] == index_of_tempdir1[1]))
+            del os.environ["SPARK_LOCAL_DIRS"]

Review Comment:
   @HyukjinKwon , sorry didn't understand , your suggestion completely . Please help . 
   Originally SPARK_LOCAL_DIRS is not set in the environment variable(it has no default value)  . So as part of this test it was set , once test case completed , it was removed at the end of the test case, so that rest of the cases work as is .  Let me know , if i am missing anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r884365865


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -117,6 +171,34 @@ def legit_merge_combiners(x, y):
             m.mergeCombiners(map(lambda x_y1: (x_y1[0], [x_y1[1]]), data))
 
 
+class ExternalGroupByTests(unittest.TestCase):
+    def setUp(self):
+        self.N = 1 << 20
+        values = [i for i in range(self.N)]
+        keys = [i for i in range(2)]
+        import itertools
+
+        self.data = [value for value in itertools.product(keys, values)]
+        self.agg = Aggregator(
+            lambda x: [x], lambda x, y: x.append(y) or x, lambda x, y: x.extend(y) or x
+        )
+
+    def test_medium_dataset(self):
+        # SPARK-39179: Test external group by for medium dataset
+        m = ExternalGroupBy(self.agg, 5, partitions=3)
+        m.mergeValues(self.data)
+        self.assertTrue(m.spills >= 1)
+        self.assertEqual(sum(sum(v) for k, v in m.items()), 2 * sum(range(self.N)))
+
+    def test_dataset_with_keys_are_unsorted(self):
+        # SPARK-39179: Test external group when numbers of keys are greater than SORT KEY Limit.
+        m = ExternalGroupBy(self.agg, 5, partitions=3)
+        m.SORT_KEY_LIMIT = 1

Review Comment:
   Let's restore this back to the original value via try-except



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r884365180


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +61,53 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+        import tempfile
+
+        with tempfile.TemporaryDirectory(dir="/tmp") as tempdir1, tempfile.TemporaryDirectory(
+            dir="/tmp"
+        ) as tempdir2:
+            import os
+
+            os.environ["SPARK_LOCAL_DIRS"] = tempdir1 + "," + tempdir2

Review Comment:
   We should probably restore this value to the original value after the test (e.g., via try-except)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r884364883


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +61,53 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+        import tempfile
+
+        with tempfile.TemporaryDirectory(dir="/tmp") as tempdir1, tempfile.TemporaryDirectory(

Review Comment:
   I think you can omit `dir="/tmp"`



##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +61,53 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+        import tempfile

Review Comment:
   Let's import this on the top.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36701:
URL: https://github.com/apache/spark/pull/36701#issuecomment-1140600286

   Hey, thanks for working on this. I will take another look in few more days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on PR #36701:
URL: https://github.com/apache/spark/pull/36701#issuecomment-1143066365

   @HyukjinKwon Have done the changes as suggested. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r888738083


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +63,59 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+
+        with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2:
+            original = None
+            if "SPARK_LOCAL_DIRS" in os.environ:
+                original = os.environ["SPARK_LOCAL_DIRS"]

Review Comment:
   you can use `os.environ.get("SPARK_LOCAL_DIRS", None)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a diff in pull request #36701: [SPARK-39179][PYTHON][TESTS] Improve the test coverage for pyspark/shuffle.py

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on code in PR #36701:
URL: https://github.com/apache/spark/pull/36701#discussion_r888632502


##########
python/pyspark/tests/test_shuffle.py:
##########
@@ -54,6 +63,49 @@ def test_medium_dataset(self):
         self.assertTrue(m.spills >= 1)
         self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)) * 3)
 
+    def test_shuffle_data_with_multiple_locations(self):
+        # SPARK-39179: Test shuffle of data with multiple location also check
+        # shuffle locations get randomized
+
+        with tempfile.TemporaryDirectory() as tempdir1, tempfile.TemporaryDirectory() as tempdir2:
+            os.environ["SPARK_LOCAL_DIRS"] = tempdir1 + "," + tempdir2
+            index_of_tempdir1 = [False, False]
+            for idx in range(10):
+                m = ExternalMerger(self.agg, 20)
+                if m.localdirs[0].startswith(tempdir1):
+                    index_of_tempdir1[0] = True
+                elif m.localdirs[1].startswith(tempdir1):
+                    index_of_tempdir1[1] = True
+                m.mergeValues(self.data)
+                self.assertTrue(m.spills >= 1)
+                self.assertEqual(sum(sum(v) for k, v in m.items()), sum(range(self.N)))
+            self.assertTrue(index_of_tempdir1[0] and (index_of_tempdir1[0] == index_of_tempdir1[1]))
+            del os.environ["SPARK_LOCAL_DIRS"]

Review Comment:
   @HyukjinKwon , sorry didn't understand it completely . Please help . 
   Originally SPARK_LOCAL_DIRS is not set in the environment variable(it has no default value)  . So as part of this test it was set , once test case completed , it was removed at the end of the test case, so that rest of the cases work as is .  Let me know , if i am missing anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org