You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/14 04:45:41 UTC

[GitHub] [spark] dchvn opened a new pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

dchvn opened a new pull request #34281:
URL: https://github.com/apache/spark/pull/34281


   ### What changes were proposed in this pull request?
   Introduce the ```compute.check_identical_indices``` option, with the default value of ```True```.
   
   ### Why are the changes needed?
   Check two indices are identical is a expensive operation, ```compute.check_identical_indices``` is introduced to set whether or not do this operation.
   
   ### Does this PR introduce _any_ user-facing change?
   ```python
   >>> ps.get_option('compute.check_identical_indices')
   True
   >>> ps.set_option('compute.check_identical_indices', False)
   >>> ps.get_option('compute.check_identical_indices')
   False
   >>> ps.set_option('compute.check_identical_indices', 1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/u02/spark/python/pyspark/pandas/config.py", line 348, in set_option
       _options_dict[key].validate(value)
     File "/u02/spark/python/pyspark/pandas/config.py", line 104, in validate
       raise TypeError(
   TypeError: The value for option 'compute.check_identical_indices' was <class 'int'>; however, expected types are [<class 'bool'>].
   >>> ps.reset_option('compute.check_identical_indices')
   >>> ps.get_option('compute.check_identical_indices')
   True
   ```
   ### How was this patch tested?
   Existing tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r731449497



##########
File path: python/pyspark/pandas/config.py
##########
@@ -194,6 +194,18 @@ def validate(self, v: Any) -> None:
         default=False,
         types=bool,
     ),
+    Option(
+        key="compute.check_identical_indices",

Review comment:
       @dchvn would you mind creating an umbrella JIRA to apply this configuration across all the codebase? What we need to do is basically:
   
   1. Make every input validation like this covered by the new configuration. For example:
   
       ```diff
       - a == b
       + def eager_check(f):  # Utility function
       +     return not config.compute.eager_check and f()
       +
       + eager_check(lambda: a == b)
       ```
   
   2. We should check if the output makes sense although the behaviour is not matched with pandas'. If the output does not make sense, we shouldn't cover it with this configuration.
   
   3. Make this configuration enabled by default so we match the behaviour to pandas' by default.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-943082073


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-942988817


   **[Test build #144238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144238/testReport)** for PR 34281 at commit [`2a11312`](https://github.com/apache/spark/commit/2a11312b02b6301f7f45328a011e8ddd983cc4d0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-943001705


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144238/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
itholic commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-946380487


   LGTM too, once the existing comments are resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-943003869


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945355940


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144343/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #34281:
URL: https://github.com/apache/spark/pull/34281


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945352942


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48822/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dchvn commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
dchvn commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-947244910


   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r731449689



##########
File path: python/docs/source/user_guide/pandas_on_spark/options.rst
##########
@@ -280,6 +280,13 @@ compute.ordered_head            False          'compute.ordered_head' sets wheth
                                                'compute.ordered_head' is set to True, pandas-on-
                                                Spark performs natural ordering beforehand, but it
                                                will cause a performance overhead.
+compute.eager_check             True           'compute.eager_check' sets whether or not to launch
+                                               some Spark jobs just for the sake of validation. If
+                                               'compute.eager_check' is set to True, pandas-on-Spark
+                                               performs the validation beforehand, but it will cause
+                                               a performance overhead. Otherwise, pandas-on-Spark
+                                               skip the validation and will be slightly different
+                                               from pandas

Review comment:
       In addition, let's make sure listing which API is affected here in the description.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dchvn commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
dchvn commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r731548782



##########
File path: python/pyspark/pandas/config.py
##########
@@ -194,6 +194,18 @@ def validate(self, v: Any) -> None:
         default=False,
         types=bool,
     ),
+    Option(
+        key="compute.check_identical_indices",

Review comment:
       Created, https://issues.apache.org/jira/browse/SPARK-37055




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-942957930


   **[Test build #144238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144238/testReport)** for PR 34281 at commit [`2a11312`](https://github.com/apache/spark/commit/2a11312b02b6301f7f45328a011e8ddd983cc4d0).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r730197291



##########
File path: python/pyspark/pandas/config.py
##########
@@ -194,6 +194,18 @@ def validate(self, v: Any) -> None:
         default=False,
         types=bool,
     ),
+    Option(
+        key="compute.check_identical_indices",

Review comment:
       I think we should probably have a configuration like `compute.eager_check`, and apply this configuration everywhere when it requires to launch some Spark jobs just for the sake of validation. WDYT @ueshin @itholic @xinrong-databricks ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dchvn commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
dchvn commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r731546496



##########
File path: python/docs/source/user_guide/pandas_on_spark/options.rst
##########
@@ -280,6 +280,13 @@ compute.ordered_head            False          'compute.ordered_head' sets wheth
                                                'compute.ordered_head' is set to True, pandas-on-
                                                Spark performs natural ordering beforehand, but it
                                                will cause a performance overhead.
+compute.eager_check             True           'compute.eager_check' sets whether or not to launch
+                                               some Spark jobs just for the sake of validation. If
+                                               'compute.eager_check' is set to True, pandas-on-Spark
+                                               performs the validation beforehand, but it will cause
+                                               a performance overhead. Otherwise, pandas-on-Spark
+                                               skip the validation and will be slightly different
+                                               from pandas

Review comment:
       I think we should add the APIs affected in the description when they are modified ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-946318799


   With the comments above, LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dchvn commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
dchvn commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r730748668



##########
File path: python/pyspark/pandas/config.py
##########
@@ -194,6 +194,18 @@ def validate(self, v: Any) -> None:
         default=False,
         types=bool,
     ),
+    Option(
+        key="compute.check_identical_indices",

Review comment:
       Updated, Can you take another look? thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dchvn commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
dchvn commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r731546496



##########
File path: python/docs/source/user_guide/pandas_on_spark/options.rst
##########
@@ -280,6 +280,13 @@ compute.ordered_head            False          'compute.ordered_head' sets wheth
                                                'compute.ordered_head' is set to True, pandas-on-
                                                Spark performs natural ordering beforehand, but it
                                                will cause a performance overhead.
+compute.eager_check             True           'compute.eager_check' sets whether or not to launch
+                                               some Spark jobs just for the sake of validation. If
+                                               'compute.eager_check' is set to True, pandas-on-Spark
+                                               performs the validation beforehand, but it will cause
+                                               a performance overhead. Otherwise, pandas-on-Spark
+                                               skip the validation and will be slightly different
+                                               from pandas

Review comment:
       I think we should add the API affected in the description when it is modified ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945347402


   **[Test build #144343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144343/testReport)** for PR 34281 at commit [`191774f`](https://github.com/apache/spark/commit/191774f789abccd0692cffc28630405416c703dd).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r731449497



##########
File path: python/pyspark/pandas/config.py
##########
@@ -194,6 +194,18 @@ def validate(self, v: Any) -> None:
         default=False,
         types=bool,
     ),
+    Option(
+        key="compute.check_identical_indices",

Review comment:
       @dchvn would you mind creating an umbrella JIRA to apply this configuration across all the codebase? What we need to do is basically:
   
   1. Make every input validation like this covered by the new configuration. For example:
   
       ```diff
       - a == b
       + def eager_check(f):  # Utility function
       +     return not config.compute.eager_check and f()
       + eager_check(lambda: a == b)
       ```
   
   2. We should check if the output makes sense although the behaviour is not matched with pandas'. If the output does not make sense, we shouldn't cover it with this configuration.
   
   3. Make this configuration enabled by default so we match the behaviour to pandas' by default.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-942957930


   **[Test build #144238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144238/testReport)** for PR 34281 at commit [`2a11312`](https://github.com/apache/spark/commit/2a11312b02b6301f7f45328a011e8ddd983cc4d0).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r732353850



##########
File path: python/docs/source/user_guide/pandas_on_spark/options.rst
##########
@@ -280,6 +280,13 @@ compute.ordered_head            False          'compute.ordered_head' sets wheth
                                                'compute.ordered_head' is set to True, pandas-on-
                                                Spark performs natural ordering beforehand, but it
                                                will cause a performance overhead.
+compute.eager_check             True           'compute.eager_check' sets whether or not to launch
+                                               some Spark jobs just for the sake of validation. If
+                                               'compute.eager_check' is set to True, pandas-on-Spark
+                                               performs the validation beforehand, but it will cause
+                                               a performance overhead. Otherwise, pandas-on-Spark
+                                               skip the validation and will be slightly different
+                                               from pandas

Review comment:
       yup




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dchvn commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
dchvn commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r731548073



##########
File path: python/pyspark/pandas/config.py
##########
@@ -194,6 +194,18 @@ def validate(self, v: Any) -> None:
         default=False,
         types=bool,
     ),
+    Option(
+        key="compute.check_identical_indices",

Review comment:
       many thanks to you, i will try to find where we can apply this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-943082004


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945335307


   **[Test build #144343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144343/testReport)** for PR 34281 at commit [`191774f`](https://github.com/apache/spark/commit/191774f789abccd0692cffc28630405416c703dd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945355940


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144343/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945376463


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48822/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-943082073


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945335307


   **[Test build #144343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144343/testReport)** for PR 34281 at commit [`191774f`](https://github.com/apache/spark/commit/191774f789abccd0692cffc28630405416c703dd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945378902


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48822/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-945378902


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48822/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34281:
URL: https://github.com/apache/spark/pull/34281#discussion_r731449497



##########
File path: python/pyspark/pandas/config.py
##########
@@ -194,6 +194,18 @@ def validate(self, v: Any) -> None:
         default=False,
         types=bool,
     ),
+    Option(
+        key="compute.check_identical_indices",

Review comment:
       @dchvn would you mind creating an umbrella JIRA to apply this configuration across all the codebase? What we need to do is basically:
   
   1. Make every input validation like this covered by the new configuration. For example:
   
       ```diff
       - a == b
       + def eager_check(f):
       +     return not config.compute.eager_check and f()
       + eager_check(lambda: a == b)
       ```
   
   2. We should check if the output makes sense although the behaviour is not matched with pandas'. If the output does not make sense, we shouldn't cover it with this configuration.
   
   3. Make this configuration enabled by default so we match the behaviour to pandas' by default.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.eager_check' option

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-947233668


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34281: [SPARK-37002][PYTHON] Introduce the 'compute.check_identical_indices' option

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34281:
URL: https://github.com/apache/spark/pull/34281#issuecomment-943001705


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144238/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org