You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/26 07:12:46 UTC

[GitHub] [spark] sadhen opened a new pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

sadhen opened a new pull request #32332:
URL: https://github.com/apache/spark/pull/32332


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error message, please read the guideline first:
        https://spark.apache.org/error-message-guidelines.html
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-829105798


   > BTW, can we avoid doing the verification twice (#32332 (comment)) too?
   
   Yes. The original implementation mixed schema verification with `prepare` to avoid performance loss.
   
   Let me make it clear in pseudo code to describe what I have done:
   
   
   switch schema:
   + case DataType --> required `prepare` -> optional schema verification
   + case None/a list/tuple of names: --> require `schema inferences` and required conversion
   + case StructType --> optional schema verification
   
   
   ---->
   
   
   switch schema:
   + case DataType --> required `prepare` -> optional schema verification
   + case None/a list/tuple of names: --> required `schema inferences` and optional verification and required conversion
   + case StructType --> optional schema verification
   
   
   This part is a bit complicated because, data can be RDD/map/list/... and schema can be DataType/StructType/None/list of names. Too many special cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826520847


   @sadhen, sorry I am trying to understand the fix and problem. Would you mind filling the Pr description and adding a test case for that?
   
   The UDT has its backed SQL type at `sqlType`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r619935979



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       See https://github.com/apache/spark/blob/6f782efb044403ab3ca79662fdeca7f1f906e1bf/python/pyspark/sql/pandas/conversion.py#L312
   
   @HyukjinKwon  That's why we need to do verification after the schema is inferred.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-888836831


   Rebase on master and CI has passed. Please re-review @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r676400183



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       > Seems like it does twice when the schema is specified.
   
   When the schema is specified, `_inferSchema` and `_create_verified_converter ` will not be invoked 
   
   https://github.com/apache/spark/blob/c561ee686551690bee689f37ae5bbd75119994d6/python/pyspark/sql/session.py#L490-493

##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       > Seems like it does twice when the schema is specified.
   
   When the schema is specified, `_inferSchema` and `_create_verified_converter ` will not be invoked 
   
   https://github.com/apache/spark/blob/c561ee686551690bee689f37ae5bbd75119994d6/python/pyspark/sql/session.py#L490:493

##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       > Seems like it does twice when the schema is specified.
   
   When the schema is specified, `_inferSchema` and `_create_verified_converter ` will not be invoked 
   
   https://github.com/apache/spark/blob/c561ee686551690bee689f37ae5bbd75119994d6/python/pyspark/sql/session.py#L490:L493




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889708732


   @HyukjinKwon  @viirya @BryanCutler 
   
   Please do another round of review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r678904955



##########
File path: python/pyspark/sql/session.py
##########
@@ -599,6 +609,9 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
             the sample ratio of rows used for inferring
         verifySchema : bool, optional
             verify data types of every row against schema. Enabled by default.
+            Specifically, if schema is provided, schema verification will be performed at the
+            preparation phase; if schema is not provided and can be inferred (eg. from UDT),
+            schema verification will be performed at the convertion phase.

Review comment:
       There is no such `preparation` phase exposed to users. I think it is confusing to mention the schema verification is performed at preparation phase. Could we remove it?
   
   E.g. "Specifically, if schema is provided, schema verification will be performed against the provided schema; if schema is not provided and can be inferred (eg. from UDT), schema verification will be performed against the inferred schema."




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281828


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137911/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889583058


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141866/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826307360


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137917/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r624859613



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       You could just pass extra boolean, and do the verification only in the case at this line (when the verification is not done).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281679






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889250439


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141846/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-828113580


   Ah, okay. Thanks for clarification. So the problem is we're literally not doing the validation when schema is inferred.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-886507746


   **[Test build #141624 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141624/testReport)** for PR 32332 at commit [`ee519c0`](https://github.com/apache/spark/commit/ee519c00e486c82aaf358e4c49ca43d675637198).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-828885872


   Ah, I meant to disable it only when `schema` is `None`, meaning that keeping the same default behaviour.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945444898


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144355/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887157079


   **[Test build #141676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141676/testReport)** for PR 32332 at commit [`675c254`](https://github.com/apache/spark/commit/675c2540f69b954f37810f41c0d1704d790fe49a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889253804


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141847/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887202286


   **[Test build #141683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)** for PR 32332 at commit [`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-829112079


   The refactored https://github.com/apache/spark/pull/32320 using `inner_map` to decrease the complexity of RDD and python collection.
   
   The only weird part is:
   ```
               # make sure data could consumed multiple times
               if not isinstance(data, RDD) and not isinstance(data, list):
                   data = list(data)
   ```
   Because some data can only be iterated once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r679265861



##########
File path: python/pyspark/sql/tests/test_types.py
##########
@@ -175,6 +175,15 @@ def __init__(self):
         ]
         self.assertEqual(actual, expected)
 
+    # SPARK-35211: inferred schema verification

Review comment:
       Extra unit tests to check the verified schema




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r619935979



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       See https://github.com/apache/spark/blob/6f782efb044403ab3ca79662fdeca7f1f906e1bf/python/pyspark/sql/pandas/conversion.py#L312:L314
   
   @HyukjinKwon  That's why we need to do verification after the schema is inferred.

##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       See https://github.com/apache/spark/blob/6f782efb044403ab3ca79662fdeca7f1f906e1bf/python/pyspark/sql/pandas/conversion.py#L312-L314
   
   @HyukjinKwon  That's why we need to do verification after the schema is inferred.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-828405772


   I tried to turn off `verifySchema` by default:
   ```
   -    def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=True):
   +    def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=False):
   
   ... other doc/tests changes
   ```
   
   But correctness should be much more important than performance. We can but we'd better keep the same semantics for user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-886486023


   **[Test build #141624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141624/testReport)** for PR 32332 at commit [`ee519c0`](https://github.com/apache/spark/commit/ee519c00e486c82aaf358e4c49ca43d675637198).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r619936308



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       1. create from pandas, schema is a list.
   2. create directly without schema, schema is None
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889572660


   **[Test build #141866 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141866/testReport)** for PR 32332 at commit [`1807997`](https://github.com/apache/spark/commit/18079975ca65c73b494b1808012ffe0176751a03).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889309855


   **[Test build #141849 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141849/testReport)** for PR 32332 at commit [`4dd3f57`](https://github.com/apache/spark/commit/4dd3f576812f3695f511600ea10e2016f5f78b09).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r679585531



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -759,6 +759,30 @@ def test_create_dataframe_from_pandas_with_dst(self):
                 os.environ['TZ'] = orig_env_tz
             time.tzset()
 
+    # SPARK-35211: inferred schema verification
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: ignore
+    def test_create_dataframe_from_pandas_with_mismatched_udt(self):
+        from pyspark.testing.sqlutils import ExamplePoint
+        import pandas as pd
+        pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
+        # The underlying sqlType of ExamplePoint is ArrayType(DoubleType(), False)
+        # There is a data type mismatch between 1 and DoubleType, a TypeError is expected
+        with self.assertRaises(TypeError):
+            with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": "false"}):
+                self.spark.createDataFrame(pdf)
+        # With verifySchema disabled, there will be unexpected behaviours
+        self.assertEquals(self.spark.createDataFrame(pdf, verifySchema=False).collect(),
+                          [Row(point=ExamplePoint(0.0, 0.0)), Row(point=ExamplePoint(0.0, 0.0))])

Review comment:
       added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889564827


   **[Test build #141866 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141866/testReport)** for PR 32332 at commit [`1807997`](https://github.com/apache/spark/commit/18079975ca65c73b494b1808012ffe0176751a03).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-886625575


   I think this looks okay (as far as I remember from the last discussion) but would be great to have a second look from @BryanCutler, @ueshin or @viirya.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-888307416


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141775/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r621784594



##########
File path: python/pyspark/sql/session.py
##########
@@ -478,13 +478,22 @@ def _inferSchema(self, rdd, samplingRatio=None, names=None):
             schema = rdd.map(lambda row: _infer_schema(row, names)).reduce(_merge_type)
         return schema
 
-    def _createFromRDD(self, rdd, schema, samplingRatio):
+    def _create_verified_converter(self, schema, verifySchema):
+        converter = _create_converter(schema)
+        verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
+
+        def verify_and_convert(obj):
+            verify_func(obj)

Review comment:
       I still think we should avoid doing this again though. the verification func is pretty expensive




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826323626






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826671823


   **[Test build #137951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137951/testReport)** for PR 32332 at commit [`1bbbf49`](https://github.com/apache/spark/commit/1bbbf49150a2728a4081cb1ea7ef472c383a5f31).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889252887


   **[Test build #141847 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141847/testReport)** for PR 32332 at commit [`a752d68`](https://github.com/apache/spark/commit/a752d68ebc3060df8ecbbe10024abb008fd63d9e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826519391


   > I think you should fix _make_type_verifier to handle UDT
   
   In this case, schema is not provided and it will be inferred from the dataset. The inferred schema actually does not contain UDT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-888285445


   **[Test build #141775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141775/testReport)** for PR 32332 at commit [`be5e9d8`](https://github.com/apache/spark/commit/be5e9d8dd5d2cc581b54e22b34f5c503981cb942).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945746425


   Squashed and rebased on the latest master and conflicts resolved again.
   
   @HyukjinKwon @viirya @BryanCutler 
   
   This PR improves user experience, could you review it again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #32332:
URL: https://github.com/apache/spark/pull/32332


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r624859160



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       Hm, here the verification is done at `prepare` once. After that you do it one more time at L483. Seems like it does twice when the schema is specified.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826562737


   One of the unit tests has been described in REPL format above in the PR description part. It would be helpful to understand what this PR actually solved.
   
   Give me some time to add the unit tests for the schema verification part.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887161989


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141676/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887202286


   **[Test build #141683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)** for PR 32332 at commit [`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r679575072



##########
File path: python/pyspark/sql/session.py
##########
@@ -697,12 +712,17 @@ def prepare(obj):
                 verify_func(obj)
                 return obj,
         else:
+            no_need_to_prepare = True
             prepare = lambda obj: obj
 
         if isinstance(data, RDD):
-            rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
+            rdd, schema = self._createFromRDD(
+                data if no_need_to_prepare else data.map(prepare),

Review comment:
       There are cases when verifySchema is True and the `prepare` function is the identity function.
   
   That's why we need the `no_need_to_prepare` flag.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889253780


   **[Test build #141847 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141847/testReport)** for PR 32332 at commit [`a752d68`](https://github.com/apache/spark/commit/a752d68ebc3060df8ecbbe10024abb008fd63d9e).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-1051388711


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-829105798


   > BTW, can we avoid doing the verification twice (#32332 (comment)) too?
   
   Yes. The original implementation mixed schema verification with `prepare` to avoid performance loss.
   
   Let me make it clear in pseudo code to describe what I have done:
   
   
   switch schema:
   + case DataType --> require prepare -> optional schema verification
   + case None/a list/tuple of names: --> require schema inferences and conversion
   + case StructType --> optional schema verification
   
   
   ---->
   
   
   switch schema:
   + case DataType --> required `prepare` -> optional schema verification
   + case None/a list/tuple of names: --> required `schema inferences` and optional verification and required conversion
   + case StructType --> optional schema verification
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-828135383


   > Can we turn off the verifySchema by default when the given schema is None?
   
   
   Yes.
   
   Let me improve this PR and update the corresponding default parameter, doc and unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281360






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281679


   **[Test build #137911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137911/testReport)** for PR 32332 at commit [`d0a0942`](https://github.com/apache/spark/commit/d0a094245fdba72b97b2b721e1198ed62fd3b3da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-917792028


   **[Test build #143184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143184/testReport)** for PR 32332 at commit [`54553fd`](https://github.com/apache/spark/commit/54553fd062798c008254ddebe3987ef983b4e2ad).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-917811752


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887211807


   **[Test build #141683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)** for PR 32332 at commit [`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945417301


   **[Test build #144355 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144355/testReport)** for PR 32332 at commit [`86d71f9`](https://github.com/apache/spark/commit/86d71f9bcb08b21c1c71edc9f383e6965a07f755).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945446988


   **[Test build #144358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144358/testReport)** for PR 32332 at commit [`5acf69d`](https://github.com/apache/spark/commit/5acf69d00e744eebf99df59dd60f89c7eeeb831a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945742469


   **[Test build #144372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144372/testReport)** for PR 32332 at commit [`21afb7c`](https://github.com/apache/spark/commit/21afb7c118e65f83f31cbac5d89feab31d774286).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826638087


   **[Test build #137947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137947/testReport)** for PR 32332 at commit [`849163e`](https://github.com/apache/spark/commit/849163e9090e11055c2df399c419db7c9535e14d).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-828135383


   > Can we turn off the verifySchema by default when the given schema is None?
   Yes.
   
   Let me improve this PR and update the corresponding default parameter, doc and unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r676401500



##########
File path: python/pyspark/sql/session.py
##########
@@ -478,13 +478,22 @@ def _inferSchema(self, rdd, samplingRatio=None, names=None):
             schema = rdd.map(lambda row: _infer_schema(row, names)).reduce(_merge_type)
         return schema
 
-    def _createFromRDD(self, rdd, schema, samplingRatio):
+    def _create_verified_converter(self, schema, verifySchema):
+        converter = _create_converter(schema)
+        verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
+
+        def verify_and_convert(obj):
+            verify_func(obj)

Review comment:
       In the latest commit, I've introduce a flag and avoid invoking the identity `prepare` function.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887217818


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141683/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-886528598


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141624/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r676968916



##########
File path: python/pyspark/sql/session.py
##########
@@ -697,12 +712,19 @@ def prepare(obj):
                 verify_func(obj)
                 return obj,
         else:
+            no_need_to_prepare = True
             prepare = lambda obj: obj
 
-        if isinstance(data, RDD):
-            rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
+        if isinstance(data, RDD) and no_need_to_prepare:
+            rdd, schema = self._createFromRDD(
+                data, schema, verifySchema, samplingRatio)
+        elif isinstance(data, RDD) and (not no_need_to_prepare):
+            rdd, schema = self._createFromRDD(
+                data.map(prepare), schema, verifySchema, samplingRatio)

Review comment:
       I feel adding another `if` against `no_need_to_prepare` under `if isinstance(data, RDD)` looks more readable, e.g.:
   
   ```python
   if isinstance(data, RDD):
     if no_need_to_prepare:
       rdd, schema = self._createFromRDD(
                   data, schema, verifySchema, samplingRatio)
     else:
       rdd, schema = self._createFromRDD(
                   data.map(prepare), schema, verifySchema, samplingRatio)
   else:
     if no_need_to_prepare:
     ...
   ```
   
   
   or
   
   ```python
   if isinstance(data, RDD):
     rdd, schema = self._createFromRDD(
                   data if no_need_to_prepare else data.map(prepare), schema, verifySchema, samplingRatio)
   else:
     ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-886486023


   **[Test build #141624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141624/testReport)** for PR 32332 at commit [`ee519c0`](https://github.com/apache/spark/commit/ee519c00e486c82aaf358e4c49ca43d675637198).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889249599


   **[Test build #141846 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141846/testReport)** for PR 32332 at commit [`54b81e3`](https://github.com/apache/spark/commit/54b81e319586f0ff8f17904017af1c9c2670d64b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r679585610



##########
File path: python/pyspark/sql/session.py
##########
@@ -483,13 +483,22 @@ def _inferSchema(self, rdd, samplingRatio=None, names=None):
                 row, names, infer_dict_as_struct=infer_dict_as_struct)).reduce(_merge_type)
         return schema
 
-    def _createFromRDD(self, rdd, schema, samplingRatio):
+    def _create_verified_converter(self, schema, verifySchema):
+        converter = _create_converter(schema)
+        verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
+
+        def verify_and_convert(obj):
+            verify_func(obj)
+            return converter(obj)
+        return verify_and_convert

Review comment:
       simplified now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889249599


   **[Test build #141846 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141846/testReport)** for PR 32332 at commit [`54b81e3`](https://github.com/apache/spark/commit/54b81e319586f0ff8f17904017af1c9c2670d64b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945416743


   **[Test build #144355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144355/testReport)** for PR 32332 at commit [`86d71f9`](https://github.com/apache/spark/commit/86d71f9bcb08b21c1c71edc9f383e6965a07f755).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826694852


   **[Test build #137951 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137951/testReport)** for PR 32332 at commit [`1bbbf49`](https://github.com/apache/spark/commit/1bbbf49150a2728a4081cb1ea7ef472c383a5f31).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826521727


   e.g.) https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/testing/sqlutils.py#L77-L79
   
   `ExamplePointUDT` has its backed double array type so the input should better be Python floats.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826622952


   **[Test build #137947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137947/testReport)** for PR 32332 at commit [`849163e`](https://github.com/apache/spark/commit/849163e9090e11055c2df399c419db7c9535e14d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826299079


   **[Test build #137917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137917/testReport)** for PR 32332 at commit [`36089e8`](https://github.com/apache/spark/commit/36089e8c7770df47c22d2feb4663adb653c3a80f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r676969996



##########
File path: python/pyspark/sql/session.py
##########
@@ -681,12 +694,14 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
 
     def _create_dataframe(self, data, schema, samplingRatio, verifySchema):
         if isinstance(schema, StructType):
+            no_need_to_prepare = not verifySchema
             verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
 
             def prepare(obj):
                 verify_func(obj)
                 return obj

Review comment:
       Hmm, doesn't `prepare` already verify the schema? Why we need to verify it again in converter?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-888269543


   **[Test build #141775 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141775/testReport)** for PR 32332 at commit [`be5e9d8`](https://github.com/apache/spark/commit/be5e9d8dd5d2cc581b54e22b34f5c503981cb942).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889293948


   **[Test build #141849 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141849/testReport)** for PR 32332 at commit [`4dd3f57`](https://github.com/apache/spark/commit/4dd3f576812f3695f511600ea10e2016f5f78b09).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-828405772


   I tried to turn off `verifySchema` by default:
   ```
   -    def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=True):
   +    def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=False):
   ```
   
   But correctness should be much more important than performance. We can but we'd better keep the same semantics for user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826519391


   > I think you should fix _make_type_verifier to handle UDT
   
   In this case, schema is not provided and it will be inferred from the dataset. The inferred schema actually do not contain UDT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945452394


   **[Test build #144359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144359/testReport)** for PR 32332 at commit [`4639a67`](https://github.com/apache/spark/commit/4639a674d68bcb7faa98b4f411e8c975cbe6f7cb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945463879


   **[Test build #144359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144359/testReport)** for PR 32332 at commit [`4639a67`](https://github.com/apache/spark/commit/4639a674d68bcb7faa98b4f411e8c975cbe6f7cb).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945530368


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144359/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945458175






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945452394


   **[Test build #144359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144359/testReport)** for PR 32332 at commit [`4639a67`](https://github.com/apache/spark/commit/4639a674d68bcb7faa98b4f411e8c975cbe6f7cb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945416743


   **[Test build #144355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144355/testReport)** for PR 32332 at commit [`86d71f9`](https://github.com/apache/spark/commit/86d71f9bcb08b21c1c71edc9f383e6965a07f755).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281679






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r678905960



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -759,6 +759,22 @@ def test_create_dataframe_from_pandas_with_dst(self):
                 os.environ['TZ'] = orig_env_tz
             time.tzset()
 
+    # SPARK-35211: inferred schema verification
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: ignore
+    def test_create_dataframe_from_pandas_with_mismatched_udt(self):
+        from pyspark.testing.sqlutils import ExamplePoint
+        import pandas as pd
+        pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
+        # The underlying sqlType of ExamplePoint is ArrayType(DoubleType(), False)
+        # There is a data type mismatch between 1 and DoubleType, a TypeError is expected
+        with self.assertRaises(TypeError):
+            with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": "false"}):
+                self.spark.createDataFrame(pdf)
+
+    def test_create_dataframe_with_mismatched_datatype(self):
+        with self.assertRaises(TypeError):
+            self.spark.createDataFrame([1, 2, 3], schema="a double")
+

Review comment:
       Could you call `createDataFrame`  with `verifySchema = False`?

##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -759,6 +759,22 @@ def test_create_dataframe_from_pandas_with_dst(self):
                 os.environ['TZ'] = orig_env_tz
             time.tzset()
 
+    # SPARK-35211: inferred schema verification
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: ignore
+    def test_create_dataframe_from_pandas_with_mismatched_udt(self):
+        from pyspark.testing.sqlutils import ExamplePoint
+        import pandas as pd
+        pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
+        # The underlying sqlType of ExamplePoint is ArrayType(DoubleType(), False)
+        # There is a data type mismatch between 1 and DoubleType, a TypeError is expected
+        with self.assertRaises(TypeError):
+            with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": "false"}):
+                self.spark.createDataFrame(pdf)
+
+    def test_create_dataframe_with_mismatched_datatype(self):
+        with self.assertRaises(TypeError):
+            self.spark.createDataFrame([1, 2, 3], schema="a double")
+

Review comment:
       Could you also add a call `createDataFrame`  with `verifySchema = False`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-829105798


   > BTW, can we avoid doing the verification twice (#32332 (comment)) too?
   
   Yes. The original implementation mixed schema verification with `prepare` to avoid performance loss.
   
   Let me make it clear in pseudo code to describe what I have done:
   
   ```
   switch schema:
   + case DataType --> require prepare -> optional schema verification
   + case None/a list/tuple of names: --> require schema inferences and conversion
   + case StructType --> optional schema verification
   ```
   
   ```
   switch schema:
   + case DataType --> require prepare -> optional schema verification
   + case None/a list/tuple of names: --> require schema inferences and optional verification and conversion
   + case StructType --> optional schema verification
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-888269543


   **[Test build #141775 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141775/testReport)** for PR 32332 at commit [`be5e9d8`](https://github.com/apache/spark/commit/be5e9d8dd5d2cc581b54e22b34f5c503981cb942).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887007635


   > The result is not correct because of incorrect type conversion.
   
   Could you also post the incorrect result in the PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r619915537



##########
File path: python/pyspark/sql/session.py
##########
@@ -478,13 +478,22 @@ def _inferSchema(self, rdd, samplingRatio=None, names=None):
             schema = rdd.map(lambda row: _infer_schema(row, names)).reduce(_merge_type)
         return schema
 
-    def _createFromRDD(self, rdd, schema, samplingRatio):
+    def _create_verified_converter(self, schema, verifySchema):
+        converter = _create_converter(schema)
+        verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
+
+        def verify_and_convert(obj):
+            verify_func(obj)

Review comment:
       Hm, why do we need to verify this again? The verification is done at `prepare(data)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r619915537



##########
File path: python/pyspark/sql/session.py
##########
@@ -478,13 +478,22 @@ def _inferSchema(self, rdd, samplingRatio=None, names=None):
             schema = rdd.map(lambda row: _infer_schema(row, names)).reduce(_merge_type)
         return schema
 
-    def _createFromRDD(self, rdd, schema, samplingRatio):
+    def _create_verified_converter(self, schema, verifySchema):
+        converter = _create_converter(schema)
+        verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
+
+        def verify_and_convert(obj):
+            verify_func(obj)

Review comment:
       Hm, why do we need to verify this again? The verification is done at `prepare(data)`

##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       Can we fix schema inference to infer the correct schema?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281360






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826714957


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137951/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r679264313



##########
File path: python/pyspark/sql/session.py
##########
@@ -599,6 +609,9 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
             the sample ratio of rows used for inferring
         verifySchema : bool, optional
             verify data types of every row against schema. Enabled by default.
+            Specifically, if schema is provided, schema verification will be performed at the
+            preparation phase; if schema is not provided and can be inferred (eg. from UDT),
+            schema verification will be performed at the convertion phase.

Review comment:
       comment updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887132056


   > Could you also post the incorrect result in the PR?
   
   I've copied the incorrect result from JIRA ticket to the PR description part.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-917792028


   **[Test build #143184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143184/testReport)** for PR 32332 at commit [`54553fd`](https://github.com/apache/spark/commit/54553fd062798c008254ddebe3987ef983b4e2ad).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889253804


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141847/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-886497761


   Sorry for the late response. Review comments are replied now. And also, I improved the performance by removing the extra identity function call.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-886528598


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141624/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r676400183



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       > Seems like it does twice when the schema is specified.
   
   When the schema is specified, `_inferSchema` and `_create_verified_converter ` will not be invoked 
   
   https://github.com/apache/spark/blob/c561ee686551690bee689f37ae5bbd75119994d6/python/pyspark/sql/session.py#L490-L493




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887161989


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141676/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945444898


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144355/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945446988


   **[Test build #144358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144358/testReport)** for PR 32332 at commit [`5acf69d`](https://github.com/apache/spark/commit/5acf69d00e744eebf99df59dd60f89c7eeeb831a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945457941


   **[Test build #144358 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144358/testReport)** for PR 32332 at commit [`5acf69d`](https://github.com/apache/spark/commit/5acf69d00e744eebf99df59dd60f89c7eeeb831a).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-888307416


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141775/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889564827


   **[Test build #141866 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141866/testReport)** for PR 32332 at commit [`1807997`](https://github.com/apache/spark/commit/18079975ca65c73b494b1808012ffe0176751a03).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889250415


   **[Test build #141846 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141846/testReport)** for PR 32332 at commit [`54b81e3`](https://github.com/apache/spark/commit/54b81e319586f0ff8f17904017af1c9c2670d64b).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889329568


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141849/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-829105798


   > BTW, can we avoid doing the verification twice (#32332 (comment)) too?
   
   Yes. The original implementation mixed schema verification with `prepare` to avoid performance loss.
   
   Let me make it clear in pseudo code to describe what I have done:
   
   
   switch schema:
   + case DataType --> require prepare -> optional schema verification
   + case None/a list/tuple of names: --> require `schema inferences` and required conversion
   + case StructType --> optional schema verification
   
   
   ---->
   
   
   switch schema:
   + case DataType --> required `prepare` -> optional schema verification
   + case None/a list/tuple of names: --> required `schema inferences` and optional verification and required conversion
   + case StructType --> optional schema verification
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945772114


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144372/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945719863


   **[Test build #144372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144372/testReport)** for PR 32332 at commit [`21afb7c`](https://github.com/apache/spark/commit/21afb7c118e65f83f31cbac5d89feab31d774286).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945458175


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144358/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826671823


   **[Test build #137951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137951/testReport)** for PR 32332 at commit [`1bbbf49`](https://github.com/apache/spark/commit/1bbbf49150a2728a4081cb1ea7ef472c383a5f31).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826307360


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137917/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281360






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826302600


   **[Test build #137917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137917/testReport)** for PR 32332 at commit [`36089e8`](https://github.com/apache/spark/commit/36089e8c7770df47c22d2feb4663adb653c3a80f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-828886094


   If users explicitly set `verifySchema`, then it should work. Without this PR, it doesn't work right.
   
   And then, we would have to document that `verifySchema` is turned off when `schema` is not specified for legacy reason.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826714957


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137951/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r622757279



##########
File path: python/pyspark/sql/session.py
##########
@@ -478,13 +478,22 @@ def _inferSchema(self, rdd, samplingRatio=None, names=None):
             schema = rdd.map(lambda row: _infer_schema(row, names)).reduce(_merge_type)
         return schema
 
-    def _createFromRDD(self, rdd, schema, samplingRatio):
+    def _create_verified_converter(self, schema, verifySchema):
+        converter = _create_converter(schema)
+        verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
+
+        def verify_and_convert(obj):
+            verify_func(obj)

Review comment:
       https://github.com/apache/spark/blob/132cbf0c8c1a382f33d8d212f931f5956f85a2f9/python/pyspark/sql/session.py#L695
   
   It is not verified twice, previously, it is not verified. See the above `identity` prepare code snippet.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r619935979



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       See https://github.com/apache/spark/blob/6f782efb044403ab3ca79662fdeca7f1f906e1bf/python/pyspark/sql/pandas/conversion.py#L312
   
   @HyukjinKwon  That's why we need to do verification after the schema is inferred.

##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       See https://github.com/apache/spark/blob/6f782efb044403ab3ca79662fdeca7f1f906e1bf/python/pyspark/sql/pandas/conversion.py#L312:L314
   
   @HyukjinKwon  That's why we need to do verification after the schema is inferred.

##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       See https://github.com/apache/spark/blob/6f782efb044403ab3ca79662fdeca7f1f906e1bf/python/pyspark/sql/pandas/conversion.py#L312-L314
   
   @HyukjinKwon  That's why we need to do verification after the schema is inferred.

##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       1. create from pandas, schema is a list.
   2. create directly without schema, schema is None
   

##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       it is not schema inference, it just extracts the name.
   
   I think doing verification after the schema is inferred is good idea. Otherwise, we need to change when to infer schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826599366


   @HyukjinKwon  unit tests added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-917811752


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r678906847



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -759,6 +759,22 @@ def test_create_dataframe_from_pandas_with_dst(self):
                 os.environ['TZ'] = orig_env_tz
             time.tzset()
 
+    # SPARK-35211: inferred schema verification
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: ignore
+    def test_create_dataframe_from_pandas_with_mismatched_udt(self):
+        from pyspark.testing.sqlutils import ExamplePoint
+        import pandas as pd
+        pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
+        # The underlying sqlType of ExamplePoint is ArrayType(DoubleType(), False)
+        # There is a data type mismatch between 1 and DoubleType, a TypeError is expected
+        with self.assertRaises(TypeError):
+            with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": "false"}):
+                self.spark.createDataFrame(pdf)

Review comment:
       ditto. Could you add `verifySchema = False` call to make sure it won't fail the query?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889293948


   **[Test build #141849 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141849/testReport)** for PR 32332 at commit [`4dd3f57`](https://github.com/apache/spark/commit/4dd3f576812f3695f511600ea10e2016f5f78b09).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281360


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826622952


   **[Test build #137947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137947/testReport)** for PR 32332 at commit [`849163e`](https://github.com/apache/spark/commit/849163e9090e11055c2df399c419db7c9535e14d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r679351798



##########
File path: python/pyspark/sql/session.py
##########
@@ -483,13 +483,22 @@ def _inferSchema(self, rdd, samplingRatio=None, names=None):
                 row, names, infer_dict_as_struct=infer_dict_as_struct)).reduce(_merge_type)
         return schema
 
-    def _createFromRDD(self, rdd, schema, samplingRatio):
+    def _create_verified_converter(self, schema, verifySchema):
+        converter = _create_converter(schema)
+        verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
+
+        def verify_and_convert(obj):
+            verify_func(obj)
+            return converter(obj)
+        return verify_and_convert

Review comment:
       I'm a little confused by this, if `verifySchema` is `False` then could you just return `converter`?

##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -759,6 +759,30 @@ def test_create_dataframe_from_pandas_with_dst(self):
                 os.environ['TZ'] = orig_env_tz
             time.tzset()
 
+    # SPARK-35211: inferred schema verification
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: ignore
+    def test_create_dataframe_from_pandas_with_mismatched_udt(self):
+        from pyspark.testing.sqlutils import ExamplePoint
+        import pandas as pd
+        pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
+        # The underlying sqlType of ExamplePoint is ArrayType(DoubleType(), False)
+        # There is a data type mismatch between 1 and DoubleType, a TypeError is expected
+        with self.assertRaises(TypeError):
+            with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": "false"}):
+                self.spark.createDataFrame(pdf)
+        # With verifySchema disabled, there will be unexpected behaviours
+        self.assertEquals(self.spark.createDataFrame(pdf, verifySchema=False).collect(),
+                          [Row(point=ExamplePoint(0.0, 0.0)), Row(point=ExamplePoint(0.0, 0.0))])

Review comment:
       Can you add a positive test case where `verifySchema` is `True` and results are correct?

##########
File path: python/pyspark/sql/session.py
##########
@@ -697,12 +712,17 @@ def prepare(obj):
                 verify_func(obj)
                 return obj,
         else:
+            no_need_to_prepare = True
             prepare = lambda obj: obj
 
         if isinstance(data, RDD):
-            rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
+            rdd, schema = self._createFromRDD(
+                data if no_need_to_prepare else data.map(prepare),

Review comment:
       I think you could just say `data if not verifySchema else ...` and remove `no_need_to_prepare`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r679264523



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -759,6 +759,22 @@ def test_create_dataframe_from_pandas_with_dst(self):
                 os.environ['TZ'] = orig_env_tz
             time.tzset()
 
+    # SPARK-35211: inferred schema verification
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: ignore
+    def test_create_dataframe_from_pandas_with_mismatched_udt(self):
+        from pyspark.testing.sqlutils import ExamplePoint
+        import pandas as pd
+        pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
+        # The underlying sqlType of ExamplePoint is ArrayType(DoubleType(), False)
+        # There is a data type mismatch between 1 and DoubleType, a TypeError is expected
+        with self.assertRaises(TypeError):
+            with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": "false"}):
+                self.spark.createDataFrame(pdf)

Review comment:
       added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r679263849



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -759,6 +759,22 @@ def test_create_dataframe_from_pandas_with_dst(self):
                 os.environ['TZ'] = orig_env_tz
             time.tzset()
 
+    # SPARK-35211: inferred schema verification
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: ignore
+    def test_create_dataframe_from_pandas_with_mismatched_udt(self):
+        from pyspark.testing.sqlutils import ExamplePoint
+        import pandas as pd
+        pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
+        # The underlying sqlType of ExamplePoint is ArrayType(DoubleType(), False)
+        # There is a data type mismatch between 1 and DoubleType, a TypeError is expected
+        with self.assertRaises(TypeError):
+            with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": "false"}):
+                self.spark.createDataFrame(pdf)

Review comment:
       added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-829105798


   > BTW, can we avoid doing the verification twice (#32332 (comment)) too?
   
   Yes. The original implementation mixed schema verification with `prepare` to avoid performance loss.
   
   Let me make it clear in pseudo code to describe what I have done:
   
   
   switch schema:
   + case DataType --> required `prepare` -> optional schema verification
   + case None/a list/tuple of names: --> require `schema inferences` and required conversion
   + case StructType --> optional schema verification
   
   
   ---->
   
   
   switch schema:
   + case DataType --> required `prepare` -> optional schema verification
   + case None/a list/tuple of names: --> required `schema inferences` and optional verification and required conversion
   + case StructType --> optional schema verification
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887217818


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141683/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-971144257


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887157079


   **[Test build #141676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141676/testReport)** for PR 32332 at commit [`675c254`](https://github.com/apache/spark/commit/675c2540f69b954f37810f41c0d1704d790fe49a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826520847






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826519391






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-917801031


   **[Test build #143184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143184/testReport)** for PR 32332 at commit [`54553fd`](https://github.com/apache/spark/commit/54553fd062798c008254ddebe3987ef983b4e2ad).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889583058


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141866/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889252887


   **[Test build #141847 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141847/testReport)** for PR 32332 at commit [`a752d68`](https://github.com/apache/spark/commit/a752d68ebc3060df8ecbbe10024abb008fd63d9e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-917839939


   Rebased on master, conflicts resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889250439


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141846/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826519391






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r619943638



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       Can we fix schema inference to infer the correct schema?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281679


   **[Test build #137911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137911/testReport)** for PR 32332 at commit [`d0a0942`](https://github.com/apache/spark/commit/d0a094245fdba72b97b2b721e1198ed62fd3b3da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadhen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
sadhen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r619948668



##########
File path: python/pyspark/sql/session.py
##########
@@ -695,9 +704,10 @@ def prepare(obj):
             prepare = lambda obj: obj

Review comment:
       it is not schema inference, it just extracts the name.
   
   I think doing verification after the schema is inferred is good idea. Otherwise, we need to change when to infer schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen edited a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen edited a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-886497761


   Sorry for the late response. Review comments are replied now. And also, I improved the performance by removing the extra identity function (`prepare`) call.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826299079


   **[Test build #137917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137917/testReport)** for PR 32332 at commit [`36089e8`](https://github.com/apache/spark/commit/36089e8c7770df47c22d2feb4663adb653c3a80f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-828886196


   BTW, can we avoid doing the verification twice (https://github.com/apache/spark/pull/32332#discussion_r619915537) too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r677110817



##########
File path: python/pyspark/sql/session.py
##########
@@ -697,12 +712,19 @@ def prepare(obj):
                 verify_func(obj)
                 return obj,
         else:
+            no_need_to_prepare = True
             prepare = lambda obj: obj
 
-        if isinstance(data, RDD):
-            rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
+        if isinstance(data, RDD) and no_need_to_prepare:
+            rdd, schema = self._createFromRDD(
+                data, schema, verifySchema, samplingRatio)
+        elif isinstance(data, RDD) and (not no_need_to_prepare):
+            rdd, schema = self._createFromRDD(
+                data.map(prepare), schema, verifySchema, samplingRatio)

Review comment:
       Refactored, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887161873


   **[Test build #141676 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141676/testReport)** for PR 32332 at commit [`675c254`](https://github.com/apache/spark/commit/675c2540f69b954f37810f41c0d1704d790fe49a).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826668707


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137947/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-889329568


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141849/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826281821


   **[Test build #137911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137911/testReport)** for PR 32332 at commit [`d0a0942`](https://github.com/apache/spark/commit/d0a094245fdba72b97b2b721e1198ed62fd3b3da).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r677042167



##########
File path: python/pyspark/sql/session.py
##########
@@ -681,12 +694,14 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
 
     def _create_dataframe(self, data, schema, samplingRatio, verifySchema):
         if isinstance(schema, StructType):
+            no_need_to_prepare = not verifySchema
             verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True
 
             def prepare(obj):
                 verify_func(obj)
                 return obj

Review comment:
       The prepare phase verification only covers when schema `isinstance` of StructType and DataType.
   
   The conversion phase covers when schema is None and is inferred during the conversion phase.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945719863


   **[Test build #144372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144372/testReport)** for PR 32332 at commit [`21afb7c`](https://github.com/apache/spark/commit/21afb7c118e65f83f31cbac5d89feab31d774286).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-945772114


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144372/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-826323626


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org