You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/27 12:33:56 UTC

[GitHub] [spark] pralabhkumar opened a new pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

pralabhkumar opened a new pull request #34401:
URL: https://github.com/apache/spark/pull/34401


   ### What changes were proposed in this pull request?
   toPandas will return correct dtype for empty dataframe when arrow enabled
   
   ### Why are the changes needed?
   Currently toPandas for empty dataframe return object as dtype for all the element when arrow is enabled . However things works fine when arrow is disabled . Therefore this PR will make give the correct dtype when arrow is enabled and dataframe is empty  
   
   
   ### Does this PR introduce any user-facing change?
   Yes user will be able to see correct dtype values 
   
   ### How was this patch tested?
   unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r752856092



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])

Review comment:
       Can we simply create an empty DataFrame and `df.astype`? e.g.) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753224281



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])

Review comment:
       @HyukjinKwon Also resolved merged conflicts 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966923050


   **[Test build #145152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145152/testReport)** for PR 34401 at commit [`53cecc8`](https://github.com/apache/spark/commit/53cecc8e9cbada95a17579369db382246f0ab021).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar edited a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar edited a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966244527


   @HyukjinKwon 
   
   There is another way to do the same using pyarrow  . Below is the code for the same . 
   tmp_schema = [pyarrow.field(tmp_column_names[i],to_arrow_type(
               TimestampNTZType() if isinstance(self.schema.fields[i].dataType, TimestampType)
               else self.schema.fields[i].dataType
           ),)
       for i in range(len(self.columns))
   ]
   
   pdf = (
       pyarrow.schema(tmp_schema)
       .empty_table()
       .to_pandas()
       .set_index([pd.Index([])])
   )
   pdf.columns = self.column
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966966134


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49623/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953785178


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144716/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953791621


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953421841


   **[Test build #144682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144682/testReport)** for PR 34401 at commit [`5c2630c`](https://github.com/apache/spark/commit/5c2630c6fd303b4ef84a71fc560c1474e777c68a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953937749


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49187/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955133433


   **[Test build #144772 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144772/testReport)** for PR 34401 at commit [`bdd9a9e`](https://github.com/apache/spark/commit/bdd9a9e296ade1e9a9b40d22f60bef11c68f802f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955129769


   **[Test build #144772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144772/testReport)** for PR 34401 at commit [`bdd9a9e`](https://github.com/apache/spark/commit/bdd9a9e296ade1e9a9b40d22f60bef11c68f802f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953431358


   **[Test build #144682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144682/testReport)** for PR 34401 at commit [`5c2630c`](https://github.com/apache/spark/commit/5c2630c6fd303b4ef84a71fc560c1474e777c68a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953440510


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49151/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966908299


   **[Test build #145152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145152/testReport)** for PR 34401 at commit [`53cecc8`](https://github.com/apache/spark/commit/53cecc8e9cbada95a17579369db382246f0ab021).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar edited a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar edited a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966962435


   ping @HyukjinKwon 
   
    Have simplified the logic , and resolve the merge conflicts . Please review the PR .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975606553


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145511/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975647107


   @HyukjinKwon 
   
   Tests are passed . Please merge the PR , if everything looks ok . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-976019690


   Thanks for working on this @pralabhkumar.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r737943724



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
                                     _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        pdf = pd.DataFrame.from_records([], columns=self.columns)
+                        df = pd.DataFrame()
+                        for fieldIdx, field in enumerate(self.schema):
+                            pandas_type = \
+                                PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+                            column_name = self.schema[fieldIdx].name
+                            series = pdf.iloc[:, fieldIdx]
+                            if pandas_type is not None:
+                                series = series.astype(pandas_type, copy=False)
+                            df.insert(fieldIdx, column_name, series, allow_duplicates=True)

Review comment:
       Is there simpler and easier way to assign types to an empty pandas DataFrame? The current way looks too messy.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-954562153


   @HyukjinKwon can u please review the changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966864259


   **[Test build #145146 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145146/testReport)** for PR 34401 at commit [`ba81f80`](https://github.com/apache/spark/commit/ba81f8019ad087a86fdc14234a868a6c99c8c2fc).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966853542


   **[Test build #145146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145146/testReport)** for PR 34401 at commit [`ba81f80`](https://github.com/apache/spark/commit/ba81f8019ad087a86fdc14234a868a6c99c8c2fc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966962435


   ping @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966980727






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974926232


   Looks pretty good otherwise. cc @BryanCutler and @viirya FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974081327


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49926/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953889348


   **[Test build #144718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144718/testReport)** for PR 34401 at commit [`dce8199`](https://github.com/apache/spark/commit/dce8199c11b160335992daf6858b43adccb4438e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953889348


   **[Test build #144718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144718/testReport)** for PR 34401 at commit [`dce8199`](https://github.com/apache/spark/commit/dce8199c11b160335992daf6858b43adccb4438e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974016971


   **[Test build #145454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145454/testReport)** for PR 34401 at commit [`4ef49f3`](https://github.com/apache/spark/commit/4ef49f388b13573b2e62681198c5648a81d4ad48).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953830804


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953912307


   **[Test build #144718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144718/testReport)** for PR 34401 at commit [`dce8199`](https://github.com/apache/spark/commit/dce8199c11b160335992daf6858b43adccb4438e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953748744


   **[Test build #144716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144716/testReport)** for PR 34401 at commit [`1f20700`](https://github.com/apache/spark/commit/1f20700ddf2b05432ec4954064fd487d5fdd44aa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955129769


   **[Test build #144772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144772/testReport)** for PR 34401 at commit [`bdd9a9e`](https://github.com/apache/spark/commit/bdd9a9e296ade1e9a9b40d22f60bef11c68f802f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955143081


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49241/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953139596


   @HyukjinKwon  Please review the PR
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-963071306


   @HyukjinKwon  Please find some time to review the PR . It would be really helpful


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966938696


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49624/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966980727






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r738109398



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
                                     _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        pdf = pd.DataFrame.from_records([], columns=self.columns)
+                        df = pd.DataFrame()
+                        for fieldIdx, field in enumerate(self.schema):
+                            pandas_type = \
+                                PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+                            column_name = self.schema[fieldIdx].name
+                            series = pdf.iloc[:, fieldIdx]
+                            if pandas_type is not None:
+                                series = series.astype(pandas_type, copy=False)
+                            df.insert(fieldIdx, column_name, series, allow_duplicates=True)

Review comment:
       @HyukjinKwon Thx for the comment .Let me try the simpler way




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953994729


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49187/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955143081


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49241/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966908299






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r748109781



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
                                     _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        pdf = pd.DataFrame.from_records([], columns=self.columns)
+                        df = pd.DataFrame()
+                        for fieldIdx, field in enumerate(self.schema):
+                            pandas_type = \
+                                PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+                            column_name = self.schema[fieldIdx].name
+                            series = pdf.iloc[:, fieldIdx]
+                            if pandas_type is not None:
+                                series = series.astype(pandas_type, copy=False)
+                            df.insert(fieldIdx, column_name, series, allow_duplicates=True)

Review comment:
       @HyukjinKwon  Have simplified the logic , and resolve the merge conflicts . Please review the PR . 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966966999


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49624/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966851506


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49609/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753876468



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -205,6 +205,49 @@ def test_toPandas_fallback_disabled(self):
                 with self.assertRaisesRegex(Exception, "Unsupported type"):
                     df.toPandas()
 
+    def test_toPandas_empty_df_arrow_enabled(self):
+        # SPARK-30537 test that toPandas() on an empty dataframe has the correct dtypes
+        # when arrow is enabled
+        from datetime import datetime, date
+        from decimal import Decimal
+
+        schema = StructType(
+            [
+                StructField("a", StringType(), True),
+                StructField("a", IntegerType(), True),
+                StructField("c", TimestampType(), True),

Review comment:
       We will probably also have to add `TimestampNTZ` and `DayTimeIntervalType` to test. `DayTimeIntervalType` might have an issue (see also https://github.com/apache/spark/pull/34631#discussion_r751947212)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750887784



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+                        pdf.columns = self.columns
+                        return pdf

Review comment:
       @itholic 
   1) Providing columns at the time of creation of dataframe(as suggested) , changes all columns dtypes to object. Since self.columns can have duplicate column names , once the dataframe created then i have done the assignment
   pdf.columns = self.columns . This way the dtypes are also remain intact(its similar way as done above in the code L#164) 
   
   2) We also need to provide index=[] , as otherwise its creating RangeIndex(start=0, stop=0, step=1). Empty DF , when created without arrow , in the below part of the code have index=[] , so in order to have exact empty DF (with and without arrow) , index is added 
   
   Please let me know , if its ok 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975549058


   **[Test build #145511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145511/testReport)** for PR 34401 at commit [`8c3de09`](https://github.com/apache/spark/commit/8c3de09fa670aaea5952501ffdc2fbdcf67cad8b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975664102


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49983/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975549058


   **[Test build #145511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145511/testReport)** for PR 34401 at commit [`8c3de09`](https://github.com/apache/spark/commit/8c3de09fa670aaea5952501ffdc2fbdcf67cad8b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975579552


   **[Test build #145511 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145511/testReport)** for PR 34401 at commit [`8c3de09`](https://github.com/apache/spark/commit/8c3de09fa670aaea5952501ffdc2fbdcf67cad8b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966244527


   @HyukjinKwon 
   
   There is another way to do the same using pyarrow  . Below is the code for the same .  I am ok with any of the approach. 
   Please review the PR and suggest. If u are ok , I will resolve the merge conflicts. 
   
   tmp_schema = [StructField(tmp_column_names[i],
                             TimestampNTZType() if isinstance(self.schema.fields[i].dataType, TimestampType)
                             else self.schema.fields[i].dataType) for i in range(len(self.columns))]
   
   table = pyarrow.Table.from_arrays(arrays = [pyarrow.array([])] * len(self.schema.fields),
                                     schema = to_arrow_schema(StructType(tmp_schema)))
   pdf = table.to_pandas().set_index([pd.Index([])])
   pdf.columns = self.columns
   return pdf
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966816420


   **[Test build #145138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145138/testReport)** for PR 34401 at commit [`152ef5d`](https://github.com/apache/spark/commit/152ef5d52b31f8fa19b40ff8aeb131648e809bbf).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966827020


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49609/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966936555


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49623/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974001466


   **[Test build #145454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145454/testReport)** for PR 34401 at commit [`4ef49f3`](https://github.com/apache/spark/commit/4ef49f388b13573b2e62681198c5648a81d4ad48).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974053945


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49926/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953826495


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953421841


   **[Test build #144682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144682/testReport)** for PR 34401 at commit [`5c2630c`](https://github.com/apache/spark/commit/5c2630c6fd303b4ef84a71fc560c1474e777c68a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953988063


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49187/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953428100


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953442544


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144682/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953942785


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955140132


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49241/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955137081


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144772/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955136010


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49241/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953785178


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144716/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953765577


   **[Test build #144716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144716/testReport)** for PR 34401 at commit [`1f20700`](https://github.com/apache/spark/commit/1f20700ddf2b05432ec4954064fd487d5fdd44aa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-952911841


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966941096






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974001466


   **[Test build #145454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145454/testReport)** for PR 34401 at commit [`4ef49f3`](https://github.com/apache/spark/commit/4ef49f388b13573b2e62681198c5648a81d4ad48).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750866835



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+                        pdf.columns = self.columns
+                        return pdf

Review comment:
       nit: how about
   
   ```python
   return pd.DataFrame(np.empty(0, dtype=corrected_panda_types), columns=self.columns)
   ```
   
   ??




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #34401:
URL: https://github.com/apache/spark/pull/34401


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966848699


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49609/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966905440


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49617/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966941096






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966875655


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145146/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966891480


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49617/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966870614


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49617/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974037455


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145454/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753876555



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -817,17 +818,21 @@ def test_to_pandas_from_empty_dataframe(self):
             CAST('2019-01-01' AS TIMESTAMP_NTZ) AS timestamp_ntz,
             INTERVAL '1563:04' MINUTE TO SECOND AS day_time_interval
             """
-            dtypes_when_nonempty_df = self.spark.sql(sql).toPandas().dtypes
-            dtypes_when_empty_df = self.spark.sql(sql).filter("False").toPandas().dtypes
-            self.assertTrue(np.all(dtypes_when_empty_df == dtypes_when_nonempty_df))
+        arrow_enabled_status = [True, False]

Review comment:
       you can just name it `is_arrow_enabled`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-976132094


   Thx @HyukjinKwon for helping out during the process 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753876555



##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -817,17 +818,21 @@ def test_to_pandas_from_empty_dataframe(self):
             CAST('2019-01-01' AS TIMESTAMP_NTZ) AS timestamp_ntz,
             INTERVAL '1563:04' MINUTE TO SECOND AS day_time_interval
             """
-            dtypes_when_nonempty_df = self.spark.sql(sql).toPandas().dtypes
-            dtypes_when_empty_df = self.spark.sql(sql).filter("False").toPandas().dtypes
-            self.assertTrue(np.all(dtypes_when_empty_df == dtypes_when_nonempty_df))
+        arrow_enabled_status = [True, False]

Review comment:
       you can just call `is_arrow_enabled`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750928501



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+                        pdf.columns = self.columns
+                        return pdf

Review comment:
       Got it!
   
   Then it looks good to me as is.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-970986630


   cc @BryanCutler and @itholic FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953414258


   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953457610


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49151/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953830804


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49184/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750873159



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+                        pdf.columns = self.columns
+                        return pdf

Review comment:
       Agree , doing the change 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750887784



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+                        pdf.columns = self.columns
+                        return pdf

Review comment:
       @itholic 
   1) Providing columns at the time of creation of dataframe(as suggested) , changes all columns dtypes to object. Since self.columns can have duplicate column names , once the dataframe created then i have done the assignment
   pdf.columns = self.columns . This way the dtypes are also remain intact(its similar way as done above in the code L#1654) 
   
   2) We also need to provide index=[] , as otherwise its creating RangeIndex(start=0, stop=0, step=1). Empty DF , when created without arrow , in the below part of the code have index=[] , so in order to have exact empty DF (with and without arrow) , index is added 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955137081


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144772/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753157519



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])

Review comment:
       @HyukjinKwon Have refined as suggested . Please review 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974037455


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145454/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974081327


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49926/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753876468



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -205,6 +205,49 @@ def test_toPandas_fallback_disabled(self):
                 with self.assertRaisesRegex(Exception, "Unsupported type"):
                     df.toPandas()
 
+    def test_toPandas_empty_df_arrow_enabled(self):
+        # SPARK-30537 test that toPandas() on an empty dataframe has the correct dtypes
+        # when arrow is enabled
+        from datetime import datetime, date
+        from decimal import Decimal
+
+        schema = StructType(
+            [
+                StructField("a", StringType(), True),
+                StructField("a", IntegerType(), True),
+                StructField("c", TimestampType(), True),

Review comment:
       We will probably also have to add `TimestampNTZ` and `DayTimeIntervalType` to test. `DayTimeIntervalType` might have an issue (see also https://github.com/apache/spark/pull/34631/files#r751947212)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r738513474



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
                                     _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        pdf = pd.DataFrame.from_records([], columns=self.columns)
+                        df = pd.DataFrame()
+                        for fieldIdx, field in enumerate(self.schema):
+                            pandas_type = \
+                                PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+                            column_name = self.schema[fieldIdx].name
+                            series = pdf.iloc[:, fieldIdx]
+                            if pandas_type is not None:
+                                series = series.astype(pandas_type, copy=False)
+                            df.insert(fieldIdx, column_name, series, allow_duplicates=True)

Review comment:
       @HyukjinKwon  Have done the changes . please review




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953464277


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49151/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953428100






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953994729


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49187/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r738109398



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
                                     _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        pdf = pd.DataFrame.from_records([], columns=self.columns)
+                        df = pd.DataFrame()
+                        for fieldIdx, field in enumerate(self.schema):
+                            pandas_type = \
+                                PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+                            column_name = self.schema[fieldIdx].name
+                            series = pdf.iloc[:, fieldIdx]
+                            if pandas_type is not None:
+                                series = series.astype(pandas_type, copy=False)
+                            df.insert(fieldIdx, column_name, series, allow_duplicates=True)

Review comment:
       @HyukjinKwon Thx for the comment . There is simple way to assign types to an empty DF . but the problem is when DF contain duplicate columns. We can check if there are duplicate columns in DF then use the above approach else use the simple approach . 
   
   Please let me know if its ok . 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975656323


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49983/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975598739


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49983/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975664102


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49983/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966806374


   **[Test build #145138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145138/testReport)** for PR 34401 at commit [`152ef5d`](https://github.com/apache/spark/commit/152ef5d52b31f8fa19b40ff8aeb131648e809bbf).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966825178


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145138/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966851506


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49609/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966853542


   **[Test build #145146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145146/testReport)** for PR 34401 at commit [`ba81f80`](https://github.com/apache/spark/commit/ba81f8019ad087a86fdc14234a868a6c99c8c2fc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966925629


   **[Test build #145153 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145153/testReport)** for PR 34401 at commit [`29215ee`](https://github.com/apache/spark/commit/29215eec008d56c1d84c41bbccb512cb7841c885).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966875655


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145146/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966825178


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145138/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966806374


   **[Test build #145138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145138/testReport)** for PR 34401 at commit [`152ef5d`](https://github.com/apache/spark/commit/152ef5d52b31f8fa19b40ff8aeb131648e809bbf).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar edited a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar edited a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966244527


   @HyukjinKwon 
   
   There is another way to do the same using pyarrow  . Below is the code for the same . 
   But since its empty df , its better to do directly with pandas 
   
   tmp_schema = [pyarrow.field(tmp_column_names[i],to_arrow_type(
               TimestampNTZType() if isinstance(self.schema.fields[i].dataType, TimestampType)
               else self.schema.fields[i].dataType
           ),)
       for i in range(len(self.columns))
   ]
   
   pdf = (
       pyarrow.schema(tmp_schema)
       .empty_table()
       .to_pandas()
       .set_index([pd.Index([])])
   )
   pdf.columns = self.column
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966905440


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49617/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966910627


   **[Test build #145153 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145153/testReport)** for PR 34401 at commit [`29215ee`](https://github.com/apache/spark/commit/29215eec008d56c1d84c41bbccb512cb7841c885).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-970353316


   Gentle ping @HyukjinKwon . Please review the PR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-954459042


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-952911841


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953942785


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144718/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953464277


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49151/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953748744


   **[Test build #144716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144716/testReport)** for PR 34401 at commit [`1f20700`](https://github.com/apache/spark/commit/1f20700ddf2b05432ec4954064fd487d5fdd44aa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-972862689


   gentle ping @BryanCutler 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975606553


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145511/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974028302


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49926/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753157519



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
                                 pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        corrected_panda_types = []
+                        for index, field in enumerate(self.schema):
+                            panda_type = PandasConversionMixin._to_corrected_pandas_type(
+                                field.dataType
+                            )
+                            if panda_type is None:
+                                corrected_panda_types.append((tmp_column_names[index], np.object0))
+                            else:
+                                corrected_panda_types.append((tmp_column_names[index], panda_type))
+                        pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])

Review comment:
       @HyukjinKwon Have refined . Please review 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955237982


   @HyukjinKwon Please review the PR and let me know if any further changes needed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r739656823



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
                                     _convert_map_items_to_dict(pdf[field.name])
                         return pdf
                     else:
-                        return pd.DataFrame.from_records([], columns=self.columns)
+                        pdf = pd.DataFrame.from_records([], columns=self.columns)
+                        df = pd.DataFrame()
+                        for fieldIdx, field in enumerate(self.schema):
+                            pandas_type = \
+                                PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+                            column_name = self.schema[fieldIdx].name
+                            series = pdf.iloc[:, fieldIdx]
+                            if pandas_type is not None:
+                                series = series.astype(pandas_type, copy=False)
+                            df.insert(fieldIdx, column_name, series, allow_duplicates=True)

Review comment:
       @HyukjinKwon  Please check the logic . Let me know if its ok  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-954459042


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org