You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/27 12:33:56 UTC
[GitHub] [spark] pralabhkumar opened a new pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
pralabhkumar opened a new pull request #34401:
URL: https://github.com/apache/spark/pull/34401
### What changes were proposed in this pull request?
toPandas will return correct dtype for empty dataframe when arrow enabled
### Why are the changes needed?
Currently toPandas for empty dataframe return object as dtype for all the element when arrow is enabled . However things works fine when arrow is disabled . Therefore this PR will make give the correct dtype when arrow is enabled and dataframe is empty
### Does this PR introduce any user-facing change?
Yes user will be able to see correct dtype values
### How was this patch tested?
unit tests
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r752856092
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
Review comment:
Can we simply create an empty DataFrame and `df.astype`? e.g.) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753224281
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
Review comment:
@HyukjinKwon Also resolved merged conflicts
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966923050
**[Test build #145152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145152/testReport)** for PR 34401 at commit [`53cecc8`](https://github.com/apache/spark/commit/53cecc8e9cbada95a17579369db382246f0ab021).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar edited a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar edited a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966244527
@HyukjinKwon
There is another way to do the same using pyarrow . Below is the code for the same .
tmp_schema = [pyarrow.field(tmp_column_names[i],to_arrow_type(
TimestampNTZType() if isinstance(self.schema.fields[i].dataType, TimestampType)
else self.schema.fields[i].dataType
),)
for i in range(len(self.columns))
]
pdf = (
pyarrow.schema(tmp_schema)
.empty_table()
.to_pandas()
.set_index([pd.Index([])])
)
pdf.columns = self.column
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966966134
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49623/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953785178
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144716/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953791621
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49184/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953421841
**[Test build #144682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144682/testReport)** for PR 34401 at commit [`5c2630c`](https://github.com/apache/spark/commit/5c2630c6fd303b4ef84a71fc560c1474e777c68a).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953937749
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49187/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955133433
**[Test build #144772 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144772/testReport)** for PR 34401 at commit [`bdd9a9e`](https://github.com/apache/spark/commit/bdd9a9e296ade1e9a9b40d22f60bef11c68f802f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955129769
**[Test build #144772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144772/testReport)** for PR 34401 at commit [`bdd9a9e`](https://github.com/apache/spark/commit/bdd9a9e296ade1e9a9b40d22f60bef11c68f802f).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953431358
**[Test build #144682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144682/testReport)** for PR 34401 at commit [`5c2630c`](https://github.com/apache/spark/commit/5c2630c6fd303b4ef84a71fc560c1474e777c68a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953440510
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49151/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966908299
**[Test build #145152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145152/testReport)** for PR 34401 at commit [`53cecc8`](https://github.com/apache/spark/commit/53cecc8e9cbada95a17579369db382246f0ab021).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar edited a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar edited a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966962435
ping @HyukjinKwon
Have simplified the logic , and resolve the merge conflicts . Please review the PR .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975606553
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145511/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975647107
@HyukjinKwon
Tests are passed . Please merge the PR , if everything looks ok .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-976019690
Thanks for working on this @pralabhkumar.
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r737943724
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
_convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ pdf = pd.DataFrame.from_records([], columns=self.columns)
+ df = pd.DataFrame()
+ for fieldIdx, field in enumerate(self.schema):
+ pandas_type = \
+ PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+ column_name = self.schema[fieldIdx].name
+ series = pdf.iloc[:, fieldIdx]
+ if pandas_type is not None:
+ series = series.astype(pandas_type, copy=False)
+ df.insert(fieldIdx, column_name, series, allow_duplicates=True)
Review comment:
Is there simpler and easier way to assign types to an empty pandas DataFrame? The current way looks too messy.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-954562153
@HyukjinKwon can u please review the changes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966864259
**[Test build #145146 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145146/testReport)** for PR 34401 at commit [`ba81f80`](https://github.com/apache/spark/commit/ba81f8019ad087a86fdc14234a868a6c99c8c2fc).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966853542
**[Test build #145146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145146/testReport)** for PR 34401 at commit [`ba81f80`](https://github.com/apache/spark/commit/ba81f8019ad087a86fdc14234a868a6c99c8c2fc).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966962435
ping @HyukjinKwon
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966980727
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974926232
Looks pretty good otherwise. cc @BryanCutler and @viirya FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974081327
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49926/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953889348
**[Test build #144718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144718/testReport)** for PR 34401 at commit [`dce8199`](https://github.com/apache/spark/commit/dce8199c11b160335992daf6858b43adccb4438e).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953889348
**[Test build #144718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144718/testReport)** for PR 34401 at commit [`dce8199`](https://github.com/apache/spark/commit/dce8199c11b160335992daf6858b43adccb4438e).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974016971
**[Test build #145454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145454/testReport)** for PR 34401 at commit [`4ef49f3`](https://github.com/apache/spark/commit/4ef49f388b13573b2e62681198c5648a81d4ad48).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953830804
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49184/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953912307
**[Test build #144718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144718/testReport)** for PR 34401 at commit [`dce8199`](https://github.com/apache/spark/commit/dce8199c11b160335992daf6858b43adccb4438e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953748744
**[Test build #144716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144716/testReport)** for PR 34401 at commit [`1f20700`](https://github.com/apache/spark/commit/1f20700ddf2b05432ec4954064fd487d5fdd44aa).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955129769
**[Test build #144772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144772/testReport)** for PR 34401 at commit [`bdd9a9e`](https://github.com/apache/spark/commit/bdd9a9e296ade1e9a9b40d22f60bef11c68f802f).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955143081
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49241/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953139596
@HyukjinKwon Please review the PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-963071306
@HyukjinKwon Please find some time to review the PR . It would be really helpful
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966938696
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49624/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966980727
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r738109398
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
_convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ pdf = pd.DataFrame.from_records([], columns=self.columns)
+ df = pd.DataFrame()
+ for fieldIdx, field in enumerate(self.schema):
+ pandas_type = \
+ PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+ column_name = self.schema[fieldIdx].name
+ series = pdf.iloc[:, fieldIdx]
+ if pandas_type is not None:
+ series = series.astype(pandas_type, copy=False)
+ df.insert(fieldIdx, column_name, series, allow_duplicates=True)
Review comment:
@HyukjinKwon Thx for the comment .Let me try the simpler way
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953994729
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49187/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955143081
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49241/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966908299
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r748109781
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
_convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ pdf = pd.DataFrame.from_records([], columns=self.columns)
+ df = pd.DataFrame()
+ for fieldIdx, field in enumerate(self.schema):
+ pandas_type = \
+ PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+ column_name = self.schema[fieldIdx].name
+ series = pdf.iloc[:, fieldIdx]
+ if pandas_type is not None:
+ series = series.astype(pandas_type, copy=False)
+ df.insert(fieldIdx, column_name, series, allow_duplicates=True)
Review comment:
@HyukjinKwon Have simplified the logic , and resolve the merge conflicts . Please review the PR .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966966999
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49624/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966851506
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49609/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753876468
##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -205,6 +205,49 @@ def test_toPandas_fallback_disabled(self):
with self.assertRaisesRegex(Exception, "Unsupported type"):
df.toPandas()
+ def test_toPandas_empty_df_arrow_enabled(self):
+ # SPARK-30537 test that toPandas() on an empty dataframe has the correct dtypes
+ # when arrow is enabled
+ from datetime import datetime, date
+ from decimal import Decimal
+
+ schema = StructType(
+ [
+ StructField("a", StringType(), True),
+ StructField("a", IntegerType(), True),
+ StructField("c", TimestampType(), True),
Review comment:
We will probably also have to add `TimestampNTZ` and `DayTimeIntervalType` to test. `DayTimeIntervalType` might have an issue (see also https://github.com/apache/spark/pull/34631#discussion_r751947212)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750887784
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+ pdf.columns = self.columns
+ return pdf
Review comment:
@itholic
1) Providing columns at the time of creation of dataframe(as suggested) , changes all columns dtypes to object. Since self.columns can have duplicate column names , once the dataframe created then i have done the assignment
pdf.columns = self.columns . This way the dtypes are also remain intact(its similar way as done above in the code L#164)
2) We also need to provide index=[] , as otherwise its creating RangeIndex(start=0, stop=0, step=1). Empty DF , when created without arrow , in the below part of the code have index=[] , so in order to have exact empty DF (with and without arrow) , index is added
Please let me know , if its ok
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975549058
**[Test build #145511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145511/testReport)** for PR 34401 at commit [`8c3de09`](https://github.com/apache/spark/commit/8c3de09fa670aaea5952501ffdc2fbdcf67cad8b).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975664102
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49983/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975549058
**[Test build #145511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145511/testReport)** for PR 34401 at commit [`8c3de09`](https://github.com/apache/spark/commit/8c3de09fa670aaea5952501ffdc2fbdcf67cad8b).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975579552
**[Test build #145511 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145511/testReport)** for PR 34401 at commit [`8c3de09`](https://github.com/apache/spark/commit/8c3de09fa670aaea5952501ffdc2fbdcf67cad8b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966244527
@HyukjinKwon
There is another way to do the same using pyarrow . Below is the code for the same . I am ok with any of the approach.
Please review the PR and suggest. If u are ok , I will resolve the merge conflicts.
tmp_schema = [StructField(tmp_column_names[i],
TimestampNTZType() if isinstance(self.schema.fields[i].dataType, TimestampType)
else self.schema.fields[i].dataType) for i in range(len(self.columns))]
table = pyarrow.Table.from_arrays(arrays = [pyarrow.array([])] * len(self.schema.fields),
schema = to_arrow_schema(StructType(tmp_schema)))
pdf = table.to_pandas().set_index([pd.Index([])])
pdf.columns = self.columns
return pdf
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966816420
**[Test build #145138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145138/testReport)** for PR 34401 at commit [`152ef5d`](https://github.com/apache/spark/commit/152ef5d52b31f8fa19b40ff8aeb131648e809bbf).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966827020
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49609/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966936555
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49623/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974001466
**[Test build #145454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145454/testReport)** for PR 34401 at commit [`4ef49f3`](https://github.com/apache/spark/commit/4ef49f388b13573b2e62681198c5648a81d4ad48).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974053945
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49926/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953826495
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49184/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953421841
**[Test build #144682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144682/testReport)** for PR 34401 at commit [`5c2630c`](https://github.com/apache/spark/commit/5c2630c6fd303b4ef84a71fc560c1474e777c68a).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953988063
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49187/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953428100
Can one of the admins verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953442544
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144682/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953942785
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144718/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955140132
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49241/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955137081
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144772/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955136010
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49241/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953785178
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144716/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953765577
**[Test build #144716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144716/testReport)** for PR 34401 at commit [`1f20700`](https://github.com/apache/spark/commit/1f20700ddf2b05432ec4954064fd487d5fdd44aa).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-952911841
Can one of the admins verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966941096
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974001466
**[Test build #145454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145454/testReport)** for PR 34401 at commit [`4ef49f3`](https://github.com/apache/spark/commit/4ef49f388b13573b2e62681198c5648a81d4ad48).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750866835
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+ pdf.columns = self.columns
+ return pdf
Review comment:
nit: how about
```python
return pd.DataFrame(np.empty(0, dtype=corrected_panda_types), columns=self.columns)
```
??
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #34401:
URL: https://github.com/apache/spark/pull/34401
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966848699
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49609/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966905440
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49617/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966941096
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966875655
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145146/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966891480
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49617/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966870614
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49617/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974037455
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145454/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753876555
##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -817,17 +818,21 @@ def test_to_pandas_from_empty_dataframe(self):
CAST('2019-01-01' AS TIMESTAMP_NTZ) AS timestamp_ntz,
INTERVAL '1563:04' MINUTE TO SECOND AS day_time_interval
"""
- dtypes_when_nonempty_df = self.spark.sql(sql).toPandas().dtypes
- dtypes_when_empty_df = self.spark.sql(sql).filter("False").toPandas().dtypes
- self.assertTrue(np.all(dtypes_when_empty_df == dtypes_when_nonempty_df))
+ arrow_enabled_status = [True, False]
Review comment:
you can just name it `is_arrow_enabled`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-976132094
Thx @HyukjinKwon for helping out during the process
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753876555
##########
File path: python/pyspark/sql/tests/test_dataframe.py
##########
@@ -817,17 +818,21 @@ def test_to_pandas_from_empty_dataframe(self):
CAST('2019-01-01' AS TIMESTAMP_NTZ) AS timestamp_ntz,
INTERVAL '1563:04' MINUTE TO SECOND AS day_time_interval
"""
- dtypes_when_nonempty_df = self.spark.sql(sql).toPandas().dtypes
- dtypes_when_empty_df = self.spark.sql(sql).filter("False").toPandas().dtypes
- self.assertTrue(np.all(dtypes_when_empty_df == dtypes_when_nonempty_df))
+ arrow_enabled_status = [True, False]
Review comment:
you can just call `is_arrow_enabled`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750928501
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+ pdf.columns = self.columns
+ return pdf
Review comment:
Got it!
Then it looks good to me as is.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-970986630
cc @BryanCutler and @itholic FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953414258
ok to test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953457610
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49151/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953830804
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49184/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750873159
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+ pdf.columns = self.columns
+ return pdf
Review comment:
Agree , doing the change
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r750887784
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
+ pdf.columns = self.columns
+ return pdf
Review comment:
@itholic
1) Providing columns at the time of creation of dataframe(as suggested) , changes all columns dtypes to object. Since self.columns can have duplicate column names , once the dataframe created then i have done the assignment
pdf.columns = self.columns . This way the dtypes are also remain intact(its similar way as done above in the code L#1654)
2) We also need to provide index=[] , as otherwise its creating RangeIndex(start=0, stop=0, step=1). Empty DF , when created without arrow , in the below part of the code have index=[] , so in order to have exact empty DF (with and without arrow) , index is added
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955137081
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144772/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753157519
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
Review comment:
@HyukjinKwon Have refined as suggested . Please review
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974037455
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145454/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974081327
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49926/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753876468
##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -205,6 +205,49 @@ def test_toPandas_fallback_disabled(self):
with self.assertRaisesRegex(Exception, "Unsupported type"):
df.toPandas()
+ def test_toPandas_empty_df_arrow_enabled(self):
+ # SPARK-30537 test that toPandas() on an empty dataframe has the correct dtypes
+ # when arrow is enabled
+ from datetime import datetime, date
+ from decimal import Decimal
+
+ schema = StructType(
+ [
+ StructField("a", StringType(), True),
+ StructField("a", IntegerType(), True),
+ StructField("c", TimestampType(), True),
Review comment:
We will probably also have to add `TimestampNTZ` and `DayTimeIntervalType` to test. `DayTimeIntervalType` might have an issue (see also https://github.com/apache/spark/pull/34631/files#r751947212)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r738513474
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
_convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ pdf = pd.DataFrame.from_records([], columns=self.columns)
+ df = pd.DataFrame()
+ for fieldIdx, field in enumerate(self.schema):
+ pandas_type = \
+ PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+ column_name = self.schema[fieldIdx].name
+ series = pdf.iloc[:, fieldIdx]
+ if pandas_type is not None:
+ series = series.astype(pandas_type, copy=False)
+ df.insert(fieldIdx, column_name, series, allow_duplicates=True)
Review comment:
@HyukjinKwon Have done the changes . please review
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953464277
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49151/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953428100
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953994729
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49187/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r738109398
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
_convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ pdf = pd.DataFrame.from_records([], columns=self.columns)
+ df = pd.DataFrame()
+ for fieldIdx, field in enumerate(self.schema):
+ pandas_type = \
+ PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+ column_name = self.schema[fieldIdx].name
+ series = pdf.iloc[:, fieldIdx]
+ if pandas_type is not None:
+ series = series.astype(pandas_type, copy=False)
+ df.insert(fieldIdx, column_name, series, allow_duplicates=True)
Review comment:
@HyukjinKwon Thx for the comment . There is simple way to assign types to an empty DF . but the problem is when DF contain duplicate columns. We can check if there are duplicate columns in DF then use the above approach else use the simple approach .
Please let me know if its ok .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975656323
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49983/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975598739
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49983/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975664102
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49983/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966806374
**[Test build #145138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145138/testReport)** for PR 34401 at commit [`152ef5d`](https://github.com/apache/spark/commit/152ef5d52b31f8fa19b40ff8aeb131648e809bbf).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966825178
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145138/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966851506
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49609/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966853542
**[Test build #145146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145146/testReport)** for PR 34401 at commit [`ba81f80`](https://github.com/apache/spark/commit/ba81f8019ad087a86fdc14234a868a6c99c8c2fc).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966925629
**[Test build #145153 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145153/testReport)** for PR 34401 at commit [`29215ee`](https://github.com/apache/spark/commit/29215eec008d56c1d84c41bbccb512cb7841c885).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966875655
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145146/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966825178
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145138/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966806374
**[Test build #145138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145138/testReport)** for PR 34401 at commit [`152ef5d`](https://github.com/apache/spark/commit/152ef5d52b31f8fa19b40ff8aeb131648e809bbf).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar edited a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar edited a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966244527
@HyukjinKwon
There is another way to do the same using pyarrow . Below is the code for the same .
But since its empty df , its better to do directly with pandas
tmp_schema = [pyarrow.field(tmp_column_names[i],to_arrow_type(
TimestampNTZType() if isinstance(self.schema.fields[i].dataType, TimestampType)
else self.schema.fields[i].dataType
),)
for i in range(len(self.columns))
]
pdf = (
pyarrow.schema(tmp_schema)
.empty_table()
.to_pandas()
.set_index([pd.Index([])])
)
pdf.columns = self.column
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966905440
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49617/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-966910627
**[Test build #145153 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145153/testReport)** for PR 34401 at commit [`29215ee`](https://github.com/apache/spark/commit/29215eec008d56c1d84c41bbccb512cb7841c885).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-970353316
Gentle ping @HyukjinKwon . Please review the PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-954459042
Can one of the admins verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-952911841
Can one of the admins verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953942785
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144718/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953464277
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49151/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-953748744
**[Test build #144716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144716/testReport)** for PR 34401 at commit [`1f20700`](https://github.com/apache/spark/commit/1f20700ddf2b05432ec4954064fd487d5fdd44aa).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-972862689
gentle ping @BryanCutler
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34401: [SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-975606553
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145511/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-974028302
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49926/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r753157519
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -171,7 +171,18 @@ def toPandas(self) -> "PandasDataFrameLike":
pdf[field.name] = _convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ corrected_panda_types = []
+ for index, field in enumerate(self.schema):
+ panda_type = PandasConversionMixin._to_corrected_pandas_type(
+ field.dataType
+ )
+ if panda_type is None:
+ corrected_panda_types.append((tmp_column_names[index], np.object0))
+ else:
+ corrected_panda_types.append((tmp_column_names[index], panda_type))
+ pdf = pd.DataFrame(np.empty(0, dtype=corrected_panda_types), index=[])
Review comment:
@HyukjinKwon Have refined . Please review
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-955237982
@HyukjinKwon Please review the PR and let me know if any further changes needed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pralabhkumar commented on a change in pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #34401:
URL: https://github.com/apache/spark/pull/34401#discussion_r739656823
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -151,7 +151,17 @@ def toPandas(self) -> "PandasDataFrameLike":
_convert_map_items_to_dict(pdf[field.name])
return pdf
else:
- return pd.DataFrame.from_records([], columns=self.columns)
+ pdf = pd.DataFrame.from_records([], columns=self.columns)
+ df = pd.DataFrame()
+ for fieldIdx, field in enumerate(self.schema):
+ pandas_type = \
+ PandasConversionMixin._to_corrected_pandas_type(field.dataType)
+ column_name = self.schema[fieldIdx].name
+ series = pdf.iloc[:, fieldIdx]
+ if pandas_type is not None:
+ series = series.astype(pandas_type, copy=False)
+ df.insert(fieldIdx, column_name, series, allow_duplicates=True)
Review comment:
@HyukjinKwon Please check the logic . Let me know if its ok
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34401: [SPARK-30537][PYTHON], Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34401:
URL: https://github.com/apache/spark/pull/34401#issuecomment-954459042
Can one of the admins verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org