You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2017/10/18 06:41:46 UTC
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/18664#discussion_r145327316
--- Diff: python/pyspark/serializers.py ---
@@ -259,11 +261,13 @@ def load_stream(self, stream):
"""
Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series.
"""
+ from pyspark.sql.types import _check_dataframe_localize_timestamps
import pyarrow as pa
reader = pa.open_stream(stream)
for batch in reader:
- table = pa.Table.from_batches([batch])
- yield [c.to_pandas() for c in table.itercolumns()]
+ # NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed in 0.7.1
+ pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
+ yield [c for _, c in pdf.iteritems()]
--- End diff --
I ran your script in my local, too.
- before change:
- mean: 2.605722
- min: 2.502404
- max: 3.045294
- after change:
- mean: 2.626306
- min: 2.341781
- max: 2.742432
I think it's okay to use this workaround.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org