You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ue...@apache.org on 2018/01/10 05:00:15 UTC
spark git commit: [SPARK-23018][PYTHON] Fix createDataFrame from
Pandas timestamp series assignment
Repository: spark
Updated Branches:
refs/heads/master 6f169ca9e -> 7bcc26668
[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment
## What changes were proposed in this pull request?
This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame. Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column. This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame."
## How was this patch tested?
existing tests
Author: Bryan Cutler <cu...@gmail.com>
Closes #20213 from BryanCutler/pyspark-createDataFrame-copy-slice-warn-SPARK-23018.
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7bcc2666
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7bcc2666
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7bcc2666
Branch: refs/heads/master
Commit: 7bcc2666810cefc85dfa0d6679ac7a0de9e23154
Parents: 6f169ca
Author: Bryan Cutler <cu...@gmail.com>
Authored: Wed Jan 10 14:00:07 2018 +0900
Committer: Takuya UESHIN <ue...@databricks.com>
Committed: Wed Jan 10 14:00:07 2018 +0900
----------------------------------------------------------------------
python/pyspark/sql/session.py | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/7bcc2666/python/pyspark/sql/session.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 6052fa9..3e45747 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -459,21 +459,23 @@ class SparkSession(object):
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if isinstance(field.dataType, TimestampType):
s = _check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
- if not copied and s is not pdf[field.name]:
- # Copy once if the series is modified to prevent the original Pandas
- # DataFrame from being updated
- pdf = pdf.copy()
- copied = True
- pdf[field.name] = s
+ if s is not pdf[field.name]:
+ if not copied:
+ # Copy once if the series is modified to prevent the original
+ # Pandas DataFrame from being updated
+ pdf = pdf.copy()
+ copied = True
+ pdf[field.name] = s
else:
for column, series in pdf.iteritems():
- s = _check_series_convert_timestamps_tz_local(pdf[column], timezone)
- if not copied and s is not pdf[column]:
- # Copy once if the series is modified to prevent the original Pandas
- # DataFrame from being updated
- pdf = pdf.copy()
- copied = True
- pdf[column] = s
+ s = _check_series_convert_timestamps_tz_local(series, timezone)
+ if s is not series:
+ if not copied:
+ # Copy once if the series is modified to prevent the original
+ # Pandas DataFrame from being updated
+ pdf = pdf.copy()
+ copied = True
+ pdf[column] = s
# Convert pandas.DataFrame to list of numpy records
np_records = pdf.to_records(index=False)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org