You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/13 13:57:31 UTC

[GitHub] [spark] pralabhkumar opened a new pull request #33980: [WIP][SPARK-32285]Add PySpark support for nested timestamps with arrow

pralabhkumar opened a new pull request #33980:
URL: https://github.com/apache/spark/pull/33980


   What changes were proposed in this pull request?
   Added nested timestamp support for Pyspark with Arrow
   
   
   
   Why are the changes needed?
   This change is required to convert ArrayType(TimeStamp) to pandas via arrow. 
   
   
   Does this PR introduce any user-facing change?
   Yes user will be able to convert DF which contain Arraytype(Timestamp) to pandas
   
   How was this patch tested?
   unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-1046129975


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-966258120


   Thx @BryanCutler 
   
   First of all , really thx for your time and effort to review the PR.
   
   ########
    I think there are issues with simply checking the first element in the first value to see if it's a timestamp.
   ########
   
   Response : I completely agree with you , checking first value is not correct . I'll change it .
   
   ########
   Also it does not seem like much of the additions here could be used to handle other nested types or deeper nesting, which would then all require specialized functions to handle
   ########
   
   Response : modify_timestamp_array is the method which basically does the datatime conversion . We can use the same code for deeper nesting conversion , its calling method have to called it recursively .  However I agree with u that this code may not be utilized for other nested type. 
   As per the jira , the immediate requirement was array of timestamp , that's why I only took consideration of Arraytype.
   
   ########
   I am thinking it would be better to utilize pyarrow for all type checking and flattening of nested columns so that all conversions are done on flat series. Although, I understand there might be some challenges with this approach as well.
   ########
   
   Response : I tried to have code similar to what already have written . For e.g do the time conversion , post converting to panda . 
   Therefore my approach/design is very same as what already there in code.  
   
   However I would like better understand your approach. If possible may be we can have call to better understand your approach and then I can code the same way.  Please let me know , if you are ok with the same.  
   
   on the other hand if  think current approach is fine , I'll address all the review comments provided by you. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-963070841


   Thx for your response @BryanCutler . Looking forward to your review 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r715799109



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -118,32 +117,50 @@ def create_pandas_data_frame(self):
         data_dict["4_float_t"] = np.float32(data_dict["4_float_t"])
         return pd.DataFrame(data=data_dict)
 
-    def test_toPandas_fallback_enabled(self):
+    # def test_toPandas_fallback_enabled(self):

Review comment:
       done. However i have changed the expected failure type to ArrayType<ArrayType<TimeStamp>> instead of ArrayType<TimeStamp> . Since ArrayType<TimeStamp> will not throw exception




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-922227237


   @HyukjinKwon 
   
   Have fixed all the issues with the build . Please look into the PR .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-933728879


   @BryanCutler . Thx for finding the time to review it and Sorry for replying late. Have done changes to simplify the conversion logic . Also this PR is only for Array of Timestamp . Later on I can work on Map Type and Struct Type as a separate PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-940223054


   @BryanCutler @HyukjinKwon  Gentle ping. Please review the PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-953140170


   @BryanCutler 
   
   Please find some time to review the PR . IT will be really helpful 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-956648528


   Apologies, I have not had a chance to review again. I will try to do so later this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r747440289



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -294,8 +303,10 @@ def _check_series_convert_timestamps_localize(
     from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
     from_tz = from_timezone or _get_local_timezone()
     to_tz = to_timezone or _get_local_timezone()
-    # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
-    if is_datetime64tz_dtype(s.dtype):
+    if datatype == ArrayType.__name__:

Review comment:
       Will change it 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-937748367


   @BryanCutler @HyukjinKwon.   Please review the PR , and please let me know if changes are needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r714430165



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -190,10 +191,19 @@ def _check_series_localize_timestamps(s, timezone):
     """
     from pyspark.sql.pandas.utils import require_minimum_pandas_version
     require_minimum_pandas_version()
-
+    from pandas import Series, DatetimeTZDtype
+    from pandas.api.types import is_datetime64_dtype
     from pandas.api.types import is_datetime64tz_dtype
     tz = timezone or _get_local_timezone()
     # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+    # Check if its Array of TimeStamp
+    if datatype == ArrayType.__name__:

Review comment:
       does it work for nested + nested array of timestamps?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r715799109



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -118,32 +117,50 @@ def create_pandas_data_frame(self):
         data_dict["4_float_t"] = np.float32(data_dict["4_float_t"])
         return pd.DataFrame(data=data_dict)
 
-    def test_toPandas_fallback_enabled(self):
+    # def test_toPandas_fallback_enabled(self):

Review comment:
       done. However i have changed the expected failure type to Map<TimeStamp,TimeStamp> instead of ArrayType<TimeStamp> . Since ArrayType<TimeStamp> will not throw exception

##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -190,10 +191,19 @@ def _check_series_localize_timestamps(s, timezone):
     """
     from pyspark.sql.pandas.utils import require_minimum_pandas_version
     require_minimum_pandas_version()
-
+    from pandas import Series, DatetimeTZDtype
+    from pandas.api.types import is_datetime64_dtype
     from pandas.api.types import is_datetime64tz_dtype
     tz = timezone or _get_local_timezone()
     # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+    # Check if its Array of TimeStamp

Review comment:
       done

##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -190,10 +191,19 @@ def _check_series_localize_timestamps(s, timezone):
     """
     from pyspark.sql.pandas.utils import require_minimum_pandas_version
     require_minimum_pandas_version()
-
+    from pandas import Series, DatetimeTZDtype
+    from pandas.api.types import is_datetime64_dtype
     from pandas.api.types import is_datetime64tz_dtype
     tz = timezone or _get_local_timezone()
     # TODO: handle nested timestamps, such as ArrayType(TimestampType())?

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r721598237



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -296,7 +337,34 @@ def _check_series_convert_timestamps_localize(s, from_timezone, to_timezone):
         return s
 
 
-def _check_series_convert_timestamps_local_tz(s, timezone):
+def __handle_array_of_timestamps(series, to_tz,  from_tz=None):
+    """
+
+    :param series: Pandas series
+    :param to_tz: to timezone
+    :param from_tz: from time zone
+    :return: return series respecting timezone
+    """
+    from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+    import pandas as pd
+    from pandas import Series
+    data_after_conversion = []
+    for data in series:

Review comment:
       @BryanCutler  , Have done some changes , remove all the unnecessary specialized conversion and also creating series 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #33980:
URL: https://github.com/apache/spark/pull/33980


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-956648528


   Apologies, I have not had a chance to review again. I will try to do so later this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-966254239


   @BryanCutler 
   First of all , really thx for your time and effort to review the PR. 
   ===
    I think there are issues with simply checking the first element in the first value to see if it's a timestamp.
   ===
   
   Response : I completely agree with you , checking first value is not correct . I'll change it .  
   
   ===
   Also it does not seem like much of the additions here could be used to handle other nested types or deeper nesting, which would then all require specialized functions to handle
   ===
   
   Response : modify_timestamp_array is the method which basically does the datatime conversion . We can use the same code for deeper nesting conversion , its calling method have to called it recursively .  However I agree with u that this code may not be utilized for other nested type. 
   As per the jira , the immediate requirement was array of timestamp , that's why I only took consideration of Arraytype.  
   
   
   ====
   I am thinking it would be better to utilize pyarrow for all type checking and flattening of nested columns so that all conversions are done on flat series. Although, I understand there might be some challenges with this approach as well.
   ====
   Response : I tried to have code similar to what already have written . For e.g do the time conversion , post converting to panda . 
   Therefore my approach/design is very same as what already there in code.  
   
   However I would like better understand your approach. If possible may be we can have call to better understand your approach and then I can code the same way.  Please let me know , if you are ok with the same.  If you think current approach is fine , I'll address all the review comments provided by you. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [WIP][SPARK-32285]Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-918801328


   > @pralabhkumar mind keeping the GitHub PR template as is? (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE)
   
   @HyukjinKwon 
   Have updated PR template. Sorry since it is still in progress, I missed the PR template .
   As stated in Jira , if u can do quick review and let me know if direction is correct , i'll complete the rest of the part of the jira and will improve the PR . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r714430038



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -190,10 +191,19 @@ def _check_series_localize_timestamps(s, timezone):
     """
     from pyspark.sql.pandas.utils import require_minimum_pandas_version
     require_minimum_pandas_version()
-
+    from pandas import Series, DatetimeTZDtype
+    from pandas.api.types import is_datetime64_dtype
     from pandas.api.types import is_datetime64tz_dtype
     tz = timezone or _get_local_timezone()
     # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+    # Check if its Array of TimeStamp

Review comment:
       typo

##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -190,10 +191,19 @@ def _check_series_localize_timestamps(s, timezone):
     """
     from pyspark.sql.pandas.utils import require_minimum_pandas_version
     require_minimum_pandas_version()
-
+    from pandas import Series, DatetimeTZDtype
+    from pandas.api.types import is_datetime64_dtype
     from pandas.api.types import is_datetime64tz_dtype
     tz = timezone or _get_local_timezone()
     # TODO: handle nested timestamps, such as ArrayType(TimestampType())?

Review comment:
       remove this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r714429794



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -20,6 +20,7 @@
 pandas instances during the type conversion.
 """
 
+

Review comment:
       Let's remove these newlines.

##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -118,32 +117,50 @@ def create_pandas_data_frame(self):
         data_dict["4_float_t"] = np.float32(data_dict["4_float_t"])
         return pd.DataFrame(data=data_dict)
 
-    def test_toPandas_fallback_enabled(self):
+    # def test_toPandas_fallback_enabled(self):

Review comment:
       Can you enable the tests back?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-929450208


   @HyukjinKwon . Thx for the review , have made changes and fix some bugs (Also added more description in PR)
   
   Please review the PR .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r747442267



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -133,18 +133,29 @@ def test_toPandas_fallback_enabled(self):
                         user_warns = [
                             warn.message for warn in warns if isinstance(warn.message, UserWarning)]
                         self.assertTrue(len(user_warns) > 0)
-                        self.assertTrue(
-                            "Attempting non-optimization" in str(user_warns[-1]))
-                        assert_frame_equal(pdf, pd.DataFrame({"a": [[ts]]}))
+                        self.assertTrue("Attempting non-optimization" in str(user_warns[-1]))
+                        assert_frame_equal(pdf, pd.DataFrame({"a": [[[ts]]]}))
 
     def test_toPandas_fallback_disabled(self):
-        schema = StructType([StructField("a", ArrayType(TimestampType()), True)])
-        df = self.spark.createDataFrame([(None,)], schema=schema)
+        ts = datetime.datetime(2015, 11, 1, 0, 30)
+        schema = StructType([StructField("a", ArrayType(ArrayType(TimestampType())), True)])
+        df = self.spark.createDataFrame([([[ts]],)], schema=schema)
         with QuietTest(self.sc):
             with self.warnings_lock:
                 with self.assertRaisesRegex(Exception, 'Unsupported type'):
                     df.toPandas()
 
+    def test_toPandas_array_timestamp(self):
+        schema = StructType([
+            StructField("idx", LongType(), True),
+            StructField("timestamp_array", ArrayType(TimestampType()), True)])
+        data = [(0, [datetime.datetime(1969, 1, 1, 1, 1, 1)]),
+                (2, [datetime.datetime(2100, 3, 3, 3, 3, 3)])]

Review comment:
       Have tested with none . but agree , will add more stringent test cases. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r715799109



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -118,32 +117,50 @@ def create_pandas_data_frame(self):
         data_dict["4_float_t"] = np.float32(data_dict["4_float_t"])
         return pd.DataFrame(data=data_dict)
 
-    def test_toPandas_fallback_enabled(self):
+    # def test_toPandas_fallback_enabled(self):

Review comment:
       done. However i have changed the expected failure type to ArrayType<ArrayType<TimeStamp>> instead of ArrayType<TimeStamp> . Since ArrayType<TimeStamp> will not throw exception




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r721598237



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -296,7 +337,34 @@ def _check_series_convert_timestamps_localize(s, from_timezone, to_timezone):
         return s
 
 
-def _check_series_convert_timestamps_local_tz(s, timezone):
+def __handle_array_of_timestamps(series, to_tz,  from_tz=None):
+    """
+
+    :param series: Pandas series
+    :param to_tz: to timezone
+    :param from_tz: from time zone
+    :return: return series respecting timezone
+    """
+    from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+    import pandas as pd
+    from pandas import Series
+    data_after_conversion = []
+    for data in series:

Review comment:
       @BryanCutler  , Have done some changes , remove all the unnecessary specialized conversion and also creating redundant  series 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r714430344



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -513,12 +531,12 @@ def run_test(num_records, num_parts, max_records, use_delay=False):
                 assert_frame_equal(pdf, pdf_arrow)
 
         cases = [
-            (1024, 512, 2),    # Use large num partitions for more likely collecting out of order
+            (1024, 512, 2),  # Use large num partitions for more likely collecting out of order
             (64, 8, 2, True),  # Use delay in first partition to force collecting out of order
-            (64, 64, 1),       # Test single batch per partition
-            (64, 1, 64),       # Test single partition, single batch
-            (64, 1, 8),        # Test single partition, multiple batches
-            (30, 7, 2),        # Test different sized partitions
+            (64, 64, 1),  # Test single batch per partition
+            (64, 1, 64),  # Test single partition, single batch
+            (64, 1, 8),  # Test single partition, multiple batches
+            (30, 7, 2),  # Test different sized partitions

Review comment:
       let's remove all these unrelated changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r715799826



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -190,10 +191,19 @@ def _check_series_localize_timestamps(s, timezone):
     """
     from pyspark.sql.pandas.utils import require_minimum_pandas_version
     require_minimum_pandas_version()
-
+    from pandas import Series, DatetimeTZDtype
+    from pandas.api.types import is_datetime64_dtype
     from pandas.api.types import is_datetime64tz_dtype
     tz = timezone or _get_local_timezone()
     # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
+    # Check if its Array of TimeStamp
+    if datatype == ArrayType.__name__:

Review comment:
       Currently only one level of nesting is handled . So only ArrayType<TimeStamp> will work . Also this is what being suggested in Jira (as the immediate need)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r747442001



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -356,7 +389,27 @@ def _convert_map_items_to_dict(s: "PandasSeriesLike") -> "PandasSeriesLike":
     return s.apply(lambda m: None if m is None else {k: v for k, v in m})
 
 
-def _convert_dict_to_map_items(s: "PandasSeriesLike") -> "PandasSeriesLike":
+def _is_series_contain_timestamp(s):
+    """
+    checks whether the series contain Timstamp object
+    :param s: pd.series
+    :return:  True if the series contain timestamp object
+    """
+    from numpy import ndarray
+    import datetime
+    from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+    if not s.empty and isinstance(s.dtype, object) and \
+            isinstance(s.iloc[0], list) and len(s.iloc[0]) > 0 and \
+            isinstance(s.iloc[0][0], datetime.datetime):

Review comment:
       Agree , However for array with pyarrow , I m checking the dtype for the array. 
   But I think as suggested by you for pyarrow , it would be better to use pyarrow for type checking 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on a change in pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r745214776



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -294,8 +303,10 @@ def _check_series_convert_timestamps_localize(
     from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
     from_tz = from_timezone or _get_local_timezone()
     to_tz = to_timezone or _get_local_timezone()
-    # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
-    if is_datetime64tz_dtype(s.dtype):
+    if datatype == ArrayType.__name__:

Review comment:
       I don't think it's a good way to check for data types by using the `__name__` attribute, isn't there another way to check that the data type is an ArrayType?

##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -356,7 +389,27 @@ def _convert_map_items_to_dict(s: "PandasSeriesLike") -> "PandasSeriesLike":
     return s.apply(lambda m: None if m is None else {k: v for k, v in m})
 
 
-def _convert_dict_to_map_items(s: "PandasSeriesLike") -> "PandasSeriesLike":
+def _is_series_contain_timestamp(s):
+    """
+    checks whether the series contain Timstamp object
+    :param s: pd.series
+    :return:  True if the series contain timestamp object
+    """
+    from numpy import ndarray
+    import datetime
+    from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+    if not s.empty and isinstance(s.dtype, object) and \
+            isinstance(s.iloc[0], list) and len(s.iloc[0]) > 0 and \
+            isinstance(s.iloc[0][0], datetime.datetime):

Review comment:
       I don't think you can rely on checking the first value to determine the data type

##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -306,8 +317,28 @@ def _check_series_convert_timestamps_localize(
         return s
 
 
+def modify_timestamp_array(data, to_tz: Optional[str], from_tz: Optional[str] = None
+                           , is_utc: bool = False):
+    import pandas as pd
+    if data is None:
+        return [None]

Review comment:
       Wouldn't this change a value in the series from None to an array with one value that is None?

##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -133,18 +133,29 @@ def test_toPandas_fallback_enabled(self):
                         user_warns = [
                             warn.message for warn in warns if isinstance(warn.message, UserWarning)]
                         self.assertTrue(len(user_warns) > 0)
-                        self.assertTrue(
-                            "Attempting non-optimization" in str(user_warns[-1]))
-                        assert_frame_equal(pdf, pd.DataFrame({"a": [[ts]]}))
+                        self.assertTrue("Attempting non-optimization" in str(user_warns[-1]))
+                        assert_frame_equal(pdf, pd.DataFrame({"a": [[[ts]]]}))
 
     def test_toPandas_fallback_disabled(self):
-        schema = StructType([StructField("a", ArrayType(TimestampType()), True)])
-        df = self.spark.createDataFrame([(None,)], schema=schema)
+        ts = datetime.datetime(2015, 11, 1, 0, 30)
+        schema = StructType([StructField("a", ArrayType(ArrayType(TimestampType())), True)])
+        df = self.spark.createDataFrame([([[ts]],)], schema=schema)
         with QuietTest(self.sc):
             with self.warnings_lock:
                 with self.assertRaisesRegex(Exception, 'Unsupported type'):
                     df.toPandas()
 
+    def test_toPandas_array_timestamp(self):
+        schema = StructType([
+            StructField("idx", LongType(), True),
+            StructField("timestamp_array", ArrayType(TimestampType()), True)])
+        data = [(0, [datetime.datetime(1969, 1, 1, 1, 1, 1)]),
+                (2, [datetime.datetime(2100, 3, 3, 3, 3, 3)])]

Review comment:
       The data here is a little too simple, there should be multiple values including nulls




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-945989173


   @BryanCutler 
   
   IT would be really helpful if you can review the PR . 
   
   cc
   @HyukjinKwon 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33980: [WIP][SPARK-32285]Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-918689130


   @pralabhkumar mind keeping the GitHub PR template as is? (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-937748367


   @BryanCutler @HyukjinKwon.   Please review the PR , and please let me know if changes are needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r715799901



##########
File path: python/pyspark/sql/tests/test_arrow.py
##########
@@ -513,12 +531,12 @@ def run_test(num_records, num_parts, max_records, use_delay=False):
                 assert_frame_equal(pdf, pdf_arrow)
 
         cases = [
-            (1024, 512, 2),    # Use large num partitions for more likely collecting out of order
+            (1024, 512, 2),  # Use large num partitions for more likely collecting out of order
             (64, 8, 2, True),  # Use delay in first partition to force collecting out of order
-            (64, 64, 1),       # Test single batch per partition
-            (64, 1, 64),       # Test single partition, single batch
-            (64, 1, 8),        # Test single partition, multiple batches
-            (30, 7, 2),        # Test different sized partitions
+            (64, 64, 1),  # Test single batch per partition
+            (64, 1, 64),  # Test single partition, single batch
+            (64, 1, 8),  # Test single partition, multiple batches
+            (30, 7, 2),  # Test different sized partitions

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [WIP][SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r715798363



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -20,6 +20,7 @@
 pandas instances during the type conversion.
 """
 
+

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar removed a comment on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar removed a comment on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-966254239


   @BryanCutler 
   First of all , really thx for your time and effort to review the PR. 
   ===
    I think there are issues with simply checking the first element in the first value to see if it's a timestamp.
   ===
   
   Response : I completely agree with you , checking first value is not correct . I'll change it .  
   
   ===
   Also it does not seem like much of the additions here could be used to handle other nested types or deeper nesting, which would then all require specialized functions to handle
   ===
   
   Response : modify_timestamp_array is the method which basically does the datatime conversion . We can use the same code for deeper nesting conversion , its calling method have to called it recursively .  However I agree with u that this code may not be utilized for other nested type. 
   As per the jira , the immediate requirement was array of timestamp , that's why I only took consideration of Arraytype.  
   
   
   ====
   I am thinking it would be better to utilize pyarrow for all type checking and flattening of nested columns so that all conversions are done on flat series. Although, I understand there might be some challenges with this approach as well.
   ====
   Response : I tried to have code similar to what already have written . For e.g do the time conversion , post converting to panda . 
   Therefore my approach/design is very same as what already there in code.  
   
   However I would like better understand your approach. If possible may be we can have call to better understand your approach and then I can code the same way.  Please let me know , if you are ok with the same.  If you think current approach is fine , I'll address all the review comments provided by you. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-956648528


   Apologies, I have not had a chance to review again. I will try to do so later this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-929450208


   @HyukjinKwon . Thx for the review , have made changes and fix some bugs (Also added more description in PR)
   
   Please review the PR .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-929838834


   cc @BryanCutler FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33980: [WIP][SPARK-32285]Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-918218825


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pralabhkumar commented on a change in pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
pralabhkumar commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r721598693



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -253,14 +278,29 @@ def _check_series_convert_timestamps_internal(s, timezone):
         # >>> str(tz.localize(t))
         # '2015-11-01 01:30:00-05:00'
         tz = timezone or _get_local_timezone()
-        return s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC')
+        data = s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC')
+        return __modified_series(data, is_array)
     elif is_datetime64tz_dtype(s.dtype):
-        return s.dt.tz_convert('UTC')
+        data = s.dt.tz_convert('UTC')
+        return __modified_series(data, is_array)
     else:
-        return s
+        return __modified_series(s, is_array)
+
+
+def __modified_series(data, is_array):
+    """
+    :param data: Converted data
+    :param is_array: If the input data type is type of array
+    :return: If input type is array ,then return series with array of data
+    else return series as it is.
+    """
+    from pandas import Series
+    if is_array:
+        return Series([data])

Review comment:
       @BryanCutler  Removed the logic




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33980:
URL: https://github.com/apache/spark/pull/33980#issuecomment-929838834


   cc @BryanCutler FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on a change in pull request #33980: [SPARK-32285][PYTHON] Add PySpark support for nested timestamps with arrow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #33980:
URL: https://github.com/apache/spark/pull/33980#discussion_r718964803



##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -296,7 +337,34 @@ def _check_series_convert_timestamps_localize(s, from_timezone, to_timezone):
         return s
 
 
-def _check_series_convert_timestamps_local_tz(s, timezone):
+def __handle_array_of_timestamps(series, to_tz,  from_tz=None):
+    """
+
+    :param series: Pandas series
+    :param to_tz: to timezone
+    :param from_tz: from time zone
+    :return: return series respecting timezone
+    """
+    from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+    import pandas as pd
+    from pandas import Series
+    data_after_conversion = []
+    for data in series:

Review comment:
       So this is iterating over each timestamp array in the series, applying conversions each time and then building a new series back from a list? That seems pretty inefficient and looks to be a lot of specialized conversion going on here for just arrays of timestamps.

##########
File path: python/pyspark/sql/pandas/types.py
##########
@@ -253,14 +278,29 @@ def _check_series_convert_timestamps_internal(s, timezone):
         # >>> str(tz.localize(t))
         # '2015-11-01 01:30:00-05:00'
         tz = timezone or _get_local_timezone()
-        return s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC')
+        data = s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC')
+        return __modified_series(data, is_array)
     elif is_datetime64tz_dtype(s.dtype):
-        return s.dt.tz_convert('UTC')
+        data = s.dt.tz_convert('UTC')
+        return __modified_series(data, is_array)
     else:
-        return s
+        return __modified_series(s, is_array)
+
+
+def __modified_series(data, is_array):
+    """
+    :param data: Converted data
+    :param is_array: If the input data type is type of array
+    :return: If input type is array ,then return series with array of data
+    else return series as it is.
+    """
+    from pandas import Series
+    if is_array:
+        return Series([data])

Review comment:
       Sorry, I don't quite understand what this function is doing, it looks like it's making a Series with 1 array?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org