You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/05/25 08:59:57 UTC

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1205199261


##########
python/pyarrow/types.pxi:
##########
@@ -1122,6 +1150,19 @@ cdef class DurationType(DataType):
         """
         return timeunit_to_string(self.duration_type.unit())
 
+    def to_pandas_dtype(self):
+        """
+        Return the equivalent NumPy / Pandas dtype.
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> d = pa.duration('ms')
+        >>> d.to_pandas_dtype()
+        timedelta64[ms]
+        """
+        return _get_pandas_type(_Type_TIMESTAMP, self.unit)

Review Comment:
   Should this use _Type_DURATION instead? 
   (which also raises the question: is this tested? Although I suppose you haven't yet bothered with updating all the tests)



##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   pandas only supports the range of second to nanoseconds, so for dates we should maybe still default to `datetime64[s]`? (otherwise I assume this conversion would happen anyway on the pandas side)



##########
python/pyarrow/array.pxi:
##########
@@ -1674,8 +1682,11 @@ cdef _array_like_to_pandas(obj, options, types_mapper):
     original_type = obj.type
     name = obj._name
 
-    # ARROW-3789(wesm): Convert date/timestamp types to datetime64[ns]
-    c_options.coerce_temporal_nanoseconds = True
+    # ARROW-33321 reenables support for date/timestamp conversion in pandas >= 2.0
+    from pyarrow.vendored.version import Version
+    if pandas_api.loose_version < Version('2.0.0'):
+        # ARROW-3789(wesm): Convert date/timestamp types to datetime64[ns]
+        c_options.coerce_temporal_nanoseconds = True

Review Comment:
   Should this only be set of the argument was _not_ specified by the user? Or are we fine with forcing this in case of pandas<2.0, since that's what we did in the past anyway, and is also the only useful behaviour (letting the user specify `coerce_temporal_nanoseconds=False` in a conversion to pandas with pandas<2.0 doesn't make much sense, I suppose, since then pandas will convert it to nanoseconds anyhow, and the user never sees the non-nanosecond data). So after writing this: I assume the above is fine ;)



##########
python/pyarrow/array.pxi:
##########
@@ -775,6 +776,11 @@ cdef class _PandasConvertible(_Weakrefable):
             expected to return a pandas ExtensionDtype or ``None`` if the
             default conversion should be used for that type. If you have
             a dictionary mapping, you can pass ``dict.get`` as function.
+        coerce_temporal_nanoseconds : bool, default False
+            A legacy option to coerce date32, date64, datetime, and timestamp
+            types to use nanoseconds when converting to pandas. This was the
+            default behavior in pandas version 1.x. In pandas version 2.0,
+            non-nanosecond time units are now supported.

Review Comment:
   We should probably clarify that this keyword is ignored for conversions to pandas with pandas<2.0, and force set to True (regardless of what the user would specify here)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org