You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "danepitkin (via GitHub)" <gi...@apache.org> on 2023/05/17 22:33:41 UTC

[GitHub] [arrow] danepitkin opened a new pull request, #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

danepitkin opened a new pull request, #35656:
URL: https://github.com/apache/arrow/pull/35656

   Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since this has been fixed in pandas.
   
   ### Rationale for this change
   
   Pandas 2.0 introduces proper support for temporal types.
   
   ### Are these changes tested?
   
   Not yet.
   
   ### Are there any user-facing changes?
   
   Yes, pandas conversion behavior will change when users have pandas >= 2.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1561803046

   > Yes, that approach looks good, and is actually simpler than I thought it would be since we already control this with the single option switch (for the code, the tests will indeed get a bit messier).
   > 
   > I think one question is whether we want to make that option public through the `to_pandas` API, so people could still override it to get nanoseconds if they want (to get back the pre-pandas-2.0 behaviour).
   
   I'll expose this! I agree we'd its best to allow continued use of the legacy behavior for awhile.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1255367993


##########
python/pyarrow/array.pxi:
##########
@@ -721,12 +722,16 @@ cdef class _PandasConvertible(_Weakrefable):
         integer_object_nulls : bool, default False
             Cast integers with nulls to objects
         date_as_object : bool, default True
-            Cast dates to objects. If False, convert to datetime64[ns] dtype.
+            Cast dates to objects. If False, convert to datetime64 dtype with
+            the equivalent time unit (if supported). Note: in pandas version
+            < 2.0, only datetime64[ns] conversion is supported.
+            dtype.

Review Comment:
   ```suggestion
               < 2.0, only datetime64[ns] conversion is supported.
   ```
   
   (leftover from previous iteration)



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
     # Partitioned columns become 'categorical' dtypes
     for col in partition_by:
         output_df[col] = output_df[col].astype('category')
+
+    if schema:
+        expected_date_type = schema.field_by_name('date').type.to_pandas_dtype()
+    else:
+        # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+        # silently coerce to [ns] due to non-[ns] support.
+        expected_date_type = 'datetime64[ms]'

Review Comment:
   This comment is not fully correct, I think (when converting the pandas dataframe to pyarrow, we actually don't have date32, but timestamp type). But then I also don't understand how this test is passing ..



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
     # Partitioned columns become 'categorical' dtypes
     for col in partition_by:
         output_df[col] = output_df[col].astype('category')
+
+    if schema:
+        expected_date_type = schema.field_by_name('date').type.to_pandas_dtype()
+    else:
+        # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+        # silently coerce to [ns] due to non-[ns] support.
+        expected_date_type = 'datetime64[ms]'

Review Comment:
   So what actually happens with pandas 2.x: when we create a DataFrame with datetime64[D], that gets converted to datetime64[s] (closest supported resolution to "D"). Then roundtripping to parquet turns that into "ms" (because "s" is not supported by Parquet)
   
   With older pandas this gets converted to datetime64[ns], will come back from Parquet as "us", and converted back to "ns" when converting to pandas. But this `astype("datetime64[ms]")` essentially doesn't do anything, i.e. pandas does preserve the "ns" because it doesn't support "ms", and hence the test also passes for older pandas. 
   
   Maybe it's simpler to just test with a DataFrame of nanoseconds, which now works the same with old and new pandas, and then we don't have to add any comment or astype.



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
     # Partitioned columns become 'categorical' dtypes
     for col in partition_by:
         output_df[col] = output_df[col].astype('category')
+
+    if schema:
+        expected_date_type = schema.field_by_name('date').type.to_pandas_dtype()
+    else:
+        # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+        # silently coerce to [ns] due to non-[ns] support.
+        expected_date_type = 'datetime64[ms]'

Review Comment:
   > Maybe it's simpler to just test with a DataFrame of nanoseconds, which now works the same with old and new pandas, and then we don't have to add any comment or astype.
   
   Hmm, trying that out locally fails (but only with the non-legacy code path), and digging in, it seems that we are still writing Parquet v 1 files with the dataset API ... 
   Will open a separate issue and PR to quickly fix that separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1612008685

   I'll first try to merge https://github.com/apache/arrow/pull/36314 and rebase afterwards. That way, the issue can be fixed properly here before merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240164828


##########
python/pyarrow/array.pxi:
##########
@@ -721,12 +722,12 @@ cdef class _PandasConvertible(_Weakrefable):
         integer_object_nulls : bool, default False
             Cast integers with nulls to objects
         date_as_object : bool, default True
-            Cast dates to objects. If False, convert to datetime64[ns] dtype.
+            Cast dates to objects. If False, convert to datetime64 dtype.

Review Comment:
   I will add the case for pandas 1.x! For pandas 2.0, the conversion will match the pyarrow time unit so it could range from `s` to `ns`. The only surprise from the user's perspective is date32, which translates `Day` to `ms`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1615143455

   Will wait to rebase on top of https://github.com/apache/arrow/pull/36137 before merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1230563898


##########
python/pyarrow/array.pxi:
##########
@@ -775,6 +776,13 @@ cdef class _PandasConvertible(_Weakrefable):
             expected to return a pandas ExtensionDtype or ``None`` if the
             default conversion should be used for that type. If you have
             a dictionary mapping, you can pass ``dict.get`` as function.
+        coerce_temporal_nanoseconds : bool, default False
+            Only applicable to pandas version >= 2.0.
+            A legacy option to coerce date32, date64, datetime, and timestamp

Review Comment:
   ```suggestion
               A legacy option to coerce date32, date64, duration, and timestamp
   ```
   
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1592890513

   The dask test build has a whole bunch of failures(https://github.com/ursacomputing/crossbow/actions/runs/5275817890/jobs/9541737015), but that is because they have tests for reading/writing parquet, and they will have to update their tests, similarly as we are updating our own parquet tests here. 
   
   All failures show something like 
   
   ```
   E       Attribute "dtype" are different
   E       [left]:  datetime64[ns]
   E       [right]: datetime64[us]
   ```
   
   so that is the expected failure.
   
   (although I am just thinking, the proposed change to update parquet to write nanoseconds by default (https://github.com/apache/arrow/issues/35746) would help for keeping those tests working as before ..)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1222675409


##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   > I'm somewhat inclined to convert date32 to [ms] by default so we don't have to add a conversion from [ms] -> [s] when doing a parquet roundtrip
   
   Yes, that sounds as a good idea (then it also gives the same for date32 vs date64)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1583147026

   > For the tz aware update, that also influences the (currently untested) to_numpy behaviour, which you can test with the following change:
   > 
   > ```diff
   > --- a/python/pyarrow/tests/test_array.py
   > +++ b/python/pyarrow/tests/test_array.py
   > @@ -211,9 +211,10 @@ def test_to_numpy_writable():
   >          arr.to_numpy(zero_copy_only=True, writable=True)
   >  
   >  
   > +@pytest.mark.parametrize('tz', [None, "UTC"])
   >  @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
   > -def test_to_numpy_datetime64(unit):
   > -    arr = pa.array([1, 2, 3], pa.timestamp(unit))
   > +def test_to_numpy_datetime64(unit, tz):
   > +    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
   >      expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
   >      np_arr = arr.to_numpy()
   >      np.testing.assert_array_equal(np_arr, expected)
   > ```
   
   Thank you! Updated and it passes out of the gate 🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1205211356


##########
python/pyarrow/array.pxi:
##########
@@ -1674,8 +1682,11 @@ cdef _array_like_to_pandas(obj, options, types_mapper):
     original_type = obj.type
     name = obj._name
 
-    # ARROW-3789(wesm): Convert date/timestamp types to datetime64[ns]
-    c_options.coerce_temporal_nanoseconds = True
+    # ARROW-33321 reenables support for date/timestamp conversion in pandas >= 2.0
+    from pyarrow.vendored.version import Version
+    if pandas_api.loose_version < Version('2.0.0'):
+        # ARROW-3789(wesm): Convert date/timestamp types to datetime64[ns]
+        c_options.coerce_temporal_nanoseconds = True

Review Comment:
   Should this only be set if the argument was _not_ specified by the user? Or are we fine with forcing this in case of pandas<2.0, since that's what we did in the past anyway, and is also the only useful behaviour (letting the user specify `coerce_temporal_nanoseconds=False` in a conversion to pandas with pandas<2.0 doesn't make much sense, I suppose, since then pandas will convert it to nanoseconds anyhow, and the user never sees the non-nanosecond data). So after writing this: I assume the above is fine ;)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223193903


##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   The current tests manipulate the types so test cases pass, but these were the tests that originally were failling (the tests in there current state are a bit of a mess right now, I need to go back and clean them up once the implementation actually appears to work properly) : 
   ```
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_index_name[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_index_name[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_no_partitions[True] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_no_partitions[False] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="date") are different
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1625524323

   Thanks @danepitkin!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1230556348


##########
python/pyarrow/array.pxi:
##########
@@ -721,12 +722,12 @@ cdef class _PandasConvertible(_Weakrefable):
         integer_object_nulls : bool, default False
             Cast integers with nulls to objects
         date_as_object : bool, default True
-            Cast dates to objects. If False, convert to datetime64[ns] dtype.
+            Cast dates to objects. If False, convert to datetime64 dtype.
         timestamp_as_object : bool, default False
             Cast non-nanosecond timestamps (np.datetime64) to objects. This is
             useful if you have timestamps that don't fit in the normal date
             range of nanosecond timestamps (1678 CE-2262 CE).

Review Comment:
   ```suggestion
               range of nanosecond timestamps (1678 CE-2262 CE) with pandas version older than 2.0.
   ```
   
   or something like that. I think it would be useful to clarify that this keyword now is less important / this range limitation is only relevant for older pandas.



##########
python/pyarrow/array.pxi:
##########
@@ -721,12 +722,12 @@ cdef class _PandasConvertible(_Weakrefable):
         integer_object_nulls : bool, default False
             Cast integers with nulls to objects
         date_as_object : bool, default True
-            Cast dates to objects. If False, convert to datetime64[ns] dtype.
+            Cast dates to objects. If False, convert to datetime64 dtype.

Review Comment:
   Do we want to specify this is [ns] for older pandas and [ms] for pandas >= 2.0?



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):
+        df = pd.DataFrame({'date': [date(2000, 1, 1)]}, dtype='datetime64[ms]')
+        table = pa.Table.from_pandas(df)

Review Comment:
   We should maybe parametrize this test with using date32/date64/timestamp? Or is the date32/64 case tested elsewhere?



##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -1569,7 +1572,31 @@ class DatetimeWriter : public TypedPandasWriter<NPY_DATETIME> {
 };
 
 using DatetimeSecondWriter = DatetimeWriter<TimeUnit::SECOND>;
-using DatetimeMilliWriter = DatetimeWriter<TimeUnit::MILLI>;
+
+class DatetimeMilliWriter : public DatetimeWriter<TimeUnit::MILLI> {
+ public:
+  using DatetimeWriter<TimeUnit::MILLI>::DatetimeWriter;
+
+  Status CopyInto(std::shared_ptr<ChunkedArray> data, int64_t rel_placement) override {
+    Type::type type = data->type()->id();
+    int64_t* out_values = this->GetBlockColumnStart(rel_placement);
+    if (type == Type::DATE32) {
+      // Convert from days since epoch to datetime64[ms]
+      ConvertDatetimeLikeNanos<int32_t, 86400000L>(*data, out_values);

Review Comment:
   Ah, yes, when milliseconds is the target this is probably fine. For the nanoseconds case, though, this gives wrong results. Opened https://github.com/apache/arrow/issues/36084 about that.



##########
python/pyarrow/array.pxi:
##########
@@ -775,6 +776,13 @@ cdef class _PandasConvertible(_Weakrefable):
             expected to return a pandas ExtensionDtype or ``None`` if the
             default conversion should be used for that type. If you have
             a dictionary mapping, you can pass ``dict.get`` as function.
+        coerce_temporal_nanoseconds : bool, default False
+            Only applicable to pandas version >= 2.0.
+            A legacy option to coerce date32, date64, datetime, and timestamp

Review Comment:
   ```suggestion
               A legacy option to coerce date32, date64, and timestamp
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1205199261


##########
python/pyarrow/types.pxi:
##########
@@ -1122,6 +1150,19 @@ cdef class DurationType(DataType):
         """
         return timeunit_to_string(self.duration_type.unit())
 
+    def to_pandas_dtype(self):
+        """
+        Return the equivalent NumPy / Pandas dtype.
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> d = pa.duration('ms')
+        >>> d.to_pandas_dtype()
+        timedelta64[ms]
+        """
+        return _get_pandas_type(_Type_TIMESTAMP, self.unit)

Review Comment:
   Should this use _Type_DURATION instead? 
   (which also raises the question: is this tested? Although I suppose you haven't yet bothered with updating all the tests)



##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   pandas only supports the range of second to nanoseconds, so for dates we should maybe still default to `datetime64[s]`? (otherwise I assume this conversion would happen anyway on the pandas side)



##########
python/pyarrow/array.pxi:
##########
@@ -1674,8 +1682,11 @@ cdef _array_like_to_pandas(obj, options, types_mapper):
     original_type = obj.type
     name = obj._name
 
-    # ARROW-3789(wesm): Convert date/timestamp types to datetime64[ns]
-    c_options.coerce_temporal_nanoseconds = True
+    # ARROW-33321 reenables support for date/timestamp conversion in pandas >= 2.0
+    from pyarrow.vendored.version import Version
+    if pandas_api.loose_version < Version('2.0.0'):
+        # ARROW-3789(wesm): Convert date/timestamp types to datetime64[ns]
+        c_options.coerce_temporal_nanoseconds = True

Review Comment:
   Should this only be set of the argument was _not_ specified by the user? Or are we fine with forcing this in case of pandas<2.0, since that's what we did in the past anyway, and is also the only useful behaviour (letting the user specify `coerce_temporal_nanoseconds=False` in a conversion to pandas with pandas<2.0 doesn't make much sense, I suppose, since then pandas will convert it to nanoseconds anyhow, and the user never sees the non-nanosecond data). So after writing this: I assume the above is fine ;)



##########
python/pyarrow/array.pxi:
##########
@@ -775,6 +776,11 @@ cdef class _PandasConvertible(_Weakrefable):
             expected to return a pandas ExtensionDtype or ``None`` if the
             default conversion should be used for that type. If you have
             a dictionary mapping, you can pass ``dict.get`` as function.
+        coerce_temporal_nanoseconds : bool, default False
+            A legacy option to coerce date32, date64, datetime, and timestamp
+            types to use nanoseconds when converting to pandas. This was the
+            default behavior in pandas version 1.x. In pandas version 2.0,
+            non-nanosecond time units are now supported.

Review Comment:
   We should probably clarify that this keyword is ignored for conversions to pandas with pandas<2.0, and force set to True (regardless of what the user would specify here)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1614816108

   Revision: a65301e6108976b42641130d411f821895147793
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-47d5a116e0](https://github.com/ursacomputing/crossbow/branches/all?query=actions-47d5a116e0)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/5424492768/jobs/9863926141)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/5424492287/jobs/9863924877)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5424493091/jobs/9863926992)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/5424491471/jobs/9863922908)|
   |test-conda-python-3.10-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.10-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/5424490899/jobs/9863921863)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5424493393/jobs/9863927816)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5424492033/jobs/9863924282)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5424491850/jobs/9863923886)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/5424491651/jobs/9863923613)|
   |test-conda-python-3.8-spark-v3.1.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.8-spark-v3.1.2)](https://github.com/ursacomputing/crossbow/actions/runs/5424492556/jobs/9863925522)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5424490779/jobs/9863921578)|
   |test-conda-python-3.9-spark-v3.2.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-47d5a116e0-github-test-conda-python-3.9-spark-v3.2.0)](https://github.com/ursacomputing/crossbow/actions/runs/5424493185/jobs/9863927235)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1220474875


##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   One thing I found is that Parquet only support [ms], [us], and [ns]. So now several pyarrow dataset tests are failing because datasets with [D]ay units are being converted to [ms] units. I'm somewhat inclined to convert date32 to [ms] by default so we don't have to add a conversion from [ms] -> [s] when doing a parquet roundtrip. Or.. we just let it happen and modify the tests. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1579572612

   Current update: the tests failing locally for me are 1) parquet dataset roundtrips where date32 days are converted to milliseconds instead of seconds because seconds are not supported in parquet and 2) all TZ-aware timestamps are defaulted to nanoseconds (aka I need to add support for other time units in c++).
   
   For (1), I mentioned in another comment that we can convert date32 to millisecond instead of second. For (2), I just need to add support, but it's going to grow this PR even larger unfortunately..
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223394291


##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -1569,7 +1572,31 @@ class DatetimeWriter : public TypedPandasWriter<NPY_DATETIME> {
 };
 
 using DatetimeSecondWriter = DatetimeWriter<TimeUnit::SECOND>;
-using DatetimeMilliWriter = DatetimeWriter<TimeUnit::MILLI>;
+
+class DatetimeMilliWriter : public DatetimeWriter<TimeUnit::MILLI> {
+ public:
+  using DatetimeWriter<TimeUnit::MILLI>::DatetimeWriter;
+
+  Status CopyInto(std::shared_ptr<ChunkedArray> data, int64_t rel_placement) override {
+    Type::type type = data->type()->id();
+    int64_t* out_values = this->GetBlockColumnStart(rel_placement);
+    if (type == Type::DATE32) {
+      // Convert from days since epoch to datetime64[ms]
+      ConvertDatetimeLikeNanos<int32_t, 86400000L>(*data, out_values);
+    } else if (type == Type::DATE64) {
+      ConvertDatetimeLikeNanos<int64_t, 1LL>(*data, out_values);
+    } else {
+      const auto& ts_type = checked_cast<const TimestampType&>(*data->type());
+      DCHECK_EQ(TimeUnit::MILLI, ts_type.unit())
+          << "Should only call instances of this writer "
+          << "with arrays of the correct unit";
+      ConvertNumericNullable<int64_t>(*data, kPandasTimestampNull,
+                                      this->GetBlockColumnStart(rel_placement));

Review Comment:
   ```suggestion
         ConvertNumericNullable<int64_t>(*data, kPandasTimestampNull, out_values);
   ```



##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -1569,7 +1572,31 @@ class DatetimeWriter : public TypedPandasWriter<NPY_DATETIME> {
 };
 
 using DatetimeSecondWriter = DatetimeWriter<TimeUnit::SECOND>;
-using DatetimeMilliWriter = DatetimeWriter<TimeUnit::MILLI>;
+
+class DatetimeMilliWriter : public DatetimeWriter<TimeUnit::MILLI> {
+ public:
+  using DatetimeWriter<TimeUnit::MILLI>::DatetimeWriter;
+
+  Status CopyInto(std::shared_ptr<ChunkedArray> data, int64_t rel_placement) override {
+    Type::type type = data->type()->id();
+    int64_t* out_values = this->GetBlockColumnStart(rel_placement);
+    if (type == Type::DATE32) {
+      // Convert from days since epoch to datetime64[ms]
+      ConvertDatetimeLikeNanos<int32_t, 86400000L>(*data, out_values);
+    } else if (type == Type::DATE64) {
+      ConvertDatetimeLikeNanos<int64_t, 1LL>(*data, out_values);

Review Comment:
   Can probably use `ConvertNumericNullable` here as well (like in the branch below, that avoids doing the multiplication with 1)



##########
python/pyarrow/tests/parquet/common.py:
##########
@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False):
         # TODO(wesm): Test other timestamp resolutions now that arrow supports
         # them
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype('datetime64[ns]'),
-        'timedelta': np.arange(0, size, dtype="timedelta64[ns]"),
+                              dtype='datetime64[ms]'),

Review Comment:
   We can maybe keep both original ns and new ms resolution? (to test both)



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -70,7 +70,7 @@ def _alltypes_example(size=100):
         # TODO(wesm): Pandas only support ns resolution, Arrow supports s, ms,
         # us, ns
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype("datetime64[ns]"),
+                              dtype='datetime64[ms]'),

Review Comment:
   Or also here keep two resolutions?



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1202,7 +1211,7 @@ def test_table_convert_date_as_object(self):
         df_datetime = table.to_pandas(date_as_object=False)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype('datetime64[ms]'), df_datetime,

Review Comment:
   Do we have coverage for testing that it stays nanoseconds if you specify `coerce_temporal_nanoseconds=True`?



##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -1569,7 +1572,31 @@ class DatetimeWriter : public TypedPandasWriter<NPY_DATETIME> {
 };
 
 using DatetimeSecondWriter = DatetimeWriter<TimeUnit::SECOND>;
-using DatetimeMilliWriter = DatetimeWriter<TimeUnit::MILLI>;
+
+class DatetimeMilliWriter : public DatetimeWriter<TimeUnit::MILLI> {
+ public:
+  using DatetimeWriter<TimeUnit::MILLI>::DatetimeWriter;
+
+  Status CopyInto(std::shared_ptr<ChunkedArray> data, int64_t rel_placement) override {
+    Type::type type = data->type()->id();
+    int64_t* out_values = this->GetBlockColumnStart(rel_placement);
+    if (type == Type::DATE32) {
+      // Convert from days since epoch to datetime64[ms]
+      ConvertDatetimeLikeNanos<int32_t, 86400000L>(*data, out_values);

Review Comment:
   I am wondering that if we do such naive multiplication here, we can get overflow errors for out-of-bounds timestamps (but this is already the case for the current code converting to nanoseconds, as well, to be clear)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1552170544

   * Closes: #33321


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1582537127

   >  For (2), I just need to add support, but it's going to grow this PR even larger unfortunately..
   
   If PR size is a concern, this is also something that could be done as a precursor. It's actually already an issue that shows in conversion to numpy as well:
   
   ```
   # no timezone -> this preserves the unit
   >>> pa.array([1, 2, 3], pa.timestamp('us')).to_numpy()
   array(['1970-01-01T00:00:00.000001', '1970-01-01T00:00:00.000002',
          '1970-01-01T00:00:00.000003'], dtype='datetime64[us]')
   
   # with timezone -> always converts to nanoseconds
   >>> pa.array([1, 2, 3], pa.timestamp('us', tz="Europe/Brussels")).to_numpy()
   ...
   ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
   
   >>> pa.array([1, 2, 3], pa.timestamp('us', tz="Europe/Brussels")).to_numpy(zero_copy_only=False)
   array(['1970-01-01T00:00:00.000001000', '1970-01-01T00:00:00.000002000',
          '1970-01-01T00:00:00.000003000'], dtype='datetime64[ns]')
   ```
   
   While this could also be perfectly zero-copy to microseconds in the case with a timezone (we just return the underlying UTC values anyway)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223418022


##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -1618,31 +1645,37 @@ class DatetimeNanoWriter : public DatetimeWriter<TimeUnit::NANO> {
   }
 };
 
-class DatetimeTZWriter : public DatetimeNanoWriter {
- public:
-  DatetimeTZWriter(const PandasOptions& options, const std::string& timezone,
-                   int64_t num_rows)
-      : DatetimeNanoWriter(options, num_rows, 1), timezone_(timezone) {}
-
- protected:
-  Status GetResultBlock(PyObject** out) override {
-    RETURN_NOT_OK(MakeBlock1D());
-    *out = block_arr_.obj();
-    return Status::OK();
-  }
-
-  Status AddResultMetadata(PyObject* result) override {
-    PyObject* py_tz = PyUnicode_FromStringAndSize(
-        timezone_.c_str(), static_cast<Py_ssize_t>(timezone_.size()));
-    RETURN_IF_PYERROR();
-    PyDict_SetItemString(result, "timezone", py_tz);
-    Py_DECREF(py_tz);
-    return Status::OK();
-  }
+// TODO (do not merge) how to templatize this..

Review Comment:
   I have a TODO to try something better than a #define before merging



##########
python/pyarrow/tests/parquet/common.py:
##########
@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False):
         # TODO(wesm): Test other timestamp resolutions now that arrow supports
         # them
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype('datetime64[ns]'),
-        'timedelta': np.arange(0, size, dtype="timedelta64[ns]"),
+                              dtype='datetime64[ms]'),
+        'timedelta': np.arange(0, size, dtype="timedelta64[s]"),

Review Comment:
   Reverting https://github.com/apache/arrow/pull/14460



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1329,7 +1330,7 @@ def _test_write_to_dataset_no_partitions(base_path,
                               'num': list(range(10)),
                               'date': np.arange('2017-01-01', '2017-01-11',
                                                 dtype='datetime64[D]')})
-    output_df["date"] = output_df["date"].astype('datetime64[ns]')
+    output_df["date"] = output_df["date"].astype('datetime64[ms]')

Review Comment:
   Modifying https://github.com/apache/arrow/pull/14460 to use ms, which is the new default conversion for [D]ays.



##########
python/pyarrow/tests/parquet/test_datetime.py:
##########
@@ -153,7 +154,7 @@ def test_coerce_timestamps_truncated(tempdir):
     df_ms = table_ms.to_pandas()
 
     arrays_expected = {'datetime64': [dt_ms, dt_ms]}
-    df_expected = pd.DataFrame(arrays_expected)
+    df_expected = pd.DataFrame(arrays_expected, dtype='datetime64[ms]')

Review Comment:
   Need to specify expected dtype to match the coercion now. 



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1105,7 +1113,8 @@ class MyDatetime(datetime):
         assert isinstance(table[0].chunk(0), pa.TimestampArray)
 
         result = table.to_pandas()
-        expected_df = pd.DataFrame({"datetime": date_array})
+        expected_df = pd.DataFrame(
+            {"datetime": pd.Series(date_array, dtype='datetime64[us]')})

Review Comment:
   Again, convert pandas default `ns` to Arrow default `us`



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1202,7 +1211,7 @@ def test_table_convert_date_as_object(self):
         df_datetime = table.to_pandas(date_as_object=False)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype('datetime64[ms]'), df_datetime,

Review Comment:
   Not yet, will add!



##########
python/pyarrow/tests/parquet/test_pandas.py:
##########
@@ -344,7 +344,7 @@ def test_index_column_name_duplicate(tempdir, use_legacy_dataset):
         }
     }
     path = str(tempdir / 'data.parquet')
-    dfx = pd.DataFrame(data).set_index('time', drop=False)
+    dfx = pd.DataFrame(data, dtype='datetime64[us]').set_index('time', drop=False)

Review Comment:
   Pandas defaults to `ns`, but Arrow defaults to `us`



##########
python/pyarrow/tests/parquet/test_datetime.py:
##########
@@ -64,12 +64,13 @@ def test_pandas_parquet_datetime_tz(use_legacy_dataset):
 
     arrow_table = pa.Table.from_pandas(df)
 
-    _write_table(arrow_table, f, coerce_timestamps='ms')
+    _write_table(arrow_table, f)

Review Comment:
   don't need to coerce anything. In pandas 1.x it gets coerced to `ns` anyways. Now in 2.0, it just causes an error.



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1458,7 +1459,6 @@ def test_write_to_dataset_with_partitions_and_custom_filenames(
                               'nan': [np.nan] * 10,
                               'date': np.arange('2017-01-01', '2017-01-11',
                                                 dtype='datetime64[D]')})
-    output_df["date"] = output_df["date"].astype('datetime64[ns]')

Review Comment:
   Reverting https://github.com/apache/arrow/pull/14460 - no longer needed



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1471,7 +1487,7 @@ def test_timestamp_to_pandas_empty_chunked(self):
         # ARROW-7907 table with chunked array with 0 chunks
         table = pa.table({'a': pa.chunked_array([], type=pa.timestamp('us'))})
         result = table.to_pandas()
-        expected = pd.DataFrame({'a': pd.Series([], dtype="datetime64[ns]")})
+        expected = pd.DataFrame({'a': pd.Series([], dtype="datetime64[us]")})

Review Comment:
   This is the Arrow default time unit



##########
python/pyarrow/tests/parquet/test_datetime.py:
##########
@@ -50,7 +50,7 @@
 @pytest.mark.pandas
 @parametrize_legacy_dataset
 def test_pandas_parquet_datetime_tz(use_legacy_dataset):
-    s = pd.Series([datetime.datetime(2017, 9, 6)])
+    s = pd.Series([datetime.datetime(2017, 9, 6)], dtype='datetime64[us]')

Review Comment:
   `us` is the Arrow default, but `ns` is the pandas default if dtype is not specified.



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1268,7 +1268,7 @@ def _test_write_to_dataset_with_partitions(base_path,
                               'nan': [np.nan] * 10,
                               'date': np.arange('2017-01-01', '2017-01-11',
                                                 dtype='datetime64[D]')})
-    output_df["date"] = output_df["date"].astype('datetime64[ns]')
+    output_df["date"] = output_df["date"].astype('datetime64[ms]')

Review Comment:
   Modifying https://github.com/apache/arrow/pull/14460 to use ms, which is the new default conversion for [D]ays.



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1054,7 +1062,7 @@ def test_python_datetime(self):
 
         result = table.to_pandas()
         expected_df = pd.DataFrame({
-            'datetime': date_array
+            'datetime': pd.Series(date_array, dtype='datetime64[us]')

Review Comment:
   Pandas default is `ns`, but Arrow default is `us`



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1438,7 +1454,7 @@ def test_timestamp_to_pandas_ns(self):
     def test_timestamp_to_pandas_out_of_bounds(self):
         # ARROW-7758 check for out of bounds timestamps for non-ns timestamps
 
-        if Version(pd.__version__) >= Version("2.1.0.dev"):
+        if Version(pd.__version__) < Version("2.1.0.dev"):

Review Comment:
   TODO: I need to verify 1) this test works 2) which version to actually run it on



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -2850,7 +2866,7 @@ def test_strided_data_import(self):
         cases.append(boolean_objects)
 
         cases.append(np.arange("2016-01-01T00:00:00.001", N * K,
-                               dtype='datetime64[ms]').astype("datetime64[ns]")

Review Comment:
   Revert https://github.com/apache/arrow/pull/14460



##########
python/pyarrow/tests/parquet/common.py:
##########
@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False):
         # TODO(wesm): Test other timestamp resolutions now that arrow supports
         # them
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype('datetime64[ns]'),
-        'timedelta': np.arange(0, size, dtype="timedelta64[ns]"),
+                              dtype='datetime64[ms]'),

Review Comment:
   Quickly enabling it does open up a small can of worms:
   ```
   FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[None-True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[None-False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[1000-True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[1000-False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   FAILED pyarrow/tests/parquet/test_metadata.py::test_parquet_metadata_api - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_metadata.py::test_compare_schemas - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_custom_metadata - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_column_multiindex[True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_column_multiindex[False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written[True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written[False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[300] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[1000] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[1300] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_reader[1000] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
   ```
   
   Maybe better for a follow up PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1625344108

   And with this PR and the Parquet v2.6 update combined, the failures in the dask builds are now much smaller (just one failure that was testing that a timestamp would overflow by being casted to nanoseconds)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240169931


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):
+        df = pd.DataFrame({'date': [date(2000, 1, 1)]}, dtype='datetime64[ms]')
+        table = pa.Table.from_pandas(df)

Review Comment:
   Its not, good catch! Will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240332965


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):

Review Comment:
   Good call! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1222679059


##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   > several pyarrow dataset tests are failing because datasets with [D]ay units are being converted to [ms] units
   
   Can you point to which test is failing? Because this is about conversion from pyarrow to pandas, right? (not arrow<->parquet roundtrip, which should be able to preserve our date32 type because we store the arrow schema)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1592478751

   @github-actions crossbow submit -g integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1243932408


##########
python/pyarrow/tests/parquet/common.py:
##########
@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False):
         # TODO(wesm): Test other timestamp resolutions now that arrow supports
         # them
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype('datetime64[ns]'),
-        'timedelta': np.arange(0, size, dtype="timedelta64[ns]"),
+                              dtype='datetime64[ms]'),

Review Comment:
   Ok, I am adding and fixing the tests. Just needed to remove coercion to `ms` in the test cases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1242896003


##########
python/pyarrow/tests/test_array.py:
##########
@@ -212,8 +212,9 @@ def test_to_numpy_writable():
 
 
 @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
-    arr = pa.array([1, 2, 3], pa.timestamp(unit))
+@pytest.mark.parametrize('tz', [None, "UTC"])
+def test_to_numpy_datetime64(unit, tz):
+    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
     expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
     np_arr = arr.to_numpy()
     np.testing.assert_array_equal(np_arr, expected)

Review Comment:
   Looks like this is tested in test_pandas.py in `test_timestamps_with_timezone`. It will call the following function on a pandas series and I confirmed it is converted into a pyarrow TimestampArray (vs a ChunkedArray)
   
   ```
   def _check_series_roundtrip(s, type_=None, expected_pa_type=None):
       arr = pa.array(s, from_pandas=True, type=type_)
   
       if type_ is not None and expected_pa_type is None:
           expected_pa_type = type_
   
       if expected_pa_type is not None:
           assert arr.type == expected_pa_type
   
       result = pd.Series(arr.to_pandas(), name=s.name)
       tm.assert_series_equal(s, result)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1242896436


##########
python/pyarrow/tests/test_array.py:
##########
@@ -212,8 +212,9 @@ def test_to_numpy_writable():
 
 
 @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
-    arr = pa.array([1, 2, 3], pa.timestamp(unit))
+@pytest.mark.parametrize('tz', [None, "UTC"])
+def test_to_numpy_datetime64(unit, tz):
+    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
     expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
     np_arr = arr.to_numpy()
     np.testing.assert_array_equal(np_arr, expected)

Review Comment:
   I did add s/ms/us time unit support to the above test case as well!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1212301675


##########
python/pyarrow/types.pxi:
##########
@@ -1122,6 +1150,19 @@ cdef class DurationType(DataType):
         """
         return timeunit_to_string(self.duration_type.unit())
 
+    def to_pandas_dtype(self):
+        """
+        Return the equivalent NumPy / Pandas dtype.
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> d = pa.duration('ms')
+        >>> d.to_pandas_dtype()
+        timedelta64[ms]
+        """
+        return _get_pandas_type(_Type_TIMESTAMP, self.unit)

Review Comment:
   Great catch! You're correct that this isn't tested yet.



##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   Thank you! Numpy supports [D]ay, but pandas does not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1257954557


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -4179,20 +4258,20 @@ def test_to_pandas_extension_dtypes_mapping():
     assert isinstance(result['a'].dtype, pd.PeriodDtype)
 
 
-def test_array_to_pandas():
+@pytest.mark.parametrize("arr",
+                         [pd.period_range("2012-01-01", periods=3, freq="D").array,
+                          pd.interval_range(1, 4).array])

Review Comment:
   Fixed -> https://github.com/apache/arrow/pull/36586



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1220476953


##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   This wasn't a problem before when everything was coerced to [ns], which parquet supports.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1243936015


##########
python/pyarrow/tests/parquet/common.py:
##########
@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False):
         # TODO(wesm): Test other timestamp resolutions now that arrow supports
         # them
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype('datetime64[ns]'),
-        'timedelta': np.arange(0, size, dtype="timedelta64[ns]"),
+                              dtype='datetime64[ms]'),

Review Comment:
   Good suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240315162


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1452,26 +1487,27 @@ def test_timestamp_to_pandas_out_of_bounds(self):
 
                 msg = "would result in out of bounds timestamp"
                 with pytest.raises(ValueError, match=msg):
-                    arr.to_pandas()
+                    print(arr.to_pandas(coerce_temporal_nanoseconds=True))

Review Comment:
   🤦 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240173676


##########
python/pyarrow/array.pxi:
##########
@@ -775,6 +776,13 @@ cdef class _PandasConvertible(_Weakrefable):
             expected to return a pandas ExtensionDtype or ``None`` if the
             default conversion should be used for that type. If you have
             a dictionary mapping, you can pass ``dict.get`` as function.
+        coerce_temporal_nanoseconds : bool, default False
+            Only applicable to pandas version >= 2.0.
+            A legacy option to coerce date32, date64, datetime, and timestamp

Review Comment:
   oof, good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223491247


##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -1569,7 +1572,31 @@ class DatetimeWriter : public TypedPandasWriter<NPY_DATETIME> {
 };
 
 using DatetimeSecondWriter = DatetimeWriter<TimeUnit::SECOND>;
-using DatetimeMilliWriter = DatetimeWriter<TimeUnit::MILLI>;
+
+class DatetimeMilliWriter : public DatetimeWriter<TimeUnit::MILLI> {
+ public:
+  using DatetimeWriter<TimeUnit::MILLI>::DatetimeWriter;
+
+  Status CopyInto(std::shared_ptr<ChunkedArray> data, int64_t rel_placement) override {
+    Type::type type = data->type()->id();
+    int64_t* out_values = this->GetBlockColumnStart(rel_placement);
+    if (type == Type::DATE32) {
+      // Convert from days since epoch to datetime64[ms]
+      ConvertDatetimeLikeNanos<int32_t, 86400000L>(*data, out_values);

Review Comment:
   Technically, I think this specific multiplication is fine. INT32_MAX * 86400000 = 1.8554259e+17, while INT64_MAX = 9.223372e+18.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1625542199

   Thanks @jorisvandenbossche for the collaboration and support! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1244142673


##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -2060,9 +2097,10 @@ static Status GetPandasWriterType(const ChunkedArray& data, const PandasOptions&
     case Type::DATE64:
       if (options.date_as_object) {
         *output_type = PandasWriter::OBJECT;
+      } else if (options.coerce_temporal_nanoseconds) {
+        *output_type = PandasWriter::DATETIME_NANO;
       } else {
-        *output_type = options.coerce_temporal_nanoseconds ? PandasWriter::DATETIME_NANO
-                                                           : PandasWriter::DATETIME_DAY;
+        *output_type = PandasWriter::DATETIME_MILLI;

Review Comment:
   Fixed! I ended up adding a `to_numpy` bool in `PandasOptions`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223272114


##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   I think those test failures are related to the fact that, with our defaults, parquet doesn't support nanoseconds, and we actually don't try to preserve the unit when roundtripping from arrow<->parquet:
   
   ```
   In [1]: table = pa.table({"col": pa.array([1, 2, 3], pa.timestamp("s")).cast(pa.timestamp("ns"))})
   
   In [2]: import pyarrow.parquet as pq
   
   In [3]: pq.write_table(table, "test_nanoseconds.parquet")
   
   In [4]: pq.read_table("test_nanoseconds.parquet")
   Out[4]: 
   pyarrow.Table
   col: timestamp[us]
   ----
   col: [[1970-01-01 00:00:01.000000,1970-01-01 00:00:02.000000,1970-01-01 00:00:03.000000]]
   ```
   
   So starting with an arrow table with nanoseconds, the result has microseconds (even though we actually _could_ preserve the original unit, because we store the original arrow schema in the parquet metadata. Although that would not be a zero copy restoration, in contrast to for example restoring the timezone, or restoring duration from int64, which is done in `ApplyOriginalStorageMetadata`)
   
   So this means that whenever we start with nanoseconds, we get back microseconds after roundtrip to parquet. And then if the roundtrip actually started from pandas using nanoseconds, we now also get microseconds in the pandas result (while before we still got nanoseconds since we forced using that in the arrow->pandas conversion step) ..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223520414


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
     # Partitioned columns become 'categorical' dtypes
     for col in partition_by:
         output_df[col] = output_df[col].astype('category')
+
+    if schema:
+        expected_date_type = schema.field_by_name('date').type.to_pandas_dtype()
+    else:
+        # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+        # silently coerce to [ns] due to non-[ns] support.
+        expected_date_type = 'datetime64[ms]'
+    output_df["date"] = output_df["date"].astype(expected_date_type)

Review Comment:
   There is no issue with reading datsets, I just had to update this test to check if the resulting pandas df matched the overridden schema types if a schema is applied. There is probably a better way to check then what I've done here..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1592881853

   The test build with pandas nightly has one related failure:
   
   ```
    =================================== FAILURES ===================================
   _____ TestConvertDateTimeLikeTypes.test_timestamp_to_pandas_out_of_bounds ______
   
   self = <pyarrow.tests.test_pandas.TestConvertDateTimeLikeTypes object at 0x7f022e57aee0>
   
       def test_timestamp_to_pandas_out_of_bounds(self):
           # ARROW-7758 check for out of bounds timestamps for non-ns timestamps
           # that end up getting coerced into ns timestamps.
       
           for unit in ['s', 'ms', 'us']:
               for tz in [None, 'America/New_York']:
                   arr = pa.array([datetime(1, 1, 1)], pa.timestamp(unit, tz=tz))
                   table = pa.table({'a': arr})
       
                   msg = "would result in out of bounds timestamp"
                   with pytest.raises(ValueError, match=msg):
                       print(arr.to_pandas(coerce_temporal_nanoseconds=True))
       
                   with pytest.raises(ValueError, match=msg):
                       table.to_pandas(coerce_temporal_nanoseconds=True)
       
   >               with pytest.raises(ValueError, match=msg):
   E               Failed: DID NOT RAISE <class 'ValueError'>
   
   opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_pandas.py:1495: Failed
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1245251210


##########
python/pyarrow/array.pxi:
##########
@@ -721,12 +722,12 @@ cdef class _PandasConvertible(_Weakrefable):
         integer_object_nulls : bool, default False
             Cast integers with nulls to objects
         date_as_object : bool, default True
-            Cast dates to objects. If False, convert to datetime64[ns] dtype.
+            Cast dates to objects. If False, convert to datetime64 dtype.

Review Comment:
   > For pandas 2.0, the conversion will match the pyarrow time unit so it could range from `s` to `ns`
   
   This is for dates (so date32/date64), so isn't that always "ms" unit?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1611753259

   The pandas integration test failure is:
   
   ```
   >               table.column('a').to_pandas(
                       safe=False, coerce_temporal_nanoseconds=True)
   
   opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_pandas.py:1535: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
   pyarrow/array.pxi:868: in pyarrow.lib._PandasConvertible.to_pandas
       ???
   pyarrow/table.pxi:472: in pyarrow.lib.ChunkedArray._to_pandas
       ???
   opt/conda/envs/arrow/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py:866: in __from_arrow__
       array = array.cast(pyarrow.timestamp(unit=self._unit), safe=True)
   pyarrow/table.pxi:556: in pyarrow.lib.ChunkedArray.cast
       ???
   opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/compute.py:402: in cast
       return call_function("cast", [arr], options, memory_pool)
   pyarrow/_compute.pyx:572: in pyarrow._compute.call_function
       ???
   pyarrow/_compute.pyx:367: in pyarrow._compute.Function.call
       ???
   pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
   
   >   ???
   E   pyarrow.lib.ArrowInvalid: Casting from timestamp[s, tz=America/New_York] to timestamp[ns] would result in out of bounds timestamp: -62135596800
   ```
   
   The issue here is that the pyarrow user can specify the `safe=False` option, but pyarrow internally can't pass that option to the pandas `DatetimeTZDtype.__from_arrow_` method: https://github.com/pandas-dev/pandas/blob/main/pandas/core/dtypes/dtypes.py#L897.  
   
   The only option is to either disallow usage of options in pyarrow's `to_pandas()` methods for pandas ExtensionDType objects or ignore the __from_arrow__ method and do the conversion in pyarrow instead. 
   
   Another approach is to explicitly add option parameters to `__from_arrow__`, but his adds a higher level of complexity to package integration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240319676


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):
+        df = pd.DataFrame({'date': [date(2000, 1, 1)]}, dtype='datetime64[ms]')
+        table = pa.Table.from_pandas(df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=True, date_as_object=False)
+        expected_df = df.astype('datetime64[ns]')
+        tm.assert_frame_equal(result_df, expected_df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=False, date_as_object=False)

Review Comment:
   I went ahead and parametrized for dates + timestamps!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1623821030

   @github-actions crossbow submit -g integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1257540254


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -4179,20 +4258,20 @@ def test_to_pandas_extension_dtypes_mapping():
     assert isinstance(result['a'].dtype, pd.PeriodDtype)
 
 
-def test_array_to_pandas():
+@pytest.mark.parametrize("arr",
+                         [pd.period_range("2012-01-01", periods=3, freq="D").array,
+                          pd.interval_range(1, 4).array])

Review Comment:
   @jorisvandenbossche @danepitkin We can't use `pd` here because pandas may not be available.
   This causes an error on "no pandas" environment: https://github.com/apache/arrow/actions/runs/5496447565/jobs/10016477233
   This PR's CI succeeded because our "Without Pandas" job installed pandas implicitly. It has been fixed by #36542.
   
   Could you open an issue for this and fix this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1552173251

   I'm looking for early feedback to see if this is the right approach. There are many test cases that will need updating, but I didn't want to tackle them yet in case we take a different approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1610256525

   All feedback has been applied! This PR is ready for re-review.
   
   I've added support for passing `coerce_temporal_nanoseconds` to `to_pandas_dtype` internally. I thought it best not to expose this option in the user-facing API since it is legacy default behavior. This does mean we have to check if we are using an extension type in `to_pandas` and call the extension types `to_pandas_dtype()` method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240318270


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):
+        df = pd.DataFrame({'date': [date(2000, 1, 1)]}, dtype='datetime64[ms]')
+        table = pa.Table.from_pandas(df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=True, date_as_object=False)

Review Comment:
   I went ahead and parametrized for dates + timestamps!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240426480


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -67,10 +67,14 @@ def _alltypes_example(size=100):
         'float32': np.arange(size, dtype=np.float32),
         'float64': np.arange(size, dtype=np.float64),
         'bool': np.random.randn(size) > 0,
-        # TODO(wesm): Pandas only support ns resolution, Arrow supports s, ms,
-        # us, ns
-        'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype("datetime64[ns]"),
+        'datetime[s]': np.arange("2016-01-01T00:00:00.001", size,
+                                 dtype='datetime64[s]'),
+        'datetime[ms]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[ms]'),
+        'datetime[us]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[us]'),
+        'datetime[ns]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[ns]'),

Review Comment:
   Will do!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche merged pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche merged PR #35656:
URL: https://github.com/apache/arrow/pull/35656


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1625345337

   Revision: a6487c20bacf1648660f6f0f30cbc023de92dd7b
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-5c8f3f6cad](https://github.com/ursacomputing/crossbow/branches/all?query=actions-5c8f3f6cad)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/5486396312/jobs/9996439304)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/5486394265/jobs/9996434686)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5486395338/jobs/9996437052)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/5486396090/jobs/9996438804)|
   |test-conda-python-3.10-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.10-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/5486397111/jobs/9996441260)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5486396657/jobs/9996440103)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5486395896/jobs/9996438369)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5486396815/jobs/9996440532)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/5486394596/jobs/9996435274)|
   |test-conda-python-3.8-spark-v3.1.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.8-spark-v3.1.2)](https://github.com/ursacomputing/crossbow/actions/runs/5486394827/jobs/9996435802)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5486395153/jobs/9996436600)|
   |test-conda-python-3.9-spark-v3.2.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-5c8f3f6cad-github-test-conda-python-3.9-spark-v3.2.0)](https://github.com/ursacomputing/crossbow/actions/runs/5486393979/jobs/9996434256)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1625342277

   @github-actions crossbow submit -g integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1623827021

   Revision: 6ffb5e559b7c2a355a4632f7c92b9d1e4b6f5d94
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-40508e3899](https://github.com/ursacomputing/crossbow/branches/all?query=actions-40508e3899)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/5476947597/jobs/9975246391)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/5476946753/jobs/9975244066)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5476945865/jobs/9975241863)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/5476947752/jobs/9975246957)|
   |test-conda-python-3.10-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.10-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/5476946489/jobs/9975243394)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5476945590/jobs/9975241254)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5476946222/jobs/9975242622)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5476947363/jobs/9975245678)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/5476947019/jobs/9975244740)|
   |test-conda-python-3.8-spark-v3.1.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.8-spark-v3.1.2)](https://github.com/ursacomputing/crossbow/actions/runs/5476949163/jobs/9975250713)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5476948531/jobs/9975248887)|
   |test-conda-python-3.9-spark-v3.2.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-40508e3899-github-test-conda-python-3.9-spark-v3.2.0)](https://github.com/ursacomputing/crossbow/actions/runs/5476948772/jobs/9975249650)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1583106826

   For the tz aware update, that also influences the (currently untested) to_numpy behaviour, which you can test with the following change:
   
   ```diff
   --- a/python/pyarrow/tests/test_array.py
   +++ b/python/pyarrow/tests/test_array.py
   @@ -211,9 +211,10 @@ def test_to_numpy_writable():
            arr.to_numpy(zero_copy_only=True, writable=True)
    
    
   +@pytest.mark.parametrize('tz', [None, "UTC"])
    @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
   -def test_to_numpy_datetime64(unit):
   -    arr = pa.array([1, 2, 3], pa.timestamp(unit))
   +def test_to_numpy_datetime64(unit, tz):
   +    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
        expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
        np_arr = arr.to_numpy()
        np.testing.assert_array_equal(np_arr, expected)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1614812664

   @github-actions crossbow submit -g integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1636692283

   Conbench analyzed the 6 benchmark runs on commit `4f56aba3`.
   
   There were 7 benchmark results indicating a performance regression:
   
   - Commit Run on `arm64-t4g-linux-compute` at [2023-07-07 16:29:54Z](http://conbench.ursa.dev/compare/runs/148fbffa139d4f048c9b51e9aec606a5...cfebd818b0de420cba9a28e19a00a742/)
     - [params=num_cols:8/is_partial:0/real_time, source=cpp-micro, suite=arrow-ipc-read-write-benchmark](http://conbench.ursa.dev/compare/benchmarks/064a839012727ce480001a62a3356122...064a83dd9e997314800041fedf9e4eb5)
   
   - Commit Run on `arm64-m6g-linux-compute` at [2023-07-07 16:29:36Z](http://conbench.ursa.dev/compare/runs/da120c9ce3d74b23b01dc175cbe19549...b026b09cd6044f31a6654b066deb45ee/)
     - [params=32768, source=cpp-micro, suite=parquet-encoding-benchmark](http://conbench.ursa.dev/compare/benchmarks/064a837e098177e08000d5a7a89ac45e...064a83dd0427725580009e03bcb9cdb2)
   - and 5 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/15063947042) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1230607268


##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -2060,9 +2097,10 @@ static Status GetPandasWriterType(const ChunkedArray& data, const PandasOptions&
     case Type::DATE64:
       if (options.date_as_object) {
         *output_type = PandasWriter::OBJECT;
+      } else if (options.coerce_temporal_nanoseconds) {
+        *output_type = PandasWriter::DATETIME_NANO;
       } else {
-        *output_type = options.coerce_temporal_nanoseconds ? PandasWriter::DATETIME_NANO
-                                                           : PandasWriter::DATETIME_DAY;
+        *output_type = PandasWriter::DATETIME_MILLI;

Review Comment:
   I am just realizing that this is a small regression / behaviour change for `to_numpy`, which also goes through this code path. Numpy does have datetime64[D], and so here it can make sense to actually use "D" unit when converting to a numpy array .. 
   I am not sure if we want some other options to indicate whether we are converting to numpy or pandas, though ..



##########
python/pyarrow/tests/test_array.py:
##########
@@ -212,8 +212,9 @@ def test_to_numpy_writable():
 
 
 @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
-    arr = pa.array([1, 2, 3], pa.timestamp(unit))
+@pytest.mark.parametrize('tz', [None, "UTC"])
+def test_to_numpy_datetime64(unit, tz):
+    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
     expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
     np_arr = arr.to_numpy()
     np.testing.assert_array_equal(np_arr, expected)

Review Comment:
   Another comment related to this file (but not this specific test): we are testing to_numpy here, but we should also cover the fix to Array.to_pandas to no longer coerce to nanoseconds (since that goes through a different code path in the cython level compared to Table.to_pandas) 
   
   So either maybe repeat this test here but with to_pandas(), or add one below where there is already a `test_to_pandas_timezone` test function



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1016,13 +1020,17 @@ def test_timestamps_notimezone_nulls(self):
             expected_schema=schema,
         )
 
-    def test_timestamps_with_timezone(self):
+    @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
+    def test_timestamps_with_timezone(self, unit):
+        if Version(pd.__version__) < Version("2.0.0"):
+            # ARROW-3789: Coerce date/timestamp types to datetime64[ns]

Review Comment:
   I suppose this reference to ARROW-3789 was copied from another comment, but I think here it's not very useful



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1452,26 +1487,27 @@ def test_timestamp_to_pandas_out_of_bounds(self):
 
                 msg = "would result in out of bounds timestamp"
                 with pytest.raises(ValueError, match=msg):
-                    arr.to_pandas()
+                    print(arr.to_pandas(coerce_temporal_nanoseconds=True))

Review Comment:
   ```suggestion
                       arr.to_pandas(coerce_temporal_nanoseconds=True)
   ```



##########
python/pyarrow/tests/test_schema.py:
##########
@@ -45,7 +46,10 @@ def test_type_integers():
 
 
 def test_type_to_pandas_dtype():
-    M8_ns = np.dtype('datetime64[ns]')
+    M8 = np.dtype('datetime64[ms]')
+    if _pandas_api.is_v1():

Review Comment:
   In tests we generally use the actual pandas version, like `Version(pd.__version__) < Version("2.0")` (although this is more convenient in this case, it's probably fine it keep it here)



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):
+        df = pd.DataFrame({'date': [date(2000, 1, 1)]}, dtype='datetime64[ms]')
+        table = pa.Table.from_pandas(df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=True, date_as_object=False)
+        expected_df = df.astype('datetime64[ns]')
+        tm.assert_frame_equal(result_df, expected_df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=False, date_as_object=False)

Review Comment:
   ```suggestion
           result_df = table.to_pandas(coerce_temporal_nanoseconds=False)
   ```



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):
+        df = pd.DataFrame({'date': [date(2000, 1, 1)]}, dtype='datetime64[ms]')
+        table = pa.Table.from_pandas(df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=True, date_as_object=False)

Review Comment:
   ```suggestion
           result_df = table.to_pandas(coerce_temporal_nanoseconds=True)
   ```
   
   We have timestamp types here, so `date_as_object` has no effect (and seeing it can only be confusing to the reader then)



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):

Review Comment:
   Can you add a similar "test_array_coerce_temporal_nanoseconds"? 



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -67,10 +67,14 @@ def _alltypes_example(size=100):
         'float32': np.arange(size, dtype=np.float32),
         'float64': np.arange(size, dtype=np.float64),
         'bool': np.random.randn(size) > 0,
-        # TODO(wesm): Pandas only support ns resolution, Arrow supports s, ms,
-        # us, ns
-        'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype("datetime64[ns]"),
+        'datetime[s]': np.arange("2016-01-01T00:00:00.001", size,
+                                 dtype='datetime64[s]'),
+        'datetime[ms]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[ms]'),
+        'datetime[us]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[us]'),
+        'datetime[ns]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[ns]'),

Review Comment:
   I know this was not yet there before, but can you also add timedelta columns? Because also for timedelta64 (duration on the pyarrow side), pandas now supports the multiple resolutions and we now preserve the resolution on conversion to pandas (with pandas >= 2.0).



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1171,26 +1187,34 @@ def test_array_types_date_as_object(self):
                 None,
                 date(1970, 1, 1),
                 date(2040, 2, 26)]
-        expected_d = np.array(['2000-01-01', None, '1970-01-01',
-                               '2040-02-26'], dtype='datetime64[D]')
+        expected_days = np.array(['2000-01-01', None, '1970-01-01',
+                                  '2040-02-26'], dtype='datetime64[D]')
 
-        expected_ns = np.array(['2000-01-01', None, '1970-01-01',
-                                '2040-02-26'], dtype='datetime64[ns]')
+        expected_dtype = 'datetime64[ms]'
+        if Version(pd.__version__) < Version("2.0.0"):
+            # ARROW-3789: Coerce date/timestamp types to datetime64[ns]
+            expected_dtype = 'datetime64[ns]'
+
+        expected = np.array(['2000-01-01', None, '1970-01-01',
+                             '2040-02-26'], dtype=expected_dtype)
 
         objects = [pa.array(data),
                    pa.chunked_array([data])]
 
         for obj in objects:
             result = obj.to_pandas()
-            expected_obj = expected_d.astype(object)
+            expected_obj = expected_days.astype(object)
             assert result.dtype == expected_obj.dtype
             npt.assert_array_equal(result, expected_obj)
 
             result = obj.to_pandas(date_as_object=False)
-            assert result.dtype == expected_ns.dtype
-            npt.assert_array_equal(result, expected_ns)
+            assert result.dtype == expected.dtype
+            npt.assert_array_equal(result, expected)
 
-    def test_table_convert_date_as_object(self):
+    @pytest.mark.parametrize("coerce_to_ns,expected_type",
+                             [(False, 'datetime64[ms]'),
+                              (True, 'datetime64[ns]')])

Review Comment:
   Can you add this parametrization also to the test above? (for array.to_pandas)



##########
python/pyarrow/types.pxi:
##########
@@ -23,6 +23,8 @@ import re
 import sys
 import warnings
 
+from pyarrow.lib import _pandas_api

Review Comment:
   I think this import shouldn't be needed, as we essentially are "in" pyarrow.lib? (types.pxi gets included in lib.pyx)



##########
python/pyarrow/tests/test_array.py:
##########
@@ -212,8 +212,9 @@ def test_to_numpy_writable():
 
 
 @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
-    arr = pa.array([1, 2, 3], pa.timestamp(unit))
+@pytest.mark.parametrize('tz', [None, "UTC"])
+def test_to_numpy_datetime64(unit, tz):
+    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
     expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
     np_arr = arr.to_numpy()
     np.testing.assert_array_equal(np_arr, expected)

Review Comment:
   Although, you maybe already covered this in test_pandas.py? While many of the tests in `TestConvertDateTimeLikeTypes` use a table for the roundtrip testing, I see that there are also a bunch that use arrays more towards the end of the test class.



##########
python/pyarrow/types.pxi:
##########
@@ -115,6 +128,21 @@ def _is_primitive(Type type):
     return is_primitive(type)
 
 
+def _get_pandas_type(type, unit=None):
+    if type not in _pandas_type_map:
+        return None
+    if type in [_Type_DATE32, _Type_DATE64, _Type_TIMESTAMP, _Type_DURATION]:
+        if _pandas_api.is_v1():

Review Comment:
   Nitpick: maybe switch the two checks (assuming the is_v1 is cheaper), or on a single line as `if _pandas_api.is_v1() and type in ..` (although it probably doesn't fit on a single line anyway)



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1437,13 +1478,7 @@ def test_timestamp_to_pandas_ns(self):
 
     def test_timestamp_to_pandas_out_of_bounds(self):
         # ARROW-7758 check for out of bounds timestamps for non-ns timestamps
-
-        if Version(pd.__version__) >= Version("2.1.0.dev"):
-            # GH-35235: test fail due to __from_pyarrow__ being added to pandas
-            # https://github.com/pandas-dev/pandas/pull/52201
-            # Needs: https://github.com/apache/arrow/issues/33321
-            pytest.skip(
-                "Need support converting to non-nano datetime64 for pandas >= 2.0")

Review Comment:
   So this still seems to fail on pandas nightly (https://github.com/apache/arrow/pull/35656#issuecomment-1592881853)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223485119


##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -70,7 +70,7 @@ def _alltypes_example(size=100):
         # TODO(wesm): Pandas only support ns resolution, Arrow supports s, ms,
         # us, ns
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype("datetime64[ns]"),
+                              dtype='datetime64[ms]'),

Review Comment:
   Added s, ms, us!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1592482473

   Revision: c74c67f3a5d9a516e55a4227b24aae908c388bae
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-3f3936e262](https://github.com/ursacomputing/crossbow/branches/all?query=actions-3f3936e262)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/5275818067/jobs/9541737508)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/5275815949/jobs/9541732093)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5275818302/jobs/9541738176)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/5275817544/jobs/9541736059)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5275817890/jobs/9541737015)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5275817303/jobs/9541735457)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5275816375/jobs/9541733023)|
   |test-conda-python-3.7-spark-v3.1.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.7-spark-v3.1.2)](https://github.com/ursacomputing/crossbow/actions/runs/5275818381/jobs/9541738466)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/5275817056/jobs/9541734813)|
   |test-conda-python-3.8-spark-v3.2.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.8-spark-v3.2.0)](https://github.com/ursacomputing/crossbow/actions/runs/5275817742/jobs/9541736623)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5275816605/jobs/9541733665)|
   |test-conda-python-3.9-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-3f3936e262-github-test-conda-python-3.9-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/5275816188/jobs/9541732548)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1240418310


##########
python/pyarrow/tests/test_schema.py:
##########
@@ -45,7 +46,10 @@ def test_type_integers():
 
 
 def test_type_to_pandas_dtype():
-    M8_ns = np.dtype('datetime64[ns]')
+    M8 = np.dtype('datetime64[ms]')
+    if _pandas_api.is_v1():

Review Comment:
   I'll go ahead and update. I think this is the better approach!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1623659400

   @danepitkin I merged the Parquet version 2.6 PR https://github.com/apache/arrow/pull/36137, so this can be updated now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1255736346


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
     # Partitioned columns become 'categorical' dtypes
     for col in partition_by:
         output_df[col] = output_df[col].astype('category')
+
+    if schema:
+        expected_date_type = schema.field_by_name('date').type.to_pandas_dtype()
+    else:
+        # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+        # silently coerce to [ns] due to non-[ns] support.
+        expected_date_type = 'datetime64[ms]'

Review Comment:
   -> https://github.com/apache/arrow/pull/36538



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1608437189

   TODO: 
   * Fix Numpy date32 regression
   * Fix Pandas nightly test failure
   
   (all other comments are addressed to the best of my knowledge)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1610260446

   @github-actions crossbow submit -g integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1610263248

   Revision: 641cd2ead27f0c3b2b6946521ac7a62e299ab682
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-db676ddfe3](https://github.com/ursacomputing/crossbow/branches/all?query=actions-db676ddfe3)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/5394901882/jobs/9796713753)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/5394902096/jobs/9796714257)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5394902773/jobs/9796715909)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/5394902616/jobs/9796715480)|
   |test-conda-python-3.10-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.10-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/5394901539/jobs/9796712929)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5394901109/jobs/9796712140)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5394901335/jobs/9796712519)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5394903091/jobs/9796716760)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/5394902453/jobs/9796715029)|
   |test-conda-python-3.8-spark-v3.1.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.8-spark-v3.1.2)](https://github.com/ursacomputing/crossbow/actions/runs/5394902941/jobs/9796716328)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5394903191/jobs/9796717068)|
   |test-conda-python-3.9-spark-v3.2.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-db676ddfe3-github-test-conda-python-3.9-spark-v3.2.0)](https://github.com/ursacomputing/crossbow/actions/runs/5394902292/jobs/9796714712)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223194680


##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   ```
       @pytest.mark.filterwarnings("ignore:'ParquetDataset.schema:FutureWarning")
       def _test_write_to_dataset_with_partitions(base_path,
                                                  use_legacy_dataset=True,
                                                  filesystem=None,
                                                  schema=None,
                                                  index_name=None):
           import pandas as pd
           import pandas.testing as tm
   
           import pyarrow.parquet as pq
   
           # ARROW-1400
           output_df = pd.DataFrame({'group1': list('aaabbbbccc'),
                                     'group2': list('eefeffgeee'),
                                     'num': list(range(10)),
                                     'nan': [np.nan] * 10,
                                     'date': np.arange('2017-01-01', '2017-01-11',
                                                       dtype='datetime64[D]')})
           output_df["date"] = output_df["date"]
           cols = output_df.columns.tolist()
           partition_by = ['group1', 'group2']
           output_table = pa.Table.from_pandas(output_df, schema=schema, safe=False,
                                               preserve_index=False)
           pq.write_to_dataset(output_table, base_path, partition_by,
                               filesystem=filesystem,
                               use_legacy_dataset=use_legacy_dataset)
   
           metadata_path = os.path.join(str(base_path), '_common_metadata')
   
           if filesystem is not None:
               with filesystem.open(metadata_path, 'wb') as f:
                   pq.write_metadata(output_table.schema, f)
           else:
               pq.write_metadata(output_table.schema, metadata_path)
   
           # ARROW-2891: Ensure the output_schema is preserved when writing a
           # partitioned dataset
           dataset = pq.ParquetDataset(base_path,
                                       filesystem=filesystem,
                                       validate_schema=True,
                                       use_legacy_dataset=use_legacy_dataset)
           # ARROW-2209: Ensure the dataset schema also includes the partition columns
           if use_legacy_dataset:
               with pytest.warns(FutureWarning, match="'ParquetDataset.schema'"):
                   dataset_cols = set(dataset.schema.to_arrow_schema().names)
           else:
               # NB schema property is an arrow and not parquet schema
               dataset_cols = set(dataset.schema.names)
   
           assert dataset_cols == set(output_table.schema.names)
   
           input_table = dataset.read(use_pandas_metadata=True)
   
           input_df = input_table.to_pandas()
   
           # Read data back in and compare with original DataFrame
           # Partitioned columns added to the end of the DataFrame when read
           input_df_cols = input_df.columns.tolist()
           assert partition_by == input_df_cols[-1 * len(partition_by):]
   
           input_df = input_df[cols]
           # Partitioned columns become 'categorical' dtypes
           for col in partition_by:
               output_df[col] = output_df[col].astype('category')
           # if schema is None and Version(pd.__version__) >= Version("2.0.0"):
           #     output_df['date'] = output_df['date'].astype('datetime64[ms]')
   >       tm.assert_frame_equal(output_df, input_df)
   E       AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
   E
   E       Attribute "dtype" are different
   E       [left]:  datetime64[s]
   E       [right]: datetime64[ms]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35656:
URL: https://github.com/apache/arrow/pull/35656#issuecomment-1584832895

   Should we run crossbow on this change to? IIRC only committers can trigger it. Does it run the same tests as the nightly builds?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org