You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/19 16:19:00 UTC

[GitHub] [arrow] amol- opened a new pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

amol- opened a new pull request #10101:
URL: https://github.com/apache/arrow/pull/10101


   https://issues.apache.org/jira/browse/ARROW-9594


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621178469



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),
+        expected,
+        equal_nan=True
+    )
+
+    aints = pa.DictionaryArray.from_arrays(

Review comment:
       It was the test that pointed out that using `NaN` to represent nulls wasn't working (only works in `to_pandas`, not in `to_numpy`) as in case of `numpy` arrays of ints there was no way to store `np.NaN`  in the array. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621057457



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),

Review comment:
       Ok, what about `zero_copy_only=True`?

##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),
+        expected,
+        equal_nan=True
+    )
+
+    aints = pa.DictionaryArray.from_arrays(

Review comment:
       Is this test useful?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621055767



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()

Review comment:
       `np.testing` will provide more useful error messages.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r618363682



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()

Review comment:
       I am not sure if we are very consistent about it, but we also use `np.testing.assert_array_equal` for asserting numpy arrays

##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # to_numpy takes for granted that when zero_copy_only=True
+        # there will be no nulls.
+        # If that changes, nulls handling will have to be updated in to_numpy
+        # as we won't be able to rely anymore on decoding the DictionaryArray
+        # to handle nulls.

Review comment:
       I don't fully understand this comment about nulls. For this specific case isn't it just testing that conversion to numpy can never be done zero-copy for dictionary (nulls or no nulls) and is thus raising an error? (this already worked correctly on master as well)

##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # to_numpy takes for granted that when zero_copy_only=True
+        # there will be no nulls.
+        # If that changes, nulls handling will have to be updated in to_numpy
+        # as we won't be able to rely anymore on decoding the DictionaryArray
+        # to handle nulls.
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)

Review comment:
       ```suggestion
       expected = np.array(["foo", "bar", "bar", "foo"], dtype=object)
   ```
   
   I think we can just define the expected here? (it's also a bit simpler in code)
   
   (and the same for the others)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616000507



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,9 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
+            missings = array["indices"] < 0
             array = np.take(array['dictionary'], array['indices'])
+            array[missings] = np.NaN

Review comment:
       I thought about adding a `if missings.any():` here, but it didn't change performances at all, so not much benefit.
   
   ```
   def withoutcheck():
       arr[mask] = np.NaN
   
   def withcheck():
       if mask.any():
           arr[mask] = np.NaN
   
   >>> timeit.timeit(withoutcheck, number=10000)
   5.555068847000001
   >>> timeit.timeit(withcheck, number=10000)
   5.6257486310000075
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r618353079



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -2229,6 +2229,11 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& options,
         checked_cast<const DictionaryType&>(*arr->type()).value_type();
     RETURN_NOT_OK(DecodeDictionaries(options.pool, dense_type, &arr));
     DCHECK_NE(arr->type()->id(), Type::DICTIONARY);
+
+    // The original Python DictionaryArray won't own the memory anymore
+    // as we actually built a new array when we decoded the DictionaryArray
+    // thus let the final resulting numpy array own the memory through a Capsule
+    py_ref = nullptr;

Review comment:
       @pitrou would you be able to confirm that this is the right fix?
   
   You can see the problem it deals with by checking out this branch, commenting out that line and running the `test_dictionary_to_numpy` test. 
   
   You will see that in such case we produce corrupted data as the subsequent `MakeNumPyView` used the original `DictionaryArray` as the "base" of the new `numpy` array. While the newly built `Array`, resulting by dictionary decoding (who really owns the memory for the numpy array), probably goes away immediately at the end of this method. With this change instead a capsule is made and the returned numpy array becomes the owner of that capsule.
   
   Does that sound like the right solution?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616005809



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,9 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
+            missings = array["indices"] < 0
             array = np.take(array['dictionary'], array['indices'])
+            array[missings] = np.NaN

Review comment:
       I noticed I also have to check for interaction with zero copy. Given that we are ending up _writing_ to `array[missings]` in the end. Nulls should not be around when zero copy is requested, but better to have a test to see that it deals with that well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616005809



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,9 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
+            missings = array["indices"] < 0
             array = np.take(array['dictionary'], array['indices'])
+            array[missings] = np.NaN

Review comment:
       I noticed I also have to check for interaction with zero copy. Given that we are ending up _writing_ to `array[missings]` in the end.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616786742



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       I understand that it can be confusing, but the fact that `da.to_pandas()[1] == NaN` is the case, is a pandas behaviour / implementation detail (pandas doesn't use either NaN or None under the hood for a Categorical, but returns NaN when accessing the value. And, in a potential next version with pandas using nullable dtypes, this would no longer be true). 
   
   I think the relevant cases where we would want consistency is for the different `to_numpy` conversions:
   
   ```
   In [20]: pa.array(["a", "b", None]).to_numpy(zero_copy_only=False)
   Out[20]: array(['a', 'b', None], dtype=object)
   
   In [21]: pa.array(["a", "b", None]).dictionary_encode().to_numpy(zero_copy_only=False)
   Out[21]: array(['a', 'b', 'b'], dtype=object)
   ```
   
   where of course right now the last one is buggy (which is what this PR is solving), but so I would expect those two above to return the same in case of `to_numpy()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#issuecomment-828256266


   @amol- If you enable Github Actions and AppVeyor on your fork, it can allow to get CI results more quickly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621203945



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),
+        expected,
+        equal_nan=True
+    )
+
+    aints = pa.DictionaryArray.from_arrays(

Review comment:
       I see no `NaN` in that test?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#issuecomment-822627565


   https://issues.apache.org/jira/browse/ARROW-9594


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#issuecomment-827477462


   The failure here seem to be related to a transient error on S3 tests. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r618381447



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # to_numpy takes for granted that when zero_copy_only=True
+        # there will be no nulls.
+        # If that changes, nulls handling will have to be updated in to_numpy
+        # as we won't be able to rely anymore on decoding the DictionaryArray
+        # to handle nulls.
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)

Review comment:
       My goal was mostly to test that "wathever decoding a normal `Array` leads to, decoding a `DictionaryArray` leads to the same exact result". I felt it was more future proof than testing for the actual value. But if you feel it's better to test for the actual resulting numpy array I'll gladly change that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616690521



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,9 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
+            missings = array["indices"] < 0
             array = np.take(array['dictionary'], array['indices'])
+            array[missings] = np.NaN

Review comment:
       Added an optimization based on `zero_copy_only` option (as it doesn't allow nulls) and on `self.null_count` as the null_count is cached ( https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/array/data.cc#L120-L121 ) thus should frequently not add to the overhead.
   
   Also confirmed that `-1` is used to signal NULL values when converting to `numpy` arrays ( https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/python/arrow_to_pandas.cc#L1637 )
   
   A further optimization might have been to have `ConvertArrayToPandas`  return the count of null values (as it is invoking `IsValid` on them anyway) but that requires a more widespread change and thus I think should be deferred until proved necessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

pitrou closed pull request #10101:
URL: https://github.com/apache/arrow/pull/10101


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621177214



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),

Review comment:
       The test here has `None` inside, so there is no way to test with `zero_copy_only=True`, the case of floats with `zero_copy_only=true` is already tested in the previous block.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616774759



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       I think that my question was mostly in terms of consistency.
   
   Given
   ```
   da = pa.DictionaryArray.from_arrays(pa.array([0, None]), pa.array(['a']))
   ```
   
   If 
   ```
   da.to_pandas()[1] == NaN
   ```
   
   shouldn't
   ```
   da.to_numpy()[1] == da.to_pandas()[1]
   ``` 
   
   woundn't it be confusing that in one case you get back `NaN` and in the other `None`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621250569



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),

Review comment:
       We already verify for that error in the previous `with pytest.raises(pa.ArrowInvalid)` block




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621203599



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),

Review comment:
       What do you mean, "no way to test"? It may return an error, but there should be a test for it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616743857



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       Thanks for catching the `np.NaN` with ints, I actually noticed that yesterday but forgot to deal with it when I got back to this ticket.
   
   Regarding the `None` vs `np.NaN`.
   I think there is an inconsistency already, because I copied the behaviour from `to_pandas` and it seems that `to_pandas` already leads to two different results when used on `DictionaryArray`  or `Array`:
   
   ```
   >>> pa.array(['a', None]).to_pandas().tolist()
   ['a', None]
   >>> pa.array(['a', None]).to_numpy(zero_copy_only=False).tolist()
   ['a', None]
   ```
   
   VS
   
   ```
   >>> pa.DictionaryArray.from_arrays(pa.array([0, None]), pa.array(['a'])).to_pandas().tolist()
   ['a', nan]
   ```
   
   So it seems that for `DictionaryArray` we already used `NaN`, not `None`, which lead to the reason why I used `NaN`.
   
   Should we uniform the behaviour and switch to `None` for `DictionaryArray.to_pandas` too as we do for `Array.to_pandas`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r620260312



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # to_numpy takes for granted that when zero_copy_only=True
+        # there will be no nulls.
+        # If that changes, nulls handling will have to be updated in to_numpy
+        # as we won't be able to rely anymore on decoding the DictionaryArray
+        # to handle nulls.
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)

Review comment:
       OK, either way is fine for me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r618384121



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -2229,6 +2229,11 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& options,
         checked_cast<const DictionaryType&>(*arr->type()).value_type();
     RETURN_NOT_OK(DecodeDictionaries(options.pool, dense_type, &arr));
     DCHECK_NE(arr->type()->id(), Type::DICTIONARY);
+
+    // The original Python DictionaryArray won't own the memory anymore
+    // as we actually built a new array when we decoded the DictionaryArray
+    // thus let the final resulting numpy array own the memory through a Capsule
+    py_ref = nullptr;

Review comment:
       This does seem right indeed. Kudos for figuring this out :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r620359829



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # to_numpy takes for granted that when zero_copy_only=True
+        # there will be no nulls.
+        # If that changes, nulls handling will have to be updated in to_numpy
+        # as we won't be able to rely anymore on decoding the DictionaryArray
+        # to handle nulls.

Review comment:
       👍  I'll replace the comment




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616002776



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,9 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
+            missings = array["indices"] < 0
             array = np.take(array['dictionary'], array['indices'])
+            array[missings] = np.NaN

Review comment:
       Checking the null count of the original arrow array (`self`) might help?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r618353079



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -2229,6 +2229,11 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& options,
         checked_cast<const DictionaryType&>(*arr->type()).value_type();
     RETURN_NOT_OK(DecodeDictionaries(options.pool, dense_type, &arr));
     DCHECK_NE(arr->type()->id(), Type::DICTIONARY);
+
+    // The original Python DictionaryArray won't own the memory anymore
+    // as we actually built a new array when we decoded the DictionaryArray
+    // thus let the final resulting numpy array own the memory through a Capsule
+    py_ref = nullptr;

Review comment:
       @pitrou would you be able to confirm that this is the right fix?
   
   You can see the problem it deals with by checking out this branch, commenting out that line and running the `test_dictionary_to_numpy` test. You will see that in such case we produce corrupted data as the subsequent `MakeNumPyView` used the original `DictionaryArray` as the "base" of the new `numpy` array and the newly built `Array` resulting by dictionary decoding (who really owns the memory for the numpy array) probably goes away immediately at the end of this method. With this change instead a capsule is made and the returned numpy array becomes the owner of that capsule.
   
   Does that sound like the right solution?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621242901



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),
+        expected,
+        equal_nan=True
+    )
+
+    aints = pa.DictionaryArray.from_arrays(

Review comment:
       `None` is what is getting translated to `NaN`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r618353079



##########
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##########
@@ -2229,6 +2229,11 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& options,
         checked_cast<const DictionaryType&>(*arr->type()).value_type();
     RETURN_NOT_OK(DecodeDictionaries(options.pool, dense_type, &arr));
     DCHECK_NE(arr->type()->id(), Type::DICTIONARY);
+
+    // The original Python DictionaryArray won't own the memory anymore
+    // as we actually built a new array when we decoded the DictionaryArray
+    // thus let the final resulting numpy array own the memory through a Capsule
+    py_ref = nullptr;

Review comment:
       @pitrou would you be able to confirm that this is the right fix?
   
   You can see the problem it deals with by checking out this branch, commenting out that line and running the `test_dictionary_to_numpy` test. 
   
   You will see that in such case we produce corrupted data as the subsequent `MakeNumPyView` used the original `DictionaryArray` as the "base" of the new `numpy` array and the newly built `Array` resulting by dictionary decoding (who really owns the memory for the numpy array) probably goes away immediately at the end of this method. With this change instead a capsule is made and the returned numpy array becomes the owner of that capsule.
   
   Does that sound like the right solution?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r620264160



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # to_numpy takes for granted that when zero_copy_only=True
+        # there will be no nulls.
+        # If that changes, nulls handling will have to be updated in to_numpy
+        # as we won't be able to rely anymore on decoding the DictionaryArray
+        # to handle nulls.

Review comment:
       I see, understand it now, but still find the original comment a bit confusing. How about something like:
   
   ```
           # If this would be changed to no longer raise in the future, ensure to test the actual result
           # because, currently, to_numpy takes for granted that when zero_copy_only=True
           # there will be no nulls (it's the decoding of the DictionaryArray that handles the nulls and
           # this is only activated with zero_copy_only=False)
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621242901



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),
+        expected,
+        equal_nan=True
+    )
+
+    aints = pa.DictionaryArray.from_arrays(

Review comment:
       `None` is what was getting translated to `NaN`, now becomes `None`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616774759



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       I think that my question was mostly in terms of consistency.
   
   Given
   ```
   da = pa.DictionaryArray.from_arrays(pa.array([0, None]), pa.array(['a']))
   ```
   
   If 
   ```
   da.to_pandas()[1] == NaN
   ```
   
   shouldn't
   ```
   da.to_numpy()[1] == da.to_pandas()[1]
   ```
   
   woundn't it be confusing that in one case you get back `NaN` and in the other `None`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621178469



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),
+        expected,
+        equal_nan=True
+    )
+
+    aints = pa.DictionaryArray.from_arrays(

Review comment:
       It was the test that pointed out that using `NaN` to represent nulls wasn't working (only worked in `to_pandas`, not in `to_numpy`) as in case of `numpy` arrays of ints there was no way to store `np.NaN`  in the array. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#issuecomment-828259889


   > @amol- If you enable Github Actions and AppVeyor on your fork, it can allow to get CI results more quickly.
   
   Oh, thanks for the suggestion, I'll try that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621251938



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()

Review comment:
       replaced the occurrences of `assert (X == Y).all()` with `np.testing.assert_array_equal`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616743857



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       Thanks for catching the `np.NaN` with ints, I actually noticed that yesterday but forgot to deal with it when I got back to this ticket.
   
   Regarding the `None` vs `np.NaN`.
   I think there is an inconsistency already, because I copied the behaviour from `to_pandas` and it seems that `to_pandas` already leads to two different results when used on `DictionaryArray`  or `Array`:
   
   ```
   >>> pa.array(['a', None]).to_pandas().tolist()
   ['a', None]
   >>> pa.array(['a', None]).to_numpy(zero_copy_only=False).tolist()
   ['a', None]
   ```
   
   VS
   
   ```
   >>> pa.DictionaryArray.from_arrays(pa.array([0, None]), pa.array(['a'])).to_pandas().tolist()
   ```
   
   So it seems that for `DictionaryArray` we already used `NaN`, not `None`, which lead to the reason why I used `NaN`.
   
   Should we uniform the behaviour and switch to `None` for `DictionaryArray.to_pandas` too as we do for `Array.to_pandas`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621924730



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),
+        expected,
+        equal_nan=True
+    )
+
+    aints = pa.DictionaryArray.from_arrays(

Review comment:
       :+1: done!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616751489



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       > So it seems that for `DictionaryArray` we already used `NaN`, not `None`, which lead to the reason why I used `NaN`.
   
   That's actually the result of pandas' conversion of Categorical to a list, so not controlled by pyarrow. 
   
   For `DictionaryArray.to_pandas` we return a pandas categorical dtype, which natively supports missing values (using -1 in the codes). So for this part, the `None` vs `NaN` is not relevant (outside of our control)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616795078



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       Ok, just wanted to make sure we didn't introduced inconsistencies in the APIs by chance.
   
   I'll go in the direction of having the `to_numpy()` return the same values it would return with a normal `Array` and I'll check what's the impact of converting the `DictionaryArray`  to a plain `Array` before doing the conversion when there are nulls.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616728155



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       This will not work for all data types, I think (eg a dictionary array with integer dictionary). Also, for strings, we might want to use `None` instead of `np.nan`, since this is what we do in basic conversion as well (eg see `pa.array(['a', None]).to_numpy(zero_copy_only=False)`). 
   
   Given those complexities, I am wondering if it might not be easier to *first* convert to a "dense" array before doing the arrow->python conversion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r621391796



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # If this would be changed to no longer raise in the future,
+        # ensure to test the actual result because, currently, to_numpy takes
+        # for granted that when zero_copy_only=True there will be no nulls
+        # (it's the decoding of the DictionaryArray that handles the nulls and
+        # this is only activated with zero_copy_only=False)
+        a.to_numpy(zero_copy_only=True)
+
+    anonulls = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    expected = pa.array(
+        ["foo", "bar", "bar", "foo"]
+    ).to_numpy(zero_copy_only=False)
+    assert (anonulls.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        anonulls.to_numpy(zero_copy_only=True)
+
+    afloat = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, 1, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
+    assert (afloat.to_numpy(zero_copy_only=True) == expected).all()
+    assert (afloat.to_numpy(zero_copy_only=False) == expected).all()
+
+    afloat2 = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array([13.7, 11.0])
+    )
+    expected = pa.array(
+        [13.7, 11.0, None, 13.7]
+    ).to_numpy(zero_copy_only=False)
+    assert np.array_equal(
+        afloat2.to_numpy(zero_copy_only=False),
+        expected,
+        equal_nan=True
+    )
+
+    aints = pa.DictionaryArray.from_arrays(

Review comment:
       Ok, can you add a comment explaining this? Otherwise it's not obvious to the reader what this test brings.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r618380258



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()
+
+    with pytest.raises(pa.ArrowInvalid):
+        # to_numpy takes for granted that when zero_copy_only=True
+        # there will be no nulls.
+        # If that changes, nulls handling will have to be updated in to_numpy
+        # as we won't be able to rely anymore on decoding the DictionaryArray
+        # to handle nulls.

Review comment:
       The purpose of this assertion is mostly to make sure if `zero_copy_only` no longer throws an exception in the future, then the developer notices that this test has to be updated too as we will have to check that nulls are still properly handled even with zero copy (as currently the null handling is dealt with by decoding the dictionary which only happens if you allow copies)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10101: ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r618376431



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -688,6 +688,69 @@ def test_dictionary_from_numpy():
             assert d2[i].as_py() == dictionary[indices[i]]
 
 
+def test_dictionary_to_numpy():
+    expected = pa.array(
+        ["foo", "bar", None, "foo"]
+    ).to_numpy(zero_copy_only=False)
+    a = pa.DictionaryArray.from_arrays(
+        pa.array([0, 1, None, 0]),
+        pa.array(['foo', 'bar'])
+    )
+    assert (a.to_numpy(zero_copy_only=False) == expected).all()

Review comment:
       I don't have a strong opinion about it, if you want me to change those to `assert_array_equal` I'll gladly do. I'm just used to go for `assert` when using `pytest` but in the end `np.testing` provides a good instrumentation too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org