You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/24 09:00:21 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

jorisvandenbossche opened a new pull request #10591:
URL: https://github.com/apache/arrow/pull/10591


   This PR does several small things:
   
   - Fix `__contains__` to not check the actual keys (and not rely on `__getitem__` since that also supports integers)
   - Add tests for the case of StructScalar with duplicate field names, and ensure some behaviours that already worked (`__getitem__` with integers, `__iter__`, `keys()`)
   - Changed `list(scalar)` (`__iter__`) to also return the keys for a "null" scalar (since we do allow `s["key"]` in that case, that seems most consistent?)
   - Changed the `repr` to use a tuple instead of dict representation. I know this just postpones a bit the `as_py()` discussion (should it return dict vs tuple?), but at least this ensures that the `repr` doesn't fail on the short term, and so you can inspect the object.
   - I added an explicit `items()` method (overriding the ones from Mapping), because the return value of this can in theory support duplicate fields, and this can be a way to get a list of tuples already. But, this doesn't fully follows the Mapping API as the return value is different (not a `dict_items` object).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#discussion_r658129462



##########
File path: python/pyarrow/scalar.pxi
##########
@@ -652,21 +649,42 @@ cdef class StructScalar(Scalar, collections.abc.Mapping):
 
         try:
             return Scalar.wrap(GetResultValue(sp.field(ref)))
-        except ArrowInvalid:
+        except ArrowInvalid as exc:
             if isinstance(key, int):
-                raise IndexError(key)
+                raise IndexError(key) from exc
             else:
-                raise KeyError(key)
+                raise KeyError(key) from exc
 
     def as_py(self):
         """
         Return this value as a Python dict.
         """
         if self.is_valid:
-            return {k: v.as_py() for k, v in self.items()}
+            try:
+                return {k: self[k].as_py() for k in self.keys()}
+            except KeyError:
+                raise ValueError(
+                    "Converting to Python dictionary is not supported when "
+                    "duplicate field names are present")
         else:
             return None
 
+    def _as_py_tuple(self):
+        # a version that returns a tuple instead of dict to support repr/str
+        # with the presence of duplicate field names
+        if self.is_valid:
+            return [(key, self[i].as_py()) for i, key in enumerate(self)]

Review comment:
       `return list(self.items())`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#issuecomment-874153925


   +1 for merging as is. Can you rebase first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#issuecomment-873915328


   Any other comments here? (cc @kszucs)
   
   There is the issue of how we should handle "null" struct scalars:
   
   ```
   In [2]: s = pa.scalar(None, type=pa.struct([('a', pa.int64()), ('b', pa.float64())]))
   
   In [3]: s
   Out[3]: <pyarrow.StructScalar: None>
   
   In [4]: s['a']
   Out[4]: <pyarrow.Int64Scalar: None>
   ```
   
   The above is the current behaviour, where you can access a field of a null value, but so that doesn't distinguish from a scalar with null values in the fields:
   
   ```
   In [5]: s2 = pa.scalar({'a': None, 'b': None}, type=pa.struct([('a', pa.int64()), ('b', pa.float64())]))
   
   In [6]: s2
   Out[6]: <pyarrow.StructScalar: {'a': None, 'b': None}>
   
   In [7]: s2['a']
   Out[7]: <pyarrow.Int64Scalar: None>
   ```
   
   But I would leave that discussion out of this PR, I think this PR already fixes some issues and is useful on its own.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#discussion_r658130623



##########
File path: python/pyarrow/tests/test_scalars.py
##########
@@ -514,7 +515,7 @@ def test_struct():
         s['non-existent']
 
     s = pa.scalar(None, type=ty)
-    assert list(s) == []
+    assert list(s) == list(s.keys()) == ['x', 'y']

Review comment:
       Yes, it's a bit difficult. @wesm @xhochy  What do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#discussion_r660380916



##########
File path: python/pyarrow/scalar.pxi
##########
@@ -652,21 +649,42 @@ cdef class StructScalar(Scalar, collections.abc.Mapping):
 
         try:
             return Scalar.wrap(GetResultValue(sp.field(ref)))
-        except ArrowInvalid:
+        except ArrowInvalid as exc:
             if isinstance(key, int):
-                raise IndexError(key)
+                raise IndexError(key) from exc
             else:
-                raise KeyError(key)
+                raise KeyError(key) from exc
 
     def as_py(self):
         """
         Return this value as a Python dict.
         """
         if self.is_valid:
-            return {k: v.as_py() for k, v in self.items()}
+            try:
+                return {k: self[k].as_py() for k in self.keys()}

Review comment:
       `self.items()` (currently) returns tuples with pyarrow scalars, not python scalars (so it could also be `{k: v.as_py() for k, v in self.items()}`)

##########
File path: python/pyarrow/scalar.pxi
##########
@@ -652,21 +649,42 @@ cdef class StructScalar(Scalar, collections.abc.Mapping):
 
         try:
             return Scalar.wrap(GetResultValue(sp.field(ref)))
-        except ArrowInvalid:
+        except ArrowInvalid as exc:
             if isinstance(key, int):
-                raise IndexError(key)
+                raise IndexError(key) from exc
             else:
-                raise KeyError(key)
+                raise KeyError(key) from exc
 
     def as_py(self):
         """
         Return this value as a Python dict.
         """
         if self.is_valid:
-            return {k: v.as_py() for k, v in self.items()}
+            try:
+                return {k: self[k].as_py() for k in self.keys()}
+            except KeyError:
+                raise ValueError(
+                    "Converting to Python dictionary is not supported when "
+                    "duplicate field names are present")
         else:
             return None
 
+    def _as_py_tuple(self):
+        # a version that returns a tuple instead of dict to support repr/str
+        # with the presence of duplicate field names
+        if self.is_valid:
+            return [(key, self[i].as_py()) for i, key in enumerate(self)]

Review comment:
       Same here, I need to `as_py()` version of the scalar.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#issuecomment-867467562


   https://issues.apache.org/jira/browse/ARROW-13158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#discussion_r658129744



##########
File path: python/pyarrow/scalar.pxi
##########
@@ -652,21 +649,42 @@ cdef class StructScalar(Scalar, collections.abc.Mapping):
 
         try:
             return Scalar.wrap(GetResultValue(sp.field(ref)))
-        except ArrowInvalid:
+        except ArrowInvalid as exc:
             if isinstance(key, int):
-                raise IndexError(key)
+                raise IndexError(key) from exc
             else:
-                raise KeyError(key)
+                raise KeyError(key) from exc
 
     def as_py(self):
         """
         Return this value as a Python dict.
         """
         if self.is_valid:
-            return {k: v.as_py() for k, v in self.items()}
+            try:
+                return {k: self[k].as_py() for k in self.keys()}

Review comment:
       `return dict(self.items())`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] xhochy commented on a change in pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
xhochy commented on a change in pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#discussion_r658231643



##########
File path: python/pyarrow/tests/test_scalars.py
##########
@@ -514,7 +515,7 @@ def test_struct():
         s['non-existent']
 
     s = pa.scalar(None, type=ty)
-    assert list(s) == []
+    assert list(s) == list(s.keys()) == ['x', 'y']

Review comment:
       I have no deep experience with scalars so I cannot contribute an opinion here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10591:
URL: https://github.com/apache/arrow/pull/10591


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#issuecomment-874153925


   +1 for merging as is. Can you rebase first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#issuecomment-873915328


   Any other comments here? (cc @kszucs)
   
   There is the issue of how we should handle "null" struct scalars:
   
   ```
   In [2]: s = pa.scalar(None, type=pa.struct([('a', pa.int64()), ('b', pa.float64())]))
   
   In [3]: s
   Out[3]: <pyarrow.StructScalar: None>
   
   In [4]: s['a']
   Out[4]: <pyarrow.Int64Scalar: None>
   ```
   
   The above is the current behaviour, where you can access a field of a null value, but so that doesn't distinguish from a scalar with null values in the fields:
   
   ```
   In [5]: s2 = pa.scalar({'a': None, 'b': None}, type=pa.struct([('a', pa.int64()), ('b', pa.float64())]))
   
   In [6]: s2
   Out[6]: <pyarrow.StructScalar: {'a': None, 'b': None}>
   
   In [7]: s2['a']
   Out[7]: <pyarrow.Int64Scalar: None>
   ```
   
   But I would leave that discussion out of this PR, I think this PR already fixes some issues and is useful on its own.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10591:
URL: https://github.com/apache/arrow/pull/10591


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#discussion_r660380916



##########
File path: python/pyarrow/scalar.pxi
##########
@@ -652,21 +649,42 @@ cdef class StructScalar(Scalar, collections.abc.Mapping):
 
         try:
             return Scalar.wrap(GetResultValue(sp.field(ref)))
-        except ArrowInvalid:
+        except ArrowInvalid as exc:
             if isinstance(key, int):
-                raise IndexError(key)
+                raise IndexError(key) from exc
             else:
-                raise KeyError(key)
+                raise KeyError(key) from exc
 
     def as_py(self):
         """
         Return this value as a Python dict.
         """
         if self.is_valid:
-            return {k: v.as_py() for k, v in self.items()}
+            try:
+                return {k: self[k].as_py() for k in self.keys()}

Review comment:
       `self.items()` (currently) returns tuples with pyarrow scalars, not python scalars (so it could also be `{k: v.as_py() for k, v in self.items()}`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#discussion_r660381390



##########
File path: python/pyarrow/scalar.pxi
##########
@@ -652,21 +649,42 @@ cdef class StructScalar(Scalar, collections.abc.Mapping):
 
         try:
             return Scalar.wrap(GetResultValue(sp.field(ref)))
-        except ArrowInvalid:
+        except ArrowInvalid as exc:
             if isinstance(key, int):
-                raise IndexError(key)
+                raise IndexError(key) from exc
             else:
-                raise KeyError(key)
+                raise KeyError(key) from exc
 
     def as_py(self):
         """
         Return this value as a Python dict.
         """
         if self.is_valid:
-            return {k: v.as_py() for k, v in self.items()}
+            try:
+                return {k: self[k].as_py() for k in self.keys()}
+            except KeyError:
+                raise ValueError(
+                    "Converting to Python dictionary is not supported when "
+                    "duplicate field names are present")
         else:
             return None
 
+    def _as_py_tuple(self):
+        # a version that returns a tuple instead of dict to support repr/str
+        # with the presence of duplicate field names
+        if self.is_valid:
+            return [(key, self[i].as_py()) for i, key in enumerate(self)]

Review comment:
       Same here, I need to `as_py()` version of the scalar.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10591: ARROW-13158: [Python] Fix StructScalar contains and repr with duplicate field names

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10591:
URL: https://github.com/apache/arrow/pull/10591#discussion_r657765844



##########
File path: python/pyarrow/tests/test_scalars.py
##########
@@ -514,7 +515,7 @@ def test_struct():
         s['non-existent']
 
     s = pa.scalar(None, type=ty)
-    assert list(s) == []
+    assert list(s) == list(s.keys()) == ['x', 'y']

Review comment:
       This is a change in behaviour, but it seems more correct to me, since you can still do `s['x']` (which gives you a null scalar). But on the other hand it conflicts with `s.as_py()` being `None` (and not `{'x': None, 'y': None}`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org