You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "chairmank (via GitHub)" <gi...@apache.org> on 2023/04/28 20:08:23 UTC

[GitHub] [arrow] chairmank commented on issue #19157: [Python] Implement pa.array() with type=union type

chairmank commented on issue #19157:
URL: https://github.com/apache/arrow/issues/19157#issuecomment-1528037394

   I would also like `pyarrow.array` to automatically convert Python values when a sparse union or dense union type is explicitly specified. I frequently use dense union types to represent data that originated in protocol buffers with `oneof` fields. It is inconvenient to have to implement special handling of this case when the target Arrow schema is known.
   
   Also, I would like to politely observe that example code snippets in previous comments are misleading, because they do not distinguish between child fields that happen to have the same data type.
   
   > [Antoine Pitrou](https://issues.apache.org/jira/browse/ARROW-2774?focusedCommentId=17392943) / @pitrou: I'm still not convinced this is a good idea. Consider `pa.array([1, 2.3])`. Should it return a `union<int64, float64>`?
   > 
   > cc @amol- for advice.
   
   > [Joris Van den Bossche](https://issues.apache.org/jira/browse/ARROW-2774?focusedCommentId=17393143) / @jorisvandenbossche: Agreed that we shouldn't do that by default, but we can keep this issue about actually supporting it? Because now construction of a union array from a python sequence is not even supported when explicitly mentioning the type.
   > 
   > ```java
   > In [52]: typ = pa.union([pa.field("int", "int64"), pa.field("float", "float64")], mode="sparse")
   > 
   > In [53]: pa.array([1, 2.3], type=typ)
   > ...
   > ArrowNotImplementedError: sparse_union
   > ../src/arrow/util/converter.h:265  VisitTypeInline(*visitor.type, &visitor)
   > ../src/arrow/python/python_to_arrow.cc:1015  (MakeConverter<PyConverter, PyConverterTrait>( options.type, options, pool))
   > ```
   
   As an example, consider the following union type:
   ```
   >>> string_predicate_type = pa.dense_union([
   ...     pa.field("regexp", pa.string(), False),
   ...     pa.field("regexp", pa.string(), False),
   ...     pa.field("is_null", pa.null()),
   ... ])
   >>> string_predicate_type
   DenseUnionType(dense_union<regexp: string not null=0, regexp: string not null=1, is_null: null=2>)
   ```
   
   Both `equals` and `regexp` are string, but they are semantically distinct. For `pyarrow.array` to convert Python values to the correct child field type, the values ought to be tagged:
   ```
   pa.array([{"equals": "foo"}, {"regexp": "[0-9a-f]{16}"}, {"is_null": None}], type=string_predicate_type)
   ```
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org