You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/02/08 12:09:00 UTC
[jira] [Updated] (ARROW-11553) [Python] Make Table.cast(schema) more flexible regarding order of fields / missing fields?

     [ https://issues.apache.org/jira/browse/ARROW-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-11553:
------------------------------------------
    Description: 
Currently, {{Table.cast}} requires a new schema with exactly the same names and same order of those names (it simply does a {{self.schema.names != target_schema.names: raise ...}} check). Example:

{code:python}
>>> table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})
>>> table
pyarrow.Table
a: int64
b: double

>>> schema = pa.schema([('a', pa.int32()), ('b', pa.float32())])
>>> table.cast(schema)
pyarrow.Table
a: int32
b: float

>>> schema2 = pa.schema([('b', pa.float32()), ('a', pa.int32())])
>>> table.cast(schema2)
....
ValueError: Target schema's field names are not matching the table's field names: ['a', 'b'], ['b', 'a']
{code}

Do we want to make this more flexible? Allow different order? (and the follow order of the passed schema or of the original table?) Allow missing fields? (and then use the fields of the schema to "subset" as well?)


  was:
Currently, {{Table.cast}} requires a new schema with exactly the same names and same order of those names (it simply does a {{self.schema.names != target_schema.names: raise ...}} check). Example:

{code: python}
In [5]: table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})

In [7]: table
Out[7]: 
pyarrow.Table
a: int64
b: double

In [9]: schema = pa.schema([('a', pa.int32()), ('b', pa.float32())])

In [10]: table.cast(schema)
Out[10]: 
pyarrow.Table
a: int32
b: float

In [11]: schema2 = pa.schema([('b', pa.float32()), ('a', pa.int32())])

In [12]: table.cast(schema2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-0c712db0c16a> in <module>
----> 1 table.cast(schema2)

~/scipy/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.cast()

ValueError: Target schema's field names are not matching the table's field names: ['a', 'b'], ['b', 'a']
{code}

Do we want to make this more flexible? Allow different order? (and the follow order of the passed schema or of the original table?) Allow missing fields? (and then use the fields of the schema to "subset" as well?)



> [Python] Make Table.cast(schema) more flexible regarding order of fields / missing fields?
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11553
>                 URL: https://issues.apache.org/jira/browse/ARROW-11553
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Currently, {{Table.cast}} requires a new schema with exactly the same names and same order of those names (it simply does a {{self.schema.names != target_schema.names: raise ...}} check). Example:
> {code:python}
> >>> table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})
> >>> table
> pyarrow.Table
> a: int64
> b: double
> >>> schema = pa.schema([('a', pa.int32()), ('b', pa.float32())])
> >>> table.cast(schema)
> pyarrow.Table
> a: int32
> b: float
> >>> schema2 = pa.schema([('b', pa.float32()), ('a', pa.int32())])
> >>> table.cast(schema2)
> ....
> ValueError: Target schema's field names are not matching the table's field names: ['a', 'b'], ['b', 'a']
> {code}
> Do we want to make this more flexible? Allow different order? (and the follow order of the passed schema or of the original table?) Allow missing fields? (and then use the fields of the schema to "subset" as well?)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)