You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ian Joiner <ia...@gmail.com> on 2021/11/21 09:21:52 UTC

[Python] Shall all-null object type columns in Pandas be converted into Arrow columns with NullType type?

Hi,

When an all NA column from Pandas of the object type is converted to Arrow using Table.from_pandas we get a column of the NullType type as opposed to StringType/BinaryType. Is this actually intended?

Ian Joiner

Re: [Python] Shall all-null object type columns in Pandas be converted into Arrow columns with NullType type?

Posted by Joris Van den Bossche <jo...@gmail.com>.
People ran into similar issues with such all-NA columns with Parquet
as well (with the difference that Parquet actually supports a null
type, but if you have a partitioned dataset, this could lead to
conflicting schemas). The typical workaround for the user to provide
the schema when writing / converting the data to Arrow. For example,
for this reason, dask added a "schema" keyword to their "to_parquet"
function (https://docs.dask.org/en/latest/generated/dask.dataframe.to_parquet.html),
which also allowed to specify the type for just one column, leaving
the others to use the normal type inference.

Now, for ORC writing in Arrow itself, I agree it would be good to
provide a way to write a column of null type.

On Mon, 22 Nov 2021 at 10:52, Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 21/11/2021 à 19:48, Ian Joiner a écrit :
> > I see.
> >
> > Now the question is what we should do about such columns in the ORC writer
> > as well as maybe some other writers since the Null type, as opposed to all
> > Null columns of a numeric or binary type, doesn’t exist in such formats.
>
> We could perhaps add an option to silently turn them into another type,
> but they wouldn't roundtrip properly unless we also serialize the Arrow
> schema as we do in Parquet.

Storing the schema similarly as we do for Parquet might be a good idea
in general to improve roundtripping? Nor only for null type, but eg
also for timestamp resolution and timezones.

>
> For now, people will have to detect such columns and cast them manually,
> I think.
>
> Regards
>
> Antoine.

Re: [Python] Shall all-null object type columns in Pandas be converted into Arrow columns with NullType type?

Posted by Antoine Pitrou <an...@python.org>.
Le 21/11/2021 à 19:48, Ian Joiner a écrit :
> I see.
> 
> Now the question is what we should do about such columns in the ORC writer
> as well as maybe some other writers since the Null type, as opposed to all
> Null columns of a numeric or binary type, doesn’t exist in such formats.

We could perhaps add an option to silently turn them into another type, 
but they wouldn't roundtrip properly unless we also serialize the Arrow 
schema as we do in Parquet.

For now, people will have to detect such columns and cast them manually, 
I think.

Regards

Antoine.

Re: [Python] Shall all-null object type columns in Pandas be converted into Arrow columns with NullType type?

Posted by Ian Joiner <ia...@gmail.com>.
I see.

Now the question is what we should do about such columns in the ORC writer
as well as maybe some other writers since the Null type, as opposed to all
Null columns of a numeric or binary type, doesn’t exist in such formats.

On Sunday, November 21, 2021, Antoine Pitrou <an...@python.org> wrote:

> On Sun, 21 Nov 2021 04:21:52 -0500
> Ian Joiner <ia...@gmail.com> wrote:
> > Hi,
> >
> > When an all NA column from Pandas of the object type is converted to
> Arrow using Table.from_pandas we get a column of the NullType type as
> opposed to StringType/BinaryType. Is this actually intended?
>
> It is. Why would we get a StringType/BinaryType?
>
>
>

Re: [Python] Shall all-null object type columns in Pandas be converted into Arrow columns with NullType type?

Posted by Antoine Pitrou <an...@python.org>.
On Sun, 21 Nov 2021 04:21:52 -0500
Ian Joiner <ia...@gmail.com> wrote:
> Hi,
> 
> When an all NA column from Pandas of the object type is converted to Arrow using Table.from_pandas we get a column of the NullType type as opposed to StringType/BinaryType. Is this actually intended?

It is. Why would we get a StringType/BinaryType?