You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Grant Williams <gr...@grantwilliams.dev> on 2022/01/28 17:04:26 UTC

Pyarrow uint32() int64() column type mismatch bug?

Hello,

I've found that if you write a file that has a schema that specifies column
A as a uint32() type. If you read the file and inspect the schema it will
show Column A as int64(). This issue appears to be unique to the uint32()
type and I was unable to get any other type mismatches with the other
integer or float types.

The following is a link to a gist showing a minimal code example and the
output from it:
https://gist.github.com/grantmwilliams/1ceb490312c59e4fb6e4bc15b57e9707.

I'm not sure if this is a problem with the physical datatype being actually
written as int64, or if the metadata for the file is just wrong instead.
Does anyone have any idea what could be causing this? Or whether it's just
a metadata issue or an actual physical type error?

Thanks,
Grant W.
-- 
Grant Williams
Machine Learning Engineer
https://github.com/grantmwilliams/

Re: Pyarrow uint32() int64() column type mismatch bug?

Posted by Micah Kornfield <em...@gmail.com>.

I think I'd be OK with this, I'm not sure if we've defined when we want to
have log messages vs not.  If you think this is useful, please open a JIRA
and others can chime in.

FWIW, I think in the next release or two we will likely be switching the
default to 2.4 or 2.6.

-Micah



On Fri, Jan 28, 2022 at 9:31 AM Grant Williams <gr...@grantwilliams.dev>
wrote:

> Thank you, Micah! That makes sense.
>
> Do you have any thoughts about maybe adding a logged warning if a user
> calls write_table() and uint32() is in the given schema?
>
> On Fri, Jan 28, 2022 at 11:15 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Grant,
>> This is intended behavior because the default writing of parquet  uses
>> version 1 of logical types. Version 1 does not support annotating fields as
>> uint32, so to preserve the values round trip they are cast to int64.  If
>> you wish to maintain the type setting the version kwarg to 2.4 or 2.6 [1]
>> should work.
>>
>> Cheers,
>> Micah
>>
>>
>> [1]
>> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
>>
>> On Fri, Jan 28, 2022 at 9:04 AM Grant Williams <gr...@grantwilliams.dev>
>> wrote:
>>
>>> Hello,
>>>
>>> I've found that if you write a file that has a schema that specifies
>>> column A as a uint32() type. If you read the file and inspect the schema it
>>> will show Column A as int64(). This issue appears to be unique to the
>>> uint32() type and I was unable to get any other type mismatches with the
>>> other integer or float types.
>>>
>>> The following is a link to a gist showing a minimal code example and the
>>> output from it:
>>> https://gist.github.com/grantmwilliams/1ceb490312c59e4fb6e4bc15b57e9707.
>>>
>>> I'm not sure if this is a problem with the physical datatype being
>>> actually written as int64, or if the metadata for the file is just wrong
>>> instead. Does anyone have any idea what could be causing this? Or whether
>>> it's just a metadata issue or an actual physical type error?
>>>
>>> Thanks,
>>> Grant W.
>>> --
>>> Grant Williams
>>> Machine Learning Engineer
>>> https://github.com/grantmwilliams/
>>>
>>
>
> --
> Grant Williams
> Machine Learning Engineer
> https://github.com/grantmwilliams/
>

Re: Pyarrow uint32() int64() column type mismatch bug?

Posted by Grant Williams <gr...@grantwilliams.dev>.

Thank you, Micah! That makes sense.

Do you have any thoughts about maybe adding a logged warning if a user
calls write_table() and uint32() is in the given schema?

On Fri, Jan 28, 2022 at 11:15 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Grant,
> This is intended behavior because the default writing of parquet  uses
> version 1 of logical types. Version 1 does not support annotating fields as
> uint32, so to preserve the values round trip they are cast to int64.  If
> you wish to maintain the type setting the version kwarg to 2.4 or 2.6 [1]
> should work.
>
> Cheers,
> Micah
>
>
> [1]
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
>
> On Fri, Jan 28, 2022 at 9:04 AM Grant Williams <gr...@grantwilliams.dev>
> wrote:
>
>> Hello,
>>
>> I've found that if you write a file that has a schema that specifies
>> column A as a uint32() type. If you read the file and inspect the schema it
>> will show Column A as int64(). This issue appears to be unique to the
>> uint32() type and I was unable to get any other type mismatches with the
>> other integer or float types.
>>
>> The following is a link to a gist showing a minimal code example and the
>> output from it:
>> https://gist.github.com/grantmwilliams/1ceb490312c59e4fb6e4bc15b57e9707.
>>
>> I'm not sure if this is a problem with the physical datatype being
>> actually written as int64, or if the metadata for the file is just wrong
>> instead. Does anyone have any idea what could be causing this? Or whether
>> it's just a metadata issue or an actual physical type error?
>>
>> Thanks,
>> Grant W.
>> --
>> Grant Williams
>> Machine Learning Engineer
>> https://github.com/grantmwilliams/
>>
>

-- 
Grant Williams
Machine Learning Engineer
https://github.com/grantmwilliams/

Re: Pyarrow uint32() int64() column type mismatch bug?

Posted by Micah Kornfield <em...@gmail.com>.

Hi Grant,
This is intended behavior because the default writing of parquet  uses
version 1 of logical types. Version 1 does not support annotating fields as
uint32, so to preserve the values round trip they are cast to int64.  If
you wish to maintain the type setting the version kwarg to 2.4 or 2.6 [1]
should work.

Cheers,
Micah


[1]
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html

On Fri, Jan 28, 2022 at 9:04 AM Grant Williams <gr...@grantwilliams.dev>
wrote:

> Hello,
>
> I've found that if you write a file that has a schema that specifies
> column A as a uint32() type. If you read the file and inspect the schema it
> will show Column A as int64(). This issue appears to be unique to the
> uint32() type and I was unable to get any other type mismatches with the
> other integer or float types.
>
> The following is a link to a gist showing a minimal code example and the
> output from it:
> https://gist.github.com/grantmwilliams/1ceb490312c59e4fb6e4bc15b57e9707.
>
> I'm not sure if this is a problem with the physical datatype being
> actually written as int64, or if the metadata for the file is just wrong
> instead. Does anyone have any idea what could be causing this? Or whether
> it's just a metadata issue or an actual physical type error?
>
> Thanks,
> Grant W.
> --
> Grant Williams
> Machine Learning Engineer
> https://github.com/grantmwilliams/
>