You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Arun Joseph <aj...@gmail.com> on 2022/09/13 15:58:17 UTC

[C++] How to write a null value to a int64 column with Parquet StreamWriter?

Hi all,

I've tried defining my field with the following:

fields.push_back(
  parquet::schema::PrimitiveNode::Make(
    "field_name",
    parquet::Repetition::REQUIRED,
    parquet::Type::INT64,
    parquet::ConvertedType::INT_64)
);

and I'm not sure if it's possible to specify a null value for an int64
column. I understand that C++ ints don't have a null value. I write to the
field with the following:

os << std::numeric_limits<int64_t>::quiet_NaN();

where os is:

parquet::WriterProperties::Builder builder_;
parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_,
schema_, builder_.build())};

This (as expected) writes a 0 for the value. But is there a way to specify
a null value? From my understanding parquet::Repetition:OPTIONAL is meant
for repeating groups.

My actual usecase is trying to represent a null linux epoch timestamp in
nanos e.g. NaN or NaT in the resulting pandas dataframe after reading the
written parquet file. It seems like in Pandas, int columns with nulls are
implicitly casted to float but I think parquet is able to define a null
value like this. Is this the only way to achieve this to convert the column
to a float or is there a way to specify value is null in parquet cpp?

Thank You,
Arun Joseph

Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Posted by Micah Kornfield <em...@gmail.com>.

>
> terminate called after throwing an instance of 'parquet::ParquetException'
>   what():  Column converted type mismatch.  Column 'field_name' has
> converted type[NONE] not 'INT_64'

I think this is probably a bug in the streaming library where it should
also be checking on LogicalType, it has been a while since I looked at the
code.  Nanoseconds isn't support for ConvertType which is deprecated
concept in Parquet.

Regarding nullopt compatibility with the ParquetStreamWriter, is that
> something that should work without a template parameter? The compiler error
> that gets thrown is:

I think you need to add an overload for nullopt specifically which is a
different type in C++ then the the empty optional<int64_t>





On Wed, Sep 14, 2022 at 9:51 AM Arun Joseph <aj...@gmail.com> wrote:

> I've tried the following schema:
>
>             fields.push_back(
>                 parquet::schema::PrimitiveNode::Make(
>                     "field_name", parquet::Repetition::OPTIONAL,
>                     parquet::LogicalType::Timestamp(true,
> parquet::LogicalType::TimeUnit::NANOS),
>                     parquet::Type::INT64)
>             );
>
> But when I try to insert a value, I get the following exception:
>
> terminate called after throwing an instance of 'parquet::ParquetException'
>   what():  Column converted type mismatch.  Column 'field_name' has
> converted type[NONE] not 'INT_64'
>
> I don't really understand how the ConvertedType vs LogicalType stuff
> works w.r.t the two diff versions of Make. However the Make call with ConvertedType
> does not seem like it would support Timestamp.
>
> Regarding nullopt compatibility with the ParquetStreamWriter, is that
> something that should work without a template parameter? The compiler error
> that gets thrown is:
> ./include/writer.h:181:33: error: no match for ‘operator<<’ (operand types
> are ‘parquet::StreamWriter’ and ‘const nonstd::optional_lite::nullopt_t’)
>   181 |                     writer_.os_ << arrow::util::nullopt;
>       |                     ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
>
> With each of the following errors having the following format (with all
> the diff types):
>
> /home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:17: note:
> candidate: ‘parquet::StreamWriter&
> parquet::StreamWriter::operator<<(int64_t)’
>   110 |   StreamWriter& operator<<(int64_t v);
>       |                 ^~~~~~~~
> /home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:36: note:
> no known conversion for argument 1 from ‘const
> nonstd::optional_lite::nullopt_t’ to ‘int64_t’ {aka ‘long int’}
>   110 |   StreamWriter& operator<<(int64_t v);
>       |                            ~~~~~~~~^
>
> I can try to contribute a solution, but I've never contributed to an
> Apache project before. I can try to take a peek this weekend or after work
> one of these days if this is an actual issue (since there seems to be a
> workaround with arrow::util::optional<int64_t>()
>
> On Wed, Sep 14, 2022 at 12:38 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> I'm not sure how it works with null elements but pass LogicalType of
>> timestamp with isAdjustedToUtc=true and nanoseconds unit when creating the
>> schema would be the most likely thing to work.
>>
>> The fact that nullopt doesn't work, seems like an oversight that might be
>> nice to address if you would like to contribute to the project.
>>
>> On Wed, Sep 14, 2022 at 7:57 AM Arun Joseph <aj...@gmail.com> wrote:
>>
>>> Hi Micah,
>>>
>>> I couldn't find arrow::util::Optional::nullopt but I did find
>>> arrow::util::nullopt which also did not seem to work. However, I then
>>> found arrow::util::optional<T>() right afterwhich seems to output NaNs!
>>>
>>> I do see that the resulting dataframe when loaded in pandas has the
>>> column dtype as float64. Do you know if there is a way to define the
>>> schema such that I can input an uint64_t (linux epoch time nanos) and
>>> have it output as datetime64[ns] in parquet cpp?
>>>
>>> Thank You,
>>> Arun
>>>
>>> On Tue, Sep 13, 2022 at 10:49 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> Hi Arun,
>>>> The schema should be `parquet::Repetition:OPTIONAL`, parquet::Repetition:REPEATED
>>>> should be for repeated groups.  IIRC you can insert
>>>> arrow::util::Optional::nullopt into the stream for a null value.
>>>>
>>>> Hope this helps.
>>>>
>>>> Micah
>>>>
>>>> On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <aj...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've tried defining my field with the following:
>>>>>
>>>>> fields.push_back(
>>>>>   parquet::schema::PrimitiveNode::Make(
>>>>>     "field_name",
>>>>>     parquet::Repetition::REQUIRED,
>>>>>     parquet::Type::INT64,
>>>>>     parquet::ConvertedType::INT_64)
>>>>> );
>>>>>
>>>>> and I'm not sure if it's possible to specify a null value for an int64
>>>>> column. I understand that C++ ints don't have a null value. I write to the
>>>>> field with the following:
>>>>>
>>>>> os << std::numeric_limits<int64_t>::quiet_NaN();
>>>>>
>>>>> where os is:
>>>>>
>>>>> parquet::WriterProperties::Builder builder_;
>>>>> parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_,
>>>>> schema_, builder_.build())};
>>>>>
>>>>> This (as expected) writes a 0 for the value. But is there a way to
>>>>> specify a null value? From my understanding parquet::Repetition:OPTIONAL
>>>>> is meant for repeating groups.
>>>>>
>>>>> My actual usecase is trying to represent a null linux epoch timestamp
>>>>> in nanos e.g. NaN or NaT in the resulting pandas dataframe after reading
>>>>> the written parquet file. It seems like in Pandas, int columns with
>>>>> nulls are implicitly casted to float but I think parquet is able to
>>>>> define a null value like this. Is this the only way to achieve this
>>>>> to convert the column to a float or is there a way to specify value
>>>>> is null in parquet cpp?
>>>>>
>>>>> Thank You,
>>>>> Arun Joseph
>>>>>
>>>>>
>>>
>>> --
>>> Arun Joseph
>>>
>>>
>
> --
> Arun Joseph
>
>

Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Posted by Arun Joseph <aj...@gmail.com>.

I've tried the following schema:

            fields.push_back(
                parquet::schema::PrimitiveNode::Make(
                    "field_name", parquet::Repetition::OPTIONAL,
                    parquet::LogicalType::Timestamp(true,
parquet::LogicalType::TimeUnit::NANOS),
                    parquet::Type::INT64)
            );

But when I try to insert a value, I get the following exception:

terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Column converted type mismatch.  Column 'field_name' has
converted type[NONE] not 'INT_64'

I don't really understand how the ConvertedType vs LogicalType stuff works
w.r.t the two diff versions of Make. However the Make call with ConvertedType
does not seem like it would support Timestamp.

Regarding nullopt compatibility with the ParquetStreamWriter, is that
something that should work without a template parameter? The compiler error
that gets thrown is:
./include/writer.h:181:33: error: no match for ‘operator<<’ (operand types
are ‘parquet::StreamWriter’ and ‘const nonstd::optional_lite::nullopt_t’)
  181 |                     writer_.os_ << arrow::util::nullopt;
      |                     ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

With each of the following errors having the following format (with all the
diff types):

/home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:17: note:
candidate: ‘parquet::StreamWriter&
parquet::StreamWriter::operator<<(int64_t)’
  110 |   StreamWriter& operator<<(int64_t v);
      |                 ^~~~~~~~
/home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:36: note:
no known conversion for argument 1 from ‘const
nonstd::optional_lite::nullopt_t’ to ‘int64_t’ {aka ‘long int’}
  110 |   StreamWriter& operator<<(int64_t v);
      |                            ~~~~~~~~^

I can try to contribute a solution, but I've never contributed to an Apache
project before. I can try to take a peek this weekend or after work one of
these days if this is an actual issue (since there seems to be a workaround
with arrow::util::optional<int64_t>()

On Wed, Sep 14, 2022 at 12:38 PM Micah Kornfield <em...@gmail.com>
wrote:

> I'm not sure how it works with null elements but pass LogicalType of
> timestamp with isAdjustedToUtc=true and nanoseconds unit when creating the
> schema would be the most likely thing to work.
>
> The fact that nullopt doesn't work, seems like an oversight that might be
> nice to address if you would like to contribute to the project.
>
> On Wed, Sep 14, 2022 at 7:57 AM Arun Joseph <aj...@gmail.com> wrote:
>
>> Hi Micah,
>>
>> I couldn't find arrow::util::Optional::nullopt but I did find
>> arrow::util::nullopt which also did not seem to work. However, I then
>> found arrow::util::optional<T>() right afterwhich seems to output NaNs!
>>
>> I do see that the resulting dataframe when loaded in pandas has the
>> column dtype as float64. Do you know if there is a way to define the
>> schema such that I can input an uint64_t (linux epoch time nanos) and
>> have it output as datetime64[ns] in parquet cpp?
>>
>> Thank You,
>> Arun
>>
>> On Tue, Sep 13, 2022 at 10:49 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Hi Arun,
>>> The schema should be `parquet::Repetition:OPTIONAL`, parquet::Repetition:REPEATED
>>> should be for repeated groups.  IIRC you can insert
>>> arrow::util::Optional::nullopt into the stream for a null value.
>>>
>>> Hope this helps.
>>>
>>> Micah
>>>
>>> On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <aj...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I've tried defining my field with the following:
>>>>
>>>> fields.push_back(
>>>>   parquet::schema::PrimitiveNode::Make(
>>>>     "field_name",
>>>>     parquet::Repetition::REQUIRED,
>>>>     parquet::Type::INT64,
>>>>     parquet::ConvertedType::INT_64)
>>>> );
>>>>
>>>> and I'm not sure if it's possible to specify a null value for an int64
>>>> column. I understand that C++ ints don't have a null value. I write to the
>>>> field with the following:
>>>>
>>>> os << std::numeric_limits<int64_t>::quiet_NaN();
>>>>
>>>> where os is:
>>>>
>>>> parquet::WriterProperties::Builder builder_;
>>>> parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_,
>>>> schema_, builder_.build())};
>>>>
>>>> This (as expected) writes a 0 for the value. But is there a way to
>>>> specify a null value? From my understanding parquet::Repetition:OPTIONAL
>>>> is meant for repeating groups.
>>>>
>>>> My actual usecase is trying to represent a null linux epoch timestamp
>>>> in nanos e.g. NaN or NaT in the resulting pandas dataframe after reading
>>>> the written parquet file. It seems like in Pandas, int columns with
>>>> nulls are implicitly casted to float but I think parquet is able to
>>>> define a null value like this. Is this the only way to achieve this to
>>>> convert the column to a float or is there a way to specify value is
>>>> null in parquet cpp?
>>>>
>>>> Thank You,
>>>> Arun Joseph
>>>>
>>>>
>>
>> --
>> Arun Joseph
>>
>>

-- 
Arun Joseph

Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Posted by Micah Kornfield <em...@gmail.com>.

I'm not sure how it works with null elements but pass LogicalType of
timestamp with isAdjustedToUtc=true and nanoseconds unit when creating the
schema would be the most likely thing to work.

The fact that nullopt doesn't work, seems like an oversight that might be
nice to address if you would like to contribute to the project.

On Wed, Sep 14, 2022 at 7:57 AM Arun Joseph <aj...@gmail.com> wrote:

> Hi Micah,
>
> I couldn't find arrow::util::Optional::nullopt but I did find
> arrow::util::nullopt which also did not seem to work. However, I then
> found arrow::util::optional<T>() right afterwhich seems to output NaNs!
>
> I do see that the resulting dataframe when loaded in pandas has the column
> dtype as float64. Do you know if there is a way to define the schema such
> that I can input an uint64_t (linux epoch time nanos) and have it output
> as datetime64[ns] in parquet cpp?
>
> Thank You,
> Arun
>
> On Tue, Sep 13, 2022 at 10:49 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Arun,
>> The schema should be `parquet::Repetition:OPTIONAL`, parquet::Repetition:REPEATED
>> should be for repeated groups.  IIRC you can insert
>> arrow::util::Optional::nullopt into the stream for a null value.
>>
>> Hope this helps.
>>
>> Micah
>>
>> On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <aj...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I've tried defining my field with the following:
>>>
>>> fields.push_back(
>>>   parquet::schema::PrimitiveNode::Make(
>>>     "field_name",
>>>     parquet::Repetition::REQUIRED,
>>>     parquet::Type::INT64,
>>>     parquet::ConvertedType::INT_64)
>>> );
>>>
>>> and I'm not sure if it's possible to specify a null value for an int64
>>> column. I understand that C++ ints don't have a null value. I write to the
>>> field with the following:
>>>
>>> os << std::numeric_limits<int64_t>::quiet_NaN();
>>>
>>> where os is:
>>>
>>> parquet::WriterProperties::Builder builder_;
>>> parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_,
>>> schema_, builder_.build())};
>>>
>>> This (as expected) writes a 0 for the value. But is there a way to
>>> specify a null value? From my understanding parquet::Repetition:OPTIONAL
>>> is meant for repeating groups.
>>>
>>> My actual usecase is trying to represent a null linux epoch timestamp in
>>> nanos e.g. NaN or NaT in the resulting pandas dataframe after reading the
>>> written parquet file. It seems like in Pandas, int columns with nulls
>>> are implicitly casted to float but I think parquet is able to define a
>>> null value like this. Is this the only way to achieve this to convert
>>> the column to a float or is there a way to specify value is null in
>>> parquet cpp?
>>>
>>> Thank You,
>>> Arun Joseph
>>>
>>>
>
> --
> Arun Joseph
>
>

Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Posted by Arun Joseph <aj...@gmail.com>.

Hi Micah,

I couldn't find arrow::util::Optional::nullopt but I did find
arrow::util::nullopt which also did not seem to work. However, I then
found arrow::util::optional<T>()
right afterwhich seems to output NaNs!

I do see that the resulting dataframe when loaded in pandas has the column
dtype as float64. Do you know if there is a way to define the schema such
that I can input an uint64_t (linux epoch time nanos) and have it output as
datetime64[ns] in parquet cpp?

Thank You,
Arun

On Tue, Sep 13, 2022 at 10:49 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Arun,
> The schema should be `parquet::Repetition:OPTIONAL`, parquet::Repetition:REPEATED
> should be for repeated groups.  IIRC you can insert
> arrow::util::Optional::nullopt into the stream for a null value.
>
> Hope this helps.
>
> Micah
>
> On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <aj...@gmail.com> wrote:
>
>> Hi all,
>>
>> I've tried defining my field with the following:
>>
>> fields.push_back(
>>   parquet::schema::PrimitiveNode::Make(
>>     "field_name",
>>     parquet::Repetition::REQUIRED,
>>     parquet::Type::INT64,
>>     parquet::ConvertedType::INT_64)
>> );
>>
>> and I'm not sure if it's possible to specify a null value for an int64
>> column. I understand that C++ ints don't have a null value. I write to the
>> field with the following:
>>
>> os << std::numeric_limits<int64_t>::quiet_NaN();
>>
>> where os is:
>>
>> parquet::WriterProperties::Builder builder_;
>> parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_,
>> schema_, builder_.build())};
>>
>> This (as expected) writes a 0 for the value. But is there a way to
>> specify a null value? From my understanding parquet::Repetition:OPTIONAL is
>> meant for repeating groups.
>>
>> My actual usecase is trying to represent a null linux epoch timestamp in
>> nanos e.g. NaN or NaT in the resulting pandas dataframe after reading the
>> written parquet file. It seems like in Pandas, int columns with nulls
>> are implicitly casted to float but I think parquet is able to define a
>> null value like this. Is this the only way to achieve this to convert
>> the column to a float or is there a way to specify value is null in
>> parquet cpp?
>>
>> Thank You,
>> Arun Joseph
>>
>>

-- 
Arun Joseph

Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Posted by Micah Kornfield <em...@gmail.com>.

Hi Arun,
The schema should be `parquet::Repetition:OPTIONAL`,
parquet::Repetition:REPEATED
should be for repeated groups.  IIRC you can insert
arrow::util::Optional::nullopt into the stream for a null value.

Hope this helps.

Micah

On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <aj...@gmail.com> wrote:

> Hi all,
>
> I've tried defining my field with the following:
>
> fields.push_back(
>   parquet::schema::PrimitiveNode::Make(
>     "field_name",
>     parquet::Repetition::REQUIRED,
>     parquet::Type::INT64,
>     parquet::ConvertedType::INT_64)
> );
>
> and I'm not sure if it's possible to specify a null value for an int64
> column. I understand that C++ ints don't have a null value. I write to the
> field with the following:
>
> os << std::numeric_limits<int64_t>::quiet_NaN();
>
> where os is:
>
> parquet::WriterProperties::Builder builder_;
> parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_,
> schema_, builder_.build())};
>
> This (as expected) writes a 0 for the value. But is there a way to specify
> a null value? From my understanding parquet::Repetition:OPTIONAL is meant
> for repeating groups.
>
> My actual usecase is trying to represent a null linux epoch timestamp in
> nanos e.g. NaN or NaT in the resulting pandas dataframe after reading the
> written parquet file. It seems like in Pandas, int columns with nulls are
> implicitly casted to float but I think parquet is able to define a null
> value like this. Is this the only way to achieve this to convert the
> column to a float or is there a way to specify value is null in parquet
> cpp?
>
> Thank You,
> Arun Joseph
>
>