You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Laurent Erreca (Jira)" <ji...@apache.org> on 2022/09/01 10:09:00 UTC
[jira] [Comment Edited] (ARROW-13438) [C++] Can't use StreamWriter with ToParquetSchema schema

    [ https://issues.apache.org/jira/browse/ARROW-13438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598859#comment-17598859 ] 

Laurent Erreca edited comment on ARROW-13438 at 9/1/22 10:08 AM:
-----------------------------------------------------------------

Hi,

I had a similar issue with timestamp field :
{code:c++}
...

// Schema definition

parquet::schema::NodeVector fields;
fields.push_back(PrimitiveNode::Make("posdate", Repetition::REQUIRED, LogicalType::Timestamp(false, LogicalType::TimeUnit::MICROS, false, false), Type::INT64));

std::shared_ptr<GroupNode> schema = std::static_pointer_cast<GroupNode>(GroupNode::Make("schema", Repetition::REQUIRED, fields));

...

// Write

parquet::StreamWriter out{
    parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

// posdate is a timestamp converted to int64

out << posdate; // fails here

{code}
Exception:
{code:bash}
 {code}
Column converted type mismatch.  Column 'posdate' has converted type[NONE] not 'INT_64'

Could be a bug in [ColumnCheck|https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_writer.cc#L200] as mentionned above?

{code}

I can bypass this issue by setting isAdjustedToUTC to false and by letting posdate as type std::chrono::microseconds, but in this case the output parquet file metadata has converted_type set to TIMESTAMP_MICROS.


was (Author: JIRAUSER295234):
Hi,

I had a similar issue with timestamp field :
{code:c++}
...

// Schema definition

parquet::schema::NodeVector fields;
fields.push_back(PrimitiveNode::Make("posdate", Repetition::REQUIRED, LogicalType::Timestamp(false, LogicalType::TimeUnit::MICROS, false, false), Type::INT64));

std::shared_ptr<GroupNode> schema = std::static_pointer_cast<GroupNode>(GroupNode::Make("schema", Repetition::REQUIRED, fields));

...

// Write

parquet::StreamWriter out{
    parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

// posdate is a timestamp converted to int64

out << posdate; // fails here

{code}
Exception:

{code:bash}

Column converted type mismatch.  Column 'posdate' has converted type[NONE] not 'INT_64'

Could be a bug in [ColumnCheck|https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_writer.cc#L200] as mentionned above?

{}

I can bypass this issue by setting isAdjustedToUTC to false and by letting posdate as type std::chrono::microseconds, but in this case the output parquet file metadata has converted_type set to TIMESTAMP_MICROS.

> [C++] Can't use StreamWriter with ToParquetSchema schema
> --------------------------------------------------------
>
>                 Key: ARROW-13438
>                 URL: https://issues.apache.org/jira/browse/ARROW-13438
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 4.0.1
>            Reporter: Vasily Fomin
>            Priority: Major
>
> Hi there,
> First of all, I'm not sure if I'm doing this correctly, as it took a bit of reverse engineering to figure this out. 
> I'm using Arrow 4.0.1 on Ubuntu with C++.
> I followed the streaming example and created:
> {code:cpp}
> #include <cassert>
> #include <chrono>
> #include <cstdint>
> #include <cstring>
> #include <ctime>
> #include <iomanip>
> #include <iostream>
> #include <utility>
> #include "arrow/io/file.h"
> #include "parquet/exception.h"
> #include "parquet/stream_reader.h"
> #include "parquet/stream_writer.h"
> std::shared_ptr<parquet::schema::GroupNode> GetSchema() {
>   parquet::schema::NodeVector fields;
>   fields.push_back(parquet::schema::PrimitiveNode::Make(
>       "int64_field", parquet::Repetition::OPTIONAL, parquet::Type::INT64,
>       parquet::ConvertedType::NONE));
>   return std::static_pointer_cast<parquet::schema::GroupNode>(
>       parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, fields));
> }
> int main() {
>   std::shared_ptr<arrow::io::FileOutputStream> outfile;
>   PARQUET_ASSIGN_OR_THROW(
>       outfile,
>       arrow::io::FileOutputStream::Open("parquet-stream-api-example.parquet"));
>   parquet::WriterProperties::Builder builder;
>   parquet::StreamWriter os{parquet::ParquetFileWriter::Open(outfile, GetSchema(), builder.build())};
>   os << int64_t(10);
>   return 0;
> }
> {code}
> The code terminates with:
> {code:java}
> terminate called after throwing an instance of 'parquet::ParquetException'
>   what():  Column converted type mismatch.  Column 'int64_field' has converted type[NONE] not 'INT_64' {code}
> What I'm not sure about is {{parquet::ConvertedType::NONE}} part. The example provides this value even for primitives, while it's my understanding that it's necessary? If I do provide it, the code works.
> Now, to the reverse engineering part. I'm trying to write to Parquet using {{StreamWriter}}. {{StreamWriter}} requires {{parquet::schema::{{GroupNode}}}} as the schema, but I begin with {{arrow::Schema}} I [found|https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L442] that it can be converted to {{{{parquet::SchemaDescriptor}}}} using {{parquet::arrow::ToParquetSchema }}utility. Looking at the utility [implementation|https://github.com/apache/arrow/blob/85f192a45755b3f15653fdc0a8fbd788086e125f/cpp/src/parquet/arrow/schema.cc#L322] I can see that {{logical_type}} is set to {{None}} which equals to {{parquet::ConvertedType::None}} and hence the converted schema can't be used due to the issue I described above.
>  # Do we need to provide {{ConvertedType}} even for primitives?
>  # Is it a bug in the schema conversion utility or [ColumnCheck|https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_writer.cc#L200] assert?
>  # Or is it expected behavior, in this case, what's a suggested approach? Build Parquet schema instead of Arrow Schema?
> Thank you,
> Vasily.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)