You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by anders johansson <an...@tickup.se> on 2020/12/28 15:51:33 UTC

Fix for bug in parquet stream writer

Hi,

When writing to a primitive node of a logical type not supported by
converted_type (such as parquet::LogicalType::TimeUnit::NANOS), the error
"Column converted type mismatch" is thrown. As I understand it, the
converted_type logic is legacy. The problem is solved by removing

  if (converted_type != node->converted_type()) {
    throw ParquetException("Column converted type mismatch.  Column '" +
node->name() +
                           "' has converted type[" +
                           ConvertedTypeToString(node->converted_type()) +
"] not '" +
                           ConvertedTypeToString(converted_type) + "'");
  }

from StreamWriter::CheckColumn() in src/parquet/stream_writer.cc

BR,
//Anders

Re: Fix for bug in parquet stream writer

Posted by anders johansson <an...@tickup.se>.
Hi,

Thank you for your answer Gawain, an unfinished implementation explains a
lot of the problems we have had using the library.

I 'm not sure I understand your proposed solution correctly but how is it
possible to, for example, differentiate between a uint32 and a decimal
represented using a uint32, using operator overloading?

//A

On Tue, Dec 29, 2020 at 10:20 AM Gawain Bolton <ga...@free.fr>
wrote:

> Hello,
>
> I contributed the parquet StreamWriter class and I can tell you that
> the type checking is not legacy logic and removing it is not the
> solution.
>
> As you have suggested the solution is to add an output operator to add
> support for the type and this can be done by deriving from the
> StreamWriter class.
>
> Ideally though all basic C++ and parquet types should be supported.
>  Users should only have to  add output operators for custom types.
>
> So I think a PR to add missing support for any parquet types  would be
> nice.
>
> Cheers,
>
> Gawain
>
> On Mon, 2020-12-28 at 17:34 +0100, anders johansson wrote:
> > Hi,
> >
> > Not sure if my fix is a correct or permanent solution, depending on your
> > plan regarding backwards compatibility, etc. Someone with a deeper
> > understanding of how the code works or is supposed to work should
> probably
> > have a look at it.
> >
> > A temporary workaround is to inherit from StreamWriter and define your
> own
> > ostream overloads for logical types that are not supported. Example for
> > date 32:
> >
> > class StreamWriterEx : public StreamWriter {
> >
> >   void WriteDate32Raw(int32_t d) {
> >     CheckColumn(Type::INT32, ConvertedType::DATE);
> >     Write<Int32Writer>(d);
> >   }
> > };
> >
> >
> > To determine the expected ConvertedType, run the code once and look at
> the
> > error message before inheriting, or just do without the CheckColumn()
> call.
> >
> > //A
> >
> >
> > On Mon, Dec 28, 2020 at 4:55 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Anders, would you like to open a Jira issue and submit a PR (with
> > > unit test)?
> > >
> > > On Mon, Dec 28, 2020 at 9:51 AM anders johansson
> > > <an...@tickup.se> wrote:
> > > >
> > > > Hi,
> > > >
> > > > When writing to a primitive node of a logical type not supported by
> > > > converted_type (such as parquet::LogicalType::TimeUnit::NANOS), the
> error
> > > > "Column converted type mismatch" is thrown. As I understand it, the
> > > > converted_type logic is legacy. The problem is solved by removing
> > > >
> > > >   if (converted_type != node->converted_type()) {
> > > >     throw ParquetException("Column converted type mismatch.  Column
> '" +
> > > > node->name() +
> > > >                            "' has converted type[" +
> > > >
> ConvertedTypeToString(node->converted_type())
> > > +
> > > > "] not '" +
> > > >                            ConvertedTypeToString(converted_type) +
> "'");
> > > >   }
> > > >
> > > > from StreamWriter::CheckColumn() in src/parquet/stream_writer.cc
> > > >
> > > > BR,
> > > > //Anders
> > >
>

Re: Fix for bug in parquet stream writer

Posted by Gawain Bolton <ga...@free.fr>.
Hello,

I contributed the parquet StreamWriter class and I can tell you that
the type checking is not legacy logic and removing it is not the
solution.

As you have suggested the solution is to add an output operator to add
support for the type and this can be done by deriving from the
StreamWriter class.

Ideally though all basic C++ and parquet types should be supported.
 Users should only have to  add output operators for custom types.

So I think a PR to add missing support for any parquet types  would be
nice.

Cheers,

Gawain

On Mon, 2020-12-28 at 17:34 +0100, anders johansson wrote:
> Hi,
> 
> Not sure if my fix is a correct or permanent solution, depending on your
> plan regarding backwards compatibility, etc. Someone with a deeper
> understanding of how the code works or is supposed to work should probably
> have a look at it.
> 
> A temporary workaround is to inherit from StreamWriter and define your own
> ostream overloads for logical types that are not supported. Example for
> date 32:
> 
> class StreamWriterEx : public StreamWriter {
> 
>   void WriteDate32Raw(int32_t d) {
>     CheckColumn(Type::INT32, ConvertedType::DATE);
>     Write<Int32Writer>(d);
>   }
> };
> 
> 
> To determine the expected ConvertedType, run the code once and look at the
> error message before inheriting, or just do without the CheckColumn() call.
> 
> //A
> 
> 
> On Mon, Dec 28, 2020 at 4:55 PM Wes McKinney <we...@gmail.com> wrote:
> 
> > hi Anders, would you like to open a Jira issue and submit a PR (with
> > unit test)?
> > 
> > On Mon, Dec 28, 2020 at 9:51 AM anders johansson
> > <an...@tickup.se> wrote:
> > > 
> > > Hi,
> > > 
> > > When writing to a primitive node of a logical type not supported by
> > > converted_type (such as parquet::LogicalType::TimeUnit::NANOS), the error
> > > "Column converted type mismatch" is thrown. As I understand it, the
> > > converted_type logic is legacy. The problem is solved by removing
> > > 
> > >   if (converted_type != node->converted_type()) {
> > >     throw ParquetException("Column converted type mismatch.  Column '" +
> > > node->name() +
> > >                            "' has converted type[" +
> > >                            ConvertedTypeToString(node->converted_type())
> > +
> > > "] not '" +
> > >                            ConvertedTypeToString(converted_type) + "'");
> > >   }
> > > 
> > > from StreamWriter::CheckColumn() in src/parquet/stream_writer.cc
> > > 
> > > BR,
> > > //Anders
> > 

Re: Fix for bug in parquet stream writer

Posted by anders johansson <an...@tickup.se>.
Hi,

Not sure if my fix is a correct or permanent solution, depending on your
plan regarding backwards compatibility, etc. Someone with a deeper
understanding of how the code works or is supposed to work should probably
have a look at it.

A temporary workaround is to inherit from StreamWriter and define your own
ostream overloads for logical types that are not supported. Example for
date 32:

class StreamWriterEx : public StreamWriter {

  void WriteDate32Raw(int32_t d) {
    CheckColumn(Type::INT32, ConvertedType::DATE);
    Write<Int32Writer>(d);
  }
};


To determine the expected ConvertedType, run the code once and look at the
error message before inheriting, or just do without the CheckColumn() call.

//A


On Mon, Dec 28, 2020 at 4:55 PM Wes McKinney <we...@gmail.com> wrote:

> hi Anders, would you like to open a Jira issue and submit a PR (with
> unit test)?
>
> On Mon, Dec 28, 2020 at 9:51 AM anders johansson
> <an...@tickup.se> wrote:
> >
> > Hi,
> >
> > When writing to a primitive node of a logical type not supported by
> > converted_type (such as parquet::LogicalType::TimeUnit::NANOS), the error
> > "Column converted type mismatch" is thrown. As I understand it, the
> > converted_type logic is legacy. The problem is solved by removing
> >
> >   if (converted_type != node->converted_type()) {
> >     throw ParquetException("Column converted type mismatch.  Column '" +
> > node->name() +
> >                            "' has converted type[" +
> >                            ConvertedTypeToString(node->converted_type())
> +
> > "] not '" +
> >                            ConvertedTypeToString(converted_type) + "'");
> >   }
> >
> > from StreamWriter::CheckColumn() in src/parquet/stream_writer.cc
> >
> > BR,
> > //Anders
>

Re: Fix for bug in parquet stream writer

Posted by Wes McKinney <we...@gmail.com>.
hi Anders, would you like to open a Jira issue and submit a PR (with
unit test)?

On Mon, Dec 28, 2020 at 9:51 AM anders johansson
<an...@tickup.se> wrote:
>
> Hi,
>
> When writing to a primitive node of a logical type not supported by
> converted_type (such as parquet::LogicalType::TimeUnit::NANOS), the error
> "Column converted type mismatch" is thrown. As I understand it, the
> converted_type logic is legacy. The problem is solved by removing
>
>   if (converted_type != node->converted_type()) {
>     throw ParquetException("Column converted type mismatch.  Column '" +
> node->name() +
>                            "' has converted type[" +
>                            ConvertedTypeToString(node->converted_type()) +
> "] not '" +
>                            ConvertedTypeToString(converted_type) + "'");
>   }
>
> from StreamWriter::CheckColumn() in src/parquet/stream_writer.cc
>
> BR,
> //Anders