You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Sandeep Joshi <sa...@gmail.com> on 2017/11/26 13:25:26 UTC

parquet-cpp question : ParquetFileWriter and Arrow schema conversion

This might seem like a dumb question but I am not intimate with the API yet
to figure out how to get around this problem.

I have a pre-defined Arrow Schema which I convert to Parquet Schema using
the "ToParquetSchema" function.  This returns a SchemaDescriptor object.
https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/schema.h#L80

ParquetFileWriter on the other hand, expects a shared_ptr<GroupNode>
https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/writer.h#L126

SchemaDescriptor can return a raw pointer for GroupNode but to pass it to
the ParquetFileWriter, I need a shared_ptr.   This introduces memory
management complications.  I'd rather not create a copy of the GroupNode in
order to pass it to ParquetFileWriter.

 * // convert arrow schema to parquet schema*
*  std::shared_ptr<SchemaDescriptor> parquet_schema;*
*  std::shared_ptr<::parquet::WriterProperties> properties =*
*    ::parquet::default_writer_properties();*
*  ToParquetSchema(arrow_sch.get(), *properties.get(), &parquet_schema);*

*  // write arrow table to parquet*
*  parquet::schema::GroupNode* g =
(parquet::schema::GroupNode*)parquet_schema->group_node();*
*  grp_node.reset(g);  // Dont want to do this !*
*  std::shared_ptr<::arrow::io::FileOutputStream> sink;*
*  ::arrow::io::FileOutputStream::Open(path, &sink);*
*  std::unique_ptr<FileWriter> arrow_writer(*
*    new FileWriter(pool, ParquetFileWriter::Open(sink, grp_node)));*

*  arrow_writer->WriteTable(*new_table_ptr.get(), 65536);*

Is this an API limitation that no one has hit before ? Or I am missing a
better way of writing parquet files given a pre-defined arrow schema.

-Sandeep

Re: parquet-cpp question : ParquetFileWriter and Arrow schema conversion

Posted by Wes McKinney <we...@gmail.com>.
You can see some sample usages in the Cython wrappers of this code for Python:

https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx

On Mon, Nov 27, 2017 at 6:58 AM, Sandeep Joshi <sa...@gmail.com> wrote:
> thanks!   Is there sample code on how to use these APIs to learn best
> practices ?
>
> I am looking at
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/python
> but that only covers Arrow itself
>
> -Sandeep
>
> On Sun, Nov 26, 2017 at 9:57 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> I think you want to use parquet::arrow::FileWriter::Open
>>
>> https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/arrow/writer.h#L112
>>
>> The implementation is here:
>>
>> https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/arrow/writer.cc#L992
>>
>> - Wes
>>
>> On Sun, Nov 26, 2017 at 8:25 AM, Sandeep Joshi <sa...@gmail.com>
>> wrote:
>> > This might seem like a dumb question but I am not intimate with the API
>> yet
>> > to figure out how to get around this problem.
>> >
>> > I have a pre-defined Arrow Schema which I convert to Parquet Schema using
>> > the "ToParquetSchema" function.  This returns a SchemaDescriptor object.
>> > https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/arrow/schema.h#L80
>> >
>> > ParquetFileWriter on the other hand, expects a shared_ptr<GroupNode>
>> > https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/file/writer.h#L126
>> >
>> > SchemaDescriptor can return a raw pointer for GroupNode but to pass it to
>> > the ParquetFileWriter, I need a shared_ptr.   This introduces memory
>> > management complications.  I'd rather not create a copy of the GroupNode
>> in
>> > order to pass it to ParquetFileWriter.
>> >
>> >  * // convert arrow schema to parquet schema*
>> > *  std::shared_ptr<SchemaDescriptor> parquet_schema;*
>> > *  std::shared_ptr<::parquet::WriterProperties> properties =*
>> > *    ::parquet::default_writer_properties();*
>> > *  ToParquetSchema(arrow_sch.get(), *properties.get(),
>> &parquet_schema);*
>> >
>> > *  // write arrow table to parquet*
>> > *  parquet::schema::GroupNode* g =
>> > (parquet::schema::GroupNode*)parquet_schema->group_node();*
>> > *  grp_node.reset(g);  // Dont want to do this !*
>> > *  std::shared_ptr<::arrow::io::FileOutputStream> sink;*
>> > *  ::arrow::io::FileOutputStream::Open(path, &sink);*
>> > *  std::unique_ptr<FileWriter> arrow_writer(*
>> > *    new FileWriter(pool, ParquetFileWriter::Open(sink, grp_node)));*
>> >
>> > *  arrow_writer->WriteTable(*new_table_ptr.get(), 65536);*
>> >
>> > Is this an API limitation that no one has hit before ? Or I am missing a
>> > better way of writing parquet files given a pre-defined arrow schema.
>> >
>> > -Sandeep
>>

Re: parquet-cpp question : ParquetFileWriter and Arrow schema conversion

Posted by Sandeep Joshi <sa...@gmail.com>.
thanks!   Is there sample code on how to use these APIs to learn best
practices ?

I am looking at
https://github.com/apache/arrow/tree/master/cpp/src/arrow/python
but that only covers Arrow itself

-Sandeep

On Sun, Nov 26, 2017 at 9:57 PM, Wes McKinney <we...@gmail.com> wrote:

> I think you want to use parquet::arrow::FileWriter::Open
>
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/writer.h#L112
>
> The implementation is here:
>
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/writer.cc#L992
>
> - Wes
>
> On Sun, Nov 26, 2017 at 8:25 AM, Sandeep Joshi <sa...@gmail.com>
> wrote:
> > This might seem like a dumb question but I am not intimate with the API
> yet
> > to figure out how to get around this problem.
> >
> > I have a pre-defined Arrow Schema which I convert to Parquet Schema using
> > the "ToParquetSchema" function.  This returns a SchemaDescriptor object.
> > https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/schema.h#L80
> >
> > ParquetFileWriter on the other hand, expects a shared_ptr<GroupNode>
> > https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/file/writer.h#L126
> >
> > SchemaDescriptor can return a raw pointer for GroupNode but to pass it to
> > the ParquetFileWriter, I need a shared_ptr.   This introduces memory
> > management complications.  I'd rather not create a copy of the GroupNode
> in
> > order to pass it to ParquetFileWriter.
> >
> >  * // convert arrow schema to parquet schema*
> > *  std::shared_ptr<SchemaDescriptor> parquet_schema;*
> > *  std::shared_ptr<::parquet::WriterProperties> properties =*
> > *    ::parquet::default_writer_properties();*
> > *  ToParquetSchema(arrow_sch.get(), *properties.get(),
> &parquet_schema);*
> >
> > *  // write arrow table to parquet*
> > *  parquet::schema::GroupNode* g =
> > (parquet::schema::GroupNode*)parquet_schema->group_node();*
> > *  grp_node.reset(g);  // Dont want to do this !*
> > *  std::shared_ptr<::arrow::io::FileOutputStream> sink;*
> > *  ::arrow::io::FileOutputStream::Open(path, &sink);*
> > *  std::unique_ptr<FileWriter> arrow_writer(*
> > *    new FileWriter(pool, ParquetFileWriter::Open(sink, grp_node)));*
> >
> > *  arrow_writer->WriteTable(*new_table_ptr.get(), 65536);*
> >
> > Is this an API limitation that no one has hit before ? Or I am missing a
> > better way of writing parquet files given a pre-defined arrow schema.
> >
> > -Sandeep
>

Re: parquet-cpp question : ParquetFileWriter and Arrow schema conversion

Posted by Wes McKinney <we...@gmail.com>.
I think you want to use parquet::arrow::FileWriter::Open

https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.h#L112

The implementation is here:

https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L992

- Wes

On Sun, Nov 26, 2017 at 8:25 AM, Sandeep Joshi <sa...@gmail.com> wrote:
> This might seem like a dumb question but I am not intimate with the API yet
> to figure out how to get around this problem.
>
> I have a pre-defined Arrow Schema which I convert to Parquet Schema using
> the "ToParquetSchema" function.  This returns a SchemaDescriptor object.
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/schema.h#L80
>
> ParquetFileWriter on the other hand, expects a shared_ptr<GroupNode>
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/writer.h#L126
>
> SchemaDescriptor can return a raw pointer for GroupNode but to pass it to
> the ParquetFileWriter, I need a shared_ptr.   This introduces memory
> management complications.  I'd rather not create a copy of the GroupNode in
> order to pass it to ParquetFileWriter.
>
>  * // convert arrow schema to parquet schema*
> *  std::shared_ptr<SchemaDescriptor> parquet_schema;*
> *  std::shared_ptr<::parquet::WriterProperties> properties =*
> *    ::parquet::default_writer_properties();*
> *  ToParquetSchema(arrow_sch.get(), *properties.get(), &parquet_schema);*
>
> *  // write arrow table to parquet*
> *  parquet::schema::GroupNode* g =
> (parquet::schema::GroupNode*)parquet_schema->group_node();*
> *  grp_node.reset(g);  // Dont want to do this !*
> *  std::shared_ptr<::arrow::io::FileOutputStream> sink;*
> *  ::arrow::io::FileOutputStream::Open(path, &sink);*
> *  std::unique_ptr<FileWriter> arrow_writer(*
> *    new FileWriter(pool, ParquetFileWriter::Open(sink, grp_node)));*
>
> *  arrow_writer->WriteTable(*new_table_ptr.get(), 65536);*
>
> Is this an API limitation that no one has hit before ? Or I am missing a
> better way of writing parquet files given a pre-defined arrow schema.
>
> -Sandeep