You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Sandeep Joshi <sa...@gmail.com> on 2017/11/02 07:38:05 UTC

Record Conversion API in parquet-cpp

The parquet-mr version has the Record Conversion API (RecordMaterializer,
RecordConsumer) which
can be used to convert to and from rows/tuples into the Parquet columnar
format.

https://github.com/apache/parquet-mr/tree/master/parquet-column/src/main/java/org/apache/parquet/io/api

Are there any plans to add the same functionality to the parquet-cpp
codebase ?

I checked the JIRA and couldn't find any outstanding issue although the
github README
does say  "The 3rd layer would handle reading/writing records."
https://github.com/apache/parquet-cpp/blob/master/README.md/

-Sandeep

Re: Record Conversion API in parquet-cpp

Posted by Sandeep Joshi <sa...@gmail.com>.

thanks Uwe!  I will go with Arrow

On Fri, Nov 3, 2017 at 11:01 PM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello,
>
> the Arrow API in parquet-cpp is a much more convenient API for Parquet
> C++ users. It is tailored for columnar reads & writes but gives you a
> highlevel interface. We use it either to interact with Pandas or to pull
> data from/to the database using Turbodbc. If you can afford memory-wise
> to load all your data in to RAM, it might be simpler for you to convert
> the data to Arrow and then use the Arrow API. For Arrow we have
> implemented the state machine for the creation of definition and
> repetition levels in
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/writer.cc#L67-L314
>
> Uwe
>
> On Fri, Nov 3, 2017, at 05:16 AM, Sandeep Joshi wrote:
> > Uwe,
> >
> > >> As far as I understand you, you only are looking for the path r
> > ecords->Parquet?
> >
> > Yes.  Btw, I am just curious about the Arrow API in parquet-cpp.
> >
> > If I first convert the records to Arrow and then Parquet, will nested
> > schemas work ?
> >
> > While converting from Parquet to records, you need to build an FSM for
> > reassembly to
> > handle definition level and repetition level vectors.
> > Where does this happen when you convert from Parquet to Arrow to some
> > json
> > record ?
> > My questions are specific to the cpp version of Arrow and Parquet.
> >
> > -Sandeep
> >
> > On Thu, Nov 2, 2017 at 11:07 PM, Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > > Hello Sandeep,
> > >
> > > we don't require the same class structure as in parquet-mr. Preferably
> > > they are very similar but they may differ. Some of parquet-mr's
> > > interfaces are specifically tailored to fit Hadoop whereas we don't
> have
> > > this requirement in the C++ implementation. Still, the interfaces
> should
> > > be suitable for more generic record conversion. Depending on if you
> know
> > > the structure of your records at compile time, using std::tuple<..>
> > > might be a good option. If you don't know the structure beforehand, we
> > > need a more dynamic interface. I would be happy to guide you a bit to
> > > implement this API in parquet-cpp.
> > >
> > > As far as I understand you, you only are looking for the path
> > > records->Parquet?
> > >
> > > Uwe
> > >
> > > On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote:
> > > > Hi Wes
> > > >
> > > > We have a rough implementation which does this conversion from
> > > > (currently)
> > > > rapidjson to parquet that we could contribute.
> > > > It will need a shepherd/guide to ensure it aligns with the
> parquet-cpp
> > > > implementation standards.
> > > >
> > > > Does the class structure in parquet-cpp have to be in one-to-one
> > > > correspondence with the parquet-mr ?
> > > >
> > > > I noticed that parquet-mr Record Conversion API has abstract classes
> like
> > > > WriteSupport, ReadSupport,
> > > > PrimitiveConverter, GroupConverter, RecordMaterializer,
> > > > ParquetInputFormat,
> > > > ParquetOutputFormat
> > > > which have to be implemented.   I saw that these classes are
> currently
> > > > defined by avro, thrift and protobuf
> > > > converters (e.g.
> > > > https://github.com/apache/parquet-mr/tree/master/
> > > parquet-avro/src/main/java/org/apache/parquet/avro
> > > > )
> > > >
> > > > Would the parquet-cpp framework require the exact same framework ?
> > > >
> > > > -Sandeep
> > > >
> > > > On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > hi Sandeep,
> > > > >
> > > > > This is more than welcome to be implemented, though I personally
> have
> > > > > no need for it (almost exclusively work with columnar data /
> Arrow).
> > > > > In addition to implementing the decoding to records, we would need
> to
> > > > > define a suitable record data structure in C++ which is decent
> amount
> > > > > of work.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <sanjos100@gmail.com
> >
> > > wrote:
> > > > > > The parquet-mr version has the Record Conversion API
> > > (RecordMaterializer,
> > > > > > RecordConsumer) which
> > > > > > can be used to convert to and from rows/tuples into the Parquet
> > > columnar
> > > > > > format.
> > > > > >
> > > > > > https://github.com/apache/parquet-mr/tree/master/
> > > > > parquet-column/src/main/java/org/apache/parquet/io/api
> > > > > >
> > > > > > Are there any plans to add the same functionality to the
> parquet-cpp
> > > > > > codebase ?
> > > > > >
> > > > > > I checked the JIRA and couldn't find any outstanding issue
> although
> > > the
> > > > > > github README
> > > > > > does say  "The 3rd layer would handle reading/writing records."
> > > > > > https://github.com/apache/parquet-cpp/blob/master/README.md/
> > > > > >
> > > > > > -Sandeep
> > > > >
> > >
>

Re: Record Conversion API in parquet-cpp

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello,

the Arrow API in parquet-cpp is a much more convenient API for Parquet
C++ users. It is tailored for columnar reads & writes but gives you a
highlevel interface. We use it either to interact with Pandas or to pull
data from/to the database using Turbodbc. If you can afford memory-wise
to load all your data in to RAM, it might be simpler for you to convert
the data to Arrow and then use the Arrow API. For Arrow we have
implemented the state machine for the creation of definition and
repetition levels in
https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L67-L314

Uwe

On Fri, Nov 3, 2017, at 05:16 AM, Sandeep Joshi wrote:
> Uwe,
> 
> >> As far as I understand you, you only are looking for the path r
> ecords->Parquet?
> 
> Yes.  Btw, I am just curious about the Arrow API in parquet-cpp.
> 
> If I first convert the records to Arrow and then Parquet, will nested
> schemas work ?
> 
> While converting from Parquet to records, you need to build an FSM for
> reassembly to
> handle definition level and repetition level vectors.
> Where does this happen when you convert from Parquet to Arrow to some
> json
> record ?
> My questions are specific to the cpp version of Arrow and Parquet.
> 
> -Sandeep
> 
> On Thu, Nov 2, 2017 at 11:07 PM, Uwe L. Korn <uw...@xhochy.com> wrote:
> 
> > Hello Sandeep,
> >
> > we don't require the same class structure as in parquet-mr. Preferably
> > they are very similar but they may differ. Some of parquet-mr's
> > interfaces are specifically tailored to fit Hadoop whereas we don't have
> > this requirement in the C++ implementation. Still, the interfaces should
> > be suitable for more generic record conversion. Depending on if you know
> > the structure of your records at compile time, using std::tuple<..>
> > might be a good option. If you don't know the structure beforehand, we
> > need a more dynamic interface. I would be happy to guide you a bit to
> > implement this API in parquet-cpp.
> >
> > As far as I understand you, you only are looking for the path
> > records->Parquet?
> >
> > Uwe
> >
> > On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote:
> > > Hi Wes
> > >
> > > We have a rough implementation which does this conversion from
> > > (currently)
> > > rapidjson to parquet that we could contribute.
> > > It will need a shepherd/guide to ensure it aligns with the parquet-cpp
> > > implementation standards.
> > >
> > > Does the class structure in parquet-cpp have to be in one-to-one
> > > correspondence with the parquet-mr ?
> > >
> > > I noticed that parquet-mr Record Conversion API has abstract classes like
> > > WriteSupport, ReadSupport,
> > > PrimitiveConverter, GroupConverter, RecordMaterializer,
> > > ParquetInputFormat,
> > > ParquetOutputFormat
> > > which have to be implemented.   I saw that these classes are currently
> > > defined by avro, thrift and protobuf
> > > converters (e.g.
> > > https://github.com/apache/parquet-mr/tree/master/
> > parquet-avro/src/main/java/org/apache/parquet/avro
> > > )
> > >
> > > Would the parquet-cpp framework require the exact same framework ?
> > >
> > > -Sandeep
> > >
> > > On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi Sandeep,
> > > >
> > > > This is more than welcome to be implemented, though I personally have
> > > > no need for it (almost exclusively work with columnar data / Arrow).
> > > > In addition to implementing the decoding to records, we would need to
> > > > define a suitable record data structure in C++ which is decent amount
> > > > of work.
> > > >
> > > > - Wes
> > > >
> > > > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <sa...@gmail.com>
> > wrote:
> > > > > The parquet-mr version has the Record Conversion API
> > (RecordMaterializer,
> > > > > RecordConsumer) which
> > > > > can be used to convert to and from rows/tuples into the Parquet
> > columnar
> > > > > format.
> > > > >
> > > > > https://github.com/apache/parquet-mr/tree/master/
> > > > parquet-column/src/main/java/org/apache/parquet/io/api
> > > > >
> > > > > Are there any plans to add the same functionality to the parquet-cpp
> > > > > codebase ?
> > > > >
> > > > > I checked the JIRA and couldn't find any outstanding issue although
> > the
> > > > > github README
> > > > > does say  "The 3rd layer would handle reading/writing records."
> > > > > https://github.com/apache/parquet-cpp/blob/master/README.md/
> > > > >
> > > > > -Sandeep
> > > >
> >

Re: Record Conversion API in parquet-cpp

Posted by Sandeep Joshi <sa...@gmail.com>.

Uwe,

>> As far as I understand you, you only are looking for the path r
ecords->Parquet?

Yes.  Btw, I am just curious about the Arrow API in parquet-cpp.

If I first convert the records to Arrow and then Parquet, will nested
schemas work ?

While converting from Parquet to records, you need to build an FSM for
reassembly to
handle definition level and repetition level vectors.
Where does this happen when you convert from Parquet to Arrow to some json
record ?
My questions are specific to the cpp version of Arrow and Parquet.

-Sandeep

On Thu, Nov 2, 2017 at 11:07 PM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Sandeep,
>
> we don't require the same class structure as in parquet-mr. Preferably
> they are very similar but they may differ. Some of parquet-mr's
> interfaces are specifically tailored to fit Hadoop whereas we don't have
> this requirement in the C++ implementation. Still, the interfaces should
> be suitable for more generic record conversion. Depending on if you know
> the structure of your records at compile time, using std::tuple<..>
> might be a good option. If you don't know the structure beforehand, we
> need a more dynamic interface. I would be happy to guide you a bit to
> implement this API in parquet-cpp.
>
> As far as I understand you, you only are looking for the path
> records->Parquet?
>
> Uwe
>
> On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote:
> > Hi Wes
> >
> > We have a rough implementation which does this conversion from
> > (currently)
> > rapidjson to parquet that we could contribute.
> > It will need a shepherd/guide to ensure it aligns with the parquet-cpp
> > implementation standards.
> >
> > Does the class structure in parquet-cpp have to be in one-to-one
> > correspondence with the parquet-mr ?
> >
> > I noticed that parquet-mr Record Conversion API has abstract classes like
> > WriteSupport, ReadSupport,
> > PrimitiveConverter, GroupConverter, RecordMaterializer,
> > ParquetInputFormat,
> > ParquetOutputFormat
> > which have to be implemented.   I saw that these classes are currently
> > defined by avro, thrift and protobuf
> > converters (e.g.
> > https://github.com/apache/parquet-mr/tree/master/
> parquet-avro/src/main/java/org/apache/parquet/avro
> > )
> >
> > Would the parquet-cpp framework require the exact same framework ?
> >
> > -Sandeep
> >
> > On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Sandeep,
> > >
> > > This is more than welcome to be implemented, though I personally have
> > > no need for it (almost exclusively work with columnar data / Arrow).
> > > In addition to implementing the decoding to records, we would need to
> > > define a suitable record data structure in C++ which is decent amount
> > > of work.
> > >
> > > - Wes
> > >
> > > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <sa...@gmail.com>
> wrote:
> > > > The parquet-mr version has the Record Conversion API
> (RecordMaterializer,
> > > > RecordConsumer) which
> > > > can be used to convert to and from rows/tuples into the Parquet
> columnar
> > > > format.
> > > >
> > > > https://github.com/apache/parquet-mr/tree/master/
> > > parquet-column/src/main/java/org/apache/parquet/io/api
> > > >
> > > > Are there any plans to add the same functionality to the parquet-cpp
> > > > codebase ?
> > > >
> > > > I checked the JIRA and couldn't find any outstanding issue although
> the
> > > > github README
> > > > does say  "The 3rd layer would handle reading/writing records."
> > > > https://github.com/apache/parquet-cpp/blob/master/README.md/
> > > >
> > > > -Sandeep
> > >
>

Re: Record Conversion API in parquet-cpp

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Sandeep,

we don't require the same class structure as in parquet-mr. Preferably
they are very similar but they may differ. Some of parquet-mr's
interfaces are specifically tailored to fit Hadoop whereas we don't have
this requirement in the C++ implementation. Still, the interfaces should
be suitable for more generic record conversion. Depending on if you know
the structure of your records at compile time, using std::tuple<..>
might be a good option. If you don't know the structure beforehand, we
need a more dynamic interface. I would be happy to guide you a bit to
implement this API in parquet-cpp. 

As far as I understand you, you only are looking for the path
records->Parquet?

Uwe

On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote:
> Hi Wes
> 
> We have a rough implementation which does this conversion from
> (currently)
> rapidjson to parquet that we could contribute.
> It will need a shepherd/guide to ensure it aligns with the parquet-cpp
> implementation standards.
> 
> Does the class structure in parquet-cpp have to be in one-to-one
> correspondence with the parquet-mr ?
> 
> I noticed that parquet-mr Record Conversion API has abstract classes like
> WriteSupport, ReadSupport,
> PrimitiveConverter, GroupConverter, RecordMaterializer,
> ParquetInputFormat,
> ParquetOutputFormat
> which have to be implemented.   I saw that these classes are currently
> defined by avro, thrift and protobuf
> converters (e.g.
> https://github.com/apache/parquet-mr/tree/master/parquet-avro/src/main/java/org/apache/parquet/avro
> )
> 
> Would the parquet-cpp framework require the exact same framework ?
> 
> -Sandeep
> 
> On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> > hi Sandeep,
> >
> > This is more than welcome to be implemented, though I personally have
> > no need for it (almost exclusively work with columnar data / Arrow).
> > In addition to implementing the decoding to records, we would need to
> > define a suitable record data structure in C++ which is decent amount
> > of work.
> >
> > - Wes
> >
> > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <sa...@gmail.com> wrote:
> > > The parquet-mr version has the Record Conversion API (RecordMaterializer,
> > > RecordConsumer) which
> > > can be used to convert to and from rows/tuples into the Parquet columnar
> > > format.
> > >
> > > https://github.com/apache/parquet-mr/tree/master/
> > parquet-column/src/main/java/org/apache/parquet/io/api
> > >
> > > Are there any plans to add the same functionality to the parquet-cpp
> > > codebase ?
> > >
> > > I checked the JIRA and couldn't find any outstanding issue although the
> > > github README
> > > does say  "The 3rd layer would handle reading/writing records."
> > > https://github.com/apache/parquet-cpp/blob/master/README.md/
> > >
> > > -Sandeep
> >

Re: Record Conversion API in parquet-cpp

Posted by Sandeep Joshi <sa...@gmail.com>.

Hi Wes

We have a rough implementation which does this conversion from (currently)
rapidjson to parquet that we could contribute.
It will need a shepherd/guide to ensure it aligns with the parquet-cpp
implementation standards.

Does the class structure in parquet-cpp have to be in one-to-one
correspondence with the parquet-mr ?

I noticed that parquet-mr Record Conversion API has abstract classes like
WriteSupport, ReadSupport,
PrimitiveConverter, GroupConverter, RecordMaterializer, ParquetInputFormat,
ParquetOutputFormat
which have to be implemented.   I saw that these classes are currently
defined by avro, thrift and protobuf
converters (e.g.
https://github.com/apache/parquet-mr/tree/master/parquet-avro/src/main/java/org/apache/parquet/avro
)

Would the parquet-cpp framework require the exact same framework ?

-Sandeep

On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Sandeep,
>
> This is more than welcome to be implemented, though I personally have
> no need for it (almost exclusively work with columnar data / Arrow).
> In addition to implementing the decoding to records, we would need to
> define a suitable record data structure in C++ which is decent amount
> of work.
>
> - Wes
>
> On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <sa...@gmail.com> wrote:
> > The parquet-mr version has the Record Conversion API (RecordMaterializer,
> > RecordConsumer) which
> > can be used to convert to and from rows/tuples into the Parquet columnar
> > format.
> >
> > https://github.com/apache/parquet-mr/tree/master/
> parquet-column/src/main/java/org/apache/parquet/io/api
> >
> > Are there any plans to add the same functionality to the parquet-cpp
> > codebase ?
> >
> > I checked the JIRA and couldn't find any outstanding issue although the
> > github README
> > does say  "The 3rd layer would handle reading/writing records."
> > https://github.com/apache/parquet-cpp/blob/master/README.md/
> >
> > -Sandeep
>

Re: Record Conversion API in parquet-cpp

Posted by Wes McKinney <we...@gmail.com>.

hi Sandeep,

This is more than welcome to be implemented, though I personally have
no need for it (almost exclusively work with columnar data / Arrow).
In addition to implementing the decoding to records, we would need to
define a suitable record data structure in C++ which is decent amount
of work.

- Wes

On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <sa...@gmail.com> wrote:
> The parquet-mr version has the Record Conversion API (RecordMaterializer,
> RecordConsumer) which
> can be used to convert to and from rows/tuples into the Parquet columnar
> format.
>
> https://github.com/apache/parquet-mr/tree/master/parquet-column/src/main/java/org/apache/parquet/io/api
>
> Are there any plans to add the same functionality to the parquet-cpp
> codebase ?
>
> I checked the JIRA and couldn't find any outstanding issue although the
> github README
> does say  "The 3rd layer would handle reading/writing records."
> https://github.com/apache/parquet-cpp/blob/master/README.md/
>
> -Sandeep