You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by rahul challapalli <ch...@gmail.com> on 2017/08/25 17:34:04 UTC

Adding a new column after writing a few records

Hi,

I am using the parquet writer (cpp) and I want to see if I can add a new
column after writing out a few records, but before the close method is
called. An example would be helpful if this is feasible.

Rahul

Re: Adding a new column after writing a few records

Posted by rahul challapalli <ch...@gmail.com>.
Thank you Uwe!

On Tue, Aug 29, 2017 at 12:49 AM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Rahul,
>
> the benefit of using Arrow for the row-wise-to-columnar conversion is
> mainly that the API is much simpler to use than the plain parquet-cpp
> API (see
> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
> ) Performance-wise, there is no difference.
>
> Uwe
>
> On Tue, Aug 29, 2017, at 09:42 AM, rahul challapalli wrote:
> > Thanks for your response Wes. The example at [1] uses column writers and
> > column readers. So for converting row based data into columnar format, is
> > there any benefit to using arrow? (I am mainly using parquet for
> > compression benefits. Once the data is read, I immediately convert it
> > into
> > row-based data)
> >
> > [1]
> > https://github.com/apache/parquet-cpp/blob/master/
> examples/reader-writer.cc
> >
> > On Mon, Aug 28, 2017 at 1:38 PM, Wes McKinney <we...@gmail.com>
> > wrote:
> >
> > > hi Rahul,
> > >
> > > This is not easy to do in the C++ API right now, because the writer
> > > must be initialized with a static schema. Theoretically you could
> > > expand the schema while you are writing the first row group, but it
> > > would be difficult to make this possible.
> > >
> > > The writer API is also designed for writing one column at a time
> > > instead of one row at a time, so one option for you is to create an
> > > auxiliary data structure (this is not provided by the Parquet C++
> > > library) to convert records into columnar form, then write to the
> > > Parquet writer API once you have appended all your records and know
> > > the final schema.
> > >
> > > - Wes
> > >
> > > On Fri, Aug 25, 2017 at 1:34 PM, rahul challapalli
> > > <ch...@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > I am using the parquet writer (cpp) and I want to see if I can add a
> new
> > > > column after writing out a few records, but before the close method
> is
> > > > called. An example would be helpful if this is feasible.
> > > >
> > > > Rahul
> > >
>

Re: Adding a new column after writing a few records

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Rahul,

the benefit of using Arrow for the row-wise-to-columnar conversion is
mainly that the API is much simpler to use than the plain parquet-cpp
API (see
https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
) Performance-wise, there is no difference. 

Uwe

On Tue, Aug 29, 2017, at 09:42 AM, rahul challapalli wrote:
> Thanks for your response Wes. The example at [1] uses column writers and
> column readers. So for converting row based data into columnar format, is
> there any benefit to using arrow? (I am mainly using parquet for
> compression benefits. Once the data is read, I immediately convert it
> into
> row-based data)
> 
> [1]
> https://github.com/apache/parquet-cpp/blob/master/examples/reader-writer.cc
> 
> On Mon, Aug 28, 2017 at 1:38 PM, Wes McKinney <we...@gmail.com>
> wrote:
> 
> > hi Rahul,
> >
> > This is not easy to do in the C++ API right now, because the writer
> > must be initialized with a static schema. Theoretically you could
> > expand the schema while you are writing the first row group, but it
> > would be difficult to make this possible.
> >
> > The writer API is also designed for writing one column at a time
> > instead of one row at a time, so one option for you is to create an
> > auxiliary data structure (this is not provided by the Parquet C++
> > library) to convert records into columnar form, then write to the
> > Parquet writer API once you have appended all your records and know
> > the final schema.
> >
> > - Wes
> >
> > On Fri, Aug 25, 2017 at 1:34 PM, rahul challapalli
> > <ch...@gmail.com> wrote:
> > > Hi,
> > >
> > > I am using the parquet writer (cpp) and I want to see if I can add a new
> > > column after writing out a few records, but before the close method is
> > > called. An example would be helpful if this is feasible.
> > >
> > > Rahul
> >

Re: Adding a new column after writing a few records

Posted by rahul challapalli <ch...@gmail.com>.
Thanks for your response Wes. The example at [1] uses column writers and
column readers. So for converting row based data into columnar format, is
there any benefit to using arrow? (I am mainly using parquet for
compression benefits. Once the data is read, I immediately convert it into
row-based data)

[1]
https://github.com/apache/parquet-cpp/blob/master/examples/reader-writer.cc

On Mon, Aug 28, 2017 at 1:38 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Rahul,
>
> This is not easy to do in the C++ API right now, because the writer
> must be initialized with a static schema. Theoretically you could
> expand the schema while you are writing the first row group, but it
> would be difficult to make this possible.
>
> The writer API is also designed for writing one column at a time
> instead of one row at a time, so one option for you is to create an
> auxiliary data structure (this is not provided by the Parquet C++
> library) to convert records into columnar form, then write to the
> Parquet writer API once you have appended all your records and know
> the final schema.
>
> - Wes
>
> On Fri, Aug 25, 2017 at 1:34 PM, rahul challapalli
> <ch...@gmail.com> wrote:
> > Hi,
> >
> > I am using the parquet writer (cpp) and I want to see if I can add a new
> > column after writing out a few records, but before the close method is
> > called. An example would be helpful if this is feasible.
> >
> > Rahul
>

Re: Adding a new column after writing a few records

Posted by Wes McKinney <we...@gmail.com>.
hi Rahul,

This is not easy to do in the C++ API right now, because the writer
must be initialized with a static schema. Theoretically you could
expand the schema while you are writing the first row group, but it
would be difficult to make this possible.

The writer API is also designed for writing one column at a time
instead of one row at a time, so one option for you is to create an
auxiliary data structure (this is not provided by the Parquet C++
library) to convert records into columnar form, then write to the
Parquet writer API once you have appended all your records and know
the final schema.

- Wes

On Fri, Aug 25, 2017 at 1:34 PM, rahul challapalli
<ch...@gmail.com> wrote:
> Hi,
>
> I am using the parquet writer (cpp) and I want to see if I can add a new
> column after writing out a few records, but before the close method is
> called. An example would be helpful if this is feasible.
>
> Rahul