You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by ALeX Wang <ee...@gmail.com> on 2017/12/28 17:05:23 UTC

What is the correct way to read 1-Dimension bytearray array

Hi,

Assume the column type is of 1-Dimension ByteArray array, (definition level
- 1, and repetition - repeated).


If I want to read the column values one row at a time, I have to keep read
(i.e.
calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At that
point, I can
construct previously read ByteArrays and return it as for the row.

However, since 'ByteArray->ptr' points to the column page memory which
(based
on my understanding)  will be gone when calling 'HasNext()' and move to the
next
page.  So that means i have to maintain a copy of the 'ByteArray->ptr' for
all the
previously read values.

This really seems to me to be too complicated..
Would like to ask if there is a better way of doing:
   1. Reading 1D array in row-by-row fashion.
   2. Zero-copy 'ByteArray->ptr'

Thanks a lot,
-- 
Alex Wang,
Open vSwitch developer

Re: What is the correct way to read 1-Dimension bytearray array

Posted by ALeX Wang <ee...@gmail.com>.

Thx for making my day !~ ;D



On 29 December 2017 at 14:47, Wes McKinney <we...@gmail.com> wrote:

> Also, I think you can use the `Scanner` / `TypedScanner` APIs to do
> precisely this:
>
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/column_scanner.h#L88
>
> These do what the API I described in pseudocode does -- didn't
> remember it soon enough for my e-mail.
>
> On Fri, Dec 29, 2017 at 4:11 PM, ALeX Wang <ee...@gmail.com> wrote:
> > Hi Wes,
> >
> > Thanks a lot for your reply, I'll try something as you suggested,
> >
> > Thanks,
> > Alex Wang,
> >
> > On 29 December 2017 at 11:40, Wes McKinney <we...@gmail.com> wrote:
> >
> >> hi Alex,
> >>
> >> I would suggest that you handle batch buffering on the application
> >> side, _not_ calling ReadBatch(1, ...) which will be much slower -- the
> >> parquet-cpp APIs are intended to be used for batch read and writes, so
> >> if you need to read a table row by row, you could create some C++
> >> classes with a particular batch size that manage an internal buffer of
> >> values that have been read from the column.
> >>
> >> As an example, suppose you wish to buffer 1000 values from the column
> >> at a time. Then you could create an API that looks like:
> >>
> >> BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader);
> >> buffered_reader.set_batch_size(1000);
> >>
> >> const ByteArray* val;
> >> while (val = buffered_reader.Next()) {
> >>   // Do something with val
> >> }
> >>
> >> The ByteArray values do not own their data, so if you wish to persist
> >> the memory between (internal) calls to ReadBatch, you will have to
> >> copy the memory someplace else. We do not perform this copy for you in
> >> the low level ReadBatch API because it would hurt performance for
> >> users who wish to put the memory someplace else (like in an Arrow
> >> columnar array buffer)
> >>
> >> I recommend looking at the Apache Arrow-based reader API which does
> >> all this for you including memory management.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <ee...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Assume the column type is of 1-Dimension ByteArray array, (definition
> >> level
> >> > - 1, and repetition - repeated).
> >> >
> >> >
> >> > If I want to read the column values one row at a time, I have to keep
> >> read
> >> > (i.e.
> >> > calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At
> that
> >> > point, I can
> >> > construct previously read ByteArrays and return it as for the row.
> >> >
> >> > However, since 'ByteArray->ptr' points to the column page memory which
> >> > (based
> >> > on my understanding)  will be gone when calling 'HasNext()' and move
> to
> >> the
> >> > next
> >> > page.  So that means i have to maintain a copy of the 'ByteArray->ptr'
> >> for
> >> > all the
> >> > previously read values.
> >> >
> >> > This really seems to me to be too complicated..
> >> > Would like to ask if there is a better way of doing:
> >> >    1. Reading 1D array in row-by-row fashion.
> >> >    2. Zero-copy 'ByteArray->ptr'
> >> >
> >> > Thanks a lot,
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer

Re: What is the correct way to read 1-Dimension bytearray array

Posted by Wes McKinney <we...@gmail.com>.

Also, I think you can use the `Scanner` / `TypedScanner` APIs to do
precisely this:

https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_scanner.h#L88

These do what the API I described in pseudocode does -- didn't
remember it soon enough for my e-mail.

On Fri, Dec 29, 2017 at 4:11 PM, ALeX Wang <ee...@gmail.com> wrote:
> Hi Wes,
>
> Thanks a lot for your reply, I'll try something as you suggested,
>
> Thanks,
> Alex Wang,
>
> On 29 December 2017 at 11:40, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Alex,
>>
>> I would suggest that you handle batch buffering on the application
>> side, _not_ calling ReadBatch(1, ...) which will be much slower -- the
>> parquet-cpp APIs are intended to be used for batch read and writes, so
>> if you need to read a table row by row, you could create some C++
>> classes with a particular batch size that manage an internal buffer of
>> values that have been read from the column.
>>
>> As an example, suppose you wish to buffer 1000 values from the column
>> at a time. Then you could create an API that looks like:
>>
>> BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader);
>> buffered_reader.set_batch_size(1000);
>>
>> const ByteArray* val;
>> while (val = buffered_reader.Next()) {
>>   // Do something with val
>> }
>>
>> The ByteArray values do not own their data, so if you wish to persist
>> the memory between (internal) calls to ReadBatch, you will have to
>> copy the memory someplace else. We do not perform this copy for you in
>> the low level ReadBatch API because it would hurt performance for
>> users who wish to put the memory someplace else (like in an Arrow
>> columnar array buffer)
>>
>> I recommend looking at the Apache Arrow-based reader API which does
>> all this for you including memory management.
>>
>> Thanks
>> Wes
>>
>> On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <ee...@gmail.com> wrote:
>> > Hi,
>> >
>> > Assume the column type is of 1-Dimension ByteArray array, (definition
>> level
>> > - 1, and repetition - repeated).
>> >
>> >
>> > If I want to read the column values one row at a time, I have to keep
>> read
>> > (i.e.
>> > calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At that
>> > point, I can
>> > construct previously read ByteArrays and return it as for the row.
>> >
>> > However, since 'ByteArray->ptr' points to the column page memory which
>> > (based
>> > on my understanding)  will be gone when calling 'HasNext()' and move to
>> the
>> > next
>> > page.  So that means i have to maintain a copy of the 'ByteArray->ptr'
>> for
>> > all the
>> > previously read values.
>> >
>> > This really seems to me to be too complicated..
>> > Would like to ask if there is a better way of doing:
>> >    1. Reading 1D array in row-by-row fashion.
>> >    2. Zero-copy 'ByteArray->ptr'
>> >
>> > Thanks a lot,
>> > --
>> > Alex Wang,
>> > Open vSwitch developer
>>
>
>
>
> --
> Alex Wang,
> Open vSwitch developer

Re: What is the correct way to read 1-Dimension bytearray array

Posted by ALeX Wang <ee...@gmail.com>.

Hi Wes,

Thanks a lot for your reply, I'll try something as you suggested,

Thanks,
Alex Wang,

On 29 December 2017 at 11:40, Wes McKinney <we...@gmail.com> wrote:

> hi Alex,
>
> I would suggest that you handle batch buffering on the application
> side, _not_ calling ReadBatch(1, ...) which will be much slower -- the
> parquet-cpp APIs are intended to be used for batch read and writes, so
> if you need to read a table row by row, you could create some C++
> classes with a particular batch size that manage an internal buffer of
> values that have been read from the column.
>
> As an example, suppose you wish to buffer 1000 values from the column
> at a time. Then you could create an API that looks like:
>
> BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader);
> buffered_reader.set_batch_size(1000);
>
> const ByteArray* val;
> while (val = buffered_reader.Next()) {
>   // Do something with val
> }
>
> The ByteArray values do not own their data, so if you wish to persist
> the memory between (internal) calls to ReadBatch, you will have to
> copy the memory someplace else. We do not perform this copy for you in
> the low level ReadBatch API because it would hurt performance for
> users who wish to put the memory someplace else (like in an Arrow
> columnar array buffer)
>
> I recommend looking at the Apache Arrow-based reader API which does
> all this for you including memory management.
>
> Thanks
> Wes
>
> On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <ee...@gmail.com> wrote:
> > Hi,
> >
> > Assume the column type is of 1-Dimension ByteArray array, (definition
> level
> > - 1, and repetition - repeated).
> >
> >
> > If I want to read the column values one row at a time, I have to keep
> read
> > (i.e.
> > calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At that
> > point, I can
> > construct previously read ByteArrays and return it as for the row.
> >
> > However, since 'ByteArray->ptr' points to the column page memory which
> > (based
> > on my understanding)  will be gone when calling 'HasNext()' and move to
> the
> > next
> > page.  So that means i have to maintain a copy of the 'ByteArray->ptr'
> for
> > all the
> > previously read values.
> >
> > This really seems to me to be too complicated..
> > Would like to ask if there is a better way of doing:
> >    1. Reading 1D array in row-by-row fashion.
> >    2. Zero-copy 'ByteArray->ptr'
> >
> > Thanks a lot,
> > --
> > Alex Wang,
> > Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer

Re: What is the correct way to read 1-Dimension bytearray array

Posted by Wes McKinney <we...@gmail.com>.

hi Alex,

I would suggest that you handle batch buffering on the application
side, _not_ calling ReadBatch(1, ...) which will be much slower -- the
parquet-cpp APIs are intended to be used for batch read and writes, so
if you need to read a table row by row, you could create some C++
classes with a particular batch size that manage an internal buffer of
values that have been read from the column.

As an example, suppose you wish to buffer 1000 values from the column
at a time. Then you could create an API that looks like:

BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader);
buffered_reader.set_batch_size(1000);

const ByteArray* val;
while (val = buffered_reader.Next()) {
  // Do something with val
}

The ByteArray values do not own their data, so if you wish to persist
the memory between (internal) calls to ReadBatch, you will have to
copy the memory someplace else. We do not perform this copy for you in
the low level ReadBatch API because it would hurt performance for
users who wish to put the memory someplace else (like in an Arrow
columnar array buffer)

I recommend looking at the Apache Arrow-based reader API which does
all this for you including memory management.

Thanks
Wes

On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <ee...@gmail.com> wrote:
> Hi,
>
> Assume the column type is of 1-Dimension ByteArray array, (definition level
> - 1, and repetition - repeated).
>
>
> If I want to read the column values one row at a time, I have to keep read
> (i.e.
> calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At that
> point, I can
> construct previously read ByteArrays and return it as for the row.
>
> However, since 'ByteArray->ptr' points to the column page memory which
> (based
> on my understanding)  will be gone when calling 'HasNext()' and move to the
> next
> page.  So that means i have to maintain a copy of the 'ByteArray->ptr' for
> all the
> previously read values.
>
> This really seems to me to be too complicated..
> Would like to ask if there is a better way of doing:
>    1. Reading 1D array in row-by-row fashion.
>    2. Zero-copy 'ByteArray->ptr'
>
> Thanks a lot,
> --
> Alex Wang,
> Open vSwitch developer