You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by William Malpica <wi...@blazingdb.com> on 2017/11/02 23:07:26 UTC

Issues using TypedColumnReader::ReadBatchSpaced

Hello,

I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read
ByteArrays.

In my use case, I am only reading flat data but it is data with nulls,
which is why I am using ReadBatchSpaced, because its the only way I have
found to be able to read data and also know which values are null. My code
looks something like this:

int64_t total_values_read = 0;

int64_t valid_bits_offset = 0;
int64_t levels_read = 0;
int64_t values_read = 0;
int64_t null_count = -1;

std::vector<parquet::ByteArray> values(numRecords);
std::vector<int16_t> dresult(numRecords, -1);
std::vector<int16_t> rresult(numRecords, -1);
std::vector<uint8_t> valid_bits(numRecords, 255);

while (total_values_read < numRecords){
int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
dresult.begin() + total_values_read, rresult.data() + total_values_read,
values.begin() + total_values_read, valid_bits.data()  + total_values_read,
valid_bits_offset, &levels_read, &values_read, &null_count);

  total_values_read += rows_read;
}

When I follow this pattern, and I need to do multiple calls to
ReadBatchSpaced, I can get garbage results in my vector of ByteArrays after
the first call in the loop. If I were reading a more primitive data type, I
do not have this issue.
So far the only way I have been able to get this to work is by using an
intermediary buffer to hold the ByteArray data, which would look more
something like this:

std::vector<parquet::ByteArray> values(numRecords);
std::vector<parquet::ByteArray> intermediateBuffer(numRecords);
std::vector<int16_t> dresult(numRecords, -1);
std::vector<int16_t> rresult(numRecords, -1);
std::vector<uint8_t> valid_bits(numRecords, 255);

while (total_values_read < numRecords){
int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
dresult.begin() + total_values_read, rresult.data() + total_values_read,
&(intermediateBuffer[0]), valid_bits.data()  + total_values_read,
valid_bits_offset, &levels_read, &values_read, &null_count);

  std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() +
rows_read, values.begin() +  total_values_read);

  total_values_read += rows_read;
}


Any ideas as to what I am doing incorrectly in my first example? Do I
always need to use an intermediate buffer?

Thanks!

William



[image: BlazingDB] <https://htmlsig.com/t/000001C10NAQ>

William Malpica / VP of Engineering
william@blazingdb.com / 859.619.0708

BlazingDB
www.blazingdb.com

[image: Twitter]  <https://htmlsig.com/t/000001C0FQGA> [image: Facebook]
<https://htmlsig.com/t/000001C3NKBJ> [image: LinkedIn]
<https://htmlsig.com/t/000001BZZ9SC> [image: Vimeo]
<https://htmlsig.com/t/000001BYP7S2>

Re: Issues using TypedColumnReader<DType>::ReadBatchSpaced

Posted by "william@blazingdb.com" <wi...@blazingdb.com>.

Hello Uwe,

Yes this was the same issue as posted by Felipe. He posted the issue for me, since I did not have access to the mailing list. Thanks for the info, it explains the patterns that I have been seeing. 

Thanks,

William

On 2017-11-07 07:14, "Uwe L. Korn" <uw...@xhochy.com> wrote: 
> Hello William,
> 
> Seems like you got the problem Felipe earlier mentioned. My response to
> that was:
> 
> the parquet::ByteArray instances don't own the data, so their internal
> pointer might get invalid on the next call to ReadBatchSpaced. This
> should actually make no difference if you that intermediateBuffer or
> not. Thus the second code snippet might also fail. In general, I can
> recommend you to look at the parquet_arrow implementation on how to read
> files in parquet-cpp: src/parquet/arrow/reader*. Depending on your use
> case, it might also be simpler for you to use this API as Arrow data
> structures are much simpler to consume and hide some of the
> implementation details of the Parquet format.
> 
> If that does not solve your problem, feel free ask more ;)
> 
> Uwe
> 
> On Fri, Nov 3, 2017, at 12:07 AM, William Malpica wrote:
> > Hello,
> > 
> > I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read
> > ByteArrays.
> > 
> > In my use case, I am only reading flat data but it is data with nulls,
> > which is why I am using ReadBatchSpaced, because its the only way I have
> > found to be able to read data and also know which values are null. My
> > code
> > looks something like this:
> > 
> > int64_t total_values_read = 0;
> > 
> > int64_t valid_bits_offset = 0;
> > int64_t levels_read = 0;
> > int64_t values_read = 0;
> > int64_t null_count = -1;
> > 
> > std::vector<parquet::ByteArray> values(numRecords);
> > std::vector<int16_t> dresult(numRecords, -1);
> > std::vector<int16_t> rresult(numRecords, -1);
> > std::vector<uint8_t> valid_bits(numRecords, 255);
> > 
> > while (total_values_read < numRecords){
> > int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> > dresult.begin() + total_values_read, rresult.data() + total_values_read,
> > values.begin() + total_values_read, valid_bits.data()  +
> > total_values_read,
> > valid_bits_offset, &levels_read, &values_read, &null_count);
> > 
> >   total_values_read += rows_read;
> > }
> > 
> > When I follow this pattern, and I need to do multiple calls to
> > ReadBatchSpaced, I can get garbage results in my vector of ByteArrays
> > after
> > the first call in the loop. If I were reading a more primitive data type,
> > I
> > do not have this issue.
> > So far the only way I have been able to get this to work is by using an
> > intermediary buffer to hold the ByteArray data, which would look more
> > something like this:
> > 
> > std::vector<parquet::ByteArray> values(numRecords);
> > std::vector<parquet::ByteArray> intermediateBuffer(numRecords);
> > std::vector<int16_t> dresult(numRecords, -1);
> > std::vector<int16_t> rresult(numRecords, -1);
> > std::vector<uint8_t> valid_bits(numRecords, 255);
> > 
> > while (total_values_read < numRecords){
> > int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> > dresult.begin() + total_values_read, rresult.data() + total_values_read,
> > &(intermediateBuffer[0]), valid_bits.data()  + total_values_read,
> > valid_bits_offset, &levels_read, &values_read, &null_count);
> > 
> >   std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() +
> > rows_read, values.begin() +  total_values_read);
> > 
> >   total_values_read += rows_read;
> > }
> > 
> > 
> > Any ideas as to what I am doing incorrectly in my first example? Do I
> > always need to use an intermediate buffer?
> > 
> > Thanks!
> > 
> > William
> > 
> > 
> > 
> > [image: BlazingDB] <https://htmlsig.com/t/000001C10NAQ>
> > 
> > William Malpica / VP of Engineering
> > william@blazingdb.com / 859.619.0708
> > 
> > BlazingDB
> > www.blazingdb.com
> > 
> > [image: Twitter]  <https://htmlsig.com/t/000001C0FQGA> [image: Facebook]
> > <https://htmlsig.com/t/000001C3NKBJ> [image: LinkedIn]
> > <https://htmlsig.com/t/000001BZZ9SC> [image: Vimeo]
> > <https://htmlsig.com/t/000001BYP7S2>
>

Re: Issues using TypedColumnReader::ReadBatchSpaced

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello William,

Seems like you got the problem Felipe earlier mentioned. My response to
that was:

the parquet::ByteArray instances don't own the data, so their internal
pointer might get invalid on the next call to ReadBatchSpaced. This
should actually make no difference if you that intermediateBuffer or
not. Thus the second code snippet might also fail. In general, I can
recommend you to look at the parquet_arrow implementation on how to read
files in parquet-cpp: src/parquet/arrow/reader*. Depending on your use
case, it might also be simpler for you to use this API as Arrow data
structures are much simpler to consume and hide some of the
implementation details of the Parquet format.

If that does not solve your problem, feel free ask more ;)

Uwe

On Fri, Nov 3, 2017, at 12:07 AM, William Malpica wrote:
> Hello,
> 
> I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read
> ByteArrays.
> 
> In my use case, I am only reading flat data but it is data with nulls,
> which is why I am using ReadBatchSpaced, because its the only way I have
> found to be able to read data and also know which values are null. My
> code
> looks something like this:
> 
> int64_t total_values_read = 0;
> 
> int64_t valid_bits_offset = 0;
> int64_t levels_read = 0;
> int64_t values_read = 0;
> int64_t null_count = -1;
> 
> std::vector<parquet::ByteArray> values(numRecords);
> std::vector<int16_t> dresult(numRecords, -1);
> std::vector<int16_t> rresult(numRecords, -1);
> std::vector<uint8_t> valid_bits(numRecords, 255);
> 
> while (total_values_read < numRecords){
> int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> dresult.begin() + total_values_read, rresult.data() + total_values_read,
> values.begin() + total_values_read, valid_bits.data()  +
> total_values_read,
> valid_bits_offset, &levels_read, &values_read, &null_count);
> 
>   total_values_read += rows_read;
> }
> 
> When I follow this pattern, and I need to do multiple calls to
> ReadBatchSpaced, I can get garbage results in my vector of ByteArrays
> after
> the first call in the loop. If I were reading a more primitive data type,
> I
> do not have this issue.
> So far the only way I have been able to get this to work is by using an
> intermediary buffer to hold the ByteArray data, which would look more
> something like this:
> 
> std::vector<parquet::ByteArray> values(numRecords);
> std::vector<parquet::ByteArray> intermediateBuffer(numRecords);
> std::vector<int16_t> dresult(numRecords, -1);
> std::vector<int16_t> rresult(numRecords, -1);
> std::vector<uint8_t> valid_bits(numRecords, 255);
> 
> while (total_values_read < numRecords){
> int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> dresult.begin() + total_values_read, rresult.data() + total_values_read,
> &(intermediateBuffer[0]), valid_bits.data()  + total_values_read,
> valid_bits_offset, &levels_read, &values_read, &null_count);
> 
>   std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() +
> rows_read, values.begin() +  total_values_read);
> 
>   total_values_read += rows_read;
> }
> 
> 
> Any ideas as to what I am doing incorrectly in my first example? Do I
> always need to use an intermediate buffer?
> 
> Thanks!
> 
> William
> 
> 
> 
> [image: BlazingDB] <https://htmlsig.com/t/000001C10NAQ>
> 
> William Malpica / VP of Engineering
> william@blazingdb.com / 859.619.0708
> 
> BlazingDB
> www.blazingdb.com
> 
> [image: Twitter]  <https://htmlsig.com/t/000001C0FQGA> [image: Facebook]
> <https://htmlsig.com/t/000001C3NKBJ> [image: LinkedIn]
> <https://htmlsig.com/t/000001BZZ9SC> [image: Vimeo]
> <https://htmlsig.com/t/000001BYP7S2>