You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Felipe Aramburu <fe...@blazingdb.com> on 2017/11/02 23:23:11 UTC

Problem reading ByteArray data when reusing buffers

Hello,

I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read
ByteArrays.

In my use case, I am only reading flat data but it is data with nulls,
which is why I am using ReadBatchSpaced, because its the only way I have
found to be able to read data and also know which values are null. My code
looks something like this:

int64_t total_values_read = 0;

int64_t valid_bits_offset = 0;
int64_t levels_read = 0;
int64_t values_read = 0;
int64_t null_count = -1;

std::vector<parquet::ByteArray> values(numRecords);
std::vector<int16_t> dresult(numRecords, -1);
std::vector<int16_t> rresult(numRecords, -1);
std::vector<uint8_t> valid_bits(numRecords, 255);

while (total_values_read < numRecords){
int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
dresult.begin() + total_values_read, rresult.data() + total_values_read,
values.begin() + total_values_read, valid_bits.data()  + total_values_read,
valid_bits_offset, &levels_read, &values_read, &null_count);

  total_values_read += rows_read;
}

When I follow this pattern, and I need to do multiple calls to
ReadBatchSpaced, I can get garbage results in my vector of ByteArrays after
the first call in the loop. If I were reading a more primitive data type, I
do not have this issue.
So far the only way I have been able to get this to work is by using an
intermediary buffer to hold the ByteArray data (and I also cannot reuse
that buffer, otherwise I also get bad data). This would look like something
like this:

std::vector<parquet::ByteArray> values(numRecords);
std::vector<int16_t> dresult(numRecords, -1);
std::vector<int16_t> rresult(numRecords, -1);
std::vector<uint8_t> valid_bits(numRecords, 255);

while (total_values_read < numRecords){

       std::vector<parquet::ByteArray> intermediateBuffer(numRecords);

int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
dresult.begin() + total_values_read, rresult.data() + total_values_read,
&(intermediateBuffer[0]), valid_bits.data()  + total_values_read,
valid_bits_offset, &levels_read, &values_read, &null_count);

  std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() +
rows_read, values.begin() +  total_values_read);

  total_values_read += rows_read;
}


Any ideas as to what I am doing incorrectly in my first example? Do I
always need to use an intermediate buffer?

Thanks!

William


P.S. a team member of mine tried to send this email to the mailing list but
it does not seem to get created. Is there something someone has to do
before they can post to the mailing list?

ᐧ

Re: Problem reading ByteArray data when reusing buffers

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Felipe,

the parquet::ByteArray instances don't own the data, so their internal
pointer might get invalid on the next call to ReadBatchSpaced. This
should actually make no difference if you that intermediateBuffer or
not. Thus the second code snippet might also fail. In general, I can
recommend you to look at the parquet_arrow implementation on how to read
files in parquet-cpp: src/parquet/arrow/reader*. Depending on your use
case, it might also be simpler for you to use this API as Arrow data
structures are much simpler to consume and hide some of the
implementation details of the Parquet format.

For sending mails to list, you need to be subscribed first. This can be
done by sending a mail to dev-subscribe@parquet.apache.org

Uwe

On Fri, Nov 3, 2017, at 12:23 AM, Felipe Aramburu wrote:
> Hello,
> 
> I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read
> ByteArrays.
> 
> In my use case, I am only reading flat data but it is data with nulls,
> which is why I am using ReadBatchSpaced, because its the only way I have
> found to be able to read data and also know which values are null. My
> code
> looks something like this:
> 
> int64_t total_values_read = 0;
> 
> int64_t valid_bits_offset = 0;
> int64_t levels_read = 0;
> int64_t values_read = 0;
> int64_t null_count = -1;
> 
> std::vector<parquet::ByteArray> values(numRecords);
> std::vector<int16_t> dresult(numRecords, -1);
> std::vector<int16_t> rresult(numRecords, -1);
> std::vector<uint8_t> valid_bits(numRecords, 255);
> 
> while (total_values_read < numRecords){
> int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> dresult.begin() + total_values_read, rresult.data() + total_values_read,
> values.begin() + total_values_read, valid_bits.data()  +
> total_values_read,
> valid_bits_offset, &levels_read, &values_read, &null_count);
> 
>   total_values_read += rows_read;
> }
> 
> When I follow this pattern, and I need to do multiple calls to
> ReadBatchSpaced, I can get garbage results in my vector of ByteArrays
> after
> the first call in the loop. If I were reading a more primitive data type,
> I
> do not have this issue.
> So far the only way I have been able to get this to work is by using an
> intermediary buffer to hold the ByteArray data (and I also cannot reuse
> that buffer, otherwise I also get bad data). This would look like
> something
> like this:
> 
> std::vector<parquet::ByteArray> values(numRecords);
> std::vector<int16_t> dresult(numRecords, -1);
> std::vector<int16_t> rresult(numRecords, -1);
> std::vector<uint8_t> valid_bits(numRecords, 255);
> 
> while (total_values_read < numRecords){
> 
>        std::vector<parquet::ByteArray> intermediateBuffer(numRecords);
> 
> int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> dresult.begin() + total_values_read, rresult.data() + total_values_read,
> &(intermediateBuffer[0]), valid_bits.data()  + total_values_read,
> valid_bits_offset, &levels_read, &values_read, &null_count);
> 
>   std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() +
> rows_read, values.begin() +  total_values_read);
> 
>   total_values_read += rows_read;
> }
> 
> 
> Any ideas as to what I am doing incorrectly in my first example? Do I
> always need to use an intermediate buffer?
> 
> Thanks!
> 
> William
> 
> 
> P.S. a team member of mine tried to send this email to the mailing list
> but
> it does not seem to get created. Is there something someone has to do
> before they can post to the mailing list?
> 
> ᐧ