You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "william@blazingdb.com" <wi...@blazingdb.com> on 2017/11/08 15:57:48 UTC

Re: Issues using TypedColumnReader<DType>::ReadBatchSpaced

Hello Uwe,

Yes this was the same issue as posted by Felipe. He posted the issue for me, since I did not have access to the mailing list. Thanks for the info, it explains the patterns that I have been seeing. 

Thanks,

William

On 2017-11-07 07:14, "Uwe L. Korn" <uw...@xhochy.com> wrote: 
> Hello William,
> 
> Seems like you got the problem Felipe earlier mentioned. My response to
> that was:
> 
> the parquet::ByteArray instances don't own the data, so their internal
> pointer might get invalid on the next call to ReadBatchSpaced. This
> should actually make no difference if you that intermediateBuffer or
> not. Thus the second code snippet might also fail. In general, I can
> recommend you to look at the parquet_arrow implementation on how to read
> files in parquet-cpp: src/parquet/arrow/reader*. Depending on your use
> case, it might also be simpler for you to use this API as Arrow data
> structures are much simpler to consume and hide some of the
> implementation details of the Parquet format.
> 
> If that does not solve your problem, feel free ask more ;)
> 
> Uwe
> 
> On Fri, Nov 3, 2017, at 12:07 AM, William Malpica wrote:
> > Hello,
> > 
> > I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read
> > ByteArrays.
> > 
> > In my use case, I am only reading flat data but it is data with nulls,
> > which is why I am using ReadBatchSpaced, because its the only way I have
> > found to be able to read data and also know which values are null. My
> > code
> > looks something like this:
> > 
> > int64_t total_values_read = 0;
> > 
> > int64_t valid_bits_offset = 0;
> > int64_t levels_read = 0;
> > int64_t values_read = 0;
> > int64_t null_count = -1;
> > 
> > std::vector<parquet::ByteArray> values(numRecords);
> > std::vector<int16_t> dresult(numRecords, -1);
> > std::vector<int16_t> rresult(numRecords, -1);
> > std::vector<uint8_t> valid_bits(numRecords, 255);
> > 
> > while (total_values_read < numRecords){
> > int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> > dresult.begin() + total_values_read, rresult.data() + total_values_read,
> > values.begin() + total_values_read, valid_bits.data()  +
> > total_values_read,
> > valid_bits_offset, &levels_read, &values_read, &null_count);
> > 
> >   total_values_read += rows_read;
> > }
> > 
> > When I follow this pattern, and I need to do multiple calls to
> > ReadBatchSpaced, I can get garbage results in my vector of ByteArrays
> > after
> > the first call in the loop. If I were reading a more primitive data type,
> > I
> > do not have this issue.
> > So far the only way I have been able to get this to work is by using an
> > intermediary buffer to hold the ByteArray data, which would look more
> > something like this:
> > 
> > std::vector<parquet::ByteArray> values(numRecords);
> > std::vector<parquet::ByteArray> intermediateBuffer(numRecords);
> > std::vector<int16_t> dresult(numRecords, -1);
> > std::vector<int16_t> rresult(numRecords, -1);
> > std::vector<uint8_t> valid_bits(numRecords, 255);
> > 
> > while (total_values_read < numRecords){
> > int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> > dresult.begin() + total_values_read, rresult.data() + total_values_read,
> > &(intermediateBuffer[0]), valid_bits.data()  + total_values_read,
> > valid_bits_offset, &levels_read, &values_read, &null_count);
> > 
> >   std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() +
> > rows_read, values.begin() +  total_values_read);
> > 
> >   total_values_read += rows_read;
> > }
> > 
> > 
> > Any ideas as to what I am doing incorrectly in my first example? Do I
> > always need to use an intermediate buffer?
> > 
> > Thanks!
> > 
> > William
> > 
> > 
> > 
> > [image: BlazingDB] <https://htmlsig.com/t/000001C10NAQ>
> > 
> > William Malpica / VP of Engineering
> > william@blazingdb.com / 859.619.0708
> > 
> > BlazingDB
> > www.blazingdb.com
> > 
> > [image: Twitter]  <https://htmlsig.com/t/000001C0FQGA> [image: Facebook]
> > <https://htmlsig.com/t/000001C3NKBJ> [image: LinkedIn]
> > <https://htmlsig.com/t/000001BZZ9SC> [image: Vimeo]
> > <https://htmlsig.com/t/000001BYP7S2>
>