You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jayjeet Chakraborty <ja...@gmail.com> on 2021/01/05 19:38:51 UTC
Query on incoherent total_byte_size and offset difference calculation results
I am using Apache Arrow to write Parquet files. I am writing an uncompressed and non dictionary-encoded parquet file using pyarrow.parquet but the offsets are not well aligned when inspected using parquet tools. For example when I add up the row group offset with the row group size it does not come up to the row group offset of the new rowgroup. Can anyone tell why this is happening ? Also the difference between different row groups is not constant. I can see previously written parquet files with BIT-PACKED encoding and in those files the offset/size math is perfect. I am wondering how to write parquet files with similar BIT-PACKED type encoding now (when BIT-PACKED encoding is deprecated) ? Thanks a lot
Re: Query on incoherent total_byte_size and offset difference
calculation results
Posted by Micah Kornfield <em...@gmail.com>.
Hi Jayjeet,
It isn't clear from your description whether the files being produced are
corrupt or can be read but do not match your expectations. Either way some
sample code and a more detailed explanation would be helpful in trying to
figure out where the problem is.
Thanks,
Micah
On Tue, Jan 5, 2021 at 2:17 PM Jayjeet Chakraborty <
jayjeetchakraborty25@gmail.com> wrote:
> I am using Apache Arrow to write Parquet files. I am writing an
> uncompressed and non dictionary-encoded parquet file using pyarrow.parquet
> but the offsets are not well aligned when inspected using parquet tools.
> For example when I add up the row group offset with the row group size it
> does not come up to the row group offset of the new rowgroup. Can anyone
> tell why this is happening ? Also the difference between different row
> groups is not constant. I can see previously written parquet files with
> BIT-PACKED encoding and in those files the offset/size math is perfect. I
> am wondering how to write parquet files with similar BIT-PACKED type
> encoding now (when BIT-PACKED encoding is deprecated) ? Thanks a lot
>
>
>