You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jayjeet Chakraborty <ja...@gmail.com> on 2021/01/05 19:38:51 UTC

Query on incoherent total_byte_size and offset difference calculation results

I am using  Apache Arrow to write Parquet files. I am writing an uncompressed and non dictionary-encoded parquet file using pyarrow.parquet but the offsets are not well aligned when inspected using parquet tools. For example when I add up the row group offset with the row group size it does not come up to the row group offset of the new rowgroup. Can anyone tell why this is happening ? Also the difference between different row groups is not constant. I can see previously written parquet files with BIT-PACKED encoding and in those files the offset/size math is perfect. I am wondering how to write parquet files with similar BIT-PACKED type encoding now (when BIT-PACKED encoding is deprecated) ? Thanks a lot



Re: Query on incoherent total_byte_size and offset difference calculation results

Posted by Micah Kornfield <em...@gmail.com>.
Hi Jayjeet,
It isn't  clear from your description whether the files being produced are
corrupt or can be read but do not match your expectations.  Either way some
sample code and a more detailed explanation would be helpful in trying to
figure out where the problem is.

Thanks,
Micah

On Tue, Jan 5, 2021 at 2:17 PM Jayjeet Chakraborty <
jayjeetchakraborty25@gmail.com> wrote:

> I am using  Apache Arrow to write Parquet files. I am writing an
> uncompressed and non dictionary-encoded parquet file using pyarrow.parquet
> but the offsets are not well aligned when inspected using parquet tools.
> For example when I add up the row group offset with the row group size it
> does not come up to the row group offset of the new rowgroup. Can anyone
> tell why this is happening ? Also the difference between different row
> groups is not constant. I can see previously written parquet files with
> BIT-PACKED encoding and in those files the offset/size math is perfect. I
> am wondering how to write parquet files with similar BIT-PACKED type
> encoding now (when BIT-PACKED encoding is deprecated) ? Thanks a lot
>
>
>