You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Wes McKinney <we...@gmail.com> on 2017/09/02 13:33:56 UTC

Re: PARQUET forma

hi Joerg,

A ColumnChunk consists of multiple data pages -- the maximum size of
data pages is configurable in many Parquet writer implementations. So
there is a page header for each data page. See
https://github.com/apache/parquet-format#column-chunks

I think the default size is 1MB in parquet-mr:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L65

Same in C++: https://github.com/apache/parquet-cpp/blob/master/src/parquet/properties.h#L81

- Wes

On Sat, Sep 2, 2017 at 2:47 AM, Jörg Anders <jr...@ymail.com> wrote:
> Thank you for this helpful answer. I solved it differently:
> I solved it by re-serializing the PageHeader and counting the bytes created.
>
> Question: Is it the PageHeader repeated every 1MByte of data? If so: it
> seems to be undocumented.(?)
>
> I  assume this because : I created a PARQUET file with 1 RowGroup, 1 Column
> with INT32 values uncompressed. It works as long I write less than 1 MByte
> using apache-parquet ParquetWriter
> If I write 2000000 int32 --> 8000000 Bytes the Column has 8000285 Bytes .
> ColumnMetaData.numValues says: 2000000
> But ColumnMetaData.total_compressed_size =
> ColumnMetaData.total_uncompressed_size = 8000326
> If I substract 41 Byte PageHeader --> 8000285 Bytes. The  start position is
> apparently correct.
>
> I wrote byte 1,2,3,4,5 .... Thus I could find out at which column offset the
> additional bytes are situated (see below). Could please anybody this is the
> page header apparently repeated each 1MByte
> Offset decimal, values Hex:
>
> at column offset = 1048584: 15, 0, 15, 88, 80, 80, 1, 15, 88, 80, 80, 1, 2c,
> 15, 82, 80, 20, 15, 0, 15, 8, 15, 8, 1c, 18, 4, 7c, 7d, 7e, 7f, 18, 4, 80,
> 81, 82, 83, 16, 0, 0, 0, 0,
> at column offset = 2097205: 15, 0, 15, 88, 80, 80, 1, 15, 88, 80, 80, 1, 2c,
> 15, 82, 80, 20, 15, 0, 15, 8, 15, 8, 1c, 18, 4, 7c, 7d, 7e, 7f, 18, 4, 80,
> 81, 82, 83, 16, 0, 0, 0, 0,
> at column offset = 3145826: 15, 0, 15, 88, 80, 80, 1, 15, 88, 80, 80, 1, 2c,
> 15, 82, 80, 20, 15, 0, 15, 8, 15, 8, 1c, 18, 4, 7c, 7d, 7e, 7f, 18, 4, 80,
> 81, 82, 83, 16, 0, 0, 0, 0,
> at column offset = 4194447: 15, 0, 15, 88, 80, 80, 1, 15, 88, 80, 80, 1, 2c,
> 15, 82, 80, 20, 15, 0, 15, 8, 15, 8, 1c, ,18, 4, 7c, 7d, 7e, 7f, 18, 4, 80,
> 81, 82, 83, 16, 0, 0, 0, 0, 18,
> at column offset = 5243068: 15, 0, 15, 88, 80, 80, 1, 15, 88, 80, 80, 1, 2c,
> 15, 82, 80, 20, 15, 0, 15, 8, 15, 8, 1c 18, 4, 7c, 7d, 7e, 7f, 18, 4, 80,
> 81, 82, 83, 16, 0, 0, 0, 0, 1c,
> at column offset = 6291689: 15, 0, 15, 88, 80, 80, 1, 15, 88, 80, 80, 1, 2c,
> 15, 82, 80, 20, 15, 0, 15, 8, 15, 8, 1c, 18, 4, 7c, 7d, 7e, 7f, 18, 4, 80,
> 81, 82, 83, 16, 0, 0, 0, 0, 20,
>
>
>
>
>
> Wes McKinney <we...@gmail.com> schrieb am 21:24 Dienstag, 29.August
> 2017:
>
>
>
>
> hi Joerg,
>
> Adding back dev@
>
> I suggest that you build the Parquet C++ library and step through this
> portion of the code in gdb so that you can inspect each of the places
> where the file position is advanced.
>
> Here is the line where the Thrift header is deserialized:
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader-internal.cc#L77
>
> The number of bytes consumed by Thrift is then used to advance the
> stream. I don't think there's any way to get around this.
>
> Hope this helps,
> Wes
>
> On Tue, Aug 29, 2017 at 4:30 AM, Jörg Anders <jr...@ymail.com> wrote:
>> Thank you for this very helpful e-mail. But I'm afraid I have a more basic
>> misunderstanding.
>> I produced a PARQUET file (with apache parquet and Java) with only one
>> column and iint32 values 1,2,3,4
>>
>> If I read the comumn chunk from file from chunk.file_offset and
>> colmetadata.total_compressed_size Bytes, I get
>>
>> 0x15, 0x0, 0x15, 0x20, 0x15, 0x20, 0x2c, 0x15, 0x8, 0x15, 0x0, 0x15, 0x8,
>> 0x15, 0x8, 0x1c, 0x18, 0x4, 0x4, 0x0, 0x0, 0x0, 0x18, 0x4, 0x1, 0x0, 0x0,
>> 0x0, 0x16, 0x0, 0x0,
>> 0x0, 0x0,
>> 0x1, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x4, 0x0, 0x0,
>> 0x0,
>>
>> Ok, the last line contains the 4 values. Because the schema is not nested
>> all repetition and definition levels are
>> 0 I assume the 2 "arrays" are the bytes in 3rd line
>>
>> I further assume the first 2 lines are the PageHeader including the
>> DataPage
>> header Thrift compressed.
>>
>> And this is the problem: Till now I give Thrift - say - 64 bytes.
>>
>> byte [] bf = new byte[64];
>> TMemoryBuffer buffer = new TMemoryBuffer(64);
>>
>> inp.seek(chunk.file_offset);
>>
>> inp.read(bf);  // read the page header bytes
>> buffer.write(bf);
>> // convert into Java Objects by means of the thrift protocol
>> TCompactProtocol prot = new TCompactProtocol(buffer);
>> PageHeader phead = new PageHeader();  // empty object
>> phead.read(prot); // fill with values
>>
>>
>> I know 64 byte is too much. But it works till a certain amount of values.
>>
>>
>> But now I plan to skip the data header. To do this have to know
>>
>> how much data is comsumned by Thrift to reconstruct the PageHeader.
>>
>>
>> I assume I missunderstud this part of PARQUET completely. Could pleas
>> anyboy
>> help.
>>
>> A pointer to some WEB pages or source code would suffice.
>>
>>
>> Joerg
>>
>>
>>
>>
>>
>>
>>
>> Wes McKinney <we...@gmail.com> schrieb am 17:41 Montag, 28.August
>> 2017:
>>
>>
>> hi Joerg,
>>
>> You can see the logic where we initialize the hybrid RLE-bitpacked
>> decoder in the C++ library here:
>>
>>
>> https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_reader.cc#L38
>>
>> * The bit width of values is determined by the maximum repetition or
>> definition level based on the schema. So for non-nested fields (max
>> definition level 1 or 0) the bit width is 1
>> * The number of bytes in the encoded levels is in the first 4 bytes,
>> so if you want to skip the levels, you can decode the length (see line
>> 46) and then skip that many bytes in the stream
>>
>> - Wes
>>
>> On Mon, Aug 28, 2017 at 10:43 AM, Jörg Anders
>> <jr...@ymail.com.invalid> wrote:
>>> Hi all!
>>> For some reason I haev to know much more about the parquet format.
>>> Meanwhile I know the column chunk can be found at
>>> FileMetaData.row_groups[x].columns[x].file_offsetIf I compreheded all
>>> right at file_offset starts the repetition level array followed by the
>>> definition leve array followed by thecolumn values. Is there a way
>>> to skip both arrays and to go directly to tha values?  How long are both
>>> arrays? I would read them. But how? All I  know is: They areBitPacked..
>>> So I
>>> expect there is a MAX VALUE field or perhaps a BITS_PER_VALUE field and
>>> if
>>> it is 3 then
>>> 7 2 1 0 3
>>> is coded
>>> 111010001000011
>>> But where is MAX-VAUE or BITS_PER_VALUE ?
>>> An URL or pointer to a line in some C (++) or Java code would suffice.
>>> Thank you in advance
>>> Jörg
>>
>>
>
>