You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Nezih Yigitbasi <ny...@netflix.com.INVALID> on 2015/06/18 21:50:40 UTC

problem reading parquet file

Hi all,

I have generated some test data using the method here
<https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68>.
What I notice is if I use WriterVersion.PARQUET_2_0, the default block and
page sizes, and GZIP compression (test case 1 below) I cannot read the file
with parquet-tools dump (see stack trace below). When I switch to
PARQUET_1_0 (test case 2 below) I can use dump tool to read the data. Weird
enough when I reduce the number of rows I create to 1K and use PARQUET_2_0
writer again (test case 3) dump still fails but with a different exception.

Are these known issues?

Nezih
Test Case 1 [FAILS]

WriterVersion.PARQUET_2_0
default block and page size
GZIP compression
1M rows

Schema:

file schema:   test
--------------------------------------------------------------------------------
binary_field:  REQUIRED BINARY R:0 D:0
int32_field:   REQUIRED INT32 R:0 D:0
int64_field:   REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field:   REQUIRED FLOAT R:0 D:0
double_field:  REQUIRED DOUBLE R:0 D:0
flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field:   REQUIRED INT96 R:0 D:0

row group 1:   RC:1000000 TS:38744008 OFFSET:4
--------------------------------------------------------------------------------
binary_field:   BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77
VC:1000000 ENC:DELTA_BYTE_ARRAY
int32_field:    INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06
VC:1000000 ENC:DELTA_BINARY_PACKED
int64_field:    INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
boolean_field:  BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000 ENC:RLE
float_field:    FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
double_field:   DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593
SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY
int96_field:    INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75
VC:1000000 ENC:PLAIN,RLE_DICTIONARY

parquet-tools dump fails with:

value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't
read value in column [binary_field] BINARY at value 377601 out of
1000000, 1 out of 23600 in currentPage. repetition level: 0,
definition level: 0
    at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
    at parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410)
    at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288)
    at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
    at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
    at parquet.tools.Main.main(Main.java:219)
Caused by: java.lang.ArrayIndexOutOfBoundsException
    at parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
    at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
    at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
    ... 5 more
Can't read value in column [binary_field] BINARY at value 377601 out
of 1000000, 1 out of 23600 in currentPage. repetition level: 0,
definition level: 0

Test Case 2 [SUCCEEDS]

WriterVersion.PARQUET_1_0
default block and page size
GZIP compression
1M rows

Schema:

file schema:   test
--------------------------------------------------------------------------------
binary_field:  REQUIRED BINARY R:0 D:0
int32_field:   REQUIRED INT32 R:0 D:0
int64_field:   REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field:   REQUIRED FLOAT R:0 D:0
double_field:  REQUIRED DOUBLE R:0 D:0
flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field:   REQUIRED INT96 R:0 D:0

row group 1:   RC:1000000 TS:1070161196 OFFSET:4
--------------------------------------------------------------------------------
binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83
VC:1000000 ENC:PLAIN,BIT_PACKED
int32_field:    INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89
VC:1000000 ENC:PLAIN,BIT_PACKED
int64_field:    INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
boolean_field:  BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06
VC:1000000 ENC:PLAIN,BIT_PACKED
float_field:    FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
double_field:   DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106
SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED
int96_field:    INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED

Test Case 3 [FAILS]

WriterVersion.PARQUET_2_0
default block and page size
GZIP compression
1K rows

Schema:

file schema:   test
--------------------------------------------------------------------------------
binary_field:  REQUIRED BINARY R:0 D:0
int32_field:   REQUIRED INT32 R:0 D:0
int64_field:   REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field:   REQUIRED FLOAT R:0 D:0
double_field:  REQUIRED DOUBLE R:0 D:0
flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field:   REQUIRED INT96 R:0 D:0

row group 1:   RC:1000 TS:40502 OFFSET:4
--------------------------------------------------------------------------------
binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000
ENC:DELTA_BYTE_ARRAY
int32_field:    INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000
ENC:DELTA_BINARY_PACKED
int64_field:    INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000
ENC:RLE_DICTIONARY,PLAIN
boolean_field:  BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000 ENC:RLE
float_field:    FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000
ENC:RLE_DICTIONARY,PLAIN
double_field:   DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000
ENC:RLE_DICTIONARY,PLAIN
flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912
SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY
int96_field:    INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000
ENC:RLE_DICTIONARY,PLAIN

parquet-tools dump fails when dumping the fixed len byte array field:

FIXED_LEN_BYTE_ARRAY flba_field
--------------------------------------------------------------------------------
parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only
supported for type BINARY
    at parquet.column.Encoding$7.getValuesReader(Encoding.java:196)
    at parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537)
    at parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577)
    at parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57)
    at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521)
    at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
    at parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
    at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513)
    at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505)
    at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
    at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351)
    at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
    at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
    at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278)
    at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
    at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
    at parquet.tools.Main.main(Main.java:219)
Encoding DELTA_BYTE_ARRAY is only supported for type BINARY

​

Re: problem reading parquet file

Posted by Sergio Pena <se...@cloudera.com>.
The second bug is on https://issues.apache.org/jira/browse/PARQUET-152

The problem is that the dictionary page size is less than the fixed byte
array. Just make it equals, and you will be able to read that file.

- Sergio

On Thu, Jun 18, 2015 at 3:36 PM, Nezih Yigitbasi <
nyigitbasi@netflix.com.invalid> wrote:

> Yep I will, seemed like a bug to me too.
>
> Thanks,
> Nezih
>
> On Thu, Jun 18, 2015 at 1:33 PM, Ryan Blue <bl...@cloudera.com> wrote:
>
> > The first issue looks like the delta byte array problem:
> >
> >   https://issues.apache.org/jira/browse/PARQUET-246
> >
> > The second one looks like the write side uses delta_byte_array for fixed,
> > but the read side doesn't expect it. File a bug?
> >
> > rb
> >
> > On 06/18/2015 12:50 PM, Nezih Yigitbasi wrote:
> >
> >> Hi all,
> >>
> >> I have generated some test data using the method here
> >> <
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68
> >> >.
> >>
> >> What I notice is if I use WriterVersion.PARQUET_2_0, the default block
> and
> >> page sizes, and GZIP compression (test case 1 below) I cannot read the
> >> file
> >> with parquet-tools dump (see stack trace below). When I switch to
> >> PARQUET_1_0 (test case 2 below) I can use dump tool to read the data.
> >> Weird
> >> enough when I reduce the number of rows I create to 1K and use
> PARQUET_2_0
> >> writer again (test case 3) dump still fails but with a different
> >> exception.
> >>
> >> Are these known issues?
> >>
> >> Nezih
> >> Test Case 1 [FAILS]
> >>
> >> WriterVersion.PARQUET_2_0
> >> default block and page size
> >> GZIP compression
> >> 1M rows
> >>
> >> Schema:
> >>
> >> file schema:   test
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:  REQUIRED BINARY R:0 D:0
> >> int32_field:   REQUIRED INT32 R:0 D:0
> >> int64_field:   REQUIRED INT64 R:0 D:0
> >> boolean_field: REQUIRED BOOLEAN R:0 D:0
> >> float_field:   REQUIRED FLOAT R:0 D:0
> >> double_field:  REQUIRED DOUBLE R:0 D:0
> >> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> >> int96_field:   REQUIRED INT96 R:0 D:0
> >>
> >> row group 1:   RC:1000000 TS:38744008 OFFSET:4
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77
> >> VC:1000000 ENC:DELTA_BYTE_ARRAY
> >> int32_field:    INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06
> >> VC:1000000 ENC:DELTA_BINARY_PACKED
> >> int64_field:    INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72
> >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> >> boolean_field:  BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000
> >> ENC:RLE
> >> float_field:    FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67
> >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> >> double_field:   DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72
> >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> >> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593
> >> SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY
> >> int96_field:    INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75
> >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> >>
> >> parquet-tools dump fails with:
> >>
> >> value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't
> >> read value in column [binary_field] BINARY at value 377601 out of
> >> 1000000, 1 out of 23600 in currentPage. repetition level: 0,
> >> definition level: 0
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410)
> >>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288)
> >>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
> >>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
> >>      at parquet.tools.Main.main(Main.java:219)
> >> Caused by: java.lang.ArrayIndexOutOfBoundsException
> >>      at
> >>
> parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
> >>      ... 5 more
> >> Can't read value in column [binary_field] BINARY at value 377601 out
> >> of 1000000, 1 out of 23600 in currentPage. repetition level: 0,
> >> definition level: 0
> >>
> >> Test Case 2 [SUCCEEDS]
> >>
> >> WriterVersion.PARQUET_1_0
> >> default block and page size
> >> GZIP compression
> >> 1M rows
> >>
> >> Schema:
> >>
> >> file schema:   test
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:  REQUIRED BINARY R:0 D:0
> >> int32_field:   REQUIRED INT32 R:0 D:0
> >> int64_field:   REQUIRED INT64 R:0 D:0
> >> boolean_field: REQUIRED BOOLEAN R:0 D:0
> >> float_field:   REQUIRED FLOAT R:0 D:0
> >> double_field:  REQUIRED DOUBLE R:0 D:0
> >> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> >> int96_field:   REQUIRED INT96 R:0 D:0
> >>
> >> row group 1:   RC:1000000 TS:1070161196 OFFSET:4
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83
> >> VC:1000000 ENC:PLAIN,BIT_PACKED
> >> int32_field:    INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89
> >> VC:1000000 ENC:PLAIN,BIT_PACKED
> >> int64_field:    INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69
> >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> >> boolean_field:  BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06
> >> VC:1000000 ENC:PLAIN,BIT_PACKED
> >> float_field:    FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63
> >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> >> double_field:   DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69
> >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> >> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106
> >> SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED
> >> int96_field:    INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73
> >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> >>
> >> Test Case 3 [FAILS]
> >>
> >> WriterVersion.PARQUET_2_0
> >> default block and page size
> >> GZIP compression
> >> 1K rows
> >>
> >> Schema:
> >>
> >> file schema:   test
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:  REQUIRED BINARY R:0 D:0
> >> int32_field:   REQUIRED INT32 R:0 D:0
> >> int64_field:   REQUIRED INT64 R:0 D:0
> >> boolean_field: REQUIRED BOOLEAN R:0 D:0
> >> float_field:   REQUIRED FLOAT R:0 D:0
> >> double_field:  REQUIRED DOUBLE R:0 D:0
> >> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> >> int96_field:   REQUIRED INT96 R:0 D:0
> >>
> >> row group 1:   RC:1000 TS:40502 OFFSET:4
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000
> >> ENC:DELTA_BYTE_ARRAY
> >> int32_field:    INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000
> >> ENC:DELTA_BINARY_PACKED
> >> int64_field:    INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000
> >> ENC:RLE_DICTIONARY,PLAIN
> >> boolean_field:  BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000
> ENC:RLE
> >> float_field:    FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000
> >> ENC:RLE_DICTIONARY,PLAIN
> >> double_field:   DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000
> >> ENC:RLE_DICTIONARY,PLAIN
> >> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912
> >> SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY
> >> int96_field:    INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000
> >> ENC:RLE_DICTIONARY,PLAIN
> >>
> >> parquet-tools dump fails when dumping the fixed len byte array field:
> >>
> >> FIXED_LEN_BYTE_ARRAY flba_field
> >>
> >>
> --------------------------------------------------------------------------------
> >> parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only
> >> supported for type BINARY
> >>      at parquet.column.Encoding$7.getValuesReader(Encoding.java:196)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
> >>      at parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351)
> >>      at
> >>
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
> >>      at
> >>
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
> >>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278)
> >>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
> >>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
> >>      at parquet.tools.Main.main(Main.java:219)
> >> Encoding DELTA_BYTE_ARRAY is only supported for type BINARY
> >>
> >> ​
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>

Re: problem reading parquet file

Posted by Nezih Yigitbasi <ny...@netflix.com.INVALID>.
Yep I will, seemed like a bug to me too.

Thanks,
Nezih

On Thu, Jun 18, 2015 at 1:33 PM, Ryan Blue <bl...@cloudera.com> wrote:

> The first issue looks like the delta byte array problem:
>
>   https://issues.apache.org/jira/browse/PARQUET-246
>
> The second one looks like the write side uses delta_byte_array for fixed,
> but the read side doesn't expect it. File a bug?
>
> rb
>
> On 06/18/2015 12:50 PM, Nezih Yigitbasi wrote:
>
>> Hi all,
>>
>> I have generated some test data using the method here
>> <
>> https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68
>> >.
>>
>> What I notice is if I use WriterVersion.PARQUET_2_0, the default block and
>> page sizes, and GZIP compression (test case 1 below) I cannot read the
>> file
>> with parquet-tools dump (see stack trace below). When I switch to
>> PARQUET_1_0 (test case 2 below) I can use dump tool to read the data.
>> Weird
>> enough when I reduce the number of rows I create to 1K and use PARQUET_2_0
>> writer again (test case 3) dump still fails but with a different
>> exception.
>>
>> Are these known issues?
>>
>> Nezih
>> Test Case 1 [FAILS]
>>
>> WriterVersion.PARQUET_2_0
>> default block and page size
>> GZIP compression
>> 1M rows
>>
>> Schema:
>>
>> file schema:   test
>>
>> --------------------------------------------------------------------------------
>> binary_field:  REQUIRED BINARY R:0 D:0
>> int32_field:   REQUIRED INT32 R:0 D:0
>> int64_field:   REQUIRED INT64 R:0 D:0
>> boolean_field: REQUIRED BOOLEAN R:0 D:0
>> float_field:   REQUIRED FLOAT R:0 D:0
>> double_field:  REQUIRED DOUBLE R:0 D:0
>> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
>> int96_field:   REQUIRED INT96 R:0 D:0
>>
>> row group 1:   RC:1000000 TS:38744008 OFFSET:4
>>
>> --------------------------------------------------------------------------------
>> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77
>> VC:1000000 ENC:DELTA_BYTE_ARRAY
>> int32_field:    INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06
>> VC:1000000 ENC:DELTA_BINARY_PACKED
>> int64_field:    INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72
>> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>> boolean_field:  BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000
>> ENC:RLE
>> float_field:    FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67
>> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>> double_field:   DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72
>> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593
>> SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY
>> int96_field:    INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75
>> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>>
>> parquet-tools dump fails with:
>>
>> value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't
>> read value in column [binary_field] BINARY at value 377601 out of
>> 1000000, 1 out of 23600 in currentPage. repetition level: 0,
>> definition level: 0
>>      at
>> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
>>      at
>> parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410)
>>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288)
>>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
>>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
>>      at parquet.tools.Main.main(Main.java:219)
>> Caused by: java.lang.ArrayIndexOutOfBoundsException
>>      at
>> parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
>>      at
>> parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
>>      at
>> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
>>      ... 5 more
>> Can't read value in column [binary_field] BINARY at value 377601 out
>> of 1000000, 1 out of 23600 in currentPage. repetition level: 0,
>> definition level: 0
>>
>> Test Case 2 [SUCCEEDS]
>>
>> WriterVersion.PARQUET_1_0
>> default block and page size
>> GZIP compression
>> 1M rows
>>
>> Schema:
>>
>> file schema:   test
>>
>> --------------------------------------------------------------------------------
>> binary_field:  REQUIRED BINARY R:0 D:0
>> int32_field:   REQUIRED INT32 R:0 D:0
>> int64_field:   REQUIRED INT64 R:0 D:0
>> boolean_field: REQUIRED BOOLEAN R:0 D:0
>> float_field:   REQUIRED FLOAT R:0 D:0
>> double_field:  REQUIRED DOUBLE R:0 D:0
>> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
>> int96_field:   REQUIRED INT96 R:0 D:0
>>
>> row group 1:   RC:1000000 TS:1070161196 OFFSET:4
>>
>> --------------------------------------------------------------------------------
>> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83
>> VC:1000000 ENC:PLAIN,BIT_PACKED
>> int32_field:    INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89
>> VC:1000000 ENC:PLAIN,BIT_PACKED
>> int64_field:    INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69
>> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>> boolean_field:  BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06
>> VC:1000000 ENC:PLAIN,BIT_PACKED
>> float_field:    FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63
>> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>> double_field:   DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69
>> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106
>> SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED
>> int96_field:    INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73
>> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>>
>> Test Case 3 [FAILS]
>>
>> WriterVersion.PARQUET_2_0
>> default block and page size
>> GZIP compression
>> 1K rows
>>
>> Schema:
>>
>> file schema:   test
>>
>> --------------------------------------------------------------------------------
>> binary_field:  REQUIRED BINARY R:0 D:0
>> int32_field:   REQUIRED INT32 R:0 D:0
>> int64_field:   REQUIRED INT64 R:0 D:0
>> boolean_field: REQUIRED BOOLEAN R:0 D:0
>> float_field:   REQUIRED FLOAT R:0 D:0
>> double_field:  REQUIRED DOUBLE R:0 D:0
>> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
>> int96_field:   REQUIRED INT96 R:0 D:0
>>
>> row group 1:   RC:1000 TS:40502 OFFSET:4
>>
>> --------------------------------------------------------------------------------
>> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000
>> ENC:DELTA_BYTE_ARRAY
>> int32_field:    INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000
>> ENC:DELTA_BINARY_PACKED
>> int64_field:    INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000
>> ENC:RLE_DICTIONARY,PLAIN
>> boolean_field:  BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000 ENC:RLE
>> float_field:    FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000
>> ENC:RLE_DICTIONARY,PLAIN
>> double_field:   DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000
>> ENC:RLE_DICTIONARY,PLAIN
>> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912
>> SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY
>> int96_field:    INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000
>> ENC:RLE_DICTIONARY,PLAIN
>>
>> parquet-tools dump fails when dumping the fixed len byte array field:
>>
>> FIXED_LEN_BYTE_ARRAY flba_field
>>
>> --------------------------------------------------------------------------------
>> parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only
>> supported for type BINARY
>>      at parquet.column.Encoding$7.getValuesReader(Encoding.java:196)
>>      at
>> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537)
>>      at
>> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577)
>>      at
>> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57)
>>      at
>> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521)
>>      at
>> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
>>      at parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
>>      at
>> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513)
>>      at
>> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505)
>>      at
>> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
>>      at
>> parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351)
>>      at
>> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
>>      at
>> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
>>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278)
>>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
>>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
>>      at parquet.tools.Main.main(Main.java:219)
>> Encoding DELTA_BYTE_ARRAY is only supported for type BINARY
>>
>> ​
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: problem reading parquet file

Posted by Ryan Blue <bl...@cloudera.com>.
The first issue looks like the delta byte array problem:

   https://issues.apache.org/jira/browse/PARQUET-246

The second one looks like the write side uses delta_byte_array for 
fixed, but the read side doesn't expect it. File a bug?

rb

On 06/18/2015 12:50 PM, Nezih Yigitbasi wrote:
> Hi all,
>
> I have generated some test data using the method here
> <https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68>.
> What I notice is if I use WriterVersion.PARQUET_2_0, the default block and
> page sizes, and GZIP compression (test case 1 below) I cannot read the file
> with parquet-tools dump (see stack trace below). When I switch to
> PARQUET_1_0 (test case 2 below) I can use dump tool to read the data. Weird
> enough when I reduce the number of rows I create to 1K and use PARQUET_2_0
> writer again (test case 3) dump still fails but with a different exception.
>
> Are these known issues?
>
> Nezih
> Test Case 1 [FAILS]
>
> WriterVersion.PARQUET_2_0
> default block and page size
> GZIP compression
> 1M rows
>
> Schema:
>
> file schema:   test
> --------------------------------------------------------------------------------
> binary_field:  REQUIRED BINARY R:0 D:0
> int32_field:   REQUIRED INT32 R:0 D:0
> int64_field:   REQUIRED INT64 R:0 D:0
> boolean_field: REQUIRED BOOLEAN R:0 D:0
> float_field:   REQUIRED FLOAT R:0 D:0
> double_field:  REQUIRED DOUBLE R:0 D:0
> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> int96_field:   REQUIRED INT96 R:0 D:0
>
> row group 1:   RC:1000000 TS:38744008 OFFSET:4
> --------------------------------------------------------------------------------
> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77
> VC:1000000 ENC:DELTA_BYTE_ARRAY
> int32_field:    INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06
> VC:1000000 ENC:DELTA_BINARY_PACKED
> int64_field:    INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72
> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> boolean_field:  BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000 ENC:RLE
> float_field:    FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67
> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> double_field:   DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72
> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593
> SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY
> int96_field:    INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75
> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>
> parquet-tools dump fails with:
>
> value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't
> read value in column [binary_field] BINARY at value 377601 out of
> 1000000, 1 out of 23600 in currentPage. repetition level: 0,
> definition level: 0
>      at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
>      at parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410)
>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288)
>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
>      at parquet.tools.Main.main(Main.java:219)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>      at parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
>      at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
>      at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
>      ... 5 more
> Can't read value in column [binary_field] BINARY at value 377601 out
> of 1000000, 1 out of 23600 in currentPage. repetition level: 0,
> definition level: 0
>
> Test Case 2 [SUCCEEDS]
>
> WriterVersion.PARQUET_1_0
> default block and page size
> GZIP compression
> 1M rows
>
> Schema:
>
> file schema:   test
> --------------------------------------------------------------------------------
> binary_field:  REQUIRED BINARY R:0 D:0
> int32_field:   REQUIRED INT32 R:0 D:0
> int64_field:   REQUIRED INT64 R:0 D:0
> boolean_field: REQUIRED BOOLEAN R:0 D:0
> float_field:   REQUIRED FLOAT R:0 D:0
> double_field:  REQUIRED DOUBLE R:0 D:0
> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> int96_field:   REQUIRED INT96 R:0 D:0
>
> row group 1:   RC:1000000 TS:1070161196 OFFSET:4
> --------------------------------------------------------------------------------
> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83
> VC:1000000 ENC:PLAIN,BIT_PACKED
> int32_field:    INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89
> VC:1000000 ENC:PLAIN,BIT_PACKED
> int64_field:    INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69
> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> boolean_field:  BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06
> VC:1000000 ENC:PLAIN,BIT_PACKED
> float_field:    FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63
> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> double_field:   DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69
> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106
> SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED
> int96_field:    INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73
> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>
> Test Case 3 [FAILS]
>
> WriterVersion.PARQUET_2_0
> default block and page size
> GZIP compression
> 1K rows
>
> Schema:
>
> file schema:   test
> --------------------------------------------------------------------------------
> binary_field:  REQUIRED BINARY R:0 D:0
> int32_field:   REQUIRED INT32 R:0 D:0
> int64_field:   REQUIRED INT64 R:0 D:0
> boolean_field: REQUIRED BOOLEAN R:0 D:0
> float_field:   REQUIRED FLOAT R:0 D:0
> double_field:  REQUIRED DOUBLE R:0 D:0
> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> int96_field:   REQUIRED INT96 R:0 D:0
>
> row group 1:   RC:1000 TS:40502 OFFSET:4
> --------------------------------------------------------------------------------
> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000
> ENC:DELTA_BYTE_ARRAY
> int32_field:    INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000
> ENC:DELTA_BINARY_PACKED
> int64_field:    INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000
> ENC:RLE_DICTIONARY,PLAIN
> boolean_field:  BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000 ENC:RLE
> float_field:    FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000
> ENC:RLE_DICTIONARY,PLAIN
> double_field:   DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000
> ENC:RLE_DICTIONARY,PLAIN
> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912
> SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY
> int96_field:    INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000
> ENC:RLE_DICTIONARY,PLAIN
>
> parquet-tools dump fails when dumping the fixed len byte array field:
>
> FIXED_LEN_BYTE_ARRAY flba_field
> --------------------------------------------------------------------------------
> parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only
> supported for type BINARY
>      at parquet.column.Encoding$7.getValuesReader(Encoding.java:196)
>      at parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537)
>      at parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577)
>      at parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57)
>      at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521)
>      at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
>      at parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
>      at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513)
>      at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505)
>      at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
>      at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351)
>      at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
>      at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278)
>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
>      at parquet.tools.Main.main(Main.java:219)
> Encoding DELTA_BYTE_ARRAY is only supported for type BINARY
>
> ​
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.