You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/12/23 17:07:58 UTC
[jira] [Commented] (PARQUET-816) [C++] Failure decoding sample
dict-encoded file from parquet-compatibility project
[ https://issues.apache.org/jira/browse/PARQUET-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15773286#comment-15773286 ]
Wes McKinney commented on PARQUET-816:
--------------------------------------
[~mrocklin] I tracked down the source of this bug.
There's a bug in parquet-mr 1.2.8 and lower in which the column chunk metadata in the Parquet file is incorrect. Impala inserted an explicit workaround for this (see See https://github.com/apache/incubator-impala/blob/88448d1d4ab31eaaf82f764b36dc7d11d4c63c32/be/src/exec/hdfs-parquet-scanner.cc#L1227). You didn't hit this bug in the fastparquet Python implementation because you aren't using the {{total_compressed_size}} field to read the entire column chunk into memory before beginning decoding.
In this particular file, the dictionary page header is 15 bytes, and the entire column chunk is:
15 (dict page header) + 277 (dictionary) + 17 (data page header) + 28 (data page) bytes, making 337 bytes.
But the metadata says the column chunk is only 322 bytes -- the dict page header size got dropped from the accounting.
> [C++] Failure decoding sample dict-encoded file from parquet-compatibility project
> ----------------------------------------------------------------------------------
>
> Key: PARQUET-816
> URL: https://issues.apache.org/jira/browse/PARQUET-816
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: Wes McKinney
> Attachments: nation.dict.parquet
>
>
> See attached. This throws an exception when read:
> {code}
> $ debug/parquet_reader nation.dict.parquet
> File statistics:
> Version: 1
> Created By: parquet-mr
> Total rows: 25
> Number of RowGroups: 1
> Number of Real Columns: 4
> Number of Columns: 4
> Number of Selected Columns: 4
> Column 0: nation_key (INT32)
> Column 1: name (BYTE_ARRAY)
> Column 2: region_key (INT32)
> Column 3: comment_col (BYTE_ARRAY)
> --- Row Group 0 ---
> --- Total Bytes 0 ---
> rows: 25---
> Column 0
> , values: 25 Statistics Not Set
> compression: UNCOMPRESSED, encodings:
> uncompressed size: 125, compressed size: 125
> Column 1
> , values: 25 Statistics Not Set
> compression: UNCOMPRESSED, encodings:
> uncompressed size: 322, compressed size: 322
> Column 2
> , values: 25 Statistics Not Set
> compression: UNCOMPRESSED, encodings:
> uncompressed size: 125, compressed size: 125
> Column 3
> , values: 25 Statistics Not Set
> compression: UNCOMPRESSED, encodings:
> uncompressed size: 2002, compressed size: 2002
> nation_key name region_key comment_col
> 0 Parquet error: Unexpected end of stream.
> {code}
> However, I checked that I can read this file with Impala:
> {code}
> In [13]: hdfs.put('/tmp/nation-dict-test/test.parq', 'nation.dict.parquet')
> Out[13]: '/tmp/nation-dict-test/test.parq'
> In [14]: pf = con.parquet_file('/tmp/nation-dict-test')
> In [15]: pf.execute()
> Out[15]:
> nation_key name region_key \
> 0 0 ALGERIA 0
> 1 1 ARGENTINA 1
> 2 2 BRAZIL 1
> 3 3 CANADA 1
> 4 4 EGYPT 4
> 5 5 ETHIOPIA 0
> 6 6 FRANCE 3
> 7 7 GERMANY 3
> 8 8 INDIA 2
> 9 9 INDONESIA 2
> 10 10 IRAN 4
> 11 11 IRAQ 4
> 12 12 JAPAN 2
> 13 13 JORDAN 4
> 14 14 KENYA 0
> 15 15 MOROCCO 0
> 16 16 MOZAMBIQUE 0
> 17 17 PERU 1
> 18 18 CHINA 2
> 19 19 ROMANIA 3
> 20 20 SAUDI ARABIA 4
> 21 21 VIETNAM 2
> 22 22 RUSSIA 3
> 23 23 UNITED KINGDOM 3
> 24 24 UNITED STATES 1
> comment_col
> 0 haggle. carefully final deposits detect slyly...
> 1 al foxes promise slyly according to the regula...
> 2 y alongside of the pending deposits. carefully...
> 3 eas hang ironic, silent packages. slyly regula...
> 4 y above the carefully unusual theodolites. fin...
> 5 ven packages wake quickly. regu
> 6 refully final requests. regular, ironi
> 7 l platelets. regular accounts x-ray: unusual, ...
> 8 ss excuses cajole slyly across the packages. d...
> 9 slyly express asymptotes. regular deposits ha...
> 10 efully alongside of the slyly final dependenci...
> 11 nic deposits boost atop the quickly final requ...
> 12 ously. final, express gifts cajole a
> 13 ic deposits are blithely about the carefully r...
> 14 pending excuses haggle furiously deposits. pe...
> 15 rns. blithely bold courts among the closely re...
> 16 s. ironic, unusual asymptotes wake blithely r
> 17 platelets. blithely pending dependencies use f...
> 18 c dependencies. furiously express notornis sle...
> 19 ular asymptotes are about the furious multipli...
> 20 ts. silent requests haggle. closely express pa...
> 21 hely enticingly express accounts. even, final
> 22 requests against the platelets use never acco...
> 23 eans boost carefully special requests. account...
> 24 y final packages. slow foxes cajole quickly. q...
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)