You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/12/23 17:07:58 UTC

[jira] [Commented] (PARQUET-816) [C++] Failure decoding sample dict-encoded file from parquet-compatibility project

    [ https://issues.apache.org/jira/browse/PARQUET-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15773286#comment-15773286 ] 

Wes McKinney commented on PARQUET-816:
--------------------------------------

[~mrocklin] I tracked down the source of this bug. 

There's a bug in parquet-mr 1.2.8 and lower in which the column chunk metadata in the Parquet file is incorrect. Impala inserted an explicit workaround for this (see See https://github.com/apache/incubator-impala/blob/88448d1d4ab31eaaf82f764b36dc7d11d4c63c32/be/src/exec/hdfs-parquet-scanner.cc#L1227). You didn't hit this bug in the fastparquet Python implementation because you aren't using the {{total_compressed_size}} field to read the entire column chunk into memory before beginning decoding.

In this particular file, the dictionary page header is 15 bytes, and the entire column chunk is:

15 (dict page header) + 277 (dictionary) + 17 (data page header) + 28 (data page) bytes, making 337 bytes. 

But the metadata says the column chunk is only 322 bytes -- the dict page header size got dropped from the accounting. 

> [C++] Failure decoding sample dict-encoded file from parquet-compatibility project
> ----------------------------------------------------------------------------------
>
>                 Key: PARQUET-816
>                 URL: https://issues.apache.org/jira/browse/PARQUET-816
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>         Attachments: nation.dict.parquet
>
>
> See attached. This throws an exception when read:
> {code}
> $ debug/parquet_reader nation.dict.parquet 
> File statistics:
> Version: 1
> Created By: parquet-mr
> Total rows: 25
> Number of RowGroups: 1
> Number of Real Columns: 4
> Number of Columns: 4
> Number of Selected Columns: 4
> Column 0: nation_key (INT32)
> Column 1: name (BYTE_ARRAY)
> Column 2: region_key (INT32)
> Column 3: comment_col (BYTE_ARRAY)
> --- Row Group 0 ---
> --- Total Bytes 0 ---
>   rows: 25---
> Column 0
> , values: 25  Statistics Not Set
>   compression: UNCOMPRESSED, encodings: 
>   uncompressed size: 125, compressed size: 125
> Column 1
> , values: 25  Statistics Not Set
>   compression: UNCOMPRESSED, encodings: 
>   uncompressed size: 322, compressed size: 322
> Column 2
> , values: 25  Statistics Not Set
>   compression: UNCOMPRESSED, encodings: 
>   uncompressed size: 125, compressed size: 125
> Column 3
> , values: 25  Statistics Not Set
>   compression: UNCOMPRESSED, encodings: 
>   uncompressed size: 2002, compressed size: 2002
> nation_key              name                    region_key              comment_col             
> 0                       Parquet error: Unexpected end of stream.
> {code}
> However, I checked that I can read this file with Impala:
> {code}
> In [13]: hdfs.put('/tmp/nation-dict-test/test.parq', 'nation.dict.parquet')
> Out[13]: '/tmp/nation-dict-test/test.parq'
> In [14]: pf = con.parquet_file('/tmp/nation-dict-test')
> In [15]: pf.execute()
> Out[15]: 
>     nation_key            name  region_key  \
> 0            0         ALGERIA           0   
> 1            1       ARGENTINA           1   
> 2            2          BRAZIL           1   
> 3            3          CANADA           1   
> 4            4           EGYPT           4   
> 5            5        ETHIOPIA           0   
> 6            6          FRANCE           3   
> 7            7         GERMANY           3   
> 8            8           INDIA           2   
> 9            9       INDONESIA           2   
> 10          10            IRAN           4   
> 11          11            IRAQ           4   
> 12          12           JAPAN           2   
> 13          13          JORDAN           4   
> 14          14           KENYA           0   
> 15          15         MOROCCO           0   
> 16          16      MOZAMBIQUE           0   
> 17          17            PERU           1   
> 18          18           CHINA           2   
> 19          19         ROMANIA           3   
> 20          20    SAUDI ARABIA           4   
> 21          21         VIETNAM           2   
> 22          22          RUSSIA           3   
> 23          23  UNITED KINGDOM           3   
> 24          24   UNITED STATES           1   
>                                           comment_col  
> 0    haggle. carefully final deposits detect slyly...  
> 1   al foxes promise slyly according to the regula...  
> 2   y alongside of the pending deposits. carefully...  
> 3   eas hang ironic, silent packages. slyly regula...  
> 4   y above the carefully unusual theodolites. fin...  
> 5                     ven packages wake quickly. regu  
> 6              refully final requests. regular, ironi  
> 7   l platelets. regular accounts x-ray: unusual, ...  
> 8   ss excuses cajole slyly across the packages. d...  
> 9    slyly express asymptotes. regular deposits ha...  
> 10  efully alongside of the slyly final dependenci...  
> 11  nic deposits boost atop the quickly final requ...  
> 12               ously. final, express gifts cajole a  
> 13  ic deposits are blithely about the carefully r...  
> 14   pending excuses haggle furiously deposits. pe...  
> 15  rns. blithely bold courts among the closely re...  
> 16      s. ironic, unusual asymptotes wake blithely r  
> 17  platelets. blithely pending dependencies use f...  
> 18  c dependencies. furiously express notornis sle...  
> 19  ular asymptotes are about the furious multipli...  
> 20  ts. silent requests haggle. closely express pa...  
> 21     hely enticingly express accounts. even, final   
> 22   requests against the platelets use never acco...  
> 23  eans boost carefully special requests. account...  
> 24  y final packages. slow foxes cajole quickly. q...  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)