You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2015/02/25 01:32:04 UTC

[jira] [Created] (DRILL-2300) problems reading DECIMAL from parquet file

Deneche A. Hakim created DRILL-2300:
---------------------------------------

             Summary: problems reading DECIMAL from parquet file
                 Key: DRILL-2300
                 URL: https://issues.apache.org/jira/browse/DRILL-2300
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 0.7.0
            Reporter: Deneche A. Hakim
            Assignee: Steven Phillips
             Fix For: Future


There are several issues related to DECIMAL columns in parquet files, depending on which reader is used "simple reader" or "complex reader" and if the columns are OPTIONAL or REQUIRED.

I have the following {{test.json}} file:
{code}
{ "a": "1" }
{ "a": "1" }
{ "a": "1" }
{code}

I created a parquet file using the following query:
{noformat}
CREATE TABLE dfs.tmp.`test_all_decimal` AS 
  SELECT 
    CAST(a AS DECIMAL(9,6)) decimal9_opt, 
    CAST('1' AS DECIMAL(9,6)) decimal9_req, 
    CAST(a AS DECIMAL(18, 8)) decimal18_opt, 
    CAST('1' AS DECIMAL(18, 8)) decimal18_req, 
    CAST(a AS DECIMAL(28,10)) decimal28_col, 
    CAST('1' AS DECIMAL(28,10)) decimal28_req, 
    CAST(a AS DECIMAL(38,10)) decimal38_col, 
    CAST('1' AS DECIMAL(38,10)) decimal38_req 
  FROM dfs.data.`test_char.json`;
{noformat}

This creates a parquet file with the following metadata, retrieved using {{parquet tools}}:
{noformat}
creator:       parquet-mr 

file schema:   root 
-----------------------------------------------------------------------------------------------------
decimal9_opt:  OPTIONAL INT32 O:DECIMAL R:0 D:1
decimal9_req:  REQUIRED INT32 O:DECIMAL R:0 D:0
decimal18_opt: OPTIONAL INT64 O:DECIMAL R:0 D:1
decimal18_req: REQUIRED INT64 O:DECIMAL R:0 D:0
decimal28_col: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
decimal28_req: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
decimal38_col: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
decimal38_req: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0

row group 1:   RC:3 TS:636 
-----------------------------------------------------------------------------------------------------
decimal9_opt:   INT32 SNAPPY DO:0 FPO:4 SZ:62/58/0.94 VC:3 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
decimal9_req:   INT32 SNAPPY DO:0 FPO:66 SZ:56/52/0.93 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED
decimal18_opt:  INT64 SNAPPY DO:0 FPO:122 SZ:74/70/0.95 VC:3 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
decimal18_req:  INT64 SNAPPY DO:0 FPO:196 SZ:68/64/0.94 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED
decimal28_col:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:264 SZ:72/91/1.26 VC:3 ENC:RLE,BIT_PACKED,PLAIN
decimal28_req:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:336 SZ:66/85/1.29 VC:3 ENC:BIT_PACKED,PLAIN
decimal38_col:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:402 SZ:80/111/1.39 VC:3 ENC:RLE,BIT_PACKED,PLAIN
decimal38_req:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:482 SZ:77/105/1.36 VC:3 ENC:BIT_PACKED,PLAIN
{noformat}

If we disable dictionary encoding:
{noformat}
alter session set `store.parquet.enable_dictionary_encoding` = false;
{noformat}

We will get a simlar parquet file but with {{PLAIN}} encoding instead of {{PLAIN_DICTIONARY}} for {{DECIMAL9}} and {{DECIMAL18}} columns.

When using the "simple" parquet reader, with dictionary encoding enabled,  
The following query returns wrong results for {{DECIMAL28/REQUIRED}} and {{DECIMAL38/REQUIRED}} (we can't read {{DECIMAL9}} nor {{DECIMAL18}} columns because of DRILL-2262):
{noformat}
select decimal28_col, decimal28_req, decimal38_col, decimal38_req from dfs.tmp.`test_all_decimal`;
+---------------+---------------+---------------+---------------+
| decimal28_opt | decimal28_req | decimal38_opt | decimal38_req |
+---------------+---------------+---------------+---------------+
| 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
| 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
| 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
+---------------+---------------+---------------+---------------+
{noformat}
 
When dictionary encoding is disabled, the following query eturns wrong results for {{DECIMAL28/REQUIRED}} and {{DECIMAL38/REQUIRED}}:
{noformat}
select decimal9_opt, decimal9_req, decimal18_opt, decimal18_req, decimal28_opt, decimal28_req, decimal38_opt, decimal38_req from dfs.tmp.`test_all_decimal_nodictionary`;
+--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
| decimal9_opt | decimal9_req | decimal18_opt | decimal18_req | decimal28_opt | decimal28_req | decimal38_opt | decimal38_req |
+--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
| 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
| 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
| 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
+--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
{noformat}

When using the "complex" reader:
{noformat}
alter session set `store.parquet.use_new_reader` = true;
{noformat}

We can't read {{DECIMAL28}} nor {{DECIMAL38}} because of DRILL-2220.

the following query gives wrong results for {DECIMAL9}} and {{DECIMAL18}} no matter if dictionary encoding is enabled or disabled:
{noformat}
select decimal9_opt, decimal9_req, decimal18_opt, decimal18_req from dfs.tmp.`test_all_decimal`;
+--------------+--------------+---------------+---------------+
| decimal9_opt | decimal9_req | decimal18_opt | decimal18_req |
+--------------+--------------+---------------+---------------+
| 1000000      | 1000000      | 100000000     | 100000000     |
| 1000000      | 1000000      | 100000000     | 100000000     |
| 1000000      | 1000000      | 100000000     | 100000000     |
+--------------+--------------+---------------+---------------+
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)