You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Adam Gilmore (JIRA)" <ji...@apache.org> on 2015/02/23 02:31:11 UTC
[jira] [Updated] (DRILL-2286) Parquet compression causes read errors

     [ https://issues.apache.org/jira/browse/DRILL-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Gilmore updated DRILL-2286:
--------------------------------
    Description: 
>From what I can see, since compression has been added to the Parquet writer, reading errors can occur.

Basically, things like timestamp and decimal are stored as int64 with some metadata.  It appears that when the column is compressed, it tries to read int64s into a vector of timestamp/decimal types, which causes a cast error.

Here's the JSON file I'm using:

{code}
{ "a": 1.5 }
{ "a": 3.5 }
{ "a": 1.5 }
{ "a": 2.5 }
{ "a": 1.5 }
{ "a": 5.5 }
{ "a": 1.5 }
{ "a": 6.0 }
{ "a": 1.5 }
{code}

Now create a Parquet table like so:

create table dfs.tmp.test as (select cast(a as decimal(18,8)) from dfs.tmp.`test.json`)

Now when you try to query it like so:

0: jdbc:drill:zk=local> select * from dfs.tmp.test;
Query failed: RemoteRpcException: Failure while running fragment., org.apache.drill.exec.vector.NullableDecimal18Vector cannot be cast to org.apache.drill.exec.vector.NullableBigIntVector [ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
[ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]

Error: exception while executing query: Failure while executing query. (state=,code=0)

This is the same for timestamps, for example.

The relevant code is in ColumnReaderFactory whereby if the column chunk is encoded, it creates specific readers based on the type of the column (in this case int64, instead of timestamp/decimal).

This is pretty severe, as it looks like the compression is enabled by default now.  I do note that with only 1-2 records in the JSON file, it doesn't bother compressing and the queries then work fine.

  was:
>From what I can see, since compression has been added to the Parquet writer, reading errors can occur.

Basically, things like timestamp and decimal are stored as int64 with some metadata.  It appears that when the column is compressed, it tries to read int64s into a vector of timestamp/decimal types, which causes a cast error.

Here's the JSON file I'm using:

{ "a": 1.5 }
{ "a": 3.5 }
{ "a": 1.5 }
{ "a": 2.5 }
{ "a": 1.5 }
{ "a": 5.5 }
{ "a": 1.5 }
{ "a": 6.0 }
{ "a": 1.5 }

Now create a Parquet table like so:

create table dfs.tmp.test as (select cast(a as decimal(18,8)) from dfs.tmp.`test.json`)

Now when you try to query it like so:

0: jdbc:drill:zk=local> select * from dfs.tmp.test;
Query failed: RemoteRpcException: Failure while running fragment., org.apache.drill.exec.vector.NullableDecimal18Vector cannot be cast to org.apache.drill.exec.vector.NullableBigIntVector [ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
[ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]

Error: exception while executing query: Failure while executing query. (state=,code=0)

This is the same for timestamps, for example.

The relevant code is in ColumnReaderFactory whereby if the column chunk is encoded, it creates specific readers based on the type of the column (in this case int64, instead of timestamp/decimal).

This is pretty severe, as it looks like the compression is enabled by default now.  I do note that with only 1-2 records in the JSON file, it doesn't bother compressing and the queries then work fine.


> Parquet compression causes read errors
> --------------------------------------
>
>                 Key: DRILL-2286
>                 URL: https://issues.apache.org/jira/browse/DRILL-2286
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Steven Phillips
>            Priority: Critical
>
> From what I can see, since compression has been added to the Parquet writer, reading errors can occur.
> Basically, things like timestamp and decimal are stored as int64 with some metadata.  It appears that when the column is compressed, it tries to read int64s into a vector of timestamp/decimal types, which causes a cast error.
> Here's the JSON file I'm using:
> {code}
> { "a": 1.5 }
> { "a": 3.5 }
> { "a": 1.5 }
> { "a": 2.5 }
> { "a": 1.5 }
> { "a": 5.5 }
> { "a": 1.5 }
> { "a": 6.0 }
> { "a": 1.5 }
> {code}
> Now create a Parquet table like so:
> create table dfs.tmp.test as (select cast(a as decimal(18,8)) from dfs.tmp.`test.json`)
> Now when you try to query it like so:
> 0: jdbc:drill:zk=local> select * from dfs.tmp.test;
> Query failed: RemoteRpcException: Failure while running fragment., org.apache.drill.exec.vector.NullableDecimal18Vector cannot be cast to org.apache.drill.exec.vector.NullableBigIntVector [ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
> [ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
> Error: exception while executing query: Failure while executing query. (state=,code=0)
> This is the same for timestamps, for example.
> The relevant code is in ColumnReaderFactory whereby if the column chunk is encoded, it creates specific readers based on the type of the column (in this case int64, instead of timestamp/decimal).
> This is pretty severe, as it looks like the compression is enabled by default now.  I do note that with only 1-2 records in the JSON file, it doesn't bother compressing and the queries then work fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)