You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Robert V (JIRA)" <ji...@apache.org> on 2018/03/04 15:42:00 UTC

[jira] [Created] (DRILL-6209) Spark generated Parquet file reading fails when 'store.parquet.reader.int96_as_timestamp' is used

Robert V created DRILL-6209:
-------------------------------

             Summary: Spark generated Parquet file reading fails when 'store.parquet.reader.int96_as_timestamp' is used
                 Key: DRILL-6209
                 URL: https://issues.apache.org/jira/browse/DRILL-6209
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.13.0
         Environment: * Parquet files, that failed to query, were generated using Apache Spark 2.2.1 on AWS EMR. The Spark SQL library was used.
 * Drill was set up on a Mac OS El Capitan system, running Java 8.
            Reporter: Robert V
         Attachments: error-stacktrace.txt, successful-log.txt

Parquet files generated using Apache Spark 2.2.1 Timestamp column type might fail when the 'store.parquet.reader.int96_as_timestamp' option is used.

Query that fails:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;

SELECT t.* FROM dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet` t;
{code}


Query that succeeds:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;

SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet` t;
{code}


See logs attached.
I'm not able to upload sample Parquet files because they contain sensitive information.
Parquet files, generated using an Apache Spark job by AWS EMR, failed. File sizes are in the range of hundreds of megabytes.
Parquet files, generated by a local Spark installation, worked however. They only contained a few rows so it wasn't an accurate comparison with the larger data set.
 
The bug is present in the current master branch (1.13.0 candidate version).
 
This issue is related to [DRILL-5097|https://issues.apache.org/jira/browse/DRILL-5097]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)