You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Arina Ielchiieva (Jira)" <ji...@apache.org> on 2019/10/04 12:28:00 UTC
[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

    [ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944457#comment-16944457 ] 

Arina Ielchiieva commented on DRILL-7291:
-----------------------------------------

[~benj641] checked on the latest master but could not reproduce the issue: used attached file, created three tables from it with different compression, all of them returned correct results. Could you please re-check or provide more details how to reproduce the issue?
{noformat}
  @Test
  public void t() throws Exception {
    String sql = "select * from cp.`store/parquet/complex/0_0_0.parquet` limit 2"; //2000
    queryBuilder().sql(sql).print();

    client.alterSession(ExecConstants.OUTPUT_FORMAT_OPTION, "parquet");
    queryBuilder().sql("use dfs.tmp").run();
    client.alterSession(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE, "none");
    queryBuilder().sql("create table none_p as select * from cp.`store/parquet/complex/0_0_0.parquet`").run();
    client.alterSession(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE, "snappy");
    queryBuilder().sql("create table snappy_p as select * from cp.`store/parquet/complex/0_0_0.parquet`").run();
    client.alterSession(ExecConstants.PARQUET_WRITER_COMPRESSION_TYPE, "gzip");
    queryBuilder().sql("create table gzip_p as select * from cp.`store/parquet/complex/0_0_0.parquet`").run();

    System.out.println("none_p");
    queryBuilder().sql("select * from none_p where crc32 = 'B1251D8B'").print();

    System.out.println("snappy_p");
    queryBuilder().sql("select * from snappy_p where crc32 = 'B1251D8B'").print();

    System.out.println("gzip_p");
    queryBuilder().sql("select * from gzip_p where crc32 = 'B1251D8B'").print();
  }
{noformat}

> parquet with compression gzip doesn't work well
> -----------------------------------------------
>
>                 Key: DRILL-7291
>                 URL: https://issues.apache.org/jira/browse/DRILL-7291
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: benj
>            Priority: Major
>         Attachments: 0_0_0.parquet
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE ....`file_snappy_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE ....`file_gzip_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM ....`file_pqt`;        => 15728036
> SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM ....`file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C';      => 2
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)