You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by rajrahul <gi...@git.apache.org> on 2018/03/14 06:43:29 UTC

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

GitHub user rajrahul opened a pull request:

    https://github.com/apache/drill/pull/1166

    DRILL-6016 - Fix for Error reading INT96 created by Apache Spark

    This fixes DRILL-6016 where drill was failing to read int96 generated by Apache Spark even after setting store.parquet.reader.int96_as_timestamp to true.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rajrahul/drill DRILL-6016

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1166.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1166
    
----
commit 6d74884f314638e8d95e0c6ea3431222cb308ec5
Author: Rahul Raj <ra...@...>
Date:   2018-03-14T06:35:45Z

    DRILL-6016 - Fix for Error reading INT96 created by Apache Spark

----


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @vdiravka I have made the changes. Please have a look.


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @parthchandra please use the link https://github.com/rajrahul/files/raw/master/result.tar.gz
    The files are present inside result/parquet/latest.


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178071020
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -35,6 +36,7 @@
     import java.util.Map;
     
     import com.google.common.base.Joiner;
    +import mockit.integration.junit4.JMockit;
    --- End diff --
    
    the same


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r177154051
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -797,6 +797,24 @@ public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
         }
       }
     
    +  @Test
    +  public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
    +    try {
    +      mockUtcDateTimeZone();
    --- End diff --
    
    It doesn't work without `@RunWith(JMockit.class)`. 
    Also please enable above test case `testImpalaParquetBinaryAsTimeStamp_DictChange` with the same change. And be sure that tests pass in the other time zone.


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178070635
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -780,17 +783,42 @@ public void testImpalaParquetBinaryAsVarBinary_DictChange() throws Exception {
       Test the reading of a binary field as drill timestamp where data is in dictionary _and_ non-dictionary encoded pages
        */
       @Test
    -  @Ignore("relies on particular time zone, works for UTC")
       public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
         try {
           testBuilder()
               .sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
               .optionSettingQueriesForTestQuery(
                   "alter session set `%s` = true", ExecConstants.PARQUET_READER_INT96_AS_TIMESTAMP)
               .ordered()
    -          .csvBaselineFile("testframework/testParquetReader/testInt96DictChange/q1.tsv")
    -          .baselineTypes(TypeProtos.MinorType.TIMESTAMP)
               .baselineColumns("int96_ts")
    +          .baselineValues(new DateTime(convertToLocalTimestamp("1970-01-01 00:00:01.000")))
    --- End diff --
    
    One baselineValue is enough. Please use `where` in the query.


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178456861
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -61,6 +60,7 @@
     import org.junit.runners.Parameterized;
     
     @RunWith(Parameterized.class)
    +
    --- End diff --
    
    ok, just remove it


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178255942
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -780,17 +780,31 @@ public void testImpalaParquetBinaryAsVarBinary_DictChange() throws Exception {
       Test the reading of a binary field as drill timestamp where data is in dictionary _and_ non-dictionary encoded pages
        */
       @Test
    -  @Ignore("relies on particular time zone, works for UTC")
       public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
         try {
           testBuilder()
    -          .sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
    +          .sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`")
    --- End diff --
    
    Did you try WHERE statement?


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @parthchandra I have compared meta of files from `TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange` and the meta from Rahul's dataset and found that test case indeed makes a query from two parquet files: one is dictionary encoded and other isn't. But the dataMode of column is `Optional`, that's why `Nullable` column reader is used.
    Rahul's dataset contains `required` mode for INT96 column. This is a difference. Therefore other non-nullable column reader is necessary. 
    
    But I believe we have some mess in names of that column readers. Maybe to make some refactoring would be a good point. What do you think? For example to remove `Dictionary` prefixes from nested classes, but to leave it for top class name.


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @rajrahul this link is good. As expected, the int96 column is dictionary encoded. 
    Is it possible for you to extract just a couple of records from this file and then use that for a unit test? 
    see [TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange](https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java#L784)
    
    @vdiravka TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange also uses an int96 that is dictionary encoded. Any idea whether (and why) it might be going thru a different code path?
    



---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178456935
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -780,17 +780,31 @@ public void testImpalaParquetBinaryAsVarBinary_DictChange() throws Exception {
       Test the reading of a binary field as drill timestamp where data is in dictionary _and_ non-dictionary encoded pages
        */
       @Test
    -  @Ignore("relies on particular time zone, works for UTC")
       public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
         try {
           testBuilder()
    -          .sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
    +          .sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`")
    --- End diff --
    
    It is just more obvious what result is expected. But using MIN is ok.


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178255699
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -61,6 +60,7 @@
     import org.junit.runners.Parameterized;
     
     @RunWith(Parameterized.class)
    +
    --- End diff --
    
    new line?


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178290675
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -780,17 +780,31 @@ public void testImpalaParquetBinaryAsVarBinary_DictChange() throws Exception {
       Test the reading of a binary field as drill timestamp where data is in dictionary _and_ non-dictionary encoded pages
        */
       @Test
    -  @Ignore("relies on particular time zone, works for UTC")
       public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
         try {
           testBuilder()
    -          .sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
    +          .sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`")
    --- End diff --
    
    I did not try a WHERE statement, MIN was used to select a single record to compare. Was there any specific reason to use WHERE?


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    The schema given below creates the issue, as @vdiravka pointed int96 is marked required here. This parquet was generated with an older version of spark and is included in the test case.
    
    ```
    message spark_schema {
      optional binary article_no (UTF8);
      optional binary qty (UTF8);
      required int96 run_date;
    }
    
    ```
    Newer spark version created the schema below where int96 has become optional.
    
    ```
    message spark_schema {
      optional binary country (UTF8);
      optional double sales;
      optional int96 targetDate;
    }
    ```


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @vdiravka I have made similar changes for testSparkParquetBinaryAsTimeStamp_DictChange, testHiveParquetTimestampAsInt96_basic and testImpalaParquetBinaryAsTimeStamp_DictChange. All tests are passing, please have a look.


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    +1. LGTM


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178072549
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -27,6 +27,7 @@
     import java.math.BigDecimal;
     import java.nio.file.Paths;
     import java.sql.Date;
    +import java.sql.Timestamp;
    --- End diff --
    
    unused import?


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @rajrahul, thanks for submitting the patch. It looks good. I guess we missed dictionary encoded int96 timestamps (even though timestamps with nanosecond precision) are the one thing that should never, ever, be dictionary encoded!
    
    Just to make sure, I tried the use the sample file in DRILL-6016, but I could not even unzip it! Can you please check and see if the file is correct? WE can use that to create the unit test as well.



---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by priteshm <gi...@git.apache.org>.
Github user priteshm commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @parthchandra would you please review this?


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @parthchandra @vdiravka  I do not have a test case for this. I have manually verified the scenario with and without the patch. The sample input file is attached with https://issues.apache.org/jira/browse/DRILL-6016.
    



---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178324303
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -61,6 +60,7 @@
     import org.junit.runners.Parameterized;
     
     @RunWith(Parameterized.class)
    +
    --- End diff --
    
    Actually not required, tried to add another RunWith for Mocking and removed later on leaving the newline.


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r178009588
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -797,6 +797,24 @@ public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
         }
       }
     
    +  @Test
    +  public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
    +    try {
    +      mockUtcDateTimeZone();
    --- End diff --
    
    I see that in `TestParquetWriter` only one parameter is used - `repeat`. I think you can replace `Parameterized` running of this. test with simple variable. 
    Other approach - you can write programmatically using of JMockit.
    
    But I prefer not to use mocks if possible. So try to use `convertToLocalTimestamp`. By using it you can enable also `testHiveParquetTimestampAsInt96_basic` test and `testImpalaParquetBinaryAsTimeStamp_DictChange` with removing redundant rows.


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r177950795
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -797,6 +797,24 @@ public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
         }
       }
     
    +  @Test
    +  public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
    +    try {
    +      mockUtcDateTimeZone();
    --- End diff --
    
    @vdiravka your thoughts on comment above?


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1166#discussion_r177318780
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -797,6 +797,24 @@ public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
         }
       }
     
    +  @Test
    +  public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
    +    try {
    +      mockUtcDateTimeZone();
    --- End diff --
    
    I could see two ways of doing this within the code itself.
    1. Mock and run with UTC, and compare the results in UTC as in TestCastFunctions#testToDateForTimeStamp. Since TestParquetWriter already has a RunWith annotation, we might have to create another class and move both the methods.
    2. Run with the JVM timezone(no mocking) and compare the results after a 'convertToLocalTimestamp' as in TestParquetWriter#testInt96TimeStampValueWidth
    
    Approach 2 does not used fixed UTC timezone. Which approach do you suggest?


---

[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1166


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @parthchandra I will create a unit test with few time stamp fields.


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @rajrahul thanks for making all the changes (and of course for the fix)!


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @vdiravka removed the extra line.


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @rajrahul Unit test from your PR relies on particular timezone similar to `TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange`.
    
    Could you please edit test case for working within any time zone?
    Please see this PR #904 for more details. 


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @vdiravka Done. Please review.


---

[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:

    https://github.com/apache/drill/pull/1166
  
    @parthchandra @vdiravka I have added the test case using the same parquet file(2.9k bytes). I tried creating a smaller file using Spark, but could not replicate the behavior. I have rebased the changes on the same commit and PR.


---