You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by rajrahul <gi...@git.apache.org> on 2018/03/14 06:43:29 UTC
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
GitHub user rajrahul opened a pull request:
https://github.com/apache/drill/pull/1166
DRILL-6016 - Fix for Error reading INT96 created by Apache Spark
This fixes DRILL-6016 where drill was failing to read int96 generated by Apache Spark even after setting store.parquet.reader.int96_as_timestamp to true.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rajrahul/drill DRILL-6016
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1166.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1166
----
commit 6d74884f314638e8d95e0c6ea3431222cb308ec5
Author: Rahul Raj <ra...@...>
Date: 2018-03-14T06:35:45Z
DRILL-6016 - Fix for Error reading INT96 created by Apache Spark
----
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
@vdiravka I have made the changes. Please have a look.
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
@parthchandra please use the link https://github.com/rajrahul/files/raw/master/result.tar.gz
The files are present inside result/parquet/latest.
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178071020
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -35,6 +36,7 @@
import java.util.Map;
import com.google.common.base.Joiner;
+import mockit.integration.junit4.JMockit;
--- End diff --
the same
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r177154051
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -797,6 +797,24 @@ public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
}
}
+ @Test
+ public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
+ try {
+ mockUtcDateTimeZone();
--- End diff --
It doesn't work without `@RunWith(JMockit.class)`.
Also please enable above test case `testImpalaParquetBinaryAsTimeStamp_DictChange` with the same change. And be sure that tests pass in the other time zone.
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178070635
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -780,17 +783,42 @@ public void testImpalaParquetBinaryAsVarBinary_DictChange() throws Exception {
Test the reading of a binary field as drill timestamp where data is in dictionary _and_ non-dictionary encoded pages
*/
@Test
- @Ignore("relies on particular time zone, works for UTC")
public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
try {
testBuilder()
.sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
.optionSettingQueriesForTestQuery(
"alter session set `%s` = true", ExecConstants.PARQUET_READER_INT96_AS_TIMESTAMP)
.ordered()
- .csvBaselineFile("testframework/testParquetReader/testInt96DictChange/q1.tsv")
- .baselineTypes(TypeProtos.MinorType.TIMESTAMP)
.baselineColumns("int96_ts")
+ .baselineValues(new DateTime(convertToLocalTimestamp("1970-01-01 00:00:01.000")))
--- End diff --
One baselineValue is enough. Please use `where` in the query.
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178456861
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -61,6 +60,7 @@
import org.junit.runners.Parameterized;
@RunWith(Parameterized.class)
+
--- End diff --
ok, just remove it
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178255942
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -780,17 +780,31 @@ public void testImpalaParquetBinaryAsVarBinary_DictChange() throws Exception {
Test the reading of a binary field as drill timestamp where data is in dictionary _and_ non-dictionary encoded pages
*/
@Test
- @Ignore("relies on particular time zone, works for UTC")
public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
try {
testBuilder()
- .sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
+ .sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`")
--- End diff --
Did you try WHERE statement?
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:
https://github.com/apache/drill/pull/1166
@parthchandra I have compared meta of files from `TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange` and the meta from Rahul's dataset and found that test case indeed makes a query from two parquet files: one is dictionary encoded and other isn't. But the dataMode of column is `Optional`, that's why `Nullable` column reader is used.
Rahul's dataset contains `required` mode for INT96 column. This is a difference. Therefore other non-nullable column reader is necessary.
But I believe we have some mess in names of that column readers. Maybe to make some refactoring would be a good point. What do you think? For example to remove `Dictionary` prefixes from nested classes, but to leave it for top class name.
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:
https://github.com/apache/drill/pull/1166
@rajrahul this link is good. As expected, the int96 column is dictionary encoded.
Is it possible for you to extract just a couple of records from this file and then use that for a unit test?
see [TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange](https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java#L784)
@vdiravka TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange also uses an int96 that is dictionary encoded. Any idea whether (and why) it might be going thru a different code path?
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178456935
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -780,17 +780,31 @@ public void testImpalaParquetBinaryAsVarBinary_DictChange() throws Exception {
Test the reading of a binary field as drill timestamp where data is in dictionary _and_ non-dictionary encoded pages
*/
@Test
- @Ignore("relies on particular time zone, works for UTC")
public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
try {
testBuilder()
- .sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
+ .sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`")
--- End diff --
It is just more obvious what result is expected. But using MIN is ok.
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178255699
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -61,6 +60,7 @@
import org.junit.runners.Parameterized;
@RunWith(Parameterized.class)
+
--- End diff --
new line?
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178290675
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -780,17 +780,31 @@ public void testImpalaParquetBinaryAsVarBinary_DictChange() throws Exception {
Test the reading of a binary field as drill timestamp where data is in dictionary _and_ non-dictionary encoded pages
*/
@Test
- @Ignore("relies on particular time zone, works for UTC")
public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
try {
testBuilder()
- .sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
+ .sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`")
--- End diff --
I did not try a WHERE statement, MIN was used to select a single record to compare. Was there any specific reason to use WHERE?
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
The schema given below creates the issue, as @vdiravka pointed int96 is marked required here. This parquet was generated with an older version of spark and is included in the test case.
```
message spark_schema {
optional binary article_no (UTF8);
optional binary qty (UTF8);
required int96 run_date;
}
```
Newer spark version created the schema below where int96 has become optional.
```
message spark_schema {
optional binary country (UTF8);
optional double sales;
optional int96 targetDate;
}
```
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
@vdiravka I have made similar changes for testSparkParquetBinaryAsTimeStamp_DictChange, testHiveParquetTimestampAsInt96_basic and testImpalaParquetBinaryAsTimeStamp_DictChange. All tests are passing, please have a look.
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:
https://github.com/apache/drill/pull/1166
+1. LGTM
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178072549
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -27,6 +27,7 @@
import java.math.BigDecimal;
import java.nio.file.Paths;
import java.sql.Date;
+import java.sql.Timestamp;
--- End diff --
unused import?
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:
https://github.com/apache/drill/pull/1166
@rajrahul, thanks for submitting the patch. It looks good. I guess we missed dictionary encoded int96 timestamps (even though timestamps with nanosecond precision) are the one thing that should never, ever, be dictionary encoded!
Just to make sure, I tried the use the sample file in DRILL-6016, but I could not even unzip it! Can you please check and see if the file is correct? WE can use that to create the unit test as well.
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by priteshm <gi...@git.apache.org>.
Github user priteshm commented on the issue:
https://github.com/apache/drill/pull/1166
@parthchandra would you please review this?
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
@parthchandra @vdiravka I do not have a test case for this. I have manually verified the scenario with and without the patch. The sample input file is attached with https://issues.apache.org/jira/browse/DRILL-6016.
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178324303
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -61,6 +60,7 @@
import org.junit.runners.Parameterized;
@RunWith(Parameterized.class)
+
--- End diff --
Actually not required, tried to add another RunWith for Mocking and removed later on leaving the newline.
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r178009588
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -797,6 +797,24 @@ public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
}
}
+ @Test
+ public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
+ try {
+ mockUtcDateTimeZone();
--- End diff --
I see that in `TestParquetWriter` only one parameter is used - `repeat`. I think you can replace `Parameterized` running of this. test with simple variable.
Other approach - you can write programmatically using of JMockit.
But I prefer not to use mocks if possible. So try to use `convertToLocalTimestamp`. By using it you can enable also `testHiveParquetTimestampAsInt96_basic` test and `testImpalaParquetBinaryAsTimeStamp_DictChange` with removing redundant rows.
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r177950795
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -797,6 +797,24 @@ public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
}
}
+ @Test
+ public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
+ try {
+ mockUtcDateTimeZone();
--- End diff --
@vdiravka your thoughts on comment above?
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on a diff in the pull request:
https://github.com/apache/drill/pull/1166#discussion_r177318780
--- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
@@ -797,6 +797,24 @@ public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
}
}
+ @Test
+ public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
+ try {
+ mockUtcDateTimeZone();
--- End diff --
I could see two ways of doing this within the code itself.
1. Mock and run with UTC, and compare the results in UTC as in TestCastFunctions#testToDateForTimeStamp. Since TestParquetWriter already has a RunWith annotation, we might have to create another class and move both the methods.
2. Run with the JVM timezone(no mocking) and compare the results after a 'convertToLocalTimestamp' as in TestParquetWriter#testInt96TimeStampValueWidth
Approach 2 does not used fixed UTC timezone. Which approach do you suggest?
---
[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/drill/pull/1166
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
@parthchandra I will create a unit test with few time stamp fields.
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:
https://github.com/apache/drill/pull/1166
@rajrahul thanks for making all the changes (and of course for the fix)!
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
@vdiravka removed the extra line.
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:
https://github.com/apache/drill/pull/1166
@rajrahul Unit test from your PR relies on particular timezone similar to `TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange`.
Could you please edit test case for working within any time zone?
Please see this PR #904 for more details.
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
@vdiravka Done. Please review.
---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Posted by rajrahul <gi...@git.apache.org>.
Github user rajrahul commented on the issue:
https://github.com/apache/drill/pull/1166
@parthchandra @vdiravka I have added the test case using the same parquet file(2.9k bytes). I tried creating a smaller file using Spark, but could not replicate the behavior. I have rebased the changes on the same commit and PR.
---