You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2020/08/22 00:00:24 UTC

[GitHub] [orc] omalley opened a new pull request #540: ORC-370. Use LocalDate and day of epoch instead of java's Date for column statistics

omalley opened a new pull request #540:
URL: https://github.com/apache/orc/pull/540


   This PR adds 4 new methods to the DateColumnStatistics.
   
   * getMinimumLocalDate -> LocalDate
   * getMinimumDayOfEpoch -> int
   
   and the matching getMaximum functions. The current functions are deprecated. The DateColumnStatisticsImpl moves the minimum and maximum values to int instead of Integer, so they can be modified. The initial values are Integer.MAX_VALUE and MIN_VALUE.
   
   The PPD code is changed to use LocalDate instead of Date.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #540: ORC-661. Use LocalDate and day of epoch instead of java's Date for column statistics

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #540:
URL: https://github.com/apache/orc/pull/540#issuecomment-680128544


   Sure~ Thank you, @omalley .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun edited a comment on pull request #540: ORC-661. Use LocalDate and day of epoch instead of java's Date for column statistics

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #540:
URL: https://github.com/apache/orc/pull/540#issuecomment-678792031


   Hi, @omalley . The committed wrong PR seems to mislead us. Since we cannot change commit log in this case, can we use a refresh new JIRA id instead of ORC-370?
   ```
   ~o:master $ git log --oneline master | grep ORC-370
   44a6002f ORC-370:[C++] Reset BooleanRleDecoder remainingbits to zero on seek
   ~o:master $ git log --oneline branch-1.6 | grep ORC-370
   44a6002f ORC-370:[C++] Reset BooleanRleDecoder remainingbits to zero on seek
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on pull request #540: ORC-661. Use LocalDate and day of epoch instead of java's Date for column statistics

Posted by GitBox <gi...@apache.org>.
omalley commented on pull request #540:
URL: https://github.com/apache/orc/pull/540#issuecomment-680303799


   Ok, I made a few changes as well as updating the tests.
   1. I moved from LocalDate to ChronoLocalDate, which is agnostic to the Chronology (eg. Hybrid vs Proleptic).
   2. The LocalDate or HybridDate returned from get(Minimum/Maximum)LocalDate depend on the calendar that is selected.
   3. We handle the conversion from ChronoLocalDate to string explicitly so that don't get the chronology.
   
   We also need an update to storage-api to allow ChronoLocalDate in SARG predicates.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on pull request #540: ORC-370. Use LocalDate and day of epoch instead of java's Date for column statistics

Posted by GitBox <gi...@apache.org>.
omalley commented on pull request #540:
URL: https://github.com/apache/orc/pull/540#issuecomment-679227153


   Ugh, that is unfortunate. Thank you for the catch, @dongjoon-hyun !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on pull request #540: ORC-661. Use LocalDate and day of epoch instead of java's Date for column statistics

Posted by GitBox <gi...@apache.org>.
omalley commented on pull request #540:
URL: https://github.com/apache/orc/pull/540#issuecomment-680104614


   @dongjoon-hyun let me write some more unit tests to make sure things are working correctly, but yes this should fix the timezone sensitive Date problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley merged pull request #540: ORC-661. Use LocalDate and day of epoch instead of java's Date for column statistics

Posted by GitBox <gi...@apache.org>.
omalley merged pull request #540:
URL: https://github.com/apache/orc/pull/540


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #540: ORC-661. Use LocalDate and day of epoch instead of java's Date for column statistics

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #540:
URL: https://github.com/apache/orc/pull/540#discussion_r475823344



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -801,12 +805,17 @@ private static Comparable getBaseObjectForComparison(PredicateLeaf.Type type,
           return Boolean.valueOf(obj.toString());
         }
       case DATE:
-        if (obj instanceof Date) {
+        if (obj instanceof LocalDate) {
           return obj;
+        } else if (obj instanceof java.sql.Date) {
+          return ((java.sql.Date) obj).toLocalDate();
+        } else if (obj instanceof Date) {
+          return LocalDateTime.ofInstant(((Date) obj).toInstant(),
+              ZoneOffset.UTC).toLocalDate();

Review comment:
       So, this is the fix for the following?
   ```
   ORC PPD evaluation for Date type uses java.sql.Date for min/max comparison causing incorrect results.
   Date.compareTo uses millis offset which can return incorrect results depending on the timezone. 
   Running the testcase in HIVE-19726 passed in Los Angeles but failed in Paris
   as Date.compareTo return 0 for Los Angeles but returned -1 for Paris.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #540: ORC-370. Use LocalDate and day of epoch instead of java's Date for column statistics

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #540:
URL: https://github.com/apache/orc/pull/540#issuecomment-678792031


   Hi, @omalley . The committed wrong PR seems to mislead us. Since we cannot commit log in this case, can we use a refresh new JIRA id instead of ORC-370?
   ```
   ~o:master $ git log --oneline master | grep ORC-370
   44a6002f ORC-370:[C++] Reset BooleanRleDecoder remainingbits to zero on seek
   ~o:master $ git log --oneline branch-1.6 | grep ORC-370
   44a6002f ORC-370:[C++] Reset BooleanRleDecoder remainingbits to zero on seek
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org