You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by vdiravka <gi...@git.apache.org> on 2016/09/22 13:47:43 UTC

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

GitHub user vdiravka opened a pull request:

    https://github.com/apache/drill/pull/595

    DRILL-4203: Parquet File. Date is stored wrongly

    Drill was writing non-standard dates into parquet files for all releases
    before this commit. The values have been read correctly by Drill, but
    external tools like Spark reading the files will see corrupted values for
    all dates that have been written by Drill.
    
    This change corrects the behavior of the Drill parquet writer to correctly
    store dates in the format given in the parquet specification.
    
    To maintain compatibility with old files, the parquet reader code has been
    updated to check for the old format and automatically shift the
    corrupted values into corrected ones automatically.
    
    The test cases included here should ensure that all files produced by
    historical versions of Drill will continue to return the same values
    they had in previous releases. For compatibility with external tools, any
    old files with corrupted dates can be re-written using the CREATE TABLE AS
    command (as the writer will now only produce the specification-compliant
    values, even if after reading out of older corrupt files, one
    new extra field "is.date.correct = true" will be included into the parquet meta
    information of files and into drill metadata cache files).
    
    While the old behavior was a consistent shift into an unlikely range to be
    used in a modern database (over 10,000 years in the future), these are
    still valid date values. In the case where these may have been written
    into files intentionally, and we cannot be certain from the metadata if
    Drill produced the files, an option is included to turn off the auto-correction.
    Use of this option is assumed to be extremely unlikely, but it is included for
    completeness.
    
    One small fix in the ParquetGroupScan to accommodate changes in master that changed
    when metadata is read.
    
    Added new tests for bugs (revealed by the regression suite) with old and new
    parquet (binary) files for new tests, updated metadata cache files accordingly.
    
    Removed unnecessary double conversion of value with Julian day.
    
    Added ability to correct corrupted dates for parquet files with the second
    version old metadata cache file as well.
    
    Fix DrillVersionInfo to make it provide a valid version number even during
    the unit tests. This is now a build-time generated class, rather than one
    that looks on the classpath for META-INF files. (This pattern
    for file generation with parameters passed from the POM files
    was borrowed from parquet-mr)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vdiravka/drill DRILL-4203

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #595
    
----
commit 6f816742d773a1696b5329472c2465a79e35140c
Author: Vitalii Diravka <vi...@gmail.com>
Date:   2016-09-22T13:44:37Z

    DRILL-4203: Parquet File. Date is stored wrongly
    
    Drill was writing non-standard dates into parquet files for all releases
    before this commit. The values have been read correctly by Drill, but
    external tools like Spark reading the files will see corrupted values for
    all dates that have been written by Drill.
    
    This change corrects the behavior of the Drill parquet writer to correctly
    store dates in the format given in the parquet specification.
    
    To maintain compatibility with old files, the parquet reader code has been
    updated to check for the old format and automatically shift the
    corrupted values into corrected ones automatically.
    
    The test cases included here should ensure that all files produced by
    historical versions of Drill will continue to return the same values
    they had in previous releases. For compatibility with external tools, any
    old files with corrupted dates can be re-written using the CREATE TABLE AS
    command (as the writer will now only produce the specification-compliant
    values, even if after reading out of older corrupt files, one
    new extra field "is.date.correct = true" will be included into the parquet meta
    information of files and into drill metadata cache files).
    
    While the old behavior was a consistent shift into an unlikely range to be
    used in a modern database (over 10,000 years in the future), these are
    still valid date values. In the case where these may have been written
    into files intentionally, and we cannot be certain from the metadata if
    Drill produced the files, an option is included to turn off the auto-correction.
    Use of this option is assumed to be extremely unlikely, but it is included for
    completeness.
    
    One small fix in the ParquetGroupScan to accommodate changes in master that changed
    when metadata is read.
    
    Added new tests for bugs (revealed by the regression suite) with old and new
    parquet (binary) files for new tests, updated metadata cache files accordingly.
    
    Removed unnecessary double conversion of value with Julian day.
    
    Added ability to correct corrupted dates for parquet files with the second
    version old metadata cache file as well.
    
    Fix DrillVersionInfo to make it provide a valid version number even during
    the unit tests. This is now a build-time generated class, rather than one
    that looks on the classpath for META-INF files. (This pattern
    for file generation with parameters passed from the POM files
    was borrowed from parquet-mr)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/595
  
    Thanks @jaltekruse.
    @parthchandra Could you please take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81636854
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java ---
    @@ -207,6 +229,31 @@ public OperatorContext getOperatorContext() {
         return operatorContext;
       }
     
    +  /**
    --- End diff --
    
    It is just Jason's minor refactoring. Is it ok? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/595


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82274813
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java ---
    @@ -184,17 +201,15 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer,
               if (parsedCreatedByVersion.hasSemanticVersion()) {
                 SemanticVersion semVer = parsedCreatedByVersion.getSemanticVersion();
                 String pre = semVer.pre + "";
    -            if (semVer != null && semVer.major == 1 && semVer.minor == 8 && semVer.patch == 1 && pre.contains("drill")) {
    --- End diff --
    
    I don't know if you are guarding against a null value for semVer at a higher level, but I'm pretty sure the reason I added it is that older versions of parquet files can be lacking this field. I assume with all of the tests cases that were added we should be okay, but it might be better to keep it in defensively. If this field isn't set in the metadata, it shouldn't make it an invalid file, but if we get an NPE it will stop query processing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:

    https://github.com/apache/drill/pull/595
  
    @vdiravka can you repost this PR with the commits split into Jason's original work and then with your fixes on top? We should give credit where it is due. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81625440
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java ---
    @@ -935,6 +972,11 @@ public ColumnTypeMetadata_v2 getColumnTypeInfo(String[] name) {
         @JsonIgnore @Override public ParquetTableMetadataBase clone() {
           return new ParquetTableMetadata_v2(files, directories, columnTypeInfo);
         }
    +
    +    @JsonIgnore @Override public boolean isDateCorrect() {
    +      return isDateCorrect;
    --- End diff --
    
    If metadata cache file is existed Drill reads it instead of retrieving metadata from multiple Parquet files. In the case when it was generated with drill after this commit the value of isDateCorrect will be true. In the case when it was generated with drill before this commit the isDateCorrect field in metadata cache file will be absent and value of this will be false in ParquetTableMetadata_v2.
    And according to this value we just define DateCorruptionStatus (you can see more in ParquetReaderUtility.correctDatesInMetadataCache()). The leftover way of data checking in the cache was not changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/595
  
    @jaltekruse Could you please review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82690304
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/FixedByteAlignedReader.java ---
    @@ -161,7 +160,7 @@ void addNext(int start, int index) {
             intValue = readIntLittleEndian(bytebuf, start);
           }
     
    -      mutator.set(index, DateTimeUtils.fromJulianDay(intValue + ParquetReaderUtility.CORRECT_CORRUPT_DATE_SHIFT));
    +      mutator.set(index, (intValue - ParquetReaderUtility.CORRECT_CORRUPT_DATE_SHIFT) * DateTimeConstants.MILLIS_PER_DAY);
    --- End diff --
    
    Reviewing this math can give anyone a headache :(. But it is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82291973
  
    --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java ---
    @@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, HiveDrillNativeParquetSubScan
               // in the first row group
               ParquetReaderUtility.DateCorruptionStatus containsCorruptDates =
                   ParquetReaderUtility.detectCorruptDates(parquetMetadata, config.getColumns(), true);
    +          logger.info(containsCorruptDates.toString());
    --- End diff --
    
    Sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/595
  
    The branch is rebased to the master version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/595
  
    @parthchandra Done. Now it involves 2 Jason original commits (only with resolved git version conflicts) and one my commit on top.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81240983
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -104,6 +104,9 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
               logger.trace("ParquetTrace,Read Footer,{},{},{},{},{},{},{}", "", e.getPath(), "", 0, 0, 0, timeToRead);
               footers.put(e.getPath(), footer );
             }
    +        boolean autoCorrectCorruptDates = rowGroupScan.formatConfig.autoCorrectCorruptDates;
    --- End diff --
    
    It would be nice to have a log message here if we are going to autocorrect dates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82292241
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -107,6 +107,7 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
             boolean autoCorrectCorruptDates = rowGroupScan.formatConfig.autoCorrectCorruptDates;
             ParquetReaderUtility.DateCorruptionStatus containsCorruptDates = ParquetReaderUtility.detectCorruptDates(footers.get(e.getPath()), rowGroupScan.getColumns(),
                     autoCorrectCorruptDates);
    +        logger.info(containsCorruptDates.toString());
    --- End diff --
    
    addressed, in other similar comment, this is good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81625584
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java ---
    @@ -664,27 +667,37 @@ private void init(MetadataContext metaContext) throws IOException {
           }
           if (metaPath != null && fs.exists(metaPath)) {
             usedMetadataCache = true;
    -        parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext);
    +        if (parquetTableMetadata == null) {
    +          parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext, formatConfig);
    +        }
           } else {
    -        parquetTableMetadata = Metadata.getParquetTableMetadata(fs, p.toString());
    +        parquetTableMetadata = Metadata.getParquetTableMetadata(fs, p.toString(), formatConfig);
           }
         } else {
           Path p = Path.getPathWithoutSchemeAndAuthority(new Path(selectionRoot));
           Path metaPath = new Path(p, Metadata.METADATA_FILENAME);
           if (fs.isDirectory(new Path(selectionRoot)) && fs.exists(metaPath)) {
             usedMetadataCache = true;
             if (parquetTableMetadata == null) {
    -          parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext);
    +          parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext, formatConfig);
             }
             if (fileSet != null) {
    -          parquetTableMetadata = removeUnneededRowGroups(parquetTableMetadata);
    +          if (parquetTableMetadata == null) {
    --- End diff --
    
    agree. It is redundant. Changed in the last commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82273282
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java ---
    @@ -918,18 +916,22 @@ public void setMax(Object max) {
         @JsonProperty public ConcurrentHashMap<ColumnTypeMetadata_v2.Key, ColumnTypeMetadata_v2> columnTypeInfo;
         @JsonProperty List<ParquetFileMetadata_v2> files;
         @JsonProperty List<String> directories;
    -    @JsonProperty String drillVersion;
    --- End diff --
    
    This sounds reasonable, having both is okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81244832
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java ---
    @@ -207,6 +229,31 @@ public OperatorContext getOperatorContext() {
         return operatorContext;
       }
     
    +  /**
    --- End diff --
    
    This seems completely unrelated to the bug


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81234794
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java ---
    @@ -935,6 +972,11 @@ public ColumnTypeMetadata_v2 getColumnTypeInfo(String[] name) {
         @JsonIgnore @Override public ParquetTableMetadataBase clone() {
           return new ParquetTableMetadata_v2(files, directories, columnTypeInfo);
         }
    +
    +    @JsonIgnore @Override public boolean isDateCorrect() {
    +      return isDateCorrect;
    --- End diff --
    
    What is this for? Do you bypass data checking in the cache if this is true? What if it is not specified in the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82277733
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -107,6 +107,7 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
             boolean autoCorrectCorruptDates = rowGroupScan.formatConfig.autoCorrectCorruptDates;
             ParquetReaderUtility.DateCorruptionStatus containsCorruptDates = ParquetReaderUtility.detectCorruptDates(footers.get(e.getPath()), rowGroupScan.getColumns(),
                     autoCorrectCorruptDates);
    +        logger.info(containsCorruptDates.toString());
    --- End diff --
    
    same as above, not a very useful logging statement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by tushu1232 <gi...@git.apache.org>.
Github user tushu1232 commented on the issue:

    https://github.com/apache/drill/pull/595
  
    How can I get to test this fix .Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81238209
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java ---
    @@ -664,27 +667,37 @@ private void init(MetadataContext metaContext) throws IOException {
           }
           if (metaPath != null && fs.exists(metaPath)) {
             usedMetadataCache = true;
    -        parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext);
    +        if (parquetTableMetadata == null) {
    +          parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext, formatConfig);
    +        }
           } else {
    -        parquetTableMetadata = Metadata.getParquetTableMetadata(fs, p.toString());
    +        parquetTableMetadata = Metadata.getParquetTableMetadata(fs, p.toString(), formatConfig);
           }
         } else {
           Path p = Path.getPathWithoutSchemeAndAuthority(new Path(selectionRoot));
           Path metaPath = new Path(p, Metadata.METADATA_FILENAME);
           if (fs.isDirectory(new Path(selectionRoot)) && fs.exists(metaPath)) {
             usedMetadataCache = true;
             if (parquetTableMetadata == null) {
    -          parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext);
    +          parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext, formatConfig);
             }
             if (fileSet != null) {
    -          parquetTableMetadata = removeUnneededRowGroups(parquetTableMetadata);
    +          if (parquetTableMetadata == null) {
    --- End diff --
    
    You've already made sure to read ParquetTableMetadata if it was null, so you probably needn't read it again


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82287507
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java ---
    @@ -184,17 +201,15 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer,
               if (parsedCreatedByVersion.hasSemanticVersion()) {
                 SemanticVersion semVer = parsedCreatedByVersion.getSemanticVersion();
                 String pre = semVer.pre + "";
    -            if (semVer != null && semVer.major == 1 && semVer.minor == 8 && semVer.patch == 1 && pre.contains("drill")) {
    --- End diff --
    
    If `semVer `will be null we get NPE above, where `semVer.pre` is called. 
    It happened with tests on our regression test framework. That's why I added `hasSemanticVersion()` above. So if this method return `true`, `semVer `cann't be `NULL` according `org.apache.parquet.SemanticVersion` and the following `semVer!=null` checking is redundant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/595
  
    @tushu1232 Until this fix is merged into the master you can clone my fork repository (https://github.com/vdiravka/drill), switch to the DRILL-4203 branch and build the project.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81240013
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java ---
    @@ -18,16 +18,62 @@
     package org.apache.drill.exec.store.parquet;
     
     import org.apache.drill.common.exceptions.UserException;
    +import org.apache.drill.common.expression.PathSegment;
    +import org.apache.drill.common.expression.SchemaPath;
     import org.apache.drill.exec.planner.physical.PlannerSettings;
     import org.apache.drill.exec.server.options.OptionManager;
    +import org.apache.drill.exec.store.AbstractRecordReader;
    +import org.apache.drill.exec.store.ParquetOutputRecordWriter;
     import org.apache.drill.exec.work.ExecErrorConstants;
    +import org.apache.parquet.SemanticVersion;
    +import org.apache.parquet.VersionParser;
    +import org.apache.parquet.column.ColumnDescriptor;
    +import org.apache.parquet.column.statistics.Statistics;
    +import org.apache.parquet.format.ConvertedType;
    +import org.apache.parquet.format.FileMetaData;
    +import org.apache.parquet.format.SchemaElement;
    +import org.apache.parquet.format.converter.ParquetMetadataConverter;
    +import org.apache.parquet.hadoop.ParquetFileWriter;
    +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
    +import org.apache.parquet.hadoop.metadata.ColumnPath;
    +import org.apache.parquet.hadoop.metadata.ParquetMetadata;
    +import org.apache.parquet.schema.OriginalType;
    +import org.joda.time.Chronology;
    +import org.joda.time.DateTimeConstants;
    +
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
     
     /*
      * Utility class where we can capture common logic between the two parquet readers
      */
     public class ParquetReaderUtility {
       private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetReaderUtility.class);
     
    +  // Note the negation symbol in the beginning
    +  public static final double CORRECT_CORRUPT_DATE_SHIFT = -ParquetOutputRecordWriter.JULIAN_DAY_EPOC - 0.5;
    --- End diff --
    
    Is the the number by which you must correct the incorrect values? If so, please add a comment. This had me very confused for a while.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82290736
  
    --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java ---
    @@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, HiveDrillNativeParquetSubScan
               // in the first row group
               ParquetReaderUtility.DateCorruptionStatus containsCorruptDates =
                   ParquetReaderUtility.detectCorruptDates(parquetMetadata, config.getColumns(), true);
    +          logger.info(containsCorruptDates.toString());
    --- End diff --
    
    But it is not a regular Boolean value, `DateCorruptionStatus` is an enum with overrided toString method. For example how log file looks after querying hive parquet file with correct date values:
    
    `2016-10-06 23:32:30,598 [280920f1-e362-136e-0fdd-24779fef2c4a:frag:0:0] INFO  o.a.d.e.s.p.ParquetScanBatchCreator - It is determined from metadata that the date values are definitely CORRECT`
    
    Is it enought?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81722838
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -104,6 +104,9 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
               logger.trace("ParquetTrace,Read Footer,{},{},{},{},{},{},{}", "", e.getPath(), "", 0, 0, 0, timeToRead);
               footers.put(e.getPath(), footer );
             }
    +        boolean autoCorrectCorruptDates = rowGroupScan.formatConfig.autoCorrectCorruptDates;
    --- End diff --
    
    We are going there only to detect corrupt dates according the option in the parquet format config. 
    I add the log message below regarding DateCorruptionStatus and also in other places where we `detectCorruptDates`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82292165
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java ---
    @@ -184,17 +201,15 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer,
               if (parsedCreatedByVersion.hasSemanticVersion()) {
                 SemanticVersion semVer = parsedCreatedByVersion.getSemanticVersion();
                 String pre = semVer.pre + "";
    -            if (semVer != null && semVer.major == 1 && semVer.minor == 8 && semVer.patch == 1 && pre.contains("drill")) {
    --- End diff --
    
    Sorry I should have looked through the rest of the changes, thank you for fixing this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r82272848
  
    --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java ---
    @@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, HiveDrillNativeParquetSubScan
               // in the first row group
               ParquetReaderUtility.DateCorruptionStatus containsCorruptDates =
                   ParquetReaderUtility.detectCorruptDates(parquetMetadata, config.getColumns(), true);
    +          logger.info(containsCorruptDates.toString());
    --- End diff --
    
    not a very useful logging statement to check in, should at least contain a message about what this value is. This will just print a boolean value. If it was just for local debugging it can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81396800
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java ---
    @@ -918,18 +916,22 @@ public void setMax(Object max) {
         @JsonProperty public ConcurrentHashMap<ColumnTypeMetadata_v2.Key, ColumnTypeMetadata_v2> columnTypeInfo;
         @JsonProperty List<ParquetFileMetadata_v2> files;
         @JsonProperty List<String> directories;
    -    @JsonProperty String drillVersion;
    --- End diff --
    
    I had intentionally added the drill version here assuming that it would be good information to have around if a similar issue ever comes up the the future, as well as provide all of the info we need to have an explicit flag that the dates have become correct. For this to work completely, this commit should be the last commit right before a release (it could be a point release). Any particular reason that we would want to not write it into the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/595#discussion_r81736781
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java ---
    @@ -918,18 +916,22 @@ public void setMax(Object max) {
         @JsonProperty public ConcurrentHashMap<ColumnTypeMetadata_v2.Key, ColumnTypeMetadata_v2> columnTypeInfo;
         @JsonProperty List<ParquetFileMetadata_v2> files;
         @JsonProperty List<String> directories;
    -    @JsonProperty String drillVersion;
    --- End diff --
    
    I agree regarding future similar issues. So I returned this field to `ParquetTableMetadataBase` in the new commit. 
    But to determine that the parquet file is new and definitely with correct date values we have a new flag in parquet metadata "is.date.correct = true". Thus, it is more easy to determine corrupted date values and there is no need to wait the end of release to merge these changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---