You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by amansinha100 <gi...@git.apache.org> on 2016/01/29 02:18:29 UTC

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

GitHub user amansinha100 opened a pull request:

    https://github.com/apache/drill/pull/345

    DRILL-4287: During initial DrillTable creation don't read the metadat…

    …a cache file; instead do it during ParquetGroupScan.
    
    Maintain state in FileSelection to keep track of whether certain operations have been done on that selection.
    
    Remove ParquetFileSelection since its only purpose was to carry the metadata cache information which is not needed anymore.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/amansinha100/incubator-drill DRILL-4287

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/345.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #345
    
----
commit a479137911bc6d4821a138789db254d6eee43316
Author: Aman Sinha <as...@maprtech.com>
Date:   2016-01-18T18:26:59Z

    DRILL-4287: During initial DrillTable creation don't read the metadata cache file; instead do it during ParquetGroupScan.
    
    Maintain state in FileSelection to keep track of whether certain operations have been done on that selection.
    
    Remove ParquetFileSelection since its only purpose was to carry the metadata cache information which is not needed anymore.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 closed the pull request at:

    https://github.com/apache/drill/pull/345


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by jinfengni <gi...@git.apache.org>.
Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/345#discussion_r51518560
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java ---
    @@ -338,8 +354,14 @@ private boolean hasSingleValue(ColumnMetadata columnChunkMetaData) {
         return (columnChunkMetaData != null) && (columnChunkMetaData.hasSingleValue());
       }
     
    +  @Override
    --- End diff --
    
    JsonIgnore is needed here? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by adeneche <gi...@git.apache.org>.
Github user adeneche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/345#discussion_r52516038
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java ---
    @@ -182,10 +183,24 @@ public ParquetGroupScan( //
         }
     
         this.selectionRoot = selectionRoot;
    -    if (selection instanceof ParquetFileSelection) {
    -      final ParquetFileSelection pfs = ParquetFileSelection.class.cast(selection);
    -      this.parquetTableMetadata = pfs.getParquetMetadata();
    +
    +    FileSelection newSelection = null;
    +    if (!selection.isExpanded()) {
    +      FileStatus firstPath = selection.getFirstPath(fs);
    +      Path p = new Path(firstPath.getPath(), Metadata.METADATA_FILENAME);
    +      if (!fs.exists(p)) { // no metadata cache
    +        if (selection.checkedForDirectories() && selection.hasDirectories()) {
    --- End diff --
    
    This won't work if `checkedForDirectories==false`, right ?
    `hasDirectories()` already uses `checkedForDirectories` internally, the following should work:
     
        if (selection.hasDirectories()) {


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by jinfengni <gi...@git.apache.org>.
Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/345#discussion_r51518520
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java ---
    @@ -118,13 +133,34 @@ public boolean apply(@Nullable FileStatus status) {
           }
         }));
     
    -    return create(nonDirectories, null, selectionRoot);
    +    final FileSelection fileSel = create(nonDirectories, null, selectionRoot);
    --- End diff --
    
    Does it make sense to pass in the list of Files in addition to the list of FileStatus? I thought the implicit assumption for FileSelection is list of files goes first before FileStatus. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on the pull request:

    https://github.com/apache/drill/pull/345#issuecomment-183933804
  
    Closing this PR in favor of updated one #376 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by adeneche <gi...@git.apache.org>.
Github user adeneche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/345#discussion_r52519003
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java ---
    @@ -521,6 +543,25 @@ public void setEndpointByteMap(EndpointByteMap byteMap) {
         }
       }
     
    +  private FileSelection
    +  getSelectionFromMetadataCache(DrillFileSystem fs, FileSelection selection) throws IOException {
    +    FileStatus metaRootDir = selection.getFirstPath(fs);
    +    Path metaFilePath = new Path(metaRootDir.getPath(), Metadata.METADATA_FILENAME);
    +
    +    // get the metadata for the directory by reading the metadata file
    +    Metadata.ParquetTableMetadataBase metadata  = Metadata.readBlockMeta(fs, metaFilePath.toString());
    +    List<String> fileNames = Lists.newArrayList();
    +    for (Metadata.ParquetFileMetadata file : metadata.getFiles()) {
    +      fileNames.add(file.getPath());
    +    }
    +    // when creating the file selection, set the selection root in the form /a/b instead of
    +    // file:/a/b.  The reason is that the file names above have been created in the form
    +    // /a/b/c.parquet and the format of the selection root must match that of the file names
    +    // otherwise downstream operations such as partition pruning can break.
    +    final Path metaRootPath = Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath());
    +    return FileSelection.create(selection.getStatuses(fs), fileNames, metaRootPath.toString());
    --- End diff --
    
    `FileSelection.create()` expects either a list of statuses or a list of filenames, but not both.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by jinfengni <gi...@git.apache.org>.
Github user jinfengni commented on the pull request:

    https://github.com/apache/drill/pull/345#issuecomment-178746517
  
    Overall, looks good to me. 
    
    +1



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/345#discussion_r51528494
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java ---
    @@ -338,8 +354,14 @@ private boolean hasSingleValue(ColumnMetadata columnChunkMetaData) {
         return (columnChunkMetaData != null) && (columnChunkMetaData.hasSingleValue());
       }
     
    +  @Override
    --- End diff --
    
    Yes, I suppose you encountered a failure in some tests due to this...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by jinfengni <gi...@git.apache.org>.
Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/345#discussion_r51612851
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java ---
    @@ -338,8 +354,14 @@ private boolean hasSingleValue(ColumnMetadata columnChunkMetaData) {
         return (columnChunkMetaData != null) && (columnChunkMetaData.hasSingleValue());
       }
     
    +  @Override
    --- End diff --
    
    Right. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/345#discussion_r51528367
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java ---
    @@ -118,13 +133,34 @@ public boolean apply(@Nullable FileStatus status) {
           }
         }));
     
    -    return create(nonDirectories, null, selectionRoot);
    +    final FileSelection fileSel = create(nonDirectories, null, selectionRoot);
    --- End diff --
    
    I did not change the existing create() call .. it was passing null for the list of files.  I think it is ok because FileStatus contains the Path which reflects the file name.  It also seems that if either the list of files is null or the list of statuses is null, the functions getFiles() or getStatuses() can compute them on the first invocation and save for future use.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4287: During initial DrillTable creation...

Posted by jinfengni <gi...@git.apache.org>.
Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/345#discussion_r51612792
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java ---
    @@ -118,13 +133,34 @@ public boolean apply(@Nullable FileStatus status) {
           }
         }));
     
    -    return create(nonDirectories, null, selectionRoot);
    +    final FileSelection fileSel = create(nonDirectories, null, selectionRoot);
    --- End diff --
    
    Agreed that it's existing create() call. It seems fine to keep it as is.  
    
    This new FileSelection is special, since it has excluded directories by going through the list of file status recursively. In that sense, it's hasDirectories = false, checkedForDirectories=true, and the list of files are already available while going through the file status list. If a FileSelection always has to call containsDirectory() or getFile() eventually, then it might be better to store them when they are available. But I guess it's minor thing to consider. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---