You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by nongli <gi...@git.apache.org> on 2016/01/05 00:47:53 UTC

[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/10581

    [SPARK-12636][SQL] Update UnsafeRowParquetRecordReader to support reading files directly.

    As noted in the code, this change is to make this component easier to test in isolation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-12636

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10581.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10581
    
----
commit 7eeff58298ceac076779a5cae05ca674ed0ac51a
Author: Nong <no...@gmail.com>
Date:   2015-12-31T22:45:30Z

    [SPARK-12636][SQL] Update UnsafeRowParquetRecordReader to support reading paths directly.
    
    As noted in the code, this change is to make this componenet easier to
    test in isolation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10581#issuecomment-168865733
  
    **[Test build #48696 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48696/consoleFull)** for PR 10581 at commit [`7eeff58`](https://github.com/apache/spark/commit/7eeff58298ceac076779a5cae05ca674ed0ac51a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10581#issuecomment-168866407
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10581#issuecomment-168915398
  
    The high level change looks good to me, although I have to admit I'm not familiar with details of Parquet.
    
    cc @nongli and @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10581#issuecomment-168849732
  
    **[Test build #48696 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48696/consoleFull)** for PR 10581 at commit [`7eeff58`](https://github.com/apache/spark/commit/7eeff58298ceac076779a5cae05ca674ed0ac51a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/10581#issuecomment-169143525
  
    Merging this to unblock the following up PR (could be addressed there).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10581#issuecomment-168866409
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48696/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10581#discussion_r48898645
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ---
    @@ -125,20 +129,80 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont
                     + " in range " + split.getStart() + ", " + split.getEnd());
           }
         }
    -    MessageType fileSchema = footer.getFileMetaData().getSchema();
    +    this.fileSchema = footer.getFileMetaData().getSchema();
         Map<String, String> fileMetadata = footer.getFileMetaData().getKeyValueMetaData();
    -    this.readSupport = getReadSupportInstance(
    +    ReadSupport<T> readSupport = getReadSupportInstance(
             (Class<? extends ReadSupport<T>>) getReadSupportClass(configuration));
         ReadSupport.ReadContext readContext = readSupport.init(new InitContext(
             taskAttemptContext.getConfiguration(), toSetMultiMap(fileMetadata), fileSchema));
         this.requestedSchema = readContext.getRequestedSchema();
    -    this.fileSchema = fileSchema;
    +    this.sparkSchema = new CatalystSchemaConverter(configuration).convert(requestedSchema);
         this.reader = new ParquetFileReader(configuration, file, blocks, requestedSchema.getColumns());
         for (BlockMetaData block : blocks) {
           this.totalRowCount += block.getRowCount();
         }
       }
     
    +  /**
    +   * Returns the list of files at 'path' recursively. This skips files that are ignored normally
    +   * by MapReduce.
    +   */
    +  public static List<String> listDirectory(File path) throws IOException {
    --- End diff --
    
    Is this only used by tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/10581#issuecomment-169140437
  
    LGTM, except one minor comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10581


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12636][SQL] Update UnsafeRowParquetReco...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/10581#issuecomment-169134087
  
    This doesn't really do much but just to make this component create-able without the hadoop machinery.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org