You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by aarondav <gi...@git.apache.org> on 2014/08/04 09:35:21 UTC

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/1764

    [SPARK-2824/2825][SQL] Work towards separating data location from format

    Currently, there is a fundamental assumption in SparkSQL that a Parquet
    table is stored at a certain Hadoop path and that a Metastore table is
    stored within the Hive warehouse. However, the fact that a table is
    Parquet or serialized as an object file is independent of where the data
    is actually located.
    
    This patch attempts to work towards creating a cleaner separation between
    where the data is located and the format the data is in by introducing
    two concepts: a TableFormat and a TableLocation. This abstraction
    enables code like the following:
    
    ```scala
    val myTable = // ...
    myTable.saveAsTable("myTable", classOf[ParquetFormat])
    hql("SELECT * FROM myTable").collect // reads from Parquet!
    
    // Also allows expansion of file-writing later:
    myTable.saveAsFile("/my/file", classOf[ParquetFormat])
    ```
    
    Additionally, this allows us to trivially support external tables with arbitrary formats.
    
    However, this PR doesn't attempt to make any radical changes. Parquet files still only
    support being written to a single Hadoop directory, but this can be part of a Hive table or
    a normal directory. The MetastoreRelation still requires living within the Metastore because
    it relies heavily on the metadata there. The hope of this patch is that it enables the two linked
    features ([SPARK-2824](https://issues.apache.org/jira/browse/SPARK-2824) and [SPARK-2825](https://issues.apache.org/jira/browse/SPARK-2825)) while adding a useful abstraction for the future.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark hive

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1764.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1764
    
----
commit b18a4ae3c9bbe8f917ccc7e9aeb1ece25d54bc46
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-08-02T19:55:41Z

    [SPARK-2824/2825][SQL] Work towards separating data location from format
    
    Currently, there is a fundamental assumption in SparkSQL that a Parquet
    table is stored at a certain Hadoop path and that a Metastore table is
    stored within the Hive warehouse. However, the fact that a table is
    Parquet or serialized as an object file is independent of where the data
    is actually located.
    
    This patch attempts to work towards creating a cleaner separation between
    where the data is located and the format the data is in by introducing
    two concepts: a TableFormat and a TableLocation. This abstraction
    enables code like the following:
    
    ```scala
    val myTable = //
    myTable.saveAsTable("myTable", classOf[ParquetFormat])
    hql("SELECT * FROM myTable").collect // reads from Parquet!
    
    // Also allows expansion of file-writing later:
    myTable.saveAsFile("/my/file", classOf[ParquetFormat])
    ```
    
    Additionally, this allows us to trivially support external tables
    with arbitrary formats.
    
    However, this PR doesn't attempt to make any radical changes. Parquet files still only
    support being written to a single Hadoop directory, but this can be part of a Hive table or
    a normal directory. The MetastoreRelation still requires living within the Metastore because
    it relies heavily on the metadata there. The hope of this patch is that it enables the two linked
    features ([SPARK-2824](https://issues.apache.org/jira/browse/SPARK-2824) and [SPARK-2825](https://issues.apache.org/jira/browse/SPARK-2825)) while adding a useful abstraction for the
    future.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1764#issuecomment-51026425
  
    QA tests have started for PR 1764. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17849/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1764#issuecomment-51026778
  
    QA results for PR 1764:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class ParquetFormat(sqlContext: SQLContext, conf: Configuration) extends TableFormat {<br>class MetastoreFormat(hiveContext: HiveContext) extends TableFormat {<br>case class HiveTableLocation(<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17850/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1764#issuecomment-51026733
  
    QA tests have started for PR 1764. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17850/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1764#discussion_r15743336
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
    @@ -353,15 +356,14 @@ private[parquet] object ParquetTypesConverter extends Logging {
        * in the parent directory. If so, this is used. Else we read the actual footer at the given
        * location.
        * @param origPath The path at which we expect one (or more) Parquet files.
    -   * @param configuration The Hadoop configuration to use.
    +   * @param conf The Hadoop configuration to use.
        * @return The `ParquetMetadata` containing among other things the schema.
        */
    -  def readMetaData(origPath: Path, configuration: Option[Configuration]): ParquetMetadata = {
    +  def readMetaData(origPath: Path, conf: Configuration): ParquetMetadata = {
         if (origPath == null) {
           throw new IllegalArgumentException("Unable to read Parquet metadata: path is null")
         }
         val job = new Job()
    -    val conf = configuration.getOrElse(ContextUtil.getConfiguration(job))
    --- End diff --
    
    I wanted to get rid of the optional configuration, but perhaps I should put this back. Making a new Job and then asking for a Configuration doesn't seem like it'd be more useful than just constructing a new Configuration, though, at least in terms of the properties set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1764#issuecomment-51112955
  
    QA tests have started for PR 1764. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17871/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1764#discussion_r15743237
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/TableFormat.scala ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.plans.physical
    --- End diff --
    
    Not certain what the right place for this class is...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1764#discussion_r15743214
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
    @@ -266,7 +266,8 @@ package object dsl {
     
       object plans {  // scalastyle:ignore
         implicit class DslLogicalPlan(val logicalPlan: LogicalPlan) extends LogicalPlanFunctions {
    -      def writeToFile(path: String) = WriteToFile(path, logicalPlan)
    +    // Writing to files doesn't make sense without Hadoop in scope (unless we assume local fs)
    +//      def writeToFile(path: String) = WriteToFile(path, classOf[??], logicalPlan)
    --- End diff --
    
    Not sure what to do with this, it doesn't seem to make a lot of sense in the catalyst package unless catalyst wants to depend on Hadoop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav closed the pull request at:

    https://github.com/apache/spark/pull/1764


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1764#issuecomment-51122520
  
    QA results for PR 1764:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>abstract class TableFormat {<br>class ParquetFormat(sqlContext: SQLContext, conf: Configuration) extends TableFormat {<br>class MetastoreFormat(hiveContext: HiveContext) extends TableFormat {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17871/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1764#issuecomment-51026471
  
    QA results for PR 1764:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class ParquetFormat(sqlContext: SQLContext, conf: Configuration) extends TableFormat {<br>class MetastoreFormat(hiveContext: HiveContext) extends TableFormat {<br>case class HiveTableLocation(<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17849/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org