You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by AndreSchumacher <gi...@git.apache.org> on 2014/03/21 16:33:31 UTC

[GitHub] spark pull request: Spark parquet improvements

GitHub user AndreSchumacher opened a pull request:

    https://github.com/apache/spark/pull/195

    Spark parquet improvements

    A few improvements to the Parquet support for SQL queries:
    - Instead of files a ParquetRelation is now backed by a directory, which simplifies importing data from other
      sources
    - InsertIntoParquetTable operation now supports switching between overwriting or appending (at least in
      HiveQL)
    - tests now use the new API
    - Parquet logging can be set to WARNING level (Default)
    - Default compression for Parquet files (GZIP, as in parquet-mr)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AndreSchumacher/spark spark_parquet_improvements

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #195
    
----
commit 14a1d2c1a5df4475ffccaf3ce36a41f2234ec3b7
Author: Andre Schumacher <an...@iki.fi>
Date:   2014-03-18T13:23:26Z

    Changing ParquetRelation underlying data from file to dir

commit d6630d408cfadd6ef67767f4794f57b7ebe1c605
Author: Andre Schumacher <an...@iki.fi>
Date:   2014-03-19T06:04:37Z

    Optional overwrite when inserting into ParquetRelation

commit d1d3639d8c45f93dd485628cb02ab5ef1dccc93e
Author: Andre Schumacher <an...@iki.fi>
Date:   2014-03-19T11:32:51Z

    Update of Parquet tests to new API

commit 233e67f5571b9be38c04c0205ed522455cd91661
Author: Andre Schumacher <an...@iki.fi>
Date:   2014-03-19T17:15:09Z

    Implementing appending to existing ParquetRelation

commit b8abe01a54090191e89d49f0960096b0844b3fd7
Author: Andre Schumacher <an...@iki.fi>
Date:   2014-03-14T19:09:15Z

    Setting Parquet log level

commit 0a07f05e094e99317913c619c9ff08bc45cc933c
Author: Andre Schumacher <an...@iki.fi>
Date:   2014-03-21T13:14:55Z

    Adding Parquet debug parameter and default compression

commit 05ed2477718798fd373c9b478f25518bf8919381
Author: Andre Schumacher <an...@iki.fi>
Date:   2014-03-21T15:26:12Z

    Adding future example

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39421940
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11002153
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -210,3 +224,48 @@ case class InsertIntoParquetTable(
       }
     }
     
    +// TODO: this will be able to append to directories it created itself, not necessarily
    --- End diff --
    
    What is the complication here?  Can we detect it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38704074
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38701638
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39186013
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10856726
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +73,43 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
     object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  // TODO: consider redirecting Parquet's log output to log4j logger and
    +  // using config file for log settings
    +  def setParquetLogLevel() {
    +    val level: Level = if (DEBUG) Level.FINEST else Level.WARNING
    --- End diff --
    
    In spark we use slf4j for logging and by default we output to log4j. Are you finding that parquet logs are going through JUL when you run catalyst? This means something is not set-up correctly... ideally everything should be routed through slf4j.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by chutium <gi...@git.apache.org>.

Github user chutium commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r15936076
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +71,56 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
    -object ParquetRelation {
    +private[sql] object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  def setParquetLogLevel() {
    +    // Note: Parquet does not use forwarding to parent loggers which
    +    // is required for the JUL-SLF4J bridge to work. Also there is
    +    // a default logger that appends to Console which needs to be
    +    // reset.
    +    import org.slf4j.bridge.SLF4JBridgeHandler
    +    import java.util.logging.Logger
    +    import java.util.logging.LogManager
    +
    +    val loggerNames = Seq(
    +      "parquet.hadoop.ColumnChunkPageWriteStore",
    +      "parquet.hadoop.InternalParquetRecordWriter",
    +      "parquet.hadoop.ParquetRecordReader",
    +      "parquet.hadoop.ParquetInputFormat",
    +      "parquet.hadoop.ParquetOutputFormat",
    +      "parquet.hadoop.ParquetFileReader",
    +      "parquet.hadoop.InternalParquetRecordReader",
    +      "parquet.hadoop.codec.CodecConfig")
    +    LogManager.getLogManager.reset()
    +    SLF4JBridgeHandler.install()
    +    for(name <- loggerNames) {
    +      val logger = Logger.getLogger(name)
    +      logger.setParent(Logger.getGlobal)
    +      logger.setUseParentHandlers(true)
    +    }
    +  }
     
       // The element type for the RDDs that this relation maps to.
       type RowType = org.apache.spark.sql.catalyst.expressions.GenericMutableRow
     
    +  // The compression type
    +  type CompressionType = parquet.hadoop.metadata.CompressionCodecName
    +
    +  // The default compression
    +  val defaultCompression = CompressionCodecName.GZIP
    --- End diff --
    
    Hi @AndreSchumacher and @marmbrus , it seems we can use a hadoop config property to change this ```defaultCompression```, in ```createEmpty``` method, there is a check:
    ```
        if (conf.get(ParquetOutputFormat.COMPRESSION) == null) {
          conf.set(ParquetOutputFormat.COMPRESSION, ParquetRelation.defaultCompression.name())
        }
    ```
    
    but it is a hadoop config property, not a spark config property, so we can not simply set this property in conf/spark-defaults.conf


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11071279
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -40,7 +40,7 @@ import java.util.Date
      * Parquet table scan operator. Imports the file that backs the given
      * [[ParquetRelation]] as a RDD[Row].
      */
    -case class ParquetTableScan(
    +private[sql] case class ParquetTableScan(
         @transient output: Seq[Attribute],
         @transient relation: ParquetRelation,
         @transient columnPruningPred: Option[Expression])(
    --- End diff --
    
    @jerryshao I can't comment on the ``SparkEquiInnerJoin`` code but I was also under the impression these fields would only be used to build the execution plan. I'm fine with removing the transient if needed though (@rxin suggested to make these transient, any comments?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38988222
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38942145
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13550/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38836511
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13521/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10892494
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +73,43 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
     object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  // TODO: consider redirecting Parquet's log output to log4j logger and
    +  // using config file for log settings
    +  def setParquetLogLevel() {
    +    val level: Level = if (DEBUG) Level.FINEST else Level.WARNING
    --- End diff --
    
    Now I'm actually reading this here: <em>j.u.l. to SLF4J translation can seriously increase the cost of disabled logging statements (60 fold or 6000%)</em> Apparently there is a way to void this by using logback (a fork of log4j?). Parquet does fairly low-level logging and relies statements on these now being compiled as I understand. Any opinions? I could try if it would work via logback or see how this would degrade performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39304970
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39215197
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39457180
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38704075
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10903544
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +73,43 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
     object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  // TODO: consider redirecting Parquet's log output to log4j logger and
    +  // using config file for log settings
    +  def setParquetLogLevel() {
    +    val level: Level = if (DEBUG) Level.FINEST else Level.WARNING
    --- End diff --
    
    Hmmm, yeah if the are logging statements in the critical path of producing tuples then this is probably something we want to avoid.  Otherwise the performance penalty probably doesn't matter. Can you confirm if that is the case.  
    
    If logback is sufficiently close to log4j then we might be consider switching.  An alternative would be to just leave the logging coming out of JUL.  Any thoughts @pwendell ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10859898
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -210,3 +220,47 @@ case class InsertIntoParquetTable(
       }
     }
     
    +// TODO: this will be able to append to directories it created itself, not necessarily
    +// to imported ones
    +class AppendingParquetOutputFormat(offset: Int) extends parquet.hadoop.ParquetOutputFormat[Row] {
    --- End diff --
    
    would it make sense to make this private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10867922
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +73,43 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
     object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  // TODO: consider redirecting Parquet's log output to log4j logger and
    +  // using config file for log settings
    +  def setParquetLogLevel() {
    +    val level: Level = if (DEBUG) Level.FINEST else Level.WARNING
    --- End diff --
    
    Thanks, that was exactly what I was looking for. Will try to find out what is going wrong there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38830140
  
     Build triggered. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38988220
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39402493
  
    Also sorry for commenting in a weird way that doesn't show up on the PR page... oops!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39221975
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10859905
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -210,3 +220,47 @@ case class InsertIntoParquetTable(
       }
     }
     
    +// TODO: this will be able to append to directories it created itself, not necessarily
    +// to imported ones
    +class AppendingParquetOutputFormat(offset: Int) extends parquet.hadoop.ParquetOutputFormat[Row] {
    +  // override to accept existing directories as valid output directory
    +  override def checkOutputSpecs(job: JobContext): Unit = {}
    +
    +  // override to choose output filename so not overwrite existing ones
    +  override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
    +    val taskId: TaskID = context.getTaskAttemptID.getTaskID
    +    val partition: Int = taskId.getId
    +    val filename = s"part-r-${partition + offset}.parquet"
    +    val committer: FileOutputCommitter =
    +      getOutputCommitter(context).asInstanceOf[FileOutputCommitter]
    +    new Path(committer.getWorkPath, filename)
    +  }
    +}
    +
    +object FileSystemHelper {
    --- End diff --
    
    should this be `private` or `private sql`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10857590
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +73,43 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
     object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  // TODO: consider redirecting Parquet's log output to log4j logger and
    +  // using config file for log settings
    +  def setParquetLogLevel() {
    +    val level: Level = if (DEBUG) Level.FINEST else Level.WARNING
    --- End diff --
    
    https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L295


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38711583
  
    Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11053203
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -40,7 +40,7 @@ import java.util.Date
      * Parquet table scan operator. Imports the file that backs the given
      * [[ParquetRelation]] as a RDD[Row].
      */
    -case class ParquetTableScan(
    +private[sql] case class ParquetTableScan(
         @transient output: Seq[Attribute],
         @transient relation: ParquetRelation,
         @transient columnPruningPred: Option[Expression])(
    --- End diff --
    
    Hi @AndreSchumacher , I'm not sure why these three parameters need to be transient. In my test, `ParquetTableScan` needs to be serialized to ExecuteBackend to generate keys in `SparkEquiInnerJoin`, and this transient will lead the `output` to be null value. I'm not sure what's your design purpose, maybe I miss something. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39418187
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39513179
  
    merged. thx!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38696187
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13470/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38989836
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39314774
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38936429
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10975266
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +73,43 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
     object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  // TODO: consider redirecting Parquet's log output to log4j logger and
    +  // using config file for log settings
    +  def setParquetLogLevel() {
    +    val level: Level = if (DEBUG) Level.FINEST else Level.WARNING
    --- End diff --
    
    @marmbrus So I read up on this a bit more and as I understand the log4j bridge should not increase the log output rate so there shouldn't be any performance issues (confirmed that by a microbenchmark). The upper bound is Parquet's own default log level (INFO) and log4j can only filter from there. I got the bridge working now, too, although it still requires setting the forwarding logger after Parquet has initialized its own default logger. Anyway.. it's better than before and I will push the commits shortly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11002023
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +71,56 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    --- End diff --
    
    This isn't true anymore after my patch right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39450898
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38945561
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38830164
  
    Build started. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38836510
  
    Build finished. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39418179
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38945563
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13551/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38695928
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11072105
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -210,3 +224,48 @@ case class InsertIntoParquetTable(
       }
     }
     
    +// TODO: this will be able to append to directories it created itself, not necessarily
    --- End diff --
    
    I added a warning now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11002178
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -132,7 +132,7 @@ trait HiveStrategies {
                       relation.asInstanceOf[MetastoreRelation],
                       None)(hiveContext)
                   }
    -              case ParquetRelation(_, _) => {
    +              case ParquetRelation(_) => {
    --- End diff --
    
    I think this is all gone now,  can you rebase against master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39215180
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38935838
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11053408
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -40,7 +40,7 @@ import java.util.Date
      * Parquet table scan operator. Imports the file that backs the given
      * [[ParquetRelation]] as a RDD[Row].
      */
    -case class ParquetTableScan(
    +private[sql] case class ParquetTableScan(
         @transient output: Seq[Attribute],
         @transient relation: ParquetRelation,
         @transient columnPruningPred: Option[Expression])(
    --- End diff --
    
    Hi Michael,  actually I met NPE in my test. I think this code in `SparkEquiInnerJoin` will be executed in ExecuteBackend.
    
        val rightWithKeys = right.execute().mapPartitions { iter =>
          val generateRightKeys = new Projection(rightKeys, right.output)
          iter.map(row => (generateRightKeys(row), row.copy()))
        }
    
    As `right.output` is needed when when running mapPartitions, so this `right` will be serialized to ExecuteBackend, which may introduce the error I mentioned before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38942144
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39500398
  
    LGTM
    
    @pwendell this is ready to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38711584
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13471/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39421942
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13720/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/195


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39108623
  
    Yes, that sounds like a good solution.  `reset()` should clear all state from a `HiveContext`.  Note that this may require us to make the table registration more explicit (now I believe it only happens once when the singleton is initialized).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38988906
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13569/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11002041
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +71,56 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
    -object ParquetRelation {
    +private[sql] object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    --- End diff --
    
    could we just take the desired level as an argument to the function below.  Maybe with a default value?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38988905
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39418052
  
    @marmbrus Great, thanks a lot. I will go through those comments and your PR and extend the documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10867923
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -210,3 +220,47 @@ case class InsertIntoParquetTable(
       }
     }
     
    +// TODO: this will be able to append to directories it created itself, not necessarily
    +// to imported ones
    +class AppendingParquetOutputFormat(offset: Int) extends parquet.hadoop.ParquetOutputFormat[Row] {
    +  // override to accept existing directories as valid output directory
    +  override def checkOutputSpecs(job: JobContext): Unit = {}
    +
    +  // override to choose output filename so not overwrite existing ones
    +  override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
    +    val taskId: TaskID = context.getTaskAttemptID.getTaskID
    +    val partition: Int = taskId.getId
    +    val filename = s"part-r-${partition + offset}.parquet"
    +    val committer: FileOutputCommitter =
    +      getOutputCommitter(context).asInstanceOf[FileOutputCommitter]
    +    new Path(committer.getWorkPath, filename)
    +  }
    +}
    +
    +object FileSystemHelper {
    --- End diff --
    
    OK, yes, good point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38989837
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13570/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38989505
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11002083
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +71,56 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
    -object ParquetRelation {
    +private[sql] object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  def setParquetLogLevel() {
    +    // Note: Parquet does not use forwarding to parent loggers which
    +    // is required for the JUL-SLF4J bridge to work. Also there is
    +    // a default logger that appends to Console which needs to be
    +    // reset.
    +    import org.slf4j.bridge.SLF4JBridgeHandler
    +    import java.util.logging.Logger
    +    import java.util.logging.LogManager
    +
    +    val loggerNames = Seq(
    +      "parquet.hadoop.ColumnChunkPageWriteStore",
    +      "parquet.hadoop.InternalParquetRecordWriter",
    +      "parquet.hadoop.ParquetRecordReader",
    +      "parquet.hadoop.ParquetInputFormat",
    +      "parquet.hadoop.ParquetOutputFormat",
    +      "parquet.hadoop.ParquetFileReader",
    +      "parquet.hadoop.InternalParquetRecordReader",
    +      "parquet.hadoop.codec.CodecConfig")
    +    LogManager.getLogManager.reset()
    +    SLF4JBridgeHandler.install()
    +    for(name <- loggerNames) {
    +      val logger = Logger.getLogger(name)
    +      logger.setParent(Logger.getGlobal)
    +      logger.setUseParentHandlers(true)
    +    }
    +  }
     
       // The element type for the RDDs that this relation maps to.
       type RowType = org.apache.spark.sql.catalyst.expressions.GenericMutableRow
     
    +  // The compression type
    +  type CompressionType = parquet.hadoop.metadata.CompressionCodecName
    +
    +  // The default compression
    +  val defaultCompression = CompressionCodecName.GZIP
    --- End diff --
    
    Not for this PR, but this should probably be an argument of WriteToFile or ParquetRelation or something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38935866
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39186014
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13627/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11053247
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -40,7 +40,7 @@ import java.util.Date
      * Parquet table scan operator. Imports the file that backs the given
      * [[ParquetRelation]] as a RDD[Row].
      */
    -case class ParquetTableScan(
    +private[sql] case class ParquetTableScan(
         @transient output: Seq[Attribute],
         @transient relation: ParquetRelation,
         @transient columnPruningPred: Option[Expression])(
    --- End diff --
    
    I believe these parameters are only used when actually building the RDD, which occurs locally in the execute() method.  Once the RDD is built they should no longer be needed.  Are you seeing exceptions when running queries?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38989502
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11072079
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +71,56 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
    -object ParquetRelation {
    +private[sql] object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  def setParquetLogLevel() {
    +    // Note: Parquet does not use forwarding to parent loggers which
    +    // is required for the JUL-SLF4J bridge to work. Also there is
    +    // a default logger that appends to Console which needs to be
    +    // reset.
    +    import org.slf4j.bridge.SLF4JBridgeHandler
    +    import java.util.logging.Logger
    +    import java.util.logging.LogManager
    +
    +    val loggerNames = Seq(
    +      "parquet.hadoop.ColumnChunkPageWriteStore",
    +      "parquet.hadoop.InternalParquetRecordWriter",
    +      "parquet.hadoop.ParquetRecordReader",
    +      "parquet.hadoop.ParquetInputFormat",
    +      "parquet.hadoop.ParquetOutputFormat",
    +      "parquet.hadoop.ParquetFileReader",
    +      "parquet.hadoop.InternalParquetRecordReader",
    +      "parquet.hadoop.codec.CodecConfig")
    +    LogManager.getLogManager.reset()
    +    SLF4JBridgeHandler.install()
    +    for(name <- loggerNames) {
    +      val logger = Logger.getLogger(name)
    +      logger.setParent(Logger.getGlobal)
    +      logger.setUseParentHandlers(true)
    +    }
    +  }
     
       // The element type for the RDDs that this relation maps to.
       type RowType = org.apache.spark.sql.catalyst.expressions.GenericMutableRow
     
    +  // The compression type
    +  type CompressionType = parquet.hadoop.metadata.CompressionCodecName
    +
    +  // The default compression
    +  val defaultCompression = CompressionCodecName.GZIP
    --- End diff --
    
    Good point, I was thinking the same but did not want to introduce lots of Parquet dependencies on the logical plan side. Maybe there should be some Spark system level compression types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38696186
  
    Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10859923
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -210,3 +220,47 @@ case class InsertIntoParquetTable(
       }
     }
     
    +// TODO: this will be able to append to directories it created itself, not necessarily
    +// to imported ones
    +class AppendingParquetOutputFormat(offset: Int) extends parquet.hadoop.ParquetOutputFormat[Row] {
    +  // override to accept existing directories as valid output directory
    +  override def checkOutputSpecs(job: JobContext): Unit = {}
    +
    +  // override to choose output filename so not overwrite existing ones
    +  override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
    +    val taskId: TaskID = context.getTaskAttemptID.getTaskID
    +    val partition: Int = taskId.getId
    +    val filename = s"part-r-${partition + offset}.parquet"
    +    val committer: FileOutputCommitter =
    +      getOutputCommitter(context).asInstanceOf[FileOutputCommitter]
    +    new Path(committer.getWorkPath, filename)
    +  }
    +}
    +
    +object FileSystemHelper {
    --- End diff --
    
    good call, I think everything other than the physical/logical operator should be private[sql]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39450885
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38298439
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38288151
  
    @marmbrus : would be great if you could have a look at the changes. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38291880
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r10856974
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -72,16 +73,43 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base
       /** Output **/
       override val output = attributes
     
    +  /** Name (dummy value) */
    +  // TODO: rethink whether ParquetRelation should inherit from BaseRelation
    +  // (currently required to re-use HiveStrategies but should be removed)
    +  override def tableName = "parquet"
    +
       // Parquet files have no concepts of keys, therefore no Partitioner
       // Note: we could allow Block level access; needs to be thought through
       override def isPartitioned = false
     }
     
     object ParquetRelation {
    +  // change this to enable/disable Parquet logging
    +  var DEBUG: Boolean = false
    +
    +  // TODO: consider redirecting Parquet's log output to log4j logger and
    +  // using config file for log settings
    +  def setParquetLogLevel() {
    +    val level: Level = if (DEBUG) Level.FINEST else Level.WARNING
    --- End diff --
    
    Another clarification here from Patrick that I didn't understand before: there should already be something that forwards java util logging to slf4j.  Then in testing we use log4j with does the actual logging for slf4j.  So, in theory, we should be able to control parquet logging from the file in test/resources also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39102778
  
    @marmbrus I finally found out why the one test fails. It seems that the SchemaRDD's that are registered inside SimpleCatalog are not removed after the tests (the Hive catalog is automatically cleaned when Hive context is reset). I guess the same should happen here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38298441
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13321/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38324790
  
    Hey Andre, I haven't had a chance to look at this closely, but I do really like the idea of this functionality. Some thoughts:
     - Right now it looks like this is exposing catalyst `DataType` to end users.  While this maybe okay, the original intention was to keep the catalyst API purely internal to leave things flexible in the future.  Therefore, we may want to create wrapper classes in the public API that don't expose so many details (for example dataTypes have ClassTags that talk about the underlying JVM datatype that we use during execution).  Another option (maybe in addition to a programatic api) would be use use HiveQL strings to describe schema (there is already a converter).  At the very least we need to mark this method experimental though.  A task for me right now is to create a Java API for Spark SQL, and part of this will be deciding how to expose schema to end users.
     - I looked the test that is failing.  I'm confused how `InsertIntoParquetTable` is sneaking into the Hive test cases.  Is something getting mixed up during analysis?  Are we just failing to reset some state in the reset function of TestHive?
     - The above aside, I think the issue here is the second parameter list to `InsertIntoParquetTable`.  It is correct to keep the spark context in the second list so it is not consider part of the case class as far as equality is concerned, but arguments in other parameter lists are not included in the 'magic' tree copying and so need to be added manually `override def otherCopyArgs = sc :: Nil`.
     - You may already handle these cases correctly, but a few things that came to mind.
      - What happens to existing RDDs when a table is appended to?  Is there an a case when recomputing will rescan the directory looking for files and cause the RDD to change?  Will this always happen?
      - What happens if multiple users append to a parquet "table", stored on HDFS for example, at the same time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39314778
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13679/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39457091
  
    @marmbrus Thanks again for the comments and changes. I shuffled the imports and removed the additions to SQLContext. Also moved the insert examples to the tests rather (they are not well tested so maybe a bit too experimental for the "official" list of examples). Note that I had to do some small changes to the SQLParser to parse the INSERT. It's again rebased and squashed, close to being ready I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38989047
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39185786
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39402465
  
    Hey @AndreSchumacher,
    
    This is a pretty cool feature!
    
    I think this is close to merging once the comments are addressed.  I also made a PR against your PR to fix the testcase problem and roll back the catalog changes.  We should probably also update the documentation for Parquet to describe this new functionality and any caveats (i.e., is it going to corrupt data if two people try to write at once?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38989484
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AndreSchumacher <gi...@git.apache.org>.

Github user AndreSchumacher commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38695808
  
    @marmbrus @pwendell  Thanks a lot for the detailed comments. Here are some thoughts regarding the points you raised above:
    * Exposing ``DataType`` to users is probably not a good idea, at least not the interval JVM type. Some kind of matching against data types should be supported (IMHO). Maybe more thought could go into which part of ``DataType`` should be exposed? Also it seems that currently there is no notion of a primitive type as in Parquet (the closest thing to that is a ``NativeType``; maybe that could be renamed or extended?). I will hold off with adding schema strings until the other APIs have stabilized somewhat. OK?
    * I have no idea why that one test failed (and can't find the error output anymore in Jenkins). Locally I cannot reproduce this failure (but other not in Jenkins show up which I believe are unrelated so I don't trust my local installation). Once this build has completed I try to find the root cause.
    * I added otherCopyArgs to the two operators; please let me know if I missed anything else.
    * About the INSERT INTO TABLE operation without overwrite:
        - I tried to find out how Hive deals with this situation. Unless the Metastore somehow locks access to tables I don't see many checks on HDFS level (but I may have just missed them). Data is written to temporary files and then moved to its final destination, possibly renaming the destination files if they alreadt exist. We could try to implement something similar, which could mitigate (but not avoid) the concurrent write access situation.
        - Also, data could be saved in directories only accessible to the user that created them. But that would be kind of a strong limitation.
        - Currently the child RDD is returned unchanged, which may be different from the RDD that can be built from the Parquet files in the destination directory. This could be changed to always doing a scan after an INSERT INTO TABLE so that the actual RDD is returned. Maybe that would be less confusing to users?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38291879
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39221977
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13632/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38936414
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/195#discussion_r11094163
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -40,7 +40,7 @@ import java.util.Date
      * Parquet table scan operator. Imports the file that backs the given
      * [[ParquetRelation]] as a RDD[Row].
      */
    -case class ParquetTableScan(
    +private[sql] case class ParquetTableScan(
         @transient output: Seq[Attribute],
         @transient relation: ParquetRelation,
         @transient columnPruningPred: Option[Expression])(
    --- End diff --
    
    @jerryshao,  thanks a lot for pointing this out!  I have reproduced locally.  JIRA here:
    https://spark-project.atlassian.net/browse/SPARK-1353
    
    @AndreSchumacher, can you take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38695929
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39304956
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38989050
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39457183
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13724/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-39185768
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---