You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by saucam <gi...@git.apache.org> on 2015/02/09 10:09:01 UTC

[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

GitHub user saucam opened a pull request:

    https://github.com/apache/spark/pull/4469

    SPARK-5684: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys)

    While parsing the partition keys from the locations, in parquetRelations, it is assumed that location path string will always contain the partition keys, which is not true. Different location can be specified while adding partitions to the table, which results in key not found exception while reading from such partitions:
    
    Create a partitioned parquet table :
    create table test_table (dummy string) partitioned by (timestamp bigint) stored as parquet;
    Add a partition to the table and specify a different location:
    alter table test_table add partition (timestamp=9) location '/data/pth/different'
    Run a simple select * query
    we get an exception :
    15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from db4_mi2mi_binsrc1_default limit 5]
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 21, localhost): java
    .util.NoSuchElementException: key not found: timestamp
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:58)
    at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
    at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/saucam/spark partition_bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4469.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4469
    
----
commit 5aeeb6db8a3651b7b13d641ec0ed0dea21025438
Author: Yash Datta <ya...@guavus.com>
Date:   2015-02-09T08:53:40Z

    SPARK-5684: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-89613886
  
    Hi @marmbrus , this is a pretty common scenario in production, where the data is generated in some directory and then later partitions are added to tables using alter table <tablename> add partition (<col>=value) location <directory where data is generated (where path does not contain partition key=value)>
    In the old parquet path in v1.2.1, this is not possible.
    This is doable in the new parquet path in spark 1.3 though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-75280063
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27778/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73783234
  
      [Test build #27224 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27224/consoleFull) for   PR 4469 at commit [`30fdcec`](https://github.com/apache/spark/commit/30fdcecf34bedb6cdeaf296c34673d9f0e94ad3c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73783244
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27224/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4469


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4469#discussion_r24316073
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -310,7 +310,10 @@ class SQLContext(@transient val sparkContext: SparkContext)
       @scala.annotation.varargs
       def parquetFile(path: String, paths: String*): DataFrame =
         if (conf.parquetUseDataSourceApi) {
    -      baseRelationToDataFrame(parquet.ParquetRelation2(path +: paths, Map.empty)(this))
    +      // not fixed for ParquetRelation2 !
    +      val sPaths = path +: paths
    +      baseRelationToDataFrame(parquet.ParquetRelation2(sPaths.map(p => 
    +        p.split("->").head), Map.empty)(this))
    --- End diff --
    
    Please suggest how to proceed in case of ParquetRelation2 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73557240
  
    Mind tagging this with [SQL] so it can get properly sorted?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-89091714
  
    hey @saucam, I'm pretty hesitant to make big changes to branch-1.2 unless a lot of users are reporting a problem.  Do the problems you describe still exist in branch-1.3?  or should be close this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73771810
  
      [Test build #27224 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27224/consoleFull) for   PR 4469 at commit [`30fdcec`](https://github.com/apache/spark/commit/30fdcecf34bedb6cdeaf296c34673d9f0e94ad3c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73773136
  
    Hey @saucam, partitioning support for the old Parquet support is quite limited (only handles 1 partition column, whose type must be INT). PR #4308 and upcoming follow-up PRs aim to provide full support for multi-level partitioning and schema merging. Also, Parquet tables converted from Hive metastore tables will retain their schema and location information inherited from metastore. We plan to deprecate the old Parquet implementation by the new Parquet data source in 1.3, and would like to remove the old one once the new implementation is proved to be stable enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73771721
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-75280057
  
      [Test build #27778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27778/consoleFull) for   PR 4469 at commit [`2dd9dbb`](https://github.com/apache/spark/commit/2dd9dbb0f8f469ad52fdc15bbfa9a6cedda64445).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-89633239
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29712/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-75263263
  
      [Test build #27778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27778/consoleFull) for   PR 4469 at commit [`2dd9dbb`](https://github.com/apache/spark/commit/2dd9dbb0f8f469ad52fdc15bbfa9a6cedda64445).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4469#discussion_r24315891
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/dataTypes.scala ---
    @@ -362,7 +362,7 @@ case object BooleanType extends NativeType with PrimitiveType {
      * @group dataType
      */
     @DeveloperApi
    -case object TimestampType extends NativeType {
    +case object TimestampType extends NativeType with PrimitiveType {
    --- End diff --
    
    this is done, in case table is partitioned on a timestamp type column, parquet iterator returns a GenericRow due to this in ParquetTypes.scala : 
    
    def isPrimitiveType(ctype: DataType): Boolean =
        classOf[PrimitiveType] isAssignableFrom ctype.getClass
    
    and in ParquetConverter.scala we have  : 
    
     protected[parquet] def createRootConverter(
          parquetSchema: MessageType,
          attributes: Seq[Attribute]): CatalystConverter = {
        // For non-nested types we use the optimized Row converter
        if (attributes.forall(a => ParquetTypesConverter.isPrimitiveType(a.dataType))) {
          new CatalystPrimitiveRowConverter(attributes.toArray)
        } else {
          new CatalystGroupConverter(attributes.toArray)
        }
      }
    
    which fails here later : 
    
       new Iterator[Row] {
              def hasNext = iter.hasNext
              def next() = {
                val row = iter.next()._2.asInstanceOf[SpecificMutableRow]
    
    throwing a class cast exception that GenericRow cannot be cast to SpecificMutableRow
    
    Am I missing something here ? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-77828501
  
    hi @liancheng , any update on this one ? i think it will be useful for people using spark 1.2.1 since old parquet path might suit their needs better in that version



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73478985
  
    @liancheng please suggest ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73839872
  
    Hi @liancheng , thanks for the comments. We are using spark-1.2.1 and the old parquet support is being used. Can this be merged so that we have proper partitioning with different locations as well. I tried partitioning on 2 columns and it worked fine (Also applied this patch for specifying a different location) 
    
    On a different note, When I create a parquet table with smallint type in spark, the schema being used in parquet shows 'int32 type', is that by design in spark or its a parquet limitation ?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4469#issuecomment-73477340
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org