You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by tejasapatil <gi...@git.apache.org> on 2017/08/20 00:06:04 UTC

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/19001

    [SPARK-19256][SQL] Hive bucketing support

    ## What changes were proposed in this pull request?
    
    This PR implements both read and write side changes for supporting hive bucketing in Spark. I had initially created a PR for just the write side changes (https://github.com/apache/spark/pull/18954) for simplicity. If reviewers want to review reader and writer side changes separately, I am happy to wait for the writer side PR to get merged and then send a new PR for reader side changes.
    
    ### Semantics for read:
    - `outputPartitioning` while scanning hive table would be the set of bucketing columns (whether its partitioned or not, whether you are reading single partition or multiple partitions)
    - `outputOrdering` would be the sort columns (actually prefix subset of `sort columns` being read from the table). 
    - In case of reading multiple hive partitions of the table, there would be multiple files per bucket so global sorting across buckets is not there. Thus we would have to ignore the sort information.
    - See the documentation in `HiveTableScanExec` where the `outputPartitioning` and `outputOrdering` is populated for more nitty gritty details.
    
    ### Semantics for write:
    - If the Hive table is bucketed, then INSERT node expect the child distribution to be based on the hash of the bucket columns. Else it would be empty. (Just to compare with Spark native bucketing : the required distribution is not enforced even if the table is bucketed or not... this saves the shuffle in comparison with hive).
    - Sort ordering for INSERT node over Hive bucketed table is determined as follows:
    
    | Table type   | Normal table | Bucketed table |
    | ------------- | ------------- | ------------- |
    | non-partitioned insert  | Nil | sort columns |
    | static partition   | Nil | sort columns |
    | dynamic partitions   | partition columns | (partition columns + bucketId + sort columns) |
    
    Just to compare how sort ordering is expressed for Spark native bucketing:
    
    | Table type   | Normal table | Bucketed table |
    | ------------- | ------------- | ------------- |
    |  sort ordering | partition columns | (partition columns + bucketId + sort columns) |
    
    Why is there a difference ? With hive, since there bucketed insertions would need a shuffle, sort ordering can be relaxed for both non-partitioned and static partition cases. Every RDD partition would get rows corresponding to a single bucket so those can be written to corresponding output file after sort. In case of dynamic partitions, the rows need to be routed to appropriate partition which makes it similar to Spark's constraints.
    
    - Only `Overwrite` mode is allowed for hive bucketed tables as any other mode will break the bucketing guarantees of the table. This is a difference wrt how Spark bucketing works.
    - With the PR, if there are no files created for empty buckets, the query will fail. Will support creation of empty files in coming iteration. This is a difference wrt how Spark bucketing works as it does NOT need files for empty buckets.
    
    ### Summary of changes done:
    - `ClusteredDistribution` and `HashPartitioning` are modified to store the hashing function used.
    - `RunnableCommand`'s' can now express the required distribution and ordering. This is used by `ExecutedCommandExec` which run these commands
      - The good thing about this is that I could remove the logic for enforcing sort ordering inside `FileFormatWriter` which felt out of place. Ideally, this kinda adding of physical nodes should be done within the planner which is what happens with this PR.
    - `InsertIntoHiveTable` enforces both distribution and sort ordering
    - `InsertIntoHadoopFsRelationCommand` enforces sort ordering ONLY (and not the distribution)
    - Fixed a bug due to which any alter commands to bucketed table (eg. updating stats) would wipe out the bucketing spec from metastore. This made insertions to bucketed table non-idempotent operation.
    - `HiveTableScanExec` populates `outputPartitioning` and `outputOrdering` based on table's metadata, configs and the query
    - `HadoopTableReader` to use `BucketizedSparkInputFormat` for bucketed reads
    
    ## How was this patch tested?
    
    - Added new unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark bucket_read

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19001.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19001
    
----
commit a4a7ac941f9d53a5aeff1e9b9a19dbf611e54ac2
Author: Tejas Patil <te...@fb.com>
Date:   2017-08-03T22:57:54Z

    bucketed writer implementation

commit 5aee02b1754eaf7535bcd121eee4d31cb61e65d5
Author: Tejas Patil <te...@fb.com>
Date:   2017-08-15T23:27:06Z

    Move `requiredOrdering` into RunnableCommand instead of `FileFormatWriter`

commit 9b8f0842eb5b61e6ae1a9fc76aebe9ff88c2a39b
Author: Tejas Patil <te...@fb.com>
Date:   2017-08-16T23:54:48Z

    print only the files names in error message instead of entire FileStatus object

commit 02d87119f60db4db3e141b2f72365b09b45d9647
Author: Tejas Patil <te...@fb.com>
Date:   2017-08-16T18:33:13Z

    Reader side changes for hive bucketing

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80879/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by chrysan <gi...@git.apache.org>.

Github user chrysan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r175350408
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -156,40 +144,14 @@ object FileFormatWriter extends Logging {
           statsTrackers = statsTrackers
         )
     
    -    // We should first sort by partition columns, then bucket id, and finally sorting columns.
    -    val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
    -    // the sort order doesn't matter
    -    val actualOrdering = plan.outputOrdering.map(_.child)
    -    val orderingMatched = if (requiredOrdering.length > actualOrdering.length) {
    -      false
    -    } else {
    -      requiredOrdering.zip(actualOrdering).forall {
    -        case (requiredOrder, childOutputOrder) =>
    -          requiredOrder.semanticEquals(childOutputOrder)
    -      }
    -    }
    -
         SQLExecution.checkSQLExecutionId(sparkSession)
     
         // This call shouldn't be put into the `try` block below because it only initializes and
         // prepares the job, any exception thrown from here shouldn't cause abortJob() to be called.
         committer.setupJob(job)
     
         try {
    -      val rdd = if (orderingMatched) {
    -        plan.execute()
    -      } else {
    -        // SPARK-21165: the `requiredOrdering` is based on the attributes from analyzed plan, and
    -        // the physical plan may have different attribute ids due to optimizer removing some
    -        // aliases. Here we bind the expression ahead to avoid potential attribute ids mismatch.
    -        val orderingExpr = requiredOrdering
    -          .map(SortOrder(_, Ascending))
    -          .map(BindReferences.bindReference(_, outputSpec.outputColumns))
    -        SortExec(
    --- End diff --
    
    Removing SortExec here and adding it in EnsureRequirements Strategy will have impact on many other DataWritingCommands which depends on FileFormatWriter, like CreateDataSourceTableAsSelectCommand. To fix it code changes are needed onto such DataWritingCommand implementations to export requiredDistribution and requiredOrdering.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r183950505
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala ---
    @@ -43,7 +44,13 @@ trait RunnableCommand extends Command {
       // `ExecutedCommand` during query planning.
       lazy val metrics: Map[String, SQLMetric] = Map.empty
     
    -  def run(sparkSession: SparkSession): Seq[Row]
    +  def run(sparkSession: SparkSession, children: Seq[SparkPlan]): Seq[Row] = {
    --- End diff --
    
    `ExecutedCommandExec` doesn't call it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r134103839
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/BucketizedSparkRecordReader.java ---
    @@ -0,0 +1,147 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io;
    +
    +import org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil;
    --- End diff --
    
    Hi, @tejasapatil 
    Is this the only actual Hive dependency? Without this, it seems that `BucketizedSparkInputFormat` and `BucketizedSparkRecordReader` can be promoted to `sql/core`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #80879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80879/testReport)** for PR 19001 at commit [`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80900/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r183950638
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -156,40 +144,14 @@ object FileFormatWriter extends Logging {
           statsTrackers = statsTrackers
         )
     
    -    // We should first sort by partition columns, then bucket id, and finally sorting columns.
    -    val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
    --- End diff --
    
    Can we send an individual PR to do this? i.e. do the sorting via `requiredOrdering` instead of doing it manually.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    https://github.com/apache/spark/pull/19080 is improving the distribution semantic in planner. Will wait for that to get in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #86097 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86097/testReport)** for PR 19001 at commit [`d37eb8b`](https://github.com/apache/spark/commit/d37eb8b3359981756c923948fe12833a56b61865).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #86097 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86097/testReport)** for PR 19001 at commit [`d37eb8b`](https://github.com/apache/spark/commit/d37eb8b3359981756c923948fe12833a56b61865).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #81005 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81005/testReport)** for PR 19001 at commit [`a30b6ce`](https://github.com/apache/spark/commit/a30b6ceefe7066c644975b55567fe60a74f5aa4f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81005/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #80908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80908/testReport)** for PR 19001 at commit [`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      throw new IOException(\"Cannot find class \" + inputFormatClassName, e);`
      * `      throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #80908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80908/testReport)** for PR 19001 at commit [`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Now that https://github.com/apache/spark/pull/19080 has been merged to trunk, I am rebasing this PR. A small part of this PR is put in https://github.com/apache/spark/pull/20206 and ready for review.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #86085 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86085/testReport)** for PR 19001 at commit [`d37eb8b`](https://github.com/apache/spark/commit/d37eb8b3359981756c923948fe12833a56b61865).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #80885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80885/testReport)** for PR 19001 at commit [`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    cc @cloud-fan @gatorsmile @sameeragarwal for review


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #86013 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86013/testReport)** for PR 19001 at commit [`7b8a072`](https://github.com/apache/spark/commit/7b8a0729b38ba2fbdc1c4359fcb82a1b6cde5b5c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      throw new IOException(\"Cannot find class \" + inputFormatClassName, e);`
      * `      throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil closed the pull request at:

    https://github.com/apache/spark/pull/19001


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Hi all, any updates on this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Jenkins retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #86074 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86074/testReport)** for PR 19001 at commit [`3c367a0`](https://github.com/apache/spark/commit/3c367a08fa5290081e82d45ea7bf564277f196b0).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      throw new IOException(\"Cannot find class \" + inputFormatClassName, e);`
      * `      throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r134106172
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/BucketizedSparkRecordReader.java ---
    @@ -0,0 +1,147 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io;
    +
    +import org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil;
    --- End diff --
    
    What do we gain out of moving it to `sql/core` given that they are quite specific for Hive ? I don't see any other use cases besides hive benefiting from it so decided to keep it in `sql/hive` and have `sql/core` cleaner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80908/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    cc @cloud-fan @gatorsmile @sameeragarwal @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #86085 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86085/testReport)** for PR 19001 at commit [`d37eb8b`](https://github.com/apache/spark/commit/d37eb8b3359981756c923948fe12833a56b61865).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    I will close this for now


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86097/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #81005 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81005/testReport)** for PR 19001 at commit [`a30b6ce`](https://github.com/apache/spark/commit/a30b6ceefe7066c644975b55567fe60a74f5aa4f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      throw new IOException(\"Cannot find class \" + inputFormatClassName, e);`
      * `      throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Jenkins retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    The R failure looks irrelevant.
    ```
    1. Error: spark.logit (@test_mllib_classification.R#288) -----------------------
    java.lang.IllegalArgumentException: requirement failed: The input column stridx_c3082b343085 should have at least two distinct values.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #80900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80900/testReport)** for PR 19001 at commit [`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #86074 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86074/testReport)** for PR 19001 at commit [`3c367a0`](https://github.com/apache/spark/commit/3c367a08fa5290081e82d45ea7bf564277f196b0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #80900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80900/testReport)** for PR 19001 at commit [`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      throw new IOException(\"Cannot find class \" + inputFormatClassName, e);`
      * `      throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #86013 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86013/testReport)** for PR 19001 at commit [`7b8a072`](https://github.com/apache/spark/commit/7b8a0729b38ba2fbdc1c4359fcb82a1b6cde5b5c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by chrysan <gi...@git.apache.org>.

Github user chrysan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r178433504
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/BucketizedSparkInputFormat.java ---
    @@ -0,0 +1,107 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io;
    +
    +import org.apache.hadoop.fs.FileStatus;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.io.Writable;
    +import org.apache.hadoop.io.WritableComparable;
    +import org.apache.hadoop.mapred.*;
    +import org.apache.hadoop.util.StringUtils;
    +
    +import java.io.IOException;
    +import java.util.Arrays;
    +
    +import static org.apache.hadoop.mapreduce.lib.input.FileInputFormat.INPUT_DIR;
    +
    +/**
    + * A {@link InputFormat} implementation for reading bucketed tables.
    + *
    + * We cannot directly use {@link BucketizedHiveInputFormat} from Hive as it depends on the
    + * map-reduce plan to get required information for split generation.
    + */
    +public class BucketizedSparkInputFormat<K extends WritableComparable, V extends Writable>
    +        extends BucketizedHiveInputFormat<K, V> {
    +
    +  private static final String FILE_INPUT_FORMAT = "file.inputformat";
    +
    +  @Override
    +  public RecordReader getRecordReader(
    +      InputSplit split,
    +      JobConf job,
    +      Reporter reporter) throws IOException {
    +
    +    BucketizedHiveInputSplit hsplit = (BucketizedHiveInputSplit) split;
    +    String inputFormatClassName = null;
    +    Class inputFormatClass = null;
    +
    +    try {
    +      inputFormatClassName = hsplit.inputFormatClassName();
    +      inputFormatClass = job.getClassByName(inputFormatClassName);
    +    } catch (ClassNotFoundException e) {
    +      throw new IOException("Cannot find class " + inputFormatClassName, e);
    +    }
    +
    +    InputFormat inputFormat = getInputFormatFromCache(inputFormatClass, job);
    +    return new BucketizedSparkRecordReader<>(inputFormat, hsplit, job, reporter);
    +  }
    +
    +  @Override
    +  public InputSplit[] getSplits(JobConf job, int numBuckets) throws IOException {
    +    final String inputFormatClassName = job.get(FILE_INPUT_FORMAT);
    +    final String[] inputDirs = job.get(INPUT_DIR).split(StringUtils.COMMA_STR);
    +
    +    if (inputDirs.length != 1) {
    +      throw new IOException(this.getClass().getCanonicalName() +
    +        " expects only one input directory. " + inputDirs.length +
    +        " directories detected : " + Arrays.toString(inputDirs));
    +    }
    +
    +    final String inputDir = inputDirs[0];
    +    final Path inputPath = new Path(inputDir);
    +    final JobConf newJob = new JobConf(job);
    +    final FileStatus[] listStatus = this.listStatus(newJob, inputPath);
    +    final InputSplit[] result = new InputSplit[numBuckets];
    +
    +    if (listStatus.length != 0 && listStatus.length != numBuckets) {
    +      throw new IOException("Bucketed path was expected to have " + numBuckets + " files but " +
    +        listStatus.length + " files are present. Path = " + inputPath);
    +    }
    +
    +    try {
    +      final Class<?> inputFormatClass = Class.forName(inputFormatClassName);
    +      final InputFormat inputFormat = getInputFormatFromCache(inputFormatClass, job);
    +      newJob.setInputFormat(inputFormat.getClass());
    +
    +      for (int i = 0; i < numBuckets; i++) {
    +        final FileStatus fileStatus = listStatus[i];
    --- End diff --
    
    This logic depends on the files are listed in a right order, otherwise the RDD partitions to be joined cannot be zipped correctly. Logic should be fixed here to reorder the files listed. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86074/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    ping @cloud-fan @gatorsmile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    With the simplified distribution semantic, I think it's much easier to support the hive bucketing. We only need to create a `HiveHashPartitioning`, implement it similar to `HashPartitioning` without satisfying `HashPartitionedDistribution`, and then we can avoid shuffle for bucketed hive table in many cases like aggregate, repartitionBy, broadcast join, etc.
    
    For non-broadcast join, we have the potential to support it, after we make the hash function configurable for `HashPartitionedDistribution`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80885/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86013/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86085/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r134120888
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/BucketizedSparkRecordReader.java ---
    @@ -0,0 +1,147 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io;
    +
    +import org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil;
    --- End diff --
    
    I see. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #80885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80885/testReport)** for PR 19001 at commit [`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      throw new IOException(\"Cannot find class \" + inputFormatClassName, e);`
      * `      throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by chrysan <gi...@git.apache.org>.

Github user chrysan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r175640958
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -184,6 +189,43 @@ case class InsertIntoHadoopFsRelationCommand(
         Seq.empty[Row]
       }
     
    +  private def getBucketIdExpression(dataColumns: Seq[Attribute]): Option[Expression] = {
    +    bucketSpec.map { spec =>
    +      val bucketColumns = spec.bucketColumnNames.map(c => dataColumns.find(_.name == c).get)
    +      // Use `HashPartitioning.partitionIdExpression` as our bucket id expression, so that we can
    +      // guarantee the data distribution is same between shuffle and bucketed data source, which
    +      // enables us to only shuffle one side when join a bucketed table and a normal one.
    +      HashPartitioning(
    +        bucketColumns,
    +        spec.numBuckets,
    +        classOf[Murmur3Hash]
    +      ).partitionIdExpression
    +    }
    +  }
    +
    +  /**
    +   * How is `requiredOrdering` determined ?
    --- End diff --
    
    Why the definition of requiredOrdering here differs from that in InsertIntoHiveTable? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    **[Test build #80879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80879/testReport)** for PR 19001 at commit [`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      throw new IOException(\"Cannot find class \" + inputFormatClassName, e);`
      * `      throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r183949990
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -39,7 +40,10 @@ case class SortMergeJoinExec(
         joinType: JoinType,
         condition: Option[Expression],
         left: SparkPlan,
    -    right: SparkPlan) extends BinaryExecNode with CodegenSupport {
    +    right: SparkPlan,
    +    requiredNumPartitions: Option[Int] = None,
    +    hashingFunctionClass: Class[_ <: HashExpression[Int]] = classOf[Murmur3Hash])
    --- End diff --
    
    I think this can be done in a followup. For the first version we can just add a `HiveHashPartitioning`, which can satisfy `ClusteredDistribution`(save shuffle for aggregate) but not `HashClusteredDistribution`(can't save shuffle for join).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Jenkins retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19001
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org