You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by GitBox <gi...@apache.org> on 2020/02/24 09:39:49 UTC

[GitHub] [carbondata] Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table

Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637
 
 
    ### Why is this PR needed?
    Support Bucket Table consistent with spark parquet, to improve the join performance by avoid shuffle for bucket column. Fix bugs also.
    
    ### What changes were proposed in this PR?
   Support Bucket Table consistent with spark parquet, to improve the join performance by avoid shuffle for bucket column. Fix bugs also.
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - Yes
   
       
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590275188
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2127/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592499351
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2230/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593813133
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2274/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592918302
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2238/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592974580
 
 
   @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592490434
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/530/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r384909642
 
 

 ##########
 File path: processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/BucketMurmur3HashPartitionerImpl.java
 ##########
 @@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.processing.loading.partition.impl;
+
+import java.util.List;
+
+import org.apache.carbondata.common.annotations.InterfaceAudience;
+import org.apache.carbondata.core.datastore.row.CarbonRow;
+import org.apache.carbondata.core.metadata.datatype.DataType;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema;
+import org.apache.carbondata.core.unsafe.hash.Murmur3_x86_32;
+import org.apache.carbondata.core.unsafe.types.UTF8String;
+import org.apache.carbondata.processing.loading.partition.Partitioner;
+
+/**
+ * Bucket Hash partitioner implementation using Murmur3_x86_32, it keep the same hash value as
+ * spark for given input.
+ */
+@InterfaceAudience.Internal
+public class BucketMurmur3HashPartitionerImpl implements Partitioner<CarbonRow> {
+
+  private int numberOfBuckets;
+
+  private Hash[] hashes;
+
+  public BucketMurmur3HashPartitionerImpl(List<Integer> indexes, List<ColumnSchema> columnSchemas,
+                                          int numberOfBuckets) {
+    this.numberOfBuckets = numberOfBuckets;
+    hashes = new Hash[indexes.size()];
+    for (int i = 0; i < indexes.size(); i++) {
+      DataType dataType = columnSchemas.get(i).getDataType();
+      if (dataType == DataTypes.LONG || dataType == DataTypes.DOUBLE) {
+        hashes[i] = new LongHash(indexes.get(i));
+      } else if (dataType == DataTypes.SHORT || dataType == DataTypes.INT ||
+          dataType == DataTypes.FLOAT || dataType == DataTypes.BOOLEAN) {
+        hashes[i] = new IntegralHash(indexes.get(i));
+      } else if (DataTypes.isDecimal(dataType)) {
+        hashes[i] = new DecimalHash(indexes.get(i));
+      } else if (dataType == DataTypes.TIMESTAMP) {
 
 Review comment:
   @Indhumathi27 if use hash for datatype the hash value will diff from spark, and join result will mismatch with parquet etc.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592957640
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/543/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593037009
 
 
   > @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.
   
   @ravipesala is guava murmur hash the same as spark using?
   
   > @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore.
   > I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark.
   
   @ravipesala  spark using guava hash but not all the same like guava's impl, as for the changes in future of spark, if we want to keep same hash code as spark, maybe we can depend on spark-unsafe jar directly base on spark-version just like carbon depend on diff spark version. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593054571
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/547/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590296415
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/432/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591801851
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2205/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386770338
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala
 ##########
 @@ -96,7 +96,7 @@ class CarbonScanRDD[T: ClassTag](
 
   private var directFill = false
 
-  private val bucketedTable = tableInfo.getFactTable.getBucketingInfo
+  private val bucketInfo = tableInfo.getFactTable.getBucketingInfo
 
 Review comment:
   Does executor side need this? Can we make it as a local variable in internalCompute

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590662074
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/449/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386771262
 
 

 ##########
 File path: processing/pom.xml
 ##########
 @@ -45,6 +45,11 @@
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
     </dependency>
+    <dependency>
+      <groupId>org.apache.spark</groupId>
 
 Review comment:
   Should try to avoid depending on spark in core modules including: carbondata-core, carbondata-hadoop, carbondata-processing
   Can we avoid adding this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] asfgit closed pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386817746
 
 

 ##########
 File path: processing/pom.xml
 ##########
 @@ -45,6 +45,11 @@
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
     </dependency>
+    <dependency>
+      <groupId>org.apache.spark</groupId>
 
 Review comment:
   @jackylk if want to keep correct join result with parquet bucket tables, need to use same methods to hash the data of each datatype, so the code is needed.
   1. copy the code from spark, but there are about 2,000 lines and if we copy the code, once spark change them we need to change together, its not a good choice, more details pls check the conversations above.
   2. depend on spark-unsafe jar, we just depend 1 jar of spark and the changes of diff spark version don't have effect on us since we use it by version control in pom.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590775160
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590739256
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/456/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386770338
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala
 ##########
 @@ -96,7 +96,7 @@ class CarbonScanRDD[T: ClassTag](
 
   private var directFill = false
 
-  private val bucketedTable = tableInfo.getFactTable.getBucketingInfo
+  private val bucketInfo = tableInfo.getFactTable.getBucketingInfo
 
 Review comment:
   Does executor side need this? Can we make it as a local variable in internalCompute

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386768628
 
 

 ##########
 File path: integration/flink/src/test/scala/org/apache/carbon/flink/TestCarbonWriter.scala
 ##########
 @@ -20,20 +20,23 @@ package org.apache.carbon.flink
 import java.util.Properties
 
 import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.{CarbonFile, CarbonFileFilter}
 import org.apache.flink.api.common.restartstrategy.RestartStrategies
 import org.apache.flink.core.fs.Path
 import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
 import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
-import org.apache.spark.sql.Row
+import org.apache.spark.sql.{CarbonEnv, Row}
 import org.apache.spark.sql.test.util.QueryTest
-
 import org.apache.carbondata.core.datastore.impl.FileFactory
 import org.apache.carbondata.core.util.CarbonProperties
 import org.apache.carbondata.core.util.path.CarbonTablePath
+import org.apache.spark.sql.execution.exchange.Exchange
 
 class TestCarbonWriter extends QueryTest {
 
   val tableName = "test_flink"
+  val tableName2 = "insert_bucket_table"
 
 Review comment:
   ```suggestion
     val bucketTableName = "insert_bucket_table"
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593061151
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2247/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590753572
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2157/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769585
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
 ##########
 @@ -584,36 +591,53 @@ class CarbonMergerRDD[K, V](
     logInfo("no.of.nodes where data present=" + nodeBlockMap.size())
     defaultParallelism = sparkContext.defaultParallelism
 
-    // Create Spark Partition for each task and assign blocks
-    nodeBlockMap.asScala.foreach { case (nodeName, splitList) =>
-      val taskSplitList = new java.util.ArrayList[NodeInfo](0)
-      nodeTaskBlocksMap.put(nodeName, taskSplitList)
-      var blockletCount = 0
-      splitList.asScala.foreach { splitInfo =>
-        val splitsPerNode = splitInfo.asInstanceOf[CarbonInputSplitTaskInfo]
-        blockletCount = blockletCount + splitsPerNode.getCarbonInputSplitList.size()
-        taskSplitList.add(
-          NodeInfo(splitsPerNode.getTaskId, splitsPerNode.getCarbonInputSplitList.size()))
-
-        if (blockletCount != 0) {
-          val taskInfo = splitInfo.asInstanceOf[CarbonInputSplitTaskInfo]
-          val multiBlockSplit = if (null == rangeColumn || singleRange) {
-            new CarbonMultiBlockSplit(
-              taskInfo.getCarbonInputSplitList,
-              Array(nodeName))
-          } else {
-            var splitListForRange = new util.ArrayList[CarbonInputSplit]()
-            new CarbonMultiBlockSplit(
-              splitListForRange,
-              Array(nodeName))
+    if (bucketInfo != null) {
 
 Review comment:
   extract these new logic into a private function, so not to make internalCompute too big

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769957
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
 ##########
 @@ -91,6 +91,7 @@ class CarbonMergerRDD[K, V](
   var singleRange = false
   var expressionMapForRangeCol: util.Map[Integer, Expression] = null
   var broadCastSplits: Broadcast[CarbonInputSplitWrapper] = null
+  val bucketInfo = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getBucketingInfo
 
 Review comment:
   Seems executor side does not need this variable, so move it into internalCompute to make it as a local variable

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593790391
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/570/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592555376
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2233/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592966050
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2243/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769957
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
 ##########
 @@ -91,6 +91,7 @@ class CarbonMergerRDD[K, V](
   var singleRange = false
   var expressionMapForRangeCol: util.Map[Integer, Expression] = null
   var broadCastSplits: Broadcast[CarbonInputSplitWrapper] = null
+  val bucketInfo = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getBucketingInfo
 
 Review comment:
   Seems executor side does not need this variable, so move it into internalCompute to make it as a local variable

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590679747
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2149/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593038010
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/545/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-594404737
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/600/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590245487
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/427/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590784776
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/459/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295
 
 
   @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. we have this feature but not work fine as expected.
   1. all data stored into 1 file, not clustered in current code.
   2. join with parquet return wrong result, even carbon tables themselves the string value use diff hashcode, the join result mismatch. we should use hash method same as spark and keep consistent value for same input.
   3. after compaction it will store into file of bucket id 0.
   4. new insert flow not work for bucket table.
   5. the others pls check testcases added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592544098
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/533/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386768101
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
 ##########
 @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
    */
   public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";
 
+  /**
+   * Hash method of bucket table
+   */
+  public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+  public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+  /**
+   * bucket properties
+   */
+  public static final String BUCKET_COLUMNS = "bucketcolumns";
 
 Review comment:
   Can we follow the bucket table syntax from hive?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-594409277
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2307/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
Indhumathi27 commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r384289659
 
 

 ##########
 File path: processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/BucketMurmur3HashPartitionerImpl.java
 ##########
 @@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.processing.loading.partition.impl;
+
+import java.util.List;
+
+import org.apache.carbondata.common.annotations.InterfaceAudience;
+import org.apache.carbondata.core.datastore.row.CarbonRow;
+import org.apache.carbondata.core.metadata.datatype.DataType;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema;
+import org.apache.carbondata.core.unsafe.hash.Murmur3_x86_32;
+import org.apache.carbondata.core.unsafe.types.UTF8String;
+import org.apache.carbondata.processing.loading.partition.Partitioner;
+
+/**
+ * Bucket Hash partitioner implementation using Murmur3_x86_32, it keep the same hash value as
+ * spark for given input.
+ */
+@InterfaceAudience.Internal
+public class BucketMurmur3HashPartitionerImpl implements Partitioner<CarbonRow> {
+
+  private int numberOfBuckets;
+
+  private Hash[] hashes;
+
+  public BucketMurmur3HashPartitionerImpl(List<Integer> indexes, List<ColumnSchema> columnSchemas,
+                                          int numberOfBuckets) {
+    this.numberOfBuckets = numberOfBuckets;
+    hashes = new Hash[indexes.size()];
+    for (int i = 0; i < indexes.size(); i++) {
+      DataType dataType = columnSchemas.get(i).getDataType();
+      if (dataType == DataTypes.LONG || dataType == DataTypes.DOUBLE) {
+        hashes[i] = new LongHash(indexes.get(i));
+      } else if (dataType == DataTypes.SHORT || dataType == DataTypes.INT ||
+          dataType == DataTypes.FLOAT || dataType == DataTypes.BOOLEAN) {
+        hashes[i] = new IntegralHash(indexes.get(i));
+      } else if (DataTypes.isDecimal(dataType)) {
+        hashes[i] = new DecimalHash(indexes.get(i));
+      } else if (dataType == DataTypes.TIMESTAMP) {
 
 Review comment:
   What about Hash for Date Type?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590821550
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2160/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593037009
 
 
   > @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.
   
   @ravipesala is guava murmur hash the same as spark using?
   
   > @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore.
   > I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark.
   
   spark using guava hash but not all the same like guava's impl, as for the changes in future of spark, if we want to keep same hash code as spark, maybe we can depend on spark-unsafe jar directly base on spark-version just like carbon depend on diff spark version. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592972205
 
 
   @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore. 
   I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark. 
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592917163
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/539/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295
 
 
   @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc.
   1. all data stored into 1 file
   2. join with parquet return wrong result
   3. after compaction it will store into file of bucket id 0
   4. new insert flow not work in
   5. the others pls check testcases added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386767939
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
 ##########
 @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
    */
   public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";
 
+  /**
+   * Hash method of bucket table
+   */
+  public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+  public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+  /**
+   * bucket properties
+   */
+  public static final String BUCKET_COLUMNS = "bucketcolumns";
 
 Review comment:
   Is these for table properties? suggest to change to "bucket_columns"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386785923
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
 ##########
 @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
    */
   public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";
 
+  /**
+   * Hash method of bucket table
+   */
+  public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+  public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+  /**
+   * bucket properties
+   */
+  public static final String BUCKET_COLUMNS = "bucketcolumns";
 
 Review comment:
   @jackylk ok, will change the property name, support both property and syntax from hive already.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593050200
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2245/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590640601
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/447/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592452095
 
 
   @Zhangshunyu Bucketing is already supported in Carbon.  I wonder why all this code is added again to support it.  If there are any issues if we are facing please put the testcases first which are not working or raise a jira.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295
 
 
   @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. we have this feature but not work fine as expected.
   1. all data stored into 1 file
   2. join with parquet return wrong result
   3. after compaction it will store into file of bucket id 0
   4. new insert flow not work in
   5. the others pls check testcases added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591780044
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/506/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590333619
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2132/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591972424
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/512/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services