You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "xiaoyu wang (Commented) (JIRA)" <ji...@apache.org> on 2012/02/02 06:23:53 UTC

[jira] [Commented] (HIVE-2775) allow the number of files to be a multiple of bucketed table

    [ https://issues.apache.org/jira/browse/HIVE-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198542#comment-13198542 ] 

xiaoyu wang commented on HIVE-2775:
-----------------------------------

{code}
index d0ff67e..bcddc5b 100644
@@ -349,7 +349,25 @@ public class Partition implements Serializable {
    * we are just storing it as a property of the table as a short term measure.
    */
   public int getBucketCount() {
-    return table.getNumBuckets();
+      int logicalBucketNumber = table.getNumBuckets();
+      String pathPattern = this.getPartitionPath().toString() + "/*";
+      try {
+          FileSystem fs = FileSystem.get(this.table.getDataLocation(),Hive.get().getConf());
+          FileStatus srcs[] = fs.globStatus(new Path(pathPattern));
+          int physicalBucketNumber = srcs.length;
+          if ((physicalBucketNumber/logicalBucketNumber) * logicalBucketNumber ==  physicalBucketNumber){
+              return physicalBucketNumber;
+          } else {
+              throw new RuntimeException("Cannot get bucket count for table " + this.table.getTableName() +
+                      " logical bucket is " + logicalBucketNumber + " physical bucket number is " + physicalBucketNumber);
+          }
+      }catch (Exception e)
+      {
+          throw new RuntimeException("Cannot get bucket count for table " + this.table.getTableName(), e) ;
+      }
+
+
+//    return table.getNumBuckets();
     /*
      * TODO: Keeping this code around for later use when we will support
      * sampling on tables which are not created with CLUSTERED INTO clause
{code}
                
> allow the number of files to be a multiple of bucketed table
> ------------------------------------------------------------
>
>                 Key: HIVE-2775
>                 URL: https://issues.apache.org/jira/browse/HIVE-2775
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: xiaoyu wang
>
> Currently, hive bucketed table requires the number of files to match the bucket number in order to for correct sampling. This is very restrictive. e.g. we can only populate the table using a fix number of reducer, which can be a bottleneck. 
> The idea is to introduce this "physical bucket" and "logical bucket" concept. "physical bucket" is the number of files and "logical bucket" is the number of bucket stored in meda-data for bucketed table. By allowing "physical bucket" to be a multiple of "logical bucket", we can do correct sampling as well as scaling up. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira