You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/24 17:01:44 UTC

[GitHub] [iceberg] dmgcodevil opened a new issue #2375: How to update number of bucket in partition spec ?

dmgcodevil opened a new issue #2375:
URL: https://github.com/apache/iceberg/issues/2375


   I have a partition spec that looks like that:
   
   ```
   PartitionSpec.builderFor(Schemas.enrichedTicks)
           .day(FRONTDOOR_TIMESTAMP)
           .bucket(SECURITY_ID, 10)
           .build()
   ```
   
   I used `table.updateSpec()`
   
   ```
   table.updateSpec()
         .addField(bucket(FieldNames.SECURITY_ID, 1000)).commit()
   ```
   
   I thought it will just update the existing partition but instead, it created another partition:
   
   ```json
   "partition-spec" : [ {
       "name" : "frontdoor_timestamp_day",
       "transform" : "day",
       "source-id" : 19,
       "field-id" : 1000
     }, {
       "name" : "security_id_bucket",
       "transform" : "bucket[10]",
       "source-id" : 1,
       "field-id" : 1001
     },
       {
       "name" : "security_id_bucket_1000",
       "transform" : "bucket[1000]",
       "source-id" : 1,
       "field-id" : 1002
     }
     ]
   ```
   which resulted in the following structure in S3:
   
   ```
   security_id_bucket=0
   ---------------------/security_id_bucket_1000=0
   ---------------------/security_id_bucket_1000=1
   ---------------------/security_id_bucket_1000=...
   ---------------------/security_id_bucket_1000=1000
   security_id_bucket=1
   ---------------------/security_id_bucket_1000=0
   ---------------------/security_id_bucket_1000=1
   ---------------------/security_id_bucket_1000=...
   ---------------------/security_id_bucket_1000=1000
   ....
   ```
   
   Actually, it should have failed b/c I used the same name for the partition, at least I thought so, but bucket(FieldNames.SECURITY_ID, 1000) appends `_bucket_n` to the source name, in our case it's `security_id`, however 
   `PartitionSpec.builderFor(...).bucket(SECURITY_ID, 10)` doesn't append the buckets num. 
   
   Then I tried the following:
   
   ```scala
   table.updateSpec()
         .removeField("security_id_bucket")
         .addField("security_id_bucket", bucket(FieldNames.SECURITY_ID, 1000)).commit()
   ```
   
   Got this error:
   
   ```
   java.lang.IllegalArgumentException: Cannot use partition name more than once: security_id_bucket
   ```
   
   Tried the following code:
   
   ```
       table.updateSpec()
         .removeField("security_id_bucket")
         .addField(bucket(FieldNames.SECURITY_ID, 1000)).commit()
   ```
   
   Got the following schema:
   
   ```
   {
       "spec-id" : 1,
       "fields" : [ {
         "name" : "frontdoor_timestamp_day",
         "transform" : "day",
         "source-id" : 19,
         "field-id" : 1000
       }, {
         "name" : "security_id_bucket",
         "transform" : "void",
         "source-id" : 1,
         "field-id" : 1001
       }, {
         "name" : "security_id_bucket_1000",
         "transform" : "bucket[1000]",
         "source-id" : 1,
         "field-id" : 1002
       } ]
     }
   ```
   
   When I tried to query presto:
   
   I got this: `Query 20210324_052724_00036_giqjj failed: Unsupported partition transform: 1001: security_id_bucket: void(1)`
   
   Also the structure in BCS was like:
   
   ```
   security_id_bucket=null/
   ---------------------/security_id_bucket_1000=0
   ---------------------/security_id_bucket_1000=1
   ```
   
   And at this point, I gave up and modified `xxx.metadata.json` manually:
   
   ```json
   "partition-spec" : [ {
       "name" : "frontdoor_timestamp_day",
       "transform" : "day",
       "source-id" : 19,
       "field-id" : 1000
     }, {
       "name" : "security_id_bucket",
       "transform" : "bucket[1000]",
       "source-id" : 1,
       "field-id" : 1001
     } ],
     "default-spec-id" : 1,
     "partition-specs" : [ {
       "spec-id" : 0,
       "fields" : [ {
         "name" : "frontdoor_timestamp_day",
         "transform" : "day",
         "source-id" : 19,
         "field-id" : 1000
       }, {
         "name" : "security_id_bucket",
         "transform" : "bucket[10]",
         "source-id" : 1,
         "field-id" : 1001
       } ]
     }, {
       "spec-id" : 1,
       "fields" : [ {
         "name" : "frontdoor_timestamp_day",
         "transform" : "day",
         "source-id" : 19,
         "field-id" : 1000
       }, {
         "name" : "security_id_bucket",
         "transform" : "bucket[1000]",
         "source-id" : 1,
         "field-id" : 1001
       } ]
     } ]
   ```
   How to properly change the number of buckets using Iceberg API ? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org