You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@paimon.apache.org by "gnailJC (via GitHub)" <gi...@apache.org> on 2023/11/24 01:44:59 UTC

[I] [Bug] dynamic-bucket.* configurations don't work on Spark [incubator-paimon]

gnailJC opened a new issue, #2385:
URL: https://github.com/apache/incubator-paimon/issues/2385

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/incubator-paimon/issues) and found nothing similar.
   
   
   ### Paimon version
   
   [paimon-spark-3.3-0.6-20231122.093342-69.jar](https://repository.apache.org/content/groups/snapshots/org/apache/paimon/paimon-spark-3.3/0.6-SNAPSHOT/paimon-spark-3.3-0.6-20231122.093342-69.jar)
   
   ### Compute Engine
   
   https://www.apache.org/dyn/closer.lua/spark/spark-3.3.3/spark-3.3.3-bin-hadoop3.tgz
   
   ### Minimal reproduce step
   
   ```python
   spark.sql("""
           CREATE TABLE IF NOT EXISTS paimon.tdr.ods_tbl (
             _id STRING NOT NULL,
             update_time TIMESTAMP)
           USING paimon
           TBLPROPERTIES (
             'bucket' = '-1',
             'dynamic-bucket.assigner-parallelism' = '32',
             'dynamic-bucket.target-row-num' = '2000000',
             'merge-engine' = 'partial-update',
             'path' = '',
             'primary-key' = '_id',
             'tag.creation-period' = 'daily',
             'tag.num-retained-max' = '30',
             'write.merge-schema' = 'true',
             'write.merge-schema.explicit-cast' = 'true')
      """).show()
   
   
   schema = spark.sql('select * from tdr.ods_tbl limit 0').schema
   
   full_data_reader = (
       spark.read
       .format('mongodb')
       .schema(schema)
       .option('database', '')
       .option('collection', '')
       .option('connection.uri', '')
   )
   full_data_df = full_data_reader.load()
   
   full_data_writer = (
       full_data_df.write
       .option('write-buffer-size', '256MB')
       .option('target-file-size', '256MB')
       .option('num-sorted-run.stop-trigger', '2147483647')
       .option('sort-spill-threshold', '2')
       .option('write-buffer-spillable', 'true')
   )
   full_data_writer.save(
      'oss://**', format='paimon', mode='append'
   )
   ```
   
   full_data_df.count() almost 20 millions.
   
   However, 200+ buckets were generated in the end, not 32(dynamic-bucket.assigner-parallelism) or 20 (20 millions / dynamic-bucket.target-row-num(2 million)) . 
   
   Is this as expected?
   
   ### What doesn't meet your expectations?
   
   `dynamic-bucket.assigner-parallelism`, `dynamic-bucket.target-row-num ` can control the number of buckets during Dynamic Bucket initialization.
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug] dynamic-bucket.* configurations don't work on Spark [incubator-paimon]

Posted by "zhuangchong (via GitHub)" <gi...@apache.org>.

zhuangchong closed issue #2385: [Bug] dynamic-bucket.* configurations don't work on Spark
URL: https://github.com/apache/incubator-paimon/issues/2385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug] dynamic-bucket.* configurations don't work on Spark [incubator-paimon]

Posted by "zhuangchong (via GitHub)" <gi...@apache.org>.

zhuangchong commented on issue #2385:
URL: https://github.com/apache/incubator-paimon/issues/2385#issuecomment-1825157557

   This PR has been completed, I will close this issue.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@paimon.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org