You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/07 08:30:02 UTC

[GitHub] [iceberg] xloya opened a new issue #3682: Should the specified number of data be added as the partition transform for data with a large amount of data and that will continue to increase with id as the primary key?

xloya opened a new issue #3682:
URL: https://github.com/apache/iceberg/issues/3682


   I have a question with the extremely large amount of data written into Iceberg: we have a scenario where 2 billion data need to be written to Iceberg, and CDC datas need to be consumed through Flink. These data have only Long type auto-increment id as the primary key, and it is likely to swell to 4 billion data in the future.  
   I took a look at the current partition transforms, and it seems that it is not very suitable for this kind of non-temporal primary key scenario with a large amount of data that needs to be updated.  
   Can we use the specified number of data as the partition transform to partition by the auto-increment primary key? Or do you have any better solutions for reference? Thanks!
   cc @rdblue @kbendick @jackye1995 @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on issue #3682: Should the specified number of data be added as the partition transform for data with a large amount of data and that will continue to increase with id as the primary key?

Posted by GitBox <gi...@apache.org>.

xloya commented on issue #3682:
URL: https://github.com/apache/iceberg/issues/3682#issuecomment-988464646


   > > For example, we can partition with 500,000 pieces of data as a threshold, and write the data corresponding to the id into the corresponding partition. The data's id between 1 - 500,000 will be written into partition_1, and the data's id between 500,001 - 1,000,000 will be written into partition_2, and so on.
   > 
   > This sounds like you want to have your data range-partitioned. Would `truncate` transform solve the issue?
   
   Hey jack, thx for reply! I misunderstood the function of truncate transform, it is indeed feasible after actual testing, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on issue #3682: Should the specified number of data be added as the partition transform for data with a large amount of data and that will continue to increase with id as the primary key?

Posted by GitBox <gi...@apache.org>.

xloya commented on issue #3682:
URL: https://github.com/apache/iceberg/issues/3682#issuecomment-987686693


   For example, we can partition with 500,000 pieces of data as a threshold, and write the data corresponding to the id into the corresponding partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on issue #3682: Should the specified number of data be added as the partition transform for data with a large amount of data and that will continue to increase with id as the primary key?

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on issue #3682:
URL: https://github.com/apache/iceberg/issues/3682#issuecomment-988444444


   > For example, we can partition with 500,000 pieces of data as a threshold, and write the data corresponding to the id into the corresponding partition. The data's id between 1 - 500,000 will be written into partition_1, and the data's id between 500,001 - 1,000,000 will be written into partition_2, and so on.
   
   This sounds like you want to have your data range-partitioned. Would `truncate` transform solve the issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya edited a comment on issue #3682: Should the specified number of data be added as the partition transform for data with a large amount of data and that will continue to increase with id as the primary key?

Posted by GitBox <gi...@apache.org>.

xloya edited a comment on issue #3682:
URL: https://github.com/apache/iceberg/issues/3682#issuecomment-987686693


   For example, we can partition with 500,000 pieces of data as a threshold, and write the data corresponding to the id into the corresponding partition. The data's id between 1 - 500,000 will be written into partition_1, and the data's id between 500,001 - 1,000,000 will be written into partition_2, and so on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya closed issue #3682: Should the specified number of data be added as the partition transform for data with a large amount of data and that will continue to increase with id as the primary key?

Posted by GitBox <gi...@apache.org>.

xloya closed issue #3682:
URL: https://github.com/apache/iceberg/issues/3682


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org