You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "LucasRoesler (via GitHub)" <gi...@apache.org> on 2023/05/19 15:16:31 UTC

[GitHub] [iceberg] LucasRoesler commented on issue #7657: Unable to append to table after cancelled optmization procedure call

LucasRoesler commented on issue #7657:
URL: https://github.com/apache/iceberg/issues/7657#issuecomment-1554741515

   We suspected the `rewrite-datafiles` because it is the only change that we know happened around the time that we stopped loading data. 
   
   Regarding 
   >  If you are confident it does not need to be shuffled you can always set the mode to none.
   
   We are currently just grabbing a batch of message from kafka and then we set the loaded time to the current timestamp. I suspect we don't need a suffle here, all of the data should just go to the end of the current partition. I am not 100% sure though,
   
   
   Regarding the version, we did update the iceberg version while testing various fixes for this, from 1.2.0 to 1.2.1. But it had been running 1.2.0 for a couple weeks or longer. If I read the changelog correctly, 1.2.0 already had this change to the `write.distribution-mode`.
   
   
   
   Part of the reason we suspected that it was trying to sort the full table is because it is the only thing that has as much data as it was claiming to have loaded, at one point it had sorted through 69 million rows. The kafka topic simply didn't have that much data in it for the time period specified 
   
   For example, this is one of the screenshots I can find in our slack 
   <details>
   
   ![image](https://github.com/apache/iceberg/assets/891889/79fc2e95-843b-4713-a358-3135d05d5080)
   
   </details>
   
   Compared to the same screen for the currently running stream, i can't find any that show more rows loaded
   <details>
   
   ![image](https://github.com/apache/iceberg/assets/891889/41a3d4da-1019-40cd-a011-fc5282db17ea)
   
   </details>
   
   
   Additionally, a single partition doesn't really have enough data to account for the same of that `ExistingRDD`
   
   This is the metadata for the last handful of partitions in the new table 
   <details>
   
   ![image](https://github.com/apache/iceberg/assets/891889/b43f679d-7157-4aca-8bc2-b662727c41df)
   
   </details>
   
   For comparison, the same partition data for the original table
   
   <details>
   
   ![image](https://github.com/apache/iceberg/assets/891889/84a56164-fd03-4bd7-a377-024475b32f2a)
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org