You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/04 21:24:56 UTC

[GitHub] [iceberg] can-sun opened a new issue, #6125: Encountered throttling when writting to S3 without repartitioning

can-sun opened a new issue, #6125:
URL: https://github.com/apache/iceberg/issues/6125

   ### Apache Iceberg version
   
   0.14.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I am using the following code snippet to batch write data to my S3 bucket and encountered the S3 throttling issue:
   
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 590 in stage 7.0 failed 4 times, most recent failure: Lost task 590.3 in stage 7.0 (TID 1750) (172.34.25.153 executor 108): software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 0MYS30NPVXFFRM9R, Extended Request ID:  ****)
   ```
   
   Code snippet:
   
   ```java
   dataFrame
           .sortWithinPartitions(col(eventTimeFeatureName))
           .writeTo(f"$dataCatalogName.$dataBaseName.`$tableName`")
           .option("compression", "none")
           .append()
   ```
   
   `$dataCatalogName.$dataBaseName.$tableName` is the Iceberg table I created in glue and table is partitioned by truncating the column `eventTimeFeatureName`. Besides, we followed the practice mentioned [here](https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout) and added `write.object-storage.enabled'=true`. We verified the parquet files are written to s3 locations with random prefixes. However we still encounter the persistent failure of S3 throttling.
   
   The data file we used for test is about 8gb and eventTimeFeature spans across 1 year. To reduce number of files  to be written, I re-partitioned the input dataFrame and it works, however I believe this will greatly impact the performance. 
   
   ```
   tempDataFrame
           .withColumn("trunc_event_time", trunc(col(eventTimeFeatureName), "yyyy-MM-dd"))
           .repartition(col("trunc_event_time"))
           .drop(col("trunc_event_time"))
           .sortWithinPartitions(col(eventTimeFeatureName))
           .writeTo(f"$dataCatalogName.$dataBaseName.`$tableName`")
           .option("compression", "none")
           .append()
   ```
   
   Does iceberg team has any suggestions or best practices we can follow?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6125: Encountered throttling when writting to S3 without repartitioning

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6125:
URL: https://github.com/apache/iceberg/issues/6125#issuecomment-1533902841

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #6125: Encountered throttling when writting to S3 without repartitioning

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #6125: Encountered throttling when writting to S3 without repartitioning
URL: https://github.com/apache/iceberg/issues/6125


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6125: Encountered throttling when writting to S3 without repartitioning

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6125:
URL: https://github.com/apache/iceberg/issues/6125#issuecomment-1552237935

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org