You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/05/08 15:18:37 UTC

[GitHub] [arrow] westonpace commented on issue #34892: [C++] Mechanism for throttling remote filesystems to avoid rate limiting

westonpace commented on issue #34892:
URL: https://github.com/apache/arrow/issues/34892#issuecomment-1538550562

   > Imagine a scenario where you have nearly continuous influx of data, which you need to render into parquet and store on S3. A backoff strategy works fine and well for a single write, but when you have loads of data incoming, if you get rate limited, and you backoff, you risk falling behind to a point where it's very difficult to catch up.
   
   > This is, of course, hypothetical, but it illustrates that whilst throttling and retry with backoff would be very useful for 90% of use cases (and I would certainly appreciate them, I just do not possess the programming skill to implement them here :( ), there are some niche circumstances where we may need to consider batching writes more efficiently.
   
   The dataset writer itself issues one "Write" call per row group.  You can batch those using the min_rows_per_group configuration of the call.
   
   The S3 filesystem itself will batch incoming writes until it has accumulated 5MBs of data.  This is controlled by a constant `kMinimumPartUpload`.  Given that S3 is supposedly providing 5,500 requests per second that would seem to imply a limit of 27.5GBps which I assume is more than enough.
   
   It's also possible, if you have many partitions, and a low number of max open files, that many small parquet files are being created.  So you might just check to see if that is happening (and increase the allowed # of max open files if it is).
   
   Again, I think more investigation is needed.  How many writes per second are actually being issued?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org