You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "namrathamyske (via GitHub)" <gi...@apache.org> on 2023/06/23 00:22:27 UTC

[GitHub] [iceberg] namrathamyske opened a new issue, #7885: Rate limiting feature for structured streaming

namrathamyske opened a new issue, #7885:
URL: https://github.com/apache/iceberg/issues/7885

   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   In rate limiting for structured streaming PR -  https://github.com/apache/iceberg/pull/4479.  
   According to https://github.com/apache/iceberg/pull/4479/files#diff-26782bf5c27f69e5cc9cd4a9363f601a97d1c9f97fe0c1a7fb927da7c60c014fR169 unit test, it says the stream get stuck if SparkReadOptions.STREAMING_MAX_ROWS_PER_MICRO_BATCH is not respected. It's is a major blocker to consume this feature. If the stream is stuck, then no further advancement of stream takes place even if new snapshots comes in.
   
   E.g.:
   STREAMING_MAX_ROWS_PER_MICRO_BATCH - 2
   Snapshot1 - (2 records, 1 file) - Read fully in Microbatch-1
   Snapshot2 - (3 records, 1 file) - Can never be read as 3 records > STREAMING_MAX_ROWS_PER_MICRO_BATCH ( Stuck forever )
   Snapshot3 - 3 records
   Please let me know if this is intended behavior or is it expected to change. 
   
   @singhpk234 @jackye1995 @RussellSpitzer @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] namrathamyske closed issue #7885: Rate limiting feature for structured streaming

Posted by "namrathamyske (via GitHub)" <gi...@apache.org>.
namrathamyske closed issue #7885: Rate limiting feature for structured streaming
URL: https://github.com/apache/iceberg/issues/7885


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on issue #7885: Rate limiting feature for structured streaming

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.
singhpk234 commented on issue #7885:
URL: https://github.com/apache/iceberg/issues/7885#issuecomment-1607914583

   +1 to @RussellSpitzer's point, presently we don't read a file partially thus making file the smallest unit to be streamed, thus we should atleast make sure that num of records per microbatch > largest data file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] namrathamyske commented on issue #7885: Rate limiting feature for structured streaming

Posted by "namrathamyske (via GitHub)" <gi...@apache.org>.
namrathamyske commented on issue #7885:
URL: https://github.com/apache/iceberg/issues/7885#issuecomment-1608311520

   Thanks for responding. @RussellSpitzer @singhpk234 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackieo168 commented on issue #7885: Rate limiting feature for structured streaming

Posted by "jackieo168 (via GitHub)" <gi...@apache.org>.
jackieo168 commented on issue #7885:
URL: https://github.com/apache/iceberg/issues/7885#issuecomment-1658755211

   Hi @RussellSpitzer and @singhpk234, just to follow up on this, since it's not always practical to determine/calculate the record count of the largest data file ahead of time and and adjust `SparkReadOptions.STREAMING_MAX_ROWS_PER_MICRO_BATCH` each time, is it not possible to instead load the full file in such scenarios?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #7885: Rate limiting feature for structured streaming

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #7885:
URL: https://github.com/apache/iceberg/issues/7885#issuecomment-1603638790

   Yes I believe the smallest unit that is allowed to be streamed is a single data file. So the rate limit must be larger than the largest possible data file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org