You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/11/17 19:03:34 UTC

[GitHub] [pinot] mneedham opened a new issue #7785: [Docs] Realtime segment thresholds

mneedham opened a new issue #7785:
URL: https://github.com/apache/pinot/issues/7785


   Update the docs to explain the interaction between the various realtime segment thresholds
   
   https://docs.pinot.apache.org/configuration-reference/table#realtime-table-config
   
   Diogo Baeder:
   ```
   Hi folks! I have a question about segment thresholds. If I have something like this:
                       'realtime.segment.flush.threshold.rows': '10000',
                       'realtime.segment.flush.threshold.time': '24h',
                       'realtime.segment.flush.desired.size': '100M',
   does this mean that the first value that gets reached from the above ones determines that the segment will be flushed? Or is it the last value reached that determines that? For example, if a segment has been filling for 24h already, but has only 200 rows and 10M in size, does it get flushed because it reached the 24h mark?
   ```
   
   Mark:
   ```
   The High Level Realtime Segment Data Manager
   Checks realtime.segment.flush.threshold.rows on every record that gets processed. If that value is exceeded it flushes.
   If it hasn't been exceeded then there's a task that checks once a minute if realtime.segment.flush.threshold.time has been exceeded and flushes if it has. That task also checks the rows threshold too  in case it was missed by the first check I guess.
   If the time threshold has been exceeded and no new documents have been indexed it won't flush the segment. But if there have been > 0 documents indexed it will flush.
   And then the Low Level Realtime Segment Data Manager  checks:
   realtime.segment.flush.desired.size gets translated into a number of rows threshold based on the size of each row in the previous segment.
    * The formula used to compute new number of rows is:
    * targetNumRows = ideal_segment_size * (a * current_rows_to_size_ratio + b * previous_rows_to_size_ratio)
    * where a = 0.25, b = 0.75, prev ratio= ratio collected over all previous segment completions
   I'm not entirely sure how it switches between those two data managers, but I expect @Neha Pawar or @Mayank will know.
   But assuming that it's using the high level one, for your example:
   For example, if a segment has been filling for 24h already, but has only 200 rows and 10M in size, does it get flushed because it reached the 24h mark?
   It would flush at the 24h mark from my understanding.
   ```
   
   Kishore:
   ```
   Yes, what ever reaches first.. but note that under the hood there are only two thresholds
   
   rows
   Time
    Size gets converted into rows by looking at previous segments.. it takes a few iterations for Pinot to get this right as it needs to learn the mapping between rows and size
   ```
   
   Mayank:
   ```
   Typically whatever threshold meets first is honored. The only exception is desired size, which take effect only when rows is set to zero (as in the docs).
   ```
   
   Neha:
   ```
   Rows has to be 0 for size to take effect
   No matter what size you specify, it starts off with 100k rows, and then slowly ramps up the rows to get to the desired size
   If you specify only rows, and no size, then the rows get divided amongst all consuming partitions on a server (so if you specify rows 10k, and use 1 server and have 3 partitions and 1 replica, each segment will use 3333 as rows threshold)
   Regardless of whether you go with size or just rows, you can (and always should) set a time threshold, so that it serves as the ultimate safety check.
   Having said all this, typically you can just go with 0 rows, 24h time, and 200M size
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mcvsubbu commented on issue #7785: [Docs] Realtime segment thresholds

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on issue #7785:
URL: https://github.com/apache/pinot/issues/7785#issuecomment-971891864


   @mneedham you may want to take a look at this
   https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime#tuning-realtime-performance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mneedham commented on issue #7785: [Docs] Realtime segment thresholds

Posted by GitBox <gi...@apache.org>.
mneedham commented on issue #7785:
URL: https://github.com/apache/pinot/issues/7785#issuecomment-971905640


   @mcvsubbu ah cool. So maybe we rather just need to link to that page from this one https://docs.pinot.apache.org/configuration-reference/table#realtime-table-config


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org