You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "walterddr (via GitHub)" <gi...@apache.org> on 2023/11/01 16:24:33 UTC

[I] [multistage][bug] block splitter estimation is way off [pinot]

walterddr opened a new issue, #11921:
URL: https://github.com/apache/pinot/issues/11921

   when we send data over the mailboxes we are estimating the data size and cut the inbound messges into chunks. however
   
   ```
   block.getDataSchema().getColumnNames().length * MEDIAN_COLUMN_SIZE_BYTES;
   ```
         // Use estimated row size, this estimate is not accurate and is used to estimate numRowsPerChunk only.
         int estimatedRowSizeInBytes = block.getDataSchema().getColumnNames().length * MEDIAN_COLUMN_SIZE_BYTES;
         int numRowsPerChunk = maxBlockSize / estimatedRowSizeInBytes;
         while (currentRow < totalNumRows) {
           List<Object[]> chunk = allRows.subList(currentRow, Math.min(currentRow + numRowsPerChunk, allRows.size()));
   ```
   this is not an accurate estimate when there's high-cardinality string/bytes column that can be super large.
   
   simple solution is to use the first row to estimate the size of the row when there's variable length columns found, but 
   - there's no easy way to tell cardinality
   - it is expensive to compute a row size of `Object[]` which needs to loop through everything. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] [multistage][bug] block splitter estimation is way off [pinot]

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on issue #11921:
URL: https://github.com/apache/pinot/issues/11921#issuecomment-1789261759

   i think this is the root cause of #11919


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org