You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "gianm (via GitHub)" <gi...@apache.org> on 2023/05/18 15:23:21 UTC

[GitHub] [druid] gianm opened a new pull request, #14310: MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL.

gianm opened a new pull request, #14310:
URL: https://github.com/apache/druid/pull/14310

   This is an undocumented parameter that controls how cluster-by statistics are merged. In PARALLEL mode, statistics are gathered from workers all at once. In SEQUENTIAL mode, statistics are gathered time chunk by time chunk. This improves accuracy for jobs with many time chunks, and reduces memory usage.
   
   The main downside of SEQUENTIAL is that it can take longer, but in most situations I've seen, PARALLEL is only really usable in cases where the sketches are small enough that SEQUENTIAL would also run relatively quickly. So it seems like SEQUENTIAL is a better default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cryptoe commented on pull request #14310: MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL.

Posted by "cryptoe (via GitHub)" <gi...@apache.org>.
cryptoe commented on PR #14310:
URL: https://github.com/apache/druid/pull/14310#issuecomment-1605208928

   We should also change
   * MSQTestBase#245 to PARALLEL_MERGE_CONTEXT since now the default has changed/ 
   
   This can be done in another PR as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm merged pull request #14310: MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL.

Posted by "gianm (via GitHub)" <gi...@apache.org>.
gianm merged PR #14310:
URL: https://github.com/apache/druid/pull/14310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cryptoe commented on pull request #14310: MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL.

Posted by "cryptoe (via GitHub)" <gi...@apache.org>.
cryptoe commented on PR #14310:
URL: https://github.com/apache/druid/pull/14310#issuecomment-1607135373

   @gianm Quite possible we missed this case since we throw InsertCannotByEmptyFault only after getting the partition boundaries. 
   
   ```
       if (isTimeBucketed && partitionBoundaries.equals(ClusterByPartitions.oneUniversalPartition())) {
               throw new MSQException(new InsertCannotBeEmptyFault(task.getDataSource()));
             } else {
               log.info("Query [%s] generating %d segments.", queryDef.getQueryId(), partitionBoundaries.size());
             }
   
   
   ```
   
   The fix would be here WorkerSketcherFetcher#235 . Need to check if CompleteKeyStatisticsInformation is empty. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #14310: MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL.

Posted by "gianm (via GitHub)" <gi...@apache.org>.
gianm commented on PR #14310:
URL: https://github.com/apache/druid/pull/14310#issuecomment-1606687657

   > We should also change
   > 
   >     * MSQTestBase#245 to PARALLEL_MERGE_CONTEXT since now the default has changed/
   > 
   > 
   > This can be done in another PR as well
   
   Ah, I'll do it in this patch, since it needs updates anyway due to one of the test cases failing. Currently, the test case `MSQFaultsTest.testInsertCannotBeEmptyFault` is timing out, I suppose because the sequential fetching logic doesn't handle the case where no workers have any time chunks. @cryptoe any idea why that might be happening? If not, I'll take a deeper look into it soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cryptoe commented on pull request #14310: MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL.

Posted by "cryptoe (via GitHub)" <gi...@apache.org>.
cryptoe commented on PR #14310:
URL: https://github.com/apache/druid/pull/14310#issuecomment-1556601264

   Segments cuts for sequential would be atleast equal to segment cuts in parallel mode. I have seen cases where the job runs 30% faster when changing modes from parallel to sequential when number of workers was 1000. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #14310: MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL.

Posted by "gianm (via GitHub)" <gi...@apache.org>.
gianm commented on PR #14310:
URL: https://github.com/apache/druid/pull/14310#issuecomment-1607815000

   > The fix would be here WorkerSketcherFetcher#235 . Need to check if CompleteKeyStatisticsInformation is empty.
   
   TY, I added a block there that registers `ClusterByStatisticsSnapshot.empty()` for all workers if `completeKeyStatisticsInformation.getTimeSegmentVsWorkerMap().isEmpty()`. Please let me know if this looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org