You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/06 16:20:53 UTC

[GitHub] [druid] 599166320 opened a new pull request, #13031: ScanQuery supports multi column orderBy queries

599166320 opened a new pull request, #13031:
URL: https://github.com/apache/druid/pull/13031

   
   <!-- Thanks for trying to help us make Apache Druid be the best it can be! Please fill out as much of the following information as is possible (where relevant, and remove it when irrelevant) to help make the intention and scope of this PR clear in order to ease review. -->
   
   <!-- Please read the doc for contribution (https://github.com/apache/druid/blob/master/CONTRIBUTING.md) before making this PR. Also, once you open a PR, please _avoid using force pushes and rebasing_ since these make it difficult for reviewers to see what you've changed in response to their reviews. See [the 'If your pull request shows conflicts with master' section](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#if-your-pull-request-shows-conflicts-with-master) for more details. -->
   
   Fixes [#12958](https://github.com/apache/druid/issues/12958).
   
   <!-- Replace XXXX with the id of the issue fixed in this PR. Remove this section if there is no corresponding issue. Don't reference the issue in the title of this pull-request. -->
   
   <!-- If you are a committer, follow the PR action item checklist for committers:
   https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers. -->
   
   ### Description
   
   <!-- Describe the goal of this PR, what problem are you fixing. If there is a corresponding issue (referenced above), it's not necessary to repeat the description here, however, you may choose to keep one summary sentence. -->
   
   <!-- Describe your patch: what did you change in code? How did you fix the problem? -->
   
   <!-- If there are several relatively logically separate changes in this PR, create a mini-section for each of them. For example: -->
   
   #### Scanquery supports multi column orderby queries to meet the requirements of sorting detailed data
   #### MultiColumnSorter add more tests and tips
   
   <!--
   In each section, please describe design decisions made, including:
    - Choice of algorithms
    - Behavioral aspects. What configuration values are acceptable? How are corner cases and error conditions handled, such as when there are insufficient resources?
    - Class organization and design (how the logic is split between classes, inheritance, composition, design patterns)
    - Method organization and design (how the logic is split between methods, parameters and return types)
    - Naming (class, method, API, configuration, HTTP endpoint, names of emitted metrics)
   -->
   
   
   <!-- It's good to describe an alternative design (or mention an alternative name) for every design (or naming) decision point and compare the alternatives with the designs that you've implemented (or the names you've chosen) to highlight the advantages of the chosen designs and names. -->
   
   <!-- If there was a discussion of the design of the feature implemented in this PR elsewhere (e. g. a "Proposal" issue, any other issue, or a thread in the development mailing list), link to that discussion from this PR description and explain what have changed in your final design compared to your original proposal or the consensus version in the end of the discussion. If something hasn't changed since the original discussion, you can omit a detailed discussion of those aspects of the design here, perhaps apart from brief mentioning for the sake of readability of this PR description. -->
   
   <!-- Some of the aspects mentioned above may be omitted for simple and small changes. -->
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `MultiColumnSorter`
    * `ScanQuery`
    * `ScanQueryEngine`
    * `ScanQueryLimitRowIterator`
    * `ScanQueryOrderByLimitRowIterator`
    * `ScanQueryQueryToolChest`
    * `ScanQueryRunnerFactory`
    * `MultiColumnSorterTests`
    * `ScanQueryResultOrderingTest`
    * `ScanQueryTest`
    * `DruidQuery`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
      - [x] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [x] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [x] added integration tests.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on PR #13031:
URL: https://github.com/apache/druid/pull/13031#issuecomment-1240942976

   > [599166320](https://github.com/599166320), thanks for your contribution! The native query engines are quite complex. See the comments for a few things to think about.
   > 
   > As it turns out, we have been discussing the idea of using an operator approach to native queries. There is an issue with the discussion, and a now-closed PR with a first approach. Adding this sort operation would be _so_ much simpler using that operator approach. That PR actually converted the scan query to use operators.
   > 
   > So, an alternative to this approach is to get that operator-based version of scan query committed, then add a sort operator. The merge operators that exist can be reused for the merge part of the sort/merge operation.
   > 
   > Finally, it is worth noting that if the user asks to sort by a single dimension, and Druid has already sorted that dimension, perhaps we can reuse that sort order to avoid the need to sort again. If the key is compound (multiple keys), then we would need to sort. This is the kind of "optimization" decision that the operator approach makes easier.
   
   
   
   > 
   
   
   Thank you for your wonderful introduction to `operator-based` and `Volcano` (# 12641). This scheme is very helpful to the optimization of query engines. I also see that Druid is moving in this direction.
   
   
   To get back to the point, if the user arranges the order in advance, it really does not need to be sorted again. At present, this PR temporarily solves the orderby problem of Druid ordinary columns. If it contains sorted columns, such as`__ime `, and the sorting algorithm mentioned in this PR will not be executed.
   
   The logic of this judgment is mainly encapsulated in `ScanQuery.scanOrderByNonTime`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers commented on pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on PR #13031:
URL: https://github.com/apache/druid/pull/13031#issuecomment-1264124521

   @599166320, I had time to take another look at this PR and noticed you closed it. I wonder why? Working on a better version to appear in a later commit? Else, it would be great to get your work into Druid.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r966220104


##########
processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java:
##########
@@ -252,7 +260,295 @@ public void cleanup(Iterator<ScanResultValue> iterFromMake)
             ))
     );
   }
+  private Sequence<ScanResultValue> getScanOrderByResultValueSequence(

Review Comment:
   > 
   
   
   This is a good idea. Query engines should not be responsible for sorting. Let me see how to insert a "Sorting Query runner".
   
   I think that currently, orderby is implemented in a violent way. If we follow the  Druid pattern , the insertion and sorting position is not suitable.
   
   The current processing method is as follows:
   
   Segments are sorted in `ScanqueryEngine.process`.
   
   Data nodes are sorted in `ScanqueryRunnerFactory.mergeRunners`.
   
   The broker is sorted in `ScanqueryQueryToolCheck.mergeReults`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r967662342


##########
processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java:
##########
@@ -252,7 +260,295 @@ public void cleanup(Iterator<ScanResultValue> iterFromMake)
             ))
     );
   }
+  private Sequence<ScanResultValue> getScanOrderByResultValueSequence(

Review Comment:
   > 
   
   Today I thought about why I changed it to this way in the first place.
   Take a look, in fact `orderByTimeLimit` and `orderByMultiColumnLimit` are similar.
   ```
   //past execution path
   if(orderByTimeLimit){
     //...
   }else{
     //...
   }
   
   //current execution path
   if(orderByMultiColumnLimit){
     //
   }else if(orderByTimeLimit){
     ///
   }else{
     ///
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r966246719


##########
processing/src/main/java/org/apache/druid/query/scan/ScanQueryOrderByLimitRowIterator.java:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.scan;
+
+import com.google.common.collect.Iterators;
+import org.apache.druid.collections.MultiColumnSorter;
+import org.apache.druid.java.util.common.UOE;
+import org.apache.druid.query.QueryPlus;
+import org.apache.druid.query.QueryRunner;
+import org.apache.druid.query.context.ResponseContext;
+
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class ScanQueryOrderByLimitRowIterator extends ScanQueryLimitRowIterator

Review Comment:
   > I would suggest not combing the order by operation with the limit operation. The typical way this is done is to sort in the leaf nodes (or, if we like, in the root node), then start pulling from the result set. The limit (and offset) sit on top, and control what we pull.
   
   I think for unordered column sorting, it needs to traverse all the data. If it puts orderby and limit together, it can control the use of memory and avoid excessive memory use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r966252278


##########
processing/src/main/java/org/apache/druid/collections/MultiColumnSorter.java:
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.collections;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.MinMaxPriorityQueue;
+import com.google.common.collect.Ordering;
+import org.apache.druid.java.util.common.ISE;
+
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+
+public class MultiColumnSorter<T>

Review Comment:
   > 
   
   It is indeed "multi column merger". In many places, the TOPK algorithm is often used in this way. I want to know what better multi column sorting method you can recommend?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13031:
URL: https://github.com/apache/druid/pull/13031#issuecomment-1251793243

   This pull request **introduces 2 alerts** when merging 4c9f40f43f60e8a3f3bdbfc5826c6bda72eb1950 into b9edfe34a40ef033f3d9c362227fc83827c5b2f1 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-89cb21d12fbf3c97e4de81416b64028bc7dd156b)
   
   **new alerts:**
   
   * 2 for Dereferenced variable may be null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13031:
URL: https://github.com/apache/druid/pull/13031#issuecomment-1242926275

   This pull request **introduces 2 alerts** when merging 6448b4bab752a7b9f51e4c43fa229e9951b1d3e2 into 4bde50e683e649c57a5d1a147bb15c6286e9eb97 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-b4b372211aeb8db4b7a9b9876a38423fd186a195)
   
   **new alerts:**
   
   * 2 for Dereferenced variable may be null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r966252278


##########
processing/src/main/java/org/apache/druid/collections/MultiColumnSorter.java:
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.collections;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.MinMaxPriorityQueue;
+import com.google.common.collect.Ordering;
+import org.apache.druid.java.util.common.ISE;
+
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+
+public class MultiColumnSorter<T>

Review Comment:
   > "multi column merger"
   
   It is indeed "multi column merger". In many places, the TOPK algorithm is often used in this way. I want to know what better multi column sorting method you can recommend?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r966252278


##########
processing/src/main/java/org/apache/druid/collections/MultiColumnSorter.java:
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.collections;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.MinMaxPriorityQueue;
+import com.google.common.collect.Ordering;
+import org.apache.druid.java.util.common.ISE;
+
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+
+public class MultiColumnSorter<T>

Review Comment:
   > 
   
    In many places, the TOPK algorithm is often used in this way. I want to know what better multi column sorting method you can recommend?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r965322734


##########
processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java:
##########
@@ -252,7 +260,295 @@ public void cleanup(Iterator<ScanResultValue> iterFromMake)
             ))
     );
   }
+  private Sequence<ScanResultValue> getScanOrderByResultValueSequence(

Review Comment:
   Is the scan query engine the right place for a sort? Scan queries are distributed. Each data node scans a separate segment. Should we do a sort per data node? I suppose yes, since then the Broker need only merge.
   
   The Druid pattern seems to be to insert another `QueryRunner`/`Sequence` pair for a new task. So, we'd want to insert a "sorting query runner" somewhere so that the sorting operation can be separate from the "segment scan" operation. This split avoids making the scan query engine more complex than it already is.
   
   Also, where should the sort occur? Per cursor? Per segment? Per node? If per node, then only the Broker merges. If per node (i.e., if node 1 scans 5 segments, sort the results of all 5), then the operation has to be done above the scan query engine level: at the point where we have visibility to all the data for the node.
   
   If per segment, then we have to merge per-node and in the Broker.
   
   I wonder, does this draft handle these cases?



##########
processing/src/main/java/org/apache/druid/query/scan/ScanQueryOrderByLimitRowIterator.java:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.scan;
+
+import com.google.common.collect.Iterators;
+import org.apache.druid.collections.MultiColumnSorter;
+import org.apache.druid.java.util.common.UOE;
+import org.apache.druid.query.QueryPlus;
+import org.apache.druid.query.QueryRunner;
+import org.apache.druid.query.context.ResponseContext;
+
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class ScanQueryOrderByLimitRowIterator extends ScanQueryLimitRowIterator

Review Comment:
   I would suggest not combing the order by operation with the limit operation. The typical way this is done is to sort in the leaf nodes (or, if we like, in the root node), then start pulling from the result set. The limit (and offset) sit on top, and control what we pull.



##########
processing/src/main/java/org/apache/druid/collections/MultiColumnSorter.java:
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.collections;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.MinMaxPriorityQueue;
+import com.google.common.collect.Ordering;
+import org.apache.druid.java.util.common.ISE;
+
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+
+public class MultiColumnSorter<T>

Review Comment:
   This class is based on a priority queue. Thus, it would seem to be a "multi column merger" of previously-sorted values. Or, are we using a priority queue to sort?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 closed pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 closed pull request #13031: ScanQuery supports multi column orderBy queries
URL: https://github.com/apache/druid/pull/13031


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on PR #13031:
URL: https://github.com/apache/druid/pull/13031#issuecomment-1264574543

   > 
   
   
   I'm making some improvements. I'll reopen this pr later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
599166320 commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r967662342


##########
processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java:
##########
@@ -252,7 +260,295 @@ public void cleanup(Iterator<ScanResultValue> iterFromMake)
             ))
     );
   }
+  private Sequence<ScanResultValue> getScanOrderByResultValueSequence(

Review Comment:
   > 
   
   Today I thought about why I changed it to this way in the first place.
   Take a look, in fact `orderByTimeLimit` and `orderByMultiColumnLimit` are similar.
   ```
   //past execution path
   if(orderByTimeLimit){
     //...
   }else{
     //...
   }
   
   //current execution path
   if(orderByMultiColumnLimit){
     //
   }else if(orderByTimeLimit){
     ///
   }else{
     ///
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13031:
URL: https://github.com/apache/druid/pull/13031#issuecomment-1257208496

   This pull request **introduces 2 alerts** when merging 995bcf16cd74cb94929b711fd58f2beae7ac486b into 0bfa81b7df4b0ae76ac45497b007b6857acb419f - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-a519c1b1c080f8d5b87a1391383cbc93b3e46112)
   
   **new alerts:**
   
   * 2 for Dereferenced variable may be null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers commented on a diff in pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on code in PR #13031:
URL: https://github.com/apache/druid/pull/13031#discussion_r985002018


##########
processing/src/main/java/org/apache/druid/query/scan/OrderByLimitQueryRunner.java:
##########
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.scan;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Iterators;
+import com.google.common.collect.Sets;
+import org.apache.druid.collections.MultiColumnSorter;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.JodaUtils;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.java.util.common.guava.Sequence;
+import org.apache.druid.java.util.common.guava.Sequences;
+import org.apache.druid.query.Query;
+import org.apache.druid.query.QueryContexts;
+import org.apache.druid.query.QueryMetrics;
+import org.apache.druid.query.QueryPlus;
+import org.apache.druid.query.QueryRunner;
+import org.apache.druid.query.context.ResponseContext;
+import org.apache.druid.query.filter.Filter;
+import org.apache.druid.segment.Cursor;
+import org.apache.druid.segment.Segment;
+import org.apache.druid.segment.StorageAdapter;
+import org.apache.druid.segment.VirtualColumn;
+import org.apache.druid.segment.column.ColumnHolder;
+import org.apache.druid.segment.filter.Filters;
+import org.apache.druid.timeline.SegmentId;
+import org.joda.time.Interval;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+class OrderByLimitQueryRunner implements QueryRunner<ScanResultValue>

Review Comment:
   I made an earlier comment about the complexity that results from combining sorting in with cursor handling, since cursor handling is already complex. I've been thinking about what I could suggest instead, until we get the operator-based solution working. Here's an idea.
   
   Leave the cursor layer as-is: it produces a set of rows as Java arrays. If the query runner sees that a sort is requested, insert another `Sequence` layer above the cursor. The native query stuff is a bit messy, but here's the gist.
   
   Each cursor represents one slice of time. If the sort spec starts with `__time`, then the results from each cursor can be concatenated to produce the sorted result for a segment.
   
   If the key isn't `__time`, then do a two-level sort-merge. Use the sorter from above to sort the results from each cursor. Hold on to all these sorted lists until all cursors are done. Then, merge the results using the priority queue mechanism you have in the above file. Voila!, a sorted per-segment result. Results from multiple segments must be merged, but the code to check for merges already exists. You may have to modify it to handle your custom merge.
   
   The one drawback of this, mentioned above, is memory pressure. But, since there is no good solution at present, we can start with ignoring the memory issue.
   
   This layered trick is what we've done for operators; so your solution will have all the pieces needed when we convert this code: we'll just wrap your solution in an operator interface rather than a `Sequence` interface.



##########
processing/src/main/java/org/apache/druid/collections/MultiColumnSorter.java:
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.collections;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.MinMaxPriorityQueue;
+import com.google.common.collect.Ordering;
+import org.apache.druid.java.util.common.ISE;
+
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+
+public class MultiColumnSorter<T>

Review Comment:
   Sorting is a tricky beast! First, there are the memory issues. In the worst case, someone might sort all columns for all rows in a segment, which might create excessive memory use. Druid has no good solution at present, we'd just have to be aware of the issue.
   
   For a sort algorithm, there are many. Back in Apache Drill, we found that the Hadoop sorter worked better than the stock Java sorter. MSQ introduced a disk-based "super sorter", which is overkill here.
   
   There are probably some sorts going on in ingestion, but a quick search of the code didn't uncover anything obvious. Perhaps a message in the Druid dev Slack channel would offer some pointers.
   
   In any event, the sorter can be wrapped in an interface, since sorts are the kind of things that can always be improved. For this use case, perhaps we can assume all rows have the same schema, and are Java objects in an array, and just use the Java sort options, along with a comparator that compares columns in key order.



##########
processing/src/main/java/org/apache/druid/query/scan/OrderByLimitSequence.java:
##########
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.scan;
+
+import com.google.common.collect.Lists;
+import org.apache.druid.java.util.common.DateTimes;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.UOE;
+import org.apache.druid.java.util.common.guava.BaseSequence;
+import org.apache.druid.query.QueryTimeoutException;
+import org.apache.druid.query.context.ResponseContext;
+import org.apache.druid.segment.BaseObjectColumnValueSelector;
+import org.apache.druid.segment.Cursor;
+import org.apache.druid.segment.column.ColumnHolder;
+import org.apache.druid.timeline.SegmentId;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.NoSuchElementException;
+import java.util.Set;
+
+class OrderByLimitSequence extends BaseSequence<ScanResultValue, Iterator<ScanResultValue>>

Review Comment:
   Once you have the basic sort/merge working, then you can add limits. There are two cases.
   
   In the sort phase, it is necessary to sort the entire result set to apply the limit after sorting. So, you can simply plop a limit sequence on top of your order sequence, which is what we do in the operator case.
   
   Then, in the merge, you can just have one sequence do the merge, ignorant of a limit. Have a limit sequence sit on top. The limit sequence just stops reading from the merge, and closes it, when it has reached the limit.
   
   In this way, each operation is separate and simple. You don't need to have the sort and merge also know how to do the limit.
   
   It would be the query runner that works out how to stack these sequences. We'd have to work out exactly which one does the deed, but my guess is that it is the one that works with the scan engine: it just inserts extra sequences as needed for sort/merge/limit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13031:
URL: https://github.com/apache/druid/pull/13031#issuecomment-1238528213

   This pull request **introduces 2 alerts** when merging e169a0746ab1e6b7e287241fc763f473d4e889cd into 66545a0f3d1acd752da5f3ebb581b06c71ea186d - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-6922307d9cbb95d6e04a7484195a8c855ed41086)
   
   **new alerts:**
   
   * 2 for Dereferenced variable may be null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13031: ScanQuery supports multi column orderBy queries

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13031:
URL: https://github.com/apache/druid/pull/13031#issuecomment-1242944108

   This pull request **introduces 2 alerts** when merging 1466bf7c5adca532e768b2f8fa77063245c88a84 into 4bde50e683e649c57a5d1a147bb15c6286e9eb97 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-d55d790389dd8885197f38c9ff7fc9b96c8f1b99)
   
   **new alerts:**
   
   * 2 for Dereferenced variable may be null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org