You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2020/05/04 22:55:45 UTC

[GitHub] [drill] paul-rogers opened a new pull request #2075: DRILL-7730: Improve web query efficiency

paul-rogers opened a new pull request #2075:
URL: https://github.com/apache/drill/pull/2075


   # DRILL-7730](https://issues.apache.org/jira/browse/DRILL-7730): Improve web query efficiency
   
   ## Description
   
   Drill provides a REST API to run queries: `http://<host>:8047/query` and `/query.json`. This PR improves the memory efficiency of these queries.
   
   Drill runs queries as a DAG of operators, rooted on the "Screen" operator. The Screen operator takes each output batch of the query and hands it over to a `UserClientConnection` object. The original design is that `UserClientConnection` corresponded to an RPC connection. So, the Screen operator converted the vectors in the outgoing batch into a `QueryWritableBatch` which is an ordered list of buffers ready to send via Netty.
   
   When the REST API was added, the simplest thing was to add a new REST-specific version of `UserClientConnection`, called `WebUserConnection`. Rather than sending our list of buffers off to the network, the web version converts the buffers back into a set of value vectors using the same deserialization code used in the Drill client. However, that deserialization code needs the data in the form of a single large buffer. So, the REST code copies the entire batch from the list of buffers into one large direct memory buffer. Then it converts that back into vectors.
   
   Clearly, all this work simply gets us back where we started: the Screen operator has a batch of vectors, the `WebUserConnection` recreates them, consuming lots of memory and CPU in the process. All of this work occurs in the query thread (not the REST request thread), making the query more costly than necessary.
   
   So, the major part of this PR is to avoid the copy: allow the REST code to work with the batch given to Screen.
   
   This is done by creating a new level of indirection, the `QueryDataPackage` class. Now, Screen simply wraps the outgoing batch of vectors in a data package and hands that off to the `UserClientConnection`. The RPC version calls a method which does the conversion from vectors into a list of buffers. But, the REST version calls a different method which returns the original batch of vectors. Voila, no more copying and no more extra direct memory overhead.
   
   The `WebUserConnection` use the vectors to create three on-heap structures: a list of column names, a list of column types, and a list of maps of rows. The rows are particularly inefficient and will be addressed in a separate PR. As it turns out, the code that handled the column and metadata list had a bug: every incoming batch of data would append another copy to the in-memory list, resulting in many redundant objects. That bug is fixed in this PR.
   
   The work to understand all this resulted in "grand tour" of parts of Drill. Much code cleanup resulted. Also, WebUserConnection` is split into two classes as part of the next phase (removing the on-heap buffered results.)
   
   ## Documentation
   
   N/A: the user visible behavior of Drill is unchanged (though REST queries might be a bit faster.)
   
   ## Testing
   
   Reran all unit tests. Though, to be fair, the test suite include basically no tests of the REST API. The test run instead ensured that nothing was broken in the main RPC pathway.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2075: DRILL-7730: Improve web query efficiency

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2075:
URL: https://github.com/apache/drill/pull/2075#issuecomment-625527650


   Looks like we've got a dependency issue that breaks the build:
   
   ```
   Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
   ~/work/drill ~/work/drill/drill
   --2020-05-07 22:17:59--  https://github.com/protocolbuffers/protobuf/releases/download/v3.11.1/protobuf-java-3.11.1.zip
   Resolving github.com (github.com)... 140.82.112.3
   Connecting to github.com (github.com)|140.82.112.3|:443... connected.
   HTTP request sent, awaiting response... 302 Found
   Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/23357588...
   ...
   HTTP request sent, awaiting response... 403 Forbidden
   2020-05-07 22:17:59 ERROR 403: Forbidden.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2075: DRILL-7730: Improve web query efficiency

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2075:
URL: https://github.com/apache/drill/pull/2075#discussion_r421832470



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/client/DumpCat.java
##########
@@ -250,7 +250,7 @@ private void showSingleBatch (VectorAccessibleSerializable vcSerializable, boole
       System.out.println(getBatchMetaInfo(vcSerializable).toString());
 
       System.out.println("Schema Information");
-      for (final VectorWrapper w : vectorContainer) {
+      for (final VectorWrapper<?> w : vectorContainer) {
         final MaterializedField field = w.getValueVector().getField();
         System.out.println (String.format("name : %s, minor_type : %s, data_mode : %s",

Review comment:
       Ah ok.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre merged pull request #2075: DRILL-7730: Improve web query efficiency

Posted by GitBox <gi...@apache.org>.
cgivre merged pull request #2075:
URL: https://github.com/apache/drill/pull/2075


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2075: DRILL-7730: Improve web query efficiency

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2075:
URL: https://github.com/apache/drill/pull/2075#issuecomment-625524714


   @cgivre, thanks much for the review. Answered the question and addressed the format suggestion.
   
   Looks like the builds are failing due to timeouts unrelated, I hope, to this PR. As far as I can tell, no tests actually use the JSON Rest API. Let's see what happens when I push the new commit. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on pull request #2075: DRILL-7730: Improve web query efficiency

Posted by GitBox <gi...@apache.org>.
cgivre commented on pull request #2075:
URL: https://github.com/apache/drill/pull/2075#issuecomment-625530647


   From the code perspective: LGTM +1.  I'm going to rerun the checkstyle and see if it works.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on a change in pull request #2075: DRILL-7730: Improve web query efficiency

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #2075:
URL: https://github.com/apache/drill/pull/2075#discussion_r421824639



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/client/DumpCat.java
##########
@@ -250,7 +250,7 @@ private void showSingleBatch (VectorAccessibleSerializable vcSerializable, boole
       System.out.println(getBatchMetaInfo(vcSerializable).toString());
 
       System.out.println("Schema Information");
-      for (final VectorWrapper w : vectorContainer) {
+      for (final VectorWrapper<?> w : vectorContainer) {
         final MaterializedField field = w.getValueVector().getField();
         System.out.println (String.format("name : %s, minor_type : %s, data_mode : %s",

Review comment:
       `DumpCat` is a command-line tool. All I did here was fix some warnings.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2075: DRILL-7730: Improve web query efficiency

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2075:
URL: https://github.com/apache/drill/pull/2075#discussion_r421729696



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/client/DumpCat.java
##########
@@ -250,7 +250,7 @@ private void showSingleBatch (VectorAccessibleSerializable vcSerializable, boole
       System.out.println(getBatchMetaInfo(vcSerializable).toString());
 
       System.out.println("Schema Information");
-      for (final VectorWrapper w : vectorContainer) {
+      for (final VectorWrapper<?> w : vectorContainer) {
         final MaterializedField field = w.getValueVector().getField();
         System.out.println (String.format("name : %s, minor_type : %s, data_mode : %s",

Review comment:
       Do we want a `System.out.println()` here?

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/LogsResources.java
##########
@@ -190,9 +190,9 @@ public int compareTo(Log log) {
 
   @XmlRootElement
   public class LogContent {
-    private String name;
-    private Collection<String> lines;
-    private int maxLines;
+    private final String name;
+    private final Collection<String> lines;
+    private final int maxLines;
 
     @JsonCreator
     public LogContent (@JsonProperty("name") String name, @JsonProperty("lines") Collection<String> lines, @JsonProperty("maxLines") int maxLines) {

Review comment:
       nit: New lines here perhaps?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org