You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/01/21 02:39:44 UTC

[GitHub] [drill] paul-rogers opened a new pull request #2149: DRILL-7733: Use streaming for REST JSON queries

paul-rogers opened a new pull request #2149:
URL: https://github.com/apache/drill/pull/2149


   
   
   
   
   # [DRILL-7733](https://issues.apache.org/jira/browse/DRILL-7733): Use streaming for REST JSON queries
   
   ## Description
   
   Modifies the REST API to stream JSON query results rather than buffering the entire result set in memory as was previously required. The buffering limited the size of query which could be run using the REST API: users would run out of memory. With the streaming solution, data is fed directly from the query result to a JSON encoder and then back to the HTTP client with no buffering.
   
   Note that Drill has historically put the result schema *after* data. The reasoning was likely that the query schema can change many times during a query run (with different fragments returning batches with differing schemas.) The schema-at-end model allows the schemas to be merged.
   
   However, with streaming, the schema-at-end model forces the client to buffer the entire result set if the client needs the schema. A good improvement would be to send the (first batch) schema *before* the data. Drill would somehow have to deal with schema changes. As it turns out,  ODBC and JDBC clients send the schema before data and thus suffer from the same schema-change problem described here. We've avoided having to address the ODBC/JDBC issue, so maybe it won't be a problem in practice for the REST API if we send the first batch schema before data. In any event, that would be a (simple) separate enhancement.
   
   Refactors the existing JSON writer to work with the result set mechanism which is then used as the implementation for streaming.
   
   Refactors the internals of the REST API to allow for traditional "batch" responses and the new streaming responses.
   
   Revises the date/time methods for the row set API to use Java classes rather than Joda. Required to integrate properly with the
   JSON writer. The Joda Period class remains as there is no Java equivalent. Most of the changed files, in fact, are for this date/time change.
   
   A recent PR added get/set float methods to the row set API. This change was redundant and added a large volume of code to avoid a single-instruction cast and so is questionable. However, since we made it, we need to make it work. This PR fixes a few holes found during this work.
   
   ## Documentation
   
   The streaming form of JSON output is used only for REST queries: `query.json`. It is not used for HTML. The change is invisible to the user except that there is no longer a limit to the size of query results that the REST API can return.
   
   The Joda-to-Java time implementation change should be transparent to users except in one very specific case: if users have created a provided schema that includes a date/time format string. Such strings must be updated to Java date/time format. Provided schema is, however, an obscure feature so it is likely any users are affected.
   
   ## Testing
   
   Most changes are for the Joda replacement. All tests were rerun and updated as needed. Drill previously had no unit tests for the REST API. This PR adds a few simple tests, and instructions for how to quickly use the test to do ad-hoc tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-767217817


   @paul-rogers BTW, Could you please explain the advantages of using `Java Time` instead of `Joda Time`? Does this change take effect only for EVF framework?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-767360757


   @luocooong, thanks for the question. There are three advantages of using the Java rather than Joda time.
   
   First, Joda was originally written to extend the limited Java date/time classes. Most of the Joda functionality, however, was added to Java 8 and so Joda has been obsolete since Java 8. Drill no longer builds on Java 7, so the need for Joda no longer exists. (Except for the Joda `Period` class, which does not exist in Java for some reason.)
   
   The second reason had to due with the JSON writer used for streaming. I can't recall the details (this code was written 6+ months ago), but I think the underlying JSON classes used Java time, so I would either have to convert from Joda to Java on each value; or convert the column accessors to use Java. I chose the latter for the reasons above.
   
   The third reason is that the accessor (and other EVF classes) are visible to plugins. By using Joda, we force plugin developers to use obsolete libraries and to remember the differences between Joda and Java time classes, which is a nuisance. By using Java time, we reduce the burden on plugin developers somewhat.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre merged pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
cgivre merged pull request #2149:
URL: https://github.com/apache/drill/pull/2149


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vvysotskyi commented on a change in pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2149:
URL: https://github.com/apache/drill/pull/2149#discussion_r563865455



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/server/rest/FileVerifier.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.server.rest;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.FileReader;
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+import java.net.URL;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
+/**
+ * Verifier for execution plans. A handy tool to ensure that the
+ * planner produces the expected plan given some set of conditions.
+ * <p>
+ * The test works by comparing the actual {@code EXPLAIN} output
+ * to a "golden" file with the expected ("golden") plan.
+ * <p>
+ * To create a test, just write it and let it fail due to a missing file.
+ * The output will go to the console. Inspect it. If it looks good,
+ * copy the plan to the golden file and run again.
+ * <p>
+ * If comparison fails, you can optionally ask the verifier to write the
+ * output to {@code /tmp/drill/test} so you can compare the golden and actual
+ * outputs using your favorite diff tool to understand changes. If the changes
+ * are expected, use that same IDE to copy changes from the actual
+ * to the golden file.
+ * <p>
+ * The JSON properties of the serialized classes are all controlled
+ * to have a fixed order to ensure that files compare across test
+ * runs. If you see spurious failures do to changed JSON order, consider
+ * adding a {@code @JsonPropertyOrder} tag to enforce a consistent order.
+ * <p>
+ * Lines can be prefixed with with "!" to indicate that they are a regex.

Review comment:
       ```suggestion
    * Lines can be prefixed with "!" to indicate that they are a regex.
   ```

##########
File path: contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
##########
@@ -59,7 +59,7 @@
 
   private FileSplit split;
   private ResultSetLoader loader;
-  private final ShpReaderConfig readerConfig;
+//  private final ShpReaderConfig readerConfig;

Review comment:
       Please remove these commented lines. Looks like the `ShpReaderConfig` class also may be removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-767987440


   @luocooong, the commits are squashed. How are we doing commits these days? Since I'm a committer, should I push the commit myself, or do we still do periodic batch commits? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers edited a comment on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
paul-rogers edited a comment on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-767360757


   @luocooong, thanks for the question. There are three advantages of using the Java rather than Joda time.
   
   First, Joda was originally written to extend the limited Java date/time classes. Most of the Joda functionality, however, was added to Java 8 and so Joda has been obsolete since Java 8. Drill no longer builds on Java 7, so the need for Joda no longer exists. (Except for the Joda `Period` class, which does not exist in Java for some reason.)
   
   The second reason had to due with the JSON writer used for streaming. I can't recall the details (this code was written 6+ months ago), but I think the underlying JSON classes used Java time, so I would either have to convert from Joda to Java on each value; or convert the column accessors to use Java. I chose the latter for the reasons above.
   
   The third reason is that the accessor (and other EVF classes) are visible to plugins. By using Joda, we force plugin developers to use obsolete libraries and to remember the differences between Joda and Java time classes, which is a nuisance. By using Java time, we reduce the burden on plugin developers somewhat.
   
   As noted in the comment at the top, this change will affect anyone who used the provided schema feature and provided a custom date/time format. I'm guessing that no one has actually used that feature given that we never saw any questions about it. Moving forward, it is better, if anyone does use the custom date/time format, that they do so with Java's format rather than the obsolete Joda formats.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-764540825


   @paul-rogers  It's nice to see your new PR since last commit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vvysotskyi commented on a change in pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2149:
URL: https://github.com/apache/drill/pull/2149#discussion_r563865455



##########
File path: exec/java-exec/src/test/java/org/apache/drill/exec/server/rest/FileVerifier.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.server.rest;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.FileReader;
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+import java.net.URL;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
+/**
+ * Verifier for execution plans. A handy tool to ensure that the
+ * planner produces the expected plan given some set of conditions.
+ * <p>
+ * The test works by comparing the actual {@code EXPLAIN} output
+ * to a "golden" file with the expected ("golden") plan.
+ * <p>
+ * To create a test, just write it and let it fail due to a missing file.
+ * The output will go to the console. Inspect it. If it looks good,
+ * copy the plan to the golden file and run again.
+ * <p>
+ * If comparison fails, you can optionally ask the verifier to write the
+ * output to {@code /tmp/drill/test} so you can compare the golden and actual
+ * outputs using your favorite diff tool to understand changes. If the changes
+ * are expected, use that same IDE to copy changes from the actual
+ * to the golden file.
+ * <p>
+ * The JSON properties of the serialized classes are all controlled
+ * to have a fixed order to ensure that files compare across test
+ * runs. If you see spurious failures do to changed JSON order, consider
+ * adding a {@code @JsonPropertyOrder} tag to enforce a consistent order.
+ * <p>
+ * Lines can be prefixed with with "!" to indicate that they are a regex.

Review comment:
       ```suggestion
    * Lines can be prefixed with "!" to indicate that they are a regex.
   ```

##########
File path: contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
##########
@@ -59,7 +59,7 @@
 
   private FileSplit split;
   private ResultSetLoader loader;
-  private final ShpReaderConfig readerConfig;
+//  private final ShpReaderConfig readerConfig;

Review comment:
       Please remove these commented lines. Looks like the `ShpReaderConfig` class also may be removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-767902482


   @paul-rogers Please squash commits and ready to merge. looking forward to your next contribution. thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-767217817


   @paul-rogers BTW, Could you please explain the advantages of using `Java Time` instead of `Joda Time`? Does this change take effect only for EVF framework?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-764540825


   @paul-rogers  It's nice to see your new PR since last commit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-768057508


    @paul-rogers  Be merged. Recently cgivre  went to the moon (just kidding, he is fine). i think speed up the review and merge of the new PR is will be more friendly to new contributors.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-767191625


   @vvysotskyi, thanks much for the review. I've made the requested changes. It appeared the builds for the prior version failed for reasons related to the environment; perhaps someone wants to take a look.
   
   I would suggest that someone try out the fixes with something that makes heavy use of the JSON REST API. While I've done simple functional testing, it would be good to get additional experience with a wider range of use cases. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2149: DRILL-7733: Use streaming for REST JSON queries

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2149:
URL: https://github.com/apache/drill/pull/2149#issuecomment-767191625


   @vvysotskyi, thanks much for the review. I've made the requested changes. It appeared the builds for the prior version failed for reasons related to the environment; perhaps someone wants to take a look.
   
   I would suggest that someone try out the fixes with something that makes heavy use of the JSON REST API. While I've done simple functional testing, it would be good to get additional experience with a wider range of use cases. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org