You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/01/21 02:40:00 UTC
[jira] [Commented] (DRILL-7733) Use streaming for REST JSON queries

    [ https://issues.apache.org/jira/browse/DRILL-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268995#comment-17268995 ] 

ASF GitHub Bot commented on DRILL-7733:
---------------------------------------

paul-rogers opened a new pull request #2149:
URL: https://github.com/apache/drill/pull/2149


   
   
   
   
   # [DRILL-7733](https://issues.apache.org/jira/browse/DRILL-7733): Use streaming for REST JSON queries
   
   ## Description
   
   Modifies the REST API to stream JSON query results rather than buffering the entire result set in memory as was previously required. The buffering limited the size of query which could be run using the REST API: users would run out of memory. With the streaming solution, data is fed directly from the query result to a JSON encoder and then back to the HTTP client with no buffering.
   
   Note that Drill has historically put the result schema *after* data. The reasoning was likely that the query schema can change many times during a query run (with different fragments returning batches with differing schemas.) The schema-at-end model allows the schemas to be merged.
   
   However, with streaming, the schema-at-end model forces the client to buffer the entire result set if the client needs the schema. A good improvement would be to send the (first batch) schema *before* the data. Drill would somehow have to deal with schema changes. As it turns out,  ODBC and JDBC clients send the schema before data and thus suffer from the same schema-change problem described here. We've avoided having to address the ODBC/JDBC issue, so maybe it won't be a problem in practice for the REST API if we send the first batch schema before data. In any event, that would be a (simple) separate enhancement.
   
   Refactors the existing JSON writer to work with the result set mechanism which is then used as the implementation for streaming.
   
   Refactors the internals of the REST API to allow for traditional "batch" responses and the new streaming responses.
   
   Revises the date/time methods for the row set API to use Java classes rather than Joda. Required to integrate properly with the
   JSON writer. The Joda Period class remains as there is no Java equivalent. Most of the changed files, in fact, are for this date/time change.
   
   A recent PR added get/set float methods to the row set API. This change was redundant and added a large volume of code to avoid a single-instruction cast and so is questionable. However, since we made it, we need to make it work. This PR fixes a few holes found during this work.
   
   ## Documentation
   
   The streaming form of JSON output is used only for REST queries: `query.json`. It is not used for HTML. The change is invisible to the user except that there is no longer a limit to the size of query results that the REST API can return.
   
   The Joda-to-Java time implementation change should be transparent to users except in one very specific case: if users have created a provided schema that includes a date/time format string. Such strings must be updated to Java date/time format. Provided schema is, however, an obscure feature so it is likely any users are affected.
   
   ## Testing
   
   Most changes are for the Joda replacement. All tests were rerun and updated as needed. Drill previously had no unit tests for the REST API. This PR adds a few simple tests, and instructions for how to quickly use the test to do ad-hoc tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Use streaming for REST JSON queries
> -----------------------------------
>
>                 Key: DRILL-7733
>                 URL: https://issues.apache.org/jira/browse/DRILL-7733
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.17.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: 1.19.0
>
>
> Several uses on the user and dev mail lists have complained about the memory overhead when running a REST JSON query: {{http:://node:8047/query.json}}. The current implementation buffers the entire result set in memory, then lets Jersey/Jetty convert the results to JSON. The result is very heavy heap use for larger query result sets.
> This ticket requests a change to use streaming. As each batch arrives at the Screen operator, convert that batch to JSON and directly stream the results to the client network connection, much as is done for the native client connection.
> For backward compatibility, the form of the JSON must be the same as the current API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)