You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/03/14 04:54:11 UTC

[GitHub] [drill] cgivre opened a new pull request #2189: DRILL-7877 - Streaming REST API Fails to Send Multiple Batches

cgivre opened a new pull request #2189:
URL: https://github.com/apache/drill/pull/2189


   # [DRILL-7877](https://issues.apache.org/jira/browse/DRILL-7877): DRILL-7877 - Streaming REST API Fails to Send Multiple Batches
   
   ## Description
   
   [DRILL-7733](https://github.com/apache/drill/pull/2149) introduced the streaming reader for the REST API for Drill.  Unfortunately, a minor bug was introduced as well in that if the user executed a query that produced more than one batch, the schema would be passed in the first batch with no rows, then the rows for future batches would be ignored in the results.
   
   This minor fix corrects this situation. 
   
   The behavior can be verified by executing the query below in Postman via the Drill REST API.  
   
   ```sql
   SELECT position_title, COUNT(*) as pc 
   FROM cp.`employee.json` 
   GROUP BY position_title
   ```
   
   ## Documentation
   No user facing changes.
   
   ## Testing
   Added an additional unit test and tested manually. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2189: DRILL-7877 - Streaming REST API Fails to Send Multiple Batches

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2189:
URL: https://github.com/apache/drill/pull/2189#discussion_r595980302



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/stream/StreamingHttpConnection.java
##########
@@ -104,11 +104,9 @@ public void outputAvailable(OutputStream out) throws IOException {
   public void sendData(RpcOutcomeListener<Ack> listener, QueryDataPackage data) {
     VectorContainer batch = data.batch();
     try {
-      if (batchCount == 0) {
-        batchHolder = new BatchHolder(batch);
-        reader = new PushResultSetReaderImpl(batchHolder);
-        startSignal.await();
-      }
+      batchHolder = new BatchHolder(batch);
+      reader = new PushResultSetReaderImpl(batchHolder);
+      startSignal.await();

Review comment:
       @paul-rogers, @vdiravka 
   Any additional thoughts? I'd like to get this committed if possible, as this issue is breaking our UI.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2189: DRILL-7877 - Streaming REST API Fails to Send Multiple Batches

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2189:
URL: https://github.com/apache/drill/pull/2189#issuecomment-799430766


   I hope everyone staying safe in COVID storm.
   As it happens, cgivre has been commit  this PR when I work on this issues. thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on a change in pull request #2189: DRILL-7877 - Streaming REST API Fails to Send Multiple Batches

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #2189:
URL: https://github.com/apache/drill/pull/2189#discussion_r593859759



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/stream/StreamingHttpConnection.java
##########
@@ -104,11 +104,9 @@ public void outputAvailable(OutputStream out) throws IOException {
   public void sendData(RpcOutcomeListener<Ack> listener, QueryDataPackage data) {
     VectorContainer batch = data.batch();
     try {
-      if (batchCount == 0) {
-        batchHolder = new BatchHolder(batch);
-        reader = new PushResultSetReaderImpl(batchHolder);
-        startSignal.await();
-      }
+      batchHolder = new BatchHolder(batch);
+      reader = new PushResultSetReaderImpl(batchHolder);
+      startSignal.await();

Review comment:
       Thanks for fixing this. I'd been working with Curtis on how to debug/test the problem. For example, I suggested using the mock data source to create a test data set of arbitrary size, and using the `ExampleTest` tools for enabling logging to be able to see what's happening.
   
   I don't think this fix is correct; I think the original version is closer to what is needed. My memory is hazy on this, but I do suspect the lines above should be done only once. In particular, I worry that having multiple batch holders may lead to memory issues.
   
   Perhaps do a bit more debugging. Clearly the above code is being called once for each batch, or this fix would not work. Why don't the following three lines do what we think they will? Is there some handshake needed in the batch holder? Or, is the consumer doing something wrong with the holder so that it looks like we need to create a new one each time?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vvysotskyi commented on a change in pull request #2189: DRILL-7877 - Streaming REST API Fails to Send Multiple Batches

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2189:
URL: https://github.com/apache/drill/pull/2189#discussion_r596263674



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/stream/StreamingHttpConnection.java
##########
@@ -104,11 +104,9 @@ public void outputAvailable(OutputStream out) throws IOException {
   public void sendData(RpcOutcomeListener<Ack> listener, QueryDataPackage data) {
     VectorContainer batch = data.batch();
     try {
-      if (batchCount == 0) {
-        batchHolder = new BatchHolder(batch);
-        reader = new PushResultSetReaderImpl(batchHolder);
-        startSignal.await();
-      }
+      batchHolder = new BatchHolder(batch);
+      reader = new PushResultSetReaderImpl(batchHolder);
+      startSignal.await();

Review comment:
       The actual reason why this issue appears is because of `ReaderIndex.rowCount` initialized on the first batch with 0 value and this value is left for the next batches. The ideal fix would be to enhance the logic to update `ReaderIndex.rowCount` when calling `RowSetReaderImpl.newBatch()`. But I don't see significant issues with an existing solution.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2189: DRILL-7877 - Streaming REST API Fails to Send Multiple Batches

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2189:
URL: https://github.com/apache/drill/pull/2189#discussion_r593894152



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/stream/StreamingHttpConnection.java
##########
@@ -104,11 +104,9 @@ public void outputAvailable(OutputStream out) throws IOException {
   public void sendData(RpcOutcomeListener<Ack> listener, QueryDataPackage data) {
     VectorContainer batch = data.batch();
     try {
-      if (batchCount == 0) {
-        batchHolder = new BatchHolder(batch);
-        reader = new PushResultSetReaderImpl(batchHolder);
-        startSignal.await();
-      }
+      batchHolder = new BatchHolder(batch);
+      reader = new PushResultSetReaderImpl(batchHolder);
+      startSignal.await();

Review comment:
       @paul-rogers 
   Thanks for taking a look at this.  When I was stepping through this (full disclosure, I got my second COVID shot on Friday and was really groggy yesterday), I noticed that the issue seemed to be on this line:
   https://github.com/apache/drill/blob/4e514c214091b21c334f0d3b2c14c1efed381850/exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/stream/StreamingHttpConnection.java#L112
   
   Let's say that you have a query that returns `n` batches `b1...bn`.   As it was originally implemented, the first batch would emit the headers with an empty array for `rows`.   So far so good.  
   
   When you hit the next batch (`b2`), nothing was ever being done with `b2` and most importantly, `b1` was being sent again to `emitBatch()`.   If you notice, we create a new `VectorContainer` from the incoming data, but if those lines are skipped, nothing happens with that `VectorContainer`.   The end result is that `b1` gets emitted `n` times.  Since `b1` has no rows, you get an empty dataset. 
   
   Like I said, I was pretty groggy yesterday, so I'm only about 78.346% sure that's what's happening.  
   
   If the concern is an OOM situation, one option might be to change the scope of the `reader` and `batchHolder` variables so that they are local to that function.  They don't seem to be used outside of the `sendData` function. 
   
    
   Tagging @Daddykong for additional sanity check.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre merged pull request #2189: DRILL-7877 - Streaming REST API Fails to Send Multiple Batches

Posted by GitBox <gi...@apache.org>.
cgivre merged pull request #2189:
URL: https://github.com/apache/drill/pull/2189


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org