You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/02/02 05:25:00 UTC

[jira] [Commented] (DRILL-8037) Add V2 JSON Format Plugin based on EVF

    [ https://issues.apache.org/jira/browse/DRILL-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485573#comment-17485573 ] 

ASF GitHub Bot commented on DRILL-8037:
---------------------------------------

paul-rogers commented on pull request #2364:
URL: https://github.com/apache/drill/pull/2364#issuecomment-1027598044


   @vdiravka reached out to me on this bug. His explanation:
   
   > The issue is the schema is changed for the second batch and it is reported by SchemaTracker#isSameSchema
   I suppose V1 reader just compared the same field type, but the new one compares vectors by identity. So schemaChange is right thing for that test case. But Drill Hash aggregate doesn't support schema change.
   
   Here is my analysis:
   
   You are tight that the `SchemaTracker` is the thing in EVF which looks for schema changes. The rule is pretty simple: if the the kind of vector differs from the previous batch, then a schema change occurred. This is based, in part, on the observation that the sort operator can't handle any schema changes at all: not INT/BIGINT, not NOT NULL to NULL. Sort combines vectors, and if they are of a different type, the values won't fit together. The same is probably true of the hash agg.
   
   OK, so we agree that a type change is bad. What you're describing is the same type, but a different vector. This is a bug in several ways, and not the way you might think.
   
   First, it seems logical that if column x is a non-null INT in one batch, and is a non-null INT in the next batch, that no schema change has occurred. But, Drill will surprise you. Most operators assume that the vector itself has not changed: only the data stored within the vector. This is a subtle point, so let me expand it.
   
   A vector is a permanent holder or a column. The vector holds one or more data buffers. Those buffers change from batch to batch. Think of it as a bucket brigade: the old-style way that people fought fires. A chain of people forms (the vectors). They hand buckets one to the next (the data buffers).
   
   Now, I was pretty surprised when I first discovered this. So was Boaz when he wrote hash agg. But, most of Drill's code assumes that, in setup, the code binds to a specific vector. In the next batch, that same binding is valid; only the data changes. This shows up in code gen and other places. In fact, a goodly part of EVF is concerned with this need for "vector continuity" even across readers. (Let that sink in: two readers that have nothing to do with each other must somehow still share common vectors. That alone should tell us the vector design in Drill is a bit strange and needs rethinking. But, that's another topic.)
   
   So, now let's review the bug. On the one hand, `SchemaTracker` has noticed the vectors changed, and is (correctly) telling the downstream operators that they must redo their bindings to the new vector.
   
   But, on the other hand, `SchemaTracker` has been presented with two different vectors for the same column. That should never happen unless there is an actual type change (non-null to null, INT to BIGINT, etc.) There are elaborate mechanisms to reuse vectors from one reader to the next.
   
   So, the first thing to check is if the types are at all different. (If the "major types" differ.) If so, then we have a legitimate schema change that Drill can't handle.
   
   Now, if you find that the types are the same, then we have a bug in EVF, specifically in the `ResultVectorCacheImpl` class, since that's the thing which is supposed to do the magic.
   
   Let me ask another question. When this occurs, is it in the first batch of the second (or later) reader? If so, then that reader should have pulled the vector from the cache. The question is, why didn't it?
   
   If the error occurs within the first reader, then the bug is more subtle. How could the JSON reader accomplish that? The `ResultSetLoader` hides all that detail, and it defers to the `ResultVectorCacheImpl `for vector continuity.
   
   Yet another question is why these tests didn't fail 3 months ago when the tests first ran. And, why did the tests pass way back when I wrote this code? Did anything change elsewhere that could trigger this? I don't see how, but we should probably ask the question. 
   
   Fortunately, these are unit tests, so we should be able to sort out the issues without too much trouble.
   
   I'll also note that there were a few fixes to EVF V2 that I made in my PR, but those were around a subtle but in implicit columns.
   
   Suggestion: give the above a shot to see if you can see the problem. Otherwise, tomorrow I'll try to squeeze in time to grab this branch and do some debugging. After all, I wrote this stuff originally, so I should try to make it work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Add V2 JSON Format Plugin based on EVF
> --------------------------------------
>
>                 Key: DRILL-8037
>                 URL: https://issues.apache.org/jira/browse/DRILL-8037
>             Project: Apache Drill
>          Issue Type: Sub-task
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>            Priority: Major
>
> This adds new V2 beta JSON Format Plugin based on the "Extended Vector Framework".
> This is follow up DRILL-6953 (was closed with the decision to merge it by small pieces).
> So it is based on [https://github.com/apache/drill/pull/1913] and [https://github.com/paul-rogers/drill/tree/DRILL-6953-rev2] work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)