You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by sachouche <gi...@git.apache.org> on 2018/02/01 03:15:28 UTC

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

GitHub user sachouche opened a pull request:

    https://github.com/apache/drill/pull/1106

    DRILL-6129: Fixed query failure due to nested column data type change

    Problem Description -
    - The Drillbit was able to successfully send batches containing different metadata (for nested columns)
    - This was the case when one or multiple scanners were involved
    - The issue happened within the client where value vectors are cached across batches
    - The load(...) API is responsible for updating values vectors when a new batch arrives
    - The RecordBatchLoader class is used to detect schema changes ; if this is the case, then previous value vectors are discarded and new ones created
    - There is a bug with the current implementation where only first level columns are compared
    
    Fix -
    - The fix is to improve the schema diff logic by including nested columns

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill DRILL-6129

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1106
    
----
commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4
Author: Salim Achouche <sa...@...>
Date:   2018-02-01T02:59:58Z

    DRILL-6129: Fixed query failure due to nested column data type change

----


---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/1106
  
    Thank you Aman and Paul for reviewing this JIRA. I have created a new JIRA [DRILL-6131](https://issues.apache.org/jira/browse/DRILL-6131) to track the new SchemaComparator utility for sharing schema comparison logic.


---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1106


---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

Posted by priteshm <gi...@git.apache.org>.
Github user priteshm commented on the issue:

    https://github.com/apache/drill/pull/1106
  
    @amansinha100 can you please review it?


---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on the issue:

    https://github.com/apache/drill/pull/1106
  
    Seems ok to fix the RecordBatchLoader.isSameSchema() since it is missing the checks for nested columns.  In order to do the consolidation that Paul suggested,  you might want to open an enhancement JIRA.  Since MaterializedField and RecordBatchLoader have separate class hierarchies, there's not a direct way to have a single method do the comparison.   One option is to create a utility 'SchemaComparator' class that incorporates various static utility methods and Javadoc the rules for the comparisons. 
    
    So, I am +1  on this change. 


---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1106
  
    Note that a similar bug was recently fixed in (as I recall) the Merge Receiver. As part of this fix, would be good to either:
    
    1. Determine if we have more copies of this logic besides the Merge Receiver (previously fixed) and the client code (fixed here.)
    2. Refactor the code so that all use cases use a common set of code for this task.
    
    In any event, would be good to compare this code with that done in the Merge Receiver to ensure that we are using a common approach. See `exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java` in PR #968.
    
    The two sets of code appear similar, depending on what `isSameSchema()` does with a list of `MaterializedField`s. But, please take a look.


---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/1106
  
    Thanks @paul-rogers for the information; I went through the PR and noticed the following:
    
    - BatchSchema invokes MaterializedField.isEquivalent()
    - With my fix, both methods consider nested columns but they have several differences
    
    1) RecordBatchLoader requires sameness as this information is used to reuse the value vectors; if old and new batch are deemed same, then the value vectors are reloaded using the load(...) API. The metadata better be the same or a runtime exception will occur
    
    2) RecordBatchLoader isSame(...) API compares two different java objects: SerializedField (obtained from protobufs) and already materialized value vectors MaterializedField
    
    3) RecordBatchLoader isSame(...) API tolerates unordered fields (within the same level) but not MaterializedField.isEquivalent() method
    
    4) MaterializedField.isEquivalent() ignores hidden columns such "$bits" and "$offsets" but not RecordBatchLoader isSame(...)
    
    I think moving forward, the best way to prevent bugs with regard to schema changes is by maintaining a document that establishes all the rules. This will allow QA to refine their tests and catch current limitations.  



---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche closed the pull request at:

    https://github.com/apache/drill/pull/1106


---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

Posted by sachouche <gi...@git.apache.org>.
GitHub user sachouche reopened a pull request:

    https://github.com/apache/drill/pull/1106

    DRILL-6129: Fixed query failure due to nested column data type change

    Problem Description -
    - The Drillbit was able to successfully send batches containing different metadata (for nested columns)
    - This was the case when one or multiple scanners were involved
    - The issue happened within the client where value vectors are cached across batches
    - The load(...) API is responsible for updating values vectors when a new batch arrives
    - The RecordBatchLoader class is used to detect schema changes ; if this is the case, then previous value vectors are discarded and new ones created
    - There is a bug with the current implementation where only first level columns are compared
    
    Fix -
    - The fix is to improve the schema diff logic by including nested columns

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill DRILL-6129

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1106
    
----
commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4
Author: Salim Achouche <sa...@...>
Date:   2018-02-01T02:59:58Z

    DRILL-6129: Fixed query failure due to nested column data type change

----


---