You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Parth Chandra (JIRA)" <ji...@apache.org> on 2018/04/02 09:18:00 UTC

[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

    [ https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422042#comment-16422042 ] 

Parth Chandra commented on DRILL-6223:
--------------------------------------

{quote}To your point about compensation logic in the context of Schema Changes
{quote} * 
{quote}Why do you think it is ok to dynamically include new columns?{quote}
 * 
{quote}Yet it is not ok to exclude them?{quote}

Usually, in real world data with dynamically changing schema's, new columns are added and not removed. 
 * 
{quote}Consider a batch of 32k rows{quote}
 * 
{quote}A VV with null integer values will require 32kb (bits) + 32kb * 4 = 160kb{quote}
 * 
{quote}Each missing column will require that much memory per mini-fragment{quote}

One of the guarantees provided by value vectors is that elements can be accessed by index in constant time (or, in the case of nested elements in O(m) where m is the level of nesting) . The representation is based on providing this guarantee. It comes at the cost of additional memory usage, which is a deliberate tradeoff.
{quote}This is unless (similarly to the implicit columns) we optimize the VV storage representation or / and push the column preservation to higher layers such as the client or foreman
{quote}
It would be wonderful to improve vectors to use much less memory while providing the same guarantees. A proposal would be welcome. 

 

> Drill fails on Schema changes 
> ------------------------------
>
>                 Key: DRILL-6223
>                 URL: https://issues.apache.org/jira/browse/DRILL-6223
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0, 1.12.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)