You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/12/01 20:09:10 UTC

[jira] [Commented] (DRILL-4053) Reduce metadata cache file size

    [ https://issues.apache.org/jira/browse/DRILL-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034379#comment-15034379 ] 

ASF GitHub Bot commented on DRILL-4053:
---------------------------------------

Github user parthchandra commented on the pull request:

    https://github.com/apache/drill/pull/254#issuecomment-161064958
  
    Updated the patch. @StevenMPhillips, @jacques-n, if you guys can do a quick review. 
    The cache code now uses the same file name and reads the metadata appropriately. Note that as a result, v1 and v2 of the cache file can not co-exist (and also means that Drill 1.3 and earlier are no longer forward compatible).


> Reduce metadata cache file size
> -------------------------------
>
>                 Key: DRILL-4053
>                 URL: https://issues.apache.org/jira/browse/DRILL-4053
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 1.4.0
>
>
> The parquet metadata cache file has fair amount of redundant metadata that causes the size of the cache file to bloat. Two things that we can reduce are :
> 1) Schema is repeated for every row group. We can keep a merged schema (similar to what was discussed for insert into functionality) 2) The max and min value in the stats are used for partition pruning when the values are the same. We can keep the maxValue only and that too only if it is the same as the minValue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)