You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2016/04/17 02:28:25 UTC

[jira] [Resolved] (PARQUET-580) Potentially unnecessary creation of large int[] in IntList for columns that aren't used

     [ https://issues.apache.org/jira/browse/PARQUET-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Blue resolved PARQUET-580.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.9.0

Thanks for fixing this, [~pnarang]!

About scaling the size of the array up from a smaller size, I think [~alexlevenson] may be able to comment about why this was 64k to begin with. I've not done extensive testing but I definitely recommend talking with him if you'd like to change the way that works. Thanks!

> Potentially unnecessary creation of large int[] in IntList for columns that aren't used
> ---------------------------------------------------------------------------------------
>
>                 Key: PARQUET-580
>                 URL: https://issues.apache.org/jira/browse/PARQUET-580
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Piyush Narang
>            Assignee: Piyush Narang
>            Priority: Minor
>             Fix For: 1.9.0
>
>
> Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size) as we create an IntList object for every column. Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. 
> Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)