You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by paul-rogers <gi...@git.apache.org> on 2017/06/08 19:06:43 UTC

[GitHub] drill issue #846: DRILL-5544: Out of heap running CTAS against text delimite...

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/846
  
    Chatted with Parth who mentioned that Parquet page sizes are typically on the order of 1MB, maybe 8 MB, but 16 MB is too large.
    
    The concern expressed in earlier comments was that if we buffer, say, 256 MB of data per file, and we're doing many parallel writes, we will use up too much memory.
    
    But, if we buffer only one page at a time, and we control page size to be some amount on the order of 1-2 MB, then even with 100 threads, we're still using only 200 MB, say, which is fine.
    
    In this case, the direct memory solution is fine. (But please check performance.)
    
    However, if we are running out of memory, I wonder if we are not controlling page size and letting them get too large? Did you happen to check the size of the pages we are writing?
    
    If the pages are too big, let's file another JIRA ticket to fix that problem so that we have a complete solution.
    
    Once we confirm that we are writing small pages (or file that JIRA if not), I'll change my vote from +0 to +1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---