You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Bikramjeet Vig (JIRA)" <ji...@apache.org> on 2019/04/18 22:23:00 UTC

[jira] [Assigned] (IMPALA-7380) Untracked memory for file metadata like AvroHeader accumulates until end of query

     [ https://issues.apache.org/jira/browse/IMPALA-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikramjeet Vig reassigned IMPALA-7380:
--------------------------------------

    Assignee: Alice Fan  (was: Yongjun Zhang)

> Untracked memory for file metadata like AvroHeader accumulates until end of query
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-7380
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7380
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Alice Fan
>            Priority: Major
>              Labels: resource-management
>
> HdfsScanNodeBase maintains a map of per-file metadata objects for use by different scan ranges from the same file, e.g. AvroFileHeader. These are not cleaned up until the end of the query.
> Note that because of IMPALA-6932 this doesn't necessarily increase peak memory significantly (because the headers are all accumulated during the header-parsing phase anyway).
> We should track the number of scanners remaining for each file and delete the headers when we no longer need them.
> h2. How to reproduce 
> Create an Avro table with a large number of files (e.g. 10000).
> Run an Avro scan on a single node:
> {code}
> set num_nodes=1;
> select * from table where foo = 'bar';
> {code}
> Notice on the /memz debug page that untracked memory increases a lot, then drops once the query is cancelled or finishes.
> h2. Proposed fix 
> Values from HdfsScanNodeBase::per_file_metadata_ should be removed and the metadata object deleted once all scanners for that file/partition combination are finished. We already know the expected number of scan ranges per file from HdfsFileDesc::splits so we can delete the object once all scan ranges for the file are finished.
> I can see two options here, both of which involve evicting members from per_file_metadata_ at different points:
> # unique ownership: per_file_metadata_ owns the metadata objects via a unique_ptr and maintains a refcount that is decremented by the scanner when it is done (e.g. by BaseSequenceScanner::Close()). 
> # shared ownership: per_file_metadata_ stores shared_ptr and maintains a refcount that is decremented when each scanner makes a copy of the shared_ptr. 
> I think #1 is better since it's more consistent with our usual memory management. The nice thing about #2 though is that the interaction with the scanners is simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org