You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Chris Westin (JIRA)" <ji...@apache.org> on 2015/10/01 00:47:04 UTC
[jira] [Commented] (DRILL-3874) flattening large JSON objects consumes too much direct memory

    [ https://issues.apache.org/jira/browse/DRILL-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939000#comment-14939000 ] 

Chris Westin commented on DRILL-3874:
-------------------------------------

The problem with the flatten() operation is the following:
The current operator design doesn't provide a means to push down the set of output fields that are of interest. When records pass through an operator, the result includes all of the original fields, as well as all the newly added computed fields that operator produces. The current design requires using a subsequent project (see the query profile) to filter out the fields you don't want. As a result, when you flatten a 4MB record, the output record includes the original 4MB record, plus whatever you extracted. In the sample given, that means that internally, for each element in the array being flattened, you get the original 4MB record, along with one of the 20,000 array elements each time.
We have some internal checks that prevent us from making any single value vector too large; when that happens, we flush what we've processed so far. One of those checks is going off after about 2,000 of the flattened elements are processed. Note that 2,000 x 4MB is about 8GB – this is why the peak memory usage for that operator in the profile is 8GB.
The better, longer term fix requires some re-architecting of the operators so that we can push down the list of fields of interest so that we don't pass through the originals in this way. But that's a much bigger change that will affect a lot of the system.
In the short term, in order to avoid using up so much memory, we need to know how much memory has been consumed so far during the flattening process. Unfortunately, our value vector implementation does not have a way to ask for that until after you've finished populating the vector; but by then it's too late – you've already used up a huge amount of memory. I'm adding something that will let us ask for that information. However, that comes with a performance penalty. In order to mitigate that, the flattening process will also have an adaptive algorithm added so that we can try to estimate memory usage by only asking for this information periodically. The period will be determined based on large records seen in the past, and how much memory they took up. A query like this one will start out optimistically, but if we hit an initial memory limit, then we start sampling this information and monitoring it periodically in order to prevent it from getting too large.

> flattening large JSON objects consumes too much direct memory
> -------------------------------------------------------------
>
>                 Key: DRILL-3874
>                 URL: https://issues.apache.org/jira/browse/DRILL-3874
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types
>    Affects Versions: 1.1.0
>            Reporter: Chris Westin
>            Assignee: Chris Westin
>
> A JSON record has a field whose value is an array with 20,000 elements; the record's size is 4MB. A select is used to flatten this. The query profile reports that the peak memory utilization was 8GB, most of it used by the flatten. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)