You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2017/03/27 16:59:41 UTC

[jira] [Commented] (DRILL-5384) Sort cannot directly access map members, causes a data copy

    [ https://issues.apache.org/jira/browse/DRILL-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943639#comment-15943639 ] 

Jinfeng Ni commented on DRILL-5384:
-----------------------------------

It seems that 1st argument is true; the rest two arguments may be partially true. 

It's true that Drill adds a Project operator. However, it's not true that Drill has to copy the data out of map vector for a path 'customer.id" in your example. If you look at the generated code for project operator, you may see that it's merely doing a vector transfer. 

As a matter of fact, Drill does not differentiate  a top level column reference like "col1", or a nested field in n-level map, such as "col2.b.c.d".  Only when a map is an element of array (repeated map), Drill will evaluate and copy the data. For instance, 'col3.a.b[100].c.d[20].f". On the other hand, for such schema path, I'm not clear how your proposed approach will make it work without any copy, until I see a design/implementation. 





> Sort cannot directly access map members, causes a data copy
> -----------------------------------------------------------
>
>                 Key: DRILL-5384
>                 URL: https://issues.apache.org/jira/browse/DRILL-5384
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Suppose we have a JSON structure for "orders" like this:
> {code}
> { customer: { id: 10, name: "fred" },
>   order: { id: 20, product: "Frammis 1000" } }
> {code}
> Suppose I want to sort by customer.id. Today, Drill will project customer.id up to the top level as a temporary, hidden field. Drill will copy the data from the customer.id vector to this new temporary field. Drill then sorts on the temporary column, and uses another project to remove the columns.
> Clearly, this work, but it has a cost:
> * Extra two project operators.
> * Extra memory copy.
> * Sort must buffer both the original and copied data. This can double memory use in the worst case.
> All of this is done simply to avoid having to reference "customer.id" in the sort.
> But, as explained in DRILL-5376, maps are just nested tuples; there is no need to copy the data, the data is already right there in a value vector. The problem is that Drill's map implementation makes it hard for the generated code to get at the "customer.id" vector.
> This ticket asks to allow the sort to work directly with nested scalars to avoid the overhead explained above. To do this:
> 1. Fix nested scalar access to allow the generated code to easily access a nested scalar.
> 2. Allow a sort key of the form "customer.id".
> 3. Modify the planner to generate such sort keys instead of the dual projects.
> The result will be a leaner, faster sort operation when sorting on scalars within a map.
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)