You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2018/09/05 20:13:00 UTC

[jira] [Commented] (IMPALA-4268) buffer more than a batch of rows at coordinator

    [ https://issues.apache.org/jira/browse/IMPALA-4268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604884#comment-16604884 ] 

ASF subversion and git services commented on IMPALA-4268:
---------------------------------------------------------

Commit b288a6af2eda9631b2bad91896ae4bfd2a3fdf30 in impala's branch refs/heads/master from [~tarmstrong@cloudera.com]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=b288a6a ]

IMPALA-7477: Batch-oriented query set construction

Rework the row-by-row construction of query result sets in PlanRootSink
so that it materialises an output column at a time. Make some minor
optimisations like preallocating output vectors and initialising
strings more efficiently.

My intent is both to make this faster and to make the QueryResultSet
interface better before IMPALA-4268 does a bunch of surgery on this
part of the code.

Testing:
Ran core tests.

Perf:
Downloaded tpch_parquet.orders via JDBC driver.
Before: 3.01s, After: 2.57s.

Downloaded l_orderkey from tpch_parquet.lineitem.
Before: 1.21s, After: 1.08s.

Change-Id: Ibc87a84c34935d0d5841c7f5528eb802527fa809
Reviewed-on: http://gerrit.cloudera.org:8080/11297
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> buffer more than a batch of rows at coordinator
> -----------------------------------------------
>
>                 Key: IMPALA-4268
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4268
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.8.0
>            Reporter: Henry Robinson
>            Priority: Major
>              Labels: resource-management
>         Attachments: rows-produced-histogram.png
>
>
> In IMPALA-2905, we are introducing a {{PlanRootSink}} that handles the production of output rows at the root of a plan.
> The implementation in IMPALA-2905 has the plan execute in a separate thread to the consumer, which calls {{GetNext()}} to retrieve the rows. However, the sender thread will block until {{GetNext()}} is called, so that there are no complications about memory usage and ownership due to having several batches in flight at one time.
> However, this also leads to many context switches, as each {{GetNext()}} call yields to the sender to produce the rows. If the sender was to fill a buffer asynchronously, the consumer could pull out of that buffer without taking a context switch in many cases (and the extra buffering might smooth out any performance spikes due to client delays, which currently directly affect plan execution).
> The tricky part is managing the mismatch between the size of the row batches processed in {{Send()}} and the size of the fetch result asked for by the client. The sender materializes output rows in a {{QueryResultSet}} that is owned by the coordinator. That is not, currently, a splittable object - instead it contains the actual RPC response struct that will hit the wire when the RPC completes. As asynchronous sender cannot know the batch size, which may change on every fetch call. So the {{GetNext()}} implementation would need to be able to split out the {{QueryResultSet}} to match the correct fetch size, and handle stitching together other {{QueryResultSets}} - without doing extra copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org