You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Michael Ho (JIRA)" <ji...@apache.org> on 2018/01/24 23:28:00 UTC

[jira] [Resolved] (IMPALA-6395) Allow the accumulated row batch size of a data sink to be tunable

     [ https://issues.apache.org/jira/browse/IMPALA-6395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Ho resolved IMPALA-6395.
--------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 3.0

Fixed at commit: https://github.com/apache/impala/commit/92fd7e56eb83b67f7f4ec836ebf8e495c75cbdb7

This change adds a flag to control the maximum size in bytes
a row batch in a data stream sender's channel can accumulate
before the row batch is sent over the wire. Increasing this
value will better amortize the cost of compression and RPC
per row batch. The default value is 16KB per channel.

Change-Id: I385f4b7a0671bb2d7872bee60d476c375680b5c2
Reviewed-on: http://gerrit.cloudera.org:8080/9026
Reviewed-by: Michael Ho <kw...@cloudera.com>
Tested-by: Impala Public Jenkins




> Allow the accumulated row batch size of a data sink to be tunable
> -----------------------------------------------------------------
>
>                 Key: IMPALA-6395
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6395
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Distributed Exec
>    Affects Versions: Impala 2.12.0
>            Reporter: Michael Ho
>            Assignee: Michael Ho
>            Priority: Minor
>             Fix For: Impala 3.0
>
>
> During scale testing, it was noticed that tuning the size of the accumulated row batches in data stream sender will affect the performance of Impala. This is understandable as a larger row batch will amortize the cost of compression and RPC in general. The default value is 16KB per channel. Experiment in a 38 node cluster with 48 concurrent users running 10TB TPC-DS shows about 5% improvement in query-per-hour when bumping the default value to 512KB. This is a tradeoff between memory consumption and performance. Having this flag allows us to tune for performance more easily.
> {noformat}
>       if (FLAGS_use_krpc) {
>         *sink = pool->Add(new KrpcDataStreamSender(fragment_instance_ctx.sender_id,
>             row_desc, thrift_sink.stream_sink, fragment_ctx.destinations, 16 * 1024,
>             state));
>       } else {
>         // TODO: figure out good buffer size based on size of output row
>         *sink = pool->Add(new DataStreamSender(fragment_instance_ctx.sender_id, row_desc,
>             thrift_sink.stream_sink, fragment_ctx.destinations, 16 * 1024, state));
>       }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)