You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/11/21 16:56:00 UTC

[jira] [Commented] (IMPALA-12549) Adjust estimations for small strings

    [ https://issues.apache.org/jira/browse/IMPALA-12549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788482#comment-17788482 ] 

ASF subversion and git services commented on IMPALA-12549:
----------------------------------------------------------

Commit ae848a6cefc59b027644d7ea54ab365593f4fc6e in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ae848a6ce ]

IMPALA-12373: Small String Optimization for StringValue

This patch implements the Small String Optimization (SSO) for
StringValue objects. This is a well-known optimization in the C++
world that is used by the majority of various string implementations
(STL string, boost string, Folly string, etc.)

The old layout of the StringValue was:
  char* ptr;  // 8 bytes
  int len;    // 4 bytes

We also add the __packed__ attribute to the StringValue class which
means there is no padding between 'ptr' and 'len', neither after 'len'.
I.e. StringValue objects take 12 bytes. This means with SSO we can use
11 bytes to store small strings. In this case the last byte is used to
store the length.

Small string layout:
  char[11] buf;
  unsigned char len;

We also need an indicator bit (which tells whether the long
representation or the small representation is active) in the last byte
that is the same bit of LONG_STRING.len and SMALL_STRING.len. On
little-endian architectures this is the most significant bit (MSB) of
both LONG_STRING.len and SMALL_STRING.len. On big endian architectures
this would be the least significant bit (LSB) of both LONG_STRING.len
and SMALL_STRING.len. Since currently impala can only be built on
little endian architectures, this patch only adds code for such
platforms. Moreover, systems that use big endian usually support little
endian as well.

This patch adds SmallableString which implements the above on an
on-demand basis. I.e. all string objects start with the long
representation, then the string object can be explicitly asked to try
smallify itself. This is because I didn't want to introduce too much
change in behavior. This way we can try smallify only at certain points
(e.g. DeepCopy()), and we can also smallify all strings in a tuple at
once. The latter means if we've done that for a tuple, subsequent
smallifications can return on the first small string that is encountered
(because we can assume that all other string slots are also smallified).

Benefits:
 * lower memory and CPU cache consumption
 * smaller serialization buffers to compress
 * less data to send over the network
 * less data to spill to disk

Measurements:
I used TPCH(30) with the following query:

  select * from lineitem a, lineitem b
  where a.l_orderkey = b.l_orderkey and
        a.l_orderkey * b.l_orderkey < 1000

The above query generates significant network traffic and does spilling.
The query selects all 16 columns out of which 6.5 columns contain small
strings. I.e. this kind of data is a good candidate for this
optimization but also not unrealistic.

This improves the following numbers:
Total query time:             5m16s    --> 3m40s
Total CPU time:               12m9s    --> 10m17s
Bytes sent over the network:  54.17 GB --> 41.76 GB
Data had to be spilled:       14.66 GB --> 9.42 GB

On the standard benchmarks I measured the followings:

TPC-H:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(42) | parquet / none / none | 3.57    | -3.06%     | 2.40       | -1.34%         |
+----------+-----------------------+---------+------------+------------+----------------+

TPC-DS:
+-----------+-----------------------+---------+------------+------------+----------------+
| Workload  | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+-----------+-----------------------+---------+------------+------------+----------------+
| TPCDS(30) | parquet / none / none | 2.09    | -0.92%     | 0.76       | -1.17%         |
+-----------+-----------------------+---------+------------+------------+----------------+

Testing:
 * There was query in 'spilling.test' that used to spill, but now it
   doesn't (at least in upstream GVO, in other environments the test
   passes). I cannot lower the buffer_pool_limit under the min
   reservation, so cannot make it spill without IMPALA-12549 that will
   update the estimations and min reservations.
 * Other tests that wanted to spill were modified to work on larger
   data sets
 * Added few backend tests in string-value-test
 * Added new complex types tests that have deeply nested small/long
   strings
 * Existing tests pass

Change-Id: I741c3a5f12ab620b6b64b57d4c89b5f8e056efd3
Reviewed-on: http://gerrit.cloudera.org:8080/20496
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Adjust estimations for small strings
> ------------------------------------
>
>                 Key: IMPALA-12549
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12549
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: performance
>
> With small strings, the queries consume less memory.
> We should adjust the memory estimations / min reservations to take the small strings into account.
> At first we can be conservative, i.e. only take them into account for columns with max size less than the small string limit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org