You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2023/10/03 15:11:00 UTC

[jira] [Created] (IMPALA-12479) Adaptive rowbatch serialization for duplicated STRING values

Zoltán Borók-Nagy created IMPALA-12479:
------------------------------------------

Summary: Adaptive rowbatch serialization for duplicated STRING values
Key: IMPALA-12479
URL: https://issues.apache.org/jira/browse/IMPALA-12479
Project: IMPALA
Issue Type: Bug
Components: Backend
Reporter: Zoltán Borók-Nagy

When we are serializing row batches, or putting them into BufferedTupleStreams, we always deep copy every tuple into a flat structure. This means we will write the fixed length fields first, then all the varlen fields are coming.

For low-NDV strings this means we will make a log of copies of the same string values.

There is de-duplication in RowBatch::serialize(), but it only applies to whole tuples, not slots:

[https://github.com/apache/impala/blob/ae14d78c8f2e8366e67aa8c39f0c02c60862905e/be/src/runtime/row-batch.cc#L234]

We could try to implement adaptive de-duplication of adjacent STRING slots. We would sample the first N (e.g. 10000) tuples' string slots, and check what is the ratio of the adjacent identical string values. Memory overhead should be negligbible as we would use a single counter per string slot.

If we find that the ratio is quite high, e.g. >0.5, then we would de-duplicate the affected string slots. I.e. adjacent identical string values would use the previous string value's pointer.

This optimization could improve compression times a lot (because we would need to compress much smaller buffers), and memory consumption of non-SCAN fragments (which always work on duplicated string data).

This could be very beneficial for any low-NDV string column, and extremely useful for Iceberg position delete records, where we have a lot of adjacent duplicated long strings.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org