You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Csaba Ringhofer (JIRA)" <ji...@apache.org> on 2019/04/19 15:21:00 UTC

[jira] [Updated] (IMPALA-340) Improve internal format of strings

     [ https://issues.apache.org/jira/browse/IMPALA-340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Csaba Ringhofer updated IMPALA-340:
-----------------------------------
    Description: 
We currently store string data outside of a Tuple, with the string slot taking up 8 bytes (4 bytes length, 8 bytes pointer, -4 bytes padding- (UPDATE: IMPALA-7367 removed the padding)), which is hugely wasteful.
We need 2 improvements:
a more compact string slot: Intel architectures only use 48 bits of a 64-bit address; strings are usually smaller than 64K; if the latter holds, we should pack a string slot into 64 bits total
in-line representation of strings: schemas we've seen often use strings as ids (which then also show up as foreign keys and are used heavily in joins), and those are typically smaller than 8 bytes; in that case, we could simply store the actual data in the string slot itself

See benchmarks/string-benchmark.cc.

See IMP-148 for more details.

  was:
We currently store string data outside of a Tuple, with the string slot taking up 8 bytes (4 bytes length, 8 bytes pointer, 4 bytes padding), which is hugely wasteful.
We need 2 improvements:
a more compact string slot: Intel architectures only use 48 bits of a 64-bit address; strings are usually smaller than 64K; if the latter holds, we should pack a string slot into 64 bits total
in-line representation of strings: schemas we've seen often use strings as ids (which then also show up as foreign keys and are used heavily in joins), and those are typically smaller than 8 bytes; in that case, we could simply store the actual data in the string slot itself

See benchmarks/string-benchmark.cc.

See IMP-148 for more details.


> Improve internal format of strings
> ----------------------------------
>
>                 Key: IMPALA-340
>                 URL: https://issues.apache.org/jira/browse/IMPALA-340
>             Project: IMPALA
>          Issue Type: Task
>          Components: Backend
>    Affects Versions: Impala 1.0
>            Reporter: Nong Li
>            Priority: Minor
>              Labels: perfomance
>
> We currently store string data outside of a Tuple, with the string slot taking up 8 bytes (4 bytes length, 8 bytes pointer, -4 bytes padding- (UPDATE: IMPALA-7367 removed the padding)), which is hugely wasteful.
> We need 2 improvements:
> a more compact string slot: Intel architectures only use 48 bits of a 64-bit address; strings are usually smaller than 64K; if the latter holds, we should pack a string slot into 64 bits total
> in-line representation of strings: schemas we've seen often use strings as ids (which then also show up as foreign keys and are used heavily in joins), and those are typically smaller than 8 bytes; in that case, we could simply store the actual data in the string slot itself
> See benchmarks/string-benchmark.cc.
> See IMP-148 for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org