You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Alexander Saydakov <sa...@verizonmedia.com> on 2021/08/12 20:57:34 UTC

Apache DataSketches integration

Hi Impala development community,
I am a member of Apache DataSketches team. I looked at the DataSketches
functions in Impala and I have a few questions and suggestions.

Regarding the dependency. I see that the current approach is to copy all
files from Datasketches into a single pile. Is there a better way?

I see that you are using version 3.0.0 currently. We released version 3.1.0
about a month ago. It has some important bug fixes, and a Theta sketch
feature to avoid some cost of deserialization. This feature can speed up
Theta union operations.

Regarding your approach to serializing sketches. We have two ways of
serializing and deserializing sketches: stream and bytes. You are using
bytes for deserializing, but a temporary stringstream for serializing
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
just to copy the internals of this stream as bytes (using memcopy) in the
StringStreamToStringVal function
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
I would think it should be faster to use serialization to bytes, and the
StringStreamToStringVal function can be eliminated.

Regarding deserialization. I see in some cases that a sketch constructor is
called just to replace this instance with a deserialized one. This extra
construction seems unnecessary.
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819

Looking at this DsHllMerge function
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
it seems that the merge is done pairwise. Is it possible to arrange this
process as init, multiple merges and finalize (serialize) at the end? It is
quite costly to initialize a union, update it with two sketches and then
call get_result(). If many such merges happen, the overhead of initializing
a fresh union and finalizing it for each pair can be substantial.

Regards,
Alexander.

Re: [E] Re: Apache DataSketches integration

Posted by Alexander Saydakov <sa...@verizonmedia.com>.
I am away for a few days. I will have a look soon.
Thank you.

On Mon, Aug 16, 2021 at 9:44 AM Quanlong Huang <hu...@gmail.com>
wrote:

> Thank Fucun for creating the JIRAs!
>
> Regarding the dependency. I see that the current approach is to copy all
>> files from Datasketches into a single pile. Is there a better way?
>
> Is there a historical reason that we don't add DataSketches into the
> native-toolchain?
>
> Regards,
> Quanlong
>
>
> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im...@gmail.com> wrote:
>
>> Hi,
>>
>> 1. Upgrade DataSketches to version 3.1.0
>> Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=iBQtxz4UGDBMoYcnKDNYATQynGLPLZBstgSkA277VRo&e=>
>>
>> 2. Use bytes to serializing sketches
>> Impala currently does not support the BINARY data type, we can write
>> sketches as binary instead of strings once it's supported:
>> https://issues.apache.org/jira/browse/IMPALA-9482
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=32MbroPcSDWqe2FIkHG2aRBRa8JaOkG6LWbAFbFVcdY&e=>
>> .
>>
>> 3. Regarding deserialization
>> Using std::unique_ptr can reduce unnecessary constructor, here is to solve
>> the problem that `compact_theta_sketch` cannot be directly constructed.
>>
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=i6YX1VlcX-Q-y5JX3UNYOUl_1Xtw9vMeq9teRzx7KQg&e=>
>>
>> 4. DsHllMerge function optimization
>> One solution is to use hll_union instead of hll_sketch, which provides
>> support for updating hll_sketch sketches and primitive data types, But
>> other union sketches (cpc theta) do not provide primitive type support.
>>
>> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=qed-km3TAPsXHtV0ITYtnSySRTmWflHeYwqLjVmUfdQ&e=>
>>
>> https://issues.apache.org/jira/browse/IMPALA-10864
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=4UbfiVJC5jw0Ig6qpNu7tHzoRUI0A8ZWlqlODyYzWDI&e=>
>>
>> Regards.
>> Fucun
>>
>> Alexander Saydakov <sa...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
>>
>> > Hi Impala development community,
>> > I am a member of Apache DataSketches team. I looked at the DataSketches
>> > functions in Impala and I have a few questions and suggestions.
>> >
>> > Regarding the dependency. I see that the current approach is to copy all
>> > files from Datasketches into a single pile. Is there a better way?
>> >
>> > I see that you are using version 3.0.0 currently. We released version
>> 3.1.0
>> > about a month ago. It has some important bug fixes, and a Theta sketch
>> > feature to avoid some cost of deserialization. This feature can speed up
>> > Theta union operations.
>> >
>> > Regarding your approach to serializing sketches. We have two ways of
>> > serializing and deserializing sketches: stream and bytes. You are using
>> > bytes for deserializing, but a temporary stringstream for serializing
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=KXZIqKu1hXSFCI2PLEXAYueCXGUkEyHYU-_kKQLGah0&e=>
>> > just to copy the internals of this stream as bytes (using memcopy) in
>> the
>> > StringStreamToStringVal function
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=GsWAdaRvX0pzxSsKvulW_VAU4I8cy1OGwabCdU8-7uA&e=>
>> > I would think it should be faster to use serialization to bytes, and the
>> > StringStreamToStringVal function can be eliminated.
>> >
>> > Regarding deserialization. I see in some cases that a sketch
>> constructor is
>> > called just to replace this instance with a deserialized one. This extra
>> > construction seems unnecessary.
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=5GZQAUF5LKvMP1QcjFxrWXWDQSdX2ICzm7POKMLi-1Y&e=>
>> >
>> > Looking at this DsHllMerge function
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=AVmB8Varo7CKBuIObB9AYVGbIyU1Jhmi4Xh7CkLfLJk&e=>
>> > it seems that the merge is done pairwise. Is it possible to arrange this
>> > process as init, multiple merges and finalize (serialize) at the end?
>> It is
>> > quite costly to initialize a union, update it with two sketches and then
>> > call get_result(). If many such merges happen, the overhead of
>> initializing
>> > a fresh union and finalizing it for each pair can be substantial.
>> >
>> > Regards,
>> > Alexander.
>> >
>>
>

Re: [E] Re: Apache DataSketches integration

Posted by Alexander Saydakov <sa...@verizonmedia.com>.
I submitted a pull request with some changes I tried to explain here.
https://github.com/apache/impala/pull/30

There are still open questions for me regarding:
- better dependency mechanism
- updating dependency to the latest 3.1.0
- process flow in aggregate functions (avoiding overhead of pairwise merge)


On Tue, Aug 24, 2021 at 7:36 PM Alexander Saydakov <
saydakov@verizonmedia.com> wrote:

> I am afraid that I was misunderstood regarding a few points. Let me try to
> clarify.
>
> Regarding serialization using bytes as opposed to a stream. This has
> nothing to do with BINARY data type in Impala.
> Currently I see in the Impala code something like this (simplified):
> std::stringstream tmp;
> sketch.serialize(tmp);
> std::string str = tmp.str(); // in StringStreamToStringVal
> StringVal result(context, str.size());
> memcpy(result.ptr, str.c_str(), str.size());
>
> You could do it faster like this:
> auto bytes = sketch.serialize();
> StringVal result(context, bytes.size());
> memcpy(result.ptr, bytes.data() bytes.size());
>
> Regarding unnecessary constructor during deserialization. I see a code
> like this (HLL is an example, but the pattern is the same):
> datasketches::hll_sketch src_sketch(DS_SKETCH_CONFIG, DS_HLL_TYPE); //
> construct an empty sketch, which is not needed
> DeserializeDsSketch(src, &src_sketch); // pass it into a function, which
> will replace it by an assignment (hopefully a move, not copy)
> // in the function
> *sketch = T::deserialize((void*)serialized_sketch.ptr,
> serialized_sketch.len);
>
> This can be accomplished like so avoiding unnecessary constructor:
> datasketches::hll_sketch src_sketch =
> datasketches::hll_sketch::deserialize((void*)serialized_sketch.ptr,
> serialized_sketch.len);
>
> Regarding merge (not just HLL). I am talking about the overall process
> flow. Looking at this UDA documentation:
>
> https://github.com/apache/impala/blob/b1ca0894462e4b9c040fda41622f8b97b1ec15e1/be/src/udf/udf.h#L340
> it seems to me that it should be possible to have a flow like Init(),
> Merge() ... Merge(), Finalize()
> So why the implementation here finalizes each time?:
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
> Instantiating a union for each pairwise merge and especially calling
> get_result() has a lot of overhead.
>
>
>
> On Wed, Aug 18, 2021 at 12:49 AM Gabor Kaszab <ga...@apache.org>
> wrote:
>
>> Hey Quanlong,
>> Initially I added the first library as copying all the required files
>> into the same dir. It was ~1.5 years ago so my memories are faint but as I
>> remember there was an issue with the library having compilation issues if
>> we kept the structure of the files. As a Quick workaround came the idea to
>> copy the files to a common dir just to unblock us and start working on
>> adding sketching functionalities.
>> This was the first commit: https://gerrit.cloudera.org/#/c/15746/
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gerrit.cloudera.org_-23_c_15746_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Gux0Patz_D0kWZBeFg7ElKCiUah7Q9_PLHogwHpGZuk&e=>
>> We can open a Jira for sure to move the Datasketches library to toolchain
>> and let's see if anyone has some free cycles to take care of that.
>> https://issues.apache.org/jira/browse/IMPALA-10868
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10868&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=ixGxYvTuy2tUFFHn2T7Y8Mhev5G5f2D9_LdIMlWD-ZM&e=>
>>
>> Cheers,
>> Gabor
>>
>> On Mon, Aug 16, 2021 at 3:44 PM Quanlong Huang <hu...@gmail.com>
>> wrote:
>>
>>> Thank Fucun for creating the JIRAs!
>>>
>>> Regarding the dependency. I see that the current approach is to copy all
>>> > files from Datasketches into a single pile. Is there a better way?
>>>
>>> Is there a historical reason that we don't add DataSketches into the
>>> native-toolchain?
>>>
>>> Regards,
>>> Quanlong
>>>
>>>
>>> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im...@gmail.com> wrote:
>>>
>>> > Hi,
>>> >
>>> > 1. Upgrade DataSketches to version 3.1.0
>>> > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=hT4PA0h2zT7dWD90xKlsww9OV5B8BrLbDtUiCfrxc24&e=>
>>> >
>>> > 2. Use bytes to serializing sketches
>>> > Impala currently does not support the BINARY data type, we can write
>>> > sketches as binary instead of strings once it's supported:
>>> > https://issues.apache.org/jira/browse/IMPALA-9482
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=XUz0h0DnWnGcE0Y3LQ1efRqEnrkWERW5jbdT9piEmE8&e=>
>>> .
>>> >
>>> > 3. Regarding deserialization
>>> > Using std::unique_ptr can reduce unnecessary constructor, here is to
>>> solve
>>> > the problem that `compact_theta_sketch` cannot be directly constructed.
>>> >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=a4LLYpJUgaAFcpgL9zFKqHeKgVffcERTdJP2623G-TQ&e=>
>>> >
>>> > 4. DsHllMerge function optimization
>>> > One solution is to use hll_union instead of hll_sketch, which provides
>>> > support for updating hll_sketch sketches and primitive data types, But
>>> > other union sketches (cpc theta) do not provide primitive type support.
>>> >
>>> >
>>> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=lkmVnnuioQT2zQgmBUbcts8c86LUALm2Oj6eSYogqfY&e=>
>>> >
>>> > https://issues.apache.org/jira/browse/IMPALA-10864
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=rbOAZsy1yN_8nMOmcr4BbDEOS67DSBeE1ArHwnZXsR0&e=>
>>> >
>>> > Regards.
>>> > Fucun
>>> >
>>> > Alexander Saydakov <sa...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
>>> >
>>> > > Hi Impala development community,
>>> > > I am a member of Apache DataSketches team. I looked at the
>>> DataSketches
>>> > > functions in Impala and I have a few questions and suggestions.
>>> > >
>>> > > Regarding the dependency. I see that the current approach is to copy
>>> all
>>> > > files from Datasketches into a single pile. Is there a better way?
>>> > >
>>> > > I see that you are using version 3.0.0 currently. We released version
>>> > 3.1.0
>>> > > about a month ago. It has some important bug fixes, and a Theta
>>> sketch
>>> > > feature to avoid some cost of deserialization. This feature can
>>> speed up
>>> > > Theta union operations.
>>> > >
>>> > > Regarding your approach to serializing sketches. We have two ways of
>>> > > serializing and deserializing sketches: stream and bytes. You are
>>> using
>>> > > bytes for deserializing, but a temporary stringstream for serializing
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=gR9Ktn2vYHkI1ey_3E5_C6Ox9DU61vHxn5T6qH8wmHM&e=>
>>> > > just to copy the internals of this stream as bytes (using memcopy)
>>> in the
>>> > > StringStreamToStringVal function
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=4SRfbNoCr2si6yIXDPrXS6hN__TGtw9U1hbmnj85I1c&e=>
>>> > > I would think it should be faster to use serialization to bytes, and
>>> the
>>> > > StringStreamToStringVal function can be eliminated.
>>> > >
>>> > > Regarding deserialization. I see in some cases that a sketch
>>> constructor
>>> > is
>>> > > called just to replace this instance with a deserialized one. This
>>> extra
>>> > > construction seems unnecessary.
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=NbstsM1-dJh1LimtdmtM98ohuL6WtiulY47qGopnmFE&e=>
>>> > >
>>> > > Looking at this DsHllMerge function
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Yj4JWLjD9OCNAuLI2ZcTb-EZj0StX4lJ8ODziVrgbxg&e=>
>>> > > it seems that the merge is done pairwise. Is it possible to arrange
>>> this
>>> > > process as init, multiple merges and finalize (serialize) at the
>>> end? It
>>> > is
>>> > > quite costly to initialize a union, update it with two sketches and
>>> then
>>> > > call get_result(). If many such merges happen, the overhead of
>>> > initializing
>>> > > a fresh union and finalizing it for each pair can be substantial.
>>> > >
>>> > > Regards,
>>> > > Alexander.
>>> > >
>>> >
>>>
>>

Re: [E] Re: Apache DataSketches integration

Posted by Alexander Saydakov <sa...@verizonmedia.com>.
I am afraid that I was misunderstood regarding a few points. Let me try to
clarify.

Regarding serialization using bytes as opposed to a stream. This has
nothing to do with BINARY data type in Impala.
Currently I see in the Impala code something like this (simplified):
std::stringstream tmp;
sketch.serialize(tmp);
std::string str = tmp.str(); // in StringStreamToStringVal
StringVal result(context, str.size());
memcpy(result.ptr, str.c_str(), str.size());

You could do it faster like this:
auto bytes = sketch.serialize();
StringVal result(context, bytes.size());
memcpy(result.ptr, bytes.data() bytes.size());

Regarding unnecessary constructor during deserialization. I see a code like
this (HLL is an example, but the pattern is the same):
datasketches::hll_sketch src_sketch(DS_SKETCH_CONFIG, DS_HLL_TYPE); //
construct an empty sketch, which is not needed
DeserializeDsSketch(src, &src_sketch); // pass it into a function, which
will replace it by an assignment (hopefully a move, not copy)
// in the function
*sketch = T::deserialize((void*)serialized_sketch.ptr,
serialized_sketch.len);

This can be accomplished like so avoiding unnecessary constructor:
datasketches::hll_sketch src_sketch =
datasketches::hll_sketch::deserialize((void*)serialized_sketch.ptr,
serialized_sketch.len);

Regarding merge (not just HLL). I am talking about the overall process
flow. Looking at this UDA documentation:
https://github.com/apache/impala/blob/b1ca0894462e4b9c040fda41622f8b97b1ec15e1/be/src/udf/udf.h#L340
it seems to me that it should be possible to have a flow like Init(),
Merge() ... Merge(), Finalize()
So why the implementation here finalizes each time?:
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
Instantiating a union for each pairwise merge and especially calling
get_result() has a lot of overhead.



On Wed, Aug 18, 2021 at 12:49 AM Gabor Kaszab <ga...@apache.org>
wrote:

> Hey Quanlong,
> Initially I added the first library as copying all the required files into
> the same dir. It was ~1.5 years ago so my memories are faint but as I
> remember there was an issue with the library having compilation issues if
> we kept the structure of the files. As a Quick workaround came the idea to
> copy the files to a common dir just to unblock us and start working on
> adding sketching functionalities.
> This was the first commit: https://gerrit.cloudera.org/#/c/15746/
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gerrit.cloudera.org_-23_c_15746_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Gux0Patz_D0kWZBeFg7ElKCiUah7Q9_PLHogwHpGZuk&e=>
> We can open a Jira for sure to move the Datasketches library to toolchain
> and let's see if anyone has some free cycles to take care of that.
> https://issues.apache.org/jira/browse/IMPALA-10868
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10868&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=ixGxYvTuy2tUFFHn2T7Y8Mhev5G5f2D9_LdIMlWD-ZM&e=>
>
> Cheers,
> Gabor
>
> On Mon, Aug 16, 2021 at 3:44 PM Quanlong Huang <hu...@gmail.com>
> wrote:
>
>> Thank Fucun for creating the JIRAs!
>>
>> Regarding the dependency. I see that the current approach is to copy all
>> > files from Datasketches into a single pile. Is there a better way?
>>
>> Is there a historical reason that we don't add DataSketches into the
>> native-toolchain?
>>
>> Regards,
>> Quanlong
>>
>>
>> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > 1. Upgrade DataSketches to version 3.1.0
>> > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=hT4PA0h2zT7dWD90xKlsww9OV5B8BrLbDtUiCfrxc24&e=>
>> >
>> > 2. Use bytes to serializing sketches
>> > Impala currently does not support the BINARY data type, we can write
>> > sketches as binary instead of strings once it's supported:
>> > https://issues.apache.org/jira/browse/IMPALA-9482
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=XUz0h0DnWnGcE0Y3LQ1efRqEnrkWERW5jbdT9piEmE8&e=>
>> .
>> >
>> > 3. Regarding deserialization
>> > Using std::unique_ptr can reduce unnecessary constructor, here is to
>> solve
>> > the problem that `compact_theta_sketch` cannot be directly constructed.
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=a4LLYpJUgaAFcpgL9zFKqHeKgVffcERTdJP2623G-TQ&e=>
>> >
>> > 4. DsHllMerge function optimization
>> > One solution is to use hll_union instead of hll_sketch, which provides
>> > support for updating hll_sketch sketches and primitive data types, But
>> > other union sketches (cpc theta) do not provide primitive type support.
>> >
>> >
>> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=lkmVnnuioQT2zQgmBUbcts8c86LUALm2Oj6eSYogqfY&e=>
>> >
>> > https://issues.apache.org/jira/browse/IMPALA-10864
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=rbOAZsy1yN_8nMOmcr4BbDEOS67DSBeE1ArHwnZXsR0&e=>
>> >
>> > Regards.
>> > Fucun
>> >
>> > Alexander Saydakov <sa...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
>> >
>> > > Hi Impala development community,
>> > > I am a member of Apache DataSketches team. I looked at the
>> DataSketches
>> > > functions in Impala and I have a few questions and suggestions.
>> > >
>> > > Regarding the dependency. I see that the current approach is to copy
>> all
>> > > files from Datasketches into a single pile. Is there a better way?
>> > >
>> > > I see that you are using version 3.0.0 currently. We released version
>> > 3.1.0
>> > > about a month ago. It has some important bug fixes, and a Theta sketch
>> > > feature to avoid some cost of deserialization. This feature can speed
>> up
>> > > Theta union operations.
>> > >
>> > > Regarding your approach to serializing sketches. We have two ways of
>> > > serializing and deserializing sketches: stream and bytes. You are
>> using
>> > > bytes for deserializing, but a temporary stringstream for serializing
>> > >
>> > >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=gR9Ktn2vYHkI1ey_3E5_C6Ox9DU61vHxn5T6qH8wmHM&e=>
>> > > just to copy the internals of this stream as bytes (using memcopy) in
>> the
>> > > StringStreamToStringVal function
>> > >
>> > >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=4SRfbNoCr2si6yIXDPrXS6hN__TGtw9U1hbmnj85I1c&e=>
>> > > I would think it should be faster to use serialization to bytes, and
>> the
>> > > StringStreamToStringVal function can be eliminated.
>> > >
>> > > Regarding deserialization. I see in some cases that a sketch
>> constructor
>> > is
>> > > called just to replace this instance with a deserialized one. This
>> extra
>> > > construction seems unnecessary.
>> > >
>> > >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=NbstsM1-dJh1LimtdmtM98ohuL6WtiulY47qGopnmFE&e=>
>> > >
>> > > Looking at this DsHllMerge function
>> > >
>> > >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Yj4JWLjD9OCNAuLI2ZcTb-EZj0StX4lJ8ODziVrgbxg&e=>
>> > > it seems that the merge is done pairwise. Is it possible to arrange
>> this
>> > > process as init, multiple merges and finalize (serialize) at the end?
>> It
>> > is
>> > > quite costly to initialize a union, update it with two sketches and
>> then
>> > > call get_result(). If many such merges happen, the overhead of
>> > initializing
>> > > a fresh union and finalizing it for each pair can be substantial.
>> > >
>> > > Regards,
>> > > Alexander.
>> > >
>> >
>>
>

Re: Apache DataSketches integration

Posted by Gabor Kaszab <ga...@apache.org>.
Hey Quanlong,
Initially I added the first library as copying all the required files into
the same dir. It was ~1.5 years ago so my memories are faint but as I
remember there was an issue with the library having compilation issues if
we kept the structure of the files. As a Quick workaround came the idea to
copy the files to a common dir just to unblock us and start working on
adding sketching functionalities.
This was the first commit: https://gerrit.cloudera.org/#/c/15746/
We can open a Jira for sure to move the Datasketches library to toolchain
and let's see if anyone has some free cycles to take care of that.
https://issues.apache.org/jira/browse/IMPALA-10868

Cheers,
Gabor

On Mon, Aug 16, 2021 at 3:44 PM Quanlong Huang <hu...@gmail.com>
wrote:

> Thank Fucun for creating the JIRAs!
>
> Regarding the dependency. I see that the current approach is to copy all
> > files from Datasketches into a single pile. Is there a better way?
>
> Is there a historical reason that we don't add DataSketches into the
> native-toolchain?
>
> Regards,
> Quanlong
>
>
> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im...@gmail.com> wrote:
>
> > Hi,
> >
> > 1. Upgrade DataSketches to version 3.1.0
> > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
> >
> > 2. Use bytes to serializing sketches
> > Impala currently does not support the BINARY data type, we can write
> > sketches as binary instead of strings once it's supported:
> > https://issues.apache.org/jira/browse/IMPALA-9482.
> >
> > 3. Regarding deserialization
> > Using std::unique_ptr can reduce unnecessary constructor, here is to
> solve
> > the problem that `compact_theta_sketch` cannot be directly constructed.
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
> >
> > 4. DsHllMerge function optimization
> > One solution is to use hll_union instead of hll_sketch, which provides
> > support for updating hll_sketch sketches and primitive data types, But
> > other union sketches (cpc theta) do not provide primitive type support.
> >
> >
> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
> >
> > https://issues.apache.org/jira/browse/IMPALA-10864
> >
> > Regards.
> > Fucun
> >
> > Alexander Saydakov <sa...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
> >
> > > Hi Impala development community,
> > > I am a member of Apache DataSketches team. I looked at the DataSketches
> > > functions in Impala and I have a few questions and suggestions.
> > >
> > > Regarding the dependency. I see that the current approach is to copy
> all
> > > files from Datasketches into a single pile. Is there a better way?
> > >
> > > I see that you are using version 3.0.0 currently. We released version
> > 3.1.0
> > > about a month ago. It has some important bug fixes, and a Theta sketch
> > > feature to avoid some cost of deserialization. This feature can speed
> up
> > > Theta union operations.
> > >
> > > Regarding your approach to serializing sketches. We have two ways of
> > > serializing and deserializing sketches: stream and bytes. You are using
> > > bytes for deserializing, but a temporary stringstream for serializing
> > >
> > >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
> > > just to copy the internals of this stream as bytes (using memcopy) in
> the
> > > StringStreamToStringVal function
> > >
> > >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
> > > I would think it should be faster to use serialization to bytes, and
> the
> > > StringStreamToStringVal function can be eliminated.
> > >
> > > Regarding deserialization. I see in some cases that a sketch
> constructor
> > is
> > > called just to replace this instance with a deserialized one. This
> extra
> > > construction seems unnecessary.
> > >
> > >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
> > >
> > > Looking at this DsHllMerge function
> > >
> > >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
> > > it seems that the merge is done pairwise. Is it possible to arrange
> this
> > > process as init, multiple merges and finalize (serialize) at the end?
> It
> > is
> > > quite costly to initialize a union, update it with two sketches and
> then
> > > call get_result(). If many such merges happen, the overhead of
> > initializing
> > > a fresh union and finalizing it for each pair can be substantial.
> > >
> > > Regards,
> > > Alexander.
> > >
> >
>

Re: Apache DataSketches integration

Posted by Quanlong Huang <hu...@gmail.com>.
Thank Fucun for creating the JIRAs!

Regarding the dependency. I see that the current approach is to copy all
> files from Datasketches into a single pile. Is there a better way?

Is there a historical reason that we don't add DataSketches into the
native-toolchain?

Regards,
Quanlong


On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im...@gmail.com> wrote:

> Hi,
>
> 1. Upgrade DataSketches to version 3.1.0
> Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
>
> 2. Use bytes to serializing sketches
> Impala currently does not support the BINARY data type, we can write
> sketches as binary instead of strings once it's supported:
> https://issues.apache.org/jira/browse/IMPALA-9482.
>
> 3. Regarding deserialization
> Using std::unique_ptr can reduce unnecessary constructor, here is to solve
> the problem that `compact_theta_sketch` cannot be directly constructed.
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
>
> 4. DsHllMerge function optimization
> One solution is to use hll_union instead of hll_sketch, which provides
> support for updating hll_sketch sketches and primitive data types, But
> other union sketches (cpc theta) do not provide primitive type support.
>
> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
>
> https://issues.apache.org/jira/browse/IMPALA-10864
>
> Regards.
> Fucun
>
> Alexander Saydakov <sa...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
>
> > Hi Impala development community,
> > I am a member of Apache DataSketches team. I looked at the DataSketches
> > functions in Impala and I have a few questions and suggestions.
> >
> > Regarding the dependency. I see that the current approach is to copy all
> > files from Datasketches into a single pile. Is there a better way?
> >
> > I see that you are using version 3.0.0 currently. We released version
> 3.1.0
> > about a month ago. It has some important bug fixes, and a Theta sketch
> > feature to avoid some cost of deserialization. This feature can speed up
> > Theta union operations.
> >
> > Regarding your approach to serializing sketches. We have two ways of
> > serializing and deserializing sketches: stream and bytes. You are using
> > bytes for deserializing, but a temporary stringstream for serializing
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
> > just to copy the internals of this stream as bytes (using memcopy) in the
> > StringStreamToStringVal function
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
> > I would think it should be faster to use serialization to bytes, and the
> > StringStreamToStringVal function can be eliminated.
> >
> > Regarding deserialization. I see in some cases that a sketch constructor
> is
> > called just to replace this instance with a deserialized one. This extra
> > construction seems unnecessary.
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
> >
> > Looking at this DsHllMerge function
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
> > it seems that the merge is done pairwise. Is it possible to arrange this
> > process as init, multiple merges and finalize (serialize) at the end? It
> is
> > quite costly to initialize a union, update it with two sketches and then
> > call get_result(). If many such merges happen, the overhead of
> initializing
> > a fresh union and finalizing it for each pair can be substantial.
> >
> > Regards,
> > Alexander.
> >
>

Re: Apache DataSketches integration

Posted by fucun chu <im...@gmail.com>.
Hi,

1. Upgrade DataSketches to version 3.1.0
Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863

2. Use bytes to serializing sketches
Impala currently does not support the BINARY data type, we can write
sketches as binary instead of strings once it's supported:
https://issues.apache.org/jira/browse/IMPALA-9482.

3. Regarding deserialization
Using std::unique_ptr can reduce unnecessary constructor, here is to solve
the problem that `compact_theta_sketch` cannot be directly constructed.
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104

4. DsHllMerge function optimization
One solution is to use hll_union instead of hll_sketch, which provides
support for updating hll_sketch sketches and primitive data types, But
other union sketches (cpc theta) do not provide primitive type support.
https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581

https://issues.apache.org/jira/browse/IMPALA-10864

Regards.
Fucun

Alexander Saydakov <sa...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:

> Hi Impala development community,
> I am a member of Apache DataSketches team. I looked at the DataSketches
> functions in Impala and I have a few questions and suggestions.
>
> Regarding the dependency. I see that the current approach is to copy all
> files from Datasketches into a single pile. Is there a better way?
>
> I see that you are using version 3.0.0 currently. We released version 3.1.0
> about a month ago. It has some important bug fixes, and a Theta sketch
> feature to avoid some cost of deserialization. This feature can speed up
> Theta union operations.
>
> Regarding your approach to serializing sketches. We have two ways of
> serializing and deserializing sketches: stream and bytes. You are using
> bytes for deserializing, but a temporary stringstream for serializing
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
> just to copy the internals of this stream as bytes (using memcopy) in the
> StringStreamToStringVal function
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
> I would think it should be faster to use serialization to bytes, and the
> StringStreamToStringVal function can be eliminated.
>
> Regarding deserialization. I see in some cases that a sketch constructor is
> called just to replace this instance with a deserialized one. This extra
> construction seems unnecessary.
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
>
> Looking at this DsHllMerge function
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
> it seems that the merge is done pairwise. Is it possible to arrange this
> process as init, multiple merges and finalize (serialize) at the end? It is
> quite costly to initialize a union, update it with two sketches and then
> call get_result(). If many such merges happen, the overhead of initializing
> a fresh union and finalizing it for each pair can be substantial.
>
> Regards,
> Alexander.
>