You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@datasketches.apache.org by Gabor Kaszab <ga...@apache.org> on 2020/04/27 13:19:19 UTC

Apache Impala integration with DataSketches HLL (C++)

Hey,

I'm an Apache Impala (distributed, fast, SQL query engine on big data)
contributor and recently started working on pulling in HLL sketching from
DataSketches. I managed to put a PoC together where Impala runs a
count(distinct) estimate on a column of a table where in the background it
uses Datasketches' HLL algorithm from apache/incubator-datasketches-cpp to
produce the results.

My quick question would be that taking into account that the order of the
items provided to datasketches:hll_sketch is not deterministic is it normal
behaviour that for the same dataset I get a different estimate each time I
run my query?
I'm trying to figure out if this is due to some issues with my code or
normal characteristics of the C++ library of DataSketches.

My second question would be that in case Hive uses the Hive connectors from
DataSketches and Impala uses the provided C++ library is it guaranteed that
whatever sketch is written by any of these systems it can be correctly read
with the other? I see binary compatibility mentioned on the official web
page just wanted to double check if there are any exceptions to this.

Cheers,
Gabor

Re: Apache Impala integration with DataSketches HLL (C++)

Posted by leerho <le...@gmail.com>.
Hi Gabor,

My quick question would be that taking into account that the order of the
> items provided to datasketches:hll_sketch is not deterministic is it normal
> behaviour that for the same dataset I get a different estimate each time I
> run my query?
> I'm trying to figure out if this is due to some issues with my code or
> normal characteristics of the C++ library of DataSketches.


Please refer to our documentation where we discuss Data Insensitivity
<https://datasketches.apache.org/docs/Architecture/SketchCriteria.html>
and Order
Sensitivity
<https://datasketches.apache.org/docs/Architecture/OrderSensitivity.html>,
specifically.

In general, because sketches are probabilistic algorithms and often depend
on internal randomization, Absolute Order Insensitivity (AOI) is not
guaranteed, only that the result will be within the specified error bounds
with the specified confidence, which is what we call Bounded Order
Insensitivity (BOI).  However, even though some of the sketch algorithms
under certain conditions, can be AOI, we do not recommend that you depend
on that in your testing.  Instead of comparing with some previous estimate
exactly, it is better to check that your new estimate is within the
specified error bounds and confidence interval for that sketch and its
configuration.

My second question would be that in case Hive uses the Hive connectors from
> DataSketches and Impala uses the provided C++ library is it guaranteed that
> whatever sketch is written by any of these systems it can be correctly read
> with the other? I see binary compatibility mentioned on the official web
> page just wanted to double check if there are any exceptions to this.


We do our best to guarantee "binary compatibility" across C++, Java and
Python and are doing a lot of cross-language testing to ensure that.  What
this means, for example, is that a sketch generated in C++ and serialized
to its binary image, can be deserialized and read in Java, or visa versa.

Note that our fundamental serialization format is an array of bytes. It is
up to the specific environment to choose how they wish to transport this
array of bytes to other systems without corruption.  Typical transport
schemes include Base64, Kafka, ProtoBuf, etc.

I am pleased that you are integrating DataSketches into Impala.  Please
continue to post questions to us if you need further help!

Lee.

On Mon, Apr 27, 2020 at 6:19 AM Gabor Kaszab <ga...@apache.org> wrote:

> Hey,
>
> I'm an Apache Impala (distributed, fast, SQL query engine on big data)
> contributor and recently started working on pulling in HLL sketching from
> DataSketches. I managed to put a PoC together where Impala runs a
> count(distinct) estimate on a column of a table where in the background it
> uses Datasketches' HLL algorithm from apache/incubator-datasketches-cpp to
> produce the results.
>
> My quick question would be that taking into account that the order of the
> items provided to datasketches:hll_sketch is not deterministic is it normal
> behaviour that for the same dataset I get a different estimate each time I
> run my query?
> I'm trying to figure out if this is due to some issues with my code or
> normal characteristics of the C++ library of DataSketches.
>
> My second question would be that in case Hive uses the Hive connectors
> from DataSketches and Impala uses the provided C++ library is it guaranteed
> that whatever sketch is written by any of these systems it can be correctly
> read with the other? I see binary compatibility mentioned on the official
> web page just wanted to double check if there are any exceptions to this.
>
> Cheers,
> Gabor
>

Re: Apache Impala integration with DataSketches HLL (C++)

Posted by leerho <le...@gmail.com>.
Hi Gabor,

My quick question would be that taking into account that the order of the
> items provided to datasketches:hll_sketch is not deterministic is it normal
> behaviour that for the same dataset I get a different estimate each time I
> run my query?
> I'm trying to figure out if this is due to some issues with my code or
> normal characteristics of the C++ library of DataSketches.


Please refer to our documentation where we discuss Data Insensitivity
<https://datasketches.apache.org/docs/Architecture/SketchCriteria.html>
and Order
Sensitivity
<https://datasketches.apache.org/docs/Architecture/OrderSensitivity.html>,
specifically.

In general, because sketches are probabilistic algorithms and often depend
on internal randomization, Absolute Order Insensitivity (AOI) is not
guaranteed, only that the result will be within the specified error bounds
with the specified confidence, which is what we call Bounded Order
Insensitivity (BOI).  However, even though some of the sketch algorithms
under certain conditions, can be AOI, we do not recommend that you depend
on that in your testing.  Instead of comparing with some previous estimate
exactly, it is better to check that your new estimate is within the
specified error bounds and confidence interval for that sketch and its
configuration.

My second question would be that in case Hive uses the Hive connectors from
> DataSketches and Impala uses the provided C++ library is it guaranteed that
> whatever sketch is written by any of these systems it can be correctly read
> with the other? I see binary compatibility mentioned on the official web
> page just wanted to double check if there are any exceptions to this.


We do our best to guarantee "binary compatibility" across C++, Java and
Python and are doing a lot of cross-language testing to ensure that.  What
this means, for example, is that a sketch generated in C++ and serialized
to its binary image, can be deserialized and read in Java, or visa versa.

Note that our fundamental serialization format is an array of bytes. It is
up to the specific environment to choose how they wish to transport this
array of bytes to other systems without corruption.  Typical transport
schemes include Base64, Kafka, ProtoBuf, etc.

I am pleased that you are integrating DataSketches into Impala.  Please
continue to post questions to us if you need further help!

Lee.

On Mon, Apr 27, 2020 at 6:19 AM Gabor Kaszab <ga...@apache.org> wrote:

> Hey,
>
> I'm an Apache Impala (distributed, fast, SQL query engine on big data)
> contributor and recently started working on pulling in HLL sketching from
> DataSketches. I managed to put a PoC together where Impala runs a
> count(distinct) estimate on a column of a table where in the background it
> uses Datasketches' HLL algorithm from apache/incubator-datasketches-cpp to
> produce the results.
>
> My quick question would be that taking into account that the order of the
> items provided to datasketches:hll_sketch is not deterministic is it normal
> behaviour that for the same dataset I get a different estimate each time I
> run my query?
> I'm trying to figure out if this is due to some issues with my code or
> normal characteristics of the C++ library of DataSketches.
>
> My second question would be that in case Hive uses the Hive connectors
> from DataSketches and Impala uses the provided C++ library is it guaranteed
> that whatever sketch is written by any of these systems it can be correctly
> read with the other? I see binary compatibility mentioned on the official
> web page just wanted to double check if there are any exceptions to this.
>
> Cheers,
> Gabor
>