You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datasketches.apache.org by David Cromberge <da...@permutive.com> on 2021/01/14 19:48:52 UTC

Jaccard similarity for Tuple sketches

Hi Everyone,

We have found the jaccard similarity utility to be a useful way to compare the similarity/dissimilarity between theta sketches. 

My use case involves using summaries from tuple sketches for both mean and distinct summary counts.

The java library however, does not provide an implementation that is able to compare:
- two tuple sketches
- a theta and a tuple sketch

I would like to know if it makes sense to implement a similar utility as can be found in the theta package, for tuple sketches.  Clearly this would only apply to the hashed values and not the summaries.

I would be happy to submit a pull request if adding this makes sense, and would benefit other users.

Thanks in advance for any feedback,
David
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@datasketches.apache.org
For additional commands, e-mail: dev-help@datasketches.apache.org


Re: Jaccard similarity for Tuple sketches

Posted by leerho <le...@gmail.com>.
David,

Please, go ahead and propose something.  Instead of doing a PR against
master, why don't you add a branch and do your work there.

Lee.

On Thu, Jan 14, 2021 at 11:48 AM David Cromberge <
david.cromberge@permutive.com> wrote:

> Hi Everyone,
>
> We have found the jaccard similarity utility to be a useful way to compare
> the similarity/dissimilarity between theta sketches.
>
> My use case involves using summaries from tuple sketches for both mean and
> distinct summary counts.
>
> The java library however, does not provide an implementation that is able
> to compare:
> - two tuple sketches
> - a theta and a tuple sketch
>
> I would like to know if it makes sense to implement a similar utility as
> can be found in the theta package, for tuple sketches.  Clearly this would
> only apply to the hashed values and not the summaries.
>
> I would be happy to submit a pull request if adding this makes sense, and
> would benefit other users.
>
> Thanks in advance for any feedback,
> David
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@datasketches.apache.org
> For additional commands, e-mail: dev-help@datasketches.apache.org
>
>