You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by David Medinets <da...@gmail.com> on 2014/05/16 19:54:46 UTC

Tracking cardinality in Accumulo

If I have the following simple set of data:

NAME John
NAME Jake
NAME John
NAME Mary

I want to end up with the following:

NAME 3

I'm thinking that perhaps a HyperLogLog approach should work. See
http://en.wikipedia.org/wiki/HyperLogLog for more information.

Has anyone done this before in Accumulo?

Re: Tracking cardinality in Accumulo

Posted by William Slacum <wi...@accumulo.net>.

Yes. It will be less useful if you can't scan only the newest data, as
you'll be recombining the same pieces of data on subsequent runs.

On Fri, May 16, 2014 at 1:54 PM, David Medinets <da...@gmail.com>wrote:

> If I have the following simple set of data:
>
> NAME John
> NAME Jake
> NAME John
> NAME Mary
>
> I want to end up with the following:
>
> NAME 3
>
> I'm thinking that perhaps a HyperLogLog approach should work. See
> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>
> Has anyone done this before in Accumulo?
>

Re: Tracking cardinality in Accumulo

Posted by David Medinets <da...@gmail.com>.

>What's the expected size of your unique key set? Thousands? Millions?
Billions?

This project is something to occupy me my spare time. And it's intended to
explore aspects of Accumulo that I haven't needed to use yet. In the past,
I simply ran a map-reduce job using the Word Counting technique.

tl;dr - The expected size of the unique key key would be in the millions.
Too large to calculate on-the-fly for a web application.


On Fri, May 16, 2014 at 6:04 PM, Corey Nolet <cj...@gmail.com> wrote:

> What's the expected size of your unique key set? Thousands? Millions?
> Billions?
>
> You could probably use a table structure similar to
> https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut just have it emit 1's instead of summing them.
>
> I'm thinking maybe your mappings could be like this:
> group=anything, type=NAME, name=John(etc...)
>
> perhaps a ColumnQualifierGrouping iterator could be applied at scan time
> to add up the cardinalities for the quals over the given time range being
> scanned where cardinalities across different time units get aggregated
> client side.
>
>
>
>
> On Fri, May 16, 2014 at 5:19 PM, David Medinets <da...@gmail.com>wrote:
>
>> Yes, the data has not yet been ingested. I can control the table
>> structure; hopefully by integrating (or extending) the D4M schema.
>>
>> I'm leaning towards using https://github.com/addthis/stream-lib as part
>> of the ingest process. Upon start up, existing tables would be analyzed to
>> find cardinality. Then as records are ingested, the cardinality would be
>> adjusted as needed. I don't yet know how to store the cardinality
>> information so that restarting the ingest process doesn't require
>> re-processing all the data. Still researching.
>>
>>
>> On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cj...@gmail.com> wrote:
>>
>>> Can we assume this data has not yet been ingested? Do you have control
>>> over the way in which you structure your table?
>>>
>>>
>>>
>>> On Fri, May 16, 2014 at 1:54 PM, David Medinets <
>>> david.medinets@gmail.com> wrote:
>>>
>>>> If I have the following simple set of data:
>>>>
>>>> NAME John
>>>> NAME Jake
>>>> NAME John
>>>> NAME Mary
>>>>
>>>> I want to end up with the following:
>>>>
>>>> NAME 3
>>>>
>>>> I'm thinking that perhaps a HyperLogLog approach should work. See
>>>> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>>>>
>>>> Has anyone done this before in Accumulo?
>>>>
>>>
>>>
>>
>

Re: Tracking cardinality in Accumulo

Posted by Marc Parisi <ma...@accumulo.net>.

woops, sorry for the empty response, but I'm new to E-mail. The bitset
within HLL supports union and intersection. You should be able to estimate
cardinality without re-reading the data. In effect, you can segment your
estimation and minimize error < about 2%.

Union is straightforward, whereas intersection is |FIELD+1| + |FIELD_2| -
|FIELD_1 UNION FIELD_2|


On Fri, May 16, 2014 at 9:17 PM, Marc Parisi <ma...@accumulo.net> wrote:

>
>
>
> On Fri, May 16, 2014 at 6:04 PM, Corey Nolet <cj...@gmail.com> wrote:
>
>> What's the expected size of your unique key set? Thousands? Millions?
>> Billions?
>>
>> You could probably use a table structure similar to
>> https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut just have it emit 1's instead of summing them.
>>
>> I'm thinking maybe your mappings could be like this:
>> group=anything, type=NAME, name=John(etc...)
>>
>> perhaps a ColumnQualifierGrouping iterator could be applied at scan time
>> to add up the cardinalities for the quals over the given time range being
>> scanned where cardinalities across different time units get aggregated
>> client side.
>>
>>
>>
>>
>> On Fri, May 16, 2014 at 5:19 PM, David Medinets <david.medinets@gmail.com
>> > wrote:
>>
>>> Yes, the data has not yet been ingested. I can control the table
>>> structure; hopefully by integrating (or extending) the D4M schema.
>>>
>>> I'm leaning towards using https://github.com/addthis/stream-lib as part
>>> of the ingest process. Upon start up, existing tables would be analyzed to
>>> find cardinality. Then as records are ingested, the cardinality would be
>>> adjusted as needed. I don't yet know how to store the cardinality
>>> information so that restarting the ingest process doesn't require
>>> re-processing all the data. Still researching.
>>>
>>>
>>> On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cj...@gmail.com> wrote:
>>>
>>>> Can we assume this data has not yet been ingested? Do you have control
>>>> over the way in which you structure your table?
>>>>
>>>>
>>>>
>>>> On Fri, May 16, 2014 at 1:54 PM, David Medinets <
>>>> david.medinets@gmail.com> wrote:
>>>>
>>>>> If I have the following simple set of data:
>>>>>
>>>>> NAME John
>>>>> NAME Jake
>>>>> NAME John
>>>>> NAME Mary
>>>>>
>>>>> I want to end up with the following:
>>>>>
>>>>> NAME 3
>>>>>
>>>>> I'm thinking that perhaps a HyperLogLog approach should work. See
>>>>> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>>>>>
>>>>> Has anyone done this before in Accumulo?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Tracking cardinality in Accumulo

Posted by Marc Parisi <ma...@accumulo.net>.

On Fri, May 16, 2014 at 6:04 PM, Corey Nolet <cj...@gmail.com> wrote:

> What's the expected size of your unique key set? Thousands? Millions?
> Billions?
>
> You could probably use a table structure similar to
> https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut just have it emit 1's instead of summing them.
>
> I'm thinking maybe your mappings could be like this:
> group=anything, type=NAME, name=John(etc...)
>
> perhaps a ColumnQualifierGrouping iterator could be applied at scan time
> to add up the cardinalities for the quals over the given time range being
> scanned where cardinalities across different time units get aggregated
> client side.
>
>
>
>
> On Fri, May 16, 2014 at 5:19 PM, David Medinets <da...@gmail.com>wrote:
>
>> Yes, the data has not yet been ingested. I can control the table
>> structure; hopefully by integrating (or extending) the D4M schema.
>>
>> I'm leaning towards using https://github.com/addthis/stream-lib as part
>> of the ingest process. Upon start up, existing tables would be analyzed to
>> find cardinality. Then as records are ingested, the cardinality would be
>> adjusted as needed. I don't yet know how to store the cardinality
>> information so that restarting the ingest process doesn't require
>> re-processing all the data. Still researching.
>>
>>
>> On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cj...@gmail.com> wrote:
>>
>>> Can we assume this data has not yet been ingested? Do you have control
>>> over the way in which you structure your table?
>>>
>>>
>>>
>>> On Fri, May 16, 2014 at 1:54 PM, David Medinets <
>>> david.medinets@gmail.com> wrote:
>>>
>>>> If I have the following simple set of data:
>>>>
>>>> NAME John
>>>> NAME Jake
>>>> NAME John
>>>> NAME Mary
>>>>
>>>> I want to end up with the following:
>>>>
>>>> NAME 3
>>>>
>>>> I'm thinking that perhaps a HyperLogLog approach should work. See
>>>> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>>>>
>>>> Has anyone done this before in Accumulo?
>>>>
>>>
>>>
>>
>

Re: Tracking cardinality in Accumulo

Posted by Corey Nolet <cj...@gmail.com>.

What's the expected size of your unique key set? Thousands? Millions?
Billions?

You could probably use a table structure similar to
https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut
just have it emit 1's instead of summing them.

I'm thinking maybe your mappings could be like this:
group=anything, type=NAME, name=John(etc...)

perhaps a ColumnQualifierGrouping iterator could be applied at scan time to
add up the cardinalities for the quals over the given time range being
scanned where cardinalities across different time units get aggregated
client side.




On Fri, May 16, 2014 at 5:19 PM, David Medinets <da...@gmail.com>wrote:

> Yes, the data has not yet been ingested. I can control the table
> structure; hopefully by integrating (or extending) the D4M schema.
>
> I'm leaning towards using https://github.com/addthis/stream-lib as part
> of the ingest process. Upon start up, existing tables would be analyzed to
> find cardinality. Then as records are ingested, the cardinality would be
> adjusted as needed. I don't yet know how to store the cardinality
> information so that restarting the ingest process doesn't require
> re-processing all the data. Still researching.
>
>
> On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cj...@gmail.com> wrote:
>
>> Can we assume this data has not yet been ingested? Do you have control
>> over the way in which you structure your table?
>>
>>
>>
>> On Fri, May 16, 2014 at 1:54 PM, David Medinets <david.medinets@gmail.com
>> > wrote:
>>
>>> If I have the following simple set of data:
>>>
>>> NAME John
>>> NAME Jake
>>> NAME John
>>> NAME Mary
>>>
>>> I want to end up with the following:
>>>
>>> NAME 3
>>>
>>> I'm thinking that perhaps a HyperLogLog approach should work. See
>>> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>>>
>>> Has anyone done this before in Accumulo?
>>>
>>
>>
>

Re: Tracking cardinality in Accumulo

Posted by David Medinets <da...@gmail.com>.

Yes, the data has not yet been ingested. I can control the table structure;
hopefully by integrating (or extending) the D4M schema.

I'm leaning towards using https://github.com/addthis/stream-lib as part of
the ingest process. Upon start up, existing tables would be analyzed to
find cardinality. Then as records are ingested, the cardinality would be
adjusted as needed. I don't yet know how to store the cardinality
information so that restarting the ingest process doesn't require
re-processing all the data. Still researching.

On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cj...@gmail.com> wrote:

> Can we assume this data has not yet been ingested? Do you have control
> over the way in which you structure your table?
>
>
>
> On Fri, May 16, 2014 at 1:54 PM, David Medinets <da...@gmail.com>wrote:
>
>> If I have the following simple set of data:
>>
>> NAME John
>> NAME Jake
>> NAME John
>> NAME Mary
>>
>> I want to end up with the following:
>>
>> NAME 3
>>
>> I'm thinking that perhaps a HyperLogLog approach should work. See
>> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>>
>> Has anyone done this before in Accumulo?
>>
>
>

Re: Tracking cardinality in Accumulo

Posted by Corey Nolet <cj...@gmail.com>.

Can we assume this data has not yet been ingested? Do you have control over
the way in which you structure your table?

On Fri, May 16, 2014 at 1:54 PM, David Medinets <da...@gmail.com>wrote:

> If I have the following simple set of data:
>
> NAME John
> NAME Jake
> NAME John
> NAME Mary
>
> I want to end up with the following:
>
> NAME 3
>
> I'm thinking that perhaps a HyperLogLog approach should work. See
> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>
> Has anyone done this before in Accumulo?
>