You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by ch...@gmail.com on 2014/01/21 14:31:59 UTC

Re[2]: Compute the top 100 million in the total 10 billion data efficiently.

 Hi Ted,
Sorry for my ambiguous question. I did mean the top 1% base no the score attached to the 10 billion tuples. 
You  mentioned a approximate algorithm. That's great! I will check it out later. But, Is there a way to calculate it in a precise way?
Thanks.

Sent from  myMail for iOS

2014年1月21日 星期二 20:54 +0800 from Ted Dunning  <te...@gmail.com>:
Top what?

Most frequent?  Or the top 1% based on some score attached to the tuples.

The latter is trivial. The former less so.

If you have the score problem, you just need to use an approximate quantile algorithm like t-digest to get a continuous estimate of the 99-th percentile.

For the most frequent problem there are other approximation algorithms but you may have some issues with the number of hits that you are looking for.

Sent from my iPhone

> On Jan 20, 2014, at 21:58, churly lin < churylin@gmail.com > wrote:
>
> Hi all:
> Recently, This question occurs to me, how to compute the top 100 million in the total 10 billion records efficiently using Storm?
>
> The total 10 billion records is the input of topology with the top 100 million records as output.

Re: Re[2]: Compute the top 100 million in the total 10 billion data efficiently.

Posted by churly lin <ch...@gmail.com>.
Thanks Klausen ^ ^. I will check it out.


2014/1/22 Klausen Schaefersinho <kl...@gmail.com>

> There is even a free version online :
>
> http://www.liaad.up.pt/area/jgama/DataStreams-CRC.pdf
>
>
>
> On Wed, Jan 22, 2014 at 11:02 AM, Klausen Schaefersinho <
> klaus.schaefers@gmail.com> wrote:
>
>> Hi,
>>
>> you can have a look at "Knowledge Discovery from Data Streams" from Joao
>> Gama. It gives a very good and solid introduction to the topic of stream
>> mining.
>>
>> Regards,
>>
>> Klaus
>>
>>
>>
>> On Wed, Jan 22, 2014 at 10:35 AM, Ted Dunning <te...@gmail.com>wrote:
>>
>>>
>>> On Tue, Jan 21, 2014 at 7:31 AM, <ch...@gmail.com> wrote:
>>>
>>>> You mentioned a approximate algorithm. That's great! I will check it
>>>> out later. But, Is there a way to calculate it in a precise way?
>>>
>>>
>>> If you want to select the 1% largest numbers, then you have a few
>>> choices.
>>>
>>> If you have memory for the full set, you can sort.
>>>
>>> If you have room to keep 1% of the samples in memory, you need to do 100
>>> passes.
>>>
>>> If you are willing to accept small errors, then you can do it in a
>>> single pass.
>>>
>>> These trade-offs are not optional, but are theorems.
>>>
>>>
>>>
>>
>

Re: Re[2]: Compute the top 100 million in the total 10 billion data efficiently.

Posted by Klausen Schaefersinho <kl...@gmail.com>.
There is even a free version online :

http://www.liaad.up.pt/area/jgama/DataStreams-CRC.pdf



On Wed, Jan 22, 2014 at 11:02 AM, Klausen Schaefersinho <
klaus.schaefers@gmail.com> wrote:

> Hi,
>
> you can have a look at "Knowledge Discovery from Data Streams" from Joao
> Gama. It gives a very good and solid introduction to the topic of stream
> mining.
>
> Regards,
>
> Klaus
>
>
>
> On Wed, Jan 22, 2014 at 10:35 AM, Ted Dunning <te...@gmail.com>wrote:
>
>>
>> On Tue, Jan 21, 2014 at 7:31 AM, <ch...@gmail.com> wrote:
>>
>>> You mentioned a approximate algorithm. That's great! I will check it out
>>> later. But, Is there a way to calculate it in a precise way?
>>
>>
>> If you want to select the 1% largest numbers, then you have a few choices.
>>
>> If you have memory for the full set, you can sort.
>>
>> If you have room to keep 1% of the samples in memory, you need to do 100
>> passes.
>>
>> If you are willing to accept small errors, then you can do it in a single
>> pass.
>>
>> These trade-offs are not optional, but are theorems.
>>
>>
>>
>

Re: Re[2]: Compute the top 100 million in the total 10 billion data efficiently.

Posted by Klausen Schaefersinho <kl...@gmail.com>.
Hi,

you can have a look at "Knowledge Discovery from Data Streams" from Joao
Gama. It gives a very good and solid introduction to the topic of stream
mining.

Regards,

Klaus



On Wed, Jan 22, 2014 at 10:35 AM, Ted Dunning <te...@gmail.com> wrote:

>
> On Tue, Jan 21, 2014 at 7:31 AM, <ch...@gmail.com> wrote:
>
>> You mentioned a approximate algorithm. That's great! I will check it out
>> later. But, Is there a way to calculate it in a precise way?
>
>
> If you want to select the 1% largest numbers, then you have a few choices.
>
> If you have memory for the full set, you can sort.
>
> If you have room to keep 1% of the samples in memory, you need to do 100
> passes.
>
> If you are willing to accept small errors, then you can do it in a single
> pass.
>
> These trade-offs are not optional, but are theorems.
>
>
>

Re: Re[2]: Compute the top 100 million in the total 10 billion data efficiently.

Posted by Nathan Leung <nc...@gmail.com>.
You don't need the full set in ram. You only need to keep the largest 100m
in ram, but you would need to keep it in a sorted data structure. Our ram
is tight you can keep keys only then extract the data in a second pass.
On Jan 22, 2014 4:36 AM, "Ted Dunning" <te...@gmail.com> wrote:

>
> On Tue, Jan 21, 2014 at 7:31 AM, <ch...@gmail.com> wrote:
>
>> You mentioned a approximate algorithm. That's great! I will check it out
>> later. But, Is there a way to calculate it in a precise way?
>
>
> If you want to select the 1% largest numbers, then you have a few choices.
>
> If you have memory for the full set, you can sort.
>
> If you have room to keep 1% of the samples in memory, you need to do 100
> passes.
>
> If you are willing to accept small errors, then you can do it in a single
> pass.
>
> These trade-offs are not optional, but are theorems.
>
>
>

Re: Re[2]: Compute the top 100 million in the total 10 billion data efficiently.

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Jan 21, 2014 at 7:31 AM, <ch...@gmail.com> wrote:

> You mentioned a approximate algorithm. That's great! I will check it out
> later. But, Is there a way to calculate it in a precise way?


If you want to select the 1% largest numbers, then you have a few choices.

If you have memory for the full set, you can sort.

If you have room to keep 1% of the samples in memory, you need to do 100
passes.

If you are willing to accept small errors, then you can do it in a single
pass.

These trade-offs are not optional, but are theorems.