You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by shashank agarwal <sh...@gmail.com> on 2017/07/31 16:01:29 UTC

Can i use lot of keyd states or should i use 1 big key state.

Hello,

I have to compute results on basis of lot of history data, parameters like
total transactions in last 1 month, last 1 day, last 1 hour etc. by email
id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email,
mobile, zipcode etc. or should i create 1 big mapped state (BS) and than
process that BS, may be in process function or by applying some loop and
filter logic in window or process function.

My main worry is i will end up with millions of states, because there can
be millions unique emails, phone numbers or zipcode if i create keyed state
by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ?
Which approach would you suggest in this use case.


-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....

Re: Can i use lot of keyd states or should i use 1 big key state.

Posted by shashank agarwal <sh...@gmail.com>.

Thanks Aljoscha and Stephan for clearing the doubt.




On Wed, Aug 9, 2017 at 7:37 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
>
> If you have one keyed state, say "count by email id", and many different
> keys you will only have one column in RocksDB (or one HashTable). Actually,
> a lot of users have hundreds of millions of different keys for some states.
>
> Best,
> Aljoscha
>
> On 2. Aug 2017, at 14:59, shashank agarwal <sh...@gmail.com> wrote:
>
> If I am creating KeyedState ("count by email id") and keyed stream has 10
> unique email id's.
>
> Will it create 1 column family or hash table ?
>
> Or it will create 10 column family or hash table ?
>
> Can i have millions of unique email id in that keyed state ?
>
>
>
> On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <sh...@gmail.com>
> wrote:
>
>> Ok if i am taking it as right for an example :
>>
>> if  i am creating a keyed state with name "total count by email" for
>> key(project id + email)  than it will create a single hash-table or column
>> family "total count by email" and all the unique email id's will be rows of
>> that single hash-table or column family and than i can store millions of
>> unique email id's in that.
>>
>> Means it will create only single state object for all unique email id's ?
>>
>>
>>
>>
>> On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> Each keyed state in Flink is a hashtable or a column family in RocksDB.
>>> Having too many of those is not memory efficient.
>>>
>>> Having fewer states is better, if you can adapt your schema that way.
>>>
>>> I would also look into "MapState", which is an efficient way to have
>>> "sub keys" under a keyed state.
>>>
>>> Stephan
>>>
>>>
>>> On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <shashank734@gmail.com
>>> > wrote:
>>>
>>>> Hello,
>>>>
>>>> I have to compute results on basis of lot of history data, parameters
>>>> like total transactions in last 1 month, last 1 day, last 1 hour etc. by
>>>> email id, ip, mobile, name, address, zipcode etc.
>>>>
>>>> So my question is this right approach to create keyed state by email,
>>>> mobile, zipcode etc. or should i create 1 big mapped state (BS) and than
>>>> process that BS, may be in process function or by applying some loop and
>>>> filter logic in window or process function.
>>>>
>>>> My main worry is i will end up with millions of states, because there
>>>> can be millions unique emails, phone numbers or zipcode if i create keyed
>>>> state by email, phone etc.
>>>>
>>>> am i right ? is this impact on the performance or is this wrong
>>>> approach ? Which approach would you suggest in this use case.
>>>>
>>>>
>>>> --
>>>> Thanks Regards
>>>>
>>>> SHASHANK AGARWAL
>>>>  ---  Trying to mobilize the things....
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks Regards
>>
>> SHASHANK AGARWAL
>>  ---  Trying to mobilize the things....
>>
>>
>
>
> --
> Thanks Regards
>
> SHASHANK AGARWAL
>  ---  Trying to mobilize the things....
>
>
>


-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....

Re: Can i use lot of keyd states or should i use 1 big key state.

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,

If you have one keyed state, say "count by email id", and many different keys you will only have one column in RocksDB (or one HashTable). Actually, a lot of users have hundreds of millions of different keys for some states.

Best,
Aljoscha 
> On 2. Aug 2017, at 14:59, shashank agarwal <sh...@gmail.com> wrote:
> 
> If I am creating KeyedState ("count by email id") and keyed stream has 10 unique email id's.
> 
> Will it create 1 column family or hash table ?
> 
> Or it will create 10 column family or hash table ?
> 
> Can i have millions of unique email id in that keyed state ?
> 
> 
> 
> On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <shashank734@gmail.com <ma...@gmail.com>> wrote:
> Ok if i am taking it as right for an example :
> 
> if  i am creating a keyed state with name "total count by email" for key(project id + email)  than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.
> 
> Means it will create only single state object for all unique email id's ?
> 
> 
> 
> 
> On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <sewen@apache.org <ma...@apache.org>> wrote:
> Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.
> 
> Having fewer states is better, if you can adapt your schema that way.
> 
> I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.
> 
> Stephan
> 
> 
> On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <shashank734@gmail.com <ma...@gmail.com>> wrote:
> Hello,
> 
> I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.
> 
> So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function. 
> 
> My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.
> 
> am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.
> 
> 
> -- 
> Thanks Regards
> 
> SHASHANK AGARWAL
>  ---  Trying to mobilize the things....
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Thanks Regards
> 
> SHASHANK AGARWAL
>  ---  Trying to mobilize the things....
> 
> 
> 
> 
> -- 
> Thanks Regards
> 
> SHASHANK AGARWAL
>  ---  Trying to mobilize the things....

Re: Can i use lot of keyd states or should i use 1 big key state.

Posted by shashank agarwal <sh...@gmail.com>.

If I am creating KeyedState ("count by email id") and keyed stream has 10
unique email id's.

Will it create 1 column family or hash table ?

Or it will create 10 column family or hash table ?

Can i have millions of unique email id in that keyed state ?



On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <sh...@gmail.com>
wrote:

> Ok if i am taking it as right for an example :
>
> if  i am creating a keyed state with name "total count by email" for
> key(project id + email)  than it will create a single hash-table or column
> family "total count by email" and all the unique email id's will be rows of
> that single hash-table or column family and than i can store millions of
> unique email id's in that.
>
> Means it will create only single state object for all unique email id's ?
>
>
>
>
> On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <se...@apache.org> wrote:
>
>> Each keyed state in Flink is a hashtable or a column family in RocksDB.
>> Having too many of those is not memory efficient.
>>
>> Having fewer states is better, if you can adapt your schema that way.
>>
>> I would also look into "MapState", which is an efficient way to have "sub
>> keys" under a keyed state.
>>
>> Stephan
>>
>>
>> On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <sh...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I have to compute results on basis of lot of history data, parameters
>>> like total transactions in last 1 month, last 1 day, last 1 hour etc. by
>>> email id, ip, mobile, name, address, zipcode etc.
>>>
>>> So my question is this right approach to create keyed state by email,
>>> mobile, zipcode etc. or should i create 1 big mapped state (BS) and than
>>> process that BS, may be in process function or by applying some loop and
>>> filter logic in window or process function.
>>>
>>> My main worry is i will end up with millions of states, because there
>>> can be millions unique emails, phone numbers or zipcode if i create keyed
>>> state by email, phone etc.
>>>
>>> am i right ? is this impact on the performance or is this wrong approach
>>> ? Which approach would you suggest in this use case.
>>>
>>>
>>> --
>>> Thanks Regards
>>>
>>> SHASHANK AGARWAL
>>>  ---  Trying to mobilize the things....
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Thanks Regards
>
> SHASHANK AGARWAL
>  ---  Trying to mobilize the things....
>
>


-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....

Re: Can i use lot of keyd states or should i use 1 big key state.

Posted by shashank agarwal <sh...@gmail.com>.

Ok if i am taking it as right for an example :

if  i am creating a keyed state with name "total count by email" for
key(project id + email)  than it will create a single hash-table or column
family "total count by email" and all the unique email id's will be rows of
that single hash-table or column family and than i can store millions of
unique email id's in that.

Means it will create only single state object for all unique email id's ?




On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <se...@apache.org> wrote:

> Each keyed state in Flink is a hashtable or a column family in RocksDB.
> Having too many of those is not memory efficient.
>
> Having fewer states is better, if you can adapt your schema that way.
>
> I would also look into "MapState", which is an efficient way to have "sub
> keys" under a keyed state.
>
> Stephan
>
>
> On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <sh...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I have to compute results on basis of lot of history data, parameters
>> like total transactions in last 1 month, last 1 day, last 1 hour etc. by
>> email id, ip, mobile, name, address, zipcode etc.
>>
>> So my question is this right approach to create keyed state by email,
>> mobile, zipcode etc. or should i create 1 big mapped state (BS) and than
>> process that BS, may be in process function or by applying some loop and
>> filter logic in window or process function.
>>
>> My main worry is i will end up with millions of states, because there can
>> be millions unique emails, phone numbers or zipcode if i create keyed state
>> by email, phone etc.
>>
>> am i right ? is this impact on the performance or is this wrong approach
>> ? Which approach would you suggest in this use case.
>>
>>
>> --
>> Thanks Regards
>>
>> SHASHANK AGARWAL
>>  ---  Trying to mobilize the things....
>>
>>
>>
>>
>>
>


-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....

Re: Can i use lot of keyd states or should i use 1 big key state.

Posted by Stephan Ewen <se...@apache.org>.

Each keyed state in Flink is a hashtable or a column family in RocksDB.
Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub
keys" under a keyed state.

Stephan


On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <sh...@gmail.com>
wrote:

> Hello,
>
> I have to compute results on basis of lot of history data, parameters like
> total transactions in last 1 month, last 1 day, last 1 hour etc. by email
> id, ip, mobile, name, address, zipcode etc.
>
> So my question is this right approach to create keyed state by email,
> mobile, zipcode etc. or should i create 1 big mapped state (BS) and than
> process that BS, may be in process function or by applying some loop and
> filter logic in window or process function.
>
> My main worry is i will end up with millions of states, because there can
> be millions unique emails, phone numbers or zipcode if i create keyed state
> by email, phone etc.
>
> am i right ? is this impact on the performance or is this wrong approach ?
> Which approach would you suggest in this use case.
>
>
> --
> Thanks Regards
>
> SHASHANK AGARWAL
>  ---  Trying to mobilize the things....
>
>
>
>
>