You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Chakravarthy varaga <ch...@gmail.com> on 2017/01/05 13:21:37 UTC

Increasing parallelism skews/increases overall job processing time linearly

Hi All,

I have a job as attached.

I have a 16 Core blade running RHEL 7. The taskmanager default number of
slots is set to 1. The source is a kafka stream and each of the 2
sources(topic) have 2 partitions each.


*What I notice is that when I deploy a job to run with #parallelism=2 the
total processing time doubles the time it took when the same job was
deployed with #parallelism=1. It linearly increases with the parallelism.*
Since the numberof slots is set to 1 per TM, I would assume that the job
would be processed in parallel in 2 different TMs and that each consumer in
each TM is connected to 1 partition of the topic. This therefore should
have kept the overall processing time the same or less !!!

The co-flatmap connects the 2 streams & uses ValueState (checkpointed in
FS). I think this is distributed among the TMs. My understanding is that
the search of values state could be costly between TMs.  Do you sense
something wrong here?

Best Regards
CVP

Re: Increasing parallelism skews/increases overall job processing time linearly

Posted by Chakravarthy varaga <ch...@gmail.com>.

Hi Tim,

    Thanks for your response.
    The results are the same.
        4 CPU (*8 cores in total)
        kafka partitions = 4 per topic
        parallesim for job = 3
        task.slot / TM = 4

    Basically this flink application consumes (kafka source) from 2 topics
and produces (kafka sink) onto 1 topic. on 1 consumer topic, the event load
is 100K/sec, while the other source has 1 event / an hour ...
    I'm wondering if parallelism is enabled on multiple sources
irrespective of the partition size.

    What I did is to enable 1 partition for the 2nd topic (1 event/hour)
and 4 partitions for 100K events topic. And deployed  a 3 parallelism job
and the results are the same...

Best Regards
CVP

On Wed, Jan 11, 2017 at 1:11 PM, Till Rohrmann <tr...@apache.org> wrote:

> Hi CVP,
>
> changing the parallelism from 1 to 2 with every TM having only one slot
> will inevitably introduce another network shuffle operation between the
> sources and the keyed co flat map. This might be the source of your slow
> down, because before everything was running on one machine without any
> network communication (apart from reading from Kafka).
>
> Do you also observe a further degradation when increasing the parallelism
> from 2 to 4, for example (given that you've increased the number of topic
> partitions to at least the maximum parallelism in your topology)?
>
> Cheers,
> Till
>
> On Tue, Jan 10, 2017 at 11:37 AM, Chakravarthy varaga <
> chakravarthyvp@gmail.com> wrote:
>
>> Hi Guys,
>>
>>     I understand that you are extremely busy but any pointers here is
>> highly appreciated. I can proceed forward towards concluding the activity !
>>
>> Best Regards
>> CVP
>>
>> On Mon, Jan 9, 2017 at 11:43 AM, Chakravarthy varaga <
>> chakravarthyvp@gmail.com> wrote:
>>
>>> Anything that I could check or collect for you for investigation ?
>>>
>>> On Sat, Jan 7, 2017 at 1:35 PM, Chakravarthy varaga <
>>> chakravarthyvp@gmail.com> wrote:
>>>
>>>> Hi Stephen
>>>>
>>>> . Kafka version is: 0.9.0.1 the connector is flinkconsumer09
>>>> . The flatmap n coflatmap are connected by keyBy
>>>> . No data is broadcasted and the data is not exploded based on the
>>>> parallelism
>>>>
>>>> Cvp
>>>>
>>>> On 6 Jan 2017 20:16, "Stephan Ewen" <se...@apache.org> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> You are right, parallelism 2 should be faster than parallelism 1 ;-)
>>>>> As ChenQin pointed out, having only 2 Kafka Partitions may prevent further
>>>>> scaleout.
>>>>>
>>>>> Few things to check:
>>>>>   - How are you connecting the FlatMap and CoFlatMap? Default, keyBy,
>>>>> broadcast?
>>>>>   - Broadcast for example would multiply the data based on
>>>>> parallelism, can lead to slowdown when saturating the network.
>>>>>   - Are you using the standard Kafka Source (which Kafka version)?
>>>>>   - Is there any part in the program that multiplies data/effort with
>>>>> higher parallelism (does the FlatMap explode data based on parallelism)?
>>>>>
>>>>> Stephan
>>>>>
>>>>>
>>>>> On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <qi...@gmail.com> wrote:
>>>>>
>>>>>> Just noticed there are only two partitions per topic. Regardless of
>>>>>> how large parallelism set. Only two of those will get partition assigned at
>>>>>> most.
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Jan 6, 2017, at 02:40, Chakravarthy varaga <
>>>>>> chakravarthyvp@gmail.com> wrote:
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>>     Any updates on this?
>>>>>>
>>>>>> Best Regards
>>>>>> CVP
>>>>>>
>>>>>> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
>>>>>> chakravarthyvp@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have a job as attached.
>>>>>>>
>>>>>>> I have a 16 Core blade running RHEL 7. The taskmanager default
>>>>>>> number of slots is set to 1. The source is a kafka stream and each of the 2
>>>>>>> sources(topic) have 2 partitions each.
>>>>>>>
>>>>>>>
>>>>>>> *What I notice is that when I deploy a job to run with
>>>>>>> #parallelism=2 the total processing time doubles the time it took when the
>>>>>>> same job was deployed with #parallelism=1. It linearly increases with the
>>>>>>> parallelism.*
>>>>>>> Since the numberof slots is set to 1 per TM, I would assume that the
>>>>>>> job would be processed in parallel in 2 different TMs and that each
>>>>>>> consumer in each TM is connected to 1 partition of the topic. This
>>>>>>> therefore should have kept the overall processing time the same or less !!!
>>>>>>>
>>>>>>> The co-flatmap connects the 2 streams & uses ValueState
>>>>>>> (checkpointed in FS). I think this is distributed among the TMs. My
>>>>>>> understanding is that the search of values state could be costly between
>>>>>>> TMs.  Do you sense something wrong here?
>>>>>>>
>>>>>>> Best Regards
>>>>>>> CVP
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Increasing parallelism skews/increases overall job processing time linearly

Posted by Till Rohrmann <tr...@apache.org>.

Hi CVP,

changing the parallelism from 1 to 2 with every TM having only one slot
will inevitably introduce another network shuffle operation between the
sources and the keyed co flat map. This might be the source of your slow
down, because before everything was running on one machine without any
network communication (apart from reading from Kafka).

Do you also observe a further degradation when increasing the parallelism
from 2 to 4, for example (given that you've increased the number of topic
partitions to at least the maximum parallelism in your topology)?

Cheers,
Till

On Tue, Jan 10, 2017 at 11:37 AM, Chakravarthy varaga <
chakravarthyvp@gmail.com> wrote:

> Hi Guys,
>
>     I understand that you are extremely busy but any pointers here is
> highly appreciated. I can proceed forward towards concluding the activity !
>
> Best Regards
> CVP
>
> On Mon, Jan 9, 2017 at 11:43 AM, Chakravarthy varaga <
> chakravarthyvp@gmail.com> wrote:
>
>> Anything that I could check or collect for you for investigation ?
>>
>> On Sat, Jan 7, 2017 at 1:35 PM, Chakravarthy varaga <
>> chakravarthyvp@gmail.com> wrote:
>>
>>> Hi Stephen
>>>
>>> . Kafka version is: 0.9.0.1 the connector is flinkconsumer09
>>> . The flatmap n coflatmap are connected by keyBy
>>> . No data is broadcasted and the data is not exploded based on the
>>> parallelism
>>>
>>> Cvp
>>>
>>> On 6 Jan 2017 20:16, "Stephan Ewen" <se...@apache.org> wrote:
>>>
>>>> Hi!
>>>>
>>>> You are right, parallelism 2 should be faster than parallelism 1 ;-) As
>>>> ChenQin pointed out, having only 2 Kafka Partitions may prevent further
>>>> scaleout.
>>>>
>>>> Few things to check:
>>>>   - How are you connecting the FlatMap and CoFlatMap? Default, keyBy,
>>>> broadcast?
>>>>   - Broadcast for example would multiply the data based on parallelism,
>>>> can lead to slowdown when saturating the network.
>>>>   - Are you using the standard Kafka Source (which Kafka version)?
>>>>   - Is there any part in the program that multiplies data/effort with
>>>> higher parallelism (does the FlatMap explode data based on parallelism)?
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <qi...@gmail.com> wrote:
>>>>
>>>>> Just noticed there are only two partitions per topic. Regardless of
>>>>> how large parallelism set. Only two of those will get partition assigned at
>>>>> most.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jan 6, 2017, at 02:40, Chakravarthy varaga <
>>>>> chakravarthyvp@gmail.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>>     Any updates on this?
>>>>>
>>>>> Best Regards
>>>>> CVP
>>>>>
>>>>> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
>>>>> chakravarthyvp@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have a job as attached.
>>>>>>
>>>>>> I have a 16 Core blade running RHEL 7. The taskmanager default number
>>>>>> of slots is set to 1. The source is a kafka stream and each of the 2
>>>>>> sources(topic) have 2 partitions each.
>>>>>>
>>>>>>
>>>>>> *What I notice is that when I deploy a job to run with #parallelism=2
>>>>>> the total processing time doubles the time it took when the same job was
>>>>>> deployed with #parallelism=1. It linearly increases with the parallelism.*
>>>>>> Since the numberof slots is set to 1 per TM, I would assume that the
>>>>>> job would be processed in parallel in 2 different TMs and that each
>>>>>> consumer in each TM is connected to 1 partition of the topic. This
>>>>>> therefore should have kept the overall processing time the same or less !!!
>>>>>>
>>>>>> The co-flatmap connects the 2 streams & uses ValueState (checkpointed
>>>>>> in FS). I think this is distributed among the TMs. My understanding is that
>>>>>> the search of values state could be costly between TMs.  Do you sense
>>>>>> something wrong here?
>>>>>>
>>>>>> Best Regards
>>>>>> CVP
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Increasing parallelism skews/increases overall job processing time linearly

Posted by Chakravarthy varaga <ch...@gmail.com>.

Hi Guys,

    I understand that you are extremely busy but any pointers here is
highly appreciated. I can proceed forward towards concluding the activity !

Best Regards
CVP

On Mon, Jan 9, 2017 at 11:43 AM, Chakravarthy varaga <
chakravarthyvp@gmail.com> wrote:

> Anything that I could check or collect for you for investigation ?
>
> On Sat, Jan 7, 2017 at 1:35 PM, Chakravarthy varaga <
> chakravarthyvp@gmail.com> wrote:
>
>> Hi Stephen
>>
>> . Kafka version is: 0.9.0.1 the connector is flinkconsumer09
>> . The flatmap n coflatmap are connected by keyBy
>> . No data is broadcasted and the data is not exploded based on the
>> parallelism
>>
>> Cvp
>>
>> On 6 Jan 2017 20:16, "Stephan Ewen" <se...@apache.org> wrote:
>>
>>> Hi!
>>>
>>> You are right, parallelism 2 should be faster than parallelism 1 ;-) As
>>> ChenQin pointed out, having only 2 Kafka Partitions may prevent further
>>> scaleout.
>>>
>>> Few things to check:
>>>   - How are you connecting the FlatMap and CoFlatMap? Default, keyBy,
>>> broadcast?
>>>   - Broadcast for example would multiply the data based on parallelism,
>>> can lead to slowdown when saturating the network.
>>>   - Are you using the standard Kafka Source (which Kafka version)?
>>>   - Is there any part in the program that multiplies data/effort with
>>> higher parallelism (does the FlatMap explode data based on parallelism)?
>>>
>>> Stephan
>>>
>>>
>>> On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <qi...@gmail.com> wrote:
>>>
>>>> Just noticed there are only two partitions per topic. Regardless of how
>>>> large parallelism set. Only two of those will get partition assigned at
>>>> most.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 6, 2017, at 02:40, Chakravarthy varaga <ch...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>     Any updates on this?
>>>>
>>>> Best Regards
>>>> CVP
>>>>
>>>> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
>>>> chakravarthyvp@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I have a job as attached.
>>>>>
>>>>> I have a 16 Core blade running RHEL 7. The taskmanager default number
>>>>> of slots is set to 1. The source is a kafka stream and each of the 2
>>>>> sources(topic) have 2 partitions each.
>>>>>
>>>>>
>>>>> *What I notice is that when I deploy a job to run with #parallelism=2
>>>>> the total processing time doubles the time it took when the same job was
>>>>> deployed with #parallelism=1. It linearly increases with the parallelism.*
>>>>> Since the numberof slots is set to 1 per TM, I would assume that the
>>>>> job would be processed in parallel in 2 different TMs and that each
>>>>> consumer in each TM is connected to 1 partition of the topic. This
>>>>> therefore should have kept the overall processing time the same or less !!!
>>>>>
>>>>> The co-flatmap connects the 2 streams & uses ValueState (checkpointed
>>>>> in FS). I think this is distributed among the TMs. My understanding is that
>>>>> the search of values state could be costly between TMs.  Do you sense
>>>>> something wrong here?
>>>>>
>>>>> Best Regards
>>>>> CVP
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: Increasing parallelism skews/increases overall job processing time linearly

Posted by Chakravarthy varaga <ch...@gmail.com>.

Anything that I could check or collect for you for investigation ?

On Sat, Jan 7, 2017 at 1:35 PM, Chakravarthy varaga <
chakravarthyvp@gmail.com> wrote:

> Hi Stephen
>
> . Kafka version is: 0.9.0.1 the connector is flinkconsumer09
> . The flatmap n coflatmap are connected by keyBy
> . No data is broadcasted and the data is not exploded based on the
> parallelism
>
> Cvp
>
> On 6 Jan 2017 20:16, "Stephan Ewen" <se...@apache.org> wrote:
>
>> Hi!
>>
>> You are right, parallelism 2 should be faster than parallelism 1 ;-) As
>> ChenQin pointed out, having only 2 Kafka Partitions may prevent further
>> scaleout.
>>
>> Few things to check:
>>   - How are you connecting the FlatMap and CoFlatMap? Default, keyBy,
>> broadcast?
>>   - Broadcast for example would multiply the data based on parallelism,
>> can lead to slowdown when saturating the network.
>>   - Are you using the standard Kafka Source (which Kafka version)?
>>   - Is there any part in the program that multiplies data/effort with
>> higher parallelism (does the FlatMap explode data based on parallelism)?
>>
>> Stephan
>>
>>
>> On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <qi...@gmail.com> wrote:
>>
>>> Just noticed there are only two partitions per topic. Regardless of how
>>> large parallelism set. Only two of those will get partition assigned at
>>> most.
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 6, 2017, at 02:40, Chakravarthy varaga <ch...@gmail.com>
>>> wrote:
>>>
>>> Hi All,
>>>
>>>     Any updates on this?
>>>
>>> Best Regards
>>> CVP
>>>
>>> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
>>> chakravarthyvp@gmail.com> wrote:
>>>
>>>>
>>>> Hi All,
>>>>
>>>> I have a job as attached.
>>>>
>>>> I have a 16 Core blade running RHEL 7. The taskmanager default number
>>>> of slots is set to 1. The source is a kafka stream and each of the 2
>>>> sources(topic) have 2 partitions each.
>>>>
>>>>
>>>> *What I notice is that when I deploy a job to run with #parallelism=2
>>>> the total processing time doubles the time it took when the same job was
>>>> deployed with #parallelism=1. It linearly increases with the parallelism.*
>>>> Since the numberof slots is set to 1 per TM, I would assume that the
>>>> job would be processed in parallel in 2 different TMs and that each
>>>> consumer in each TM is connected to 1 partition of the topic. This
>>>> therefore should have kept the overall processing time the same or less !!!
>>>>
>>>> The co-flatmap connects the 2 streams & uses ValueState (checkpointed
>>>> in FS). I think this is distributed among the TMs. My understanding is that
>>>> the search of values state could be costly between TMs.  Do you sense
>>>> something wrong here?
>>>>
>>>> Best Regards
>>>> CVP
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>

Re: Increasing parallelism skews/increases overall job processing time linearly

Posted by Chakravarthy varaga <ch...@gmail.com>.

Hi Stephen

. Kafka version is: 0.9.0.1 the connector is flinkconsumer09
. The flatmap n coflatmap are connected by keyBy
. No data is broadcasted and the data is not exploded based on the
parallelism

Cvp

On 6 Jan 2017 20:16, "Stephan Ewen" <se...@apache.org> wrote:

> Hi!
>
> You are right, parallelism 2 should be faster than parallelism 1 ;-) As
> ChenQin pointed out, having only 2 Kafka Partitions may prevent further
> scaleout.
>
> Few things to check:
>   - How are you connecting the FlatMap and CoFlatMap? Default, keyBy,
> broadcast?
>   - Broadcast for example would multiply the data based on parallelism,
> can lead to slowdown when saturating the network.
>   - Are you using the standard Kafka Source (which Kafka version)?
>   - Is there any part in the program that multiplies data/effort with
> higher parallelism (does the FlatMap explode data based on parallelism)?
>
> Stephan
>
>
> On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <qi...@gmail.com> wrote:
>
>> Just noticed there are only two partitions per topic. Regardless of how
>> large parallelism set. Only two of those will get partition assigned at
>> most.
>>
>> Sent from my iPhone
>>
>> On Jan 6, 2017, at 02:40, Chakravarthy varaga <ch...@gmail.com>
>> wrote:
>>
>> Hi All,
>>
>>     Any updates on this?
>>
>> Best Regards
>> CVP
>>
>> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
>> chakravarthyvp@gmail.com> wrote:
>>
>>>
>>> Hi All,
>>>
>>> I have a job as attached.
>>>
>>> I have a 16 Core blade running RHEL 7. The taskmanager default number of
>>> slots is set to 1. The source is a kafka stream and each of the 2
>>> sources(topic) have 2 partitions each.
>>>
>>>
>>> *What I notice is that when I deploy a job to run with #parallelism=2
>>> the total processing time doubles the time it took when the same job was
>>> deployed with #parallelism=1. It linearly increases with the parallelism.*
>>> Since the numberof slots is set to 1 per TM, I would assume that the job
>>> would be processed in parallel in 2 different TMs and that each consumer in
>>> each TM is connected to 1 partition of the topic. This therefore should
>>> have kept the overall processing time the same or less !!!
>>>
>>> The co-flatmap connects the 2 streams & uses ValueState (checkpointed in
>>> FS). I think this is distributed among the TMs. My understanding is that
>>> the search of values state could be costly between TMs.  Do you sense
>>> something wrong here?
>>>
>>> Best Regards
>>> CVP
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Increasing parallelism skews/increases overall job processing time linearly

Posted by Stephan Ewen <se...@apache.org>.

Hi!

You are right, parallelism 2 should be faster than parallelism 1 ;-) As
ChenQin pointed out, having only 2 Kafka Partitions may prevent further
scaleout.

Few things to check:
  - How are you connecting the FlatMap and CoFlatMap? Default, keyBy,
broadcast?
  - Broadcast for example would multiply the data based on parallelism, can
lead to slowdown when saturating the network.
  - Are you using the standard Kafka Source (which Kafka version)?
  - Is there any part in the program that multiplies data/effort with
higher parallelism (does the FlatMap explode data based on parallelism)?

Stephan


On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <qi...@gmail.com> wrote:

> Just noticed there are only two partitions per topic. Regardless of how
> large parallelism set. Only two of those will get partition assigned at
> most.
>
> Sent from my iPhone
>
> On Jan 6, 2017, at 02:40, Chakravarthy varaga <ch...@gmail.com>
> wrote:
>
> Hi All,
>
>     Any updates on this?
>
> Best Regards
> CVP
>
> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
> chakravarthyvp@gmail.com> wrote:
>
>>
>> Hi All,
>>
>> I have a job as attached.
>>
>> I have a 16 Core blade running RHEL 7. The taskmanager default number of
>> slots is set to 1. The source is a kafka stream and each of the 2
>> sources(topic) have 2 partitions each.
>>
>>
>> *What I notice is that when I deploy a job to run with #parallelism=2 the
>> total processing time doubles the time it took when the same job was
>> deployed with #parallelism=1. It linearly increases with the parallelism.*
>> Since the numberof slots is set to 1 per TM, I would assume that the job
>> would be processed in parallel in 2 different TMs and that each consumer in
>> each TM is connected to 1 partition of the topic. This therefore should
>> have kept the overall processing time the same or less !!!
>>
>> The co-flatmap connects the 2 streams & uses ValueState (checkpointed in
>> FS). I think this is distributed among the TMs. My understanding is that
>> the search of values state could be costly between TMs.  Do you sense
>> something wrong here?
>>
>> Best Regards
>> CVP
>>
>>
>>
>>
>>
>

Re: Increasing parallelism skews/increases overall job processing time linearly

Posted by Chen Qin <qi...@gmail.com>.

Just noticed there are only two partitions per topic. Regardless of how large parallelism set. Only two of those will get partition assigned at most.

Sent from my iPhone

> On Jan 6, 2017, at 02:40, Chakravarthy varaga <ch...@gmail.com> wrote:
> 
> Hi All,
> 
>     Any updates on this?
> 
> Best Regards
> CVP
> 
>> On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <ch...@gmail.com> wrote:
>> 
>> Hi All,
>> 
>> I have a job as attached.
>> 
>> I have a 16 Core blade running RHEL 7. The taskmanager default number of slots is set to 1. The source is a kafka stream and each of the 2 sources(topic) have 2 partitions each.
>> What I notice is that when I deploy a job to run with #parallelism=2 the total processing time doubles the time it took when the same job was deployed with #parallelism=1. It linearly increases with the parallelism.
>> 
>> Since the numberof slots is set to 1 per TM, I would assume that the job would be processed in parallel in 2 different TMs and that each consumer in each TM is connected to 1 partition of the topic. This therefore should have kept the overall processing time the same or less !!!
>> 
>> The co-flatmap connects the 2 streams & uses ValueState (checkpointed in FS). I think this is distributed among the TMs. My understanding is that the search of values state could be costly between TMs.  Do you sense something wrong here?
>> 
>> Best Regards
>> CVP
>> 
>> 
>> 
>> 
>

Re: Increasing parallelism skews/increases overall job processing time linearly

Posted by Chakravarthy varaga <ch...@gmail.com>.

Hi All,

    Any updates on this?

Best Regards
CVP

On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <
chakravarthyvp@gmail.com> wrote:

>
> Hi All,
>
> I have a job as attached.
>
> I have a 16 Core blade running RHEL 7. The taskmanager default number of
> slots is set to 1. The source is a kafka stream and each of the 2
> sources(topic) have 2 partitions each.
>
>
> *What I notice is that when I deploy a job to run with #parallelism=2 the
> total processing time doubles the time it took when the same job was
> deployed with #parallelism=1. It linearly increases with the parallelism.*
> Since the numberof slots is set to 1 per TM, I would assume that the job
> would be processed in parallel in 2 different TMs and that each consumer in
> each TM is connected to 1 partition of the topic. This therefore should
> have kept the overall processing time the same or less !!!
>
> The co-flatmap connects the 2 streams & uses ValueState (checkpointed in
> FS). I think this is distributed among the TMs. My understanding is that
> the search of values state could be costly between TMs.  Do you sense
> something wrong here?
>
> Best Regards
> CVP
>
>
>
>
>