You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Dotan Patrich <do...@fortscale.com> on 2014/10/28 16:34:14 UTC

Samza threads issues

Hi All,

I encountered some issues caused by having too many threads for a user on
linux CentOS. Investigating this deeper, it turned out that the JVM spawn
over 31 threads per process for GC. Having about 18 Samza processes running
on the machine we soon got near to the 1000 threads limit per user.
I was thinking of running the Samza JVM with SerialGC instead of parallel
GC to avoid having so many threads in the environment. In addition,
theoretically this seems to be better fitted for situations where we prefer
throughput over latency in a single-core environments (this is roughly what
we Samza tasks is assigned with).

Before doing so, I would really appreciate you insights - did anyone
encountered this issue before? Does changing the GC to be serial is a good
solution?

Thanks,
Dotan

Re: Samza threads issues

Posted by Chris Riccomini <cr...@linkedin.com.INVALID>.
Hey Dotan,

> should we increase the kafka topic sizes to accommodate incoming data
>during these time gaps as opposed to the parallel GC?

You'll have to experiment and see. I doubt it, though.

> Or on a broader aspect - What are the best practices to measure and set
>the right size for the kafka topics? Can anyone share his experience on
>that?

There's a lot that goes into this. Some to consider:

1. Peak bytes/sec throughput.
2. Retention policy for the topic.
3. Parallelism requirements for consumers.

At LinkedIn, we start with a default of 8, and size up as needed. The need
could be that partitions are running too hot (either on reads or writes),
that the partitions are too large on disk (retention policy), or that the
downstream consumers can't keep up because their processing is slower than
the messages/sec on the partition.

Cheers,
Chris

On 10/28/14 11:59 PM, "Dotan Patrich" <do...@fortscale.com> wrote:

>Thanks Chris,
>We will test our product using SerialGC to see how it behave.
>
>One concern that I have is regarding the kafka topic sizes - Assuming
>"stop-the-world" GC stops will more noticable using SerialGC should we
>increase the kafka topic sizes to accommodate incoming data during these
>time gaps as opposed to the parallel GC?
>Or on a broader aspect - What are the best practices to measure and set
>the
>right size for the kafka topics? Can anyone share his experience on that?
>
>Thanks,
>Dotan
>
>On Tue, Oct 28, 2014 at 5:53 PM, Chris Riccomini <
>criccomini@linkedin.com.invalid> wrote:
>
>> Hey Dotan,
>>
>> We run all of our jobs using SerialGC by default. For a few of our
>> higher-throughput jobs, we've had better luck with parallel GC or G1,
>>but
>> in general, serial works fine.
>>
>> Cheers,
>> Chris
>>
>> On 10/28/14 8:34 AM, "Dotan Patrich" <do...@fortscale.com> wrote:
>>
>> >Hi All,
>> >
>> >I encountered some issues caused by having too many threads for a user
>>on
>> >linux CentOS. Investigating this deeper, it turned out that the JVM
>>spawn
>> >over 31 threads per process for GC. Having about 18 Samza processes
>> >running
>> >on the machine we soon got near to the 1000 threads limit per user.
>> >I was thinking of running the Samza JVM with SerialGC instead of
>>parallel
>> >GC to avoid having so many threads in the environment. In addition,
>> >theoretically this seems to be better fitted for situations where we
>> >prefer
>> >throughput over latency in a single-core environments (this is roughly
>> >what
>> >we Samza tasks is assigned with).
>> >
>> >Before doing so, I would really appreciate you insights - did anyone
>> >encountered this issue before? Does changing the GC to be serial is a
>>good
>> >solution?
>> >
>> >Thanks,
>> >Dotan
>>
>>


Re: Samza threads issues

Posted by Dotan Patrich <do...@fortscale.com>.
Thanks Chris,
We will test our product using SerialGC to see how it behave.

One concern that I have is regarding the kafka topic sizes - Assuming
"stop-the-world" GC stops will more noticable using SerialGC should we
increase the kafka topic sizes to accommodate incoming data during these
time gaps as opposed to the parallel GC?
Or on a broader aspect - What are the best practices to measure and set the
right size for the kafka topics? Can anyone share his experience on that?

Thanks,
Dotan

On Tue, Oct 28, 2014 at 5:53 PM, Chris Riccomini <
criccomini@linkedin.com.invalid> wrote:

> Hey Dotan,
>
> We run all of our jobs using SerialGC by default. For a few of our
> higher-throughput jobs, we've had better luck with parallel GC or G1, but
> in general, serial works fine.
>
> Cheers,
> Chris
>
> On 10/28/14 8:34 AM, "Dotan Patrich" <do...@fortscale.com> wrote:
>
> >Hi All,
> >
> >I encountered some issues caused by having too many threads for a user on
> >linux CentOS. Investigating this deeper, it turned out that the JVM spawn
> >over 31 threads per process for GC. Having about 18 Samza processes
> >running
> >on the machine we soon got near to the 1000 threads limit per user.
> >I was thinking of running the Samza JVM with SerialGC instead of parallel
> >GC to avoid having so many threads in the environment. In addition,
> >theoretically this seems to be better fitted for situations where we
> >prefer
> >throughput over latency in a single-core environments (this is roughly
> >what
> >we Samza tasks is assigned with).
> >
> >Before doing so, I would really appreciate you insights - did anyone
> >encountered this issue before? Does changing the GC to be serial is a good
> >solution?
> >
> >Thanks,
> >Dotan
>
>

Re: Samza threads issues

Posted by Chris Riccomini <cr...@linkedin.com.INVALID>.
Hey Dotan,

We run all of our jobs using SerialGC by default. For a few of our
higher-throughput jobs, we've had better luck with parallel GC or G1, but
in general, serial works fine.

Cheers,
Chris

On 10/28/14 8:34 AM, "Dotan Patrich" <do...@fortscale.com> wrote:

>Hi All,
>
>I encountered some issues caused by having too many threads for a user on
>linux CentOS. Investigating this deeper, it turned out that the JVM spawn
>over 31 threads per process for GC. Having about 18 Samza processes
>running
>on the machine we soon got near to the 1000 threads limit per user.
>I was thinking of running the Samza JVM with SerialGC instead of parallel
>GC to avoid having so many threads in the environment. In addition,
>theoretically this seems to be better fitted for situations where we
>prefer
>throughput over latency in a single-core environments (this is roughly
>what
>we Samza tasks is assigned with).
>
>Before doing so, I would really appreciate you insights - did anyone
>encountered this issue before? Does changing the GC to be serial is a good
>solution?
>
>Thanks,
>Dotan