You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Liu Bo <di...@gmail.com> on 2016/02/01 04:58:53 UTC

samza gc tuning, what about serial + serial old?

Hi group

We are trying to migrate our current streaming pipeline to samza. Our
pipeline has several NLP modules, such as segment, POS, and a lot of score
calculation. Each process normally needs 8~10GB memory.

Our goal is high throughput so we use Parallel Scavenge + Parallel Old in
our current setup. We've tried G1 in Java 8 U65, it's not so good for
throughput.

My question is since samza is designed for one core, dose it means that
Serial + Serial Old is the best garbage collector for samza? On paper
serial is more efficient.

If it's not could someone share your experience on samza GC tuning for
discussion? Thanks in advance.

-- 
All the best

Liu Bo

Re: samza gc tuning, what about serial + serial old?

Posted by Tao Feng <fe...@gmail.com>.

Hi Bo,

I think application usually would not want to use Serial GC which is
designed only for uniprocessor. If you have 8G~10G memory, the STW time
with serial GC could be quite large.  Even Samza is designed for one core
as you mentioned, if the message rate of upstream data is huge, you would
still need to have multiple Samza containers to consume the upstream data
in order to avoid message lag(fall behind produce data offset). With full
GC(each ~10s) happened frequently , the QPS could be very minimal which I
would imagine it is hard for this Samza job to keep up the upstream data
message rate.

Thanks,
-Tao

On Sun, Jan 31, 2016 at 7:58 PM, Liu Bo <di...@gmail.com> wrote:

> Hi group
>
> We are trying to migrate our current streaming pipeline to samza. Our
> pipeline has several NLP modules, such as segment, POS, and a lot of score
> calculation. Each process normally needs 8~10GB memory.
>
> Our goal is high throughput so we use Parallel Scavenge + Parallel Old in
> our current setup. We've tried G1 in Java 8 U65, it's not so good for
> throughput.
>
> My question is since samza is designed for one core, dose it means that
> Serial + Serial Old is the best garbage collector for samza? On paper
> serial is more efficient.
>
> If it's not could someone share your experience on samza GC tuning for
> discussion? Thanks in advance.
>
> --
> All the best
>
> Liu Bo
>

Re: samza gc tuning, what about serial + serial old?

Posted by Liu Bo <di...@gmail.com>.

Hi Yi

Thanks for the replay.

The other thing I just recall is  "how long STOP THE WORLD time can a
program tolerates...".

One of our team run into a situation that they have to use CMS in a
throughput pipeline. They maintain a heavy workload storm cluster. Parallel
full GC takes too much time, and the zookeeper thinks the work node is dead
and kicks it out. This leads to a lot of kafka rebalance... They had to use
CMS to reduce STOP THE WORLD time for quite some time until G1 come out...

If long full GC time, tens of seconds for example, isn't a problem for
samza at framework side, Serial + Serial Old sounds good to me. ;-)

On 1 February 2016 at 16:02, Yi Pan <ni...@gmail.com> wrote:

> Hi, Bo,
>
> That's an interesting question. Since we have opened up the task.opts
> option to the users to set any favorable GC configuration to Samza jobs, we
> really don't have a "recommended" GC for the users. It would probably also
> depend on the application's usage pattern as well. Our perf partner Tao
> Feng @LinkedIn may have some more insights.
>
> @Tao, do you have any comments on this?
>
> -Yi
>
> On Sun, Jan 31, 2016 at 7:58 PM, Liu Bo <di...@gmail.com> wrote:
>
> > Hi group
> >
> > We are trying to migrate our current streaming pipeline to samza. Our
> > pipeline has several NLP modules, such as segment, POS, and a lot of
> score
> > calculation. Each process normally needs 8~10GB memory.
> >
> > Our goal is high throughput so we use Parallel Scavenge + Parallel Old in
> > our current setup. We've tried G1 in Java 8 U65, it's not so good for
> > throughput.
> >
> > My question is since samza is designed for one core, dose it means that
> > Serial + Serial Old is the best garbage collector for samza? On paper
> > serial is more efficient.
> >
> > If it's not could someone share your experience on samza GC tuning for
> > discussion? Thanks in advance.
> >
> > --
> > All the best
> >
> > Liu Bo
> >
>

-- 
All the best

Liu Bo

Re: samza gc tuning, what about serial + serial old?

Posted by Yi Pan <ni...@gmail.com>.

Hi, Bo,

That's an interesting question. Since we have opened up the task.opts
option to the users to set any favorable GC configuration to Samza jobs, we
really don't have a "recommended" GC for the users. It would probably also
depend on the application's usage pattern as well. Our perf partner Tao
Feng @LinkedIn may have some more insights.

@Tao, do you have any comments on this?

-Yi

On Sun, Jan 31, 2016 at 7:58 PM, Liu Bo <di...@gmail.com> wrote:

> Hi group
>
> We are trying to migrate our current streaming pipeline to samza. Our
> pipeline has several NLP modules, such as segment, POS, and a lot of score
> calculation. Each process normally needs 8~10GB memory.
>
> Our goal is high throughput so we use Parallel Scavenge + Parallel Old in
> our current setup. We've tried G1 in Java 8 U65, it's not so good for
> throughput.
>
> My question is since samza is designed for one core, dose it means that
> Serial + Serial Old is the best garbage collector for samza? On paper
> serial is more efficient.
>
> If it's not could someone share your experience on samza GC tuning for
> discussion? Thanks in advance.
>
> --
> All the best
>
> Liu Bo
>