You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Michael Ravits <mi...@gmail.com> on 2015/05/21 13:03:00 UTC

Number of partitions

Hi,

I wonder what are the considerations I need to account for in regard to the
number of partitions in input topics for Samza.
When testing with a 500 partitions topic with one Samza job I noticed the
start up time to be very long.
Are there any problems that might occur when dealing with this number of
partitions?

Thanks,
Michael

Re: Number of partitions

Posted by Michael Ravits <mi...@gmail.com>.
Hey Garry,

Thanks for the good advice. I'm definitely going to read the confluent blog
post.
I actually went back and re-read the section on containers before reading
your reply and realized I was mixing between jobs and containers..


Thanks,
Michael

On Thu, May 21, 2015 at 10:50 PM, Garry Turkington <
g.turkington@improvedigital.com> wrote:

> Hi,
>
> The other variable to think about here is the task to container mapping.
> Each job will indeed have 1 task per input partition in the underlying
> topic but you can then spread those 500 instances across multiple
> containers in your Yarn grid:
>
>
> http://samza.apache.org/learn/documentation/0.9/container/samza-container.html
>
> I'd also suggest thinking about throughput requirements in terms of both
> the Kafka and Samza perspectives. Great blog post from one of the Confluent
> guys here:
>
>
> http://blog.confluent.io/2015/03/12/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
>
> For me I have Kafka brokers with JBOD disks so my starting partition count
> was number of brokers * number of disks. I went with that, discovered it
> wasn't hitting my needed throughput and had to up that several times.
> Currently I have around 160 partitions for the  high throughput topics
> (700K msgs/sec) on a 5 broker cluster.
>
> So over-partitioning is a good thing and gives additional throughput and
> more flexible growth but if you are ramping that too soon you are likely
> requiring growth of your job container count. Look at your throughput
> requirements and hardware assets, pick a starting point and test your
> assumptions. If you are  like me you'll find most of them are wrong. :)
>
> Garry
>
> -----Original Message-----
> From: Lukas Steiblys [mailto:lukas@doubledutch.me]
> Sent: 21 May 2015 20:20
> To: dev@samza.apache.org
> Subject: Re: Number of partitions
>
> Each job will get all the partitions and each task (500 of them) within
> the job will get 1 partition. So there will be 500 processes working
> through the log.
>
> I'd try to figure out what your scaling needs are for the next 2-3 years
> and then calculate your resource requirements accordingly (how many
> parallel executing tasks you would need). If you need to split, it is not
> trivial, but doable.
>
> Lukas
>
> -----Original Message-----
> From: Michael Ravits
> Sent: Thursday, May 21, 2015 11:17 AM
> To: dev@samza.apache.org
> Subject: Re: Number of partitions
>
> Well, since the number of partitions can't be changed after the system
> starts running I wanted to have the flexibility to grow a lot without
> stopping for upgrade.
> Just wonder what would be a tolerable number for Samza.
> For example if I'd start with 5 jobs, each will get 100 partitions. Is
> this reasonable? Or too much for a single job instance?
>
> On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lu...@doubledutch.me>
> wrote:
>
> > 500 is a bit extreme unless you're planning on running the job on some
> > 200 machines and try to exploit their full power. I personally run 4
> > in production for our system processing 100 messages/s and there's
> > plenty of room to grow.
> >
> > Lukas
> >
> > On Thursday, May 21, 2015, Michael Ravits <mi...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I wonder what are the considerations I need to account for in regard
> > > to
> > the
> > > number of partitions in input topics for Samza.
> > > When testing with a 500 partitions topic with one Samza job I
> > > noticed the start up time to be very long.
> > > Are there any problems that might occur when dealing with this
> > > number of partitions?
> > >
> > > Thanks,
> > > Michael
> > >
> >
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2014.0.4800 / Virus Database: 4311/9817 - Release Date: 05/19/15
>

Re: Number of partitions

Posted by Davide Simoncelli <ne...@gmail.com>.
Garry,

Thanks for sharing. That was an interesting read!

I Have few questions for you. How did you test your needed throughput? Only at Kafka level or with your Samza job as well? Did you measure it with metrics?

Thanks

Davide

> On 21 May 2015, at 8:50 pm, Garry Turkington <g....@improvedigital.com> wrote:
> 
> Hi,
> 
> The other variable to think about here is the task to container mapping. Each job will indeed have 1 task per input partition in the underlying topic but you can then spread those 500 instances across multiple containers in your Yarn grid:
> 
> http://samza.apache.org/learn/documentation/0.9/container/samza-container.html
> 
> I'd also suggest thinking about throughput requirements in terms of both the Kafka and Samza perspectives. Great blog post from one of the Confluent guys here:
> 
> http://blog.confluent.io/2015/03/12/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
> 
> For me I have Kafka brokers with JBOD disks so my starting partition count was number of brokers * number of disks. I went with that, discovered it wasn't hitting my needed throughput and had to up that several times. Currently I have around 160 partitions for the  high throughput topics (700K msgs/sec) on a 5 broker cluster.
> 
> So over-partitioning is a good thing and gives additional throughput and more flexible growth but if you are ramping that too soon you are likely requiring growth of your job container count. Look at your throughput requirements and hardware assets, pick a starting point and test your assumptions. If you are  like me you'll find most of them are wrong. :)
> 
> Garry
> 
> -----Original Message-----
> From: Lukas Steiblys [mailto:lukas@doubledutch.me] 
> Sent: 21 May 2015 20:20
> To: dev@samza.apache.org
> Subject: Re: Number of partitions
> 
> Each job will get all the partitions and each task (500 of them) within the job will get 1 partition. So there will be 500 processes working through the log.
> 
> I'd try to figure out what your scaling needs are for the next 2-3 years and then calculate your resource requirements accordingly (how many parallel executing tasks you would need). If you need to split, it is not trivial, but doable.
> 
> Lukas
> 
> -----Original Message-----
> From: Michael Ravits
> Sent: Thursday, May 21, 2015 11:17 AM
> To: dev@samza.apache.org
> Subject: Re: Number of partitions
> 
> Well, since the number of partitions can't be changed after the system starts running I wanted to have the flexibility to grow a lot without stopping for upgrade.
> Just wonder what would be a tolerable number for Samza.
> For example if I'd start with 5 jobs, each will get 100 partitions. Is this reasonable? Or too much for a single job instance?
> 
> On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lu...@doubledutch.me>
> wrote:
> 
>> 500 is a bit extreme unless you're planning on running the job on some 
>> 200 machines and try to exploit their full power. I personally run 4 
>> in production for our system processing 100 messages/s and there's 
>> plenty of room to grow.
>> 
>> Lukas
>> 
>> On Thursday, May 21, 2015, Michael Ravits <mi...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I wonder what are the considerations I need to account for in regard 
>>> to
>> the
>>> number of partitions in input topics for Samza.
>>> When testing with a 500 partitions topic with one Samza job I 
>>> noticed the start up time to be very long.
>>> Are there any problems that might occur when dealing with this 
>>> number of partitions?
>>> 
>>> Thanks,
>>> Michael
>>> 
>> 
> 
> 
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2014.0.4800 / Virus Database: 4311/9817 - Release Date: 05/19/15


RE: Number of partitions

Posted by Garry Turkington <g....@improvedigital.com>.
Hi,

The other variable to think about here is the task to container mapping. Each job will indeed have 1 task per input partition in the underlying topic but you can then spread those 500 instances across multiple containers in your Yarn grid:

http://samza.apache.org/learn/documentation/0.9/container/samza-container.html

I'd also suggest thinking about throughput requirements in terms of both the Kafka and Samza perspectives. Great blog post from one of the Confluent guys here:

http://blog.confluent.io/2015/03/12/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/

For me I have Kafka brokers with JBOD disks so my starting partition count was number of brokers * number of disks. I went with that, discovered it wasn't hitting my needed throughput and had to up that several times. Currently I have around 160 partitions for the  high throughput topics (700K msgs/sec) on a 5 broker cluster.

So over-partitioning is a good thing and gives additional throughput and more flexible growth but if you are ramping that too soon you are likely requiring growth of your job container count. Look at your throughput requirements and hardware assets, pick a starting point and test your assumptions. If you are  like me you'll find most of them are wrong. :)

Garry

-----Original Message-----
From: Lukas Steiblys [mailto:lukas@doubledutch.me] 
Sent: 21 May 2015 20:20
To: dev@samza.apache.org
Subject: Re: Number of partitions

Each job will get all the partitions and each task (500 of them) within the job will get 1 partition. So there will be 500 processes working through the log.

I'd try to figure out what your scaling needs are for the next 2-3 years and then calculate your resource requirements accordingly (how many parallel executing tasks you would need). If you need to split, it is not trivial, but doable.

Lukas

-----Original Message-----
From: Michael Ravits
Sent: Thursday, May 21, 2015 11:17 AM
To: dev@samza.apache.org
Subject: Re: Number of partitions

Well, since the number of partitions can't be changed after the system starts running I wanted to have the flexibility to grow a lot without stopping for upgrade.
Just wonder what would be a tolerable number for Samza.
For example if I'd start with 5 jobs, each will get 100 partitions. Is this reasonable? Or too much for a single job instance?

On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lu...@doubledutch.me>
wrote:

> 500 is a bit extreme unless you're planning on running the job on some 
> 200 machines and try to exploit their full power. I personally run 4 
> in production for our system processing 100 messages/s and there's 
> plenty of room to grow.
>
> Lukas
>
> On Thursday, May 21, 2015, Michael Ravits <mi...@gmail.com> wrote:
>
> > Hi,
> >
> > I wonder what are the considerations I need to account for in regard 
> > to
> the
> > number of partitions in input topics for Samza.
> > When testing with a 500 partitions topic with one Samza job I 
> > noticed the start up time to be very long.
> > Are there any problems that might occur when dealing with this 
> > number of partitions?
> >
> > Thanks,
> > Michael
> >
> 


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4800 / Virus Database: 4311/9817 - Release Date: 05/19/15

Re: Number of partitions

Posted by Lukas Steiblys <lu...@doubledutch.me>.
Each job will get all the partitions and each task (500 of them) within the 
job will get 1 partition. So there will be 500 processes working through the 
log.

I'd try to figure out what your scaling needs are for the next 2-3 years and 
then calculate your resource requirements accordingly (how many parallel 
executing tasks you would need). If you need to split, it is not trivial, 
but doable.

Lukas

-----Original Message----- 
From: Michael Ravits
Sent: Thursday, May 21, 2015 11:17 AM
To: dev@samza.apache.org
Subject: Re: Number of partitions

Well, since the number of partitions can't be changed after the system
starts running I wanted to have the flexibility to grow a lot without
stopping for upgrade.
Just wonder what would be a tolerable number for Samza.
For example if I'd start with 5 jobs, each will get 100 partitions. Is this
reasonable? Or too much for a single job instance?

On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lu...@doubledutch.me>
wrote:

> 500 is a bit extreme unless you're planning on running the job on some 200
> machines and try to exploit their full power. I personally run 4 in
> production for our system processing 100 messages/s and there's plenty of
> room to grow.
>
> Lukas
>
> On Thursday, May 21, 2015, Michael Ravits <mi...@gmail.com> wrote:
>
> > Hi,
> >
> > I wonder what are the considerations I need to account for in regard to
> the
> > number of partitions in input topics for Samza.
> > When testing with a 500 partitions topic with one Samza job I noticed 
> > the
> > start up time to be very long.
> > Are there any problems that might occur when dealing with this number of
> > partitions?
> >
> > Thanks,
> > Michael
> >
> 


Re: Number of partitions

Posted by Michael Ravits <mi...@gmail.com>.
Well, since the number of partitions can't be changed after the system
starts running I wanted to have the flexibility to grow a lot without
stopping for upgrade.
Just wonder what would be a tolerable number for Samza.
For example if I'd start with 5 jobs, each will get 100 partitions. Is this
reasonable? Or too much for a single job instance?

On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lu...@doubledutch.me>
wrote:

> 500 is a bit extreme unless you're planning on running the job on some 200
> machines and try to exploit their full power. I personally run 4 in
> production for our system processing 100 messages/s and there's plenty of
> room to grow.
>
> Lukas
>
> On Thursday, May 21, 2015, Michael Ravits <mi...@gmail.com> wrote:
>
> > Hi,
> >
> > I wonder what are the considerations I need to account for in regard to
> the
> > number of partitions in input topics for Samza.
> > When testing with a 500 partitions topic with one Samza job I noticed the
> > start up time to be very long.
> > Are there any problems that might occur when dealing with this number of
> > partitions?
> >
> > Thanks,
> > Michael
> >
>

Re: Number of partitions

Posted by Lukas Steiblys <lu...@doubledutch.me>.
500 is a bit extreme unless you're planning on running the job on some 200
machines and try to exploit their full power. I personally run 4 in
production for our system processing 100 messages/s and there's plenty of
room to grow.

Lukas

On Thursday, May 21, 2015, Michael Ravits <mi...@gmail.com> wrote:

> Hi,
>
> I wonder what are the considerations I need to account for in regard to the
> number of partitions in input topics for Samza.
> When testing with a 500 partitions topic with one Samza job I noticed the
> start up time to be very long.
> Are there any problems that might occur when dealing with this number of
> partitions?
>
> Thanks,
> Michael
>