You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Lukas Steiblys <lu...@doubledutch.me> on 2015/10/20 00:49:17 UTC

Multithreading ThreadJobFactory

I have been thinking lately about the most non-invasive way to add multithreading capabilities to ThreadJobFactory, as that is the main method we run our jobs in production. Looking at the master branch code in Git, I have found the following:
  a.. The best way would be to simply spin up a new thread for each container. 
  b.. The number of containers can already be specified using the configuration property job.container.count. 
  c.. I can construct a new SamzaContainer for each containerModel returned from coordinator.jobModel.getContainers in ThreadJobFactory. 
  d.. I can pass a list of these containers into ThreadJob constructor modifying it to accept an array of Runnables. 
  e.. For each runnable, it would create a new thread and start it in the submit method of ThreadJob.
This should start up a new thread for each container and group the tasks using the appropriate TaskNameGrouper.

Any ideas on what I might have missed? Are there any other potential solutions? Would this be a good patch for Samza in general?

Lukas

Re: Multithreading ThreadJobFactory

Posted by Yi Pan <ni...@gmail.com>.
Hi, Lukas,

Sorry to reply late. My comments below

On Tue, Oct 20, 2015 at 7:14 AM, Lukas Steiblys <lu...@doubledutch.me>
wrote:

>
> >> I have been thinking lately about the most non-invasive way to add
> >> multithreading capabilities to ThreadJobFactory, as that is the main
> >> method
> >> we run our jobs in production. Looking at the master branch code in Git,
> >> I
> >> have found the following:
> >>   a.. The best way would be to simply spin up a new thread for each
> >> container.
> >>   b.. The number of containers can already be specified using the
> >> configuration property job.container.count.
> >>   c.. I can construct a new SamzaContainer for each containerModel
> >> returned from coordinator.jobModel.getContainers in ThreadJobFactory.
>

I like this idea. So I guess this would be a simplified version of
parallelized Samza jobs within one JVM. It could be useful for users that
need multiple parallel threads but does not need to scale the job across
multiple machines. We do have a plan to work on a similar multi-thread
model in a distributed environment in LinkedIn. The main difference here
would be the coordinator would need to have a distributed version of
implementation to include partition assignment, checkpoint, changelog
initialization. It would be ideal if the coordinator's API can stay the
same in both single node mode and distributed mode. I would be able to
comment more when we have a sketchy design doc.


> >>   d.. I can pass a list of these containers into ThreadJob constructor
> >> modifying it to accept an array of Runnables.
> >>   e.. For each runnable, it would create a new thread and start it in
> the
> >> submit method of ThreadJob.
> >> This should start up a new thread for each container and group the tasks
> >> using the appropriate TaskNameGrouper.
>

Wouldn't it make more sense to have the TaskNameGrouper be part of the
coordinator logic as well? And each container will just take a
ContainerModel to start. That way, the container logic would be much more
simplified.


> >>
> >> Any ideas on what I might have missed? Are there any other potential
> >> solutions? Would this be a good patch for Samza in general?
> >>
> >> Lukas
> >>
> >
>

Definitely worth to push forward in that direction. This is aligned w/ our
effort to build a standalone execution model of Samza.

Thanks a lot for putting so much thoughts in this direction.

-Yi

Re: Multithreading ThreadJobFactory

Posted by Lukas Steiblys <lu...@doubledutch.me>.
Kartik,

This is for the case when you don't use YARN. ThreadJob runs locally
and simply spins up a single thread for all tasks right now.

Lukas

On 10/20/15, Kartik Paramasivam <kp...@linkedin.com.invalid> wrote:
> We have been wanting to do something similar at LinkedIn.  We however
> haven't thought through the details.
>
> if container == thread.. then we would need to change the AppMaster to
> request the appropriate number of Yarn 'containers' (processes) .. i.e. we
> would have to decouple the process count from the yarn.Containers.Count ..
>
> Basically wouldn't we have to come up with a new setting Yarn.ProcessCount
> ?
>
> On Mon, Oct 19, 2015 at 3:49 PM, Lukas Steiblys <lu...@doubledutch.me>
> wrote:
>
>> I have been thinking lately about the most non-invasive way to add
>> multithreading capabilities to ThreadJobFactory, as that is the main
>> method
>> we run our jobs in production. Looking at the master branch code in Git,
>> I
>> have found the following:
>>   a.. The best way would be to simply spin up a new thread for each
>> container.
>>   b.. The number of containers can already be specified using the
>> configuration property job.container.count.
>>   c.. I can construct a new SamzaContainer for each containerModel
>> returned from coordinator.jobModel.getContainers in ThreadJobFactory.
>>   d.. I can pass a list of these containers into ThreadJob constructor
>> modifying it to accept an array of Runnables.
>>   e.. For each runnable, it would create a new thread and start it in the
>> submit method of ThreadJob.
>> This should start up a new thread for each container and group the tasks
>> using the appropriate TaskNameGrouper.
>>
>> Any ideas on what I might have missed? Are there any other potential
>> solutions? Would this be a good patch for Samza in general?
>>
>> Lukas
>>
>

Re: Multithreading ThreadJobFactory

Posted by Kartik Paramasivam <kp...@linkedin.com.INVALID>.
We have been wanting to do something similar at LinkedIn.  We however
haven't thought through the details.

if container == thread.. then we would need to change the AppMaster to
request the appropriate number of Yarn 'containers' (processes) .. i.e. we
would have to decouple the process count from the yarn.Containers.Count ..

Basically wouldn't we have to come up with a new setting Yarn.ProcessCount
?

On Mon, Oct 19, 2015 at 3:49 PM, Lukas Steiblys <lu...@doubledutch.me>
wrote:

> I have been thinking lately about the most non-invasive way to add
> multithreading capabilities to ThreadJobFactory, as that is the main method
> we run our jobs in production. Looking at the master branch code in Git, I
> have found the following:
>   a.. The best way would be to simply spin up a new thread for each
> container.
>   b.. The number of containers can already be specified using the
> configuration property job.container.count.
>   c.. I can construct a new SamzaContainer for each containerModel
> returned from coordinator.jobModel.getContainers in ThreadJobFactory.
>   d.. I can pass a list of these containers into ThreadJob constructor
> modifying it to accept an array of Runnables.
>   e.. For each runnable, it would create a new thread and start it in the
> submit method of ThreadJob.
> This should start up a new thread for each container and group the tasks
> using the appropriate TaskNameGrouper.
>
> Any ideas on what I might have missed? Are there any other potential
> solutions? Would this be a good patch for Samza in general?
>
> Lukas
>