You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Anh Thu Vu <vu...@gmail.com> on 2014/02/06 16:30:35 UTC

Some questions about Samza from a newbie

Hi everyone,

I've just started using Samza, so pardon me if my questions are obvious.

I've looked through the example hello-samza. So here are some questions I
have when trying to write an application on Samza:
1) I can't seem to figure out how to define the number of partitions for a
stream.

2) Will a StreamTask stop? (when its input streams are all exhausted). How
to trigger this and stop the task when all messages are processed?

Thanks a lot, guys!

Casey

RE: Some questions about Samza from a newbie

Posted by Garry Turkington <g....@improvedigital.com>.
Casey,

Chris did respond to your email, link to the archive below in case it hit your spam filter or similar:

http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201402.mbox/%3C1B43C7411DB20E47AB0FB62E7262B80179BAB245%40ESV4-MBX01.linkedin.biz%3E

Garry

-----Original Message-----
From: Anh Thu Vu [mailto:vuanhthu888@gmail.com] 
Sent: 10 February 2014 09:54
To: dev@samza.incubator.apache.org
Subject: Re: Some questions about Samza from a newbie

Hi everyone,

Can someone help me with the two questions?

Thanks in advance,
Casey


On Thu, Feb 6, 2014 at 4:30 PM, Anh Thu Vu <vu...@gmail.com> wrote:

> Hi everyone,
>
> I've just started using Samza, so pardon me if my questions are obvious.
>
> I've looked through the example hello-samza. So here are some 
> questions I have when trying to write an application on Samza:
> 1) I can't seem to figure out how to define the number of partitions 
> for a stream.
>
> 2) Will a StreamTask stop? (when its input streams are all exhausted). 
> How to trigger this and stop the task when all messages are processed?
>
> Thanks a lot, guys!
>
> Casey
>

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4259 / Virus Database: 3697/7080 - Release Date: 02/10/14

Re: Some questions about Samza from a newbie

Posted by Anh Thu Vu <vu...@gmail.com>.
Hi everyone,

Can someone help me with the two questions?

Thanks in advance,
Casey


On Thu, Feb 6, 2014 at 4:30 PM, Anh Thu Vu <vu...@gmail.com> wrote:

> Hi everyone,
>
> I've just started using Samza, so pardon me if my questions are obvious.
>
> I've looked through the example hello-samza. So here are some questions I
> have when trying to write an application on Samza:
> 1) I can't seem to figure out how to define the number of partitions for a
> stream.
>
> 2) Will a StreamTask stop? (when its input streams are all exhausted). How
> to trigger this and stop the task when all messages are processed?
>
> Thanks a lot, guys!
>
> Casey
>

Re: Some questions about Samza from a newbie

Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Anh,

1. The number of partitions for your Samza job is determined by the number
of partitions from the input stream that you're reading from. If your
Samza job has task.inputs=foo, and the stream foo has 4 partitions, then
your Samza job will have 4 partitions. These partitions are grouped into
"containers" (Java processes). Each container owns some number of the
partitions for the Samza job. If you had two containers
(yarn.container.count=2), and your foo had 4 partitions, then each Samza
container would get two partitions: container 0 would get partitions 0 and
2, container 1 would get partitions 1 and 3.

2. Samza's model of a stream is infinite. When a StreamTask has read all
messages for a stream, it will simply wait until more messages are
available, and then process them when they arrive. Samza currently has no
built-in way to stop a container when a StreamTask has read all of the
messages in a stream, and "caught up". We actually have a need for this at
LinkedIn, so I'm planning to open a JIRA for this. My current line of
thinking is to add a BootstrapTask, which has a callback to let you know
when your stream has "caught up". Something like:

interface BootstrapTask {
  public void bootstrapped(SystemStreamPartition
bootstrappedSystemStreamPartition, Set<SystemStreamPartition>
unbootstrappedSystemStreamPartitions, MessageCollector, collector,
TaskCoordinator coordinator);
}

With an interface like this, you would call coordinator.shutdown() when
unbootstrappedSystemStreamPartitions.size() == 0 (you've bootstrapped all
system stream partitions). This doesn't exist yet, but I'll be working on
it this week, most likely.

In general, I recommend reading:

 * Background 
<http://samza.incubator.apache.org/learn/documentation/0.7.0/introduction/b
ackground.html>
 * Concepts 
<http://samza.incubator.apache.org/learn/documentation/0.7.0/introduction/c
oncepts.html>
 * Architecture 
<http://samza.incubator.apache.org/learn/documentation/0.7.0/introduction/a
rchitecture.html>

And watching:

  http://www.infoq.com/presentations/samza-linkedin

Cheers,
Chris

On 2/6/14 7:30 AM, "Anh Thu Vu" <vu...@gmail.com> wrote:

>Hi everyone,
>
>I've just started using Samza, so pardon me if my questions are obvious.
>
>I've looked through the example hello-samza. So here are some questions I
>have when trying to write an application on Samza:
>1) I can't seem to figure out how to define the number of partitions for a
>stream.
>
>2) Will a StreamTask stop? (when its input streams are all exhausted). How
>to trigger this and stop the task when all messages are processed?
>
>Thanks a lot, guys!
>
>Casey