You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Martin Kleppmann <mk...@linkedin.com> on 2014/02/04 18:51:33 UTC

Making hello-samza easier to get started with

I love the hello-samza project -- it's quite magical to run a bunch of commands and see real data flow through the example job. Great idea to use Wikipedia's IRC feed!

However, I feel the setup process is still a bit intimidating and fragile. I just wanted to bounce around some ideas about how we could make it quicker to get started:

• YARN is very heavyweight (100MB download). Could we avoid using YARN in hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode for development that doesn't require Zookeeper? The fewer dependencies the better.

• The Vagrant bootstrap script was quite broken -- I submitted a pull request (https://github.com/linkedin/hello-samza/pull/18) which should hopefully fix it.

• I somehow got my setup into a bad state (where YARN was running but its web UI wouldn't load); I think it happened because I ran `vagrant up` at the same time as `bin/grid bootstrap` outside of the VM, and the two processes trampled on each other. Deleting the 'deploy' directory and starting from a clean slate fixed it. Can we isolate Vagrant and local-OS bootstrap from each other?

• Can we make task logs go to stdout by default? Logs provide reassurance that something is happening, and at the moment you have to dig around somewhere in the deploy directory to find the log files.

• Can we shorten the commands? Having to unpack the .tar.gz file and then copy/paste a scary long run-job.sh line makes the process feel arcane, and obscures what is really happening. Perhaps just a shell script wrapper for run-job.sh or a maven goal would do it.

• Would it be possible to have maven download the dependencies, rather than bin/grid calling curl on random URLs? Somehow it feels weird to have a script download and run random code off the internet (although of course that's what every package manager does, it's irrational). It would also avoid re-downloading everything in case you decide to blow away the deploy directory.

What do you think? Please chime in. I'm happy to work on these things, just wanted to get a read on what people think first.

Martin

Re: Making hello-samza easier to get started with

Posted by Chris Riccomini <cr...@linkedin.com>.

Hey Guys,

I think I agree with Sriram on this one.

It seems to me that, if we move the Vagrant stuff out into a separate
repo, hello-samza is pretty straight forward. Copying and pasting a long
line isn't that scary to me, I just ignore it, and follow the directions,
but that's just my personality. :)

Cheers,
Chris

On 2/5/14 8:20 AM, "Sriram" <sr...@gmail.com> wrote:

>- I am not convinced that LocalJobFactory should be the default mode for
>hello-Samza. The target users for Samza are developers. Showing how
>awesome it is to setup Samza with Kafka and Yarn and consume wiki edit
>events in 5 - 10 minutes is really the big win. I don't think we gain
>much in reducing this time to 1 minute. I am also not a fan of having
>many way to do a quickstart which is my next point.
>
>- Having three ways to do quick start defeats the purpose and I vote for
>moving vagrant out into another repo. However, I do think the default
>should use yarn as mentioned above. I don't see a value add with making
>it localjobfactory.
>
>> On Feb 5, 2014, at 6:21 AM, Martin Kleppmann <mk...@linkedin.com>
>>wrote:
>> 
>> Hi Chris,
>> 
>> On 4 Feb 2014, at 19:05, Chris Riccomini <cr...@linkedin.com>
>>wrote:
>>>> [...]
>>>> € YARN is very heavyweight (100MB download). Could we avoid using
>>>>YARN in
>>>> hello-samza, in favour of LocalJobFactory? Does Kafka have a local
>>>>mode
>>>> for development that doesn't require Zookeeper? The fewer dependencies
>>>> the better.
>>> 
>>> On the one hand, I agree with you that it's annoying to have so many
>>> dependencies get pulled in. On the other hand, these systems are
>>> non-trivial to install, and getting them up and running, and showing
>>>the
>>> full power of Samza is a big deal. When I wrote hello-samza, I
>>>originally
>>> was just going to use LocalJobFactory, and not even use Kafka. This
>>>would
>>> have eliminated all dependencies. I opted against this because I felt
>>>like
>>> it gave a much poorer feel of what Samza was, and how it worked in the
>>> real world. For example, having the AM dashboard is really helpful, and
>>> allows us to illustrate what containers are, etc.
>> 
>> I agree that it's good to show the full power of Samza, and make it
>>easy to get started with YARN etc. But that raises the question: who is
>>hello-samza intended for?
>> 
>> - Is it for somebody who just saw a link to the Samza website in a
>>tweet, but who hasn't read the documentation yet, and who just wants to
>>quickly decide whether to invest more time into finding out about Samza?
>>(The "2-minute-quickly-playing-around" use case)
>> 
>> - Or is it for somebody who has already decided to try Samza, and wants
>>a reference project as a starting point for their own project? (The
>>"1-hour-experimentation" use case)
>> 
>> Both are valid use cases. The fact that "Hello Samza" appears as the
>>very first item in the website navigation suggests that it's intended
>>for the first case, whereas the full-on YARN install is more appropriate
>>to the second case.
>> 
>> In that light, I'd like to suggest the following:
>> 
>> - We move both the Vagrant setup and bin/grid into a separate
>>repository (call it "samza-instant-grid" or something like that). Since
>>the Vagrant setup depends on bin/grid, it makes sense for the two to be
>>in the same repository. That repo doesn't contain a particular Samza job
>>-- it's focused on the purpose of getting to a working YARN+Kafka+ZK
>>setup as quickly as possible, either on the local OS or inside a VM.
>> 
>> - We change hello-samza to use LocalJobFactory by default, for instant
>>gratification of people who are completely new to Samza. And at the end
>>of the hello-samza instructions we say something like: "Congratulations,
>>you've run your first Samza job! But it was running in local mode, which
>>is only for development, and doesn't have the resource isolation or
>>fault tolerance features of a real Samza deployment. Check out
>>[samza-instant-grid](LINK) to set up a miniature Samza cluster on your
>>machine in 10 minutes. You can then deploy
>>samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz to your
>>local cluster, and see the same job running in a YARN container."
>> 
>> That would allow hello-samza to satisfy both the
>>2-minute-quickly-playing-around use case and the 1-hour-experimentation
>>use case. And it would have the side benefit of showing how to set up a
>>project to use both local mode for development (which I think is
>>genuinely useful) and also generate an artifact that is deployable to
>>YARN.
>> 
>> Does that make sense?
>> 
>>>> € I somehow got my setup into a bad state (where YARN was running but
>>>>its
>>>> web UI wouldn't load); I think it happened because I ran `vagrant up`
>>>>at
>>>> the same time as `bin/grid bootstrap` outside of the VM, and the two
>>>> processes trampled on each other. Deleting the 'deploy' directory and
>>>> starting from a clean slate fixed it. Can we isolate Vagrant and
>>>>local-OS
>>>> bootstrap from each other?
>>> 
>>> Yea, we really need to think this through. Originally, we only had
>>>local
>>> bin/grid (no Vagrant). Now, we have two different ways to run
>>>hello-samza,
>>> which is really confusing (especially since the README only talks about
>>> Vagrant, and the Samza website only talks about local mode). Jakob and
>>>I
>>> were talking about this as well. It seems like a good thing to move the
>>> Vagrant stuff somewhere else, and be clear about the two different
>>>ways of
>>> bootstrapping. Not quite sure about the best way to do this, but Jakob
>>>had
>>> some thoughts.
>> 
>> Jakob, would be interested to hear what you think.
>> 
>>>> € Can we make task logs go to stdout by default? Logs provide
>>>>reassurance
>>>> that something is happening, and at the moment you have to dig around
>>>> somewhere in the deploy directory to find the log files.
>>> 
>>> Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs?
>> 
>> The run-job.sh commands currently give no visual feedback as to what is
>>happening -- you just start it, but then the job disappears into a
>>'black hole'. You can start the kafka-console-consumer to see the output
>>of a job, or you can find it on the YARN web UI, but a more immediate
>>form of feedback would be for the job's startup logs to appear on stdout.
>> 
>> I noticed a file deploy/samza/undefined-samza-container-name.log, which
>>included some info from the Samza job starting up, such as the MOTD sent
>>by the Wikipedia IRC gateway after connecting. That's the kind of output
>>I was thinking of.
>> 
>> Showing logs on stdout probably makes most sense when a job is run
>>through LocalJobFactory. If a job is deployed to YARN, it's
>>understandable that the logs are not shown (because they are generated
>>in a different process, potentially on a different machine).
>> 
>>>> € Can we shorten the commands? Having to unpack the .tar.gz file and
>>>>then
>>>> copy/paste a scary long run-job.sh line makes the process feel arcane,
>>>> and obscures what is really happening. Perhaps just a shell script
>>>> wrapper for run-job.sh or a maven goal would do it.
>>> 
>>> Regarding the mkdir and .tar.gz unpacking, we should just do this as
>>>part
>>> of `mvn package`. If you want to make that change, I'm all for it.
>>> 
>>> As for hiding the run-job.sh, I'm not as convinced of getting rid of
>>>it. I
>>> kind of like exposing how Samza actually works to the developer, so
>>>they
>>> know. Hiding it behind some one-off script doesn't really help them
>>> understand Samza (of course the same argument could be made for hiding
>>> YARN/ZK/Kafka behind bin/grid). Perhaps we just need more
>>>documentation in
>>> the walkthrough about what this command does and what the parameters
>>>are?
>> 
>> If run-job.sh is part of samza-instant-grid, I think it's ok to keep it
>>as-is, and document it.
>> 
>> For the 2-minute-quickly-playing-around use case, I fear that a long
>>command mentioning factories is more confusing than enlightening. Am I
>>right in thinking that when using LocalJobFactory, run-job.sh is not
>>needed?
>> 
>>>> € Would it be possible to have maven download the dependencies, rather
>>>> than bin/grid calling curl on random URLs? Somehow it feels weird to
>>>>have
>>>> a script download and run random code off the internet (although of
>>>> course that's what every package manager does, it's irrational). It
>>>>would
>>>> also avoid re-downloading everything in case you decide to blow away
>>>>the
>>>> deploy directory.
>>> 
>>> Not sure about this. All of this stuff is up in Apache's HTTP servers,
>>>but
>>> I'm not sure if the release packages for these projects are published
>>>into
>>> Maven central (I'm nearly 100% certain that Kafka isn't). If they're
>>>not,
>>> then having Maven download the packages is no different than having the
>>> shell script do it.
>>> 
>>> One alternative would be to have the bin/grid script cache the files
>>> locally somewhere, so that blowing away the deploy directory doesn't
>>> trigger a re-download of YARN/ZK/Kafka again.
>> 
>> Ok, having the shell script cache the files in another directory sounds
>>good. I'm happy to make that change.
>> 
>> Cheers,
>> Martin
>>

Re: Making hello-samza easier to get started with

Posted by Sriram <sr...@gmail.com>.

- I am not convinced that LocalJobFactory should be the default mode for hello-Samza. The target users for Samza are developers. Showing how awesome it is to setup Samza with Kafka and Yarn and consume wiki edit events in 5 - 10 minutes is really the big win. I don't think we gain much in reducing this time to 1 minute. I am also not a fan of having many way to do a quickstart which is my next point.

- Having three ways to do quick start defeats the purpose and I vote for moving vagrant out into another repo. However, I do think the default should use yarn as mentioned above. I don't see a value add with making it localjobfactory.

> On Feb 5, 2014, at 6:21 AM, Martin Kleppmann <mk...@linkedin.com> wrote:
> 
> Hi Chris,
> 
> On 4 Feb 2014, at 19:05, Chris Riccomini <cr...@linkedin.com> wrote:
>>> [...]
>>> € YARN is very heavyweight (100MB download). Could we avoid using YARN in
>>> hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode
>>> for development that doesn't require Zookeeper? The fewer dependencies
>>> the better.
>> 
>> On the one hand, I agree with you that it's annoying to have so many
>> dependencies get pulled in. On the other hand, these systems are
>> non-trivial to install, and getting them up and running, and showing the
>> full power of Samza is a big deal. When I wrote hello-samza, I originally
>> was just going to use LocalJobFactory, and not even use Kafka. This would
>> have eliminated all dependencies. I opted against this because I felt like
>> it gave a much poorer feel of what Samza was, and how it worked in the
>> real world. For example, having the AM dashboard is really helpful, and
>> allows us to illustrate what containers are, etc.
> 
> I agree that it's good to show the full power of Samza, and make it easy to get started with YARN etc. But that raises the question: who is hello-samza intended for?
> 
> - Is it for somebody who just saw a link to the Samza website in a tweet, but who hasn't read the documentation yet, and who just wants to quickly decide whether to invest more time into finding out about Samza? (The "2-minute-quickly-playing-around" use case)
> 
> - Or is it for somebody who has already decided to try Samza, and wants a reference project as a starting point for their own project? (The "1-hour-experimentation" use case)
> 
> Both are valid use cases. The fact that "Hello Samza" appears as the very first item in the website navigation suggests that it's intended for the first case, whereas the full-on YARN install is more appropriate to the second case.
> 
> In that light, I'd like to suggest the following:
> 
> - We move both the Vagrant setup and bin/grid into a separate repository (call it "samza-instant-grid" or something like that). Since the Vagrant setup depends on bin/grid, it makes sense for the two to be in the same repository. That repo doesn't contain a particular Samza job -- it's focused on the purpose of getting to a working YARN+Kafka+ZK setup as quickly as possible, either on the local OS or inside a VM.
> 
> - We change hello-samza to use LocalJobFactory by default, for instant gratification of people who are completely new to Samza. And at the end of the hello-samza instructions we say something like: "Congratulations, you've run your first Samza job! But it was running in local mode, which is only for development, and doesn't have the resource isolation or fault tolerance features of a real Samza deployment. Check out [samza-instant-grid](LINK) to set up a miniature Samza cluster on your machine in 10 minutes. You can then deploy samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz to your local cluster, and see the same job running in a YARN container."
> 
> That would allow hello-samza to satisfy both the 2-minute-quickly-playing-around use case and the 1-hour-experimentation use case. And it would have the side benefit of showing how to set up a project to use both local mode for development (which I think is genuinely useful) and also generate an artifact that is deployable to YARN.
> 
> Does that make sense?
> 
>>> € I somehow got my setup into a bad state (where YARN was running but its
>>> web UI wouldn't load); I think it happened because I ran `vagrant up` at
>>> the same time as `bin/grid bootstrap` outside of the VM, and the two
>>> processes trampled on each other. Deleting the 'deploy' directory and
>>> starting from a clean slate fixed it. Can we isolate Vagrant and local-OS
>>> bootstrap from each other?
>> 
>> Yea, we really need to think this through. Originally, we only had local
>> bin/grid (no Vagrant). Now, we have two different ways to run hello-samza,
>> which is really confusing (especially since the README only talks about
>> Vagrant, and the Samza website only talks about local mode). Jakob and I
>> were talking about this as well. It seems like a good thing to move the
>> Vagrant stuff somewhere else, and be clear about the two different ways of
>> bootstrapping. Not quite sure about the best way to do this, but Jakob had
>> some thoughts.
> 
> Jakob, would be interested to hear what you think.
> 
>>> € Can we make task logs go to stdout by default? Logs provide reassurance
>>> that something is happening, and at the moment you have to dig around
>>> somewhere in the deploy directory to find the log files.
>> 
>> Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs?
> 
> The run-job.sh commands currently give no visual feedback as to what is happening -- you just start it, but then the job disappears into a 'black hole'. You can start the kafka-console-consumer to see the output of a job, or you can find it on the YARN web UI, but a more immediate form of feedback would be for the job's startup logs to appear on stdout.
> 
> I noticed a file deploy/samza/undefined-samza-container-name.log, which included some info from the Samza job starting up, such as the MOTD sent by the Wikipedia IRC gateway after connecting. That's the kind of output I was thinking of.
> 
> Showing logs on stdout probably makes most sense when a job is run through LocalJobFactory. If a job is deployed to YARN, it's understandable that the logs are not shown (because they are generated in a different process, potentially on a different machine).
> 
>>> € Can we shorten the commands? Having to unpack the .tar.gz file and then
>>> copy/paste a scary long run-job.sh line makes the process feel arcane,
>>> and obscures what is really happening. Perhaps just a shell script
>>> wrapper for run-job.sh or a maven goal would do it.
>> 
>> Regarding the mkdir and .tar.gz unpacking, we should just do this as part
>> of `mvn package`. If you want to make that change, I'm all for it.
>> 
>> As for hiding the run-job.sh, I'm not as convinced of getting rid of it. I
>> kind of like exposing how Samza actually works to the developer, so they
>> know. Hiding it behind some one-off script doesn't really help them
>> understand Samza (of course the same argument could be made for hiding
>> YARN/ZK/Kafka behind bin/grid). Perhaps we just need more documentation in
>> the walkthrough about what this command does and what the parameters are?
> 
> If run-job.sh is part of samza-instant-grid, I think it's ok to keep it as-is, and document it.
> 
> For the 2-minute-quickly-playing-around use case, I fear that a long command mentioning factories is more confusing than enlightening. Am I right in thinking that when using LocalJobFactory, run-job.sh is not needed?
> 
>>> € Would it be possible to have maven download the dependencies, rather
>>> than bin/grid calling curl on random URLs? Somehow it feels weird to have
>>> a script download and run random code off the internet (although of
>>> course that's what every package manager does, it's irrational). It would
>>> also avoid re-downloading everything in case you decide to blow away the
>>> deploy directory.
>> 
>> Not sure about this. All of this stuff is up in Apache's HTTP servers, but
>> I'm not sure if the release packages for these projects are published into
>> Maven central (I'm nearly 100% certain that Kafka isn't). If they're not,
>> then having Maven download the packages is no different than having the
>> shell script do it.
>> 
>> One alternative would be to have the bin/grid script cache the files
>> locally somewhere, so that blowing away the deploy directory doesn't
>> trigger a re-download of YARN/ZK/Kafka again.
> 
> Ok, having the shell script cache the files in another directory sounds good. I'm happy to make that change.
> 
> Cheers,
> Martin
>

Re: Making hello-samza easier to get started with

Posted by Martin Kleppmann <mk...@linkedin.com>.

Hi Chris,

On 4 Feb 2014, at 19:05, Chris Riccomini <cr...@linkedin.com> wrote:
>> [...]
>> € YARN is very heavyweight (100MB download). Could we avoid using YARN in
>> hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode
>> for development that doesn't require Zookeeper? The fewer dependencies
>> the better.
> 
> On the one hand, I agree with you that it's annoying to have so many
> dependencies get pulled in. On the other hand, these systems are
> non-trivial to install, and getting them up and running, and showing the
> full power of Samza is a big deal. When I wrote hello-samza, I originally
> was just going to use LocalJobFactory, and not even use Kafka. This would
> have eliminated all dependencies. I opted against this because I felt like
> it gave a much poorer feel of what Samza was, and how it worked in the
> real world. For example, having the AM dashboard is really helpful, and
> allows us to illustrate what containers are, etc.

I agree that it's good to show the full power of Samza, and make it easy to get started with YARN etc. But that raises the question: who is hello-samza intended for?

- Is it for somebody who just saw a link to the Samza website in a tweet, but who hasn't read the documentation yet, and who just wants to quickly decide whether to invest more time into finding out about Samza? (The "2-minute-quickly-playing-around" use case)

- Or is it for somebody who has already decided to try Samza, and wants a reference project as a starting point for their own project? (The "1-hour-experimentation" use case)

Both are valid use cases. The fact that "Hello Samza" appears as the very first item in the website navigation suggests that it's intended for the first case, whereas the full-on YARN install is more appropriate to the second case.

In that light, I'd like to suggest the following:

- We move both the Vagrant setup and bin/grid into a separate repository (call it "samza-instant-grid" or something like that). Since the Vagrant setup depends on bin/grid, it makes sense for the two to be in the same repository. That repo doesn't contain a particular Samza job -- it's focused on the purpose of getting to a working YARN+Kafka+ZK setup as quickly as possible, either on the local OS or inside a VM.

- We change hello-samza to use LocalJobFactory by default, for instant gratification of people who are completely new to Samza. And at the end of the hello-samza instructions we say something like: "Congratulations, you've run your first Samza job! But it was running in local mode, which is only for development, and doesn't have the resource isolation or fault tolerance features of a real Samza deployment. Check out [samza-instant-grid](LINK) to set up a miniature Samza cluster on your machine in 10 minutes. You can then deploy samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz to your local cluster, and see the same job running in a YARN container."

That would allow hello-samza to satisfy both the 2-minute-quickly-playing-around use case and the 1-hour-experimentation use case. And it would have the side benefit of showing how to set up a project to use both local mode for development (which I think is genuinely useful) and also generate an artifact that is deployable to YARN.

Does that make sense?

>> € I somehow got my setup into a bad state (where YARN was running but its
>> web UI wouldn't load); I think it happened because I ran `vagrant up` at
>> the same time as `bin/grid bootstrap` outside of the VM, and the two
>> processes trampled on each other. Deleting the 'deploy' directory and
>> starting from a clean slate fixed it. Can we isolate Vagrant and local-OS
>> bootstrap from each other?
> 
> Yea, we really need to think this through. Originally, we only had local
> bin/grid (no Vagrant). Now, we have two different ways to run hello-samza,
> which is really confusing (especially since the README only talks about
> Vagrant, and the Samza website only talks about local mode). Jakob and I
> were talking about this as well. It seems like a good thing to move the
> Vagrant stuff somewhere else, and be clear about the two different ways of
> bootstrapping. Not quite sure about the best way to do this, but Jakob had
> some thoughts.

Jakob, would be interested to hear what you think.

>> € Can we make task logs go to stdout by default? Logs provide reassurance
>> that something is happening, and at the moment you have to dig around
>> somewhere in the deploy directory to find the log files.
> 
> Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs?

The run-job.sh commands currently give no visual feedback as to what is happening -- you just start it, but then the job disappears into a 'black hole'. You can start the kafka-console-consumer to see the output of a job, or you can find it on the YARN web UI, but a more immediate form of feedback would be for the job's startup logs to appear on stdout.

I noticed a file deploy/samza/undefined-samza-container-name.log, which included some info from the Samza job starting up, such as the MOTD sent by the Wikipedia IRC gateway after connecting. That's the kind of output I was thinking of.

Showing logs on stdout probably makes most sense when a job is run through LocalJobFactory. If a job is deployed to YARN, it's understandable that the logs are not shown (because they are generated in a different process, potentially on a different machine).

>> € Can we shorten the commands? Having to unpack the .tar.gz file and then
>> copy/paste a scary long run-job.sh line makes the process feel arcane,
>> and obscures what is really happening. Perhaps just a shell script
>> wrapper for run-job.sh or a maven goal would do it.
> 
> Regarding the mkdir and .tar.gz unpacking, we should just do this as part
> of `mvn package`. If you want to make that change, I'm all for it.
> 
> As for hiding the run-job.sh, I'm not as convinced of getting rid of it. I
> kind of like exposing how Samza actually works to the developer, so they
> know. Hiding it behind some one-off script doesn't really help them
> understand Samza (of course the same argument could be made for hiding
> YARN/ZK/Kafka behind bin/grid). Perhaps we just need more documentation in
> the walkthrough about what this command does and what the parameters are?

If run-job.sh is part of samza-instant-grid, I think it's ok to keep it as-is, and document it.

For the 2-minute-quickly-playing-around use case, I fear that a long command mentioning factories is more confusing than enlightening. Am I right in thinking that when using LocalJobFactory, run-job.sh is not needed?

>> € Would it be possible to have maven download the dependencies, rather
>> than bin/grid calling curl on random URLs? Somehow it feels weird to have
>> a script download and run random code off the internet (although of
>> course that's what every package manager does, it's irrational). It would
>> also avoid re-downloading everything in case you decide to blow away the
>> deploy directory.
> 
> Not sure about this. All of this stuff is up in Apache's HTTP servers, but
> I'm not sure if the release packages for these projects are published into
> Maven central (I'm nearly 100% certain that Kafka isn't). If they're not,
> then having Maven download the packages is no different than having the
> shell script do it.
> 
> One alternative would be to have the bin/grid script cache the files
> locally somewhere, so that blowing away the deploy directory doesn't
> trigger a re-download of YARN/ZK/Kafka again.

Ok, having the shell script cache the files in another directory sounds good. I'm happy to make that change.

Cheers,
Martin

Re: Making hello-samza easier to get started with

Posted by Chris Riccomini <cr...@linkedin.com>.

Hey Martin,

Responses inline.

The things in your list that I'm most excited about are:

1. Split Vagrant out.
2. Collapsing mkdir and tar -xvf into `mvn package`.
3. Making bin/grid cache downloads outside of the deploy directory.

I'm not really opposed to some of the other stuff, but we need to think it
through more (and probably need feedback from others).

Cheers,
Chris

On 2/4/14 9:51 AM, "Martin Kleppmann" <mk...@linkedin.com> wrote:

>I love the hello-samza project -- it's quite magical to run a bunch of
>commands and see real data flow through the example job. Great idea to
>use Wikipedia's IRC feed!
>
>However, I feel the setup process is still a bit intimidating and
>fragile. I just wanted to bounce around some ideas about how we could
>make it quicker to get started:
>
>€ YARN is very heavyweight (100MB download). Could we avoid using YARN in
>hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode
>for development that doesn't require Zookeeper? The fewer dependencies
>the better.

On the one hand, I agree with you that it's annoying to have so many
dependencies get pulled in. On the other hand, these systems are
non-trivial to install, and getting them up and running, and showing the
full power of Samza is a big deal. When I wrote hello-samza, I originally
was just going to use LocalJobFactory, and not even use Kafka. This would
have eliminated all dependencies. I opted against this because I felt like
it gave a much poorer feel of what Samza was, and how it worked in the
real world. For example, having the AM dashboard is really helpful, and
allows us to illustrate what containers are, etc.

>
>€ The Vagrant bootstrap script was quite broken -- I submitted a pull
>request (https://github.com/linkedin/hello-samza/pull/18) which should
>hopefully fix it.

Took a look. Looks good to me. Will merge if no one has any objections.

>
>€ I somehow got my setup into a bad state (where YARN was running but its
>web UI wouldn't load); I think it happened because I ran `vagrant up` at
>the same time as `bin/grid bootstrap` outside of the VM, and the two
>processes trampled on each other. Deleting the 'deploy' directory and
>starting from a clean slate fixed it. Can we isolate Vagrant and local-OS
>bootstrap from each other?

Yea, we really need to think this through. Originally, we only had local
bin/grid (no Vagrant). Now, we have two different ways to run hello-samza,
which is really confusing (especially since the README only talks about
Vagrant, and the Samza website only talks about local mode). Jakob and I
were talking about this as well. It seems like a good thing to move the
Vagrant stuff somewhere else, and be clear about the two different ways of
bootstrapping. Not quite sure about the best way to do this, but Jakob had
some thoughts.

>
>€ Can we make task logs go to stdout by default? Logs provide reassurance
>that something is happening, and at the moment you have to dig around
>somewhere in the deploy directory to find the log files.

Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs?

>
>€ Can we shorten the commands? Having to unpack the .tar.gz file and then
>copy/paste a scary long run-job.sh line makes the process feel arcane,
>and obscures what is really happening. Perhaps just a shell script
>wrapper for run-job.sh or a maven goal would do it.

Regarding the mkdir and .tar.gz unpacking, we should just do this as part
of `mvn package`. If you want to make that change, I'm all for it.

As for hiding the run-job.sh, I'm not as convinced of getting rid of it. I
kind of like exposing how Samza actually works to the developer, so they
know. Hiding it behind some one-off script doesn't really help them
understand Samza (of course the same argument could be made for hiding
YARN/ZK/Kafka behind bin/grid). Perhaps we just need more documentation in
the walkthrough about what this command does and what the parameters are?

>
>€ Would it be possible to have maven download the dependencies, rather
>than bin/grid calling curl on random URLs? Somehow it feels weird to have
>a script download and run random code off the internet (although of
>course that's what every package manager does, it's irrational). It would
>also avoid re-downloading everything in case you decide to blow away the
>deploy directory.

Not sure about this. All of this stuff is up in Apache's HTTP servers, but
I'm not sure if the release packages for these projects are published into
Maven central (I'm nearly 100% certain that Kafka isn't). If they're not,
then having Maven download the packages is no different than having the
shell script do it.

One alternative would be to have the bin/grid script cache the files
locally somewhere, so that blowing away the deploy directory doesn't
trigger a re-download of YARN/ZK/Kafka again.

>
>What do you think? Please chime in. I'm happy to work on these things,
>just wanted to get a read on what people think first.
>
>Martin
>