You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by shyla deshpande <de...@gmail.com> on 2017/04/07 00:04:46 UTC

What is the best way to run a scheduled spark batch job on AWS EC2 ?

I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
easiest way to do this. Thanks

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by shyla deshpande <de...@gmail.com>.

Thanks everyone for sharing your ideas. Very useful. I appreciate.

On Fri, Apr 7, 2017 at 10:40 AM, Sam Elamin <hu...@gmail.com> wrote:

> Definitely agree with gourav there. I wouldn't want jenkins to run my work
> flow. Seems to me that you would only be using jenkins for its scheduling
> capabilities
>
> Yes you can run tests but you wouldn't want it to run your orchestration
> of jobs
>
> What happens if jenkijs goes down for any particular reason. How do you
> have the conversation with your stakeholders that your pipeline is not
> working and they don't have data because the build server is going through
> an upgrade or going through an upgrade
>
> However to be fair I understand what you are saying Steve if someone is in
> a place where you only have access to jenkins and have to go through hoops
> to setup:get access to new instances then engineers will do what they
> always do, find ways to game the system to get their work done
>
>
>
>
> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi Steve,
>>
>> Why would you ever do that? You are suggesting the use of a CI tool as a
>> workflow and orchestration engine.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <st...@hortonworks.com>
>> wrote:
>>
>>> If you have Jenkins set up for some CI workflow, that can do scheduled
>>> builds and tests. Works well if you can do some build test before even
>>> submitting it to a remote cluster
>>>
>>> On 7 Apr 2017, at 10:15, Sam Elamin <hu...@gmail.com> wrote:
>>>
>>> Hi Shyla
>>>
>>> You have multiple options really some of which have been already listed
>>> but let me try and clarify
>>>
>>> Assuming you have a spark application in a jar you have a variety of
>>> options
>>>
>>> You have to have an existing spark cluster that is either running on EMR
>>> or somewhere else.
>>>
>>> *Super simple / hacky*
>>> Cron job on EC2 that calls a simple shell script that does a spart
>>> submit to a Spark Cluster OR create or add step to an EMR cluster
>>>
>>> *More Elegant*
>>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that
>>> will do the above step but have scheduling and potential backfilling and
>>> error handling(retries,alerts etc)
>>>
>>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>>> does some Spark jobs but I do not think its available worldwide just yet
>>>
>>> Hope I cleared things up
>>>
>>> Regards
>>> Sam
>>>
>>>
>>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>>> gourav.sengupta@gmail.com> wrote:
>>>
>>>> Hi Shyla,
>>>>
>>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>>> deshpandeshyla@gmail.com> wrote:
>>>>
>>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>>>> easiest way to do this. Thanks
>>>>>
>>>>
>>>>
>>>
>>>
>>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Sumona Routh <su...@gmail.com>.

Hi Sam,
I would absolutely be interested in reading a blog write-up of how you are
doing this. We have pieced together a relatively decent pipeline ourselves,
in jenkins, but have many kinks to work out. We also have some new
requirements to start running side by side comparisons of different
versions of the spark job, so this introduces some additional complexity
for us, and opens up the opportunity to redesign this approach.

To answer the question of "what if jenkins is down," we simply added our
jenkins to our application monitoring (AppD currently, future NewRelic) and
cloudwatch to monitor the instance. We manage our own jenkins box, so there
are no hoops to do what we want in this respect.

Thanks!
Sumona

On Tue, Apr 11, 2017 at 7:20 AM Sam Elamin <hu...@gmail.com> wrote:

> Hi Steve
>
>
> Thanks for the detailed response, I think this problem doesn't have an
> industry standard solution as of yet and I am sure a lot of people would
> benefit from the discussion
>
> I realise now what you are saying so thanks for clarifying, that said let
> me try and explain how we approached the problem
>
> There are 2 problems you highlighted, the first if moving the code from
> SCM to prod, and the other is enusiring the data your code uses is correct.
> (using the latest data from prod)
>
>
> *"how do you get your code from SCM into production?"*
>
> We currently have our pipeline being run via airflow, we have our dags in
> S3, with regards to how we get our code from SCM to production
>
> 1) Jenkins build that builds our spark applications and runs tests
> 2) Once the first build is successful we trigger another build to copy the
> dags to an s3 folder
>
> We then routinely sync this folder to the local airflow dags folder every
> X amount of mins
>
> Re test data
> *" but what's your strategy for test data: that's always the troublespot."*
>
> Our application is using versioning against the data, so we expect the
> source data to be in a certain version and the output data to also be in a
> certain version
>
> We have a test resources folder that we have following the same convention
> of versioning - this is the data that our application tests use - to ensure
> that the data is in the correct format
>
> so for example if we have Table X with version 1 that depends on data from
> Table A and B also version 1, we run our spark application then ensure the
> transformed table X has the correct columns and row values
>
> Then when we have a new version 2 of the source data or adding a new
> column in Table X (version 2), we generate a new version of the data and
> ensure the tests are updated
>
> That way we ensure any new version of the data has tests against it
>
> *"I've never seen any good strategy there short of "throw it at a copy of
> the production dataset"."*
>
> I agree which is why we have a sample of the production data and version
> the schemas we expect the source and target data to look like.
>
> If people are interested I am happy writing a blog about it in the hopes
> this helps people build more reliable pipelines
>
> Kind Regards
> Sam
>
>
>
>
>
>
>
>
>
>
> On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>
> On 7 Apr 2017, at 18:40, Sam Elamin <hu...@gmail.com> wrote:
>
> Definitely agree with gourav there. I wouldn't want jenkins to run my work
> flow. Seems to me that you would only be using jenkins for its scheduling
> capabilities
>
>
> Maybe I was just looking at this differenlty
>
> Yes you can run tests but you wouldn't want it to run your orchestration
> of jobs
>
> What happens if jenkijs goes down for any particular reason. How do you
> have the conversation with your stakeholders that your pipeline is not
> working and they don't have data because the build server is going through
> an upgrade or going through an upgrade
>
>
>
> Well, I wouldn't use it as a replacement for Oozie, but I'd certainly
> consider as the pipeline for getting your code out to the cluster, so you
> don't have to explain why you just pushed out something broken
>
> As example, here's Renault's pipeline as discussed last week in Munich
> https://flic.kr/p/Tw3Emu
>
> However to be fair I understand what you are saying Steve if someone is in
> a place where you only have access to jenkins and have to go through hoops
> to setup:get access to new instances then engineers will do what they
> always do, find ways to game the system to get their work done
>
>
>
>
> This isn't about trying to "Game the system", this is about what makes a
> replicable workflow for getting code into production, either at the press
> of a button or as part of a scheduled "we push out an update every night,
> rerun the deployment tests and then switch over to the new installation"
> mech.
>
> Put differently: how do you get your code from SCM into production? Not
> just for CI, but what's your strategy for test data: that's always the
> troublespot. Random selection of rows may work, although it will skip the
> odd outlier (high-unicode char in what should be a LATIN-1 field, time set
> to 0, etc), and for work joining > 1 table, you need rows which join well.
> I've never seen any good strategy there short of "throw it at a copy of the
> production dataset".
>
>
> -Steve
>
>
>
>
>
>
> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <go...@gmail.com>
> wrote:
>
> Hi Steve,
>
> Why would you ever do that? You are suggesting the use of a CI tool as a
> workflow and orchestration engine.
>
> Regards,
> Gourav Sengupta
>
> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
> If you have Jenkins set up for some CI workflow, that can do scheduled
> builds and tests. Works well if you can do some build test before even
> submitting it to a remote cluster
>
> On 7 Apr 2017, at 10:15, Sam Elamin <hu...@gmail.com> wrote:
>
> Hi Shyla
>
> You have multiple options really some of which have been already listed
> but let me try and clarify
>
> Assuming you have a spark application in a jar you have a variety of
> options
>
> You have to have an existing spark cluster that is either running on EMR
> or somewhere else.
>
> *Super simple / hacky*
> Cron job on EC2 that calls a simple shell script that does a spart submit
> to a Spark Cluster OR create or add step to an EMR cluster
>
> *More Elegant*
> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
> do the above step but have scheduling and potential backfilling and error
> handling(retries,alerts etc)
>
> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
> does some Spark jobs but I do not think its available worldwide just yet
>
> Hope I cleared things up
>
> Regards
> Sam
>
>
> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <gourav.sengupta@gmail.com
> > wrote:
>
> Hi Shyla,
>
> why would you want to schedule a spark job in EC2 instead of EMR?
>
> Regards,
> Gourav
>
> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <de...@gmail.com>
> wrote:
>
> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>
>
>
>
>
>
>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by "lucas.gary@gmail.com" <lu...@gmail.com>.

"Building data products is a very different discipline from that of
building software."

That is a fundamentally incorrect assumption.

There will always be a need for figuring out how to apply said principles,
but saying 'we're different' has always turned out to be incorrect and I
have seen no reason to think otherwise for data products.

At some point it always comes down to 'how do I get this to my customer, in
a reliable and repeatable fashion'.  The CI/CD patterns that we've come to
rely on are designed to optimize that process.

I have seen no evidence that 'data products' don't benefit from those
practices and I have definitely seen evidence that not following those
patterns has had substantial costs.

Of course there's always a balancing act in the early phases of discovery,
but at some point the needle swings from: "Do I have a valuable product"
to: "How do I get this to customers"

Gary Lucas

On 12 April 2017 at 10:46, Steve Loughran <st...@hortonworks.com> wrote:

>
> On 12 Apr 2017, at 17:25, Gourav Sengupta <go...@gmail.com>
> wrote:
>
> Hi,
>
> Your answer is like saying, I know how to code in assembly level language
> and I am going to build the next GUI in assembly level code and I think
> that there is a genuine functional requirement to see a color of a button
> in green on the screen.
>
>
> well, I reserve the right to have incomplete knowledge, and look forward
> to improving it.
>
> Perhaps it may be pertinent to read the first preface of a CI/ CD book and
> realize to what kind of software development disciplines is it applicable
> to.
>
>
> the original introduction on CI was probably Fowler's Cruise Control
> article,
> https://martinfowler.com/articles/originalContinuousIntegration.html
>
> "The key is to automate absolutely everything and run the process so often
> that integration errors are found quickly"
>
> Java Development with Ant, 2003, looks at Cruise Control, Anthill and
> Gump, again, with that focus on team coding and automated regression
> testing, both of unit tests, and, with things like HttpUnit, web UIs.
> There's no discussion of "Data" per-se, though databases are implicit.
>
> Apache Gump [Sam Ruby, 2001] was designed to address a single problem "get
> the entire ASF project portfolio to build and test against the latest build
> of everything else". Lots of finger pointing there, especially when
> something foundational like Ant or Xerces did bad.
>
> AFAIK, The earliest known in-print reference to Continuous Deployme3nt is
> the HP Labs 2002 paper, *Making Web Services that Work*. That introduced
> the concept with a focus on automating deployment, staging testing and
> treating ops problems as use cases for which engineers could often write
> tests for, and, perhaps, even design their applications to support. "We are
> exploring extending this model to one we term Continuous Deployment —after
> passing the local test suite, a service can be automatically deployed to a
> public staging server for stress and acceptance testing by physically
> remote calling parties"
>
> At this time, the applications weren't modern "big data" apps as they
> didn't have affordable storage or the tools to schedule work over it. It
> wasn't that the people writing the books and papers looked at big data and
> said "not for us", it just wasn't on their horizons. 1TB was a lot of
> storage in those days, not a high-end SSD.
>
> Otherwise your approach is just another line of defense in saving your job
> by applying an impertinent, incorrect, and outdated skill and tool to a
> problem.
>
>
> please be a bit more constructive here, the ASF code of conduct encourages
> empathy and coillaboration. https://www.apache.org/foundation/
> policies/conduct . Thanks.,
>
>
> Building data products is a very different discipline from that of
> building software.
>
>
> Which is why we ned to consider how to take what are core methodologies
> for software and apply them, and, where appropriate, supercede them with
> new workflows, ideas, technologies. But doing so with an understanding of
> the reasoning behind today's tools and workflows. I'm really interested in
> how do we get from experimental notebook code to something usable in
> production, pushing it out, finding the dirty-data-problems before it goes
> live, etc, etc. I do think today's tools have been outgrown by the
> applications we now build, and am thinking not so much "which tools to
> use', but one step further, "what are the new tools and techniques to
> use?".
>
> I look forward to whatever insight people have here.
>
>
> My genuine advice to everyone in all spheres of activities will be to
> first understand the problem to solve before solving it and definitely
> before selecting the tools to solve it, otherwise you will land up with a
> bowl of soup and fork in hand and argue that CI/ CD is still applicable to
> building data products and data warehousing.
>
>
> I concur
>
> Regards,
> Gourav
>
>
> -Steve
>
> On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>>
>> On 11 Apr 2017, at 20:46, Gourav Sengupta <go...@gmail.com>
>> wrote:
>>
>> And once again JAVA programmers are trying to solve a data analytics and
>> data warehousing problem using programming paradigms. It genuinely a pain
>> to see this happen.
>>
>>
>>
>> While I'm happy to be faulted for treating things as software processes,
>> having a full automated mechanism for testing the latest code before
>> production is something I'd consider foundational today. This is what
>> "Contiunous Deployment" was about when it was first conceived. Does it mean
>> you should blindly deploy that way? well, not if you worry about security,
>> but having that review process and then a final manual "deploy" button can
>> address that.
>>
>> Cloud infras let you integrate cluster instantiation to the process;
>> which helps you automate things like "stage the deployment in some new VMs,
>> run acceptance tests (*), then switch the load balancer over to the new
>> cluster, being ready to switch back if you need. I've not tried that with
>> streaming apps though; I don't know how to do it there. Boot the new
>> cluster off checkpointed state requires deserialization to work, which
>> can't be guaranteed if you are changing the objects which get serialized.
>>
>> I'd argue then, it's not a problem which has already been solved by data
>> analystics/warehousing —though if you've got pointers there, I'd be
>> grateful. Always good to see work by others. Indeed, the telecoms industry
>> have led the way in testing and HA deployment: if you look at Erlang you
>> can see a system designed with hot upgrades in mind, the way java code "add
>> a JAR to a web server" never was.
>>
>> -Steve
>>
>>
>> (*) do always make sure this is the test cluster with a snapshot of test
>> data, not production machines/data. There are always horror stories there.
>>
>>
>> Regards,
>> Gourav
>>
>> On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>>> Hi Steve
>>>
>>>
>>> Thanks for the detailed response, I think this problem doesn't have an
>>> industry standard solution as of yet and I am sure a lot of people would
>>> benefit from the discussion
>>>
>>> I realise now what you are saying so thanks for clarifying, that said
>>> let me try and explain how we approached the problem
>>>
>>> There are 2 problems you highlighted, the first if moving the code from
>>> SCM to prod, and the other is enusiring the data your code uses is correct.
>>> (using the latest data from prod)
>>>
>>>
>>> *"how do you get your code from SCM into production?"*
>>>
>>> We currently have our pipeline being run via airflow, we have our dags
>>> in S3, with regards to how we get our code from SCM to production
>>>
>>> 1) Jenkins build that builds our spark applications and runs tests
>>> 2) Once the first build is successful we trigger another build to copy
>>> the dags to an s3 folder
>>>
>>> We then routinely sync this folder to the local airflow dags folder
>>> every X amount of mins
>>>
>>> Re test data
>>> *" but what's your strategy for test data: that's always the
>>> troublespot."*
>>>
>>> Our application is using versioning against the data, so we expect the
>>> source data to be in a certain version and the output data to also be in a
>>> certain version
>>>
>>> We have a test resources folder that we have following the same
>>> convention of versioning - this is the data that our application tests use
>>> - to ensure that the data is in the correct format
>>>
>>> so for example if we have Table X with version 1 that depends on data
>>> from Table A and B also version 1, we run our spark application then ensure
>>> the transformed table X has the correct columns and row values
>>>
>>> Then when we have a new version 2 of the source data or adding a new
>>> column in Table X (version 2), we generate a new version of the data and
>>> ensure the tests are updated
>>>
>>> That way we ensure any new version of the data has tests against it
>>>
>>> *"I've never seen any good strategy there short of "throw it at a copy
>>> of the production dataset"."*
>>>
>>> I agree which is why we have a sample of the production data and version
>>> the schemas we expect the source and target data to look like.
>>>
>>> If people are interested I am happy writing a blog about it in the hopes
>>> this helps people build more reliable pipelines
>>>
>>>
>> Love to see that.
>>
>> Kind Regards
>>> Sam
>>>
>>
>>
>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Steve Loughran <st...@hortonworks.com>.

On 12 Apr 2017, at 17:25, Gourav Sengupta <go...@gmail.com>> wrote:

Hi,

Your answer is like saying, I know how to code in assembly level language and I am going to build the next GUI in assembly level code and I think that there is a genuine functional requirement to see a color of a button in green on the screen.


well, I reserve the right to have incomplete knowledge, and look forward to improving it.

Perhaps it may be pertinent to read the first preface of a CI/ CD book and realize to what kind of software development disciplines is it applicable to.

the original introduction on CI was probably Fowler's Cruise Control article,
https://martinfowler.com/articles/originalContinuousIntegration.html

"The key is to automate absolutely everything and run the process so often that integration errors are found quickly"

Java Development with Ant, 2003, looks at Cruise Control, Anthill and Gump, again, with that focus on team coding and automated regression testing, both of unit tests, and, with things like HttpUnit, web UIs. There's no discussion of "Data" per-se, though databases are implicit.

Apache Gump [Sam Ruby, 2001] was designed to address a single problem "get the entire ASF project portfolio to build and test against the latest build of everything else". Lots of finger pointing there, especially when something foundational like Ant or Xerces did bad.

AFAIK, The earliest known in-print reference to Continuous Deployme3nt is the HP Labs 2002 paper, Making Web Services that Work. That introduced the concept with a focus on automating deployment, staging testing and treating ops problems as use cases for which engineers could often write tests for, and, perhaps, even design their applications to support. "We are exploring extending this model to one we term Continuous Deployment —after passing the local test suite, a service can be automatically deployed to a public staging server for stress and acceptance testing by physically remote calling parties"

At this time, the applications weren't modern "big data" apps as they didn't have affordable storage or the tools to schedule work over it. It wasn't that the people writing the books and papers looked at big data and said "not for us", it just wasn't on their horizons. 1TB was a lot of storage in those days, not a high-end SSD.

Otherwise your approach is just another line of defense in saving your job by applying an impertinent, incorrect, and outdated skill and tool to a problem.


please be a bit more constructive here, the ASF code of conduct encourages empathy and coillaboration. https://www.apache.org/foundation/policies/conduct . Thanks.,


Building data products is a very different discipline from that of building software.


Which is why we ned to consider how to take what are core methodologies for software and apply them, and, where appropriate, supercede them with new workflows, ideas, technologies. But doing so with an understanding of the reasoning behind today's tools and workflows. I'm really interested in how do we get from experimental notebook code to something usable in production, pushing it out, finding the dirty-data-problems before it goes live, etc, etc. I do think today's tools have been outgrown by the applications we now build, and am thinking not so much "which tools to use', but one step further, "what are the new tools and techniques to use?".

I look forward to whatever insight people have here.


My genuine advice to everyone in all spheres of activities will be to first understand the problem to solve before solving it and definitely before selecting the tools to solve it, otherwise you will land up with a bowl of soup and fork in hand and argue that CI/ CD is still applicable to building data products and data warehousing.


I concur

Regards,
Gourav


-Steve

On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran <st...@hortonworks.com>> wrote:

On 11 Apr 2017, at 20:46, Gourav Sengupta <go...@gmail.com>> wrote:

And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen.



While I'm happy to be faulted for treating things as software processes, having a full automated mechanism for testing the latest code before production is something I'd consider foundational today. This is what "Contiunous Deployment" was about when it was first conceived. Does it mean you should blindly deploy that way? well, not if you worry about security, but having that review process and then a final manual "deploy" button can address that.

Cloud infras let you integrate cluster instantiation to the process; which helps you automate things like "stage the deployment in some new VMs, run acceptance tests (*), then switch the load balancer over to the new cluster, being ready to switch back if you need. I've not tried that with streaming apps though; I don't know how to do it there. Boot the new cluster off checkpointed state requires deserialization to work, which can't be guaranteed if you are changing the objects which get serialized.

I'd argue then, it's not a problem which has already been solved by data analystics/warehousing —though if you've got pointers there, I'd be grateful. Always good to see work by others. Indeed, the telecoms industry have led the way in testing and HA deployment: if you look at Erlang you can see a system designed with hot upgrades in mind, the way java code "add a JAR to a web server" never was.

-Steve


(*) do always make sure this is the test cluster with a snapshot of test data, not production machines/data. There are always horror stories there.


Regards,
Gourav

On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hu...@gmail.com>> wrote:
Hi Steve


Thanks for the detailed response, I think this problem doesn't have an industry standard solution as of yet and I am sure a lot of people would benefit from the discussion

I realise now what you are saying so thanks for clarifying, that said let me try and explain how we approached the problem

There are 2 problems you highlighted, the first if moving the code from SCM to prod, and the other is enusiring the data your code uses is correct. (using the latest data from prod)


"how do you get your code from SCM into production?"

We currently have our pipeline being run via airflow, we have our dags in S3, with regards to how we get our code from SCM to production

1) Jenkins build that builds our spark applications and runs tests
2) Once the first build is successful we trigger another build to copy the dags to an s3 folder

We then routinely sync this folder to the local airflow dags folder every X amount of mins

Re test data
" but what's your strategy for test data: that's always the troublespot."

Our application is using versioning against the data, so we expect the source data to be in a certain version and the output data to also be in a certain version

We have a test resources folder that we have following the same convention of versioning - this is the data that our application tests use - to ensure that the data is in the correct format

so for example if we have Table X with version 1 that depends on data from Table A and B also version 1, we run our spark application then ensure the transformed table X has the correct columns and row values

Then when we have a new version 2 of the source data or adding a new column in Table X (version 2), we generate a new version of the data and ensure the tests are updated

That way we ensure any new version of the data has tests against it

"I've never seen any good strategy there short of "throw it at a copy of the production dataset"."

I agree which is why we have a sample of the production data and version the schemas we expect the source and target data to look like.

If people are interested I am happy writing a blog about it in the hopes this helps people build more reliable pipelines


Love to see that.

Kind Regards
Sam

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

Your answer is like saying, I know how to code in assembly level language
and I am going to build the next GUI in assembly level code and I think
that there is a genuine functional requirement to see a color of a button
in green on the screen.

Perhaps it may be pertinent to read the first preface of a CI/ CD book and
realize to what kind of software development disciplines is it applicable
to. Otherwise your approach is just another line of defense in saving your
job by applying an impertinent, incorrect, and outdated skill and tool to a
problem.

Building data products is a very different discipline from that of building
software.

My genuine advice to everyone in all spheres of activities will be to first
understand the problem to solve before solving it and definitely before
selecting the tools to solve it, otherwise you will land up with a bowl of
soup and fork in hand and argue that CI/ CD is still applicable to building
data products and data warehousing.

Regards,
Gourav

On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran <st...@hortonworks.com>
wrote:

>
> On 11 Apr 2017, at 20:46, Gourav Sengupta <go...@gmail.com>
> wrote:
>
> And once again JAVA programmers are trying to solve a data analytics and
> data warehousing problem using programming paradigms. It genuinely a pain
> to see this happen.
>
>
>
> While I'm happy to be faulted for treating things as software processes,
> having a full automated mechanism for testing the latest code before
> production is something I'd consider foundational today. This is what
> "Contiunous Deployment" was about when it was first conceived. Does it mean
> you should blindly deploy that way? well, not if you worry about security,
> but having that review process and then a final manual "deploy" button can
> address that.
>
> Cloud infras let you integrate cluster instantiation to the process; which
> helps you automate things like "stage the deployment in some new VMs, run
> acceptance tests (*), then switch the load balancer over to the new
> cluster, being ready to switch back if you need. I've not tried that with
> streaming apps though; I don't know how to do it there. Boot the new
> cluster off checkpointed state requires deserialization to work, which
> can't be guaranteed if you are changing the objects which get serialized.
>
> I'd argue then, it's not a problem which has already been solved by data
> analystics/warehousing —though if you've got pointers there, I'd be
> grateful. Always good to see work by others. Indeed, the telecoms industry
> have led the way in testing and HA deployment: if you look at Erlang you
> can see a system designed with hot upgrades in mind, the way java code "add
> a JAR to a web server" never was.
>
> -Steve
>
>
> (*) do always make sure this is the test cluster with a snapshot of test
> data, not production machines/data. There are always horror stories there.
>
>
> Regards,
> Gourav
>
> On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
>> Hi Steve
>>
>>
>> Thanks for the detailed response, I think this problem doesn't have an
>> industry standard solution as of yet and I am sure a lot of people would
>> benefit from the discussion
>>
>> I realise now what you are saying so thanks for clarifying, that said let
>> me try and explain how we approached the problem
>>
>> There are 2 problems you highlighted, the first if moving the code from
>> SCM to prod, and the other is enusiring the data your code uses is correct.
>> (using the latest data from prod)
>>
>>
>> *"how do you get your code from SCM into production?"*
>>
>> We currently have our pipeline being run via airflow, we have our dags in
>> S3, with regards to how we get our code from SCM to production
>>
>> 1) Jenkins build that builds our spark applications and runs tests
>> 2) Once the first build is successful we trigger another build to copy
>> the dags to an s3 folder
>>
>> We then routinely sync this folder to the local airflow dags folder every
>> X amount of mins
>>
>> Re test data
>> *" but what's your strategy for test data: that's always the
>> troublespot."*
>>
>> Our application is using versioning against the data, so we expect the
>> source data to be in a certain version and the output data to also be in a
>> certain version
>>
>> We have a test resources folder that we have following the same
>> convention of versioning - this is the data that our application tests use
>> - to ensure that the data is in the correct format
>>
>> so for example if we have Table X with version 1 that depends on data
>> from Table A and B also version 1, we run our spark application then ensure
>> the transformed table X has the correct columns and row values
>>
>> Then when we have a new version 2 of the source data or adding a new
>> column in Table X (version 2), we generate a new version of the data and
>> ensure the tests are updated
>>
>> That way we ensure any new version of the data has tests against it
>>
>> *"I've never seen any good strategy there short of "throw it at a copy of
>> the production dataset"."*
>>
>> I agree which is why we have a sample of the production data and version
>> the schemas we expect the source and target data to look like.
>>
>> If people are interested I am happy writing a blog about it in the hopes
>> this helps people build more reliable pipelines
>>
>>
> Love to see that.
>
> Kind Regards
>> Sam
>>
>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Steve Loughran <st...@hortonworks.com>.

On 11 Apr 2017, at 20:46, Gourav Sengupta <go...@gmail.com>> wrote:

And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen.



While I'm happy to be faulted for treating things as software processes, having a full automated mechanism for testing the latest code before production is something I'd consider foundational today. This is what "Contiunous Deployment" was about when it was first conceived. Does it mean you should blindly deploy that way? well, not if you worry about security, but having that review process and then a final manual "deploy" button can address that.

Cloud infras let you integrate cluster instantiation to the process; which helps you automate things like "stage the deployment in some new VMs, run acceptance tests (*), then switch the load balancer over to the new cluster, being ready to switch back if you need. I've not tried that with streaming apps though; I don't know how to do it there. Boot the new cluster off checkpointed state requires deserialization to work, which can't be guaranteed if you are changing the objects which get serialized.

I'd argue then, it's not a problem which has already been solved by data analystics/warehousing —though if you've got pointers there, I'd be grateful. Always good to see work by others. Indeed, the telecoms industry have led the way in testing and HA deployment: if you look at Erlang you can see a system designed with hot upgrades in mind, the way java code "add a JAR to a web server" never was.

-Steve


(*) do always make sure this is the test cluster with a snapshot of test data, not production machines/data. There are always horror stories there.


Regards,
Gourav

On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hu...@gmail.com>> wrote:
Hi Steve


Thanks for the detailed response, I think this problem doesn't have an industry standard solution as of yet and I am sure a lot of people would benefit from the discussion

I realise now what you are saying so thanks for clarifying, that said let me try and explain how we approached the problem

There are 2 problems you highlighted, the first if moving the code from SCM to prod, and the other is enusiring the data your code uses is correct. (using the latest data from prod)


"how do you get your code from SCM into production?"

We currently have our pipeline being run via airflow, we have our dags in S3, with regards to how we get our code from SCM to production

1) Jenkins build that builds our spark applications and runs tests
2) Once the first build is successful we trigger another build to copy the dags to an s3 folder

We then routinely sync this folder to the local airflow dags folder every X amount of mins

Re test data
" but what's your strategy for test data: that's always the troublespot."

Our application is using versioning against the data, so we expect the source data to be in a certain version and the output data to also be in a certain version

We have a test resources folder that we have following the same convention of versioning - this is the data that our application tests use - to ensure that the data is in the correct format

so for example if we have Table X with version 1 that depends on data from Table A and B also version 1, we run our spark application then ensure the transformed table X has the correct columns and row values

Then when we have a new version 2 of the source data or adding a new column in Table X (version 2), we generate a new version of the data and ensure the tests are updated

That way we ensure any new version of the data has tests against it

"I've never seen any good strategy there short of "throw it at a copy of the production dataset"."

I agree which is why we have a sample of the production data and version the schemas we expect the source and target data to look like.

If people are interested I am happy writing a blog about it in the hopes this helps people build more reliable pipelines


Love to see that.

Kind Regards
Sam

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Gourav Sengupta <go...@gmail.com>.

And once again JAVA programmers are trying to solve a data analytics and
data warehousing problem using programming paradigms. It genuinely a pain
to see this happen.


Regards,
Gourav

On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hu...@gmail.com> wrote:

> Hi Steve
>
>
> Thanks for the detailed response, I think this problem doesn't have an
> industry standard solution as of yet and I am sure a lot of people would
> benefit from the discussion
>
> I realise now what you are saying so thanks for clarifying, that said let
> me try and explain how we approached the problem
>
> There are 2 problems you highlighted, the first if moving the code from
> SCM to prod, and the other is enusiring the data your code uses is correct.
> (using the latest data from prod)
>
>
> *"how do you get your code from SCM into production?"*
>
> We currently have our pipeline being run via airflow, we have our dags in
> S3, with regards to how we get our code from SCM to production
>
> 1) Jenkins build that builds our spark applications and runs tests
> 2) Once the first build is successful we trigger another build to copy the
> dags to an s3 folder
>
> We then routinely sync this folder to the local airflow dags folder every
> X amount of mins
>
> Re test data
> *" but what's your strategy for test data: that's always the troublespot."*
>
> Our application is using versioning against the data, so we expect the
> source data to be in a certain version and the output data to also be in a
> certain version
>
> We have a test resources folder that we have following the same convention
> of versioning - this is the data that our application tests use - to ensure
> that the data is in the correct format
>
> so for example if we have Table X with version 1 that depends on data from
> Table A and B also version 1, we run our spark application then ensure the
> transformed table X has the correct columns and row values
>
> Then when we have a new version 2 of the source data or adding a new
> column in Table X (version 2), we generate a new version of the data and
> ensure the tests are updated
>
> That way we ensure any new version of the data has tests against it
>
> *"I've never seen any good strategy there short of "throw it at a copy of
> the production dataset"."*
>
> I agree which is why we have a sample of the production data and version
> the schemas we expect the source and target data to look like.
>
> If people are interested I am happy writing a blog about it in the hopes
> this helps people build more reliable pipelines
>
> Kind Regards
> Sam
>
>
>
>
>
>
>
>
>
>
> On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>>
>> On 7 Apr 2017, at 18:40, Sam Elamin <hu...@gmail.com> wrote:
>>
>> Definitely agree with gourav there. I wouldn't want jenkins to run my
>> work flow. Seems to me that you would only be using jenkins for its
>> scheduling capabilities
>>
>>
>> Maybe I was just looking at this differenlty
>>
>> Yes you can run tests but you wouldn't want it to run your orchestration
>> of jobs
>>
>> What happens if jenkijs goes down for any particular reason. How do you
>> have the conversation with your stakeholders that your pipeline is not
>> working and they don't have data because the build server is going through
>> an upgrade or going through an upgrade
>>
>>
>>
>> Well, I wouldn't use it as a replacement for Oozie, but I'd certainly
>> consider as the pipeline for getting your code out to the cluster, so you
>> don't have to explain why you just pushed out something broken
>>
>> As example, here's Renault's pipeline as discussed last week in Munich
>> https://flic.kr/p/Tw3Emu
>>
>> However to be fair I understand what you are saying Steve if someone is
>> in a place where you only have access to jenkins and have to go through
>> hoops to setup:get access to new instances then engineers will do what they
>> always do, find ways to game the system to get their work done
>>
>>
>>
>>
>> This isn't about trying to "Game the system", this is about what makes a
>> replicable workflow for getting code into production, either at the press
>> of a button or as part of a scheduled "we push out an update every night,
>> rerun the deployment tests and then switch over to the new installation"
>> mech.
>>
>> Put differently: how do you get your code from SCM into production? Not
>> just for CI, but what's your strategy for test data: that's always the
>> troublespot. Random selection of rows may work, although it will skip the
>> odd outlier (high-unicode char in what should be a LATIN-1 field, time set
>> to 0, etc), and for work joining > 1 table, you need rows which join well.
>> I've never seen any good strategy there short of "throw it at a copy of the
>> production dataset".
>>
>>
>> -Steve
>>
>>
>>
>>
>>
>>
>> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <go...@gmail.com>
>> wrote:
>>
>>> Hi Steve,
>>>
>>> Why would you ever do that? You are suggesting the use of a CI tool as a
>>> workflow and orchestration engine.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <st...@hortonworks.com>
>>> wrote:
>>>
>>>> If you have Jenkins set up for some CI workflow, that can do scheduled
>>>> builds and tests. Works well if you can do some build test before even
>>>> submitting it to a remote cluster
>>>>
>>>> On 7 Apr 2017, at 10:15, Sam Elamin <hu...@gmail.com> wrote:
>>>>
>>>> Hi Shyla
>>>>
>>>> You have multiple options really some of which have been already listed
>>>> but let me try and clarify
>>>>
>>>> Assuming you have a spark application in a jar you have a variety of
>>>> options
>>>>
>>>> You have to have an existing spark cluster that is either running on
>>>> EMR or somewhere else.
>>>>
>>>> *Super simple / hacky*
>>>> Cron job on EC2 that calls a simple shell script that does a spart
>>>> submit to a Spark Cluster OR create or add step to an EMR cluster
>>>>
>>>> *More Elegant*
>>>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that
>>>> will do the above step but have scheduling and potential backfilling and
>>>> error handling(retries,alerts etc)
>>>>
>>>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>>>> does some Spark jobs but I do not think its available worldwide just yet
>>>>
>>>> Hope I cleared things up
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>>
>>>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>>>> gourav.sengupta@gmail.com> wrote:
>>>>
>>>>> Hi Shyla,
>>>>>
>>>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>>>> deshpandeshyla@gmail.com> wrote:
>>>>>
>>>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is
>>>>>> the easiest way to do this. Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Sam Elamin <hu...@gmail.com>.

Hi Steve


Thanks for the detailed response, I think this problem doesn't have an
industry standard solution as of yet and I am sure a lot of people would
benefit from the discussion

I realise now what you are saying so thanks for clarifying, that said let
me try and explain how we approached the problem

There are 2 problems you highlighted, the first if moving the code from SCM
to prod, and the other is enusiring the data your code uses is correct.
(using the latest data from prod)


*"how do you get your code from SCM into production?"*

We currently have our pipeline being run via airflow, we have our dags in
S3, with regards to how we get our code from SCM to production

1) Jenkins build that builds our spark applications and runs tests
2) Once the first build is successful we trigger another build to copy the
dags to an s3 folder

We then routinely sync this folder to the local airflow dags folder every X
amount of mins

Re test data
*" but what's your strategy for test data: that's always the troublespot."*

Our application is using versioning against the data, so we expect the
source data to be in a certain version and the output data to also be in a
certain version

We have a test resources folder that we have following the same convention
of versioning - this is the data that our application tests use - to ensure
that the data is in the correct format

so for example if we have Table X with version 1 that depends on data from
Table A and B also version 1, we run our spark application then ensure the
transformed table X has the correct columns and row values

Then when we have a new version 2 of the source data or adding a new column
in Table X (version 2), we generate a new version of the data and ensure
the tests are updated

That way we ensure any new version of the data has tests against it

*"I've never seen any good strategy there short of "throw it at a copy of
the production dataset"."*

I agree which is why we have a sample of the production data and version
the schemas we expect the source and target data to look like.

If people are interested I am happy writing a blog about it in the hopes
this helps people build more reliable pipelines

Kind Regards
Sam










On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <st...@hortonworks.com>
wrote:

>
> On 7 Apr 2017, at 18:40, Sam Elamin <hu...@gmail.com> wrote:
>
> Definitely agree with gourav there. I wouldn't want jenkins to run my work
> flow. Seems to me that you would only be using jenkins for its scheduling
> capabilities
>
>
> Maybe I was just looking at this differenlty
>
> Yes you can run tests but you wouldn't want it to run your orchestration
> of jobs
>
> What happens if jenkijs goes down for any particular reason. How do you
> have the conversation with your stakeholders that your pipeline is not
> working and they don't have data because the build server is going through
> an upgrade or going through an upgrade
>
>
>
> Well, I wouldn't use it as a replacement for Oozie, but I'd certainly
> consider as the pipeline for getting your code out to the cluster, so you
> don't have to explain why you just pushed out something broken
>
> As example, here's Renault's pipeline as discussed last week in Munich
> https://flic.kr/p/Tw3Emu
>
> However to be fair I understand what you are saying Steve if someone is in
> a place where you only have access to jenkins and have to go through hoops
> to setup:get access to new instances then engineers will do what they
> always do, find ways to game the system to get their work done
>
>
>
>
> This isn't about trying to "Game the system", this is about what makes a
> replicable workflow for getting code into production, either at the press
> of a button or as part of a scheduled "we push out an update every night,
> rerun the deployment tests and then switch over to the new installation"
> mech.
>
> Put differently: how do you get your code from SCM into production? Not
> just for CI, but what's your strategy for test data: that's always the
> troublespot. Random selection of rows may work, although it will skip the
> odd outlier (high-unicode char in what should be a LATIN-1 field, time set
> to 0, etc), and for work joining > 1 table, you need rows which join well.
> I've never seen any good strategy there short of "throw it at a copy of the
> production dataset".
>
>
> -Steve
>
>
>
>
>
>
> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi Steve,
>>
>> Why would you ever do that? You are suggesting the use of a CI tool as a
>> workflow and orchestration engine.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <st...@hortonworks.com>
>> wrote:
>>
>>> If you have Jenkins set up for some CI workflow, that can do scheduled
>>> builds and tests. Works well if you can do some build test before even
>>> submitting it to a remote cluster
>>>
>>> On 7 Apr 2017, at 10:15, Sam Elamin <hu...@gmail.com> wrote:
>>>
>>> Hi Shyla
>>>
>>> You have multiple options really some of which have been already listed
>>> but let me try and clarify
>>>
>>> Assuming you have a spark application in a jar you have a variety of
>>> options
>>>
>>> You have to have an existing spark cluster that is either running on EMR
>>> or somewhere else.
>>>
>>> *Super simple / hacky*
>>> Cron job on EC2 that calls a simple shell script that does a spart
>>> submit to a Spark Cluster OR create or add step to an EMR cluster
>>>
>>> *More Elegant*
>>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that
>>> will do the above step but have scheduling and potential backfilling and
>>> error handling(retries,alerts etc)
>>>
>>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>>> does some Spark jobs but I do not think its available worldwide just yet
>>>
>>> Hope I cleared things up
>>>
>>> Regards
>>> Sam
>>>
>>>
>>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>>> gourav.sengupta@gmail.com> wrote:
>>>
>>>> Hi Shyla,
>>>>
>>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>>> deshpandeshyla@gmail.com> wrote:
>>>>
>>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>>>> easiest way to do this. Thanks
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Steve Loughran <st...@hortonworks.com>.

On 7 Apr 2017, at 18:40, Sam Elamin <hu...@gmail.com>> wrote:

Definitely agree with gourav there. I wouldn't want jenkins to run my work flow. Seems to me that you would only be using jenkins for its scheduling capabilities


Maybe I was just looking at this differenlty

Yes you can run tests but you wouldn't want it to run your orchestration of jobs

What happens if jenkijs goes down for any particular reason. How do you have the conversation with your stakeholders that your pipeline is not working and they don't have data because the build server is going through an upgrade or going through an upgrade



Well, I wouldn't use it as a replacement for Oozie, but I'd certainly consider as the pipeline for getting your code out to the cluster, so you don't have to explain why you just pushed out something broken

As example, here's Renault's pipeline as discussed last week in Munich https://flic.kr/p/Tw3Emu

However to be fair I understand what you are saying Steve if someone is in a place where you only have access to jenkins and have to go through hoops to setup:get access to new instances then engineers will do what they always do, find ways to game the system to get their work done



This isn't about trying to "Game the system", this is about what makes a replicable workflow for getting code into production, either at the press of a button or as part of a scheduled "we push out an update every night, rerun the deployment tests and then switch over to the new installation" mech.

Put differently: how do you get your code from SCM into production? Not just for CI, but what's your strategy for test data: that's always the troublespot. Random selection of rows may work, although it will skip the odd outlier (high-unicode char in what should be a LATIN-1 field, time set to 0, etc), and for work joining > 1 table, you need rows which join well. I've never seen any good strategy there short of "throw it at a copy of the production dataset".


-Steve






On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <go...@gmail.com>> wrote:
Hi Steve,

Why would you ever do that? You are suggesting the use of a CI tool as a workflow and orchestration engine.

Regards,
Gourav Sengupta

On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <st...@hortonworks.com>> wrote:
If you have Jenkins set up for some CI workflow, that can do scheduled builds and tests. Works well if you can do some build test before even submitting it to a remote cluster

On 7 Apr 2017, at 10:15, Sam Elamin <hu...@gmail.com>> wrote:

Hi Shyla

You have multiple options really some of which have been already listed but let me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or somewhere else.

Super simple / hacky
Cron job on EC2 that calls a simple shell script that does a spart submit to a Spark Cluster OR create or add step to an EMR cluster

More Elegant
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do the above step but have scheduling and potential backfilling and error handling(retries,alerts etc)

AWS are coming out with glue<https://aws.amazon.com/glue/> soon that does some Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <go...@gmail.com>> wrote:
Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <de...@gmail.com>> wrote:
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the easiest way to do this. Thanks

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Sam Elamin <hu...@gmail.com>.

Definitely agree with gourav there. I wouldn't want jenkins to run my work
flow. Seems to me that you would only be using jenkins for its scheduling
capabilities

Yes you can run tests but you wouldn't want it to run your orchestration of
jobs

What happens if jenkijs goes down for any particular reason. How do you
have the conversation with your stakeholders that your pipeline is not
working and they don't have data because the build server is going through
an upgrade or going through an upgrade

However to be fair I understand what you are saying Steve if someone is in
a place where you only have access to jenkins and have to go through hoops
to setup:get access to new instances then engineers will do what they
always do, find ways to game the system to get their work done




On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi Steve,
>
> Why would you ever do that? You are suggesting the use of a CI tool as a
> workflow and orchestration engine.
>
> Regards,
> Gourav Sengupta
>
> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>> If you have Jenkins set up for some CI workflow, that can do scheduled
>> builds and tests. Works well if you can do some build test before even
>> submitting it to a remote cluster
>>
>> On 7 Apr 2017, at 10:15, Sam Elamin <hu...@gmail.com> wrote:
>>
>> Hi Shyla
>>
>> You have multiple options really some of which have been already listed
>> but let me try and clarify
>>
>> Assuming you have a spark application in a jar you have a variety of
>> options
>>
>> You have to have an existing spark cluster that is either running on EMR
>> or somewhere else.
>>
>> *Super simple / hacky*
>> Cron job on EC2 that calls a simple shell script that does a spart submit
>> to a Spark Cluster OR create or add step to an EMR cluster
>>
>> *More Elegant*
>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
>> do the above step but have scheduling and potential backfilling and error
>> handling(retries,alerts etc)
>>
>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>> does some Spark jobs but I do not think its available worldwide just yet
>>
>> Hope I cleared things up
>>
>> Regards
>> Sam
>>
>>
>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>> gourav.sengupta@gmail.com> wrote:
>>
>>> Hi Shyla,
>>>
>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>> deshpandeshyla@gmail.com> wrote:
>>>
>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>>> easiest way to do this. Thanks
>>>>
>>>
>>>
>>
>>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Steve,

Why would you ever do that? You are suggesting the use of a CI tool as a
workflow and orchestration engine.

Regards,
Gourav Sengupta

On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <st...@hortonworks.com>
wrote:

> If you have Jenkins set up for some CI workflow, that can do scheduled
> builds and tests. Works well if you can do some build test before even
> submitting it to a remote cluster
>
> On 7 Apr 2017, at 10:15, Sam Elamin <hu...@gmail.com> wrote:
>
> Hi Shyla
>
> You have multiple options really some of which have been already listed
> but let me try and clarify
>
> Assuming you have a spark application in a jar you have a variety of
> options
>
> You have to have an existing spark cluster that is either running on EMR
> or somewhere else.
>
> *Super simple / hacky*
> Cron job on EC2 that calls a simple shell script that does a spart submit
> to a Spark Cluster OR create or add step to an EMR cluster
>
> *More Elegant*
> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
> do the above step but have scheduling and potential backfilling and error
> handling(retries,alerts etc)
>
> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
> does some Spark jobs but I do not think its available worldwide just yet
>
> Hope I cleared things up
>
> Regards
> Sam
>
>
> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <gourav.sengupta@gmail.com
> > wrote:
>
>> Hi Shyla,
>>
>> why would you want to schedule a spark job in EC2 instead of EMR?
>>
>> Regards,
>> Gourav
>>
>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <deshpandeshyla@gmail.com
>> > wrote:
>>
>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>> easiest way to do this. Thanks
>>>
>>
>>
>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Steve Loughran <st...@hortonworks.com>.

If you have Jenkins set up for some CI workflow, that can do scheduled builds and tests. Works well if you can do some build test before even submitting it to a remote cluster

On 7 Apr 2017, at 10:15, Sam Elamin <hu...@gmail.com>> wrote:

Hi Shyla

You have multiple options really some of which have been already listed but let me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or somewhere else.

Super simple / hacky
Cron job on EC2 that calls a simple shell script that does a spart submit to a Spark Cluster OR create or add step to an EMR cluster

More Elegant
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do the above step but have scheduling and potential backfilling and error handling(retries,alerts etc)

AWS are coming out with glue<https://aws.amazon.com/glue/> soon that does some Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <go...@gmail.com>> wrote:
Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <de...@gmail.com>> wrote:
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the easiest way to do this. Thanks

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Sam Elamin <hu...@gmail.com>.

Hi Shyla

You have multiple options really some of which have been already listed but
let me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or
somewhere else.

*Super simple / hacky*
Cron job on EC2 that calls a simple shell script that does a spart submit
to a Spark Cluster OR create or add step to an EMR cluster

*More Elegant*
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
do the above step but have scheduling and potential backfilling and error
handling(retries,alerts etc)

AWS are coming out with glue <https://aws.amazon.com/glue/> soon that does
some Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi Shyla,
>
> why would you want to schedule a spark job in EC2 instead of EMR?
>
> Regards,
> Gourav
>
> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <de...@gmail.com>
> wrote:
>
>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>> easiest way to do this. Thanks
>>
>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <de...@gmail.com>
wrote:

> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Posted by Yash Sharma <ya...@gmail.com>.

Hi Shyla,
We could suggest based on what you're trying to do exactly. But with the
given information - If you have your spark job ready you could schedule it
via any scheduling framework like Airflow or Celery or Cron based on how
simple/complex you want your work flow to be.

Cheers,
Yash

On Fri, 7 Apr 2017 at 10:04 shyla deshpande <de...@gmail.com>
wrote:

> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>