You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sam Elamin <hu...@gmail.com> on 2017/04/12 20:11:16 UTC

Deploying Spark Applications. Best Practices And Patterns

Hi All,

Really useful information on this thread. We moved a bit off topic since
the initial question was how to schedule spark jobs in AWS. I do think
however that there are loads of great insights here within the community so
I have renamed the subject to "Deploying Spark Applications. Best
Practices"

I honestly think is is a great opportunity to share knowledge and views of
what best practices are and hopefully lead to people building and shipping
reliable, scalable data pipelines.

Also can I reiterate what Steve mention, can we all remember this mailing
list is aimed to help people so can we please be constructive with our
feedback :)

Here is my 2 cents:

So as someone who has come from a web development background and
transitioned into the data space, I can assure you that I completely agree
with Steve in that the need for consistent, repeatable deployments are
essential for shipping software reliably. it is the very essence of
Continuous Deployment

As a side note the difference between continuous deployment and continuous
delivery is that switch to turn the lights on. You can continuously deploy
to an environment - but it is a business decision if you want to flick that
switch to make that feature available to your customer. Basically canary
releases AKA dark releases

I have seen my fair share of "cowboy" shops where the idea of deploying
your application is a manual copy/paste by a human and that is absolutely
the worst way to be deploying things, when humans are involved in any step
of your pipeline things can and inevitably will go wrong.

This is exactly why CI tools have been brought into play, to allow
integration of all your code across teams/branches and ensure code compiles
and tests pass regardless whether this code is meant to do, whether its a
website, application, library or framework is irrelevant.

Not everyone is going to agree with this, but im my humble opinion "big
data" products are in its infancy, the traditional ETL scripts consisted of
bespoke python scripts that at most did simple transformations. There is no
right way and there certainly isn't an industry standard as of yet. there
are definitely loads of wrong ways and I am sure we have all seen/done our
fair share of "horror"" stories as Steve eloquently put it. Tools like
Talend and Pentaho came into play to try and simplify this process, the UI
just said point me to a database, click click click and you have a pipeline
in place.

When it comes to scheduling Spark jobs, you can either submit to an already
running cluster using things like Oozie or bash scripts, or have a workflow
manager like Airflow or Data Pipeline to create new clusters for you. We
went down the second route to continue with the whole immutable
infrastructure/ "treat you're servers as cattle not pets"

We are facing two problems for this at the moment

1) Testing and versioning of data in Spark applications:  We solved this by
using Holden's Spark test base which works amazingly but the nature of data
warehousing means people want ALL the columns available so you have to go
against your nature as a engineer to keep things simple, the mentality of
an analyst of a data scientist is to throw the kitchen sink in, literally
any data available should be in the end transformed table, this basically
means you either do not test the generated data or your code becomes super
verbose and coupled making a nightmare to maintain which defeats the
purpose of testing in the first place. Not to mention the nuances of the
data sources coming in, eg. data arriving in the wrong shape, wrong order
or wrong format or in some cases not at all. You need to test for all of
that and deal with it or you will get burnt in production. You do not want
to be in that call when your stake holders are asking why their reports are
not updated or worse are showing no data!

2) Testing and deploying the workflow manager: We needed to ensure
deployments were easy, we were basically masochists here, i.e. if
deployment is painful then do it more often until it isnt. The problem is
there isnt a clean way to test airflow other than running the DAGs
themselves, so we had to parameterise them to push test data through our
pipeline and ensure that the transformed tables were generated correctly
(simple s3 lookup for us). We are still early days here so happy to hear on
feedback on how to do it better

I realise this is a very very long email and would probably be better
explained on a blog post, but hey this is the gist of it. If people are
still interested I can write it up as a blog post adding code samples and
nice diagrams!

Kind Regards
Sam

On Wed, Apr 12, 2017 at 7:33 PM, lucas.gary@gmail.com <lu...@gmail.com>
wrote:

> "Building data products is a very different discipline from that of
> building software."
>
> That is a fundamentally incorrect assumption.
>
> There will always be a need for figuring out how to apply said principles,
> but saying 'we're different' has always turned out to be incorrect and I
> have seen no reason to think otherwise for data products.
>
> At some point it always comes down to 'how do I get this to my customer,
> in a reliable and repeatable fashion'.  The CI/CD patterns that we've come
> to rely on are designed to optimize that process.
>
> I have seen no evidence that 'data products' don't benefit from those
> practices and I have definitely seen evidence that not following those
> patterns has had substantial costs.
>
> Of course there's always a balancing act in the early phases of discovery,
> but at some point the needle swings from: "Do I have a valuable product"
> to: "How do I get this to customers"
>
> Gary Lucas
>
> On 12 April 2017 at 10:46, Steve Loughran <st...@hortonworks.com> wrote:
>
>>
>> On 12 Apr 2017, at 17:25, Gourav Sengupta <go...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> Your answer is like saying, I know how to code in assembly level language
>> and I am going to build the next GUI in assembly level code and I think
>> that there is a genuine functional requirement to see a color of a button
>> in green on the screen.
>>
>>
>> well, I reserve the right to have incomplete knowledge, and look forward
>> to improving it.
>>
>> Perhaps it may be pertinent to read the first preface of a CI/ CD book
>> and realize to what kind of software development disciplines is it
>> applicable to.
>>
>>
>> the original introduction on CI was probably Fowler's Cruise Control
>> article,
>> https://martinfowler.com/articles/originalContinuousIntegration.html
>>
>> "The key is to automate absolutely everything and run the process so
>> often that integration errors are found quickly"
>>
>> Java Development with Ant, 2003, looks at Cruise Control, Anthill and
>> Gump, again, with that focus on team coding and automated regression
>> testing, both of unit tests, and, with things like HttpUnit, web UIs.
>> There's no discussion of "Data" per-se, though databases are implicit.
>>
>> Apache Gump [Sam Ruby, 2001] was designed to address a single problem
>> "get the entire ASF project portfolio to build and test against the latest
>> build of everything else". Lots of finger pointing there, especially when
>> something foundational like Ant or Xerces did bad.
>>
>> AFAIK, The earliest known in-print reference to Continuous Deployme3nt is
>> the HP Labs 2002 paper, *Making Web Services that Work*. That introduced
>> the concept with a focus on automating deployment, staging testing and
>> treating ops problems as use cases for which engineers could often write
>> tests for, and, perhaps, even design their applications to support. "We are
>> exploring extending this model to one we term Continuous Deployment —after
>> passing the local test suite, a service can be automatically deployed to a
>> public staging server for stress and acceptance testing by physically
>> remote calling parties"
>>
>> At this time, the applications weren't modern "big data" apps as they
>> didn't have affordable storage or the tools to schedule work over it. It
>> wasn't that the people writing the books and papers looked at big data and
>> said "not for us", it just wasn't on their horizons. 1TB was a lot of
>> storage in those days, not a high-end SSD.
>>
>> Otherwise your approach is just another line of defense in saving your
>> job by applying an impertinent, incorrect, and outdated skill and tool to a
>> problem.
>>
>>
>> please be a bit more constructive here, the ASF code of conduct
>> encourages empathy and coillaboration. https://www.ap
>> ache.org/foundation/policies/conduct . Thanks.,
>>
>>
>> Building data products is a very different discipline from that of
>> building software.
>>
>>
>> Which is why we ned to consider how to take what are core methodologies
>> for software and apply them, and, where appropriate, supercede them with
>> new workflows, ideas, technologies. But doing so with an understanding of
>> the reasoning behind today's tools and workflows. I'm really interested in
>> how do we get from experimental notebook code to something usable in
>> production, pushing it out, finding the dirty-data-problems before it goes
>> live, etc, etc. I do think today's tools have been outgrown by the
>> applications we now build, and am thinking not so much "which tools to
>> use', but one step further, "what are the new tools and techniques to
>> use?".
>>
>> I look forward to whatever insight people have here.
>>
>>
>> My genuine advice to everyone in all spheres of activities will be to
>> first understand the problem to solve before solving it and definitely
>> before selecting the tools to solve it, otherwise you will land up with a
>> bowl of soup and fork in hand and argue that CI/ CD is still applicable to
>> building data products and data warehousing.
>>
>>
>> I concur
>>
>> Regards,
>> Gourav
>>
>>
>> -Steve
>>
>> On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran <st...@hortonworks.com>
>> wrote:
>>
>>>
>>> On 11 Apr 2017, at 20:46, Gourav Sengupta <go...@gmail.com>
>>> wrote:
>>>
>>> And once again JAVA programmers are trying to solve a data analytics and
>>> data warehousing problem using programming paradigms. It genuinely a pain
>>> to see this happen.
>>>
>>>
>>>
>>> While I'm happy to be faulted for treating things as software processes,
>>> having a full automated mechanism for testing the latest code before
>>> production is something I'd consider foundational today. This is what
>>> "Contiunous Deployment" was about when it was first conceived. Does it mean
>>> you should blindly deploy that way? well, not if you worry about security,
>>> but having that review process and then a final manual "deploy" button can
>>> address that.
>>>
>>> Cloud infras let you integrate cluster instantiation to the process;
>>> which helps you automate things like "stage the deployment in some new VMs,
>>> run acceptance tests (*), then switch the load balancer over to the new
>>> cluster, being ready to switch back if you need. I've not tried that with
>>> streaming apps though; I don't know how to do it there. Boot the new
>>> cluster off checkpointed state requires deserialization to work, which
>>> can't be guaranteed if you are changing the objects which get serialized.
>>>
>>> I'd argue then, it's not a problem which has already been solved by data
>>> analystics/warehousing —though if you've got pointers there, I'd be
>>> grateful. Always good to see work by others. Indeed, the telecoms industry
>>> have led the way in testing and HA deployment: if you look at Erlang you
>>> can see a system designed with hot upgrades in mind, the way java code "add
>>> a JAR to a web server" never was.
>>>
>>> -Steve
>>>
>>>
>>> (*) do always make sure this is the test cluster with a snapshot of test
>>> data, not production machines/data. There are always horror stories there.
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hu...@gmail.com>
>>> wrote:
>>>
>>>> Hi Steve
>>>>
>>>>
>>>> Thanks for the detailed response, I think this problem doesn't have an
>>>> industry standard solution as of yet and I am sure a lot of people would
>>>> benefit from the discussion
>>>>
>>>> I realise now what you are saying so thanks for clarifying, that said
>>>> let me try and explain how we approached the problem
>>>>
>>>> There are 2 problems you highlighted, the first if moving the code from
>>>> SCM to prod, and the other is enusiring the data your code uses is correct.
>>>> (using the latest data from prod)
>>>>
>>>>
>>>> *"how do you get your code from SCM into production?"*
>>>>
>>>> We currently have our pipeline being run via airflow, we have our dags
>>>> in S3, with regards to how we get our code from SCM to production
>>>>
>>>> 1) Jenkins build that builds our spark applications and runs tests
>>>> 2) Once the first build is successful we trigger another build to copy
>>>> the dags to an s3 folder
>>>>
>>>> We then routinely sync this folder to the local airflow dags folder
>>>> every X amount of mins
>>>>
>>>> Re test data
>>>> *" but what's your strategy for test data: that's always the
>>>> troublespot."*
>>>>
>>>> Our application is using versioning against the data, so we expect the
>>>> source data to be in a certain version and the output data to also be in a
>>>> certain version
>>>>
>>>> We have a test resources folder that we have following the same
>>>> convention of versioning - this is the data that our application tests use
>>>> - to ensure that the data is in the correct format
>>>>
>>>> so for example if we have Table X with version 1 that depends on data
>>>> from Table A and B also version 1, we run our spark application then ensure
>>>> the transformed table X has the correct columns and row values
>>>>
>>>> Then when we have a new version 2 of the source data or adding a new
>>>> column in Table X (version 2), we generate a new version of the data and
>>>> ensure the tests are updated
>>>>
>>>> That way we ensure any new version of the data has tests against it
>>>>
>>>> *"I've never seen any good strategy there short of "throw it at a copy
>>>> of the production dataset"."*
>>>>
>>>> I agree which is why we have a sample of the production data and
>>>> version the schemas we expect the source and target data to look like.
>>>>
>>>> If people are interested I am happy writing a blog about it in the
>>>> hopes this helps people build more reliable pipelines
>>>>
>>>>
>>> Love to see that.
>>>
>>> Kind Regards
>>>> Sam
>>>>
>>>
>>>
>>
>>
>

Re: Deploying Spark Applications. Best Practices And Patterns

Posted by Daniel Siegmann <ds...@securityscorecard.io>.

On Wed, Apr 12, 2017 at 4:11 PM, Sam Elamin <hu...@gmail.com> wrote:

>
> When it comes to scheduling Spark jobs, you can either submit to an
> already running cluster using things like Oozie or bash scripts, or have a
> workflow manager like Airflow or Data Pipeline to create new clusters for
> you. We went down the second route to continue with the whole immutable
> infrastructure/ "treat you're servers as cattle not pets"
>

A great overview. I just want to point out that Airflow can submit jobs to
an existing cluster if you prefer to have a shared cluster (may be ideal if
you have a bunch of smaller jobs to complete). Do keep in mind that if you
are using the EMR operator that uses the EMR add step API, these will be
submitted to YARN one at a time.