You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Maxime Beauchemin <ma...@gmail.com> on 2017/01/21 00:38:41 UTC

Article: The Rise of the Data Engineer

Hey I just published an article about the "Data Engineer" role in modern
organizations and thought it could be of interest to this community.

https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-91be18f1e603#.5rkm4htnf

Max

Re: Article: The Rise of the Data Engineer

Posted by Gerard Toonstra <gt...@gmail.com>.

>
>
>
> Per this email thread, it almost sounds like a slack team/discourse for
> data engineering might be useful.
>
>
I certainly would not mind getting more knowledge on this topic and I'd
like to be invited to such a slack group  (or google group).

Re: Article: The Rise of the Data Engineer

Posted by Brian Van Klaveren <bv...@slac.stanford.edu>.

There's also monetdb and greenplum, depending on your data size, which https://www.monetdb.org, which support columnar tables if you want to get your feet wet. If your data is actually more array-like, you might try out scidb.

Per this email thread, it almost sounds like a slack team/discourse for data engineering might be useful.

> On Jan 25, 2017, at 7:28 AM, Rob Goretsky <ro...@gmail.com> wrote:
> 
> @Gerard - I mentioned Vertica just as one of the first examples of a system
> that offers columnar storage.  You might actually see a significant benefit
> using columnar storage with even a smaller table, as small as a few GB -
> Columnar storage works well if you have wide fact tables with many columns
> and often query on just a few of those columns.  The downside to columnar
> storage is that if you often SELECT *, or many, of the columns from the
> table at once, it will actually be slower than if you had stored the data
> in traditional 'row-based' storage.  Also, updates and deletes can be
> slower with columnar storage, so it works best if you have wide,
> INSERT-only fact tables.   That said, I think there are better options than
> Vertica on the market today for getting your feet wet with columnar
> storage.  If AWS is an option for you, then Redshift offers this out of the
> box, and would let you run your POC for as little as $0.25 an hour.
> Parquet is basically columnar storage for Hadoop..  Other more traditional
> data warehouse vendors like Netezza and Teradata also offer columnar
> storage as an option ...
> 
>> On Wed, Jan 25, 2017 at 9:16 AM, Boris Tyukin <bo...@boristyukin.com> wrote:
>> 
>> Max, really really nice post and I like your style of writing - please
>> continue sharing your experience and inspire many of us working in more
>> traditional environments ;) I shared your post with our leadership and
>> hopefully we will have data engineers soon on our team! As far as UI vs.
>> coding, I am not sure I fully agree as we look at software development
>> history, we will see times when programming was the only answer and
>> required hardcore professionals like you but then commercial applications
>> which were very visual and lowered requirements to the skillset need.
>> Informatica, SSIS and others became hugely popular and many people swear
>> they save time if you know how to use them. I am pretty sure we will see
>> new tools in Big Data arena as well (AtScale is one example) that make
>> things easier for less skilled developers and users.
>> 
>> It is also good timing for me as my company evaluating Informatica Big Data
>> Management addon (which competes with Talend Big Data) - I am not sold yet
>> on why we would need it if we can do much more with Python and Spark and
>> Hive. But the key point Informatica folks make is to lower the requirements
>> for the skills of developers and to leverage existing skills with
>> Informatica and SQL. I think this is important because this is exactly why
>> SQL is still a huge player in Big Data world - people love SQL, they can do
>> a lot with SQL and they want to use their SQL experience they've built over
>> their carrier.
>> 
>> The dimensional modelling question you raised is also very interesting but
>> very arguable. I was thinking about it before and still did not come to
>> believe that flat tables is a way to go. You said it yourself that there is
>> still a place for highly accurate (certified) enterprise wide warehouse and
>> one still need to spend a lot of time thinking about use cases and design
>> to support them. I am not sure I like the abundance of de-normilized tables
>> in Big Data world but I do see your point about SCDs and all the pain to
>> maintain a traditional star schema DW. But dimensional modelling is not
>> really about maintenance or making life easier for ETL developers - IMHO it
>> is about structuring data to simply business and data analytics. It is
>> about rigorous process to conform data from multiple source systems. It is
>> about data quality and trust. Finally it is about better performing DW (by
>> nature of RDBMS which are very good at joining tables by foreign keys) -
>> the last benefit though is not relevant in Hadoop since we can reprocess or
>> query data more efficiently.
>> 
>> Gerard, why would you do that? if you have the skills already with SQL
>> Server and your DWH is tiny (I run 500Gb DWH in SQL Server on a weak
>> machine), you should be fine with SQL Server. The only issue you cannot
>> support fast BI queries. But you have enterprise license, you can easily
>> dump your table in tabular in memory cube and most of your queries will be
>> running in under 2 seconds. Vertica is cool but the learning curve is
>> pretty steep and it really shines on big de-normalized tables as join
>> performance might is not that good. I work with a large healthcare vendor
>> and they have Tb size tables in their Vertica db - most of them are flatten
>> out but they still have dimensions and facts, just less then you would
>> normally have with traditional star schema design.
>> 
>> 
>> 
>> On Wed, Jan 25, 2017 at 5:57 AM, Gerard Toonstra <gt...@gmail.com>
>> wrote:
>> 
>>> You mentioned Vertica and Parquet. Is it recommended to use these newer
>>> tools even when the DWH is not BigData
>>> size (about 150G in size) ?
>>> 
>>> So there are a couple of good benefits, but are there any downsides and
>>> disadvantages you have to take into account
>>> comparing Vertica vs. SQL Server for example?
>>> 
>>> If you really recommend Vertica over SQL Server, I'm looking at doing a
>> PoC
>>> here to see where it goes...
>>> 
>>> Rgds,
>>> 
>>> Gerard
>>> 
>>> 
>>> On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky <
>> robert.goretsky@gmail.com>
>>> wrote:
>>> 
>>>> Maxime,
>>>> Just wanted to thank you for writing this article - much like the
>>> original
>>>> articles by Jeff Hammerbacher and DJ Patil coining the term "Data
>>>> Scientist", I feel this article stands as a great explanation of what
>> the
>>>> title of "Data Engineer" means today..  As someone who has been working
>>> in
>>>> this role before the title existed, many of the points here rang true
>>> about
>>>> how the technology and tools have evolved..
>>>> 
>>>> I started my career working with graphical ETL tools (Informatica) and
>>>> could never shake the feeling that I could get a lot more done, with a
>>> more
>>>> maintainable set of processes, if I could just write reusable functions
>>> in
>>>> any programming language and then keep them in a shared library.
>>> Instead,
>>>> what the GUI tools forced upon us were massive Wiki documents laying
>> out
>>>> 'the 9 steps you need to follow perfectly in order to build a proper
>>>> Informatica workflow' , that developers would painfully need to follow
>>>> along with, rather than being able to encapsulate the things that
>> didn't
>>>> change in one central 'function' to pass in parameters for the things
>>> that
>>>> varied from the defaults.
>>>> 
>>>> I also spent a lot of time early in my career trying to design data
>>>> warehouse tables using the Kimball methodology with star schemas and
>> all
>>>> dimensions extracted out to separate dimension tables.  As columnar
>>> storage
>>>> formats with compression became available (Vertica/Parquet/etc), I
>>> started
>>>> gravitating more towards the idea that I could just store the raw
>> string
>>>> dimension data in the fact table directly, denormalized, but it always
>>> felt
>>>> like I was breaking the 'purist' rules on how to design data warehouse
>>>> schemas 'the right way'..  So in that regard, thanks for validating my
>>>> feeling that its ok to keep denormalized dimension data directly in
>> fact
>>>> tables - it definitely makes our queries easier to write, and as you
>>>> mentioned, has the added benefit of helping you avoid all of that SCD
>>> fun!
>>>> 
>>>> We're about to put Airflow into production at my company (MLB.com) for
>> a
>>>> handful of DAGs to start, so it will be running alongside our existing
>>>> Informatica server running 500+ workflows nightly..  But I can already
>>> see
>>>> the writing on the wall - it's really hard for us to find talented
>>>> engineers with Informatica experience along with more general computer
>>>> engineering backgrounds (many seem to have specialized in purely
>>>> Informatica) -  so our newer engineers come in with strong Python/SQL
>>>> backgrounds and have been gravitating towards building newer jobs in
>>>> Airflow...
>>>> 
>>>> One item that I think deserves addition to this article is the
>> continuing
>>>> prevalence of SQL.   Many technologies have changed, but SQL has
>>> persisted
>>>> (pun intended?).  We went through a phase for a few years where it
>> looked
>>>> like the tide was turning to MapReduce, Pig, or other languages for
>>>> accessing and aggregating data..  But now it seems even the "NoSQL"
>> data
>>>> stores have added SQL layers on top, and we have more SQL engines for
>>>> Hadoop than I can count.   SQL is easy to learn but tougher to master,
>> so
>>>> to me the two main languages in any modern Data Engineer's toolbelt are
>>> SQL
>>>> and a scripting language (Python/Ruby)..   I think it's amazing that
>> with
>>>> so much change in every aspect of how we do data warehousing, SQL has
>>> stood
>>>> the test of time...
>>>> 
>>>> Anyways, thanks again for writing this up, I'll definitely be sharing
>> it
>>>> with my team!
>>>> 
>>>> -Rob
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
>>>> maximebeauchemin@gmail.com> wrote:
>>>> 
>>>>> Hey I just published an article about the "Data Engineer" role in
>>> modern
>>>>> organizations and thought it could be of interest to this community.
>>>>> 
>>>>> https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
>>>>> 91be18f1e603#.5rkm4htnf
>>>>> 
>>>>> Max
>>>>> 
>>>> 
>>> 
>>

Re: Article: The Rise of the Data Engineer

Posted by Rob Goretsky <ro...@gmail.com>.

@Gerard - I mentioned Vertica just as one of the first examples of a system
that offers columnar storage.  You might actually see a significant benefit
using columnar storage with even a smaller table, as small as a few GB -
Columnar storage works well if you have wide fact tables with many columns
and often query on just a few of those columns.  The downside to columnar
storage is that if you often SELECT *, or many, of the columns from the
table at once, it will actually be slower than if you had stored the data
in traditional 'row-based' storage.  Also, updates and deletes can be
slower with columnar storage, so it works best if you have wide,
INSERT-only fact tables.   That said, I think there are better options than
Vertica on the market today for getting your feet wet with columnar
storage.  If AWS is an option for you, then Redshift offers this out of the
box, and would let you run your POC for as little as $0.25 an hour.
 Parquet is basically columnar storage for Hadoop..  Other more traditional
data warehouse vendors like Netezza and Teradata also offer columnar
storage as an option ...

On Wed, Jan 25, 2017 at 9:16 AM, Boris Tyukin <bo...@boristyukin.com> wrote:

> Max, really really nice post and I like your style of writing - please
> continue sharing your experience and inspire many of us working in more
> traditional environments ;) I shared your post with our leadership and
> hopefully we will have data engineers soon on our team! As far as UI vs.
> coding, I am not sure I fully agree as we look at software development
> history, we will see times when programming was the only answer and
> required hardcore professionals like you but then commercial applications
> which were very visual and lowered requirements to the skillset need.
> Informatica, SSIS and others became hugely popular and many people swear
> they save time if you know how to use them. I am pretty sure we will see
> new tools in Big Data arena as well (AtScale is one example) that make
> things easier for less skilled developers and users.
>
> It is also good timing for me as my company evaluating Informatica Big Data
> Management addon (which competes with Talend Big Data) - I am not sold yet
> on why we would need it if we can do much more with Python and Spark and
> Hive. But the key point Informatica folks make is to lower the requirements
> for the skills of developers and to leverage existing skills with
> Informatica and SQL. I think this is important because this is exactly why
> SQL is still a huge player in Big Data world - people love SQL, they can do
> a lot with SQL and they want to use their SQL experience they've built over
> their carrier.
>
> The dimensional modelling question you raised is also very interesting but
> very arguable. I was thinking about it before and still did not come to
> believe that flat tables is a way to go. You said it yourself that there is
> still a place for highly accurate (certified) enterprise wide warehouse and
> one still need to spend a lot of time thinking about use cases and design
> to support them. I am not sure I like the abundance of de-normilized tables
> in Big Data world but I do see your point about SCDs and all the pain to
> maintain a traditional star schema DW. But dimensional modelling is not
> really about maintenance or making life easier for ETL developers - IMHO it
> is about structuring data to simply business and data analytics. It is
> about rigorous process to conform data from multiple source systems. It is
> about data quality and trust. Finally it is about better performing DW (by
> nature of RDBMS which are very good at joining tables by foreign keys) -
> the last benefit though is not relevant in Hadoop since we can reprocess or
> query data more efficiently.
>
> Gerard, why would you do that? if you have the skills already with SQL
> Server and your DWH is tiny (I run 500Gb DWH in SQL Server on a weak
> machine), you should be fine with SQL Server. The only issue you cannot
> support fast BI queries. But you have enterprise license, you can easily
> dump your table in tabular in memory cube and most of your queries will be
> running in under 2 seconds. Vertica is cool but the learning curve is
> pretty steep and it really shines on big de-normalized tables as join
> performance might is not that good. I work with a large healthcare vendor
> and they have Tb size tables in their Vertica db - most of them are flatten
> out but they still have dimensions and facts, just less then you would
> normally have with traditional star schema design.
>
>
>
> On Wed, Jan 25, 2017 at 5:57 AM, Gerard Toonstra <gt...@gmail.com>
> wrote:
>
> > You mentioned Vertica and Parquet. Is it recommended to use these newer
> > tools even when the DWH is not BigData
> > size (about 150G in size) ?
> >
> > So there are a couple of good benefits, but are there any downsides and
> > disadvantages you have to take into account
> > comparing Vertica vs. SQL Server for example?
> >
> > If you really recommend Vertica over SQL Server, I'm looking at doing a
> PoC
> > here to see where it goes...
> >
> > Rgds,
> >
> > Gerard
> >
> >
> > On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky <
> robert.goretsky@gmail.com>
> > wrote:
> >
> > > Maxime,
> > > Just wanted to thank you for writing this article - much like the
> > original
> > > articles by Jeff Hammerbacher and DJ Patil coining the term "Data
> > > Scientist", I feel this article stands as a great explanation of what
> the
> > > title of "Data Engineer" means today..  As someone who has been working
> > in
> > > this role before the title existed, many of the points here rang true
> > about
> > > how the technology and tools have evolved..
> > >
> > > I started my career working with graphical ETL tools (Informatica) and
> > > could never shake the feeling that I could get a lot more done, with a
> > more
> > > maintainable set of processes, if I could just write reusable functions
> > in
> > > any programming language and then keep them in a shared library.
> > Instead,
> > > what the GUI tools forced upon us were massive Wiki documents laying
> out
> > > 'the 9 steps you need to follow perfectly in order to build a proper
> > > Informatica workflow' , that developers would painfully need to follow
> > > along with, rather than being able to encapsulate the things that
> didn't
> > > change in one central 'function' to pass in parameters for the things
> > that
> > > varied from the defaults.
> > >
> > > I also spent a lot of time early in my career trying to design data
> > > warehouse tables using the Kimball methodology with star schemas and
> all
> > > dimensions extracted out to separate dimension tables.  As columnar
> > storage
> > > formats with compression became available (Vertica/Parquet/etc), I
> > started
> > > gravitating more towards the idea that I could just store the raw
> string
> > > dimension data in the fact table directly, denormalized, but it always
> > felt
> > > like I was breaking the 'purist' rules on how to design data warehouse
> > > schemas 'the right way'..  So in that regard, thanks for validating my
> > > feeling that its ok to keep denormalized dimension data directly in
> fact
> > > tables - it definitely makes our queries easier to write, and as you
> > > mentioned, has the added benefit of helping you avoid all of that SCD
> > fun!
> > >
> > > We're about to put Airflow into production at my company (MLB.com) for
> a
> > > handful of DAGs to start, so it will be running alongside our existing
> > > Informatica server running 500+ workflows nightly..  But I can already
> > see
> > > the writing on the wall - it's really hard for us to find talented
> > > engineers with Informatica experience along with more general computer
> > > engineering backgrounds (many seem to have specialized in purely
> > > Informatica) -  so our newer engineers come in with strong Python/SQL
> > > backgrounds and have been gravitating towards building newer jobs in
> > > Airflow...
> > >
> > > One item that I think deserves addition to this article is the
> continuing
> > > prevalence of SQL.   Many technologies have changed, but SQL has
> > persisted
> > > (pun intended?).  We went through a phase for a few years where it
> looked
> > > like the tide was turning to MapReduce, Pig, or other languages for
> > > accessing and aggregating data..  But now it seems even the "NoSQL"
> data
> > > stores have added SQL layers on top, and we have more SQL engines for
> > > Hadoop than I can count.   SQL is easy to learn but tougher to master,
> so
> > > to me the two main languages in any modern Data Engineer's toolbelt are
> > SQL
> > > and a scripting language (Python/Ruby)..   I think it's amazing that
> with
> > > so much change in every aspect of how we do data warehousing, SQL has
> > stood
> > > the test of time...
> > >
> > > Anyways, thanks again for writing this up, I'll definitely be sharing
> it
> > > with my team!
> > >
> > > -Rob
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > > > Hey I just published an article about the "Data Engineer" role in
> > modern
> > > > organizations and thought it could be of interest to this community.
> > > >
> > > > https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> > > > 91be18f1e603#.5rkm4htnf
> > > >
> > > > Max
> > > >
> > >
> >
>

Re: Article: The Rise of the Data Engineer

Posted by Boris Tyukin <bo...@boristyukin.com>.

Max, really really nice post and I like your style of writing - please
continue sharing your experience and inspire many of us working in more
traditional environments ;) I shared your post with our leadership and
hopefully we will have data engineers soon on our team! As far as UI vs.
coding, I am not sure I fully agree as we look at software development
history, we will see times when programming was the only answer and
required hardcore professionals like you but then commercial applications
which were very visual and lowered requirements to the skillset need.
Informatica, SSIS and others became hugely popular and many people swear
they save time if you know how to use them. I am pretty sure we will see
new tools in Big Data arena as well (AtScale is one example) that make
things easier for less skilled developers and users.

It is also good timing for me as my company evaluating Informatica Big Data
Management addon (which competes with Talend Big Data) - I am not sold yet
on why we would need it if we can do much more with Python and Spark and
Hive. But the key point Informatica folks make is to lower the requirements
for the skills of developers and to leverage existing skills with
Informatica and SQL. I think this is important because this is exactly why
SQL is still a huge player in Big Data world - people love SQL, they can do
a lot with SQL and they want to use their SQL experience they've built over
their carrier.

The dimensional modelling question you raised is also very interesting but
very arguable. I was thinking about it before and still did not come to
believe that flat tables is a way to go. You said it yourself that there is
still a place for highly accurate (certified) enterprise wide warehouse and
one still need to spend a lot of time thinking about use cases and design
to support them. I am not sure I like the abundance of de-normilized tables
in Big Data world but I do see your point about SCDs and all the pain to
maintain a traditional star schema DW. But dimensional modelling is not
really about maintenance or making life easier for ETL developers - IMHO it
is about structuring data to simply business and data analytics. It is
about rigorous process to conform data from multiple source systems. It is
about data quality and trust. Finally it is about better performing DW (by
nature of RDBMS which are very good at joining tables by foreign keys) -
the last benefit though is not relevant in Hadoop since we can reprocess or
query data more efficiently.

Gerard, why would you do that? if you have the skills already with SQL
Server and your DWH is tiny (I run 500Gb DWH in SQL Server on a weak
machine), you should be fine with SQL Server. The only issue you cannot
support fast BI queries. But you have enterprise license, you can easily
dump your table in tabular in memory cube and most of your queries will be
running in under 2 seconds. Vertica is cool but the learning curve is
pretty steep and it really shines on big de-normalized tables as join
performance might is not that good. I work with a large healthcare vendor
and they have Tb size tables in their Vertica db - most of them are flatten
out but they still have dimensions and facts, just less then you would
normally have with traditional star schema design.

On Wed, Jan 25, 2017 at 5:57 AM, Gerard Toonstra <gt...@gmail.com>
wrote:

> You mentioned Vertica and Parquet. Is it recommended to use these newer
> tools even when the DWH is not BigData
> size (about 150G in size) ?
>
> So there are a couple of good benefits, but are there any downsides and
> disadvantages you have to take into account
> comparing Vertica vs. SQL Server for example?
>
> If you really recommend Vertica over SQL Server, I'm looking at doing a PoC
> here to see where it goes...
>
> Rgds,
>
> Gerard
>
>
> On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky <ro...@gmail.com>
> wrote:
>
> > Maxime,
> > Just wanted to thank you for writing this article - much like the
> original
> > articles by Jeff Hammerbacher and DJ Patil coining the term "Data
> > Scientist", I feel this article stands as a great explanation of what the
> > title of "Data Engineer" means today..  As someone who has been working
> in
> > this role before the title existed, many of the points here rang true
> about
> > how the technology and tools have evolved..
> >
> > I started my career working with graphical ETL tools (Informatica) and
> > could never shake the feeling that I could get a lot more done, with a
> more
> > maintainable set of processes, if I could just write reusable functions
> in
> > any programming language and then keep them in a shared library.
> Instead,
> > what the GUI tools forced upon us were massive Wiki documents laying out
> > 'the 9 steps you need to follow perfectly in order to build a proper
> > Informatica workflow' , that developers would painfully need to follow
> > along with, rather than being able to encapsulate the things that didn't
> > change in one central 'function' to pass in parameters for the things
> that
> > varied from the defaults.
> >
> > I also spent a lot of time early in my career trying to design data
> > warehouse tables using the Kimball methodology with star schemas and all
> > dimensions extracted out to separate dimension tables.  As columnar
> storage
> > formats with compression became available (Vertica/Parquet/etc), I
> started
> > gravitating more towards the idea that I could just store the raw string
> > dimension data in the fact table directly, denormalized, but it always
> felt
> > like I was breaking the 'purist' rules on how to design data warehouse
> > schemas 'the right way'..  So in that regard, thanks for validating my
> > feeling that its ok to keep denormalized dimension data directly in fact
> > tables - it definitely makes our queries easier to write, and as you
> > mentioned, has the added benefit of helping you avoid all of that SCD
> fun!
> >
> > We're about to put Airflow into production at my company (MLB.com) for a
> > handful of DAGs to start, so it will be running alongside our existing
> > Informatica server running 500+ workflows nightly..  But I can already
> see
> > the writing on the wall - it's really hard for us to find talented
> > engineers with Informatica experience along with more general computer
> > engineering backgrounds (many seem to have specialized in purely
> > Informatica) -  so our newer engineers come in with strong Python/SQL
> > backgrounds and have been gravitating towards building newer jobs in
> > Airflow...
> >
> > One item that I think deserves addition to this article is the continuing
> > prevalence of SQL.   Many technologies have changed, but SQL has
> persisted
> > (pun intended?).  We went through a phase for a few years where it looked
> > like the tide was turning to MapReduce, Pig, or other languages for
> > accessing and aggregating data..  But now it seems even the "NoSQL" data
> > stores have added SQL layers on top, and we have more SQL engines for
> > Hadoop than I can count.   SQL is easy to learn but tougher to master, so
> > to me the two main languages in any modern Data Engineer's toolbelt are
> SQL
> > and a scripting language (Python/Ruby)..   I think it's amazing that with
> > so much change in every aspect of how we do data warehousing, SQL has
> stood
> > the test of time...
> >
> > Anyways, thanks again for writing this up, I'll definitely be sharing it
> > with my team!
> >
> > -Rob
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> > > Hey I just published an article about the "Data Engineer" role in
> modern
> > > organizations and thought it could be of interest to this community.
> > >
> > > https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> > > 91be18f1e603#.5rkm4htnf
> > >
> > > Max
> > >
> >
>

Re: Article: The Rise of the Data Engineer

Posted by Gerard Toonstra <gt...@gmail.com>.

You mentioned Vertica and Parquet. Is it recommended to use these newer
tools even when the DWH is not BigData
size (about 150G in size) ?

So there are a couple of good benefits, but are there any downsides and
disadvantages you have to take into account
comparing Vertica vs. SQL Server for example?

If you really recommend Vertica over SQL Server, I'm looking at doing a PoC
here to see where it goes...

Rgds,

Gerard


On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky <ro...@gmail.com>
wrote:

> Maxime,
> Just wanted to thank you for writing this article - much like the original
> articles by Jeff Hammerbacher and DJ Patil coining the term "Data
> Scientist", I feel this article stands as a great explanation of what the
> title of "Data Engineer" means today..  As someone who has been working in
> this role before the title existed, many of the points here rang true about
> how the technology and tools have evolved..
>
> I started my career working with graphical ETL tools (Informatica) and
> could never shake the feeling that I could get a lot more done, with a more
> maintainable set of processes, if I could just write reusable functions in
> any programming language and then keep them in a shared library.  Instead,
> what the GUI tools forced upon us were massive Wiki documents laying out
> 'the 9 steps you need to follow perfectly in order to build a proper
> Informatica workflow' , that developers would painfully need to follow
> along with, rather than being able to encapsulate the things that didn't
> change in one central 'function' to pass in parameters for the things that
> varied from the defaults.
>
> I also spent a lot of time early in my career trying to design data
> warehouse tables using the Kimball methodology with star schemas and all
> dimensions extracted out to separate dimension tables.  As columnar storage
> formats with compression became available (Vertica/Parquet/etc), I started
> gravitating more towards the idea that I could just store the raw string
> dimension data in the fact table directly, denormalized, but it always felt
> like I was breaking the 'purist' rules on how to design data warehouse
> schemas 'the right way'..  So in that regard, thanks for validating my
> feeling that its ok to keep denormalized dimension data directly in fact
> tables - it definitely makes our queries easier to write, and as you
> mentioned, has the added benefit of helping you avoid all of that SCD fun!
>
> We're about to put Airflow into production at my company (MLB.com) for a
> handful of DAGs to start, so it will be running alongside our existing
> Informatica server running 500+ workflows nightly..  But I can already see
> the writing on the wall - it's really hard for us to find talented
> engineers with Informatica experience along with more general computer
> engineering backgrounds (many seem to have specialized in purely
> Informatica) -  so our newer engineers come in with strong Python/SQL
> backgrounds and have been gravitating towards building newer jobs in
> Airflow...
>
> One item that I think deserves addition to this article is the continuing
> prevalence of SQL.   Many technologies have changed, but SQL has persisted
> (pun intended?).  We went through a phase for a few years where it looked
> like the tide was turning to MapReduce, Pig, or other languages for
> accessing and aggregating data..  But now it seems even the "NoSQL" data
> stores have added SQL layers on top, and we have more SQL engines for
> Hadoop than I can count.   SQL is easy to learn but tougher to master, so
> to me the two main languages in any modern Data Engineer's toolbelt are SQL
> and a scripting language (Python/Ruby)..   I think it's amazing that with
> so much change in every aspect of how we do data warehousing, SQL has stood
> the test of time...
>
> Anyways, thanks again for writing this up, I'll definitely be sharing it
> with my team!
>
> -Rob
>
>
>
>
>
>
>
>
>
> On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Hey I just published an article about the "Data Engineer" role in modern
> > organizations and thought it could be of interest to this community.
> >
> > https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> > 91be18f1e603#.5rkm4htnf
> >
> > Max
> >
>

Re: Article: The Rise of the Data Engineer

Posted by Maxime Beauchemin <ma...@gmail.com>.

Glad to hear the article resonated with you! I just now got interviewed on
a podcast on this very subject, it should be up sometime this week:
https://itunes.apple.com/us/podcast/data-engineering-podcast/id1193040557

It's less structured than the article, but you can hear me babble about
data engineering and say semi-outrageous things about data scientists if
you have the patience of sitting through it :)

I totally agree about SQL, it's the one solid constant in this ever
changing space.

Screw SCDs!

Max

On Tue, Jan 24, 2017 at 3:39 PM, Rob Goretsky <ro...@gmail.com>
wrote:

> Maxime,
> Just wanted to thank you for writing this article - much like the original
> articles by Jeff Hammerbacher and DJ Patil coining the term "Data
> Scientist", I feel this article stands as a great explanation of what the
> title of "Data Engineer" means today..  As someone who has been working in
> this role before the title existed, many of the points here rang true about
> how the technology and tools have evolved..
>
> I started my career working with graphical ETL tools (Informatica) and
> could never shake the feeling that I could get a lot more done, with a more
> maintainable set of processes, if I could just write reusable functions in
> any programming language and then keep them in a shared library.  Instead,
> what the GUI tools forced upon us were massive Wiki documents laying out
> 'the 9 steps you need to follow perfectly in order to build a proper
> Informatica workflow' , that developers would painfully need to follow
> along with, rather than being able to encapsulate the things that didn't
> change in one central 'function' to pass in parameters for the things that
> varied from the defaults.
>
> I also spent a lot of time early in my career trying to design data
> warehouse tables using the Kimball methodology with star schemas and all
> dimensions extracted out to separate dimension tables.  As columnar storage
> formats with compression became available (Vertica/Parquet/etc), I started
> gravitating more towards the idea that I could just store the raw string
> dimension data in the fact table directly, denormalized, but it always felt
> like I was breaking the 'purist' rules on how to design data warehouse
> schemas 'the right way'..  So in that regard, thanks for validating my
> feeling that its ok to keep denormalized dimension data directly in fact
> tables - it definitely makes our queries easier to write, and as you
> mentioned, has the added benefit of helping you avoid all of that SCD fun!
>
> We're about to put Airflow into production at my company (MLB.com) for a
> handful of DAGs to start, so it will be running alongside our existing
> Informatica server running 500+ workflows nightly..  But I can already see
> the writing on the wall - it's really hard for us to find talented
> engineers with Informatica experience along with more general computer
> engineering backgrounds (many seem to have specialized in purely
> Informatica) -  so our newer engineers come in with strong Python/SQL
> backgrounds and have been gravitating towards building newer jobs in
> Airflow...
>
> One item that I think deserves addition to this article is the continuing
> prevalence of SQL.   Many technologies have changed, but SQL has persisted
> (pun intended?).  We went through a phase for a few years where it looked
> like the tide was turning to MapReduce, Pig, or other languages for
> accessing and aggregating data..  But now it seems even the "NoSQL" data
> stores have added SQL layers on top, and we have more SQL engines for
> Hadoop than I can count.   SQL is easy to learn but tougher to master, so
> to me the two main languages in any modern Data Engineer's toolbelt are SQL
> and a scripting language (Python/Ruby)..   I think it's amazing that with
> so much change in every aspect of how we do data warehousing, SQL has stood
> the test of time...
>
> Anyways, thanks again for writing this up, I'll definitely be sharing it
> with my team!
>
> -Rob
>
>
>
>
>
>
>
>
>
> On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Hey I just published an article about the "Data Engineer" role in modern
> > organizations and thought it could be of interest to this community.
> >
> > https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> > 91be18f1e603#.5rkm4htnf
> >
> > Max
> >
>

Re: Article: The Rise of the Data Engineer

Posted by Rob Goretsky <ro...@gmail.com>.

Maxime,
Just wanted to thank you for writing this article - much like the original
articles by Jeff Hammerbacher and DJ Patil coining the term "Data
Scientist", I feel this article stands as a great explanation of what the
title of "Data Engineer" means today..  As someone who has been working in
this role before the title existed, many of the points here rang true about
how the technology and tools have evolved..

I started my career working with graphical ETL tools (Informatica) and
could never shake the feeling that I could get a lot more done, with a more
maintainable set of processes, if I could just write reusable functions in
any programming language and then keep them in a shared library.  Instead,
what the GUI tools forced upon us were massive Wiki documents laying out
'the 9 steps you need to follow perfectly in order to build a proper
Informatica workflow' , that developers would painfully need to follow
along with, rather than being able to encapsulate the things that didn't
change in one central 'function' to pass in parameters for the things that
varied from the defaults.

I also spent a lot of time early in my career trying to design data
warehouse tables using the Kimball methodology with star schemas and all
dimensions extracted out to separate dimension tables.  As columnar storage
formats with compression became available (Vertica/Parquet/etc), I started
gravitating more towards the idea that I could just store the raw string
dimension data in the fact table directly, denormalized, but it always felt
like I was breaking the 'purist' rules on how to design data warehouse
schemas 'the right way'..  So in that regard, thanks for validating my
feeling that its ok to keep denormalized dimension data directly in fact
tables - it definitely makes our queries easier to write, and as you
mentioned, has the added benefit of helping you avoid all of that SCD fun!

We're about to put Airflow into production at my company (MLB.com) for a
handful of DAGs to start, so it will be running alongside our existing
Informatica server running 500+ workflows nightly..  But I can already see
the writing on the wall - it's really hard for us to find talented
engineers with Informatica experience along with more general computer
engineering backgrounds (many seem to have specialized in purely
Informatica) -  so our newer engineers come in with strong Python/SQL
backgrounds and have been gravitating towards building newer jobs in
Airflow...

One item that I think deserves addition to this article is the continuing
prevalence of SQL.   Many technologies have changed, but SQL has persisted
(pun intended?).  We went through a phase for a few years where it looked
like the tide was turning to MapReduce, Pig, or other languages for
accessing and aggregating data..  But now it seems even the "NoSQL" data
stores have added SQL layers on top, and we have more SQL engines for
Hadoop than I can count.   SQL is easy to learn but tougher to master, so
to me the two main languages in any modern Data Engineer's toolbelt are SQL
and a scripting language (Python/Ruby)..   I think it's amazing that with
so much change in every aspect of how we do data warehousing, SQL has stood
the test of time...

Anyways, thanks again for writing this up, I'll definitely be sharing it
with my team!

-Rob

On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Hey I just published an article about the "Data Engineer" role in modern
> organizations and thought it could be of interest to this community.
>
> https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> 91be18f1e603#.5rkm4htnf
>
> Max
>

Re: Article: The Rise of the Data Engineer

Posted by Laura Lorenz <ll...@industrydive.com>.

👍

On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Hey I just published an article about the "Data Engineer" role in modern
> organizations and thought it could be of interest to this community.
>
> https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> 91be18f1e603#.5rkm4htnf
>
> Max
>