You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Gerard Toonstra <gt...@gmail.com> on 2019/05/31 13:57:55 UTC

Project Genevere

Hi all,

We're running airflow and write data to sqlserver. Python and sqlserver
isn't a marriage made in heaven and we've literally spent hours fixing
driver issues, installations and working around things that should be easy,
time that's better spent writing app code.

I looked at Sqoop, but that runs on Hadoop only. I didn't find anything
that was opensource / standalone to use as a "generic data pump", so
started a new project for that (let me know if something does exist). The
overall idea is that an airflow operator generates a command file for this
process and then spawns a java process to run the pipeline.

It's available here:

https://github.com/gtoonstra/genevere

If you have similar issues with your environment, maybe this helps you out.
And if you want to contribute,  just reach out!

Rgds,

Gerard

Re: Project Genevere

Posted by Gerard Toonstra <gt...@gmail.com>.

There's been some discussions on the pymssql project to abandon it in favor
of pyodbc and that's still ongoing. MS has decided to support pyodbc, so
that's
what we chose in the end.

We did build a wrapper around bcp, which is the fastest way of transferring
the data, but it took a month to stabilize. It's very unfriendly in how it
reports
errors, so we had to invest some time to parse error and data files so you
could at least see what was going on.

There are some nasty flaws with bcp:

* You end up iterating over all rows AND columns to bash the data into the
right shape, mostly because we read from a different database platform into
mssql, so CPU goes up.
* It creates a local file, which means that all data is written and read
once from disk, generating disk I/O spikes. It's basically creating one
unnecessary copy in the process. So you need the throughput and IOPS.
* bcp doesn't have proper encoding support, so it's all ASCII.
* Error reporting is poor.

I added a link to some command file examples in the main README, I think
they demonstrate how it works.

@Tao: Yes, had a look at Gobblin, but that's almost a complete ETL platform
on its own. I wanted something concise and simple.

On Fri, May 31, 2019 at 10:52 PM Felipe Elias Lolas Isuani <
felipe.elias013@gmail.com> wrote:

> Hi Gerard!
>
> Why not use bcp in/out command directly with Airflow? We currently wrap
> bcp in/out command into an Hook/Operator for working with SQL Server.
> Obiously you need to install the tools in airflow workers.
>
> Cheers,
> Philip
>
> El 31-05-2019, a la(s) 09:57, Gerard Toonstra <gt...@gmail.com>
> escribió:
>
> > Hi all,
> >
> > We're running airflow and write data to sqlserver. Python and sqlserver
> > isn't a marriage made in heaven and we've literally spent hours fixing
> > driver issues, installations and working around things that should be
> easy,
> > time that's better spent writing app code.
> >
> > I looked at Sqoop, but that runs on Hadoop only. I didn't find anything
> > that was opensource / standalone to use as a "generic data pump", so
> > started a new project for that (let me know if something does exist). The
> > overall idea is that an airflow operator generates a command file for
> this
> > process and then spawns a java process to run the pipeline.
> >
> > It's available here:
> >
> > https://github.com/gtoonstra/genevere
> >
> > If you have similar issues with your environment, maybe this helps you
> out.
> > And if you want to contribute,  just reach out!
> >
> > Rgds,
> >
> > Gerard
>

Re: Project Genevere

Posted by Felipe Elias Lolas Isuani <fe...@gmail.com>.

Hi Gerard!

Why not use bcp in/out command directly with Airflow? We currently wrap bcp in/out command into an Hook/Operator for working with SQL Server. Obiously you need to install the tools in airflow workers.

Cheers,
Philip

El 31-05-2019, a la(s) 09:57, Gerard Toonstra <gt...@gmail.com> escribió:

> Hi all,
> 
> We're running airflow and write data to sqlserver. Python and sqlserver
> isn't a marriage made in heaven and we've literally spent hours fixing
> driver issues, installations and working around things that should be easy,
> time that's better spent writing app code.
> 
> I looked at Sqoop, but that runs on Hadoop only. I didn't find anything
> that was opensource / standalone to use as a "generic data pump", so
> started a new project for that (let me know if something does exist). The
> overall idea is that an airflow operator generates a command file for this
> process and then spawns a java process to run the pipeline.
> 
> It's available here:
> 
> https://github.com/gtoonstra/genevere
> 
> If you have similar issues with your environment, maybe this helps you out.
> And if you want to contribute,  just reach out!
> 
> Rgds,
> 
> Gerard

Re: Project Genevere

Posted by Tao Feng <fe...@gmail.com>.

Hi Gerard,

Good to know :) FYI, Linkedin created
https://github.com/apache/incubator-gobblin as the data ingestion framework
which could potentially solve your use case. But I am not sure it provides
support for sqlserver as sink.

Best,
-Tao

On Fri, May 31, 2019 at 6:58 AM Gerard Toonstra <gt...@gmail.com> wrote:

> Hi all,
>
> We're running airflow and write data to sqlserver. Python and sqlserver
> isn't a marriage made in heaven and we've literally spent hours fixing
> driver issues, installations and working around things that should be easy,
> time that's better spent writing app code.
>
> I looked at Sqoop, but that runs on Hadoop only. I didn't find anything
> that was opensource / standalone to use as a "generic data pump", so
> started a new project for that (let me know if something does exist). The
> overall idea is that an airflow operator generates a command file for this
> process and then spawns a java process to run the pipeline.
>
> It's available here:
>
> https://github.com/gtoonstra/genevere
>
> If you have similar issues with your environment, maybe this helps you out.
> And if you want to contribute,  just reach out!
>
> Rgds,
>
> Gerard
>