You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Brendan Niebruegge <br...@microsoft.com.INVALID> on 2020/05/19 22:16:59 UTC

Arrow Flight connector for SQL Server

Hi everyone,

I wanted to informally introduce myself. My name is Brendan Niebruegge, I'm a Software Engineer in our SQL Server extensibility team here at Microsoft. I am leading an effort to explore how we could integrate Arrow Flight with SQL Server. We think this could be a very interesting integration that would both benefit SQL Server and the Arrow community. We are very early in our thoughts so I thought it best to reach out here and see if you had any thoughts or suggestions for me. What would be the best way to socialize my thoughts to date? I am keen to learn and deepen my knowledge of Arrow as well so please let me know how I can be of help to the community.

Please feel free to reach out anytime (email:brnieb@microsoft.com)

Thanks,
Brendan Niebruegge

RE: [EXTERNAL] Re: Arrow Flight connector for SQL Server

Posted by Brendan Niebruegge <br...@microsoft.com.INVALID>.

First, thanks everyone for the warm welcome into this community!

I also want to thank Ryan/Uwe for the initial pointers and information.

Jacques, I’ll share a little bit about my project planning. At this specific point-in-time, I am focusing on a simple POC of single stream output. My first goal is ensure a strong/flexible foundation in the SQL Engine to provide the ability to easily adapt to any Flight changes with minimal overhead. But from SQL Engine/Extensibility point-of-view, parallel output streams should be relatively cheap for us once single stream is complete. Today, in parallelized execution of external scripts, via sp_execute_external_script, we map every SQL query worker thread to an external process effectively giving us the degree-of-parallelism (DOP) of the query execution (QE) itself. Leveraging QE, we can partially control the DOP using query options (MAXDOP) but this is just a query hint for max number of workers, thus unfortunately at this time, the client does not have complete control over the exact DOP. After output, we will start to look integrating Flight into input but at this time not much has been designed yet.

Ryan, I would be very much happy to discuss the topic of the standardization of exchange of metadata, connection parameters, etc. as this will be super helpful for me. There’s are a couple non-SQL specific options that I’ve thought about as well which could provide some potential benefits. Such as, specifying a subset of output endpoints for the specific connection/query, this might allow for some client-side load balancing or perf benefits if endpoints/clients are co-located. Feel free to add me to any thread or we can setup a time to call.

Here are some reference links to provide some more in-depth on the SQL specifics mentioned:
SQL Server Extensibility Framework Architecture: https://docs.microsoft.com/en-us/sql/machine-learning/concepts/extensibility-framework?view=sql-server-ver15
Stored procedure sp_execute_external_script: https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql?view=sql-server-ver15
SQL worker threads: https://docs.microsoft.com/en-us/sql/relational-databases/thread-and-task-architecture-guide?view=sql-server-ver15
SQL query hints (ex. MAXDOP): https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-ver15

Thanks,
Brendan Niebruegge

From: Ryan Murray <ry...@dremio.com>
Sent: Friday, May 22, 2020 7:22 AM
To: dev@arrow.apache.org
Cc: Brendan Niebruegge <br...@microsoft.com>; Jasraj Dange <ja...@microsoft.com>; James Rowland-Jones <jr...@microsoft.com>; Jarupat Jisarojito <Ja...@microsoft.com>
Subject: [EXTERNAL] Re: Arrow Flight connector for SQL Server

Hey Brendan,

As Jacques promised here are a few things to act as pointers for your work on Flight:
Our early release Flight connector[1]  this fully supports single flight streams and partially supports parallel streams
I also have a Spark DataSourceV2 client which may be of interest to you[2]

Both links make use of the 'doAction' part of the Flight API spec[3] to negotiate parallel vs single stream among other things. However, this is done in an ad-hoc manner and finding a way to standardise this for exchange of metadata, catalog info, connection parameters etc is for me an important next step to making a flight based protocol that is equivalent to odbc/jdbc. I would be happy to discuss further if you have any thoughts on the topic.

Best,
Ryan

[1] https://github.com/dremio-hub/dremio-flight-connector<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdremio-hub%2Fdremio-flight-connector&data=02%7C01%7Cjrj%40microsoft.com%7C250eb543f0e94cd2075708d7fe5b831b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637257541322365377&sdata=gdSH8RNjxSPYjlUWIKgEqctmsFYUqAkAEo6pDhkHRik%3D&reserved=0>
[2] https://github.com/rymurr/flight-spark-source<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frymurr%2Fflight-spark-source&data=02%7C01%7Cjrj%40microsoft.com%7C250eb543f0e94cd2075708d7fe5b831b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637257541322365377&sdata=lkpgFNn%2FH00GtblYI10XK%2FCJ6KuRAevsYw%2B7BdN1BCE%3D&reserved=0>
[3] https://github.com/apache/arrow/blob/master/format/Flight.proto<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2Fmaster%2Fformat%2FFlight.proto&data=02%7C01%7Cjrj%40microsoft.com%7C250eb543f0e94cd2075708d7fe5b831b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637257541322375371&sdata=3zwTvraq3LCstRmlGW63Filgt4W%2BpgzGbUHVAAmnRuU%3D&reserved=0>

-----Original Message-----
From: Jacques Nadeau <ja...@apache.org>
Sent: Tuesday, May 19, 2020 7:05 PM
To: dev <de...@arrow.apache.org>
Subject: [EXTERNAL] Re: Arrow Flight connector for SQL Server

Hey Brendan,

Welcome to the community. At Dremio we've exposed flight as an input and

output for sql result datasets. I'll have one of our guys share some

details. I think a couple questions we've been struggling with include how

to standardize additional metadata operations, what should the prepare

behavior be and and is there is a way to stadarize exposure of a flight

path as an extension of both jdbc and odbc.

Can you share more about whether you're initially more focused on input or

output and parallel or single stream?

Thanks and welcome

Jacques

On Thu, May 21, 2020 at 3:08 PM Uwe L. Korn <uw...@xhochy.com>> wrote:
Hello Brendan,

welcome to the community. In addition to the folks at Dremio, I wanted to make you aware of the Python ODBC client library https://github.com/blue-yonder/turbodbc<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fblue-yonder%2Fturbodbc&data=02%7C01%7Cjrj%40microsoft.com%7C250eb543f0e94cd2075708d7fe5b831b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637257541322375371&sdata=n1wd4F%2FsSOm6Y0Sf%2FP1yoHAD4pGUiwntT8Gv%2FKbpI9U%3D&reserved=0> which provides a high-performance ODBC<->Arrow adapter. It is especially popular with MS SQL Server users as the fastest known way to retrieve query results as DataFrames in Python from SQL Server, considerably faster than pandas.read_sql or using pyodbc directly.

While being the fastest known, I can tell that still there is a lot time CPU spent in the ODBC driver "transforming" results so that it matches the ODBC interface. At least here, one could get possibly a lot better performance when retrieving large columnar results from SQL Server when going through Arrow Flight as an interface instead being constraint to the less efficient ODBC for this use case. Currently there is a performance difference of 50x between reading the data from a Parquet file and reading the same data from a table in SQL Server (simple SELECT, no filtering or so). As nearly for the full retrieval time the client CPU is at 100%, using a more efficient protocol for data transferral could roughly translate into a 10x speedup.

Best,
Uwe

On Wed, May 20, 2020, at 12:16 AM, Brendan Niebruegge wrote:
> Hi everyone,
>
> I wanted to informally introduce myself. My name is Brendan Niebruegge,
> I'm a Software Engineer in our SQL Server extensibility team here at
> Microsoft. I am leading an effort to explore how we could integrate
> Arrow Flight with SQL Server. We think this could be a very interesting
> integration that would both benefit SQL Server and the Arrow community.
> We are very early in our thoughts so I thought it best to reach out
> here and see if you had any thoughts or suggestions for me. What would
> be the best way to socialize my thoughts to date? I am keen to learn
> and deepen my knowledge of Arrow as well so please let me know how I
> can be of help to the community.
>
> Please feel free to reach out anytime (email:brnieb@microsoft.com<ma...@microsoft.com>)
>
> Thanks,
> Brendan Niebruegge
>
>

Re: Arrow Flight connector for SQL Server

Posted by Ryan Murray <ry...@dremio.com>.

Hey Brendan,

As Jacques promised here are a few things to act as pointers for your work
on Flight:
Our early release Flight connector[1]  this fully supports single flight
streams and partially supports parallel streams
I also have a Spark DataSourceV2 client which may be of interest to you[2]

Both links make use of the 'doAction' part of the Flight API spec[3] to
negotiate parallel vs single stream among other things. However, this is
done in an ad-hoc manner and finding a way to standardise this for exchange
of metadata, catalog info, connection parameters etc is for me an important
next step to making a flight based protocol that is equivalent to
odbc/jdbc. I would be happy to discuss further if you have any thoughts on
the topic.

Best,
Ryan

[1] https://github.com/dremio-hub/dremio-flight-connector
[2] https://github.com/rymurr/flight-spark-source
[3] https://github.com/apache/arrow/blob/master/format/Flight.proto

On Thu, May 21, 2020 at 3:08 PM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Brendan,
>
> welcome to the community. In addition to the folks at Dremio, I wanted to
> make you aware of the Python ODBC client library
> https://github.com/blue-yonder/turbodbc which provides a high-performance
> ODBC<->Arrow adapter. It is especially popular with MS SQL Server users as
> the fastest known way to retrieve query results as DataFrames in Python
> from SQL Server, considerably faster than pandas.read_sql or using pyodbc
> directly.
>
> While being the fastest known, I can tell that still there is a lot time
> CPU spent in the ODBC driver "transforming" results so that it matches the
> ODBC interface. At least here, one could get possibly a lot better
> performance when retrieving large columnar results from SQL Server when
> going through Arrow Flight as an interface instead being constraint to the
> less efficient ODBC for this use case. Currently there is a performance
> difference of 50x between reading the data from a Parquet file and reading
> the same data from a table in SQL Server (simple SELECT, no filtering or
> so). As nearly for the full retrieval time the client CPU is at 100%, using
> a more efficient protocol for data transferral could roughly translate into
> a 10x speedup.
>
> Best,
> Uwe
>
> On Wed, May 20, 2020, at 12:16 AM, Brendan Niebruegge wrote:
> > Hi everyone,
> >
> > I wanted to informally introduce myself. My name is Brendan Niebruegge,
> > I'm a Software Engineer in our SQL Server extensibility team here at
> > Microsoft. I am leading an effort to explore how we could integrate
> > Arrow Flight with SQL Server. We think this could be a very interesting
> > integration that would both benefit SQL Server and the Arrow community.
> > We are very early in our thoughts so I thought it best to reach out
> > here and see if you had any thoughts or suggestions for me. What would
> > be the best way to socialize my thoughts to date? I am keen to learn
> > and deepen my knowledge of Arrow as well so please let me know how I
> > can be of help to the community.
> >
> > Please feel free to reach out anytime (email:brnieb@microsoft.com)
> >
> > Thanks,
> > Brendan Niebruegge
> >
> >
>

Re: Arrow Flight connector for SQL Server

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Brendan,

welcome to the community. In addition to the folks at Dremio, I wanted to make you aware of the Python ODBC client library https://github.com/blue-yonder/turbodbc which provides a high-performance ODBC<->Arrow adapter. It is especially popular with MS SQL Server users as the fastest known way to retrieve query results as DataFrames in Python from SQL Server, considerably faster than pandas.read_sql or using pyodbc directly.

While being the fastest known, I can tell that still there is a lot time CPU spent in the ODBC driver "transforming" results so that it matches the ODBC interface. At least here, one could get possibly a lot better performance when retrieving large columnar results from SQL Server when going through Arrow Flight as an interface instead being constraint to the less efficient ODBC for this use case. Currently there is a performance difference of 50x between reading the data from a Parquet file and reading the same data from a table in SQL Server (simple SELECT, no filtering or so). As nearly for the full retrieval time the client CPU is at 100%, using a more efficient protocol for data transferral could roughly translate into a 10x speedup.

Best,
Uwe

On Wed, May 20, 2020, at 12:16 AM, Brendan Niebruegge wrote:
> Hi everyone,
> 
> I wanted to informally introduce myself. My name is Brendan Niebruegge, 
> I'm a Software Engineer in our SQL Server extensibility team here at 
> Microsoft. I am leading an effort to explore how we could integrate 
> Arrow Flight with SQL Server. We think this could be a very interesting 
> integration that would both benefit SQL Server and the Arrow community. 
> We are very early in our thoughts so I thought it best to reach out 
> here and see if you had any thoughts or suggestions for me. What would 
> be the best way to socialize my thoughts to date? I am keen to learn 
> and deepen my knowledge of Arrow as well so please let me know how I 
> can be of help to the community.
> 
> Please feel free to reach out anytime (email:brnieb@microsoft.com)
> 
> Thanks,
> Brendan Niebruegge
> 
>

Re: Arrow Flight connector for SQL Server

Posted by Jacques Nadeau <ja...@apache.org>.

Hey Brendan,

Welcome to the community. At Dremio we've exposed flight as an input and
output for sql result datasets. I'll have one of our guys share some
details. I think a couple questions we've been struggling with include how
to standardize additional metadata operations, what should the prepare
behavior be and and is there is a way to stadarize exposure of a flight
path as an extension of both jdbc and odbc.

Can you share more about whether you're initially more focused on input or
output and parallel or single stream?

Thanks and welcome
Jacques

On Tue, May 19, 2020, 3:17 PM Brendan Niebruegge
<br...@microsoft.com.invalid> wrote:

> Hi everyone,
>
> I wanted to informally introduce myself. My name is Brendan Niebruegge,
> I'm a Software Engineer in our SQL Server extensibility team here at
> Microsoft. I am leading an effort to explore how we could integrate Arrow
> Flight with SQL Server. We think this could be a very interesting
> integration that would both benefit SQL Server and the Arrow community. We
> are very early in our thoughts so I thought it best to reach out here and
> see if you had any thoughts or suggestions for me. What would be the best
> way to socialize my thoughts to date? I am keen to learn and deepen my
> knowledge of Arrow as well so please let me know how I can be of help to
> the community.
>
> Please feel free to reach out anytime (email:brnieb@microsoft.com)
>
> Thanks,
> Brendan Niebruegge
>
>