You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/29 08:39:27 UTC
[GitHub] [arrow] nevi-me edited a comment on pull request #8731: [Rust] [RFC] Native Rust Arrow SQL IO

nevi-me edited a comment on pull request #8731:
URL: https://github.com/apache/arrow/pull/8731#issuecomment-735361244


   Hi @alamb @andygrove I've had a look at what RDBC offers, and what I could use from it today (say without adding additional functionality there).
   
   I have enough feedback to start addressing some questions and comments, so I'll start working on a Google Doc, which will/can be accessed at https://docs.google.com/document/d/1temGm6lb4pI_OeZ68qgytWKVYabqtpfBye6KNf9jb2s/edit?usp=sharing (CC @jorgecarleitao).
   
   **Summary**
   
   I think sufficient interop and connectors exist into SQL, but sometimes the challenge with them is that they're based on high-level `ResultSet`/`Row` conversions which carry some overhead. (This is not to say they're slow, I've seen and have been in awe of Spark saturating a network pipe while reading and writing to Oracle via JDBC; but I want to go faster than that if possible).
   The aim of building a 'native' Arrow SQL interface, is to try to bypass those conversions, and instead go straight to the binary protocols. So I'm targeting performance over convenience.
   
   Without building more functionality into `rdbc`, the current state is that we might not be able to reuse anything/much from it, because:
   * We want to use binary readers/writers where possible, to derive their performance benefits
   * Exposing a common interface for them would be a challenge, and an abstraction for too narrow a use-case
   * In order to use the binary readers/writers, we either need the underlying DB connection (`postgres::{Client, Connection}` or we might need to go further to the wire protocol
   * We don't need transaction support or creating statements, and we don't intend on exposing any DB specific details beyond "give me a table name or SQL query and I'll give you its Arrow data, and give me Arrow data + how I should write it, and I'll write it".
   
   **RDBC and Binary Database Protocols**
   
   ```rust
   /// Represents database driver that can be shared between threads, and can therefore implement
   /// a connection pool
   pub trait Driver: Sync + Send {
       /// Create a connection to the database. Note that connections are intended to be used
       /// in a single thread since most database connections are not thread-safe
       fn connect(&self, url: &str) -> Result<Box<dyn Connection>>;
   }
   ```
   In the Arrow SQL implementation, we never expose the connection to the user, so this remains internal, because all we ever share is the `arrow::_::SchemaRef` and `arrow::_::RecordBatch`. If we use the `Driver` above, it would be internally to fetch data from the database, and return it as batches, which can still work.
   
   The mechanism of the Arrow SQL interaction is that we want a SQL query or table name, then we try to execute the query and return bulk data, which will often need us to get as close to the wire protocol as necessary, and thereby avoid the need to use `rdbc::ResultSet` or `postgres::Row` etc.
   As an example, in the postgres example on this PR, we use the [postgres::CopyOutReader](https://docs.rs/postgres/0.18.1/postgres/struct.CopyOutReader.html) to read the query in a binary format (https://www.postgresql.org/docs/13/sql-copy.html), then read this into Arrow data by converting the network endian bytes for each value to the little endian types that Arrow requires (and building up a null buffer).
   
   MySQL has a similar high-level/frontend protocol, which I was hoping to leverage, but other databases don't always expose the same. MSSQL has the [Tabular Data Stream](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-tds/893fcc7e-8a39-4b3c-815a-773b7b982c50), and the `tiberius` crate would enable us to tap into TDS without needing to serialize `TDS::ROW` data into > `tiberius::Row` > `rdbc::ResultSet` > `arrow::_::RecordBatch`.
   
   **DataFusion**
   
   The intention is not to immediately build an interface into DataFusion as this PR has, I was merely showing how one could consume data from SQL databases with the current state of the PR.
   
   > Given the relative immaturity of DataFusion, I think it is more likely to be used as a source rather than a federation engine that connects multiple sources together (e.g. Presto is much more mature and targeted at such use cases.
   
   For now, one could always materialize the SQL results into `Vec<RecordBatch>`, and use the in-memory table source, if they really want to use DataFusion instead of their SQL engine. 
   So @alamb yeah, we're far from a federated system like Presto, but I'd like us to get there someday.
   
   I don't underestimate the amount of effort which we'd need to build into the logical plan and optimizer, especially as that could require us to be able to translate logical plans to SQL statements.
   
   The Java ecosystem has Calcite for this, which makes things easier for Presto, Spark, etc.
   
   A good way of evaluating this work in the context of DataFusion could be in asking:
   
   "If DataFusion was created at a time where we had Arrow IO/interfaces for every datasource imaginable (Parquet, CSV, JSON, SQL, NoSQL, etc.), what 'query engine' capability would we want to reside in DF, and what functionality would we leave to users of DF like Ballista?"
   
   Lastly, @andygrove, Databricks have been working on a native Spark vectorized query execution engine called Proton. There's not much detail about it as it's not open-sourced, but it's written in C++.
   I could see DataFusion as its OSS alternative when hooked in via FFI, and I would see the "Arrow SQL" work as a (hopefully optimized and faster) SQL interface that could be an alternative to Spark's JDBC datasources.
   
   ___
   
   In closing, the long-term goal is to build the various building blocks that one needs for native Rust-based analytics tools. It'll take us years, but as we go along, we'll try to carve out some territory, build some RPC/FFI interfaces; and hope that with Rust's low memory footprint and interop, we can replace some high-footprint applications by completementing C and C++ applications in being embedded into and sometimes replacing those tools.
   
   This is one such ambition, that tries to do this from as clean a slate as is feasible.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org