You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ag...@apache.org on 2021/04/18 16:58:39 UTC

[arrow-datafusion] branch master updated: duplicate DataFusion README for now to fix build

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git


The following commit(s) were added to refs/heads/master by this push:
     new 1675efb  duplicate DataFusion README for now to fix build
1675efb is described below

commit 1675efb9eda5be377e1581bc779d7d170fdd51d7
Author: Andy Grove <an...@gmail.com>
AuthorDate: Sun Apr 18 10:58:27 2021 -0600

    duplicate DataFusion README for now to fix build
---
 datafusion/README.md | 356 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 356 insertions(+)

diff --git a/datafusion/README.md b/datafusion/README.md
new file mode 100644
index 0000000..9e6b7a2
--- /dev/null
+++ b/datafusion/README.md
@@ -0,0 +1,356 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# DataFusion
+
+<img src="datafusion/docs/images/DataFusion-Logo-Dark.svg" width="256"/>
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
+* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
+* *High Quality*:  Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
+
+## Known Uses
+
+Here are some of the projects known to use DataFusion:
+
+* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
+* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
+* [Cube.js](https://github.com/cube-js/cube.js)
+* [datafusion-python](https://pypi.org/project/datafusion)
+* [delta-rs](https://github.com/delta-io/delta-rs)
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
+* [ROAPI](https://github.com/roapi/roapi)
+
+(if you know of another project, please submit a PR to add a link!)
+
+## Example Usage
+
+Run a SQL query against data stored in a CSV:
+
+```rust
+use datafusion::prelude::*;
+use arrow::util::pretty::print_batches;
+use arrow::record_batch::RecordBatch;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // register the table
+  let mut ctx = ExecutionContext::new();
+  ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?;
+
+  // execute and print results
+  let results: Vec<RecordBatch> = df.collect().await?;
+  print_batches(&results)?;
+  Ok(())
+}
+```
+
+Use the DataFrame API to process data stored in a CSV:
+
+```rust
+use datafusion::prelude::*;
+use arrow::util::pretty::print_batches;
+use arrow::record_batch::RecordBatch;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // create the dataframe
+  let mut ctx = ExecutionContext::new();
+  let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+
+  let df = df.filter(col("a").lt_eq(col("b")))?
+           .aggregate(vec![col("a")], vec![min(col("b"))])?
+           .limit(100)?;
+
+  // execute and print results
+  let results: Vec<RecordBatch> = df.collect().await?;
+  print_batches(&results)?;
+  Ok(())
+}
+```
+
+Both of these examples will produce
+
+```text
++---+--------+
+| a | MIN(b) |
++---+--------+
+| 1 | 2      |
++---+--------+
+```
+
+
+
+## Using DataFusion as a library
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "4.0.0-SNAPSHOT"
+```
+
+## Using DataFusion as a binary
+
+DataFusion also includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information.
+
+# Status
+
+## General
+
+- [x] SQL Parser
+- [x] SQL Query Planner
+- [x] Query Optimizer
+ - [x] Constant folding
+ - [x] Join Reordering
+ - [x] Limit Pushdown
+ - [x] Projection push down
+ - [x] Predicate push down
+- [x] Type coercion
+- [x] Parallel query execution
+
+## SQL Support
+
+- [x] Projection
+- [x] Filter (WHERE)
+- [x] Filter post-aggregate (HAVING)
+- [x] Limit
+- [x] Aggregate
+- [x] Common math functions
+- [x] cast
+- [x] try_cast
+- Postgres compatible String functions
+  - [x] ascii
+  - [x] bit_length
+  - [x] btrim
+  - [x] char_length
+  - [x] character_length
+  - [x] chr
+  - [x] concat
+  - [x] concat_ws
+  - [x] initcap
+  - [x] left
+  - [x] length
+  - [x] lpad
+  - [x] ltrim
+  - [x] octet_length
+  - [x] regexp_replace
+  - [x] repeat
+  - [x] replace
+  - [x] reverse
+  - [x] right
+  - [x] rpad
+  - [x] rtrim
+  - [x] split_part
+  - [x] starts_with
+  - [x] strpos
+  - [x] substr
+  - [x] to_hex
+  - [x] translate
+  - [x] trim
+- Miscellaneous/Boolean functions
+  - [x] nullif
+- Common date/time functions
+  - [ ] Basic date functions
+  - [ ] Basic time functions
+  - [x] Basic timestamp functions
+- nested functions
+  - [x] Array of columns
+- [x] Schema Queries
+  - [x] SHOW TABLES
+  - [x] SHOW COLUMNS
+  - [x] information_schema.{tables, columns}
+  - [ ] information_schema other views
+- [x] Sorting
+- [ ] Nested types
+- [ ] Lists
+- [x] Subqueries
+- [x] Common table expressions
+- [ ] Set Operations
+  - [x] UNION ALL
+  - [ ] UNION
+  - [ ] INTERSECT
+  - [ ] MINUS
+- [x] Joins
+  - [x] INNER JOIN
+  - [ ] CROSS JOIN
+  - [ ] OUTER JOIN
+- [ ] Window
+
+## Data Sources
+
+- [x] CSV
+- [x] Parquet primitive types
+- [ ] Parquet nested types
+
+
+## Extensibility
+
+DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
+
+- [x] User Defined Functions (UDFs)
+- [x] User Defined Aggregate Functions (UDAFs)
+- [x] User Defined Table Source (`TableProvider`) for tables
+- [x] User Defined `Optimizer` passes (plan rewrites)
+- [x] User Defined `LogicalPlan` nodes
+- [x] User Defined `ExecutionPlan` nodes
+
+
+# Supported SQL
+
+This library currently supports many SQL constructs, including
+
+* `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
+* `SELECT ... FROM ...` together with any expression
+* `ALIAS` to name an expression
+* `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
+* most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
+* `WHERE` to filter
+* `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
+* `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`
+
+
+## Supported Functions
+
+DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https://www.postgresql.org/docs/current/functions.html) where possible. We explicitly choose a single dialect to maximize interoperability with other tools and allow reuse of the PostgreSQL documents and tutorials as much as possible.
+
+Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
+
+## Schema Metadata / Information Schema Support
+
+DataFusion supports the showing metadata about the tables available. This information can be accessed using the views of the ISO SQL `information_schema` schema or the DataFusion specific `SHOW TABLES` and `SHOW COLUMNS` commands.
+
+More information can be found in the [Postgres docs](https://www.postgresql.org/docs/13/infoschema-schema.html)).
+
+
+To show tables available for use in DataFusion, use the `SHOW TABLES`  command or the `information_schema.tables` view:
+
+```sql
+> show tables;
++---------------+--------------------+------------+------------+
+| table_catalog | table_schema       | table_name | table_type |
++---------------+--------------------+------------+------------+
+| datafusion    | public             | t          | BASE TABLE |
+| datafusion    | information_schema | tables     | VIEW       |
++---------------+--------------------+------------+------------+
+
+> select * from information_schema.tables;
+
++---------------+--------------------+------------+--------------+
+| table_catalog | table_schema       | table_name | table_type   |
++---------------+--------------------+------------+--------------+
+| datafusion    | public             | t          | BASE TABLE   |
+| datafusion    | information_schema | TABLES     | SYSTEM TABLE |
++---------------+--------------------+------------+--------------+
+```
+
+To show the schema of a table in DataFusion, use the `SHOW COLUMNS`  command or the or `information_schema.columns` view:
+
+```sql
+> show columns from t;
++---------------+--------------+------------+-------------+-----------+-------------+
+| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
++---------------+--------------+------------+-------------+-----------+-------------+
+| datafusion    | public       | t          | a           | Int32     | NO          |
+| datafusion    | public       | t          | b           | Utf8      | NO          |
+| datafusion    | public       | t          | c           | Float32   | NO          |
++---------------+--------------+------------+-------------+-----------+-------------+
+
+>   select table_name, column_name, ordinal_position, is_nullable, data_type from information_schema.columns;
++------------+-------------+------------------+-------------+-----------+
+| table_name | column_name | ordinal_position | is_nullable | data_type |
++------------+-------------+------------------+-------------+-----------+
+| t          | a           | 0                | NO          | Int32     |
+| t          | b           | 1                | NO          | Utf8      |
+| t          | c           | 2                | NO          | Float32   |
++------------+-------------+------------------+-------------+-----------+
+```
+
+
+
+## Supported Data Types
+
+DataFusion uses Arrow, and thus the Arrow type system, for query
+execution. The SQL types from
+[sqlparser-rs](https://github.com/ballista-compute/sqlparser-rs/blob/main/src/ast/data_type.rs#L57)
+are mapped to Arrow types according to the following table
+
+
+| SQL Data Type   | Arrow DataType                   |
+| --------------- | -------------------------------- |
+| `CHAR`          | `Utf8`                           |
+| `VARCHAR`       | `Utf8`                           |
+| `UUID`          | *Not yet supported*              |
+| `CLOB`          | *Not yet supported*              |
+| `BINARY`        | *Not yet supported*              |
+| `VARBINARY`     | *Not yet supported*              |
+| `DECIMAL`       | `Float64`                        |
+| `FLOAT`         | `Float32`                        |
+| `SMALLINT`      | `Int16`                          |
+| `INT`           | `Int32`                          |
+| `BIGINT`        | `Int64`                          |
+| `REAL`          | `Float64`                        |
+| `DOUBLE`        | `Float64`                        |
+| `BOOLEAN`       | `Boolean`                        |
+| `DATE`          | `Date32`                         |
+| `TIME`          | `Time64(TimeUnit::Millisecond)`  |
+| `TIMESTAMP`     | `Date64`                         |
+| `INTERVAL`      | *Not yet supported*              |
+| `REGCLASS`      | *Not yet supported*              |
+| `TEXT`          | *Not yet supported*              |
+| `BYTEA`         | *Not yet supported*              |
+| `CUSTOM`        | *Not yet supported*              |
+| `ARRAY`         | *Not yet supported*              |
+
+
+# Architecture Overview
+
+There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
+
+* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+
+
+# Developer's guide
+
+Please see [Developers Guide](DEVELOPERS.md) for information about developing DataFusion.