You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by al...@apache.org on 2023/01/17 21:03:58 UTC
[arrow-datafusion] branch master updated: Update main DataFusion README (#4903)
This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/master by this push:
new b756d053e Update main DataFusion README (#4903)
b756d053e is described below
commit b756d053e11e971eff8bb9e6ce0a2bf6dbf1177f
Author: Andrew Lamb <an...@nerdnetworks.org>
AuthorDate: Tue Jan 17 22:03:52 2023 +0100
Update main DataFusion README (#4903)
* Update main DataFusion README
* todos
* add kamu
* Apply suggestions from code review
Co-authored-by: Andy Grove <an...@gmail.com>
Co-authored-by: jakevin <ja...@gmail.com>
* Add note about databend
* Wordsmithing
* Update README.md
Co-authored-by: Liang-Chi Hsieh <vi...@gmail.com>
Co-authored-by: Andy Grove <an...@gmail.com>
Co-authored-by: jakevin <ja...@gmail.com>
Co-authored-by: Liang-Chi Hsieh <vi...@gmail.com>
---
README.md | 112 ++++++++++++++++++++++++++-------
docs/source/user-guide/introduction.md | 10 +--
2 files changed, 94 insertions(+), 28 deletions(-)
diff --git a/README.md b/README.md
index ceceb4296..237c21aa4 100644
--- a/README.md
+++ b/README.md
@@ -21,45 +21,91 @@
<img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256" alt="logo"/>
-DataFusion is an extensible query planning, optimization, and execution framework, written in
-Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in
+[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org)
in-memory format.
+DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet JSON, and Avro, extensive customization, and a great community.
+
[![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master)
## Features
-- SQL query planner with support for multiple SQL dialects
-- DataFrame API
-- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom
- file formats can be supported by implementing a `TableProvider` trait.
-- Supports popular object stores, including AWS S3, Azure Blob
- Storage, and Google Cloud Storage. There are extension points for implementing
- custom object stores.
+- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html)
+- Blazingly fast, vectorized, multi-threaded, streaming execution engine.
+- Native support for Parquet, CSV, JSON, and Avro file formats. Support
+ for custom file formats and non file datasources via the `TableProvider` trait.
+- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
+ other query languages, custom plan and execution nodes, optimizer passes, and more.
+- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
+ Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
+ `ObjectStore` trait.
+- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
+ [welcoming community](https://arrow.apache.org/datafusion/community/communication.html).
+- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
+ automatic join reordering, expression coercion, and more.
+- Permissive Apache 2.0 License, Apache Software Foundation governance
+- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
+ productivity similar to Java or Golang, the performance of C++, and
+ [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
## Use Cases
-DataFusion is modular in design with many extension points and can be
-used without modification as an embedded query engine and can also provide
-a foundation for building new systems. Here are some example use cases:
+DataFusion can be used without modification as an embedded SQL
+engine or can be customized and used as a foundation for
+building new systems. Here are some examples of systems built using DataFusion:
+
+- Specialized Analytical Database systems such as [CeresDB] and more general spark like system such a [Ballista].
+- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
+- Research platform for new Database Systems, such as [Flock]
+- SQL support to another library, such as [dask sql]
+- Streaming data platforms such as [Synnada]
+- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
+- A faster Spark runtime replacement (blaze-rs)
-- DataFusion can be used as a SQL query planner and query optimizer, providing
- optimized logical plans that can then be mapped to other execution engines.
-- DataFusion is used to create modern, fast and efficient data
- pipelines, ETL processes, and database systems, which need the
- performance of Rust and Apache Arrow and want to provide their users
- the convenience of an SQL interface or a DataFrame API.
+By using DataFusion, the projects are freed to focus on their specific
+features, and avoid reimplementing general (but still necessary)
+features such as an expression representation, standard optimizations,
+execution plans, file format support, etc.
## Why DataFusion?
-- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
+## Comparisons with other projects
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.
+ Like DataFusion, it supports very fast execution, both from its custom file format
+ and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
+ is primarily used directly by users as a serverless database and query system rather
+ than as a library for building such database systems.
+
+- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
+ libraries at the time of writing. Like DataFusion, it is also
+ written in Rust and uses the Apache Arrow memory model, but unlike
+ DataFusion it does not provide SQL nor as many extension points.
+
+- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/)
+ is an execution engine. Like DataFusion, Velox aims to
+ provide a reusable foundation for building database-like systems. Unlike DataFusion,
+ it is written in C/C++ and does not include a SQL frontend or planning /optimization
+ framework.
+
+- [DataBend](https://github.com/datafuselabs/databend) is a complete,
+ database system. Like DataFusion it is also written in Rust and
+ utilizes the Apache Arrow memory model, but unlike DataFusion it
+ targets end-users rather than developers of other database systems.
+
## DataFusion Community Extensions
-There are a number of community projects that extend DataFusion or provide integrations with other systems.
+There are a number of community projects that extend DataFusion or
+provide integrations with other systems.
### Language Bindings
@@ -99,9 +145,29 @@ Here are some of the projects known to use DataFusion:
- [Tensorbase](https://github.com/tensorbase/tensorbase)
- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar
-(if you know of another project, please submit a PR to add a link!)
-
-## Example Usage
+[ballista]: https://github.com/apache/arrow-ballista
+[blaze]: https://github.com/blaze-init/blaze
+[ceresdb]: https://github.com/CeresDB/ceresdb
+[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust
+[cnosdb]: https://github.com/cnosdb/cnosdb
+[cube store]: https://github.com/cube-js/cube.js/tree/master/rust
+[dask sql]: https://github.com/dask-contrib/dask-sql
+[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui
+[delta-rs]: https://github.com/delta-io/delta-rs
+[flock]: https://github.com/flock-lab/flock
+[kamu]: https://github.com/kamu-data/kamu-cli
+[greptime db]: https://github.com/GreptimeTeam/greptimedb
+[influxdb iox]: https://github.com/influxdata/influxdb_iox
+[parseable]: https://github.com/parseablehq/parseable
+[prql-query]: https://github.com/prql/prql-query
+[qv]: https://github.com/timvw/qv
+[roapi]: https://github.com/roapi/roapi
+[seafowl]: https://github.com/splitgraph/seafowl
+[synnada]: https://synnada.ai/
+[tensorbase]: https://github.com/tensorbase/tensorbase
+[vegafusion]: https://vegafusion.io/ "if you know of another project, please submit a PR to add a link!"
+
+## Examples
Please see the [example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) in the user guide and the [datafusion-examples](https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples) crate for more information on how to use DataFusion.
diff --git a/docs/source/user-guide/introduction.md b/docs/source/user-guide/introduction.md
index e16504091..64b6be9d2 100644
--- a/docs/source/user-guide/introduction.md
+++ b/docs/source/user-guide/introduction.md
@@ -23,10 +23,10 @@ DataFusion is an extensible query execution framework, written in
Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
in-memory format.
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+DataFusion supports SQL and a DataFrame API for building logical query
+plans, an extensive query optimizer, and a multi-threaded parallel
+execution execution engine for processing partitioned data sources
+such as CSV and Parquet files extremely quickly.
## Use Cases
@@ -37,7 +37,7 @@ the convenience of an SQL interface or a DataFrame API.
## Why DataFusion?
-- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.