You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/11 11:24:14 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #5962: [DOCS]: consolidate doc site content simplify navbar

alamb opened a new pull request, #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962

   # Which issue does this PR close?
   Closes https://github.com/apache/arrow-datafusion/issues/5935
   
   # Rationale for this change
   The main page / index of the https://arrow.apache.org/datafusion/ site is somewhat disorganized, redundant and has so many entries it is causing issues such as https://github.com/apache/arrow-datafusion/issues/5935
   
   Here is a screenshot of the current site:
   ![Screenshot 2023-04-11 at 7 20 00 AM](https://user-images.githubusercontent.com/490673/231146819-e9a80483-30ee-4667-993a-5eb49eb5ef4b.png)
   
   
   # What changes are included in this PR?
   1. Consolidate some top level pages into lower level pages ("improve the site navigation") to reduce the size of the initial index and organize the content better. 
   2. Various small improvements
   
   # Are these changes tested?
   
   I rendered the site locally and it looks better to me:
   
   ![Screenshot 2023-04-11 at 7 22 45 AM](https://user-images.githubusercontent.com/490673/231146741-e213638b-c7a9-4df0-b8e0-5a0d05624c04.png)
   
   
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1164620034


##########
docs/source/contributor-guide/architecture.md:
##########
@@ -20,7 +20,8 @@
 # Architecture
 
 DataFusion's code structure and organization is described in the
-[Crate Documentation], to keep it as close to the source as
-possible.
+[crates.io documentation], to keep it as close to the source as
+possible. You can find the most up to date version in the [source code].

Review Comment:
   Filed https://github.com/apache/arrow-datafusion/issues/5981



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "waynexia (via GitHub)" <gi...@apache.org>.
waynexia commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1164944084


##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"
+```
+
+## Create a main function
+
+Update the main.rs file with your first datafusion application based on [Example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)
+
+```rust
+use datafusion::prelude::*;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // register the table
+  let ctx = SessionContext::new();
+  ctx.register_csv("test", "<PATH_TO_YOUR_CSV_FILE>", CsvReadOptions::new()).await?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT * FROM test").await?;
+
+  // execute and print results
+  df.show().await?;
+  Ok(())
+}
+```
+
+## Extensibility
+
+DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
+
+- [x] User Defined Functions (UDFs)
+- [x] User Defined Aggregate Functions (UDAFs)
+- [x] User Defined Table Source (`TableProvider`) for tables
+- [x] User Defined `Optimizer` passes (plan rewrites)
+- [x] User Defined `LogicalPlan` nodes
+- [x] User Defined `ExecutionPlan` nodes
+
+## Rust Version Compatibility
+
+This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler.
+
+## Optimized Configuration
+
+For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is
+worth noting that using the settings in the `[profile.release]` section will significantly increase the build time.
+
+```toml
+[dependencies]
+datafusion = { version = "11.0" , features = ["simd"]}

Review Comment:
   Sounds good, I filed #5983 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#issuecomment-1505877935

   Ok, I think this is better than what is on main so I will merge it in. We clearly have a ways to go to have wonderful docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1164574067


##########
docs/source/user-guide/faq.md:
##########
@@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process,
 for parallel query execution.
 
 [Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion.
+
+# How does DataFusion Compare with `XYZ`?
+
+When compared to similar systems, DataFusion typically is:
+
+1. Targeted at developers, rather than end users / data scientists.
+2. Designed to be embedded, rather than a complete file based SQL system.
+3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
+4. Implemented in `Rust`, rather than `C/C++`
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.
+  Like DataFusion, it supports very fast execution, both from its custom file format
+  and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
+  is primarily used directly by users as a serverless database and query system rather
+  than as a library for building such database systems.
+
+- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
+  libraries at the time of writing. Like DataFusion, it is also
+  written in Rust and uses the Apache Arrow memory model, but unlike
+  DataFusion it does not provide SQL nor as many extension points.

Review Comment:
   Yes I think you are correct. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1164578596


##########
docs/source/contributor-guide/architecture.md:
##########
@@ -20,7 +20,8 @@
 # Architecture
 
 DataFusion's code structure and organization is described in the
-[Crate Documentation], to keep it as close to the source as
-possible.
+[crates.io documentation], to keep it as close to the source as
+possible. You can find the most up to date version in the [source code].

Review Comment:
   I think it is a great idea -- thank you @waynexia 
   
   I actually think if we could build those API docs as part of the https://github.com/apache/arrow-datafusion/blob/main/docs build, they would "automatically" get hosted on https://arrow.apache.org/datafusion/
   
   https://arrow.apache.org/datafusion/ is published via some ASF mechanism that is similar to github pages
   
   Specifically, this workflow
   
   https://github.com/apache/arrow-datafusion/blob/388f9ec3e7f7c09dac56ee0fe074ca97a6af9d44/.github/workflows/docs.yaml#L12-L64
   
   pushes to the https://github.com/apache/arrow-datafusion/tree/asf-site branch which then gets hosted via this magic yaml: 
   
   https://github.com/apache/arrow-datafusion/blob/388f9ec3e7f7c09dac56ee0fe074ca97a6af9d44/.asf.yaml#L48-L52



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1162735593


##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"

Review Comment:
   bump to latest here? (ditto for anywhere else version is mentioned)



##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"
+```
+
+## Create a main function
+
+Update the main.rs file with your first datafusion application based on [Example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)
+
+```rust
+use datafusion::prelude::*;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // register the table
+  let ctx = SessionContext::new();
+  ctx.register_csv("test", "<PATH_TO_YOUR_CSV_FILE>", CsvReadOptions::new()).await?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT * FROM test").await?;
+
+  // execute and print results
+  df.show().await?;
+  Ok(())
+}
+```

Review Comment:
   example feels kinda redundant compared with example code in above sections



##########
docs/source/user-guide/faq.md:
##########
@@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process,
 for parallel query execution.
 
 [Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion.
+
+# How does DataFusion Compare with `XYZ`?
+
+When compared to similar systems, DataFusion typically is:
+
+1. Targeted at developers, rather than end users / data scientists.
+2. Designed to be embedded, rather than a complete file based SQL system.
+3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
+4. Implemented in `Rust`, rather than `C/C++`
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.

Review Comment:
   change to https link?



##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"
+```
+
+## Create a main function
+
+Update the main.rs file with your first datafusion application based on [Example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)

Review Comment:
   is this self link to same page, this page?



##########
docs/source/user-guide/faq.md:
##########
@@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process,
 for parallel query execution.
 
 [Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion.
+
+# How does DataFusion Compare with `XYZ`?
+
+When compared to similar systems, DataFusion typically is:
+
+1. Targeted at developers, rather than end users / data scientists.
+2. Designed to be embedded, rather than a complete file based SQL system.
+3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
+4. Implemented in `Rust`, rather than `C/C++`
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.
+  Like DataFusion, it supports very fast execution, both from its custom file format
+  and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
+  is primarily used directly by users as a serverless database and query system rather
+  than as a library for building such database systems.
+
+- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
+  libraries at the time of writing. Like DataFusion, it is also
+  written in Rust and uses the Apache Arrow memory model, but unlike
+  DataFusion it does not provide SQL nor as many extension points.
+
+- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/)

Review Comment:
   could switch to github link: https://github.com/facebookincubator/velox since this link seems dead



##########
docs/source/user-guide/faq.md:
##########
@@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process,
 for parallel query execution.
 
 [Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion.
+
+# How does DataFusion Compare with `XYZ`?
+
+When compared to similar systems, DataFusion typically is:
+
+1. Targeted at developers, rather than end users / data scientists.
+2. Designed to be embedded, rather than a complete file based SQL system.
+3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
+4. Implemented in `Rust`, rather than `C/C++`
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.
+  Like DataFusion, it supports very fast execution, both from its custom file format
+  and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
+  is primarily used directly by users as a serverless database and query system rather
+  than as a library for building such database systems.
+
+- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
+  libraries at the time of writing. Like DataFusion, it is also
+  written in Rust and uses the Apache Arrow memory model, but unlike
+  DataFusion it does not provide SQL nor as many extension points.

Review Comment:
   change to https url
   
   also i think polars might support sql now, according to their doc: https://pola-rs.github.io/polars-book/user-guide/sql.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1164586488


##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"
+```
+
+## Create a main function
+
+Update the main.rs file with your first datafusion application based on [Example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)

Review Comment:
   Yes, that is a good a catch. This whole page has some non trivial redundancy. I will try and fix it up



##########
docs/source/user-guide/faq.md:
##########
@@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process,
 for parallel query execution.
 
 [Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion.
+
+# How does DataFusion Compare with `XYZ`?
+
+When compared to similar systems, DataFusion typically is:
+
+1. Targeted at developers, rather than end users / data scientists.
+2. Designed to be embedded, rather than a complete file based SQL system.
+3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
+4. Implemented in `Rust`, rather than `C/C++`
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.

Review Comment:
   in c18786332



##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"
+```
+
+## Create a main function
+
+Update the main.rs file with your first datafusion application based on [Example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)
+
+```rust
+use datafusion::prelude::*;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // register the table
+  let ctx = SessionContext::new();
+  ctx.register_csv("test", "<PATH_TO_YOUR_CSV_FILE>", CsvReadOptions::new()).await?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT * FROM test").await?;
+
+  // execute and print results
+  df.show().await?;
+  Ok(())
+}
+```
+
+## Extensibility
+
+DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
+
+- [x] User Defined Functions (UDFs)
+- [x] User Defined Aggregate Functions (UDAFs)
+- [x] User Defined Table Source (`TableProvider`) for tables
+- [x] User Defined `Optimizer` passes (plan rewrites)
+- [x] User Defined `LogicalPlan` nodes
+- [x] User Defined `ExecutionPlan` nodes
+
+## Rust Version Compatibility
+
+This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler.
+
+## Optimized Configuration
+
+For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is
+worth noting that using the settings in the `[profile.release]` section will significantly increase the build time.
+
+```toml
+[dependencies]
+datafusion = { version = "11.0" , features = ["simd"]}

Review Comment:
   Maybe we could update the script here to automatically clean it up: https://github.com/apache/arrow-datafusion/blob/main/dev/update_datafusion_versions.py



##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"
+```
+
+## Create a main function
+
+Update the main.rs file with your first datafusion application based on [Example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)
+
+```rust
+use datafusion::prelude::*;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // register the table
+  let ctx = SessionContext::new();
+  ctx.register_csv("test", "<PATH_TO_YOUR_CSV_FILE>", CsvReadOptions::new()).await?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT * FROM test").await?;
+
+  // execute and print results
+  df.show().await?;
+  Ok(())
+}
+```

Review Comment:
   I agree -- removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1162673316


##########
docs/source/index.rst:
##########
@@ -50,22 +49,17 @@ community.
 
    user-guide/introduction
    user-guide/example-usage
-   user-guide/users

Review Comment:
   I consolidated the content of these pages into other pages



##########
docs/source/index.rst:
##########
@@ -50,22 +49,17 @@ community.
 
    user-guide/introduction
    user-guide/example-usage
-   user-guide/users
-   user-guide/comparison
-   user-guide/integration
-   user-guide/library
    user-guide/cli
    user-guide/dataframe
    user-guide/expressions
    user-guide/sql/index
    user-guide/configs
    user-guide/faq
-   Rust Crate Documentation <https://docs.rs/crate/datafusion/>
 
 .. _toc.contributor-guide:
 
 .. toctree::
-   :maxdepth: 2

Review Comment:
   This stops listing H2 headings ( `##`) on the main table of contents



##########
docs/source/index.rst:
##########
@@ -50,22 +49,17 @@ community.
 
    user-guide/introduction
    user-guide/example-usage
-   user-guide/users
-   user-guide/comparison
-   user-guide/integration
-   user-guide/library
    user-guide/cli
    user-guide/dataframe
    user-guide/expressions
    user-guide/sql/index
    user-guide/configs
    user-guide/faq
-   Rust Crate Documentation <https://docs.rs/crate/datafusion/>

Review Comment:
   This was redundant with the crates.io link above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#issuecomment-1505842199

   > Hope you don't mind but took the opportunity to review the docs and point out a few parts that are outdated/could be improved
   
   Not at all - thank you @Jefffrey  and thank you @waynexia  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Posted by "waynexia (via GitHub)" <gi...@apache.org>.
waynexia commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1163748404


##########
docs/source/contributor-guide/architecture.md:
##########
@@ -20,7 +20,8 @@
 # Architecture
 
 DataFusion's code structure and organization is described in the
-[Crate Documentation], to keep it as close to the source as
-possible.
+[crates.io documentation], to keep it as close to the source as
+possible. You can find the most up to date version in the [source code].

Review Comment:
   What do you think about hosting the latest document generated from the source code on github pages (or other static page hoster)? Like [greptimedb.rs](https://greptimedb.rs) which is generated from https://github.com/GreptimeTeam/greptimedb/deployments/activity_log?environment=github-pages



##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"
+```
+
+## Create a main function
+
+Update the main.rs file with your first datafusion application based on [Example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)
+
+```rust
+use datafusion::prelude::*;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // register the table
+  let ctx = SessionContext::new();
+  ctx.register_csv("test", "<PATH_TO_YOUR_CSV_FILE>", CsvReadOptions::new()).await?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT * FROM test").await?;
+
+  // execute and print results
+  df.show().await?;
+  Ok(())
+}
+```
+
+## Extensibility
+
+DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
+
+- [x] User Defined Functions (UDFs)
+- [x] User Defined Aggregate Functions (UDAFs)
+- [x] User Defined Table Source (`TableProvider`) for tables
+- [x] User Defined `Optimizer` passes (plan rewrites)
+- [x] User Defined `LogicalPlan` nodes
+- [x] User Defined `ExecutionPlan` nodes
+
+## Rust Version Compatibility
+
+This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler.
+
+## Optimized Configuration
+
+For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is
+worth noting that using the settings in the `[profile.release]` section will significantly increase the build time.
+
+```toml
+[dependencies]
+datafusion = { version = "11.0" , features = ["simd"]}

Review Comment:
   This is also outdated. Wondering if there are someway to render code / file from github? So we needn't update this file every time but rather render our example codes. I find [this](https://github.blog/2017-08-15-introducing-embedded-code-snippets/) but it looks only works inside github.



##########
docs/source/user-guide/faq.md:
##########
@@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process,
 for parallel query execution.
 
 [Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion.
+
+# How does DataFusion Compare with `XYZ`?
+
+When compared to similar systems, DataFusion typically is:
+
+1. Targeted at developers, rather than end users / data scientists.
+2. Designed to be embedded, rather than a complete file based SQL system.
+3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
+4. Implemented in `Rust`, rather than `C/C++`
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.
+  Like DataFusion, it supports very fast execution, both from its custom file format
+  and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
+  is primarily used directly by users as a serverless database and query system rather
+  than as a library for building such database systems.
+
+- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
+  libraries at the time of writing. Like DataFusion, it is also
+  written in Rust and uses the Apache Arrow memory model, but unlike
+  DataFusion it does not provide SQL nor as many extension points.
+
+- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/)
+  is an execution engine. Like DataFusion, Velox aims to
+  provide a reusable foundation for building database-like systems. Unlike DataFusion,
+  it is written in C/C++ and does not include a SQL frontend or planning /optimization

Review Comment:
   ```suggestion
     it is written in C/C++ and does not include a SQL frontend or planning / optimization
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org