You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/11 12:52:09 UTC
[GitHub] [arrow-datafusion] tustvold opened a new issue, #2199: Morsel-Driven Parallelism Using Rayon
tustvold opened a new issue, #2199:
URL: https://github.com/apache/arrow-datafusion/issues/2199
A proposal for reformulating the parallelism story within DataFusion to use a [morsel-driven](https://db.in.tum.de/~leis/papers/morsels.pdf) approach based on [rayon](https://docs.rs/rayon/latest/rayon/). More details, background, and discussion can be found in the proposal document [here](https://docs.google.com/document/d/1txX60thXn1tQO1ENNT8rwfU3cXLofa7ZccnvP4jD6AA), please feel free to comment there.
The keys highlights are:
* Decouples parallelism from the partitioning expressed in the physical plan, allowing for:
* Better handling of imbalanced partitions
* Adaptive parallelism based on compute availability at execution time
* Parallelism within a partition, such as decoding parquet columns in parallel, [parallel sort](https://docs.rs/rayon/latest/rayon/slice/trait.ParallelSliceMut.html), etc...
* The first step to reducing the complexity associated with the current futures-based concurrency model
* Improvements to thread-locality, observability and performance
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #2199: Morsel-Driven Parallelism Using Rayon
Posted by GitBox <gi...@apache.org>.
alamb closed issue #2199: Morsel-Driven Parallelism Using Rayon
URL: https://github.com/apache/arrow-datafusion/issues/2199
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] JasonLi-cn commented on issue #2199: Morsel-Driven Parallelism Using Rayon
Posted by GitBox <gi...@apache.org>.
JasonLi-cn commented on issue #2199:
URL: https://github.com/apache/arrow-datafusion/issues/2199#issuecomment-1263228061
1. binary code
```rust
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::arrow::util::pretty::print_batches;
use datafusion::error::Result;
use datafusion::prelude::*;
use datafusion::scheduler::Scheduler;
use futures::{StreamExt, TryStreamExt};
use std::env;
#[tokio::main]
async fn main() -> Result<()> {
let name = "test_table";
let mut args = env::args();
args.next();
let table_path = args.next().expect("parquet file");
let sql = &args.next().expect("sql");
let using_scheduler = args.next().is_some();
// create local session context
let config = SessionConfig::new()
.with_information_schema(true)
.with_target_partitions(4);
let context = SessionContext::with_config(config);
// register parquet file with the execution context
context
.register_parquet(name, &table_path, ParquetReadOptions::default())
.await?;
let task = context.task_ctx();
let query = context.sql(sql).await.unwrap();
let plan = query.create_physical_plan().await.unwrap();
println!("Start query, using scheduler {}", using_scheduler);
let now = std::time::Instant::now();
let results = if using_scheduler {
let scheduler = Scheduler::new(4);
let stream = scheduler.schedule(plan, task).unwrap().stream();
let results: Vec<RecordBatch> = stream.try_collect().await.unwrap();
results
} else {
context.sql(sql).await?.collect().await?
};
let elapsed = now.elapsed().as_millis();
println!("End query, elapsed {} ms", elapsed);
print_batches(&results)?;
Ok(())
}
/// Execute sql
async fn plan_and_collect(
context: &SessionContext,
sql: &str,
) -> Result<Vec<RecordBatch>> {
context.sql(sql).await?.collect().await
}
```
2. test data
- format: parquet
- number of files: 4
- rows: 16405852 * 4 = 65623408
- number of columns: 6
- schema: uint32, uint32, uint32, uint32, string, uint32
3. test result
SQLs:
```sql
select count(distinct column0) from test_table;
select * from test_table order by column5 limit 10;
```
The performance is similar with and without the Scheduler! Is there a problem with where I use it?
@tustvold
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] tustvold commented on issue #2199: Morsel-Driven Parallelism Using Rayon
Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2199:
URL: https://github.com/apache/arrow-datafusion/issues/2199#issuecomment-1263282629
Yes that is expected, I've had to park working on this for a bit in favour of some other things. See #2504 for the follow on work
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] JasonLi-cn commented on issue #2199: Morsel-Driven Parallelism Using Rayon
Posted by GitBox <gi...@apache.org>.
JasonLi-cn commented on issue #2199:
URL: https://github.com/apache/arrow-datafusion/issues/2199#issuecomment-1263332649
> Yes that is expected, I've had to park working on this for a bit in favour of some other things. See #2504 for the follow on work
Ok, thanks!
By the way, do you know the difference between [ClickHouse's Query Execution Pipeline](https://presentations.clickhouse.com/meetup24/5.%20Clickhouse%20query%20execution%20pipeline%20changes/#clickhouse-query-execution-pipeline) and Datafusion's Execution model(likes vectorized volcano model)? And what are the advantages of ClickHouse?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org