You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/24 23:46:53 UTC

[GitHub] [arrow-ballista] andygrove commented on a diff in pull request #41: MINOR: Improve developer docs

andygrove commented on code in PR #41:
URL: https://github.com/apache/arrow-ballista/pull/41#discussion_r881063740


##########
docs/developer/architecture.md:
##########
@@ -36,24 +35,15 @@ stage cannot start until its child query stages have completed.
 Each query stage has one or more partitions that can be processed in parallel by the available
 executors in the cluster. This is the basic unit of scalability in Ballista.
 
-The following diagram shows the flow of requests and responses between the client, scheduler, and executor
-processes.
-
-## Distributed Scheduler Overview
-
-Ballista uses the DataFusion query execution framework to create a physical plan and then transforms it into a
-distributed physical plan by breaking the query down into stages whenever the partitioning scheme changes.
+The output of each query stage is persisted to disk and future query stages will request this data from the executors
+that produced it. The persisted output will be partitioned according to the partitioning scheme that was defined for
+the query stage and this typically differs from the partitioning scheme of the query stage that will consume this
+intermediate output since it is the changes in partitioning in the plan that define the query stage boundaries.
 
-Specifically, any `RepartitionExec` operator is replaced with an `UnresolvedShuffleExec` and the child operator
-of the repartition operator is wrapped in a `ShuffleWriterExec` operator and scheduled for execution.
+This exchange of data between query stages is called a "shuffle exchange" in Apache Spark.
 
-Each executor polls the scheduler for the next task to run. Tasks are currently always `ShuffleWriterExec` operators

Review Comment:
   This seemed like it was diving into too much technical detail for an introduction



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org