You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/15 17:26:32 UTC

[GitHub] [arrow-ballista] andygrove opened a new issue, #353: Cannot run benchmarks when using docker-compose

andygrove opened a new issue, #353:
URL: https://github.com/apache/arrow-ballista/issues/353

   **Describe the bug**
   When I run Ballista in docker-compose and then run the benchmarks, all benchmark queries run very fast and return result sets with zero rows and zero columns and I see that the executed plans contain `EmptyExec` instead of `ParquetExec`.
   
   **To Reproduce**
   
   ```bash
   cargo build --release
   docker-compose up --build
   ```
   
   Then run benchmarks using instructions in repo.
   
   **Expected behavior**
   Should not be replacing `ParquetExec` with `EmptyExec`
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] andygrove commented on issue #353: Cannot run benchmarks when using docker-compose

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1279840164

   I now suspect this is somehow related to the eliminate_filter optimization rule inserting an EmptyRelation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] andygrove closed issue #353: Scheduler silently replaces `ParquetExec` with `EmptyExec` if data path is not correctly mounted in container

Posted by GitBox <gi...@apache.org>.

andygrove closed issue #353: Scheduler silently replaces `ParquetExec` with `EmptyExec` if data path is not correctly mounted in container
URL: https://github.com/apache/arrow-ballista/issues/353


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] andygrove commented on issue #353: Cannot run benchmarks when using docker-compose

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1279841055

   I maybe realize now what the root issue is - I was running the benchmark against a data set that was not mounted into the containers running under docker compose. I would expect this to cause the query to fail but somehow the optimizer is determining that no rows can match the filter and just removes the table scan!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] andygrove commented on issue #353: Cannot run benchmarks when using docker-compose

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1279890759

   This is where the `EmptyExec` comes from:
   
   ```
   impl TableProvider for ListingTable {
   
       async fn scan(
           &self,
           ctx: &SessionState,
           projection: &Option<Vec<usize>>,
           filters: &[Expr],
           limit: Option<usize>,
       ) -> Result<Arc<dyn ExecutionPlan>> {
           let (partitioned_file_lists, statistics) =
               self.list_files_for_scan(ctx, filters, limit).await?;
   
           // if no files need to be read, return an `EmptyExec`
           if partitioned_file_lists.is_empty() {
               let schema = self.schema();
               let projected_schema = project_schema(&schema, projection.as_ref())?;
               return Ok(Arc::new(EmptyExec::new(false, projected_schema)));
           }
   ```        


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] avantgardnerio commented on issue #353: Scheduler silently replaces `ParquetExec` with `EmptyExec` if data path is not correctly mounted in container

Posted by GitBox <gi...@apache.org>.

avantgardnerio commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1282660270

   > Is it a bug? 
   
   IMO, absolutely. I think few users would expect this behavior, and would spend quite a bit of time tracking it down. I can't think of another piece of software that treats a missing file the same way as an empty one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] andygrove commented on issue #353: Cannot run benchmarks when using docker-compose

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1279836013

   Scheduler has this optimized logical plan:
   
   ```
   Calculated optimized plan: Sort: lineitem.l_returnflag ASC NULLS LAST, lineitem.l_linestatus ASC NULLS LAST
   ballista-scheduler_1  |   Projection: lineitem.l_returnflag, lineitem.l_linestatus, SUM(lineitem.l_quantity) AS sum_qty, SUM(lineitem.l_extendedprice) AS sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax) AS sum_charge, AVG(lineitem.l_quantity) AS avg_qty, AVG(lineitem.l_extendedprice) AS avg_price, AVG(lineitem.l_discount) AS avg_disc, COUNT(UInt8(1)) AS count_order
   ballista-scheduler_1  |     Aggregate: groupBy=[[lineitem.l_returnflag, lineitem.l_linestatus]], aggr=[[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Float64(1) - lineitem.l_discountFloat64(1) - lineitem.l_discountlineitem.l_discountFloat64(1)lineitem.l_extendedprice AS lineitem.l_extendedprice * Float64(1) - lineitem.l_discount) AS SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Float64(1) - lineitem.l_discountFloat64(1) - lineitem.l_discountlineitem.l_discountFloat64(1)lineitem.l_extendedprice AS lineitem.l_extendedprice * Float64(1) - lineitem.l_discount * Float64(1) + lineitem.l_tax) AS SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]]
   ballista-scheduler_1  |       Projection: lineitem.l_extendedprice * Float64(1) - lineitem.l_discount AS lineitem.l_extendedprice * Float64(1) - lineitem.l_discountFloat64(1) - lineitem.l_discountlineitem.l_discountFloat64(1)lineitem.l_extendedprice, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, lineitem.l_tax, lineitem.l_returnflag, lineitem.l_linestatus
   ballista-scheduler_1  |         Filter: lineitem.l_shipdate <= Date32("10471")
   ballista-scheduler_1  |           TableScan: lineitem projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate], partial_filters=[lineitem.l_shipdate <= Date32("10471")]
   ```
   
   Then the following code fails:
   
   ```
           let plan = session_ctx.create_physical_plan(&optimized_plan).await?;
           let x = format!("{:?}", plan);
           assert!(!x.contains("EmptyExec"));
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] andygrove commented on issue #353: Cannot run benchmarks when using docker-compose

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1279836912

   physical plan has this:
   
   ```
   input: FilterExec { predicate: BinaryExpr { left: Column { name: "l_shipdate", index: 6 }, op: LtEq, right: Literal { value: Date32("10471") } }, 
   input: RepartitionExec { 
   input: EmptyExec { produce_one_row: false, schema: Schema { fields: [Field { name: "l_quantity"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] Dandandan commented on issue #353: Scheduler silently replaces `ParquetExec` with `EmptyExec` if data path is not correctly mounted in container

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1282765761

   I believe Spark / Presto / etc. commonly return an empty result when given a path/table without any files (on object storage). This makes sense for an empty table.
   
   Looking at the example though it shows an actual file that has been listed, so in that case I agree we should return an error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] andygrove commented on issue #353: Cannot run benchmarks when using docker-compose

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1279791290

   I deleted all images, rebuilt, and same issue. When running in docker-compose, scheduler shows:
   
   ```
   ShuffleWriterExec: Some(Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 2))
     AggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]
       ProjectionExec: expr=[l_extendedprice@1 * 1 - l_discount@2 as lineitem.l_extendedprice * Float64(1) - lineitem.l_discountFloat64(1) - lineitem.l_discountlineitem.l_discountFloat64(1)lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus]
         CoalesceBatchesExec: target_batch_size=4096
           FilterExec: l_shipdate@6 <= 10471
             EmptyExec: produce_one_row=false
   ```
   
   When running scheduler outside of docker-compose, I see:
   
   ```
   =========ResolvedStage[stage_id=1.0, partitions=1]=========
   ShuffleWriterExec: Some(Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 2))
     AggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]
       ProjectionExec: expr=[l_extendedprice@1 * 1 - l_discount@2 as lineitem.l_extendedprice * Float64(1) - lineitem.l_discountFloat64(1) - lineitem.l_discountlineitem.l_discountFloat64(1)lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus]
         CoalesceBatchesExec: target_batch_size=4096
           FilterExec: l_shipdate@6 <= 10471
             ParquetExec: limit=None, partitions=[mnt/bigdata/tpch/sf1-parquet/lineitem/part-0.parquet], predicate=l_shipdate_min@0 <= 10471, projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] andygrove commented on issue #353: Cannot run benchmarks when using docker-compose

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1279789279

   My local Docker images do not seem to have been rebuilt:
   
   ```
   $ docker images | grep "^ballista"
   ballista-benchmarks                                   latest                 cb19c8c820b4   12 minutes ago   220MB
   ballista-executor                                     latest                 3270ba3c5215   42 minutes ago   160MB
   ballista-scheduler                                    latest                 63cf61dfc749   27 hours ago     231MB
   ballista-builder                                      latest                 b062501e26f6   3 weeks ago      1.56GB
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-ballista] mingmwang commented on issue #353: Scheduler silently replaces `ParquetExec` with `EmptyExec` if data path is not correctly mounted in container

Posted by GitBox <gi...@apache.org>.

mingmwang commented on issue #353:
URL: https://github.com/apache/arrow-ballista/issues/353#issuecomment-1281025932

   Is it a bug?  Currently we do not have a Catalog service, if the data path does not exist, I think it is valid to return an empty relation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org