You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Andy Grove (JIRA)" <ji...@apache.org> on 2019/07/13 22:08:00 UTC
[jira] [Updated] (ARROW-5946) [Rust] [DataFusion] Projection push
down with aggregate producing incorrect results
[ https://issues.apache.org/jira/browse/ARROW-5946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Grove updated ARROW-5946:
------------------------------
Description:
I was testing some queries with the 0.14 release and noticed that the projected schema for a table scan is completely wrong (however the results of the query are not necessarily wrong)
{code:java}
// schema for nyxtaxi csv files
let schema = Schema::new(vec![
Field::new("VendorID", DataType::Utf8, true),
Field::new("tpep_pickup_datetime", DataType::Utf8, true),
Field::new("tpep_dropoff_datetime", DataType::Utf8, true),
Field::new("passenger_count", DataType::Utf8, true),
Field::new("trip_distance", DataType::Float64, true),
Field::new("RatecodeID", DataType::Utf8, true),
Field::new("store_and_fwd_flag", DataType::Utf8, true),
Field::new("PULocationID", DataType::Utf8, true),
Field::new("DOLocationID", DataType::Utf8, true),
Field::new("payment_type", DataType::Utf8, true),
Field::new("fare_amount", DataType::Float64, true),
Field::new("extra", DataType::Float64, true),
Field::new("mta_tax", DataType::Float64, true),
Field::new("tip_amount", DataType::Float64, true),
Field::new("tolls_amount", DataType::Float64, true),
Field::new("improvement_surcharge", DataType::Float64, true),
Field::new("total_amount", DataType::Float64, true),
]);
let mut ctx = ExecutionContext::new();
ctx.register_csv("tripdata", "file.csv", &schema, true);
let optimized_plan = ctx.create_logical_plan(
"SELECT passenger_count, MIN(fare_amount), MAX(fare_amount) \
FROM tripdata GROUP BY passenger_count").unwrap();{code}
The projected schema in the table scan has the first two columns from the schema (VendorID and tpetp_pickup_datetime) rather than passenger_count and fare_amount
was:
I was testing some queries with the 0.14 release and noticed that the projected schema for a table scan is completely wrong (however the results of the query are not necessarily wrong)
{code:java}
// schema for nyxtaxi csv files
let schema = Schema::new(vec![
Field::new("VendorID", DataType::Utf8, true),
Field::new("tpep_pickup_datetime", DataType::Utf8, true),
Field::new("tpep_dropoff_datetime", DataType::Utf8, true),
Field::new("passenger_count", DataType::Utf8, true),
Field::new("trip_distance", DataType::Float64, true),
Field::new("RatecodeID", DataType::Utf8, true),
Field::new("store_and_fwd_flag", DataType::Utf8, true),
Field::new("PULocationID", DataType::Utf8, true),
Field::new("DOLocationID", DataType::Utf8, true),
Field::new("payment_type", DataType::Utf8, true),
Field::new("fare_amount", DataType::Float64, true),
Field::new("extra", DataType::Float64, true),
Field::new("mta_tax", DataType::Float64, true),
Field::new("tip_amount", DataType::Float64, true),
Field::new("tolls_amount", DataType::Float64, true),
Field::new("improvement_surcharge", DataType::Float64, true),
Field::new("total_amount", DataType::Float64, true),
]);
let mut ctx = ExecutionContext::new();
ctx.register_csv("tripdata", "file.csv", &schema, true);
let optimized_plan = ctx.create_logical_plan(
"SELECT passenger_count, MIN(fare_amount), MAX(fare_amount) \
FROM tripdata GROUP BY passenger_count").unwrap();{code}
> [Rust] [DataFusion] Projection push down with aggregate producing incorrect results
> -----------------------------------------------------------------------------------
>
> Key: ARROW-5946
> URL: https://issues.apache.org/jira/browse/ARROW-5946
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust, Rust - DataFusion
> Affects Versions: 0.14.0
> Reporter: Andy Grove
> Assignee: Andy Grove
> Priority: Major
> Fix For: 1.0.0
>
>
> I was testing some queries with the 0.14 release and noticed that the projected schema for a table scan is completely wrong (however the results of the query are not necessarily wrong)
>
> {code:java}
> // schema for nyxtaxi csv files
> let schema = Schema::new(vec![
> Field::new("VendorID", DataType::Utf8, true),
> Field::new("tpep_pickup_datetime", DataType::Utf8, true),
> Field::new("tpep_dropoff_datetime", DataType::Utf8, true),
> Field::new("passenger_count", DataType::Utf8, true),
> Field::new("trip_distance", DataType::Float64, true),
> Field::new("RatecodeID", DataType::Utf8, true),
> Field::new("store_and_fwd_flag", DataType::Utf8, true),
> Field::new("PULocationID", DataType::Utf8, true),
> Field::new("DOLocationID", DataType::Utf8, true),
> Field::new("payment_type", DataType::Utf8, true),
> Field::new("fare_amount", DataType::Float64, true),
> Field::new("extra", DataType::Float64, true),
> Field::new("mta_tax", DataType::Float64, true),
> Field::new("tip_amount", DataType::Float64, true),
> Field::new("tolls_amount", DataType::Float64, true),
> Field::new("improvement_surcharge", DataType::Float64, true),
> Field::new("total_amount", DataType::Float64, true),
> ]);
> let mut ctx = ExecutionContext::new();
> ctx.register_csv("tripdata", "file.csv", &schema, true);
> let optimized_plan = ctx.create_logical_plan(
> "SELECT passenger_count, MIN(fare_amount), MAX(fare_amount) \
> FROM tripdata GROUP BY passenger_count").unwrap();{code}
> The projected schema in the table scan has the first two columns from the schema (VendorID and tpetp_pickup_datetime) rather than passenger_count and fare_amount
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)