You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/28 22:01:11 UTC

[GitHub] [arrow-datafusion] cube2222 opened a new issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

cube2222 opened a new issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109


   **Describe the bug**
   I'm running benchmarks for [OctoSQL](github.com/cube2222/octosql) and datafusion-cli is one of the tools I compare against. The previous version I used (0.6.0 I think) did the benchmark in 1.5 second. The new version takes 100 (!!!) seconds. It also prints "0 rows in set", which makes me think this is a CSV decoder regression.
   
   This is based on the nyc yellow taxi dataset.
   
   **To Reproduce**
   ```bash
   curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-04.csv -o taxi.csv
   
   echo "CREATE EXTERNAL TABLE taxi
   STORED AS CSV
   WITH HEADER ROW
   LOCATION './taxi.csv';
   
   SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count" > datafusion_commands.txt
   
   datafusion-cli -f datafusion_commands.txt
   ```
   
   **Expected behavior**
   Datafusion is supposed to be blazingly fast.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema

Posted by GitBox <gi...@apache.org>.

jychen7 commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1086655243


   > I agree we can set default const 1000 at CsvFormat and use it across the code base, to align with other file format, we probably should set it for JSON as well? (avro and parquet is not applicable)
   
   I will try create a PR this weekend and see if we can do our first minor release next weekend


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

jychen7 commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083923099


   I agree we can set default const `1000` at `CsvFormat` and use it across the code base
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

jychen7 commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083882640


   interesting, from the `master` branch as of now, the default `1000` is set in `CsvReadOptions`
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L56-L66
   
   And should be passed when converted to `ListingOptions`, need investigate more
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L106-L114
   
   ---
   
   but I agree we can set default const `1000` and use it across the code base
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1082557831


   Indeed, I was able to reproduce the performance regression building from source:
   
   Master (maybe a few commits behind, i havent pulled latest in a few days)
   ```
   DataFusion CLI v7.0.0
   ❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
   0 rows in set. Query took 84.245 seconds.
   ```
   
   7.0.0
   ```
   DataFusion CLI v7.0.0
   ❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
   0 rows in set. Query took 112.486 seconds.
   ```
   
   6.0.0 (I think the version shown when launching is wrong)
   ```
   DataFusion CLI v5.1.0-SNAPSHOT
   
   ❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
   0 rows in set. Query took 2.645 seconds.
   ```
   
   My guess is that this could be coming from an arrow-rs related change (which handles IO) - but i havent been tracking all the changes in detail there lately.  I likely wont have time to dig into this more for a few days.
   
   @alamb does anything come to mind?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083442669


   > @alamb does anything come to mind?
   
   Sometimes I have seen this kind of regression when there is no `BufReader` when doing IO. Let me poke at it a bit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083477675


   I think something in the DF 7.0 line made the number of lines used to infer the schema configurable, and the default changed to use "the whole file".
   
   Thus, in 7.0 the datafusion-cli appears to be parsing the entire CSV file to do schema inference.  
   
   When I applied the following diff, the time went from **131.012 seconds** locally to **0.076 seconds**.
   
   ```diff
   (arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ git diff
   diff --git a/datafusion/core/src/datasource/file_format/csv.rs b/datafusion/core/src/datasource/file_format/csv.rs
   index 29ca84a12..c0a6307e8 100644
   --- a/datafusion/core/src/datasource/file_format/csv.rs
   +++ b/datafusion/core/src/datasource/file_format/csv.rs
   @@ -95,7 +95,7 @@ impl FileFormat for CsvFormat {
        async fn infer_schema(&self, mut readers: ObjectReaderStream) -> Result<SchemaRef> {
            let mut schemas = vec![];
    
   -        let mut records_to_read = self.schema_infer_max_rec.unwrap_or(std::usize::MAX);
   +        let mut records_to_read = self.schema_infer_max_rec.unwrap_or(1000);
    
            while let Some(obj_reader) = readers.next().await {
                let mut reader = obj_reader?.sync_reader()?;
   (arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ 
   ```
   
   I suggest we change the default value of `schema_infer_max_rec` to something sensible like 100 or 1000. I think it is exceedingly rare to need to use more than this.
   
   FYI @jychen7  if you are looking for good candidates for changes to backport for a 7.1 type release, this would be one :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 edited a comment on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

jychen7 edited a comment on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083882640


   interesting, from the `master` branch as of now, the default `1000` is set in `CsvReadOptions`
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L56-L66
   
   And should be passed when converted to `ListingOptions`, need investigate more why it is ignored during DDL
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L106-L114
   
   ---
   
   but I agree we can set default const `1000` and use it across the code base
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 edited a comment on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

jychen7 edited a comment on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083882640


   interesting, from the `master` branch as of now, the default `1000` is set in `CsvReadOptions` and should be passed when converted `CsvReadOptions` to `ListingOptions`
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L56-L66
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L106-L114
   
   However, during DDL, the context use `CsvFormat::default()` directly, without CsvReadOptions to generate `ListingOptions`
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/context.rs#L220-L223
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/context.rs#L239-L245
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 edited a comment on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

jychen7 edited a comment on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083923099


   I agree we can set default const `1000` at `CsvFormat` and use it across the code base, to align with other file format, we probably should set it for JSON as well? (avro and parquet is not applicable)
   
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/json.rs#L48-L50


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1081290792


   Hi @cube2222 - thanks for report.  A few quick things that come to mind (I haven't had chance to try replicating or digging into it though so sorry if these are obvious or you already tried):
   
   
   1. Can you confirm you are running with the --release flag?
   2. The 0 rows in set likely comes from the CREATE EXTERNAL TABLE command. 
   3. I don't see a closing semi colon for the actual query which may prevent it from being run.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] cube2222 edited a comment on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

cube2222 edited a comment on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1081461585


   Hey @matthewmturner, sure:
   1. I've installed datafusion-cli through homebrew, I'm assuming that's compiled with --release.
   2. Yes, it does.
   3. Running interactively it's already the CREATE EXTERNAL TABLE statement that takes ages. But you're right, the final query didn't run. With the semicolon, the final query takes 0.4 seconds (so as expected). So it's basically 100 second CREATE EXTERNAL TABLE + 0.4 second query. 
   
   Here's the full output:
   ```
   datafusion-cli
   DataFusion CLI v7.0.0
   ❯ CREATE EXTERNAL TABLE taxi
   STORED AS CSV
   WITH HEADER ROW
   LOCATION './taxi.csv';
   0 rows in set. Query took 106.102 seconds.
   ❯ SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count;
   +-----------------+-----------------+------------------------+
   | passenger_count | COUNT(UInt8(1)) | AVG(taxi.total_amount) |
   +-----------------+-----------------+------------------------+
   | 4               | 25510           | 18.452774990199917     |
   | 9               | 1               | 113.6                  |
   | 0               | 42228           | 17.021401676612687     |
   | 5               | 50291           | 17.27092481756182      |
   | 8               | 2               | 95.705                 |
   |                 | 128020          | 32.237151148258164     |
   | 2               | 286461          | 18.097587071189274     |
   | 3               | 72852           | 17.915395871081138     |
   | 1               | 1533197         | 17.6418833065818       |
   | 6               | 32623           | 17.600296416638567     |
   | 7               | 2               | 87.17                  |
   +-----------------+-----------------+------------------------+
   11 rows in set. Query took 0.385 seconds.
   ❯
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] cube2222 commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

cube2222 commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083590895


   Glad it's useful!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1086799631


   Great find!👍
   
   Another thing might be useful in the future is to optimize inferring inferring the types.
   
   It makes sense it is slower than parsing the CSV, given that we don't know the types, but it sounds it shouldn't be ~100x as slow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 edited a comment on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

jychen7 edited a comment on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083882640


   interesting, from the `master` branch as of now, the default `1000` is set in `CsvReadOptions` and should be passed when converted `CsvReadOptions` to `ListingOptions`
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L56-L66
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L106-L114
   
   However, during DDL, the context use `CsvFormat::default()` directly, without CsvReadOptions to generate `ListingOptions`.
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/context.rs#L220-L223
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/context.rs#L239-L245
   
   ---
   
   Haven't read 6.0.0 about why it is fast


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083477862


   Thanks for the report @cube2222 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 edited a comment on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

jychen7 edited a comment on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083882640


   interesting, from the `master` branch as of now, the default `1000` is set in `CsvReadOptions` and should be passed when converted `CsvReadOptions` to `ListingOptions`
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L56-L66
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L106-L114
   
   However, during DDL, the context use `CsvFormat::default()` directly, without CsvReadOptions to generate `ListingOptions`
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/context.rs#L220-L223
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/context.rs#L239-L245
   
   ---
   
   I agree we can set default const `1000` and use it across the code base
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 edited a comment on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

jychen7 edited a comment on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083923099


   I agree we can set default const `1000` at `CsvFormat` and use it across the code base, we probably should set it for JSON as well?
   
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54
   https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/json.rs#L48-L50


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] cube2222 commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Posted by GitBox <gi...@apache.org>.

cube2222 commented on issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1081461585


   Hey, sure:
   1. I've installed datafusion-cli through homebrew, I'm assuming that's compiled with --release.
   2. Yes, it does.
   3. Running interactively it's already the CREATE EXTERNAL TABLE statement that takes ages. But you're right, the final query didn't run. With the semicolon, the final query takes 0.4 seconds (so as expected). So it's basically 100 second CREATE EXTERNAL TABLE + 0.4 second query. 
   
   Here's the full output:
   ```
   datafusion-cli
   DataFusion CLI v7.0.0
   ❯ CREATE EXTERNAL TABLE taxi
   STORED AS CSV
   WITH HEADER ROW
   LOCATION './taxi.csv';
   0 rows in set. Query took 106.102 seconds.
   ❯ SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count;
   +-----------------+-----------------+------------------------+
   | passenger_count | COUNT(UInt8(1)) | AVG(taxi.total_amount) |
   +-----------------+-----------------+------------------------+
   | 4               | 25510           | 18.452774990199917     |
   | 9               | 1               | 113.6                  |
   | 0               | 42228           | 17.021401676612687     |
   | 5               | 50291           | 17.27092481756182      |
   | 8               | 2               | 95.705                 |
   |                 | 128020          | 32.237151148258164     |
   | 2               | 286461          | 18.097587071189274     |
   | 3               | 72852           | 17.915395871081138     |
   | 1               | 1533197         | 17.6418833065818       |
   | 6               | 32623           | 17.600296416638567     |
   | 7               | 2               | 87.17                  |
   +-----------------+-----------------+------------------------+
   11 rows in set. Query took 0.385 seconds.
   ❯
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org