You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "comphead (via GitHub)" <gi...@apache.org> on 2023/04/19 19:07:12 UTC

[GitHub] [arrow-datafusion] comphead opened a new pull request, #6065: Optimize row hash

comphead opened a new pull request, #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065

# Which issue does this PR close?

Closes #6064.

# Rationale for this change
slice_and_maybe_filter spends some CPU ticks on vector allocations and can be improved

# What changes are included in this PR?
Rewrite slice_and_maybe_filter to avoid excessive vector allocations

# Are these changes tested?
Yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
-->

# Are there any user-facing changes?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #6065: Optimize row hash

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1520267181

   I will run test this PR using  https://github.com/alamb/datafusion-benchmarking/blob/main/bench.sh 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #6065: Optimize row hash

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1520556535

   Thanks @comphead  and @mingmwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #6065: Optimize row hash

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1518931085

   @comphead 
   Sorry that, I had tested this PR on my Mac locally and do not see any performance improvement on tpch-q17(high cardinality aggregation), and there is about 10% downgrade.
   
   
   **Before this PR**:
   
   q17(sf =1)
   Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(17), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true, enable_scheduler: false }
   Query 17 iteration 0 took 1942.3 ms and returned 1 rows
   Query 17 iteration 1 took 1918.5 ms and returned 1 rows
   Query 17 iteration 2 took 1940.6 ms and returned 1 rows
   Query 17 avg time: 1933.79 ms
   
   q17(sf =10)
   Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(17), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data10", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true, enable_scheduler: false }
   Query 17 iteration 0 took 32352.1 ms and returned 1 rows
   Query 17 iteration 1 took 32210.5 ms and returned 1 rows
   Query 17 iteration 2 took 32003.7 ms and returned 1 rows
   Query 17 avg time: 32188.74 ms
   
   
   **this PR**:
   q17(sf =1)
   Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(17), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true, enable_scheduler: false }
   Query 17 iteration 0 took 2159.3 ms and returned 1 rows
   Query 17 iteration 1 took 2106.7 ms and returned 1 rows
   Query 17 iteration 2 took 2111.4 ms and returned 1 rows
   Query 17 avg time: 2125.82 ms
   
   q17(sf =10)
   Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(17), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data10", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true, enable_scheduler: false }
   Query 17 iteration 0 took 34505.3 ms and returned 1 rows
   Query 17 iteration 1 took 34240.6 ms and returned 1 rows
   Query 17 iteration 2 took 33967.1 ms and returned 1 rows
   Query 17 avg time: 34237.68 ms
   
   -------------------------------------
   
   How this was tested:
   
   ```rust
   if matches!(self.mode, AggregateMode::Partial | AggregateMode::Single)
                   && normal_aggr_input_values.is_empty()
                   && normal_filter_values.is_empty()
                   && groups_with_rows.len() >= batch.num_rows() / 10
   
   ```
   
   change the magic number from `10` to `1`
   
   ```rust
   if matches!(self.mode, AggregateMode::Partial | AggregateMode::Single)
                   && normal_aggr_input_values.is_empty()
                   && normal_filter_values.is_empty()
                   && groups_with_rows.len() >= batch.num_rows() / 1
   
   ```
   
   So that the update accumulators will use the method `update_accumulators_using_batch()` and call the method `slice_and_maybe_filter()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on pull request #6065: Optimize row hash

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1520534020

   I have also double checked from my side with latest main which also includes row_hash optimizations from @mingmwang 
   Testing is against 13G `hits.parquet` file, 2 runs(cold run, warm run)
   
   datafusion-cli built in release mode
   Machine 2.4 GHz 8-Core Intel Core i9 MacOS
   
   With this PR
   ```
   DataFusion CLI v23.0.0
   ❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET location 'hits.parquet';
   0 rows in set. Query took 0.079 seconds.
   ❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
   +---------------------+-------------+---+---------------------+---------------------------+
   | WatchID             | ClientIP    | c | SUM(hits.IsRefresh) | AVG(hits.ResolutionWidth) |
   +---------------------+-------------+---+---------------------+---------------------------+
   | 6655575552203051303 | 1611957945  | 2 | 0                   | 1638.0                    |
   | 7904046282518428963 | 1509330109  | 2 | 0                   | 1368.0                    |
   | 8566928176839891583 | -1402644643 | 2 | 0                   | 1368.0                    |
   | 7224410078130478461 | -776509581  | 2 | 0                   | 1368.0                    |
   | 7968574085024155935 | -986722817  | 1 | 0                   | 1990.0                    |
   | 8683116696854507598 | -1887352109 | 1 | 0                   | 1917.0                    |
   | 6071982018954122379 | 1154898388  | 1 | 0                   | 1638.0                    |
   | 7044330683984323480 | -765736418  | 1 | 0                   | 1750.0                    |
   | 5170668904757974782 | 580435115   | 1 | 0                   | 1087.0                    |
   | 7121372218861667575 | -888761092  | 1 | 0                   | 1368.0                    |
   +---------------------+-------------+---+---------------------+---------------------------+
   10 rows in set. Query took 51.300 seconds.
   ❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
   +---------------------+-------------+---+---------------------+---------------------------+
   | WatchID             | ClientIP    | c | SUM(hits.IsRefresh) | AVG(hits.ResolutionWidth) |
   +---------------------+-------------+---+---------------------+---------------------------+
   | 6655575552203051303 | 1611957945  | 2 | 0                   | 1638.0                    |
   | 7904046282518428963 | 1509330109  | 2 | 0                   | 1368.0                    |
   | 8566928176839891583 | -1402644643 | 2 | 0                   | 1368.0                    |
   | 7224410078130478461 | -776509581  | 2 | 0                   | 1368.0                    |
   | 5811527790243312578 | 56896075    | 1 | 0                   | 1368.0                    |
   | 5998597912391672099 | 807012274   | 1 | 0                   | 1996.0                    |
   | 6071982018954122379 | 1154898388  | 1 | 0                   | 1638.0                    |
   | 7044330683984323480 | -765736418  | 1 | 0                   | 1750.0                    |
   | 5170668904757974782 | 580435115   | 1 | 0                   | 1087.0                    |
   | 7121372218861667575 | -888761092  | 1 | 0                   | 1368.0                    |
   +---------------------+-------------+---+---------------------+---------------------------+
   10 rows in set. Query took 47.066 seconds.
   ```
   
   Without PR
   ```
   DataFusion CLI v23.0.0
   ❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET location 'hits.parquet';
   0 rows in set. Query took 0.066 seconds.
   ❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
   +---------------------+-------------+---+---------------------+---------------------------+
   | WatchID             | ClientIP    | c | SUM(hits.IsRefresh) | AVG(hits.ResolutionWidth) |
   +---------------------+-------------+---+---------------------+---------------------------+
   | 6655575552203051303 | 1611957945  | 2 | 0                   | 1638.0                    |
   | 7904046282518428963 | 1509330109  | 2 | 0                   | 1368.0                    |
   | 8566928176839891583 | -1402644643 | 2 | 0                   | 1368.0                    |
   | 7224410078130478461 | -776509581  | 2 | 0                   | 1368.0                    |
   | 7876723297163966966 | -1495669395 | 1 | 0                   | 1750.0                    |
   | 9012818526311489736 | 1663619136  | 1 | 0                   | 1638.0                    |
   | 6071982018954122379 | 1154898388  | 1 | 0                   | 1638.0                    |
   | 7044330683984323480 | -765736418  | 1 | 0                   | 1750.0                    |
   | 5170668904757974782 | 580435115   | 1 | 0                   | 1087.0                    |
   | 7121372218861667575 | -888761092  | 1 | 0                   | 1368.0                    |
   +---------------------+-------------+---+---------------------+---------------------------+
   10 rows in set. Query took 50.776 seconds.
   ❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
   +---------------------+-------------+---+---------------------+---------------------------+
   | WatchID             | ClientIP    | c | SUM(hits.IsRefresh) | AVG(hits.ResolutionWidth) |
   +---------------------+-------------+---+---------------------+---------------------------+
   | 6655575552203051303 | 1611957945  | 2 | 0                   | 1638.0                    |
   | 7904046282518428963 | 1509330109  | 2 | 0                   | 1368.0                    |
   | 8566928176839891583 | -1402644643 | 2 | 0                   | 1368.0                    |
   | 7224410078130478461 | -776509581  | 2 | 0                   | 1368.0                    |
   | 5981193970486754997 | -1725568709 | 1 | 0                   | 1917.0                    |
   | 6089584153122015492 | 1209163516  | 1 | 0                   | 1996.0                    |
   | 6071982018954122379 | 1154898388  | 1 | 0                   | 1638.0                    |
   | 7044330683984323480 | -765736418  | 1 | 0                   | 1750.0                    |
   | 5170668904757974782 | 580435115   | 1 | 0                   | 1087.0                    |
   | 7121372218861667575 | -888761092  | 1 | 0                   | 1368.0                    |
   +---------------------+-------------+---+---------------------+---------------------------+
   10 rows in set. Query took 48.886 seconds.
   ```
   
   Before #6003 DF took 62 and 58 sec respectively. 
   That mean @mingmwang fixes performance and `slice_and_maybe_filter` was not a bottleneck anymore so this PR doesn't bring any benefit. I will close it. Thanks all for participating


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #6065: Optimize row hash

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1171840645


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];

Review Comment:
   The key point is to get rid of extra allocations https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1504585917



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #6065: Optimize row hash

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.

Dandandan commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1174404141


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];
 
-    let filtered_arrays = match filter_opt.as_ref() {
-        Some(f) => {
-            let sliced = f.slice(offsets[0], offsets[1] - offsets[0]);
-            let filter_array = as_boolean_array(&sliced)?;
+    if let Some(f) = filter_opt {
+        let sliced = f.slice(offsets[0], offsets[1] - offsets[0]);
+        let filter_array = as_boolean_array(&sliced)?;
 
-            sliced_arrays
-                .iter()
-                .map(|array| filter(array, filter_array).unwrap())
-                .collect::<Vec<ArrayRef>>()
+        for (i, arr) in aggr_array.iter().enumerate() {
+            let sliced = &arr.slice(offsets[0], offsets[1] - offsets[0]);
+            sliced_arrays[i] = filter(sliced, filter_array).unwrap();
+        }
+    } else {
+        for (i, arr) in aggr_array.iter().enumerate() {
+            sliced_arrays[i] = arr.slice(offsets[0], offsets[1] - offsets[0]);
         }
-        None => sliced_arrays,
-    };
-    Ok(filtered_arrays)
+    }

Review Comment:
   With `zip` might even use `collect` into `Vec` again to avoid initializing the `vec`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #6065: Optimize row hash

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1171842885


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];

Review Comment:
   Another optimization that we traverse input collection only once, instead of 2 times in original implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #6065: Optimize row hash

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1520484534

   Here are some benchmark results:
   
   ```
   ++ echo '****** TPCH SF1 (Parquet) ******'
   ****** TPCH SF1 (Parquet) ******
   ++ python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/optimize_row_hash/tpch_sf1_parquet_main.json /home/alamb/benchmarking/optimize_row_hash/t\
   pch_sf1_parquet_branch.json
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
   ┃ Query        ┃ /home/alamb… ┃ /home/alamb… ┃       Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
   │ QQuery 1     │    1641.50ms │    1668.52ms │    no change │
   │ QQuery 2     │     433.09ms │     457.15ms │ 1.06x slower │
   │ QQuery 3     │     551.05ms │     560.14ms │    no change │
   │ QQuery 4     │     212.82ms │     214.71ms │    no change │
   │ QQuery 5     │     722.18ms │     725.48ms │    no change │
   │ QQuery 6     │     454.35ms │     459.32ms │    no change │
   │ QQuery 7     │    1277.69ms │    1256.32ms │    no change │
   │ QQuery 8     │     714.88ms │     714.78ms │    no change │
   │ QQuery 9     │    1333.88ms │    1365.51ms │    no change │
   │ QQuery 10    │     792.20ms │     818.54ms │    no change │
   │ QQuery 11    │     345.96ms │     351.47ms │    no change │
   │ QQuery 12    │     333.29ms │     336.69ms │    no change │
   │ QQuery 13    │    1375.21ms │    1462.85ms │ 1.06x slower │
   │ QQuery 14    │     465.92ms │     453.73ms │    no change │
   │ QQuery 15    │     432.71ms │     458.58ms │ 1.06x slower │
   │ QQuery 16    │     340.73ms │     358.66ms │ 1.05x slower │
   │ QQuery 17    │    3961.24ms │    4538.85ms │ 1.15x slower │
   │ QQuery 18    │    3493.70ms │    3737.18ms │ 1.07x slower │
   │ QQuery 19    │     761.64ms │     765.50ms │    no change │
   │ QQuery 20    │    1304.52ms │    1478.34ms │ 1.13x slower │
   │ QQuery 21    │    1621.47ms │    1718.01ms │ 1.06x slower │
   │ QQuery 22    │     188.55ms │     189.26ms │    no change │
   └──────────────┴──────────────┴──────────────┴──────────────┘
   ++ echo '****** TPCH SF1 (mem) ******'
   ****** TPCH SF1 (mem) ******
   ++ python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/optimize_row_hash/tpch_sf1_mem_main.json /home/alamb/benchmarking/optimize_row_hash/tpch_\
   sf1_mem_branch.json
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
   ┃ Query        ┃           -o ┃           -o ┃        Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
   │ QQuery 1     │     929.87ms │     927.61ms │     no change │
   │ QQuery 2     │     312.32ms │     335.58ms │  1.07x slower │
   │ QQuery 3     │     187.35ms │     174.29ms │ +1.07x faster │
   │ QQuery 4     │      99.81ms │     104.31ms │     no change │
   │ QQuery 5     │     475.38ms │     465.07ms │     no change │
   │ QQuery 6     │      38.34ms │      37.14ms │     no change │
   │ QQuery 7     │    1076.25ms │    1165.47ms │  1.08x slower │
   │ QQuery 8     │     254.25ms │     242.57ms │     no change │
   │ QQuery 9     │     631.52ms │     611.30ms │     no change │
   │ QQuery 10    │     335.70ms │     346.45ms │     no change │
   │ QQuery 11    │     297.94ms │     293.44ms │     no change │
   │ QQuery 12    │     154.28ms │     153.63ms │     no change │
   │ QQuery 13    │     842.44ms │     983.07ms │  1.17x slower │
   │ QQuery 14    │      55.30ms │      58.83ms │  1.06x slower │
   │ QQuery 15    │     128.91ms │     123.69ms │     no change │
   │ QQuery 16    │     265.96ms │     258.43ms │     no change │
   │ QQuery 17    │    3514.38ms │    4024.28ms │  1.15x slower │
   │ QQuery 18    │    3158.26ms │    3294.42ms │     no change │
   │ QQuery 19    │     145.06ms │     150.17ms │     no change │
   │ QQuery 20    │    1093.86ms │    1163.19ms │  1.06x slower │
   │ QQuery 21    │    1437.97ms │    1483.74ms │     no change │
   │ QQuery 22    │     151.64ms │     139.58ms │ +1.09x faster │
   └──────────────┴──────────────┴──────────────┴───────────────┘
   ```
   
   
   For reference, here is the same benchmark run against `main` itself: 
   
   ```
   ****** TPCH SF1 (Parquet) ******
   + python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/alamb-main/tpch_sf1_parquet_main.json /home/alamb/benchmarking/alamb-main/tpch_sf1_parquet\
   _branch.json
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
   ┃ Query        ┃ /home/alamb… ┃ /home/alamb… ┃        Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
   │ QQuery 1     │    1430.86ms │    1423.29ms │     no change │
   │ QQuery 2     │     399.75ms │     405.00ms │     no change │
   │ QQuery 3     │     520.40ms │     525.56ms │     no change │
   │ QQuery 4     │     218.29ms │     223.87ms │     no change │
   │ QQuery 5     │     693.57ms │     685.46ms │     no change │
   │ QQuery 6     │     416.62ms │     423.02ms │     no change │
   │ QQuery 7     │    1258.17ms │    1243.79ms │     no change │
   │ QQuery 8     │     690.25ms │     687.29ms │     no change │
   │ QQuery 9     │    1304.02ms │    1288.01ms │     no change │
   │ QQuery 10    │     770.91ms │     748.94ms │     no change │
   │ QQuery 11    │     356.32ms │     336.55ms │ +1.06x faster │
   │ QQuery 12    │     335.14ms │     329.12ms │     no change │
   │ QQuery 13    │    1170.83ms │    1146.78ms │     no change │
   │ QQuery 14    │     422.25ms │     421.47ms │     no change │
   │ QQuery 15    │     391.14ms │     381.71ms │     no change │
   │ QQuery 16    │     348.38ms │     344.13ms │     no change │
   │ QQuery 17    │    2860.96ms │    2838.27ms │     no change │
   │ QQuery 18    │    3726.11ms │    3734.67ms │     no change │
   │ QQuery 19    │     728.53ms │     737.35ms │     no change │
   │ QQuery 20    │    1250.75ms │    1208.06ms │     no change │
   │ QQuery 21    │    1688.40ms │    1757.45ms │     no change │
   │ QQuery 22    │     192.36ms │     190.43ms │     no change │
   └──────────────┴──────────────┴──────────────┴───────────────┘
   + echo '****** TPCH SF1 (mem) ******'
   ****** TPCH SF1 (mem) ******
   + python3 /home/alamb/arrow-datafusion/benchmarks/compare.py /home/alamb/benchmarking/alamb-main/tpch_sf1_mem_main.json /home/alamb/benchmarking/alamb-main/tpch_sf1_mem_branch.\
   json
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
   ┃ Query        ┃           -o ┃           -o ┃        Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
   │ QQuery 1     │     759.07ms │     770.73ms │     no change │
   │ QQuery 2     │     269.81ms │     291.05ms │  1.08x slower │
   │ QQuery 3     │     180.67ms │     161.61ms │ +1.12x faster │
   │ QQuery 4     │     105.16ms │     105.46ms │     no change │
   │ QQuery 5     │     467.46ms │     466.11ms │     no change │
   │ QQuery 6     │      38.08ms │      42.72ms │  1.12x slower │
   │ QQuery 7     │    1170.05ms │    1147.43ms │     no change │
   │ QQuery 8     │     249.06ms │     238.74ms │     no change │
   │ QQuery 9     │     613.38ms │     609.98ms │     no change │
   │ QQuery 10    │     342.67ms │     327.23ms │     no change │
   │ QQuery 11    │     279.84ms │     281.69ms │     no change │
   │ QQuery 12    │     143.94ms │     146.57ms │     no change │
   │ QQuery 13    │     676.22ms │     668.79ms │     no change │
   │ QQuery 14    │      53.06ms │      51.73ms │     no change │
   │ QQuery 15    │      98.47ms │      92.68ms │ +1.06x faster │
   │ QQuery 16    │     244.93ms │     257.73ms │  1.05x slower │
   │ QQuery 17    │    2473.33ms │    2503.47ms │     no change │
   │ QQuery 18    │    3150.26ms │    3169.32ms │     no change │
   │ QQuery 19    │     154.90ms │     150.53ms │     no change │
   │ QQuery 20    │     969.12ms │     929.69ms │     no change │
   │ QQuery 21    │    1476.10ms │    1457.34ms │     no change │
   │ QQuery 22    │     148.71ms │     143.20ms │     no change │
   └──────────────┴──────────────┴──────────────┴───────────────┘
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on pull request #6065: Optimize row hash

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1515249274

   Performance improvement with optimized `datafusion::physical_plan::aggregates::row_hash::slice_and_maybe_filter` there is ~25% gain.
   was 68sec, now 50 sec
   
   2.4 GHz 8-Core Intel Core i9
   
   Query 
   https://github.com/apache/arrow-datafusion/issues/5969#issue-1663320375


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yahoNanJing commented on a diff in pull request #6065: Optimize row hash

Posted by "yahoNanJing (via GitHub)" <gi...@apache.org>.

yahoNanJing commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1174503169


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];
 
-    let filtered_arrays = match filter_opt.as_ref() {
-        Some(f) => {
-            let sliced = f.slice(offsets[0], offsets[1] - offsets[0]);
-            let filter_array = as_boolean_array(&sliced)?;
+    if let Some(f) = filter_opt {
+        let sliced = f.slice(offsets[0], offsets[1] - offsets[0]);
+        let filter_array = as_boolean_array(&sliced)?;
 
-            sliced_arrays
-                .iter()
-                .map(|array| filter(array, filter_array).unwrap())
-                .collect::<Vec<ArrayRef>>()
+        for (i, arr) in aggr_array.iter().enumerate() {
+            let sliced = &arr.slice(offsets[0], offsets[1] - offsets[0]);
+            sliced_arrays[i] = filter(sliced, filter_array).unwrap();

Review Comment:
   Actually, I don't think it's a good idea to do more than one thing in a loop, especially the `slice` of Array is not so light weight behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #6065: Optimize row hash

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1174222013


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];
 
-    let filtered_arrays = match filter_opt.as_ref() {
-        Some(f) => {
-            let sliced = f.slice(offsets[0], offsets[1] - offsets[0]);
-            let filter_array = as_boolean_array(&sliced)?;
+    if let Some(f) = filter_opt {
+        let sliced = f.slice(offsets[0], offsets[1] - offsets[0]);
+        let filter_array = as_boolean_array(&sliced)?;
 
-            sliced_arrays
-                .iter()
-                .map(|array| filter(array, filter_array).unwrap())
-                .collect::<Vec<ArrayRef>>()
+        for (i, arr) in aggr_array.iter().enumerate() {
+            let sliced = &arr.slice(offsets[0], offsets[1] - offsets[0]);
+            sliced_arrays[i] = filter(sliced, filter_array).unwrap();
+        }
+    } else {
+        for (i, arr) in aggr_array.iter().enumerate() {
+            sliced_arrays[i] = arr.slice(offsets[0], offsets[1] - offsets[0]);
         }
-        None => sliced_arrays,
-    };
-    Ok(filtered_arrays)
+    }

Review Comment:
   I think writing these loops as a zip of `aggr_array.iter()` and `sliced_arrays.iter_mut()` and avoiding the `sliced_arrays[i]` access inside the loop can make the code (1) a little more idiomatic, and (2) may result in less implicit bounds-checking at run-time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #6065: Optimize row hash

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1518931880

    cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./parquet_data10 --format parquet --query 17 -n 1 --disable-statistics
   
   cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./parquet_data10 --format parquet --query 17 -n 1 --disable-statistics
   
   My Mac:
   MacBook Pro (16-inch, 2021)
   Apple M1 Max
   Memory 64 GB


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yahoNanJing commented on a diff in pull request #6065: Optimize row hash

Posted by "yahoNanJing (via GitHub)" <gi...@apache.org>.

yahoNanJing commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1174503169


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];
 
-    let filtered_arrays = match filter_opt.as_ref() {
-        Some(f) => {
-            let sliced = f.slice(offsets[0], offsets[1] - offsets[0]);
-            let filter_array = as_boolean_array(&sliced)?;
+    if let Some(f) = filter_opt {
+        let sliced = f.slice(offsets[0], offsets[1] - offsets[0]);
+        let filter_array = as_boolean_array(&sliced)?;
 
-            sliced_arrays
-                .iter()
-                .map(|array| filter(array, filter_array).unwrap())
-                .collect::<Vec<ArrayRef>>()
+        for (i, arr) in aggr_array.iter().enumerate() {
+            let sliced = &arr.slice(offsets[0], offsets[1] - offsets[0]);
+            sliced_arrays[i] = filter(sliced, filter_array).unwrap();

Review Comment:
   Actually, I don't think it's a good idea to do more than one thing in a loop, especially the `slice` of Array is not so light weight behavior. And I would prefer the previous implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #6065: Optimize row hash

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1515657193

   I will do some test locally tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #6065: Optimize row hash

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1171840645


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];

Review Comment:
   The key point of this PR is to get rid of extra allocations https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1504585917 and allows 20-25% speed gain.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #6065: Optimize row hash

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1518940220

   From the Flame Graph, it shows the hot path of the method `slice_and_maybe_filter` should be `SpecFromIter`.
   And the hot path of the method of `SpecFromIter::from_iter` should be the `slice of ArrowArray`, but not the memory allocations of the Vec.
   
   
   ```rust
   impl<T> SpecFromIter<T, IntoIter<T>> for Vec<T> {
       fn from_iter(iterator: IntoIter<T>) -> Self {
           // A common case is passing a vector into a function which immediately
           // re-collects into a vector. We can short circuit this if the IntoIter
           // has not been advanced at all.
           // When it has been advanced We can also reuse the memory and move the data to the front.
           // But we only do so when the resulting Vec wouldn't have more unused capacity
           // than creating it through the generic FromIterator implementation would. That limitation
           // is not strictly necessary as Vec's allocation behavior is intentionally unspecified.
           // But it is a conservative choice.
           let has_advanced = iterator.buf.as_ptr() as *const _ != iterator.ptr;
           if !has_advanced || iterator.len() >= iterator.cap / 2 {
               unsafe {
                   let it = ManuallyDrop::new(iterator);
                   if has_advanced {
                       ptr::copy(it.ptr, it.buf.as_ptr(), it.len());
                   }
                   return Vec::from_raw_parts(it.buf.as_ptr(), it.len(), it.cap);
               }
           }
   
           let mut vec = Vec::new();
           // must delegate to spec_extend() since extend() itself delegates
           // to spec_from for empty Vecs
           vec.spec_extend(iterator);
           vec
       }
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #6065: Optimize row hash

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1519479869

   Can someone else help to test and verify this on other machines?
   Sometimes I think it is very machine specific and I can not explain the performance downgrade either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead closed pull request #6065: Optimize row hash

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead closed pull request #6065: Optimize row hash
URL: https://github.com/apache/arrow-datafusion/pull/6065


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6065: Optimize row hash

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1174395356


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];

Review Comment:
   See https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1513959517 -- reported performance improvement



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #6065: Optimize row hash

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1518934173

   @yahoNanJing 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on pull request #6065: Optimize row hash

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1519174734

   Thanks folks for the feedback
   
   @mingmwang this change was tested before you PR #6003 merged and only for q32. I will retest the latest codebase soon with other benchmarks.
   
   @ozankabak mutating by dereferencing sounds good to me, I will test it out
   
   @yahoNanJing I'm not getting your part, are you saying 2 iterations each doing the operation is the faster than 1 iteration doing 2 ops every iteration? 
   
   
   I will retest it soon, and share results. If there is still no perf benefit after #6003 I will close the PR.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on pull request #6065: Optimize row hash

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1517995622

   > I will do some test locally tomorrow.
   Hi @mingmwang did you get a chance to test it? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #6065: Optimize row hash

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1518960238

   https://doc.rust-lang.org/std/iter/trait.TrustedLen.html#impl-TrustedLen-for-Iter%3C'_,+T%3E-1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #6065: Optimize row hash

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.

Dandandan commented on code in PR #6065:
URL: https://github.com/apache/arrow-datafusion/pull/6065#discussion_r1171833111


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -772,22 +773,22 @@ fn slice_and_maybe_filter(
     filter_opt: Option<&Arc<dyn Array>>,
     offsets: &[usize],
 ) -> Result<Vec<ArrayRef>> {
-    let sliced_arrays: Vec<ArrayRef> = aggr_array
-        .iter()
-        .map(|array| array.slice(offsets[0], offsets[1] - offsets[0]))
-        .collect();
+    let null_array = Arc::new(NullArray::new(0)) as ArrayRef;
+    let mut sliced_arrays: Vec<ArrayRef> = vec![null_array; aggr_array.len()];

Review Comment:
   I don't see why this should be faster? 🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org