You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "yahoNanJing (via GitHub)" <gi...@apache.org> on 2023/04/10 11:23:14 UTC

[GitHub] [arrow-datafusion] yahoNanJing opened a new pull request, #5941: Issue 5940

yahoNanJing opened a new pull request, #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5940.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   After applying patches based on #5866 and #5904, based on the flame graph for q17 generated by 
   `sudo CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --freq 500 --bin tpch -- benchmark datafusion --path ./data-parquet/ --format parquet --partitions 1 -q 17 --iterations 10`
   
   
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on a diff in pull request #5941: Remove unnecessary equality check for JoinHashMap - Issue 5940

Posted by "yahoNanJing (via GitHub)" <gi...@apache.org>.
yahoNanJing commented on code in PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#discussion_r1162238740


##########
datafusion/core/src/physical_plan/joins/hash_join.rs:
##########
@@ -594,16 +593,17 @@ pub fn update_hash(
     let hash_values = create_hashes(&keys_values, random_state, hashes_buffer)?;
 
     // insert hashes to key of the hashmap
-    for (row, hash_value) in hash_values.iter().enumerate() {
-        let item = hash_map
-            .0
-            .get_mut(*hash_value, |(hash, _)| *hash_value == *hash);
+    let row_start = offset;
+    let row_end = offset + hash_values.len();
+    for (row, hash_value) in (row_start..row_end).zip(hash_values.iter()) {
+        // the hash value is the key, always true
+        let item = hash_map.0.get_mut(*hash_value, |_| true);

Review Comment:
   Thanks @Dandandan for pointing it out. You are right. It's my bad. The RawTable utilizes the open addressing policy for looking for matched entries. I'll revert this commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on pull request #5941: Remove unnecessary equality check for JoinHashMap - Issue 5940

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#issuecomment-1502178618

   It would be nice to show whether we are not getting regressions for other queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change back SmallVec to Vec for JoinHashMap - Issue 5940 [arrow-datafusion]

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan closed pull request #5941: Change back SmallVec to Vec for JoinHashMap - Issue 5940
URL: https://github.com/apache/arrow-datafusion/pull/5941


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5941: Remove unnecessary equality check for JoinHashMap - Issue 5940

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#issuecomment-1502057461

   There is a CI check https://github.com/apache/arrow-datafusion/actions/runs/4657156069/jobs/8241558786?pr=5941 appears to be failing
   
   Given that we have a benchmark that improves (q17) I think we should merge this PR, assuming we can resolve the CI failure satisfactorily. 
   
   Thank you for this @yahoNanJing 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change back SmallVec to Vec for JoinHashMap - Issue 5940 [arrow-datafusion]

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#issuecomment-1900409350

   We have improved the datastructure of join in other ways already, so closing the PR in order to clean it up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #5941: Change back SmallVec to Vec for JoinHashMap - Issue 5940

Posted by "yahoNanJing (via GitHub)" <gi...@apache.org>.
yahoNanJing commented on PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#issuecomment-1502741913

   Thanks @Dandandan and @alamb. I tested this PR with other sqls from TPCH. It seems it's not always true of bringing performance benefits. I think we can hold on this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #5941: Issue 5940

Posted by "yahoNanJing (via GitHub)" <gi...@apache.org>.
yahoNanJing commented on PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#issuecomment-1501717963

   Hi @Dandandan and @alamb, could you help review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #5941: Issue 5940

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on code in PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#discussion_r1161804695


##########
datafusion/core/src/physical_plan/joins/hash_join.rs:
##########
@@ -543,7 +542,7 @@ async fn collect_left_input(
         )
     })? / 7)
         .next_power_of_two();
-    // 32 bytes per `(u64, SmallVec<[u64; 1]>)`
+    // 32 bytes per `(u64, Vec<[u64; 1]>)`

Review Comment:
   This probably is a trade-off, joins with unique keys will probably get slower.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5941: Change back SmallVec to Vec for JoinHashMap - Issue 5940

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#issuecomment-1503092951

   Marking to draft so we don't accidentally merge it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #5941: Remove unnecessary equality check for JoinHashMap - Issue 5940

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on code in PR #5941:
URL: https://github.com/apache/arrow-datafusion/pull/5941#discussion_r1161987716


##########
datafusion/core/src/physical_plan/joins/hash_join.rs:
##########
@@ -594,16 +593,17 @@ pub fn update_hash(
     let hash_values = create_hashes(&keys_values, random_state, hashes_buffer)?;
 
     // insert hashes to key of the hashmap
-    for (row, hash_value) in hash_values.iter().enumerate() {
-        let item = hash_map
-            .0
-            .get_mut(*hash_value, |(hash, _)| *hash_value == *hash);
+    let row_start = offset;
+    let row_end = offset + hash_values.len();
+    for (row, hash_value) in (row_start..row_end).zip(hash_values.iter()) {
+        // the hash value is the key, always true
+        let item = hash_map.0.get_mut(*hash_value, |_| true);

Review Comment:
   The equality check might still be necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org