You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/20 13:23:26 UTC

[GitHub] [arrow] vertexclique opened a new pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

vertexclique opened a new pull request #8722:
URL: https://github.com/apache/arrow/pull/8722


   Before:
   ```
   sort 2^10               time:   [94.137 us 94.154 us 94.174 us]                      
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   sort 2^12               time:   [483.03 us 483.08 us 483.13 us]                      
   Found 8 outliers among 100 measurements (8.00%)
     1 (1.00%) low severe
     2 (2.00%) low mild
     4 (4.00%) high mild
     1 (1.00%) high severe
   
   sort nulls 2^10         time:   [59.782 us 59.800 us 59.818 us]                            
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   sort nulls 2^12         time:   [296.84 us 296.89 us 296.97 us]                            
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   ```
   
   After:
   ```
   sort 2^10               time:   [73.098 us 73.119 us 73.148 us]                      
                           change: [-22.404% -22.381% -22.356%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     3 (3.00%) low mild
     1 (1.00%) high mild
     4 (4.00%) high severe
   
   sort 2^12               time:   [354.82 us 354.98 us 355.23 us]                      
                           change: [-26.550% -26.530% -26.505%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     4 (4.00%) high severe
   
   sort nulls 2^10         time:   [53.570 us 53.577 us 53.585 us]                            
                           change: [-10.407% -10.378% -10.350%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) low mild
     3 (3.00%) high mild
   
   sort nulls 2^12         time:   [238.46 us 238.50 us 238.55 us]                            
                           change: [-19.670% -19.650% -19.627%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) low severe
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   ```
   
   Note: Waiting for other bit or PR to merge to rebase on top of it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#issuecomment-731171915


   https://issues.apache.org/jira/browse/ARROW-10664


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#discussion_r528238296



##########
File path: rust/arrow/src/compute/kernels/sort.rs
##########
@@ -222,6 +224,106 @@ impl Default for SortOptions {
     }
 }
 
+#[cfg(feature = "avx512")]
+/// Sort primitive values
+fn sort_primitive<T>(
+    values: &ArrayRef,
+    value_indices: Vec<u32>,
+    null_indices: Vec<u32>,
+    nan_indices: Vec<u32>,
+    options: &SortOptions,
+) -> Result<UInt32Array>
+where
+    T: ArrowPrimitiveType,
+    T::Native: std::cmp::PartialOrd,
+{
+    let values = as_primitive_array::<T>(values);
+    let descending = options.descending;
+
+    let mut nulls = null_indices;
+    let mut nans = nan_indices;
+
+    let perm_exch_width = PERMUTE_EXCHANGE_WIDTH * 2;
+
+    let valids = if crate::util::bit_util::is_power_of_two(values.len())
+        && values.len() > perm_exch_width
+    {
+        let value_data = value_indices
+            .iter()
+            .copied()
+            .map(|e| e as i64)
+            .collect::<Vec<i64>>();
+        let value_data = unsafe {
+            if !descending {
+                avx512_vec_sort_i64(&value_data)
+            } else {
+                let mut d = avx512_vec_sort_i64(&value_data);
+                d.reverse();
+                nans.reverse();
+                nulls.reverse();
+                d
+            }
+        };
+        // create tuples after the actual sorting

Review comment:
       I must be missing something, but how does this work if the goal of the function is to sort the indices based on the values?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#discussion_r528807575



##########
File path: rust/arrow/src/arch/avx512.rs
##########
@@ -41,6 +45,173 @@ pub(crate) unsafe fn avx512_bin_or(left: &[u8], right: &[u8], res: &mut [u8]) {
     std::ptr::copy_nonoverlapping(s, d, std::mem::size_of::<__m512i>());
 }
 
+///
+/// Sorting network for a single SIMD vector of i64s
+#[target_feature(enable = "avx512f")]
+pub(crate) unsafe fn avx512_vec_sort_i64_single<'a>(input: &[i64]) -> [i64; 8] {
+    use core::arch::x86_64::{
+        __m512i, _mm512_loadu_epi64, _mm512_mask_mov_epi64, _mm512_max_epi64,
+        _mm512_min_epi64, _mm512_permutexvar_epi64, _mm512_set_epi64,
+    };
+
+    // First wiring's permute exchange for the sorting network
+    let mut inp: __m512i = _mm512_loadu_epi64(input.as_ptr() as *const _);
+    let idxnn1: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn1, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    // Second wiring's permute exchange for the sorting network
+    let idxnn2: __m512i = _mm512_set_epi64(4, 5, 6, 7, 0, 1, 2, 3);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    // Third wiring's permute exchange for the sorting network
+    let idxnn3: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    // Fourth wiring's permute exchange, does forwarding.
+    let idxnn4: __m512i = _mm512_set_epi64(0, 1, 2, 3, 4, 5, 6, 7);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max);
+
+    // Fifth wiring's permute exchange for the sorting network
+    let idxnn5: __m512i = _mm512_set_epi64(5, 4, 7, 6, 1, 0, 3, 2);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn5, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    // Sixth wiring's permute exchange for the sorting network
+    let idxnn6: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn6, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    std::mem::transmute(inp)
+}
+
+///
+/// Sorting network with SIMD merger for two SIMD vector of i64s
+#[target_feature(enable = "avx512f")]
+pub(crate) unsafe fn avx512_vec_sort_i64_double(
+    left: &[i64],
+    right: &[i64],
+) -> [[i64; 8]; 2] {
+    use core::arch::x86_64::{
+        __m512i, _mm512_loadu_epi64, _mm512_mask_mov_epi64, _mm512_max_epi64,
+        _mm512_min_epi64, _mm512_permutexvar_epi64, _mm512_set_epi64,
+    };
+
+    let (l, r) = (
+        avx512_vec_sort_i64_single(left),
+        avx512_vec_sort_i64_single(right),
+    );
+
+    let mut l: __m512i = _mm512_loadu_epi64(l.as_ptr() as *const _);
+    let mut r: __m512i = _mm512_loadu_epi64(r.as_ptr() as *const _);
+
+    // Full blend of the both vector wires
+    let idxnn1: __m512i = _mm512_set_epi64(0, 1, 2, 3, 4, 5, 6, 7);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn1, l);
+    l = _mm512_min_epi64(r, wire_n);
+    r = _mm512_max_epi64(r, wire_n);
+
+    // Carries on with normal sorting network operation
+    let idxnn2: __m512i = _mm512_set_epi64(3, 2, 1, 0, 7, 6, 5, 4);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max); // 0x33
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max); // 0x33
+
+    let idxnn3: __m512i = _mm512_set_epi64(5, 4, 7, 6, 1, 0, 3, 2);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    let idxnn4: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    let lf: [i64; 8] = std::mem::transmute(l);
+    let rf: [i64; 8] = std::mem::transmute(r);
+
+    [lf, rf]
+}
+
+///
+/// Permute exchange width for the AVX-512 SIMD application
+pub(crate) const PERMUTE_EXCHANGE_WIDTH: usize = 8;
+
+///
+/// Merge layer for sorting network
+fn merger_net(mut input: Vec<i64>) -> Vec<i64> {
+    let half = input.len() / 2;
+    if half > PERMUTE_EXCHANGE_WIDTH {
+        (0..half).into_iter().for_each(|e| unsafe {
+            if input[e] > input[e + half] {
+                let pl: *mut i64 = &mut input[e];
+                let pr: *mut i64 = &mut input[e + half];
+                std::ptr::swap(pl, pr);
+            }
+        });
+        merger_net(input[..half].to_vec());
+        merger_net(input[half..].to_vec());
+    }
+    input
+}
+
+///
+/// Cold path marker for hinting the CPU for the further optimizations.
+#[inline]
+#[cold]
+fn cold() {}
+
+///
+/// Size independent sorter for any vector which is power of two.
+pub(crate) unsafe fn avx512_vec_sort_i64(input: &[i64]) -> Vec<i64> {
+    if (input.len() / 2) == PERMUTE_EXCHANGE_WIDTH {

Review comment:
       Might use here `let half = input.len() / 2` as well? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#discussion_r528935157



##########
File path: rust/arrow/src/arch/avx512.rs
##########
@@ -41,6 +45,173 @@ pub(crate) unsafe fn avx512_bin_or(left: &[u8], right: &[u8], res: &mut [u8]) {
     std::ptr::copy_nonoverlapping(s, d, std::mem::size_of::<__m512i>());
 }
 
+///
+/// Sorting network for a single SIMD vector of i64s
+#[target_feature(enable = "avx512f")]
+pub(crate) unsafe fn avx512_vec_sort_i64_single<'a>(input: &[i64]) -> [i64; 8] {
+    use core::arch::x86_64::{
+        __m512i, _mm512_loadu_epi64, _mm512_mask_mov_epi64, _mm512_max_epi64,
+        _mm512_min_epi64, _mm512_permutexvar_epi64, _mm512_set_epi64,
+    };
+
+    // First wiring's permute exchange for the sorting network
+    let mut inp: __m512i = _mm512_loadu_epi64(input.as_ptr() as *const _);
+    let idxnn1: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn1, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    // Second wiring's permute exchange for the sorting network
+    let idxnn2: __m512i = _mm512_set_epi64(4, 5, 6, 7, 0, 1, 2, 3);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    // Third wiring's permute exchange for the sorting network
+    let idxnn3: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    // Fourth wiring's permute exchange, does forwarding.
+    let idxnn4: __m512i = _mm512_set_epi64(0, 1, 2, 3, 4, 5, 6, 7);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max);
+
+    // Fifth wiring's permute exchange for the sorting network
+    let idxnn5: __m512i = _mm512_set_epi64(5, 4, 7, 6, 1, 0, 3, 2);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn5, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    // Sixth wiring's permute exchange for the sorting network
+    let idxnn6: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn6, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    std::mem::transmute(inp)
+}
+
+///
+/// Sorting network with SIMD merger for two SIMD vector of i64s
+#[target_feature(enable = "avx512f")]
+pub(crate) unsafe fn avx512_vec_sort_i64_double(
+    left: &[i64],
+    right: &[i64],
+) -> [[i64; 8]; 2] {
+    use core::arch::x86_64::{
+        __m512i, _mm512_loadu_epi64, _mm512_mask_mov_epi64, _mm512_max_epi64,
+        _mm512_min_epi64, _mm512_permutexvar_epi64, _mm512_set_epi64,
+    };
+
+    let (l, r) = (
+        avx512_vec_sort_i64_single(left),
+        avx512_vec_sort_i64_single(right),
+    );
+
+    let mut l: __m512i = _mm512_loadu_epi64(l.as_ptr() as *const _);
+    let mut r: __m512i = _mm512_loadu_epi64(r.as_ptr() as *const _);
+
+    // Full blend of the both vector wires
+    let idxnn1: __m512i = _mm512_set_epi64(0, 1, 2, 3, 4, 5, 6, 7);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn1, l);
+    l = _mm512_min_epi64(r, wire_n);
+    r = _mm512_max_epi64(r, wire_n);
+
+    // Carries on with normal sorting network operation
+    let idxnn2: __m512i = _mm512_set_epi64(3, 2, 1, 0, 7, 6, 5, 4);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max); // 0x33
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max); // 0x33
+
+    let idxnn3: __m512i = _mm512_set_epi64(5, 4, 7, 6, 1, 0, 3, 2);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    let idxnn4: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    let lf: [i64; 8] = std::mem::transmute(l);
+    let rf: [i64; 8] = std::mem::transmute(r);
+
+    [lf, rf]
+}
+
+///
+/// Permute exchange width for the AVX-512 SIMD application
+pub(crate) const PERMUTE_EXCHANGE_WIDTH: usize = 8;
+
+///
+/// Merge layer for sorting network
+fn merger_net(mut input: Vec<i64>) -> Vec<i64> {
+    let half = input.len() / 2;
+    if half > PERMUTE_EXCHANGE_WIDTH {
+        (0..half).into_iter().for_each(|e| unsafe {
+            if input[e] > input[e + half] {
+                let pl: *mut i64 = &mut input[e];
+                let pr: *mut i64 = &mut input[e + half];
+                std::ptr::swap(pl, pr);
+            }
+        });
+        merger_net(input[..half].to_vec());

Review comment:
       Doesn't this create a lot of intermediate vecs / recursion? I guess it could be written manually and with one bigger allocation?

##########
File path: rust/arrow/src/arch/avx512.rs
##########
@@ -41,6 +45,173 @@ pub(crate) unsafe fn avx512_bin_or(left: &[u8], right: &[u8], res: &mut [u8]) {
     std::ptr::copy_nonoverlapping(s, d, std::mem::size_of::<__m512i>());
 }
 
+///
+/// Sorting network for a single SIMD vector of i64s
+#[target_feature(enable = "avx512f")]
+pub(crate) unsafe fn avx512_vec_sort_i64_single<'a>(input: &[i64]) -> [i64; 8] {
+    use core::arch::x86_64::{
+        __m512i, _mm512_loadu_epi64, _mm512_mask_mov_epi64, _mm512_max_epi64,
+        _mm512_min_epi64, _mm512_permutexvar_epi64, _mm512_set_epi64,
+    };
+
+    // First wiring's permute exchange for the sorting network
+    let mut inp: __m512i = _mm512_loadu_epi64(input.as_ptr() as *const _);
+    let idxnn1: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn1, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    // Second wiring's permute exchange for the sorting network
+    let idxnn2: __m512i = _mm512_set_epi64(4, 5, 6, 7, 0, 1, 2, 3);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    // Third wiring's permute exchange for the sorting network
+    let idxnn3: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    // Fourth wiring's permute exchange, does forwarding.
+    let idxnn4: __m512i = _mm512_set_epi64(0, 1, 2, 3, 4, 5, 6, 7);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max);
+
+    // Fifth wiring's permute exchange for the sorting network
+    let idxnn5: __m512i = _mm512_set_epi64(5, 4, 7, 6, 1, 0, 3, 2);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn5, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    // Sixth wiring's permute exchange for the sorting network
+    let idxnn6: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn6, inp);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, inp);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, inp);
+    inp = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    std::mem::transmute(inp)
+}
+
+///
+/// Sorting network with SIMD merger for two SIMD vector of i64s
+#[target_feature(enable = "avx512f")]
+pub(crate) unsafe fn avx512_vec_sort_i64_double(
+    left: &[i64],
+    right: &[i64],
+) -> [[i64; 8]; 2] {
+    use core::arch::x86_64::{
+        __m512i, _mm512_loadu_epi64, _mm512_mask_mov_epi64, _mm512_max_epi64,
+        _mm512_min_epi64, _mm512_permutexvar_epi64, _mm512_set_epi64,
+    };
+
+    let (l, r) = (
+        avx512_vec_sort_i64_single(left),
+        avx512_vec_sort_i64_single(right),
+    );
+
+    let mut l: __m512i = _mm512_loadu_epi64(l.as_ptr() as *const _);
+    let mut r: __m512i = _mm512_loadu_epi64(r.as_ptr() as *const _);
+
+    // Full blend of the both vector wires
+    let idxnn1: __m512i = _mm512_set_epi64(0, 1, 2, 3, 4, 5, 6, 7);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn1, l);
+    l = _mm512_min_epi64(r, wire_n);
+    r = _mm512_max_epi64(r, wire_n);
+
+    // Carries on with normal sorting network operation
+    let idxnn2: __m512i = _mm512_set_epi64(3, 2, 1, 0, 7, 6, 5, 4);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max); // 0x33
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn2, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xF0, wire_n_max); // 0x33
+
+    let idxnn3: __m512i = _mm512_set_epi64(5, 4, 7, 6, 1, 0, 3, 2);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn3, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xCC, wire_n_max);
+
+    let idxnn4: __m512i = _mm512_set_epi64(6, 7, 4, 5, 2, 3, 0, 1);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, l);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, l);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, l);
+    l = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+    let wire_n: __m512i = _mm512_permutexvar_epi64(idxnn4, r);
+    let wire_n_min: __m512i = _mm512_min_epi64(wire_n, r);
+    let wire_n_max: __m512i = _mm512_max_epi64(wire_n, r);
+    r = _mm512_mask_mov_epi64(wire_n_min, 0xAA, wire_n_max);
+
+    let lf: [i64; 8] = std::mem::transmute(l);
+    let rf: [i64; 8] = std::mem::transmute(r);
+
+    [lf, rf]
+}
+
+///
+/// Permute exchange width for the AVX-512 SIMD application
+pub(crate) const PERMUTE_EXCHANGE_WIDTH: usize = 8;
+
+///
+/// Merge layer for sorting network
+fn merger_net(mut input: Vec<i64>) -> Vec<i64> {
+    let half = input.len() / 2;
+    if half > PERMUTE_EXCHANGE_WIDTH {
+        (0..half).into_iter().for_each(|e| unsafe {
+            if input[e] > input[e + half] {
+                let pl: *mut i64 = &mut input[e];
+                let pr: *mut i64 = &mut input[e + half];
+                std::ptr::swap(pl, pr);
+            }
+        });
+        merger_net(input[..half].to_vec());
+        merger_net(input[half..].to_vec());
+    }
+    input
+}
+
+///
+/// Cold path marker for hinting the CPU for the further optimizations.
+#[inline]
+#[cold]
+fn cold() {}
+
+///
+/// Size independent sorter for any vector which is power of two.
+pub(crate) unsafe fn avx512_vec_sort_i64(input: &[i64]) -> Vec<i64> {
+    if (input.len() / 2) == PERMUTE_EXCHANGE_WIDTH {
+        let v: Vec<&[i64]> = input.chunks_exact(PERMUTE_EXCHANGE_WIDTH).collect();
+        let x = avx512_vec_sort_i64_double(&v[0], &v[1]);
+        [x[0], x[1]].concat()
+    } else {
+        if (input.len() / 2) == 0 {
+            cold();
+            input.to_vec()
+        } else {
+            let mut it = input.chunks_exact(input.len() / 2);
+            let l = avx512_vec_sort_i64(it.next().unwrap());

Review comment:
       Here as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vertexclique commented on a change in pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
vertexclique commented on a change in pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#discussion_r528217013



##########
File path: rust/arrow/src/arch/avx512.rs
##########
@@ -41,6 +41,155 @@ pub(crate) unsafe fn avx512_bin_or(left: &[u8], right: &[u8], res: &mut [u8]) {
     std::ptr::copy_nonoverlapping(s, d, std::mem::size_of::<__m512i>());
 }
 
+#[target_feature(enable = "avx512f")]
+pub(crate) unsafe fn avx512_vec_sort_i64_single<'a>(input: &[i64]) -> [i64; 8] {

Review comment:
       Added for each method also comments for the operation blocks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vertexclique commented on pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
vertexclique commented on pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#issuecomment-759441550


   @alamb Thanks for reaching out! I don't have time to work on these PRs. Closing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#discussion_r527926242



##########
File path: rust/arrow/src/arch/avx512.rs
##########
@@ -41,6 +41,155 @@ pub(crate) unsafe fn avx512_bin_or(left: &[u8], right: &[u8], res: &mut [u8]) {
     std::ptr::copy_nonoverlapping(s, d, std::mem::size_of::<__m512i>());
 }
 
+#[target_feature(enable = "avx512f")]
+pub(crate) unsafe fn avx512_vec_sort_i64_single<'a>(input: &[i64]) -> [i64; 8] {

Review comment:
       May you please add some comments to the functions (and/or inline) so it's a bit easier for us to follow the code. Thanks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vertexclique closed pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
vertexclique closed pull request #8722:
URL: https://github.com/apache/arrow/pull/8722


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#discussion_r528238698



##########
File path: rust/arrow/src/compute/kernels/sort.rs
##########
@@ -853,6 +955,18 @@ mod tests {
         );
     }
 
+    #[test]
+    fn test_sort_primitives_large() {
+        let data = [0, 1, 2, 3_u8]
+            .repeat(100_000)

Review comment:
       Could you add another test where the input array has a power of two length?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8722: ARROW-10664: [Rust] Implement AVX512 sort

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8722:
URL: https://github.com/apache/arrow/pull/8722#issuecomment-759439321


   @vertexclique  --  Given the imminent Arrow 3.0 release, I am trying to clean up older Rust PRs and see if the authors have plans to move them forward. 
   
   Do you plan on working on this PR in the near future? If not, should we close this PR until there is time to make progress? Thanks again for your contributions so far. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org