You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Josiah (Jira)" <ji...@apache.org> on 2020/09/10 15:20:00 UTC
[jira] [Updated] (ARROW-9965) [Java] Buffer capacity calculations are slow for fixed-width vectors

     [ https://issues.apache.org/jira/browse/ARROW-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josiah updated ARROW-9965:
--------------------------
     Attachment: after_patch_profile_prof_perfasm_unsafe_true
                 before_patch_profile_prof_perfasm_unsafe_true
    Description: 
It turns out that setSafe performs a very expensive integer division when trying to compute buffer capacity; specifically, it divides by the field size, which isn't hardcoded. Although it is typically a power of 2 for alignment reasons, this doesn't compile down to a bitshift.

This is done here: https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
 
Forcing a bitshift operation results in a large speedup in benchmarks. When turning off bounds checks (which affects another portion of set), microbenchmarks indicate that setting the elements of a vector via setSafe is increased by ~174% (almost 3 times faster). With bounds checks on, this is reduced to a 73% increase (Amdahl's).

We use setSafe right now in a hot loop to set Arrow vectors in an internal data-intensive service (for now), although in the future, we would prefer a more specialized vector append interface to skip all the other indirection and bit manipulation instructions, while not directly manipulating the exposed (native) memory.

Here is the detailed analysis:
Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run with the latest repo-provided JDK 14 on Ubuntu 20.04.
{code}
Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op

Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied

# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op

Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT applied

# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl

Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op

Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
{code}

I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I left the CPU frequency scaling in the default state. The numbers seemed off for setting a value. I reran the benchmarks with the `prof=perfasm` option in JMH, which emitted annotated assembly for detected hotspots. Here is the relevant section:

{code}
0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit exception: dispatches to 0x00007f5a7f7bef28
│ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
*snip*
1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 rethrow=0 return_oop=0}
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12 (line 189)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ │ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - java.lang.Math::min@3 (line 1552)
││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 44)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13 (line 189)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
{code}

The hot instruction is misattributed, probably due to event instruction skid. But integer division is known to be expensive to implement. We can verify this with Agner Fog's instruction tables: https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv gives high numbers in the table provided, as expected.

After noting all of this, we can apply the following patch, which produces the speedup as above:
{code}
diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
index ee47f6dd8..0c9a57bf9 100644
--- a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
+++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
@@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
private final int typeWidth;
+ private final int typeWidthAsExactBitShiftOrNegOne;

protected int lastValueCapacity;

@@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
public BaseFixedWidthVector(Field field, final BufferAllocator allocator, final int typeWidth) {
super(allocator);
this.typeWidth = typeWidth;
+ this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
this.field = field;
valueCount = 0;
allocationMonitor = 0;
@@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
}

private int getValueBufferValueCapacity() {
- return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ if (typeWidthAsExactBitShiftOrNegOne == -1) {
+ // Slow path - integral division is very very expensive, and code here is part of
+ // setSafe's hot loop
+ // The JVM did not optimize integral division into bit shifts
+ return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ }
+ return capAtMaxInt(valueBuffer.capacity() >> typeWidthAsExactBitShiftOrNegOne);
}

private int getValidityBufferValueCapacity() {
@@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
return visitor.visit(this, value);
}

+ private int log2ExactOrNeg1(int x) {
+ final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
+ if (!isPowerOfTwo) {
+ return -1;
+ }
+ return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
+ }
+
}
{code}

Attached is the generated assembly as printed by JMH, before and after. I renamed some variables for clarity after generating the results, but the logic is unchanged.

I also did a quick test with JDK 8 - this was where I originally ran the benchmarks. The idiv instruction was present there too.

An initial version of this patch cached the value - this produces about the same speedup.

Are people fine with this approach?

  was:
It turns out that setSafe performs a very expensive integer division when trying to compute buffer capacity; specifically, it divides by the field size, which isn't hardcoded. Although it is typically a power of 2 for alignment reasons, this doesn't compile down to a bitshift.

This is done here: https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
 
Forcing a bitshift operation results in a large speedup in benchmarks. When turning off bounds checks (which affects another portion of set), microbenchmarks indicate that setting the elements of a vector via setSafe is increased by ~174% (almost 3 times faster). With bounds checks on, this is reduced to a 73% increase (Amdahl's).

We use setSafe right now in a hot loop to set Arrow vectors in an internal data-intensive service (for now), although in the future, we would prefer a more specialized vector append interface to skip all the other indirection and bit manipulation instructions, while not directly manipulating the exposed (native) memory.

Here is the detailed analysis:
Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run with the latest repo-provided JDK 14 on Ubuntu 20.04.
{code}
Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op

Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied

# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op

Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT applied

# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl

Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op

Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
{code}

I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I left the CPU frequency scaling in the default state. The numbers seemed off for setting a value. I reran the benchmarks with the `prof=perfasm` option in JMH, which emitted annotated assembly for detected hotspots. Here is the relevant section:

{code}
0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit exception: dispatches to 0x00007f5a7f7bef28
│ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
*snip*
1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 rethrow=0 return_oop=0}
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12 (line 189)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ │ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - java.lang.Math::min@3 (line 1552)
││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 44)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13 (line 189)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
{code}

The hot instruction is misattributed, probably due to event instruction skid. But integer division is known to be expensive to implement. We can verify this with Agner Fog's instruction tables: https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv gives high numbers in the table provided, as expected.

After noting all of this, we can apply the following patch, which produces the speedup as above:
{code}
diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
index ee47f6dd8..0c9a57bf9 100644
--- a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
+++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
@@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
private final int typeWidth;
+ private final int typeWidthAsExactBitShiftOrNegOne;

protected int lastValueCapacity;

@@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
public BaseFixedWidthVector(Field field, final BufferAllocator allocator, final int typeWidth) {
super(allocator);
this.typeWidth = typeWidth;
+ this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
this.field = field;
valueCount = 0;
allocationMonitor = 0;
@@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
}

private int getValueBufferValueCapacity() {
- return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ if (typeWidthAsExactBitShiftOrNegOne == -1) {
+ // Slow path - integral division is very very expensive, and code here is part of
+ // setSafe's hot loop
+ // The JVM did not optimize integral division into bit shifts
+ return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ }
+ return capAtMaxInt(valueBuffer.capacity() >> typeWidthAsExactBitShiftOrNegOne);
}

private int getValidityBufferValueCapacity() {
@@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
return visitor.visit(this, value);
}

+ private int log2ExactOrNeg1(int x) {
+ final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
+ if (!isPowerOfTwo) {
+ return -1;
+ }
+ return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
+ }
+
}
{code}

I also did a quick test with JDK 8 - this was where I originally ran the benchmarks. The idiv instruction was present there too.

An initial version of this patch cached the value - this produces about the same speedup.

Are people fine with this approach?


> [Java] Buffer capacity calculations are slow for fixed-width vectors
> --------------------------------------------------------------------
>
>                 Key: ARROW-9965
>                 URL: https://issues.apache.org/jira/browse/ARROW-9965
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Josiah
>            Priority: Minor
>         Attachments: after_patch_profile_prof_perfasm_unsafe_true, before_patch_profile_prof_perfasm_unsafe_true
>
>
> It turns out that setSafe performs a very expensive integer division when trying to compute buffer capacity; specifically, it divides by the field size, which isn't hardcoded. Although it is typically a power of 2 for alignment reasons, this doesn't compile down to a bitshift.
> This is done here: https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
>  
> Forcing a bitshift operation results in a large speedup in benchmarks. When turning off bounds checks (which affects another portion of set), microbenchmarks indicate that setting the elements of a vector via setSafe is increased by ~174% (almost 3 times faster). With bounds checks on, this is reduced to a 73% increase (Amdahl's).
> We use setSafe right now in a hot loop to set Arrow vectors in an internal data-intensive service (for now), although in the future, we would prefer a more specialized vector append interface to skip all the other indirection and bit manipulation instructions, while not directly manipulating the exposed (native) memory.
> Here is the detailed analysis:
> Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run with the latest repo-provided JDK 14 on Ubuntu 20.04.
> {code}
> Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=false
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> *snip*
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
> IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
> IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=false
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> *snip*
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
> IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
> IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=true
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
> IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
> IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=true
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
> IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
> IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
> {code}
> I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I left the CPU frequency scaling in the default state. The numbers seemed off for setting a value. I reran the benchmarks with the `prof=perfasm` option in JMH, which emitted annotated assembly for detected hotspots. Here is the relevant section:
> {code}
> 0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit exception: dispatches to 0x00007f5a7f7bef28
> │ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
> │ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4 (line 189)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
> │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
> │ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
> 0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l {reexecute=0 rethrow=0 return_oop=0}
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11 (line 189)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
> │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
> │ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
> *snip*
> 1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 rethrow=0 return_oop=0}
> ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12 (line 189)
> ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
> ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
> ││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> ││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
> ││ │ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
> 68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
> ││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
> ││ ││ ; - java.lang.Math::min@3 (line 1552)
> ││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 44)
> ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13 (line 189)
> ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
> ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
> ││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> ││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
> ││ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
> {code}
> The hot instruction is misattributed, probably due to event instruction skid. But integer division is known to be expensive to implement. We can verify this with Agner Fog's instruction tables: https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv gives high numbers in the table provided, as expected.
> After noting all of this, we can apply the following patch, which produces the speedup as above:
> {code}
> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> index ee47f6dd8..0c9a57bf9 100644
> --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> @@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
> public abstract class BaseFixedWidthVector extends BaseValueVector
> implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
> private final int typeWidth;
> + private final int typeWidthAsExactBitShiftOrNegOne;
> protected int lastValueCapacity;
> @@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
> public BaseFixedWidthVector(Field field, final BufferAllocator allocator, final int typeWidth) {
> super(allocator);
> this.typeWidth = typeWidth;
> + this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
> this.field = field;
> valueCount = 0;
> allocationMonitor = 0;
> @@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
> }
> private int getValueBufferValueCapacity() {
> - return capAtMaxInt(valueBuffer.capacity() / typeWidth);
> + if (typeWidthAsExactBitShiftOrNegOne == -1) {
> + // Slow path - integral division is very very expensive, and code here is part of
> + // setSafe's hot loop
> + // The JVM did not optimize integral division into bit shifts
> + return capAtMaxInt(valueBuffer.capacity() / typeWidth);
> + }
> + return capAtMaxInt(valueBuffer.capacity() >> typeWidthAsExactBitShiftOrNegOne);
> }
> private int getValidityBufferValueCapacity() {
> @@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
> return visitor.visit(this, value);
> }
> + private int log2ExactOrNeg1(int x) {
> + final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
> + if (!isPowerOfTwo) {
> + return -1;
> + }
> + return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
> + }
> +
> }
> {code}
> Attached is the generated assembly as printed by JMH, before and after. I renamed some variables for clarity after generating the results, but the logic is unchanged.
> I also did a quick test with JDK 8 - this was where I originally ran the benchmarks. The idiv instruction was present there too.
> An initial version of this patch cached the value - this produces about the same speedup.
> Are people fine with this approach?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)