You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Josiah (Jira)" <ji...@apache.org> on 2020/09/10 15:18:00 UTC
[jira] [Created] (ARROW-9965) [Java] Buffer capacity calculations are slow for fixed-width vectors

Josiah created ARROW-9965:
-----------------------------

             Summary: [Java] Buffer capacity calculations are slow for fixed-width vectors
                 Key: ARROW-9965
                 URL: https://issues.apache.org/jira/browse/ARROW-9965
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Josiah


It turns out that setSafe performs a very expensive integer division when trying to compute buffer capacity; specifically, it divides by the field size, which isn't hardcoded. Although it is typically a power of 2 for alignment reasons, this doesn't compile down to a bitshift.

This is done here: https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
 
Forcing a bitshift operation results in a large speedup in benchmarks. When turning off bounds checks (which affects another portion of set), microbenchmarks indicate that setting the elements of a vector via setSafe is increased by ~174% (almost 3 times faster). With bounds checks on, this is reduced to a 73% increase (Amdahl's).

We use setSafe right now in a hot loop to set Arrow vectors in an internal data-intensive service (for now), although in the future, we would prefer a more specialized vector append interface to skip all the other indirection and bit manipulation instructions, while not directly manipulating the exposed (native) memory.

Here is the detailed analysis:
Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run with the latest repo-provided JDK 14 on Ubuntu 20.04.
{code}
Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op

Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied

# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op

Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT applied

# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl

Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op

Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
{code}

I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I left the CPU frequency scaling in the default state. The numbers seemed off for setting a value. I reran the benchmarks with the `prof=perfasm` option in JMH, which emitted annotated assembly for detected hotspots. Here is the relevant section:

{code}
0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit exception: dispatches to 0x00007f5a7f7bef28
│ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
*snip*
1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 rethrow=0 return_oop=0}
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12 (line 189)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ │ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - java.lang.Math::min@3 (line 1552)
││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 44)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13 (line 189)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
{code}

The hot instruction is misattributed, probably due to event instruction skid. But integer division is known to be expensive to implement. We can verify this with Agner Fog's instruction tables: https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv gives high numbers in the table provided, as expected.

After noting all of this, we can apply the following patch, which produces the speedup as above:
{code}
diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
index ee47f6dd8..0c9a57bf9 100644
--- a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
+++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
@@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
private final int typeWidth;
+ private final int typeWidthAsExactBitShiftOrNegOne;

protected int lastValueCapacity;

@@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
public BaseFixedWidthVector(Field field, final BufferAllocator allocator, final int typeWidth) {
super(allocator);
this.typeWidth = typeWidth;
+ this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
this.field = field;
valueCount = 0;
allocationMonitor = 0;
@@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
}

private int getValueBufferValueCapacity() {
- return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ if (typeWidthAsExactBitShiftOrNegOne == -1) {
+ // Slow path - integral division is very very expensive, and code here is part of
+ // setSafe's hot loop
+ // The JVM did not optimize integral division into bit shifts
+ return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ }
+ return capAtMaxInt(valueBuffer.capacity() >> typeWidthAsExactBitShiftOrNegOne);
}

private int getValidityBufferValueCapacity() {
@@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
return visitor.visit(this, value);
}

+ private int log2ExactOrNeg1(int x) {
+ final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
+ if (!isPowerOfTwo) {
+ return -1;
+ }
+ return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
+ }
+
}
{code}

I also did a quick test with JDK 8 - this was where I originally ran the benchmarks. The idiv instruction was present there too.

An initial version of this patch cached the value - this produces about the same speedup.

Are people fine with this approach?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)