You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Eric Wohlstadter <wo...@gmail.com> on 2018/07/23 21:50:23 UTC

[JAVA] SIMD vectorized fill of ArrowBuf from Java primitive type array?

Hi all,
  I work on a project that uses Arrow streaming format to transfer data
between Java processes.
We're also following the progress on Java support for Plasma, and may
decide use Plasma also.

We typically uses a pattern like this to fill Arrow vectors from Java
arrays:
----
int[] inputValues = ...;
boolean[] nullInputValues = ...;

org.apache.arrow.vector.IntVector vector = ...;
for(int i = 0; i < inputValues.size; i++) {
  if(nullInputValues[i]) {
    vector.setNull(i);
  } else {
    vector.set(i, inputValues[i]);
  }
}
----

Obviously the JIT won't be able to vectorize this loop. Does anyone know if
there is another way to achieve this which
would be vectorized?

Here is a pseudo-code mockup of what I was thinking about, is this approach
worth pursuing?

The idea is to try to convert input into Arrow format in a vectorized loop,
and then use sun.misc.Unsafe to copy the
converted on-heap input to an off-heap valueBuffer.

I'll ignore the details of the validityBuffer here, since it would follow
along the same lines:

----
int[] inputValues = ...;
org.apache.arrow.vector.IntVector vector = ...;

for(int i = 0; i < inputValues.size; i++) {
  //convert inputValues[i] to little-endian
  //this conversion can be SIMD vectorized?
}
UNSAFE.copyMemory(
  inputValues,
  0,
  null,
  vector.getDataBuffer().memoryAddress(),
  sizeof(Integer.class) * inputValues.size
);
----

Thanks for any feedback about details I may be misunderstanding, which
would make this approach infeasible.

Re: [JAVA] SIMD vectorized fill of ArrowBuf from Java primitive type array?

Posted by Siddharth Teotia <si...@dremio.com>.
Also look here
<https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/copier/FieldBufferCopier.java#L37>
to see how validity and data are copied independently between two vectors
bypassing all Arrow APIs and directly manipulating memory. The link points
to copying of data buffer. Further down in the file, you will see BitCopier
to copy validity bits.

On Mon, Jul 23, 2018 at 5:19 PM, Siddharth Teotia <si...@dremio.com>
wrote:

> Eric, you can take a look here
> <https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java#L176>
> how we try to optimize the copy (validity and data) in/out of vectors. We
> try to start with word-wise copy (64 column values and thus 64 validity
> bits) and then accordingly branch. Similar to this there are other examples
> of manipulating off heap buffers through PlatformDependent APIs -- which I
> think is same as using sun.misc.UNSAFE as the former eventually uses the
> latter underneath.
>
> In my opinion, we should take a look at  vector APIs and see where can we
> possibly eliminate branches. I did some of it earlier -- as an example see
> this
> <https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntVector.java#L144>
> for branch free cell-to-cell copy between two columns. The idea is to copy
> junk data disregarding the validity bit. As long as the validity bit is
> copied correctly, we are good.
>
> Couple of other things have been there on my todo list but haven't yet
> gotten to them. Like for your example, we should remove the branch here
> <https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntVector.java#L277>
> The caller is already telling us that validity bit is set or not (1 or 0 in
> parameter isSet). So just following is enough to set the value (Null or non
> null) at any cell and I think this will speed up the tight for loop in your
> application. There is no need to check whether isSet is 1 or 0. We should
> be simply setting it.
>
> PS:
>
> I have long been a proponent of implementing native SIMD acceleration
> support into Arrow libraries and have plugged this shamelessly in emails
> every now and then (like this one). Use cases like yours and several others
> can then be natively supported by Arrow for most optimal execution.
>
> On several occasions we have seen topic of SIMD acceleration support with
> Arrow coming up on mailing list and I think it's high time we should do
> something about it and build kernels  which can do simple tight loop
> operations like sum (vector1, vector2, num_values) and several others
> extremely efficiently.
>
>
> On Mon, Jul 23, 2018 at 4:44 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Eric,
>>
>> Antoine recently did some work on faster bitsetting in C++ by
>> unrolling the main loop to set one byte at a time
>>
>> https://github.com/apache/arrow/blob/27b869ae5df31f3be61e76e
>> 99999d96ea7d9b557/cpp/src/arrow/util/bit-util.h#L598
>>
>> This yielded major speedups when setting a lot of bits. A similar
>> strategy should be possible in Java for your use case. We speculated
>> that it could be made even faster by eliminating the branch in the
>> bit-setting assignments (the g() | left_branch : right_branch
>> statements). If you dig around in the Dremio codebase you can find
>> plenty of low level off-heap memory manipulation that may be helpful
>> (others may be able to comment).
>>
>> If some utilities could be developed here in the Arrow Java codebase
>> for common benefit, that would be great.
>>
>> Otherwise copying the values data without branching is an obvious
>> optimization. Others may have ideas
>>
>> - Wes
>>
>> On Mon, Jul 23, 2018 at 5:50 PM, Eric Wohlstadter <wo...@gmail.com>
>> wrote:
>> > Hi all,
>> >   I work on a project that uses Arrow streaming format to transfer data
>> > between Java processes.
>> > We're also following the progress on Java support for Plasma, and may
>> > decide use Plasma also.
>> >
>> > We typically uses a pattern like this to fill Arrow vectors from Java
>> > arrays:
>> > ----
>> > int[] inputValues = ...;
>> > boolean[] nullInputValues = ...;
>> >
>> > org.apache.arrow.vector.IntVector vector = ...;
>> > for(int i = 0; i < inputValues.size; i++) {
>> >   if(nullInputValues[i]) {
>> >     vector.setNull(i);
>> >   } else {
>> >     vector.set(i, inputValues[i]);
>> >   }
>> > }
>> > ----
>> >
>> > Obviously the JIT won't be able to vectorize this loop. Does anyone
>> know if
>> > there is another way to achieve this which
>> > would be vectorized?
>> >
>> > Here is a pseudo-code mockup of what I was thinking about, is this
>> approach
>> > worth pursuing?
>> >
>> > The idea is to try to convert input into Arrow format in a vectorized
>> loop,
>> > and then use sun.misc.Unsafe to copy the
>> > converted on-heap input to an off-heap valueBuffer.
>> >
>> > I'll ignore the details of the validityBuffer here, since it would
>> follow
>> > along the same lines:
>> >
>> > ----
>> > int[] inputValues = ...;
>> > org.apache.arrow.vector.IntVector vector = ...;
>> >
>> > for(int i = 0; i < inputValues.size; i++) {
>> >   //convert inputValues[i] to little-endian
>> >   //this conversion can be SIMD vectorized?
>> > }
>> > UNSAFE.copyMemory(
>> >   inputValues,
>> >   0,
>> >   null,
>> >   vector.getDataBuffer().memoryAddress(),
>> >   sizeof(Integer.class) * inputValues.size
>> > );
>> > ----
>> >
>> > Thanks for any feedback about details I may be misunderstanding, which
>> > would make this approach infeasible.
>>
>
>

Re: [JAVA] SIMD vectorized fill of ArrowBuf from Java primitive type array?

Posted by Siddharth Teotia <si...@dremio.com>.
Eric, you can take a look here
<https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java#L176>
how we try to optimize the copy (validity and data) in/out of vectors. We
try to start with word-wise copy (64 column values and thus 64 validity
bits) and then accordingly branch. Similar to this there are other examples
of manipulating off heap buffers through PlatformDependent APIs -- which I
think is same as using sun.misc.UNSAFE as the former eventually uses the
latter underneath.

In my opinion, we should take a look at  vector APIs and see where can we
possibly eliminate branches. I did some of it earlier -- as an example see
this
<https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntVector.java#L144>
for branch free cell-to-cell copy between two columns. The idea is to copy
junk data disregarding the validity bit. As long as the validity bit is
copied correctly, we are good.

Couple of other things have been there on my todo list but haven't yet
gotten to them. Like for your example, we should remove the branch here
<https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntVector.java#L277>
The caller is already telling us that validity bit is set or not (1 or 0 in
parameter isSet). So just following is enough to set the value (Null or non
null) at any cell and I think this will speed up the tight for loop in your
application. There is no need to check whether isSet is 1 or 0. We should
be simply setting it.

PS:

I have long been a proponent of implementing native SIMD acceleration
support into Arrow libraries and have plugged this shamelessly in emails
every now and then (like this one). Use cases like yours and several others
can then be natively supported by Arrow for most optimal execution.

On several occasions we have seen topic of SIMD acceleration support with
Arrow coming up on mailing list and I think it's high time we should do
something about it and build kernels  which can do simple tight loop
operations like sum (vector1, vector2, num_values) and several others
extremely efficiently.


On Mon, Jul 23, 2018 at 4:44 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Eric,
>
> Antoine recently did some work on faster bitsetting in C++ by
> unrolling the main loop to set one byte at a time
>
> https://github.com/apache/arrow/blob/27b869ae5df31f3be61e76e99999d9
> 6ea7d9b557/cpp/src/arrow/util/bit-util.h#L598
>
> This yielded major speedups when setting a lot of bits. A similar
> strategy should be possible in Java for your use case. We speculated
> that it could be made even faster by eliminating the branch in the
> bit-setting assignments (the g() | left_branch : right_branch
> statements). If you dig around in the Dremio codebase you can find
> plenty of low level off-heap memory manipulation that may be helpful
> (others may be able to comment).
>
> If some utilities could be developed here in the Arrow Java codebase
> for common benefit, that would be great.
>
> Otherwise copying the values data without branching is an obvious
> optimization. Others may have ideas
>
> - Wes
>
> On Mon, Jul 23, 2018 at 5:50 PM, Eric Wohlstadter <wo...@gmail.com>
> wrote:
> > Hi all,
> >   I work on a project that uses Arrow streaming format to transfer data
> > between Java processes.
> > We're also following the progress on Java support for Plasma, and may
> > decide use Plasma also.
> >
> > We typically uses a pattern like this to fill Arrow vectors from Java
> > arrays:
> > ----
> > int[] inputValues = ...;
> > boolean[] nullInputValues = ...;
> >
> > org.apache.arrow.vector.IntVector vector = ...;
> > for(int i = 0; i < inputValues.size; i++) {
> >   if(nullInputValues[i]) {
> >     vector.setNull(i);
> >   } else {
> >     vector.set(i, inputValues[i]);
> >   }
> > }
> > ----
> >
> > Obviously the JIT won't be able to vectorize this loop. Does anyone know
> if
> > there is another way to achieve this which
> > would be vectorized?
> >
> > Here is a pseudo-code mockup of what I was thinking about, is this
> approach
> > worth pursuing?
> >
> > The idea is to try to convert input into Arrow format in a vectorized
> loop,
> > and then use sun.misc.Unsafe to copy the
> > converted on-heap input to an off-heap valueBuffer.
> >
> > I'll ignore the details of the validityBuffer here, since it would follow
> > along the same lines:
> >
> > ----
> > int[] inputValues = ...;
> > org.apache.arrow.vector.IntVector vector = ...;
> >
> > for(int i = 0; i < inputValues.size; i++) {
> >   //convert inputValues[i] to little-endian
> >   //this conversion can be SIMD vectorized?
> > }
> > UNSAFE.copyMemory(
> >   inputValues,
> >   0,
> >   null,
> >   vector.getDataBuffer().memoryAddress(),
> >   sizeof(Integer.class) * inputValues.size
> > );
> > ----
> >
> > Thanks for any feedback about details I may be misunderstanding, which
> > would make this approach infeasible.
>

Re: [JAVA] SIMD vectorized fill of ArrowBuf from Java primitive type array?

Posted by Wes McKinney <we...@gmail.com>.
hi Eric,

Antoine recently did some work on faster bitsetting in C++ by
unrolling the main loop to set one byte at a time

https://github.com/apache/arrow/blob/27b869ae5df31f3be61e76e99999d96ea7d9b557/cpp/src/arrow/util/bit-util.h#L598

This yielded major speedups when setting a lot of bits. A similar
strategy should be possible in Java for your use case. We speculated
that it could be made even faster by eliminating the branch in the
bit-setting assignments (the g() | left_branch : right_branch
statements). If you dig around in the Dremio codebase you can find
plenty of low level off-heap memory manipulation that may be helpful
(others may be able to comment).

If some utilities could be developed here in the Arrow Java codebase
for common benefit, that would be great.

Otherwise copying the values data without branching is an obvious
optimization. Others may have ideas

- Wes

On Mon, Jul 23, 2018 at 5:50 PM, Eric Wohlstadter <wo...@gmail.com> wrote:
> Hi all,
>   I work on a project that uses Arrow streaming format to transfer data
> between Java processes.
> We're also following the progress on Java support for Plasma, and may
> decide use Plasma also.
>
> We typically uses a pattern like this to fill Arrow vectors from Java
> arrays:
> ----
> int[] inputValues = ...;
> boolean[] nullInputValues = ...;
>
> org.apache.arrow.vector.IntVector vector = ...;
> for(int i = 0; i < inputValues.size; i++) {
>   if(nullInputValues[i]) {
>     vector.setNull(i);
>   } else {
>     vector.set(i, inputValues[i]);
>   }
> }
> ----
>
> Obviously the JIT won't be able to vectorize this loop. Does anyone know if
> there is another way to achieve this which
> would be vectorized?
>
> Here is a pseudo-code mockup of what I was thinking about, is this approach
> worth pursuing?
>
> The idea is to try to convert input into Arrow format in a vectorized loop,
> and then use sun.misc.Unsafe to copy the
> converted on-heap input to an off-heap valueBuffer.
>
> I'll ignore the details of the validityBuffer here, since it would follow
> along the same lines:
>
> ----
> int[] inputValues = ...;
> org.apache.arrow.vector.IntVector vector = ...;
>
> for(int i = 0; i < inputValues.size; i++) {
>   //convert inputValues[i] to little-endian
>   //this conversion can be SIMD vectorized?
> }
> UNSAFE.copyMemory(
>   inputValues,
>   0,
>   null,
>   vector.getDataBuffer().memoryAddress(),
>   sizeof(Integer.class) * inputValues.size
> );
> ----
>
> Thanks for any feedback about details I may be misunderstanding, which
> would make this approach infeasible.