You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2021/09/06 17:09:31 UTC

[Question] Allocations along 64 byte cache lines

Hi,

We have a whole section related to byte alignment (
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding)
recommending 64 byte alignment and referring to intel's manual.

Do we have evidence that this alignment helps (besides intel claims)?

I am asking because going through the arrow-rs we use an alignment of 128
bytes (following the stream prefetch recommendation from intel [1]).

I recently experimented changing it to 64 bytes and also to the native
alignment (i.e. i32 is aligned with 4 bytes), and I observed no difference
in performance when compiled for "skylake-avx512".

Specifically, I performed two types of tests, a "random sum" where we
compute the sum of the values taken at random indices, and "sum", where we
sum all values of the array (buffer[1] of the primitive array), both for
array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
the latter, prefetching would help, but I do not observe any difference.

I was wondering if anyone:

* has observed an equivalent behavior
* know a good benchmark where these things matter or
* have an explanation

Thanks a lot!

Best,
Jorge

[1]
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf,
sec. 3.7.3, page 162

Re: [Question] Allocations along 64 byte cache lines

Posted by Jed Brown <je...@jedbrown.org>.
Jorge Cardoso Leitão <jo...@gmail.com> writes:

> Yes, I expect aligned SIMD loads to be faster.
>
> My understanding is that we do not need an alignment requirement for this,
> though: split the buffer in 3, [unaligned][aligned][unaligned], use aligned
> loads for the middle and un-aligned (or not even SIMD) for the prefix and
> suffix. This is generic over the size of the SIMD and buffer slicing, where
> alignment can be lost. Or am I missing something?

If you add two arrays with different alignment. The [aligned] portions don't "line up" so you're always pulling unaligned from one of the arrays. This interaction between arrays is usually the rationale when HPC software decides to specify alignment. It may not be "worth it" to Arrow. If you have a high arithmetic intensity operation, you can afford to pack into aligned tiles (all GEMM-type implementations do this).

Re: [Question] Allocations along 64 byte cache lines

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Thanks Yibo,

Yes, I expect aligned SIMD loads to be faster.

My understanding is that we do not need an alignment requirement for this,
though: split the buffer in 3, [unaligned][aligned][unaligned], use aligned
loads for the middle and un-aligned (or not even SIMD) for the prefix and
suffix. This is generic over the size of the SIMD and buffer slicing, where
alignment can be lost. Or am I missing something?

Best,
Jorge





On Wed, Sep 8, 2021 at 4:26 AM Yibo Cai <yi...@arm.com> wrote:

> Thanks Jorge,
>
> I'm wondering if the 64 bytes alignment requirement is for cache or for
> simd register(avx512?).
>
> For simd, looks register width alignment does helps.
> E.g., _mm_load_si128 can only load 128 bits aligned data, it performs
> better than _mm_loadu_si128, which supports unaligned load.
>
> Again, be very skeptical to the benchmark :)
> https://quick-bench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk
>
>
> On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote:
> > Thanks,
> >
> > I think that the alignment requirement in IPC is different from this one:
> > we enforce 8/64 byte alignment when serializing for IPC, but we (only)
> > recommend 64 byte alignment in memory addresses (at least this is my
> > understanding from the above link).
> >
> > I did test adding two arrays and the result is independent of the
> alignment
> > (on my machine, compiler, etc).
> >
> > Yibo, thanks a lot for that example. I am unsure whether it captures the
> > cache alignment concept, though: in the example we are reading a long (8
> > bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0),
> which
> > is both slow and often undefined behavior. I think that the bench we want
> > is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
> > aligned with a long), the difference vanishes (under the same gotchas
> that
> > you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
> > Alternatively, add an int32 with an offset of 4.
> >
> > I benched both with explicit (via intrinsics) SIMD and without (i.e. let
> > the compiler do it for us), and the alignment does not impact the
> benches.
> >
> > Best,
> > Jorge
> >
> > [1] https://stackoverflow.com/a/27184001/931303
> >
> >
> >
> >
> >
> > On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai <yi...@arm.com> wrote:
> >
> >> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
> >> enough conditions, looks it does shows unaligned access has some penalty
> >> over aligned access. But I don't think this is an issue in practice.
> >>
> >> Please be very skeptical to this benchmark. It's hard to get it right
> >> given the complexity of hardware, compiler, benchmark tool and env.
> >>
> >> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk
> >>
> >>
> >> On 9/7/21 7:55 AM, Micah Kornfield wrote:
> >>>>
> >>>> My own impression is that the emphasis may be slightly exagerated. But
> >>>> perhaps some other benchmarks would prove differently.
> >>>
> >>>
> >>> This is probably true.  [1] is the original mailing list discussion.  I
> >>> think lack of measurable differences and high overhead for 64 byte
> >>> alignment was the reason for relaxing to 8 byte alignment.
> >>>
> >>> Specifically, I performed two types of tests, a "random sum" where we
> >>>> compute the sum of the values taken at random indices, and "sum",
> where
> >> we
> >>>> sum all values of the array (buffer[1] of the primitive array), both
> for
> >>>> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> >> least in
> >>>> the latter, prefetching would help, but I do not observe any
> difference.
> >>>
> >>>
> >>> The most likely place I think where this could make a difference would
> be
> >>> for operations on wider types (Decimal128 and Decimal256).   Another
> >> place
> >>> where I think alignment could help is when adding two primitive arrays
> >> (it
> >>> sounds like this was summing a single array?).
> >>>
> >>> [1]
> >>>
> >>
> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
> >>>
> >>> On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <an...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
> >>>>> Thanks a lot Antoine for the pointers. Much appreciated!
> >>>>>
> >>>>> Generally, it should not hurt to align allocations to 64 bytes
> anyway,
> >>>>>> since you are generally dealing with large enough data that the
> >>>>>> (small) memory overhead doesn't matter.
> >>>>>
> >>>>> Not for performance. However, 64 byte alignment in Rust requires
> >>>>> maintaining a custom container, a custom allocator, and the inability
> >> to
> >>>>> interoperate with `std::Vec` and the ecosystem that is based on it,
> >> since
> >>>>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
> >>>> anyone
> >>>>> interested, the background for this is this old PR [1] in this in
> >> arrow2
> >>>>> [2].
> >>>>
> >>>> I see. In the C++ implementation, we are not compatible with the
> default
> >>>> allocator either (but C++ allocators as defined by the standard
> library
> >>>> don't support resizing, which doesn't make them terribly useful for
> >>>> Arrow anyway).
> >>>>
> >>>>> Neither myself in micro benches nor Ritchie from polars (query
> engine)
> >> in
> >>>>> large scale benches observe any difference in the archs we have
> >>>> available.
> >>>>> This is not consistent with the emphasis we put on the memory
> >> alignments
> >>>>> discussion [3], and I am trying to understand the root cause for this
> >>>>> inconsistency.
> >>>>
> >>>> My own impression is that the emphasis may be slightly exagerated. But
> >>>> perhaps some other benchmarks would prove differently.
> >>>>
> >>>>> By prefetching I mean implicit; no intrinsics involved.
> >>>>
> >>>> Well, I'm not aware that implicit prefetching depends on alignment.
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>
> >>
> >
>

Re: [Question] Allocations along 64 byte cache lines

Posted by Yibo Cai <yi...@arm.com>.
Thanks Jorge,

I'm wondering if the 64 bytes alignment requirement is for cache or for 
simd register(avx512?).

For simd, looks register width alignment does helps.
E.g., _mm_load_si128 can only load 128 bits aligned data, it performs 
better than _mm_loadu_si128, which supports unaligned load.

Again, be very skeptical to the benchmark :)
https://quick-bench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk


On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote:
> Thanks,
> 
> I think that the alignment requirement in IPC is different from this one:
> we enforce 8/64 byte alignment when serializing for IPC, but we (only)
> recommend 64 byte alignment in memory addresses (at least this is my
> understanding from the above link).
> 
> I did test adding two arrays and the result is independent of the alignment
> (on my machine, compiler, etc).
> 
> Yibo, thanks a lot for that example. I am unsure whether it captures the
> cache alignment concept, though: in the example we are reading a long (8
> bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which
> is both slow and often undefined behavior. I think that the bench we want
> is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
> aligned with a long), the difference vanishes (under the same gotchas that
> you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
> Alternatively, add an int32 with an offset of 4.
> 
> I benched both with explicit (via intrinsics) SIMD and without (i.e. let
> the compiler do it for us), and the alignment does not impact the benches.
> 
> Best,
> Jorge
> 
> [1] https://stackoverflow.com/a/27184001/931303
> 
> 
> 
> 
> 
> On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai <yi...@arm.com> wrote:
> 
>> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
>> enough conditions, looks it does shows unaligned access has some penalty
>> over aligned access. But I don't think this is an issue in practice.
>>
>> Please be very skeptical to this benchmark. It's hard to get it right
>> given the complexity of hardware, compiler, benchmark tool and env.
>>
>> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk
>>
>>
>> On 9/7/21 7:55 AM, Micah Kornfield wrote:
>>>>
>>>> My own impression is that the emphasis may be slightly exagerated. But
>>>> perhaps some other benchmarks would prove differently.
>>>
>>>
>>> This is probably true.  [1] is the original mailing list discussion.  I
>>> think lack of measurable differences and high overhead for 64 byte
>>> alignment was the reason for relaxing to 8 byte alignment.
>>>
>>> Specifically, I performed two types of tests, a "random sum" where we
>>>> compute the sum of the values taken at random indices, and "sum", where
>> we
>>>> sum all values of the array (buffer[1] of the primitive array), both for
>>>> array ranging from 2^10 to 2^25 elements. I was expecting that, at
>> least in
>>>> the latter, prefetching would help, but I do not observe any difference.
>>>
>>>
>>> The most likely place I think where this could make a difference would be
>>> for operations on wider types (Decimal128 and Decimal256).   Another
>> place
>>> where I think alignment could help is when adding two primitive arrays
>> (it
>>> sounds like this was summing a single array?).
>>>
>>> [1]
>>>
>> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
>>>
>>> On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <an...@python.org>
>> wrote:
>>>
>>>>
>>>> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
>>>>> Thanks a lot Antoine for the pointers. Much appreciated!
>>>>>
>>>>> Generally, it should not hurt to align allocations to 64 bytes anyway,
>>>>>> since you are generally dealing with large enough data that the
>>>>>> (small) memory overhead doesn't matter.
>>>>>
>>>>> Not for performance. However, 64 byte alignment in Rust requires
>>>>> maintaining a custom container, a custom allocator, and the inability
>> to
>>>>> interoperate with `std::Vec` and the ecosystem that is based on it,
>> since
>>>>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
>>>> anyone
>>>>> interested, the background for this is this old PR [1] in this in
>> arrow2
>>>>> [2].
>>>>
>>>> I see. In the C++ implementation, we are not compatible with the default
>>>> allocator either (but C++ allocators as defined by the standard library
>>>> don't support resizing, which doesn't make them terribly useful for
>>>> Arrow anyway).
>>>>
>>>>> Neither myself in micro benches nor Ritchie from polars (query engine)
>> in
>>>>> large scale benches observe any difference in the archs we have
>>>> available.
>>>>> This is not consistent with the emphasis we put on the memory
>> alignments
>>>>> discussion [3], and I am trying to understand the root cause for this
>>>>> inconsistency.
>>>>
>>>> My own impression is that the emphasis may be slightly exagerated. But
>>>> perhaps some other benchmarks would prove differently.
>>>>
>>>>> By prefetching I mean implicit; no intrinsics involved.
>>>>
>>>> Well, I'm not aware that implicit prefetching depends on alignment.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>
>>
> 

Re: [Question] Allocations along 64 byte cache lines

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Thanks,

I think that the alignment requirement in IPC is different from this one:
we enforce 8/64 byte alignment when serializing for IPC, but we (only)
recommend 64 byte alignment in memory addresses (at least this is my
understanding from the above link).

I did test adding two arrays and the result is independent of the alignment
(on my machine, compiler, etc).

Yibo, thanks a lot for that example. I am unsure whether it captures the
cache alignment concept, though: in the example we are reading a long (8
bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which
is both slow and often undefined behavior. I think that the bench we want
is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
aligned with a long), the difference vanishes (under the same gotchas that
you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
Alternatively, add an int32 with an offset of 4.

I benched both with explicit (via intrinsics) SIMD and without (i.e. let
the compiler do it for us), and the alignment does not impact the benches.

Best,
Jorge

[1] https://stackoverflow.com/a/27184001/931303





On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai <yi...@arm.com> wrote:

> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
> enough conditions, looks it does shows unaligned access has some penalty
> over aligned access. But I don't think this is an issue in practice.
>
> Please be very skeptical to this benchmark. It's hard to get it right
> given the complexity of hardware, compiler, benchmark tool and env.
>
> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk
>
>
> On 9/7/21 7:55 AM, Micah Kornfield wrote:
> >>
> >> My own impression is that the emphasis may be slightly exagerated. But
> >> perhaps some other benchmarks would prove differently.
> >
> >
> > This is probably true.  [1] is the original mailing list discussion.  I
> > think lack of measurable differences and high overhead for 64 byte
> > alignment was the reason for relaxing to 8 byte alignment.
> >
> > Specifically, I performed two types of tests, a "random sum" where we
> >> compute the sum of the values taken at random indices, and "sum", where
> we
> >> sum all values of the array (buffer[1] of the primitive array), both for
> >> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> least in
> >> the latter, prefetching would help, but I do not observe any difference.
> >
> >
> > The most likely place I think where this could make a difference would be
> > for operations on wider types (Decimal128 and Decimal256).   Another
> place
> > where I think alignment could help is when adding two primitive arrays
> (it
> > sounds like this was summing a single array?).
> >
> > [1]
> >
> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
> >
> > On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >>
> >> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
> >>> Thanks a lot Antoine for the pointers. Much appreciated!
> >>>
> >>> Generally, it should not hurt to align allocations to 64 bytes anyway,
> >>>> since you are generally dealing with large enough data that the
> >>>> (small) memory overhead doesn't matter.
> >>>
> >>> Not for performance. However, 64 byte alignment in Rust requires
> >>> maintaining a custom container, a custom allocator, and the inability
> to
> >>> interoperate with `std::Vec` and the ecosystem that is based on it,
> since
> >>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
> >> anyone
> >>> interested, the background for this is this old PR [1] in this in
> arrow2
> >>> [2].
> >>
> >> I see. In the C++ implementation, we are not compatible with the default
> >> allocator either (but C++ allocators as defined by the standard library
> >> don't support resizing, which doesn't make them terribly useful for
> >> Arrow anyway).
> >>
> >>> Neither myself in micro benches nor Ritchie from polars (query engine)
> in
> >>> large scale benches observe any difference in the archs we have
> >> available.
> >>> This is not consistent with the emphasis we put on the memory
> alignments
> >>> discussion [3], and I am trying to understand the root cause for this
> >>> inconsistency.
> >>
> >> My own impression is that the emphasis may be slightly exagerated. But
> >> perhaps some other benchmarks would prove differently.
> >>
> >>> By prefetching I mean implicit; no intrinsics involved.
> >>
> >> Well, I'm not aware that implicit prefetching depends on alignment.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>

Re: [Question] Allocations along 64 byte cache lines

Posted by Yibo Cai <yi...@arm.com>.
Did a quick bench of accessing long buffer not 8 bytes aligned. Giving 
enough conditions, looks it does shows unaligned access has some penalty 
over aligned access. But I don't think this is an issue in practice.

Please be very skeptical to this benchmark. It's hard to get it right 
given the complexity of hardware, compiler, benchmark tool and env.

https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk


On 9/7/21 7:55 AM, Micah Kornfield wrote:
>>
>> My own impression is that the emphasis may be slightly exagerated. But
>> perhaps some other benchmarks would prove differently.
> 
> 
> This is probably true.  [1] is the original mailing list discussion.  I
> think lack of measurable differences and high overhead for 64 byte
> alignment was the reason for relaxing to 8 byte alignment.
> 
> Specifically, I performed two types of tests, a "random sum" where we
>> compute the sum of the values taken at random indices, and "sum", where we
>> sum all values of the array (buffer[1] of the primitive array), both for
>> array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
>> the latter, prefetching would help, but I do not observe any difference.
> 
> 
> The most likely place I think where this could make a difference would be
> for operations on wider types (Decimal128 and Decimal256).   Another place
> where I think alignment could help is when adding two primitive arrays (it
> sounds like this was summing a single array?).
> 
> [1]
> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
> 
> On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
>>> Thanks a lot Antoine for the pointers. Much appreciated!
>>>
>>> Generally, it should not hurt to align allocations to 64 bytes anyway,
>>>> since you are generally dealing with large enough data that the
>>>> (small) memory overhead doesn't matter.
>>>
>>> Not for performance. However, 64 byte alignment in Rust requires
>>> maintaining a custom container, a custom allocator, and the inability to
>>> interoperate with `std::Vec` and the ecosystem that is based on it, since
>>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
>> anyone
>>> interested, the background for this is this old PR [1] in this in arrow2
>>> [2].
>>
>> I see. In the C++ implementation, we are not compatible with the default
>> allocator either (but C++ allocators as defined by the standard library
>> don't support resizing, which doesn't make them terribly useful for
>> Arrow anyway).
>>
>>> Neither myself in micro benches nor Ritchie from polars (query engine) in
>>> large scale benches observe any difference in the archs we have
>> available.
>>> This is not consistent with the emphasis we put on the memory alignments
>>> discussion [3], and I am trying to understand the root cause for this
>>> inconsistency.
>>
>> My own impression is that the emphasis may be slightly exagerated. But
>> perhaps some other benchmarks would prove differently.
>>
>>> By prefetching I mean implicit; no intrinsics involved.
>>
>> Well, I'm not aware that implicit prefetching depends on alignment.
>>
>> Regards
>>
>> Antoine.
>>
> 

Re: [Question] Allocations along 64 byte cache lines

Posted by Micah Kornfield <em...@gmail.com>.
>
> My own impression is that the emphasis may be slightly exagerated. But
> perhaps some other benchmarks would prove differently.


This is probably true.  [1] is the original mailing list discussion.  I
think lack of measurable differences and high overhead for 64 byte
alignment was the reason for relaxing to 8 byte alignment.

Specifically, I performed two types of tests, a "random sum" where we
> compute the sum of the values taken at random indices, and "sum", where we
> sum all values of the array (buffer[1] of the primitive array), both for
> array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
> the latter, prefetching would help, but I do not observe any difference.


The most likely place I think where this could make a difference would be
for operations on wider types (Decimal128 and Decimal256).   Another place
where I think alignment could help is when adding two primitive arrays (it
sounds like this was summing a single array?).

[1]
https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E

On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
> > Thanks a lot Antoine for the pointers. Much appreciated!
> >
> > Generally, it should not hurt to align allocations to 64 bytes anyway,
> >> since you are generally dealing with large enough data that the
> >> (small) memory overhead doesn't matter.
> >
> > Not for performance. However, 64 byte alignment in Rust requires
> > maintaining a custom container, a custom allocator, and the inability to
> > interoperate with `std::Vec` and the ecosystem that is based on it, since
> > std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
> anyone
> > interested, the background for this is this old PR [1] in this in arrow2
> > [2].
>
> I see. In the C++ implementation, we are not compatible with the default
> allocator either (but C++ allocators as defined by the standard library
> don't support resizing, which doesn't make them terribly useful for
> Arrow anyway).
>
> > Neither myself in micro benches nor Ritchie from polars (query engine) in
> > large scale benches observe any difference in the archs we have
> available.
> > This is not consistent with the emphasis we put on the memory alignments
> > discussion [3], and I am trying to understand the root cause for this
> > inconsistency.
>
> My own impression is that the emphasis may be slightly exagerated. But
> perhaps some other benchmarks would prove differently.
>
> > By prefetching I mean implicit; no intrinsics involved.
>
> Well, I'm not aware that implicit prefetching depends on alignment.
>
> Regards
>
> Antoine.
>

Re: [Question] Allocations along 64 byte cache lines

Posted by Antoine Pitrou <an...@python.org>.
Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
> Thanks a lot Antoine for the pointers. Much appreciated!
> 
> Generally, it should not hurt to align allocations to 64 bytes anyway,
>> since you are generally dealing with large enough data that the
>> (small) memory overhead doesn't matter.
> 
> Not for performance. However, 64 byte alignment in Rust requires
> maintaining a custom container, a custom allocator, and the inability to
> interoperate with `std::Vec` and the ecosystem that is based on it, since
> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For anyone
> interested, the background for this is this old PR [1] in this in arrow2
> [2].

I see. In the C++ implementation, we are not compatible with the default 
allocator either (but C++ allocators as defined by the standard library 
don't support resizing, which doesn't make them terribly useful for 
Arrow anyway).

> Neither myself in micro benches nor Ritchie from polars (query engine) in
> large scale benches observe any difference in the archs we have available.
> This is not consistent with the emphasis we put on the memory alignments
> discussion [3], and I am trying to understand the root cause for this
> inconsistency.

My own impression is that the emphasis may be slightly exagerated. But 
perhaps some other benchmarks would prove differently.

> By prefetching I mean implicit; no intrinsics involved.

Well, I'm not aware that implicit prefetching depends on alignment.

Regards

Antoine.

Re: [Question] Allocations along 64 byte cache lines

Posted by Eduardo Ponce <ed...@gmail.com>.
To add to Antoine's points, besides data alignment being beneficial for
reducing cache line reads/write and overall using the cache more
effectively, another key point is when using vector (SIMD) registers.
Although recent CPUs can load unaligned data to vector registers at similar
speeds as aligned data, it is always recommended to have your data aligned
so that reads/writes to vector registers occur via aligned instruction
calls. Also, there are cases where alignment is required by a specific
library or API, so you are forced to abide by the alignment rules.
In general, making your data align well with memory and CPU hardware is
more efficient than not. That is why, C structs are padded, some memory
allocators allocate to a multiple of cache line/page size, etc. I am glad
that Arrow was designed with memory alignment in mind because this will
make adding more vectorization functionality easier.

~Eduardo

On Mon, Sep 6, 2021 at 5:21 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Thanks a lot Antoine for the pointers. Much appreciated!
>
> Generally, it should not hurt to align allocations to 64 bytes anyway,
> > since you are generally dealing with large enough data that the
> > (small) memory overhead doesn't matter.
> >
>
> Not for performance. However, 64 byte alignment in Rust requires
> maintaining a custom container, a custom allocator, and the inability to
> interoperate with `std::Vec` and the ecosystem that is based on it, since
> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For anyone
> interested, the background for this is this old PR [1] in this in arrow2
> [2].
>
> Neither myself in micro benches nor Ritchie from polars (query engine) in
> large scale benches observe any difference in the archs we have available.
> This is not consistent with the emphasis we put on the memory alignments
> discussion [3], and I am trying to understand the root cause for this
> inconsistency.
>
> By prefetching I mean implicit; no intrinsics involved.
>
> Best,
> Jorge
>
> [1] https://github.com/apache/arrow/pull/8796
> [2] https://github.com/jorgecarleitao/arrow2/pull/385
> [2]
>
> https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
>
>
>
>
>
> On Mon, Sep 6, 2021 at 6:51 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Le 06/09/2021 à 19:45, Antoine Pitrou a écrit :
> > >
> > >> Specifically, I performed two types of tests, a "random sum" where we
> > >> compute the sum of the values taken at random indices, and "sum",
> where
> > we
> > >> sum all values of the array (buffer[1] of the primitive array), both
> for
> > >> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> > least in
> > >> the latter, prefetching would help, but I do not observe any
> difference.
> > >
> > > By prefetching, you mean explicit prefetching using intrinsics?
> > > Modern CPUs are very good at implicit prefetching, they are able to
> > > detect memory access patterns and optimize for them. Implicit
> > > prefetching would only possibly help if your access pattern is
> > > complicated (for example you're walking a chain of pointers).
> >
> > Oops: *explicit* prefecting would only possibly help.... sorry.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > > If your
> > > access is sequential, there is zero reason to prefetch explicitly
> > > nowadays, AFAIK.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> >
>

Re: [Question] Allocations along 64 byte cache lines

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Thanks a lot Antoine for the pointers. Much appreciated!

Generally, it should not hurt to align allocations to 64 bytes anyway,
> since you are generally dealing with large enough data that the
> (small) memory overhead doesn't matter.
>

Not for performance. However, 64 byte alignment in Rust requires
maintaining a custom container, a custom allocator, and the inability to
interoperate with `std::Vec` and the ecosystem that is based on it, since
std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For anyone
interested, the background for this is this old PR [1] in this in arrow2
[2].

Neither myself in micro benches nor Ritchie from polars (query engine) in
large scale benches observe any difference in the archs we have available.
This is not consistent with the emphasis we put on the memory alignments
discussion [3], and I am trying to understand the root cause for this
inconsistency.

By prefetching I mean implicit; no intrinsics involved.

Best,
Jorge

[1] https://github.com/apache/arrow/pull/8796
[2] https://github.com/jorgecarleitao/arrow2/pull/385
[2]
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding





On Mon, Sep 6, 2021 at 6:51 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 06/09/2021 à 19:45, Antoine Pitrou a écrit :
> >
> >> Specifically, I performed two types of tests, a "random sum" where we
> >> compute the sum of the values taken at random indices, and "sum", where
> we
> >> sum all values of the array (buffer[1] of the primitive array), both for
> >> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> least in
> >> the latter, prefetching would help, but I do not observe any difference.
> >
> > By prefetching, you mean explicit prefetching using intrinsics?
> > Modern CPUs are very good at implicit prefetching, they are able to
> > detect memory access patterns and optimize for them. Implicit
> > prefetching would only possibly help if your access pattern is
> > complicated (for example you're walking a chain of pointers).
>
> Oops: *explicit* prefecting would only possibly help.... sorry.
>
> Regards
>
> Antoine.
>
>
> > If your
> > access is sequential, there is zero reason to prefetch explicitly
> > nowadays, AFAIK.
> >
> > Regards
> >
> > Antoine.
> >
> >
>

Re: [Question] Allocations along 64 byte cache lines

Posted by Antoine Pitrou <an...@python.org>.
Le 06/09/2021 à 19:45, Antoine Pitrou a écrit :
> 
>> Specifically, I performed two types of tests, a "random sum" where we
>> compute the sum of the values taken at random indices, and "sum", where we
>> sum all values of the array (buffer[1] of the primitive array), both for
>> array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
>> the latter, prefetching would help, but I do not observe any difference.
> 
> By prefetching, you mean explicit prefetching using intrinsics?
> Modern CPUs are very good at implicit prefetching, they are able to
> detect memory access patterns and optimize for them. Implicit
> prefetching would only possibly help if your access pattern is
> complicated (for example you're walking a chain of pointers).

Oops: *explicit* prefecting would only possibly help.... sorry.

Regards

Antoine.


> If your
> access is sequential, there is zero reason to prefetch explicitly
> nowadays, AFAIK.
> 
> Regards
> 
> Antoine.
> 
> 

Re: [Question] Allocations along 64 byte cache lines

Posted by Antoine Pitrou <an...@python.org>.
On Mon, 6 Sep 2021 18:09:31 +0100
Jorge Cardoso Leitão <jo...@gmail.com> wrote:
> Hi,
> 
> We have a whole section related to byte alignment (
> https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding)
> recommending 64 byte alignment and referring to intel's manual.
> 
> Do we have evidence that this alignment helps (besides intel claims)?

I don't know if there is strong evidence for it. Modern CPUs are much
better at cache-unaligned accesses than they used to be. It doesn't
necessarily mean that such accesses are always free, however. It will
certainly vary depending on the CPU model, but also depending on the
workload (a compute-bound workload will of course suffer much less from
any hypothetical alignment issue).

Basically, depending on the CPU, an unaligned access *may* require
more resources than an aligned access. For example, an unaligned AVX512
access would always straddle two 64-byte cache lines, and therefore
issue two cache reads instead of one. But perhaps your CPU is capable
of two cache reads per clock anyway? In this case, the problem would
only show if you try to issue two AVX512 reads at once, which would
require four cache reads in the unaligned case (say, you're adding two
vectors instead of reduce-summing a single one).

Generally, it should not hurt to align allocations to 64 bytes anyway,
since you are generally dealing with large enough data that the
(small) memory overhead doesn't matter.

> Specifically, I performed two types of tests, a "random sum" where we
> compute the sum of the values taken at random indices, and "sum", where we
> sum all values of the array (buffer[1] of the primitive array), both for
> array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
> the latter, prefetching would help, but I do not observe any difference.

By prefetching, you mean explicit prefetching using intrinsics?
Modern CPUs are very good at implicit prefetching, they are able to
detect memory access patterns and optimize for them. Implicit
prefetching would only possibly help if your access pattern is
complicated (for example you're walking a chain of pointers). If your
access is sequential, there is zero reason to prefetch explicitly
nowadays, AFAIK.

Regards

Antoine.