You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Li Jin <ic...@gmail.com> on 2022/04/12 19:39:53 UTC

Perf/Benchmark for temporal operations

Hello!

We recently noticed unexpected performance with Arrow's temporal
operation kernels (in particular, CeilTemporal). The perf we see are around
1.4-1.8 Gb / s. This seems to be much lower than adding a constant to a
float column (~9Gb/s). This is a bit unexpected because CeilTemporal is
similar to a numeric round operation so we are wondering if there are some
benchmarks around this and where the issue might be?

Thanks!
Li

Re: Perf/Benchmark for temporal operations

Posted by Rok Mihevc <ro...@gmail.com>.

I've opened a PR for temporal benchmarks:
https://github.com/apache/arrow/pull/12997
Please chime in if some more benchmarks are needed.

Results for the first run are here:
https://conbench.ursa.dev/runs/019c6f9cdd82415382280c89be122b58/

Rok

Re: Perf/Benchmark for temporal operations

Posted by Benson Muite <be...@emailplus.org>.

On 4/13/22 7:58 PM, Rok Mihevc wrote:
> Thanks for describing the use case Li!
> 
>> The examples we ran are on UTC timestamp without any timezone
>> complications, perhaps there is room for short circuits when there are no
>> timezone complications...
> 
> I think using UTC zoned timestamp array might currently behave as a
> regular timezoned timestamp array and use the zoned path.
> However, setting timezone="" should use a non-zoned computation path.
> See here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/temporal_internal.h#L233
> 
> Rok
> 

For many of the kernels, a comparison with memory bandwidth, for example 
as measured using Likwid[1], would be a good test of performance of the 
implementation.  However, this would typically require use of SIMD, and 
many initial implementations do not utilize SIMD operations, which at 
the moment is mostly done through the XSIMD library[2]. Maybe this is 
something to add to the developer documentation? There has been a 
related discussion on the list of xsimd adoption in the Arrow codebase.

[1] https://github.com/RRZE-HPC/likwid/wiki/Likwid-Bench
[2] https://github.com/xtensor-stack/xsimd

Re: Perf/Benchmark for temporal operations

Posted by Rok Mihevc <ro...@gmail.com>.

Thanks for describing the use case Li!

> The examples we ran are on UTC timestamp without any timezone
> complications, perhaps there is room for short circuits when there are no
> timezone complications...

I think using UTC zoned timestamp array might currently behave as a
regular timezoned timestamp array and use the zoned path.
However, setting timezone="" should use a non-zoned computation path.
See here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/temporal_internal.h#L233

Rok

Re: Perf/Benchmark for temporal operations

Posted by Li Jin <ic...@gmail.com>.

Thanks both for the reply. It's understandable that those kernels might not
be optimized right now considering the state of the Arrow compute.

> The temporal rounding operations operate on localized times taking into
account the timestamp's timezone, which is why they're more
computationally intensive that raw floating point operations.

The examples we ran are on UTC timestamp without any timezone
complications, perhaps there is room for short circuits when there are no
timezone complications...

> Which operation in particular did you benchmark? Is it part of a
significant workload for you or did you just try it out of curiosity?

We are trying to evaluate baseline performance on "Temporal Round" +
"GroupBy Aggregation" (around a stream of time series data to 5min interval
and aggregate) and noticed this issue.

This is not an urgent issue, I am asking here mostly because I'd like to
understand what might be going on. The responses have been helpful. Thank
you!

On Wed, Apr 13, 2022 at 3:28 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hello Li,
>
> The temporal rounding operations operate on localized times taking into
> account the timestamp's timezone, which is why they're more
> computationally intensive that raw floating point operations.
>
> Which operation in particular did you benchmark? Is it part of a
> significant workload for you or did you just try it out of curiosity?
>
> Regards
>
> Antoine.
>
>
>
>
> Le 12/04/2022 à 22:31, Li Jin a écrit :
> > Thanks David!
> >
> > I am not yet familiar with the implementation of this kernel so I am
> > hoping someone more familiar with kernels can shed some light on this. I
> > wonder if this is kind of expected performance (comparing to similar
> kernel
> > perf) or maybe something with the RoundTemporal implementation seems off?
> >
> > Steven (who ran the test) computed around 500 CPU cycles / value which
> > seems more than what is needed but I am not an expert on the kernels so
> > want to hear more thoughts from the dev.
> >
> > Li
> >
> > On Tue, Apr 12, 2022 at 4:19 PM David Li <li...@apache.org> wrote:
> >
> >> While we do track benchmarks for each commit on Conbench [1] it seems we
> >> lack benchmarks for the temporal operations. I filed ARROW-16173 [2].
> >>
> >> They do do a bit more work than just a round (especially if they need to
> >> handle time zones).
> >>
> >> [1]: https://conbench.ursa.dev/
> >> [2]: https://issues.apache.org/jira/browse/ARROW-16173
> >>
> >> -David
> >>
> >> On Tue, Apr 12, 2022, at 15:40, Li Jin wrote:
> >>> Sorry I should have mentioned this is the Arrow C++ compute kernels.
> >>>
> >>> On Tue, Apr 12, 2022 at 3:39 PM Li Jin <ic...@gmail.com> wrote:
> >>>
> >>>> Hello!
> >>>>
> >>>> We recently noticed unexpected performance with Arrow's temporal
> >>>> operation kernels (in particular, CeilTemporal). The perf we see are
> >> around
> >>>> 1.4-1.8 Gb / s. This seems to be much lower than adding a constant to
> a
> >>>> float column (~9Gb/s). This is a bit unexpected because CeilTemporal
> is
> >>>> similar to a numeric round operation so we are wondering if there are
> >> some
> >>>> benchmarks around this and where the issue might be?
> >>>>
> >>>> Thanks!
> >>>> Li
> >>>>
> >>
> >
>

Re: Perf/Benchmark for temporal operations

Posted by Antoine Pitrou <an...@python.org>.

Hello Li,

The temporal rounding operations operate on localized times taking into 
account the timestamp's timezone, which is why they're more 
computationally intensive that raw floating point operations.

Which operation in particular did you benchmark? Is it part of a 
significant workload for you or did you just try it out of curiosity?

Regards

Antoine.




Le 12/04/2022 à 22:31, Li Jin a écrit :
> Thanks David!
> 
> I am not yet familiar with the implementation of this kernel so I am
> hoping someone more familiar with kernels can shed some light on this. I
> wonder if this is kind of expected performance (comparing to similar kernel
> perf) or maybe something with the RoundTemporal implementation seems off?
> 
> Steven (who ran the test) computed around 500 CPU cycles / value which
> seems more than what is needed but I am not an expert on the kernels so
> want to hear more thoughts from the dev.
> 
> Li
> 
> On Tue, Apr 12, 2022 at 4:19 PM David Li <li...@apache.org> wrote:
> 
>> While we do track benchmarks for each commit on Conbench [1] it seems we
>> lack benchmarks for the temporal operations. I filed ARROW-16173 [2].
>>
>> They do do a bit more work than just a round (especially if they need to
>> handle time zones).
>>
>> [1]: https://conbench.ursa.dev/
>> [2]: https://issues.apache.org/jira/browse/ARROW-16173
>>
>> -David
>>
>> On Tue, Apr 12, 2022, at 15:40, Li Jin wrote:
>>> Sorry I should have mentioned this is the Arrow C++ compute kernels.
>>>
>>> On Tue, Apr 12, 2022 at 3:39 PM Li Jin <ic...@gmail.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> We recently noticed unexpected performance with Arrow's temporal
>>>> operation kernels (in particular, CeilTemporal). The perf we see are
>> around
>>>> 1.4-1.8 Gb / s. This seems to be much lower than adding a constant to a
>>>> float column (~9Gb/s). This is a bit unexpected because CeilTemporal is
>>>> similar to a numeric round operation so we are wondering if there are
>> some
>>>> benchmarks around this and where the issue might be?
>>>>
>>>> Thanks!
>>>> Li
>>>>
>>
>

Re: Perf/Benchmark for temporal operations

Posted by Rok Mihevc <ro...@gmail.com>.

Hi Li,

I've implemented most of the temporal rounding logic. The kernels have
not really been optimized at all yet as they are pretty new and not
completely finished (ambiguous behaviour due to DST [1], rounding
origin point [2], etc). Most effort so far was on making test sets and
getting the right results. Given that I'm actually positively
surprised with the 1:5 ratio compared to float addition.

David's proposal for benchmarking is a great starting point. I'll look
into it next.
An easy optimization right now would be better templating [3] and
perhaps simplification of rounding for sub-hour units.

[1] https://github.com/apache/arrow/pull/12528
[2] https://github.com/apache/arrow/pull/12657
[3] https://issues.apache.org/jira/browse/ARROW-15787

Rok

On Tue, Apr 12, 2022 at 10:32 PM Li Jin <ic...@gmail.com> wrote:
>
> Thanks David!
>
> I am not yet familiar with the implementation of this kernel so I am
> hoping someone more familiar with kernels can shed some light on this. I
> wonder if this is kind of expected performance (comparing to similar kernel
> perf) or maybe something with the RoundTemporal implementation seems off?
>
> Steven (who ran the test) computed around 500 CPU cycles / value which
> seems more than what is needed but I am not an expert on the kernels so
> want to hear more thoughts from the dev.
>
> Li
>
> On Tue, Apr 12, 2022 at 4:19 PM David Li <li...@apache.org> wrote:
>
> > While we do track benchmarks for each commit on Conbench [1] it seems we
> > lack benchmarks for the temporal operations. I filed ARROW-16173 [2].
> >
> > They do do a bit more work than just a round (especially if they need to
> > handle time zones).
> >
> > [1]: https://conbench.ursa.dev/
> > [2]: https://issues.apache.org/jira/browse/ARROW-16173
> >
> > -David
> >
> > On Tue, Apr 12, 2022, at 15:40, Li Jin wrote:
> > > Sorry I should have mentioned this is the Arrow C++ compute kernels.
> > >
> > > On Tue, Apr 12, 2022 at 3:39 PM Li Jin <ic...@gmail.com> wrote:
> > >
> > >> Hello!
> > >>
> > >> We recently noticed unexpected performance with Arrow's temporal
> > >> operation kernels (in particular, CeilTemporal). The perf we see are
> > around
> > >> 1.4-1.8 Gb / s. This seems to be much lower than adding a constant to a
> > >> float column (~9Gb/s). This is a bit unexpected because CeilTemporal is
> > >> similar to a numeric round operation so we are wondering if there are
> > some
> > >> benchmarks around this and where the issue might be?
> > >>
> > >> Thanks!
> > >> Li
> > >>
> >

Re: Perf/Benchmark for temporal operations

Posted by Li Jin <ic...@gmail.com>.

Thanks David!

I am not yet familiar with the implementation of this kernel so I am
hoping someone more familiar with kernels can shed some light on this. I
wonder if this is kind of expected performance (comparing to similar kernel
perf) or maybe something with the RoundTemporal implementation seems off?

Steven (who ran the test) computed around 500 CPU cycles / value which
seems more than what is needed but I am not an expert on the kernels so
want to hear more thoughts from the dev.

Li

On Tue, Apr 12, 2022 at 4:19 PM David Li <li...@apache.org> wrote:

> While we do track benchmarks for each commit on Conbench [1] it seems we
> lack benchmarks for the temporal operations. I filed ARROW-16173 [2].
>
> They do do a bit more work than just a round (especially if they need to
> handle time zones).
>
> [1]: https://conbench.ursa.dev/
> [2]: https://issues.apache.org/jira/browse/ARROW-16173
>
> -David
>
> On Tue, Apr 12, 2022, at 15:40, Li Jin wrote:
> > Sorry I should have mentioned this is the Arrow C++ compute kernels.
> >
> > On Tue, Apr 12, 2022 at 3:39 PM Li Jin <ic...@gmail.com> wrote:
> >
> >> Hello!
> >>
> >> We recently noticed unexpected performance with Arrow's temporal
> >> operation kernels (in particular, CeilTemporal). The perf we see are
> around
> >> 1.4-1.8 Gb / s. This seems to be much lower than adding a constant to a
> >> float column (~9Gb/s). This is a bit unexpected because CeilTemporal is
> >> similar to a numeric round operation so we are wondering if there are
> some
> >> benchmarks around this and where the issue might be?
> >>
> >> Thanks!
> >> Li
> >>
>

Re: Perf/Benchmark for temporal operations

Posted by David Li <li...@apache.org>.

While we do track benchmarks for each commit on Conbench [1] it seems we lack benchmarks for the temporal operations. I filed ARROW-16173 [2].

They do do a bit more work than just a round (especially if they need to handle time zones).

[1]: https://conbench.ursa.dev/
[2]: https://issues.apache.org/jira/browse/ARROW-16173

-David

On Tue, Apr 12, 2022, at 15:40, Li Jin wrote:
> Sorry I should have mentioned this is the Arrow C++ compute kernels.
>
> On Tue, Apr 12, 2022 at 3:39 PM Li Jin <ic...@gmail.com> wrote:
>
>> Hello!
>>
>> We recently noticed unexpected performance with Arrow's temporal
>> operation kernels (in particular, CeilTemporal). The perf we see are around
>> 1.4-1.8 Gb / s. This seems to be much lower than adding a constant to a
>> float column (~9Gb/s). This is a bit unexpected because CeilTemporal is
>> similar to a numeric round operation so we are wondering if there are some
>> benchmarks around this and where the issue might be?
>>
>> Thanks!
>> Li
>>

Re: Perf/Benchmark for temporal operations

Posted by Li Jin <ic...@gmail.com>.

Sorry I should have mentioned this is the Arrow C++ compute kernels.

On Tue, Apr 12, 2022 at 3:39 PM Li Jin <ic...@gmail.com> wrote:

> Hello!
>
> We recently noticed unexpected performance with Arrow's temporal
> operation kernels (in particular, CeilTemporal). The perf we see are around
> 1.4-1.8 Gb / s. This seems to be much lower than adding a constant to a
> float column (~9Gb/s). This is a bit unexpected because CeilTemporal is
> similar to a numeric round operation so we are wondering if there are some
> benchmarks around this and where the issue might be?
>
> Thanks!
> Li
>