You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/11/09 16:14:00 UTC
[jira] [Commented] (IMPALA-10984) Improve performance of
FROM_UNIXTIME function.
[ https://issues.apache.org/jira/browse/IMPALA-10984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441260#comment-17441260 ]
ASF subversion and git services commented on IMPALA-10984:
----------------------------------------------------------
Commit df42225f5c8f65a66192b47612d17259d1b2dc6c in impala's branch refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=df42225 ]
IMPALA-10984: Improve TimestampValue to String casting
TimestampValue::ToString was implemented by concatenating
boost::gregorian::to_iso_extended_string and
boost::posix_time::to_simple_string using stringstream. This involves
multiple string allocations, copying, and might hit lock within
tcmalloc::CentralFreeList. FROM_UNIXTIME and CAST expression that
touches this function can be inefficient if the expression is being
evaluated for millions of rows.
This patch adds method TimestampValue::ToStringVal and reimplements
TimestampValue::ToString by supplying default DateTimeFormatContext if
no pattern was specified. "yyyy-MM-dd HH:mm:ss" will be picked as the
default format if the time_ component does not have fractional seconds.
Otherwise, "yyyy-MM-dd HH:mm:ss.SSSSSSSSS" will be picked as the default
format. The chosen DateTimeFormatContext then is passed to
TimestampParser::Format along with date_ and time_ to be formatted into
the string representation. Int to string parsing method is replaced with
FastInt32ToBufferLeft in TimestampParser::Format.
We ran a set of expression benchmarks in a machine with Intel(R)
Core(TM) i7-4790 CPU @ 3.60GHz. This patch gives > 10X performance
improvement for CAST timestamp to string and FROM_UNIXTIME without a
date-time pattern. Following are the detailed results before and after
the patch.
Before the patch:
FromUnixCodegen: Function 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile
(relative) (relative) (relative)
---------------------------------------------------------------------------------------------------
literal 36.7 37 37.3 1X 1X 1X
cast(now() as string) 2.31 2.31 2.33 0.0628X 0.0623X 0.0626X
cast(now() as string format 'Y .SSSSS') 16.9 17.5 17.5 0.459X 0.472X 0.471X
from_unixtime(0,'yyyy-MM-dd HH:mm:ss') 6.3 6.3 6.37 0.171X 0.17X 0.171X
from_unixtime(0,'yyyy-MM-dd') 11.8 11.8 12 0.32X 0.32X 0.322X
from_unixtime(0) 2.36 2.4 2.4 0.0644X 0.0648X 0.0644X
After the patch:
FromUnixCodegen: Function 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile
(relative) (relative) (relative)
---------------------------------------------------------------------------------------------------
literal 37.7 38.1 38.4 1X 1X 1X
cast(now() as string) 29.9 30.1 30.2 0.794X 0.79X 0.787X
cast(now() as string format 'Y .SSSSS') 61.1 61.3 61.6 1.62X 1.61X 1.61X
from_unixtime(0,'yyyy-MM-dd HH:mm:ss') 33.6 33.8 34.2 0.892X 0.887X 0.892X
from_unixtime(0,'yyyy-MM-dd') 50.5 50.6 50.9 1.34X 1.33X 1.33X
from_unixtime(0) 34 34.2 34.5 0.902X 0.896X 0.898X
The literal expression used as the baseline in this benchmark is
"cast('2012-01-01 09:10:11.123456789' as timestamp)".
This patch also updates numbers in expr-benchmark for
BenchmarkTimestampFunctions and tidy up expr-benchmark a bit to clear
its MemPool in between benchmark iteration so that it does not run out
of memory.
Testing:
- Pass core tests.
Change-Id: I4fcb4545d9c9a3fdb38c4db58bb4b1321a429d61
Reviewed-on: http://gerrit.cloudera.org:8080/17980
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Csaba Ringhofer <cs...@cloudera.com>
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
> Improve performance of FROM_UNIXTIME function.
> ----------------------------------------------
>
> Key: IMPALA-10984
> URL: https://issues.apache.org/jira/browse/IMPALA-10984
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 4.0.0
> Reporter: Riza Suminto
> Assignee: Riza Suminto
> Priority: Major
>
> FROM_UNIXTIME function is implemented by calling TimestampValue::ToString() in TimestampFunctions::FromUnix().
> We found out that evaluation of TimestampValue::ToString() can get trapped in tcmalloc::CentralFreeList lock, as shown in this pstack
>
> {code:java}
> #0 0x000000000277d81a in base::internal::SpinLockDelay(int volatile*, int, int) ()
> #1 0x00000000027d17f9 in SpinLock::SlowLock() ()
> #2 0x000000000287a399 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) ()
> #3 0x00000000028882f3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) ()
> #4 0x00000000029c5e88 in tc_newarray ()
> #5 0x00007faedc677169 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p4948.16676264/lib/impala/lib/libstdc++.so.6
> #6 0x0000000000f769de in impala::TimestampValue::ToString() const ()
> #7 0x00007faeb317e08e in ?? ()
> #8 0x00007fad62af6068 in ?? ()
> #9 0x00007faedc8c20c0 in ?? () from /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p4948.16676264/lib/impala/lib/libstdc++.so.6
> #10 0x0000000000000000 in ?? (){code}
>
> This is presumably due to the combination use of stringstream, boost::gregorian::to_iso_extended_string and boost::posix_time::to_simple_string that involve multiple string allocation and copying.
> This can be problematic when FROM_UNIXTIME is being evaluated for millions of rows.
> We should come up with better implementation that involve less string allocation and copying.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org