You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "neoremind (Jira)" <ji...@apache.org> on 2020/04/18 15:48:00 UTC
[jira] [Commented] (CALCITE-3873) Use global caching for
ReflectiveVisitDispatcher implementation
[ https://issues.apache.org/jira/browse/CALCITE-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086507#comment-17086507 ]
neoremind commented on CALCITE-3873:
------------------------------------
The following is copied conclusion from github PR.
I have added new benchmark test case for baseline to disable guava caching. The result shows as below (code is in [my branch|https://github.com/neoremind/calcite/blob/refvisit_benchmark/ubenchmark/src/jmh/java/org/apache/calcite/benchmarks/ReflectiveVisitDispatcherTest2.java]).
{{Benchmark (guavaCacheMaxSize) Mode Cnt Score Error Units
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingGuavaCache 0 avgt 5 116.446 ± 8.957 ns/op
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingGuavaCache 128 avgt 5 133.678 ± 8.251 ns/op
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingHashMap 0 avgt 5 73.213 ± 7.614 ns/op
ReflectiveVisitDispatcherTest102.testInstanceCachingWithReflection 0 avgt 5 227.130 ± 9.033 ns/op}}
Table title are operations performed by each solution.
|| ||lookup method||get by key from map||put key value to map||time cost||
|{{testGlobalCachingUsingGuavaCache with guavaCacheMaxSize = 0}} (Disable cache, looking up method every time for each call)|Y| | |116.446 ns/op|
|{{testGlobalCachingUsingGuavaCache with guavaCacheMaxSize = 128}} (Global cache with guava cache)| |Y| |133.678 ns/op|
|{{testGlobalCachingUsingHashMap}} (Global cache with JDK ConcurrentHashMap)| |Y| |73.213 ns/op|
|{{testInstanceCachingWithReflection}} (Original implementation to cache per instance for each call)|Y|Y|Y|227.130 ns/op|
The conclusion we could get is
# Guava cache overhead is a little too big, compared to JDK ConcurrentHashMap. But JDK ConcurrentHashMap is not limit sized, so in the previous discussion, we do not choose JDK ConcurrentHashMap.
# Guava cache overhead is even bigger than looking up method for each call. This is what surprised me.
# The original implementation is the slowest one because it has to get from map, look up method then put method to map, it combines all the operations, the solution works well in the same instance, but for every Calcite invocation it has to sacrifice performance to look up method every time, that is where the PR tries to improve.
The current master version does improve performance and balance memory usage. But it is slower than 1) JDK ConcurrentHashMap for caching (but not limit size), 2) looking up method direct in each call without map operations.
Also, I have tested against MethodHandle and LambdaMetaFactory.
The result is as below.
{{Benchmark Mode Cnt Score Error Units
ReflectiveVisitDispatcherTest.testGlobalCaching avgt 3 137.270 ± 32.275 ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithLambdaFactory avgt 3 92459.660 ± 56777.698 ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithLambdaFactoryThreadLocalInitialize avgt 3 199.687 ± 24.755 ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithMethodHandle avgt 3 2421.131 ± 633.756 ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithMethodHandleThreadLocalInitialize avgt 3 218.616 ± 41.754 ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithReflection avgt 3 224.603 ± 22.770 ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithReflectionThreadLocalInitialize avgt 3 218.406 ± 9.620 ns/op}}
From the result, it seems the pre-work is time consuming for both {{MethodHandle}} and {{LambdaFactory}} (built upon MethodHandle). To eliminate the initialization effect, I try to use ThreadLocal to initialize once, the result do verifies that, in terms of time cost, LambdaFactory < Reflection <= MethodHandle, but the gap is very small.
With that, it will be a good option to stick to the current version. Hope the analysis and work could help people who interested in this.
Many thanks to Vladimir, Haisheng, Danny, Chunwei, Stamatis to take time reviewing my code kindly :)
> Use global caching for ReflectiveVisitDispatcher implementation
> ---------------------------------------------------------------
>
> Key: CALCITE-3873
> URL: https://issues.apache.org/jira/browse/CALCITE-3873
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.22.0
> Reporter: neoremind
> Priority: Minor
> Labels: pull-request-available
> Attachments: ReflectVisitorDispatcherTest.java, jmh_result.txt, pic1.svg, pic2.svg
>
> Time Spent: 10h
> Remaining Estimate: 0h
>
> For curiosity, I use flame graph to profiling a simple query. The code snippet looks like below.
> {code:java}
> String sql = "select empno, gender, name from EMPS where name = 'John'";
> Connection connection = null;
> Statement statement = null;
> try {
> Properties info = new Properties();
> info.put("model", jsonPath("smart"));
> connection = DriverManager.getConnection("jdbc:calcite:", info);
> String x = null;
> for (int i = 0; i < 50000; i++) {
> statement = connection.createStatement();
> final ResultSet resultSet =
> statement.executeQuery(
> sql);
> while (resultSet.next()) {
> x = resultSet.getInt(1)
> + resultSet.getString(2)
> + resultSet.getString(3);
> }
> }
> } catch (SQLException e) {
> e.printStackTrace();
> } finally {
> close(connection, statement);
> }
> {code}
>
> I attach the generated flame graph [^pic1.svg]
> {code:java}
> 3% on sql2rel
> 9% on query optimizing,
> 62% of the time is spent on code gen and implementation,
> 20% on result set iterating and checking,
> …
> {code}
> Hope this graph is informative. Since I start to learn Calcite recently, I cannot tell where to start tuning, but from the graph one tiny point catches my attention, I find there are many reflection invocations in _Prepare#trimUnusedFields_. So, I spent some time trying to mitigate the small overhead.
> I optimize _ReflectiveVisitDispatcher_ by introducing a global _Guava_ cache with limited size to cache methods, also I add full unit tests for _ReflectUtil_.
> I count the reference of the method: _ReflectUtil#createMethodDispatcher and_
> _ReflectUtil#createDispatcher (see below)._ Total 68 possible invocations, so the cache size is limited, by caching all the methods during the lifecycle of the process, we can eliminate reflection looking up methods overhead.
> {code:java}
> org.apache.calcite.rel.rel2sql.RelToSqlConverter: 18 possible invocations.
> org.apache.calcite.sql2rel.RelDecorrelator: 15 possible invocations.
> org.apache.calcite.sql2rel.RelFieldTrimmer: 11 possible invocations.
> org.apache.calcite.sql2rel.RelStructuredTypeFlattener.RewriteRelVisitor: 22 possible invocations.
> org.apache.calcite.interpreter.Interpreter.CompilerImpl: 2 possible invocations.
> {code}
> Before introducing the global caching, caching is shared per _ReflectiveVisitDispatcher_ instance, now different _ReflectiveVisitDispatcher_ in different thread is able to reuse the cached methods.
> See [^pic2.svg], after tuning, _trimUnusedFields_ only takes 0.64% of the sampling time compared with 1.38% previously. I think this will help in a lot more places.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)