You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "neoremind (Jira)" <ji...@apache.org> on 2020/04/18 15:48:00 UTC
[jira] [Commented] (CALCITE-3873) Use global caching for ReflectiveVisitDispatcher implementation

    [ https://issues.apache.org/jira/browse/CALCITE-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086507#comment-17086507 ] 

neoremind commented on CALCITE-3873:
------------------------------------

The following is copied conclusion from github PR.

I have added new benchmark test case for baseline to disable guava caching. The result shows as below (code is in [my branch|https://github.com/neoremind/calcite/blob/refvisit_benchmark/ubenchmark/src/jmh/java/org/apache/calcite/benchmarks/ReflectiveVisitDispatcherTest2.java]).

{{Benchmark                                                           (guavaCacheMaxSize)  Mode  Cnt    Score    Error  Units
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingGuavaCache                     0  avgt    5  116.446 ±  8.957  ns/op
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingGuavaCache                   128  avgt    5  133.678 ±  8.251  ns/op
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingHashMap                        0  avgt    5   73.213 ±  7.614  ns/op
ReflectiveVisitDispatcherTest102.testInstanceCachingWithReflection                    0  avgt    5  227.130 ±  9.033  ns/op}}

Table title are operations performed by each solution.
|| ||lookup method||get by key from map||put key value to map||time cost||
|{{testGlobalCachingUsingGuavaCache with guavaCacheMaxSize = 0}} (Disable cache, looking up method every time for each call)|Y| | |116.446 ns/op|
|{{testGlobalCachingUsingGuavaCache with guavaCacheMaxSize = 128}} (Global cache with guava cache)| |Y| |133.678 ns/op|
|{{testGlobalCachingUsingHashMap}} (Global cache with JDK ConcurrentHashMap)| |Y| |73.213 ns/op|
|{{testInstanceCachingWithReflection}} (Original implementation to cache per instance for each call)|Y|Y|Y|227.130 ns/op|

The conclusion we could get is
 # Guava cache overhead is a little too big, compared to JDK ConcurrentHashMap. But JDK ConcurrentHashMap is not limit sized, so in the previous discussion, we do not choose JDK ConcurrentHashMap.
 # Guava cache overhead is even bigger than looking up method for each call. This is what surprised me.
 # The original implementation is the slowest one because it has to get from map, look up method then put method to map, it combines all the operations, the solution works well in the same instance, but for every Calcite invocation it has to sacrifice performance to look up method every time, that is where the PR tries to improve.

The current master version does improve performance and balance memory usage. But it is slower than 1) JDK ConcurrentHashMap for caching (but not limit size), 2) looking up method direct in each call without map operations.

Also, I have tested against MethodHandle and LambdaMetaFactory.

The result is as below.

{{Benchmark                                                                                Mode  Cnt      Score       Error  Units
ReflectiveVisitDispatcherTest.testGlobalCaching                                          avgt    3    137.270 ±    32.275  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithLambdaFactory                       avgt    3  92459.660 ± 56777.698  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithLambdaFactoryThreadLocalInitialize  avgt    3    199.687 ±    24.755  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithMethodHandle                        avgt    3   2421.131 ±   633.756  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithMethodHandleThreadLocalInitialize   avgt    3    218.616 ±    41.754  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithReflection                          avgt    3    224.603 ±    22.770  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithReflectionThreadLocalInitialize     avgt    3    218.406 ±     9.620  ns/op}}

From the result, it seems the pre-work is time consuming for both {{MethodHandle}} and {{LambdaFactory}} (built upon MethodHandle). To eliminate the initialization effect, I try to use ThreadLocal to initialize once, the result do verifies that, in terms of time cost, LambdaFactory < Reflection <= MethodHandle, but the gap is very small.

With that, it will be a good option to stick to the current version. Hope the analysis and work could help people who interested in this. 

Many thanks to Vladimir, Haisheng, Danny, Chunwei, Stamatis to take time reviewing my code kindly :)

 

> Use global caching for ReflectiveVisitDispatcher implementation
> ---------------------------------------------------------------
>
>                 Key: CALCITE-3873
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3873
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.22.0
>            Reporter: neoremind
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: ReflectVisitorDispatcherTest.java, jmh_result.txt, pic1.svg, pic2.svg
>
>          Time Spent: 10h
>  Remaining Estimate: 0h
>
> For curiosity, I use flame graph to profiling a simple query. The code snippet looks like below.
> {code:java}
>     String sql = "select empno, gender, name from EMPS where name = 'John'";
>     Connection connection = null;
>     Statement statement = null;
>     try {
>       Properties info = new Properties();
>       info.put("model", jsonPath("smart"));
>       connection = DriverManager.getConnection("jdbc:calcite:", info);      
>       String x = null;
>       for (int i = 0; i < 50000; i++) {
>         statement = connection.createStatement();
>         final ResultSet resultSet =
>             statement.executeQuery(
>                 sql);
>         while (resultSet.next()) {
>           x = resultSet.getInt(1)
>               + resultSet.getString(2)
>               + resultSet.getString(3);
>         }      
>       }
>     } catch (SQLException e) {
>       e.printStackTrace();
>     } finally {
>       close(connection, statement);
>     }
> {code}
>  
> I attach the generated flame graph [^pic1.svg]
> {code:java}
> 3% on sql2rel
> 9% on query optimizing,
> 62% of the time is spent on code gen and implementation,
> 20% on result set iterating and checking,
> … 
> {code}
> Hope this graph is informative. Since I start to learn Calcite recently, I cannot tell where to start tuning, but from the graph one tiny point catches my attention, I find there are many reflection invocations in _Prepare#trimUnusedFields_. So, I spent some time trying to mitigate the small overhead.
> I optimize _ReflectiveVisitDispatcher_ by introducing a global _Guava_ cache with limited size to cache methods, also I add full unit tests for _ReflectUtil_.
> I count the reference of the method: _ReflectUtil#createMethodDispatcher and_
> _ReflectUtil#createDispatcher (see below)._ Total 68 possible invocations, so the cache size is limited, by caching all the methods during the lifecycle of the process, we can eliminate reflection looking up methods overhead.
> {code:java}
> org.apache.calcite.rel.rel2sql.RelToSqlConverter: 18 possible invocations.
> org.apache.calcite.sql2rel.RelDecorrelator: 15 possible invocations.
> org.apache.calcite.sql2rel.RelFieldTrimmer: 11 possible invocations.
> org.apache.calcite.sql2rel.RelStructuredTypeFlattener.RewriteRelVisitor: 22 possible invocations.
> org.apache.calcite.interpreter.Interpreter.CompilerImpl: 2 possible invocations.
> {code}
> Before introducing the global caching, caching is shared per _ReflectiveVisitDispatcher_ instance, now different _ReflectiveVisitDispatcher_ in different thread is able to reuse the cached methods.
> See [^pic2.svg], after tuning, _trimUnusedFields_ only takes 0.64% of the sampling time compared with 1.38% previously. I think this will help in a lot more places.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)