You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Hyde <jh...@apache.org> on 2016/01/19 02:12:49 UTC

Re-working the metadata framework

In https://issues.apache.org/jira/browse/CALCITE-794 <https://issues.apache.org/jira/browse/CALCITE-794> we added an extra parameter to each metadata call so that we could detect cyclic metadata calls, and potentially to cache results so that a given statistic is never computed more than once during a metadata call. But the overhead of making calls into the metadata framework is still very high. It shows up as a big fraction of the time spent optimizing complex queries. I am working on https://issues.apache.org/jira/browse/CALCITE-604 <https://issues.apache.org/jira/browse/CALCITE-604>, which aims to fix that.

I am working on 604 while the release is closing, and I thought some of you would be be interested to know where I am going.

We use reflection to make calls. This is necessary because the types of metadata (e.g. selectivity, row count, unique keys, predicates) are extensible, you can have multiple providers for each kind of metadata, each provider has different methods for various RelNode sub-types, and we want to be able to inherit handler methods (e.g. getUniqueKeys(Aggregate, boolean) handles getUniqueKeys(LogicalAggregate, boolean) because there is no handler method for LogicalAggregate.

Initially I thought we’d use MethodHandle, which is a lot faster than method invocation by reflection. MethodHandle.invoke has some flexibility based on the types of its arguments, but I realized we’d still have to dispatch to multiple underlying providers (e.g. the built-in provider and the Hive provider). And we have other inefficiencies such as calling UnboundMetadata.bind(RelNode, RelMetadataQuery) to create a short-lived object every single call.

So, now I am looking at using Janino to generate a dispatcher. Consider just one kind of metadata, UniqueKeys. We already have a “signature” interface:

public interface UniqueKeys extends Metadata {
  Set<ImmutableBitSet> getUniqueKeys(boolean ignoreNulls);
}

I have added a handler interface:

interface UniqueKeysHandler {
  Set<ImmutableBitSet> getUniqueKeys(RelNode r, RelMetadataQuery mq, boolean ignoreNulls);
}

Now, given a set of metadata providers and the set of all known RelNode sub-type, I can use Janino to generate a handler at run time:

class UniqueKeysHandlerImpl implements UniqueKeysHandlerImpl {
  final RelMdUniqueKeys provider0;
  final HiveUniqueKeys provider1;

  UniqueKeysHandlerImpl(RelMdUniqueKeys provider0, HiveUniqueKeys provider1) {
    this.provider0 = provider0;
    this.provider1 = provider1;
  }

  public Set<ImmutableBitSet> getUniqueKeys(RelNode r,
      RelMetadataQuery mq, boolean ignoreNulls) {
    switch (r.getClass().getName()) {
    case "org.apache.calcite.rel.logical.LogicalAggregate":
      return provider0.getUniqueKeys((Aggregate) r, mq, ignoreNulls);
    case "org.apache.calcite.rel.core.Aggregate":
      return provider0.getUniqueKeys(r, mq, ignoreNulls);
    case “ org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate":
      return provider1.getUniqueKeys(r, mq, ignoreNulls);
    default:
      throw NoHandler.INSTANCE;
    }
  }
}

The entry point in RelMetadataQuery changes from

public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
    boolean ignoreNulls) {
  final BuiltInMetadata.UniqueKeys metadata =
      rel.metadata(BuiltInMetadata.UniqueKeys.class, this);
  return metadata.getUniqueKeys(ignoreNulls);
}

to

public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
    boolean ignoreNulls) {
  for (;;) {
    try {
      return uniqueKeysHandler.getUniqueKeys(rel, this, ignoreNulls);
    } catch (NoHandler e) {
      uniqueKeysHandler = metadataProvider.revise(rel.getClass(),
          BuiltInMetadata.UniqueKeys.Handler.class);
    }
  }
}

The “NoHandler” exception occurs very rarely — only when a kind of RelNode is seen that hasn’t been seen before in this JVM instance — but gives the handler chance to regenerate itself.

The result is a very direct path from the caller (generally a RelOptRule) to the provider: two calls, and we don’t even need to box the arguments.

I don’t think there will be any API changes, but note that the metadata interfaces (eg. UniqueKeys) and RelNode.metadata(Class<M> metadataClass, RelMetadataQuery mq) are not used anymore.

Julian

Re: Re-working the metadata framework

Posted by Julian Hyde <jh...@apache.org>.

The fix has been available for review for a couple of weeks now.

Jacques, Vladimir (and anyone else),

Do you need more time to review? If not I’d like to check in.

Julian

> On Jan 22, 2016, at 2:51 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> 
>> On Jan 22, 2016, at 1:52 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
>> 
>> I want something like
>> 
>> @Setup
>> public void ...
>> 
>> @Benchmark
>> public Object columnUniqueness() {
>> return metadataProvider.getColumnUniqueness(whateverArgs);
>> }
>> 
> 
> RelMetadataTest pretty much does that. Copy it, add the @Benchmark annotations, and maybe move some code into setup methods.
> 
> 
>> Julian> it sounds like you're
>> Julian>talking about perfect hashing
>> 
>> I am.
>> FIY: switch(String) is compiled to switch(s.hashCode()) kind of thing.
>> Well, the requirements are slightly different, thus map might work
>> better for us.
> 
> I did consider using switch (r.getClass().getName()), except that janino doesn’t support it. But since we have all of the Class objects when we are compiling the code, we also know their identity hash codes (i.e. their addresses) and we could construct a perfect hash. We could replace 
> 
>  switch (relClasses.indexOf(r.getClass())
> 
> with
> 
>  switch (System.identityHashCode(r.getClass()) % 137)
> 
> (replacing 137 with some value chosen by an algorithm)
> 
> Julian

Re: Re-working the metadata framework

Posted by Julian Hyde <jh...@apache.org>.

> On Jan 22, 2016, at 1:52 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> I want something like
> 
> @Setup
> public void ...
> 
> @Benchmark
> public Object columnUniqueness() {
>  return metadataProvider.getColumnUniqueness(whateverArgs);
> }
> 

RelMetadataTest pretty much does that. Copy it, add the @Benchmark annotations, and maybe move some code into setup methods.

> Julian> it sounds like you're
> Julian>talking about perfect hashing
> 
> I am.
> FIY: switch(String) is compiled to switch(s.hashCode()) kind of thing.
> Well, the requirements are slightly different, thus map might work
> better for us.

I did consider using switch (r.getClass().getName()), except that janino doesn’t support it. But since we have all of the Class objects when we are compiling the code, we also know their identity hash codes (i.e. their addresses) and we could construct a perfect hash. We could replace 

  switch (relClasses.indexOf(r.getClass())

with

  switch (System.identityHashCode(r.getClass()) % 137)

(replacing 137 with some value chosen by an algorithm)

Julian

Re: Re-working the metadata framework

Posted by Vladimir Sitnikov <si...@gmail.com>.

Julian>I'm struggling to find an end-to-end case

I mean metadataQuery-specific tests here, not end-to-end ones.

In other words, I want something like

@Setup
public void ...

@Benchmark
public Object columnUniqueness() {
  return metadataProvider.getColumnUniqueness(whateverArgs);
}

As far as I understand, we can have that kind of benchmark to evaluate
the performance of metadata query, we can know how much memory a call
allocates, etc, etc.
As soon as we have a benchmark, we can actually measure different
implementations (indexOf vs Map vs switch).

I assume we can somehow setup the whole thing, so metadata query is a
finite number of calls. Can we?

Julian> it sounds like you're
Julian>talking about perfect hashing

I am.
FIY: switch(String) is compiled to switch(s.hashCode()) kind of thing.
Well, the requirements are slightly different, thus map might work
better for us.

Vladimir

Re: Re-working the metadata framework

Posted by Jacques Nadeau <ja...@apache.org>.

Thanks for sharing your thoughts and goals. I'm very engaged in this but
would like some time to review the code and really think through the
problem. Can we keep this open for a while? I also think that as Vladimir
asked, we should come up with some concrete ways to evaluate the impact of
various techniques.

On Thu, Jan 21, 2016 at 4:56 PM, Julian Hyde <jh...@apache.org> wrote:

> I can’t see any requirements that we can drop. Here are the requirements:
>
> * We need multiple providers (e.g. built-in and Drill),
> * It’s not essential but it’s very convenient to find providers by looking
> for methods that are like the implemented signature except that the first
> argument can be a sub-class
> * We need to detect cycles and cache previously computed results
> * We need to be able to flush the cache
> * We need to allow kinds of metadata to be defined at run time (although
> it’s certainly more convenient if we can add an accessor method to
> RelMetadataQuery)
> * We need to be able to create composite providers.
> * We need to be able to add providers at run time (and indeed use
> different combinations of providers during the planning of a single query).
> * We can’t assume that we know ahead of time the full set of RelNode
> sub-classes for which we will ask for metadata.
> * We need to be able to unwind or see through HepRelVertex or RelSubset
> nodes that are introduced by the Hep and Volcano planning engines.
> * For RelSubset we need to combine the constituent rels’ metadata (e.g.
> take the min of the MaxRowCount, or the union of the inferred predicates).
>
> I’ve dropped the requirement that you create an instance of the metadata
> class by “binding” a RelNode and a RelMetadataQuery. Now for each metadata
> interface you define a handler interface that has an extra 2 parameters. So
> we no longer need to create temporary “bound” objects.
>
> Given all those requirements we have a complex dispatch problem.
> Reflection solved it but it was difficult to understand AND inefficient.
> Code generation is difficult to understand but yields code that is easy to
> understand (if you turn on -Dcalcite.debug) and is efficient.
>
> My dev branch is complete and ready for review. Take a look at
> https://issues.apache.org/jira/browse/CALCITE-604 for an example of the
> generated code; I think it clarifies things a lot.
>
> Julian
>
>
>
> > On Jan 18, 2016, at 9:31 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > I read your email.  My thought was that we should evaluate the real
> > requirements. Maybe we need to shed some capabilities to achieve better
> > performance. Can you break down the issue into a set of example patterns?
> > In general, it seems like we may be trying to be too flexible. I
> understand
> > the desire to maintain as much backwards compatibility as possible but
> > think we may want to come up with a simpler metaphor. My goal is to come
> up
> > with the right design rather than trying to optimize what might be the
> > wrong design. Maybe you've done all the thinking and believe the existing
> > metaphors are exactly right but I think we should get consensus around
> this
> > the requirements before jumping to implementation approaches such as code
> > generation.
> > On Jan 18, 2016 5:40 PM, "Julian Hyde" <jh...@apache.org> wrote:
> >
> >> I describe in paragraph 3 why we CURRENTLY use reflection. In short:
> very
> >> complex dispatch requirements. The rest of the message is how I plan to
> >> phase it out. Plain java only has one dispatch mechanism (virtual
> methods)
> >> so isn’t going to cut it.
> >>
> >>> On Jan 18, 2016, at 5:23 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>
> >>> Can you go into more detail to why reflection is needed? It seems like
> we
> >>> could get away from reflection by sharing interfaces, etc.
> >>>
> >>> On Mon, Jan 18, 2016 at 5:12 PM, Julian Hyde <jh...@apache.org> wrote:
> >>>
> >>>> In https://issues.apache.org/jira/browse/CALCITE-794 <
> >>>> https://issues.apache.org/jira/browse/CALCITE-794> we added an extra
> >>>> parameter to each metadata call so that we could detect cyclic
> metadata
> >>>> calls, and potentially to cache results so that a given statistic is
> >> never
> >>>> computed more than once during a metadata call. But the overhead of
> >> making
> >>>> calls into the metadata framework is still very high. It shows up as a
> >> big
> >>>> fraction of the time spent optimizing complex queries. I am working on
> >>>> https://issues.apache.org/jira/browse/CALCITE-604 <
> >>>> https://issues.apache.org/jira/browse/CALCITE-604>, which aims to fix
> >>>> that.
> >>>>
> >>>> I am working on 604 while the release is closing, and I thought some
> of
> >>>> you would be be interested to know where I am going.
> >>>>
> >>>> We use reflection to make calls. This is necessary because the types
> of
> >>>> metadata (e.g. selectivity, row count, unique keys, predicates) are
> >>>> extensible, you can have multiple providers for each kind of metadata,
> >> each
> >>>> provider has different methods for various RelNode sub-types, and we
> >> want
> >>>> to be able to inherit handler methods (e.g. getUniqueKeys(Aggregate,
> >>>> boolean) handles getUniqueKeys(LogicalAggregate, boolean) because
> there
> >> is
> >>>> no handler method for LogicalAggregate.
> >>>>
> >>>> Initially I thought we’d use MethodHandle, which is a lot faster than
> >>>> method invocation by reflection. MethodHandle.invoke has some
> >> flexibility
> >>>> based on the types of its arguments, but I realized we’d still have to
> >>>> dispatch to multiple underlying providers (e.g. the built-in provider
> >> and
> >>>> the Hive provider). And we have other inefficiencies such as calling
> >>>> UnboundMetadata.bind(RelNode, RelMetadataQuery) to create a
> short-lived
> >>>> object every single call.
> >>>>
> >>>> So, now I am looking at using Janino to generate a dispatcher.
> Consider
> >>>> just one kind of metadata, UniqueKeys. We already have a “signature”
> >>>> interface:
> >>>>
> >>>> public interface UniqueKeys extends Metadata {
> >>>> Set<ImmutableBitSet> getUniqueKeys(boolean ignoreNulls);
> >>>> }
> >>>>
> >>>> I have added a handler interface:
> >>>>
> >>>> interface UniqueKeysHandler {
> >>>> Set<ImmutableBitSet> getUniqueKeys(RelNode r, RelMetadataQuery mq,
> >>>> boolean ignoreNulls);
> >>>> }
> >>>>
> >>>> Now, given a set of metadata providers and the set of all known
> RelNode
> >>>> sub-type, I can use Janino to generate a handler at run time:
> >>>>
> >>>> class UniqueKeysHandlerImpl implements UniqueKeysHandlerImpl {
> >>>> final RelMdUniqueKeys provider0;
> >>>> final HiveUniqueKeys provider1;
> >>>>
> >>>> UniqueKeysHandlerImpl(RelMdUniqueKeys provider0, HiveUniqueKeys
> >>>> provider1) {
> >>>>   this.provider0 = provider0;
> >>>>   this.provider1 = provider1;
> >>>> }
> >>>>
> >>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode r,
> >>>>     RelMetadataQuery mq, boolean ignoreNulls) {
> >>>>   switch (r.getClass().getName()) {
> >>>>   case "org.apache.calcite.rel.logical.LogicalAggregate":
> >>>>     return provider0.getUniqueKeys((Aggregate) r, mq, ignoreNulls);
> >>>>   case "org.apache.calcite.rel.core.Aggregate":
> >>>>     return provider0.getUniqueKeys(r, mq, ignoreNulls);
> >>>>   case “
> >>>>
> org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate":
> >>>>     return provider1.getUniqueKeys(r, mq, ignoreNulls);
> >>>>   default:
> >>>>     throw NoHandler.INSTANCE;
> >>>>   }
> >>>> }
> >>>> }
> >>>>
> >>>> The entry point in RelMetadataQuery changes from
> >>>>
> >>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
> >>>>   boolean ignoreNulls) {
> >>>> final BuiltInMetadata.UniqueKeys metadata =
> >>>>     rel.metadata(BuiltInMetadata.UniqueKeys.class, this);
> >>>> return metadata.getUniqueKeys(ignoreNulls);
> >>>> }
> >>>>
> >>>> to
> >>>>
> >>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
> >>>>   boolean ignoreNulls) {
> >>>> for (;;) {
> >>>>   try {
> >>>>     return uniqueKeysHandler.getUniqueKeys(rel, this, ignoreNulls);
> >>>>   } catch (NoHandler e) {
> >>>>     uniqueKeysHandler = metadataProvider.revise(rel.getClass(),
> >>>>         BuiltInMetadata.UniqueKeys.Handler.class);
> >>>>   }
> >>>> }
> >>>> }
> >>>>
> >>>> The “NoHandler” exception occurs very rarely — only when a kind of
> >> RelNode
> >>>> is seen that hasn’t been seen before in this JVM instance — but gives
> >> the
> >>>> handler chance to regenerate itself.
> >>>>
> >>>> The result is a very direct path from the caller (generally a
> >> RelOptRule)
> >>>> to the provider: two calls, and we don’t even need to box the
> arguments.
> >>>>
> >>>> I don’t think there will be any API changes, but note that the
> metadata
> >>>> interfaces (eg. UniqueKeys) and RelNode.metadata(Class<M>
> metadataClass,
> >>>> RelMetadataQuery mq) are not used anymore.
> >>>>
> >>>> Julian
> >>>>
> >>>>
> >>
> >>
>
>

Re: Re-working the metadata framework

Posted by Julian Hyde <jh...@apache.org>.

> Do you have test code at hand? (I mean the boilerplate code to prepare
> the environment for "getXXX" metadata call)

The easiest thing is to run RelMetadataTest; run with
"-Dcalcite.debug" so see the code generated. Add extra "mq.xxx" calls
and you'll be calling into the generated handler.

I'm struggling to find an end-to-end case (i.e. one that parses,
optimizes and executes a SQL statement) that demonstrates a clear
performance difference. One test is JdbcTest.testJoinFiveWay. The new
metadata seems to improve runs from 17s to 15s seconds.

By the way, that test makes 200k metadata calls. Here's how I found out:

diff --git a/core/src/main/java/org/apache/calcite/rel/metadata/JaninoRelMetadataProvider.java
b/core/src/main/java/org/apache/calcite/rel/metadata/JaninoRelMetadataProvider.java
index 8e246ee..3227436 100644
--- a/core/src/main/java/org/apache/calcite/rel/metadata/JaninoRelMetadataProvider.java
+++ b/core/src/main/java/org/apache/calcite/rel/metadata/JaninoRelMetadataProvider.java
@@ -91,6 +91,8 @@
  * a class that dispatches to the underlying providers.
  */
 public class JaninoRelMetadataProvider implements RelMetadataProvider {
+  public static int N;
+
   private final RelMetadataProvider provider;

   // Constants and static fields
@@ -242,6 +244,9 @@ public static JaninoRelMetadataProvider
of(RelMetadataProvider provider) {
           .append("      org.apache.calcite.rel.metadata.RelMetadataQuery mq");
       paramList(buff, method.e)
           .append(") {\n");
+      buff.append("    System.out.println(\"").append(method.e.getName())
+          .append(" \" + (").append(JaninoRelMetadataProvider.class.getName())
+          .append(".N++));\n");
       buff.append("    final java.util.List key = ")
           .append(
               (method.e.getParameterTypes().length < 4

Other counts:
* RelMetadataTest 9,301
* RelOptRulesTest 4,474
* JdbcTest.testJoinFiveWay 200,385
* JdbcTest.testJoinManyWay 200,457
* JdbcTest 715,204 (not including testJoinManyWay, testJoinFiveWay)
* CalciteSuite 8,248,498

> Have you tested compilation into switch over a hashcode?

I haven't tested switching using a hashCode (it sounds like you're
talking about perfect hashing, or something similar). It's definitely
well worth investigating.

I also thought of replacing

 switch (relClasses.indexOf(r.getClass()))

with

  Map<Class, Integer> relClassIds = new IdentityHashMap<>();
  ...
  switch (relClassIds.get(r.getClass())

because at present we scan a list of ~80 classes each call.

Julian

On Fri, Jan 22, 2016 at 11:57 AM, Vladimir Sitnikov
<si...@gmail.com> wrote:
> Julian>My dev branch is complete and ready for review.
>
> Should we conduct some performance tests?
> It would be interesting to compare old vs new, and to identify the
> bottlenecks of the new approach.
>
> Do you have test code at hand? (I mean the boilerplate code to prepare
> the environment for "getXXX" metadata call)
> Have you planned adding perf tests?
>
> I'm not up to speed with metadata queries, however I can help with
> benchmarks/analysis.
>
> Have you tested compilation into switch over a hashcode?
> I mean
>
> switch(class.hashCode()%37) { // this 37 might be picked on a
> case-by-case basis to minimize collisions
>   case 1: // LogicalAggregate or HiveProject
>   case 23: // LogicalProject
> ...
> }
>
> Vladimir

Re: Re-working the metadata framework

Posted by Vladimir Sitnikov <si...@gmail.com>.

Julian>My dev branch is complete and ready for review.

Should we conduct some performance tests?
It would be interesting to compare old vs new, and to identify the
bottlenecks of the new approach.

Do you have test code at hand? (I mean the boilerplate code to prepare
the environment for "getXXX" metadata call)
Have you planned adding perf tests?

I'm not up to speed with metadata queries, however I can help with
benchmarks/analysis.

Have you tested compilation into switch over a hashcode?
I mean

switch(class.hashCode()%37) { // this 37 might be picked on a
case-by-case basis to minimize collisions
  case 1: // LogicalAggregate or HiveProject
  case 23: // LogicalProject
...
}

Vladimir

Re: Re-working the metadata framework

Posted by Julian Hyde <jh...@apache.org>.

I can’t see any requirements that we can drop. Here are the requirements:

* We need multiple providers (e.g. built-in and Drill),
* It’s not essential but it’s very convenient to find providers by looking for methods that are like the implemented signature except that the first argument can be a sub-class
* We need to detect cycles and cache previously computed results
* We need to be able to flush the cache
* We need to allow kinds of metadata to be defined at run time (although it’s certainly more convenient if we can add an accessor method to RelMetadataQuery)
* We need to be able to create composite providers.
* We need to be able to add providers at run time (and indeed use different combinations of providers during the planning of a single query).
* We can’t assume that we know ahead of time the full set of RelNode sub-classes for which we will ask for metadata.
* We need to be able to unwind or see through HepRelVertex or RelSubset nodes that are introduced by the Hep and Volcano planning engines.
* For RelSubset we need to combine the constituent rels’ metadata (e.g. take the min of the MaxRowCount, or the union of the inferred predicates).

I’ve dropped the requirement that you create an instance of the metadata class by “binding” a RelNode and a RelMetadataQuery. Now for each metadata interface you define a handler interface that has an extra 2 parameters. So we no longer need to create temporary “bound” objects.

Given all those requirements we have a complex dispatch problem. Reflection solved it but it was difficult to understand AND inefficient. Code generation is difficult to understand but yields code that is easy to understand (if you turn on -Dcalcite.debug) and is efficient.

My dev branch is complete and ready for review. Take a look at https://issues.apache.org/jira/browse/CALCITE-604 for an example of the generated code; I think it clarifies things a lot.

Julian



> On Jan 18, 2016, at 9:31 PM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> I read your email.  My thought was that we should evaluate the real
> requirements. Maybe we need to shed some capabilities to achieve better
> performance. Can you break down the issue into a set of example patterns?
> In general, it seems like we may be trying to be too flexible. I understand
> the desire to maintain as much backwards compatibility as possible but
> think we may want to come up with a simpler metaphor. My goal is to come up
> with the right design rather than trying to optimize what might be the
> wrong design. Maybe you've done all the thinking and believe the existing
> metaphors are exactly right but I think we should get consensus around this
> the requirements before jumping to implementation approaches such as code
> generation.
> On Jan 18, 2016 5:40 PM, "Julian Hyde" <jh...@apache.org> wrote:
> 
>> I describe in paragraph 3 why we CURRENTLY use reflection. In short: very
>> complex dispatch requirements. The rest of the message is how I plan to
>> phase it out. Plain java only has one dispatch mechanism (virtual methods)
>> so isn’t going to cut it.
>> 
>>> On Jan 18, 2016, at 5:23 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>> 
>>> Can you go into more detail to why reflection is needed? It seems like we
>>> could get away from reflection by sharing interfaces, etc.
>>> 
>>> On Mon, Jan 18, 2016 at 5:12 PM, Julian Hyde <jh...@apache.org> wrote:
>>> 
>>>> In https://issues.apache.org/jira/browse/CALCITE-794 <
>>>> https://issues.apache.org/jira/browse/CALCITE-794> we added an extra
>>>> parameter to each metadata call so that we could detect cyclic metadata
>>>> calls, and potentially to cache results so that a given statistic is
>> never
>>>> computed more than once during a metadata call. But the overhead of
>> making
>>>> calls into the metadata framework is still very high. It shows up as a
>> big
>>>> fraction of the time spent optimizing complex queries. I am working on
>>>> https://issues.apache.org/jira/browse/CALCITE-604 <
>>>> https://issues.apache.org/jira/browse/CALCITE-604>, which aims to fix
>>>> that.
>>>> 
>>>> I am working on 604 while the release is closing, and I thought some of
>>>> you would be be interested to know where I am going.
>>>> 
>>>> We use reflection to make calls. This is necessary because the types of
>>>> metadata (e.g. selectivity, row count, unique keys, predicates) are
>>>> extensible, you can have multiple providers for each kind of metadata,
>> each
>>>> provider has different methods for various RelNode sub-types, and we
>> want
>>>> to be able to inherit handler methods (e.g. getUniqueKeys(Aggregate,
>>>> boolean) handles getUniqueKeys(LogicalAggregate, boolean) because there
>> is
>>>> no handler method for LogicalAggregate.
>>>> 
>>>> Initially I thought we’d use MethodHandle, which is a lot faster than
>>>> method invocation by reflection. MethodHandle.invoke has some
>> flexibility
>>>> based on the types of its arguments, but I realized we’d still have to
>>>> dispatch to multiple underlying providers (e.g. the built-in provider
>> and
>>>> the Hive provider). And we have other inefficiencies such as calling
>>>> UnboundMetadata.bind(RelNode, RelMetadataQuery) to create a short-lived
>>>> object every single call.
>>>> 
>>>> So, now I am looking at using Janino to generate a dispatcher. Consider
>>>> just one kind of metadata, UniqueKeys. We already have a “signature”
>>>> interface:
>>>> 
>>>> public interface UniqueKeys extends Metadata {
>>>> Set<ImmutableBitSet> getUniqueKeys(boolean ignoreNulls);
>>>> }
>>>> 
>>>> I have added a handler interface:
>>>> 
>>>> interface UniqueKeysHandler {
>>>> Set<ImmutableBitSet> getUniqueKeys(RelNode r, RelMetadataQuery mq,
>>>> boolean ignoreNulls);
>>>> }
>>>> 
>>>> Now, given a set of metadata providers and the set of all known RelNode
>>>> sub-type, I can use Janino to generate a handler at run time:
>>>> 
>>>> class UniqueKeysHandlerImpl implements UniqueKeysHandlerImpl {
>>>> final RelMdUniqueKeys provider0;
>>>> final HiveUniqueKeys provider1;
>>>> 
>>>> UniqueKeysHandlerImpl(RelMdUniqueKeys provider0, HiveUniqueKeys
>>>> provider1) {
>>>>   this.provider0 = provider0;
>>>>   this.provider1 = provider1;
>>>> }
>>>> 
>>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode r,
>>>>     RelMetadataQuery mq, boolean ignoreNulls) {
>>>>   switch (r.getClass().getName()) {
>>>>   case "org.apache.calcite.rel.logical.LogicalAggregate":
>>>>     return provider0.getUniqueKeys((Aggregate) r, mq, ignoreNulls);
>>>>   case "org.apache.calcite.rel.core.Aggregate":
>>>>     return provider0.getUniqueKeys(r, mq, ignoreNulls);
>>>>   case “
>>>> org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate":
>>>>     return provider1.getUniqueKeys(r, mq, ignoreNulls);
>>>>   default:
>>>>     throw NoHandler.INSTANCE;
>>>>   }
>>>> }
>>>> }
>>>> 
>>>> The entry point in RelMetadataQuery changes from
>>>> 
>>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
>>>>   boolean ignoreNulls) {
>>>> final BuiltInMetadata.UniqueKeys metadata =
>>>>     rel.metadata(BuiltInMetadata.UniqueKeys.class, this);
>>>> return metadata.getUniqueKeys(ignoreNulls);
>>>> }
>>>> 
>>>> to
>>>> 
>>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
>>>>   boolean ignoreNulls) {
>>>> for (;;) {
>>>>   try {
>>>>     return uniqueKeysHandler.getUniqueKeys(rel, this, ignoreNulls);
>>>>   } catch (NoHandler e) {
>>>>     uniqueKeysHandler = metadataProvider.revise(rel.getClass(),
>>>>         BuiltInMetadata.UniqueKeys.Handler.class);
>>>>   }
>>>> }
>>>> }
>>>> 
>>>> The “NoHandler” exception occurs very rarely — only when a kind of
>> RelNode
>>>> is seen that hasn’t been seen before in this JVM instance — but gives
>> the
>>>> handler chance to regenerate itself.
>>>> 
>>>> The result is a very direct path from the caller (generally a
>> RelOptRule)
>>>> to the provider: two calls, and we don’t even need to box the arguments.
>>>> 
>>>> I don’t think there will be any API changes, but note that the metadata
>>>> interfaces (eg. UniqueKeys) and RelNode.metadata(Class<M> metadataClass,
>>>> RelMetadataQuery mq) are not used anymore.
>>>> 
>>>> Julian
>>>> 
>>>> 
>> 
>>

Re: Re-working the metadata framework

Posted by Jacques Nadeau <ja...@apache.org>.

I read your email.  My thought was that we should evaluate the real
requirements. Maybe we need to shed some capabilities to achieve better
performance. Can you break down the issue into a set of example patterns?
In general, it seems like we may be trying to be too flexible. I understand
the desire to maintain as much backwards compatibility as possible but
think we may want to come up with a simpler metaphor. My goal is to come up
with the right design rather than trying to optimize what might be the
wrong design. Maybe you've done all the thinking and believe the existing
metaphors are exactly right but I think we should get consensus around this
the requirements before jumping to implementation approaches such as code
generation.
On Jan 18, 2016 5:40 PM, "Julian Hyde" <jh...@apache.org> wrote:

> I describe in paragraph 3 why we CURRENTLY use reflection. In short: very
> complex dispatch requirements. The rest of the message is how I plan to
> phase it out. Plain java only has one dispatch mechanism (virtual methods)
> so isn’t going to cut it.
>
> > On Jan 18, 2016, at 5:23 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > Can you go into more detail to why reflection is needed? It seems like we
> > could get away from reflection by sharing interfaces, etc.
> >
> > On Mon, Jan 18, 2016 at 5:12 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> In https://issues.apache.org/jira/browse/CALCITE-794 <
> >> https://issues.apache.org/jira/browse/CALCITE-794> we added an extra
> >> parameter to each metadata call so that we could detect cyclic metadata
> >> calls, and potentially to cache results so that a given statistic is
> never
> >> computed more than once during a metadata call. But the overhead of
> making
> >> calls into the metadata framework is still very high. It shows up as a
> big
> >> fraction of the time spent optimizing complex queries. I am working on
> >> https://issues.apache.org/jira/browse/CALCITE-604 <
> >> https://issues.apache.org/jira/browse/CALCITE-604>, which aims to fix
> >> that.
> >>
> >> I am working on 604 while the release is closing, and I thought some of
> >> you would be be interested to know where I am going.
> >>
> >> We use reflection to make calls. This is necessary because the types of
> >> metadata (e.g. selectivity, row count, unique keys, predicates) are
> >> extensible, you can have multiple providers for each kind of metadata,
> each
> >> provider has different methods for various RelNode sub-types, and we
> want
> >> to be able to inherit handler methods (e.g. getUniqueKeys(Aggregate,
> >> boolean) handles getUniqueKeys(LogicalAggregate, boolean) because there
> is
> >> no handler method for LogicalAggregate.
> >>
> >> Initially I thought we’d use MethodHandle, which is a lot faster than
> >> method invocation by reflection. MethodHandle.invoke has some
> flexibility
> >> based on the types of its arguments, but I realized we’d still have to
> >> dispatch to multiple underlying providers (e.g. the built-in provider
> and
> >> the Hive provider). And we have other inefficiencies such as calling
> >> UnboundMetadata.bind(RelNode, RelMetadataQuery) to create a short-lived
> >> object every single call.
> >>
> >> So, now I am looking at using Janino to generate a dispatcher. Consider
> >> just one kind of metadata, UniqueKeys. We already have a “signature”
> >> interface:
> >>
> >> public interface UniqueKeys extends Metadata {
> >>  Set<ImmutableBitSet> getUniqueKeys(boolean ignoreNulls);
> >> }
> >>
> >> I have added a handler interface:
> >>
> >> interface UniqueKeysHandler {
> >>  Set<ImmutableBitSet> getUniqueKeys(RelNode r, RelMetadataQuery mq,
> >> boolean ignoreNulls);
> >> }
> >>
> >> Now, given a set of metadata providers and the set of all known RelNode
> >> sub-type, I can use Janino to generate a handler at run time:
> >>
> >> class UniqueKeysHandlerImpl implements UniqueKeysHandlerImpl {
> >>  final RelMdUniqueKeys provider0;
> >>  final HiveUniqueKeys provider1;
> >>
> >>  UniqueKeysHandlerImpl(RelMdUniqueKeys provider0, HiveUniqueKeys
> >> provider1) {
> >>    this.provider0 = provider0;
> >>    this.provider1 = provider1;
> >>  }
> >>
> >>  public Set<ImmutableBitSet> getUniqueKeys(RelNode r,
> >>      RelMetadataQuery mq, boolean ignoreNulls) {
> >>    switch (r.getClass().getName()) {
> >>    case "org.apache.calcite.rel.logical.LogicalAggregate":
> >>      return provider0.getUniqueKeys((Aggregate) r, mq, ignoreNulls);
> >>    case "org.apache.calcite.rel.core.Aggregate":
> >>      return provider0.getUniqueKeys(r, mq, ignoreNulls);
> >>    case “
> >> org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate":
> >>      return provider1.getUniqueKeys(r, mq, ignoreNulls);
> >>    default:
> >>      throw NoHandler.INSTANCE;
> >>    }
> >>  }
> >> }
> >>
> >> The entry point in RelMetadataQuery changes from
> >>
> >> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
> >>    boolean ignoreNulls) {
> >>  final BuiltInMetadata.UniqueKeys metadata =
> >>      rel.metadata(BuiltInMetadata.UniqueKeys.class, this);
> >>  return metadata.getUniqueKeys(ignoreNulls);
> >> }
> >>
> >> to
> >>
> >> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
> >>    boolean ignoreNulls) {
> >>  for (;;) {
> >>    try {
> >>      return uniqueKeysHandler.getUniqueKeys(rel, this, ignoreNulls);
> >>    } catch (NoHandler e) {
> >>      uniqueKeysHandler = metadataProvider.revise(rel.getClass(),
> >>          BuiltInMetadata.UniqueKeys.Handler.class);
> >>    }
> >>  }
> >> }
> >>
> >> The “NoHandler” exception occurs very rarely — only when a kind of
> RelNode
> >> is seen that hasn’t been seen before in this JVM instance — but gives
> the
> >> handler chance to regenerate itself.
> >>
> >> The result is a very direct path from the caller (generally a
> RelOptRule)
> >> to the provider: two calls, and we don’t even need to box the arguments.
> >>
> >> I don’t think there will be any API changes, but note that the metadata
> >> interfaces (eg. UniqueKeys) and RelNode.metadata(Class<M> metadataClass,
> >> RelMetadataQuery mq) are not used anymore.
> >>
> >> Julian
> >>
> >>
>
>

Re: Re-working the metadata framework

Posted by Julian Hyde <jh...@apache.org>.

I describe in paragraph 3 why we CURRENTLY use reflection. In short: very complex dispatch requirements. The rest of the message is how I plan to phase it out. Plain java only has one dispatch mechanism (virtual methods) so isn’t going to cut it.

> On Jan 18, 2016, at 5:23 PM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> Can you go into more detail to why reflection is needed? It seems like we
> could get away from reflection by sharing interfaces, etc.
> 
> On Mon, Jan 18, 2016 at 5:12 PM, Julian Hyde <jh...@apache.org> wrote:
> 
>> In https://issues.apache.org/jira/browse/CALCITE-794 <
>> https://issues.apache.org/jira/browse/CALCITE-794> we added an extra
>> parameter to each metadata call so that we could detect cyclic metadata
>> calls, and potentially to cache results so that a given statistic is never
>> computed more than once during a metadata call. But the overhead of making
>> calls into the metadata framework is still very high. It shows up as a big
>> fraction of the time spent optimizing complex queries. I am working on
>> https://issues.apache.org/jira/browse/CALCITE-604 <
>> https://issues.apache.org/jira/browse/CALCITE-604>, which aims to fix
>> that.
>> 
>> I am working on 604 while the release is closing, and I thought some of
>> you would be be interested to know where I am going.
>> 
>> We use reflection to make calls. This is necessary because the types of
>> metadata (e.g. selectivity, row count, unique keys, predicates) are
>> extensible, you can have multiple providers for each kind of metadata, each
>> provider has different methods for various RelNode sub-types, and we want
>> to be able to inherit handler methods (e.g. getUniqueKeys(Aggregate,
>> boolean) handles getUniqueKeys(LogicalAggregate, boolean) because there is
>> no handler method for LogicalAggregate.
>> 
>> Initially I thought we’d use MethodHandle, which is a lot faster than
>> method invocation by reflection. MethodHandle.invoke has some flexibility
>> based on the types of its arguments, but I realized we’d still have to
>> dispatch to multiple underlying providers (e.g. the built-in provider and
>> the Hive provider). And we have other inefficiencies such as calling
>> UnboundMetadata.bind(RelNode, RelMetadataQuery) to create a short-lived
>> object every single call.
>> 
>> So, now I am looking at using Janino to generate a dispatcher. Consider
>> just one kind of metadata, UniqueKeys. We already have a “signature”
>> interface:
>> 
>> public interface UniqueKeys extends Metadata {
>>  Set<ImmutableBitSet> getUniqueKeys(boolean ignoreNulls);
>> }
>> 
>> I have added a handler interface:
>> 
>> interface UniqueKeysHandler {
>>  Set<ImmutableBitSet> getUniqueKeys(RelNode r, RelMetadataQuery mq,
>> boolean ignoreNulls);
>> }
>> 
>> Now, given a set of metadata providers and the set of all known RelNode
>> sub-type, I can use Janino to generate a handler at run time:
>> 
>> class UniqueKeysHandlerImpl implements UniqueKeysHandlerImpl {
>>  final RelMdUniqueKeys provider0;
>>  final HiveUniqueKeys provider1;
>> 
>>  UniqueKeysHandlerImpl(RelMdUniqueKeys provider0, HiveUniqueKeys
>> provider1) {
>>    this.provider0 = provider0;
>>    this.provider1 = provider1;
>>  }
>> 
>>  public Set<ImmutableBitSet> getUniqueKeys(RelNode r,
>>      RelMetadataQuery mq, boolean ignoreNulls) {
>>    switch (r.getClass().getName()) {
>>    case "org.apache.calcite.rel.logical.LogicalAggregate":
>>      return provider0.getUniqueKeys((Aggregate) r, mq, ignoreNulls);
>>    case "org.apache.calcite.rel.core.Aggregate":
>>      return provider0.getUniqueKeys(r, mq, ignoreNulls);
>>    case “
>> org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate":
>>      return provider1.getUniqueKeys(r, mq, ignoreNulls);
>>    default:
>>      throw NoHandler.INSTANCE;
>>    }
>>  }
>> }
>> 
>> The entry point in RelMetadataQuery changes from
>> 
>> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
>>    boolean ignoreNulls) {
>>  final BuiltInMetadata.UniqueKeys metadata =
>>      rel.metadata(BuiltInMetadata.UniqueKeys.class, this);
>>  return metadata.getUniqueKeys(ignoreNulls);
>> }
>> 
>> to
>> 
>> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
>>    boolean ignoreNulls) {
>>  for (;;) {
>>    try {
>>      return uniqueKeysHandler.getUniqueKeys(rel, this, ignoreNulls);
>>    } catch (NoHandler e) {
>>      uniqueKeysHandler = metadataProvider.revise(rel.getClass(),
>>          BuiltInMetadata.UniqueKeys.Handler.class);
>>    }
>>  }
>> }
>> 
>> The “NoHandler” exception occurs very rarely — only when a kind of RelNode
>> is seen that hasn’t been seen before in this JVM instance — but gives the
>> handler chance to regenerate itself.
>> 
>> The result is a very direct path from the caller (generally a RelOptRule)
>> to the provider: two calls, and we don’t even need to box the arguments.
>> 
>> I don’t think there will be any API changes, but note that the metadata
>> interfaces (eg. UniqueKeys) and RelNode.metadata(Class<M> metadataClass,
>> RelMetadataQuery mq) are not used anymore.
>> 
>> Julian
>> 
>>

Re: Re-working the metadata framework

Posted by Jacques Nadeau <ja...@apache.org>.

Can you go into more detail to why reflection is needed? It seems like we
could get away from reflection by sharing interfaces, etc.

On Mon, Jan 18, 2016 at 5:12 PM, Julian Hyde <jh...@apache.org> wrote:

> In https://issues.apache.org/jira/browse/CALCITE-794 <
> https://issues.apache.org/jira/browse/CALCITE-794> we added an extra
> parameter to each metadata call so that we could detect cyclic metadata
> calls, and potentially to cache results so that a given statistic is never
> computed more than once during a metadata call. But the overhead of making
> calls into the metadata framework is still very high. It shows up as a big
> fraction of the time spent optimizing complex queries. I am working on
> https://issues.apache.org/jira/browse/CALCITE-604 <
> https://issues.apache.org/jira/browse/CALCITE-604>, which aims to fix
> that.
>
> I am working on 604 while the release is closing, and I thought some of
> you would be be interested to know where I am going.
>
> We use reflection to make calls. This is necessary because the types of
> metadata (e.g. selectivity, row count, unique keys, predicates) are
> extensible, you can have multiple providers for each kind of metadata, each
> provider has different methods for various RelNode sub-types, and we want
> to be able to inherit handler methods (e.g. getUniqueKeys(Aggregate,
> boolean) handles getUniqueKeys(LogicalAggregate, boolean) because there is
> no handler method for LogicalAggregate.
>
> Initially I thought we’d use MethodHandle, which is a lot faster than
> method invocation by reflection. MethodHandle.invoke has some flexibility
> based on the types of its arguments, but I realized we’d still have to
> dispatch to multiple underlying providers (e.g. the built-in provider and
> the Hive provider). And we have other inefficiencies such as calling
> UnboundMetadata.bind(RelNode, RelMetadataQuery) to create a short-lived
> object every single call.
>
> So, now I am looking at using Janino to generate a dispatcher. Consider
> just one kind of metadata, UniqueKeys. We already have a “signature”
> interface:
>
> public interface UniqueKeys extends Metadata {
>   Set<ImmutableBitSet> getUniqueKeys(boolean ignoreNulls);
> }
>
> I have added a handler interface:
>
> interface UniqueKeysHandler {
>   Set<ImmutableBitSet> getUniqueKeys(RelNode r, RelMetadataQuery mq,
> boolean ignoreNulls);
> }
>
> Now, given a set of metadata providers and the set of all known RelNode
> sub-type, I can use Janino to generate a handler at run time:
>
> class UniqueKeysHandlerImpl implements UniqueKeysHandlerImpl {
>   final RelMdUniqueKeys provider0;
>   final HiveUniqueKeys provider1;
>
>   UniqueKeysHandlerImpl(RelMdUniqueKeys provider0, HiveUniqueKeys
> provider1) {
>     this.provider0 = provider0;
>     this.provider1 = provider1;
>   }
>
>   public Set<ImmutableBitSet> getUniqueKeys(RelNode r,
>       RelMetadataQuery mq, boolean ignoreNulls) {
>     switch (r.getClass().getName()) {
>     case "org.apache.calcite.rel.logical.LogicalAggregate":
>       return provider0.getUniqueKeys((Aggregate) r, mq, ignoreNulls);
>     case "org.apache.calcite.rel.core.Aggregate":
>       return provider0.getUniqueKeys(r, mq, ignoreNulls);
>     case “
> org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate":
>       return provider1.getUniqueKeys(r, mq, ignoreNulls);
>     default:
>       throw NoHandler.INSTANCE;
>     }
>   }
> }
>
> The entry point in RelMetadataQuery changes from
>
> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
>     boolean ignoreNulls) {
>   final BuiltInMetadata.UniqueKeys metadata =
>       rel.metadata(BuiltInMetadata.UniqueKeys.class, this);
>   return metadata.getUniqueKeys(ignoreNulls);
> }
>
> to
>
> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
>     boolean ignoreNulls) {
>   for (;;) {
>     try {
>       return uniqueKeysHandler.getUniqueKeys(rel, this, ignoreNulls);
>     } catch (NoHandler e) {
>       uniqueKeysHandler = metadataProvider.revise(rel.getClass(),
>           BuiltInMetadata.UniqueKeys.Handler.class);
>     }
>   }
> }
>
> The “NoHandler” exception occurs very rarely — only when a kind of RelNode
> is seen that hasn’t been seen before in this JVM instance — but gives the
> handler chance to regenerate itself.
>
> The result is a very direct path from the caller (generally a RelOptRule)
> to the provider: two calls, and we don’t even need to box the arguments.
>
> I don’t think there will be any API changes, but note that the metadata
> interfaces (eg. UniqueKeys) and RelNode.metadata(Class<M> metadataClass,
> RelMetadataQuery mq) are not used anymore.
>
> Julian
>
>