You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by "Hartman, Trevor" <th...@ebay.com> on 2014/11/18 19:27:23 UTC

sample project demonstrating type errors on hierarchical Map data structure

As I've said in a few other threads I'm trying to add an adapter to a custom database with a hierarchical dataset.

Overall, it seems like I'm having to do a lot of gymnastics to get Lists of projected fields, filter conditions, and aggregations into my Enumerator, where all the pushdown impl happens.

Sorry for my lack of understanding; I'm having a very hard time figuring out how to implement the adapter. I've read much of the Calcite integration in the sources of optiq-csv (I mostly understand that), optiq-javabean (mostly understand), Mongo adapter (mostly do not understand), and Kylin (mostly do not understand).

I did notice optiq-csv and optiq-javabean use `call.transformTo` in their RelOptRule subclass onMatch methods, while Mongo uses `implementor.add(op.left, op.right)`. Not sure what's going on there, or which style I should use.

Aside from not really understanding the components of Calcite, I'm blocked by an error about types that I don't understand, which I posted in my other thread a few days ago. I'm able to get queries working on a simple Enumerator without pushdown/RelOptRules, but I run into problems when trying to implement a ProjectRel. I created a separate project that reproduces the error [1]. Sample test failure output is in the README. It's about 200 lines of Scala on a dummy nested Map dataset [2].

I know there's a lot going on with the move to Apache, and this may be a bad time for newcomers, but I'd love to help out where I can with docs or blog posts if/when I better understand.

Thanks,
Trevor

[1] https://github.com/devth/calcite-map-demo
[2] https://github.com/devth/calcite-map-demo/blob/master/src/main/scala/devth/calcite/MapEnumerator.scala#L15-25

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by Vladimir Sitnikov <si...@gmail.com>.

> more performant for me to flatten the data myself in my query loop rather than constructing an intermediate Map

This makes sense.

> At this point I'm not sure how to "peel a Map", but I think I can figure that part out.
I've no idea why I used "peel". It should be "skin a Map", shouldn't it?
Anyway, I hope you are on track.

Vladmir

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by "Hartman, Trevor" <th...@ebay.com>.

calcite.debug works now with your syntax.

>>> The question is if your DB returns Iterator<Map> or if it returns
>>> something closer to ResultSet (i.e. Iterator<Object[]>).

I have complete control over the data structure my database returns, so it could be either one, but an Iterator<Map> would be much easier.

Here's a small example of what my database's query API currently looks like (without actually accumulating any results), illustrating the hierarchical nature (a single listing has many transactions):

while (listingsSchema.next) {
  val listingId: Long = listingsSchema.listingId
  val transactions = listingsSchema.transactions
  while (transactions.next) {
    val transactionId: Long = transactions.transactionId
    val quantitySold: Int = transactions.quantitySold
  }
}

In reality, there are many fields and up to 3 levels in the hierarchy (but another schema may be even deeper). It's up to me to efficiently accumulate those in an appropriate data structure. I could certainly return a Map, and that would be easiest since it naturally represents the hierarchy.

> Why do you feel you can make it more efficient?

I'm assuming it would be more performant for me to flatten the data myself in my query loop rather than constructing an intermediate Map which will only then be flattened by Calcite. At this point I'm not sure how to "peel a Map", but I think I can figure that part out.

> You invoke newProject=oldProject.copy(..., new MapProjectRel(new
> MapTableScan())); call.transformTo(newProject); and it creates a new
> project relation with the same project expressions and new input.
> Does it make sense?

Yes, thank you!

Trevor

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by Vladimir Sitnikov <si...@gmail.com>.

>I added systemProperties, but neither seem to make a difference in maven output.

Can you please put the following _literally_ in your pom.xml?
I have no idea what is the proper way since I have zero Scala
experience, however the syntax below works for me somehow.
<configuration>
  <systemProperties><calcite.debug>true</calcite.debug></systemProperties>
...
Here's what I get with it:
https://gist.github.com/vlsi/c9142d16bf5177da8ca6 (note
<<.getTable("foo")).project(>> part in the log)

>1. Wouldn't it be more efficient for me to transform results into a flat data structure (A2 approach) myself rather than letting Calcite do it?

Why do you feel you can make it more efficient?

Can you please read the following question again, and see it it makes
more sense for you now? I am sorry to ask it again, however I suspect
it did not make much sense for you in the first iteration :)
>> The question is if your DB returns Iterator<Map> or if it returns
>> something closer to ResultSet (i.e. Iterator<Object[]>).

In other words:
1) I see no reason in "optimization" of toy examples with Maps.
Well, having different examples is really helpful for understanding
things, however I do not feel "developing 1001 way of peeling a Map
from no background" is a good starter. It might be a good tutorial,
however it is not that easy to code unless you understand what you
want to result in.

Finally, there is no way to optimize already existing Map: you just
return it and that is the fastest processing. No additional overhead
included.

2) If you have your use-case (e.g. database with specific API), you
might get better speed if you solve your problem, since it is real and
measurable.
I am afraid you did not clarify if you have any specific use-case/DB
API in mind.

3) If you want to explore the possibilities (e.g. implement both A1
and A2 approaches and maybe some other(!) ), that is perfectly fine
and we are all ears for the improvement of Calcite examples.

>2. If I were to go the `ProjectRel(MapTableScan) =transformTo => ProjectRel(MapProjectRel(MapTableScan))` route, how would I actually build that structure? Currently I only understand how to convert one Rel (ProjectRel) to my own (MapProjectRel).

You invoke newProject=oldProject.copy(..., new MapProjectRel(new
MapTableScan())); call.transformTo(newProject); and it creates a new
project relation with the same project expressions and new input.
Does it make sense?


Vladimir

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by "Hartman, Trevor" <th...@ebay.com>.

> In pom.xml, you should use systemProperties, not systemPropertyVariables.

systemProperties is deprecated in favor of systemPropertyVariables [1]. I added systemProperties, but neither seem to make a difference in maven output.

> If you go with A1 approach I would recommend the following rule:
> ProjectRel(MapTableScan) =transformTo=>
> ProjectRel(MapProjectRel(MapTableScan)). Note how ProjectRel with
> original expressions is _retained_.
> This allows you: "to return just a _MAP from MapProjectRel" and allow
> ProjectRel (i.e. EnumerableCalcRel) transform the map to "columns" format.

Implementing computeSelfCost as you suggested caused my MapProjectRel to fire. And now I understand what you mean about just returning a _MAP and allowing ProjectRel to transform the map to columns. Two questions regarding that:

1. Wouldn't it be more efficient for me to transform results into a flat data structure (A2 approach) myself rather than letting Calcite do it?
2. If I were to go the `ProjectRel(MapTableScan) =transformTo => ProjectRel(MapProjectRel(MapTableScan))` route, how would I actually build that structure? Currently I only understand how to convert one Rel (ProjectRel) to my own (MapProjectRel).

Thanks for the advice!

Trevor

[1] http://maven.apache.org/surefire/maven-surefire-plugin/examples/system-properties.html

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by Vladimir Sitnikov <si...@gmail.com>.

> Could that explain it, or should I expect MapProjectRel to be in the tree?

In general, yes, you should expect it.

If you go with A1 approach I would recommend the following rule:
ProjectRel(MapTableScan) =transformTo=>
ProjectRel(MapProjectRel(MapTableScan)). Note how ProjectRel with
original expressions is _retained_.
This allows you: "to return just a _MAP from MapProjectRel" and allow
ProjectRel (i.e. EnumerableCalcRel) transform the map to "columns"
format.

The final plan should be
EnumerableCalcRel(MapToEnumerableConverter(MapProjectRel(MapTableScan))).


Vladimir

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by Vladimir Sitnikov <si...@gmail.com>.

In pom.xml, you should use systemProperties, not systemPropertyVariables.

I think your MapProject is not used is due to the cost.
Did you try implementing computeSelfCost like
super.computeSelfCost(planner).multiplyBy(0.1);?

Vladimir

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by "Hartman, Trevor" <th...@ebay.com>.

After many hours of stepping through the Mongo adapter in my debugger, I've tried to recreate the way it works with ProjectRel only at this point [1].

One issue I'm having is the `implement` method in my MapProjectRel is never called. I ran explain and saw it's not in the tree (I verified that MapProjectRule is registered [2]):

0: jdbc:calcite:model=src/test/resources/mode> explain plan for select _MAP['name'] from "map"."foo";
+------+
| PLAN |
+------+
| EnumerableCalcRel(expr#0=[{inputs}], expr#1=['name'], expr#2=[ITEM($t0, $t1)], EXPR$0=[$t2])
  MapToEnumerableConverter
    MapTableScan(table=[[map, foo]])
 |
+------+


When I run explain on Mongo adapter with zips model:

0: jdbc:calcite:model=mongodb/src/test/resour> explain plan for select state from zips;
+------+
| PLAN |
+------+
| MongoToEnumerableConverter
  MongoProjectRel(STATE=[CAST(ITEM($0, 'state')):VARCHAR(2) CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"])
    MongoTableScan(table=[[mongo_raw, zips]])
 |
+------+


One difference is the mongo query is over a view, which avoids _MAP syntax. Could that explain it, or should I expect MapProjectRel to be in the tree?

Also not sure why mine has EnumerableCalcRel as the root whereas mongo has MongoToEnumerableConverter.


[1] https://github.com/devth/calcite-map-demo/tree/master/src/main/scala/devth/calcite
[2] https://github.com/devth/calcite-map-demo/blob/master/src/main/scala/devth/calcite/MapTableScan.scala#L45

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by "Hartman, Trevor" <th...@ebay.com>.

Vladimir, thank you for the detailed response.

Yang, thanks for the suggestion; I was not aware of Calcite's JavaRules.

I'm working through these.

Trevor

Re: sample project demonstrating type errors on hierarchical Map data structure

Posted by Vladimir Sitnikov <si...@gmail.com>.

Hi,

> I created a separate project that reproduces the error

If you add
<systemProperties><calcite.debug>true</calcite.debug></systemProperties> to
   scalatest-maven-plugin it might help debugging code generation issues.
It might inspire you in terms of where those "project: Method =
classOf[MapTable].getMethod("project", classOf[JList[String]])"
finally land.

I believe the main problem with your MapProjectRule is the rule does
not honor row types.
Without MapProjectRule your project compiles and seems to work fine.

1) The source MapTable has some row type. In your case it is MAP(varchar, any)
2) When the project is applied, it might change the row type (think of
it like the set and types of columns). So when a rule replaces (i.e.
call.transformTo) one relation with another it must keep the row type
intact. Otherwise Calcite won't know what to do with "new columns" or
the ones that "disappeared". So, if you replace
ProjectRel(MapTableScan) to MapTableScan2, you must ensure
MapTableScan2 returns exactly the same row type that ProjectRel does.

> optiq-csv and optiq-javabean use `call.transformTo` in their RelOptRule subclass onMatch methods
> while Mongo uses `implementor.add(op.left, op.right)`. Not sure what's going on there, or which style I should use

Those are different concepts.
1) transformTo is used for rules that replace relation nodes. For
instance, rule that detects Project(TableScan) and replaces it with a
single SmartTableScan.

2) `implementor.add(op.left, op.right)` and `implementor.result(` are
used for the final code generation. For instance, let's assume you
managed somehow to create SmartTableScan for your "hierarchical
dataset". To compute the result of a particular scan you need to call
some API.
That is the purpose of those implementors.
In calcite-mongo `implementor` builds up a Mongo query. In
`enumerable` adapter `implementors` build some java source that
performs the required calls.

Try to come up with the API Calcite should use to call your database.
The question is if your DB returns Iterator<Map> or if it returns
something closer to ResultSet (i.e. Iterator<Object[]>).

There are two basic approaches:
A1) "Let your database return just a subset of the required data as a
Map". Calcite will project the map to flat "table row" afterwards via
its ProjectRel operation. This will be pretty much similar to the code
you have without MapProjectRule.
A1.1) In MapProjectRule you'll need rewrite ProjectRel(MapTableScan)
to ProjectRel(MapTableScan2) where MapTableScan2 includes the set of
required projections.
You'll need to make sure your rule does not fire for no reason (e.g.
override `matches` and recheck if there are more projections to push)
A1.2) In `MapTableScan.implement` pass the required projections to
your "database" to make it return a map with just a subset of the
data.

A2) "Let your database project the query to the flat format".
For instance, for query `select _MAP['name'], _MAP['address']['city']
from "foo"` your enumerator would have to return Object[]{name, city}
for each row. If your database accepts queries like Iterator<Object[]>
selectData(projections), then this might be better that A1 since you
won't have to reconvert DB response back to Map format.

A2.1) breakpoint in the `MapProjectRule.onMatch` and check what is the
rowType of the `ProjectRel`
A2.2) Teach MapTableScan to take the set of projections as a
constructor argument and teach it to use the set of projections in its
`deriveRowType`
A2.3) In MapProjectRule you'll need rewrite ProjectRel(MapTableScan)
to MapTableScan2 where MapTableScan2 effectively scans the table and
returns it back in a flat row format (no maps).
A2.3) In `MapTableScan.implement` pass the projections to your
"database". Here your `enumerator` should return not t
A2.4) A tricky point is if you are required to return just a single
column, do not return it as Object[]{column}, return just the value.
This is why CsvEnumerator has two separate converters:
https://github.com/apache/incubator-calcite/blob/master/example/csv/src/main/java/org/apache/calcite/adapter/csv/CsvEnumerator.java#L87-92

Vladimir