You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Pranav Deshpande <de...@gmail.com> on 2022/08/01 22:41:38 UTC

How does one leverage Calcite to federate queries over multiple underlying engines?

Hi Team,
Say that I have a query which references 5 tables, and 3 of these are in
Cassandra and 1 is a CSV file. Out of these, the cassandra tables are split
across 2 different cassandra DB instances.

How does calcite federation work in this case?

Do I have to instantiate multiple connections? Is there an example for
this? How does calcite know which tablescan/filter etc. to call?

Regards,
Pranav

Re: How does one leverage Calcite to federate queries over multiple underlying engines?

Posted by Gavin Ray <ra...@gmail.com>.

I've been meaning to write and publish something on this since I had the
same question + it seems to come up often

There are essentially two important things you need to know about Calcite
to work with federated data:

1. Calcite schemas work like filesystem hierarchies. You have a
"rootSchema" which is like the system root.
    Datasources you want to query need to be added as child schemas of this
rootSchema.\

In the case of something like a standard DB (Postgres/MySQL), what you'll
get may not be one schema, but a "tree" of schemas.
What I mean by this is that there are wrapper/utility methods to give you
back a "Calcite" representation of the database structure as schemas

For a Postgres DB, that has a structure like:

                my_postgres_db1
                     /              \
             public           schema_2

You'll get a Schema that has two child schemas.
To do this, you use JdbcCatalogSchema: JdbcCatalogSchema (Apache Calcite
API)
<https://calcite.apache.org/javadocAggregate/org/apache/calcite/adapter/jdbc/JdbcCatalogSchema.html>

For a DB like MySQL, which doesn't have the notion of "Databases",
you'd instead use JdbcSchema: JdbcSchema (Apache Calcite API)
<https://calcite.apache.org/javadocAggregate/org/apache/calcite/adapter/jdbc/JdbcSchema.html>

2. Don't close your Calcite Connection object. It's stateful, and whatever
you've done to add new datasources to it will vanish if the connection is
closed.
    This is especially important to keep in mind because it's
AutoCloseable. Don't use try-with-resources blocks with it, or call .close()

You can see an example of some of this stuff here:
https://github.com/GavinRay97/GraphQLCalcite/blob/92b18a850d5546403b9760d96dc5b3a59533f3ca/src/main/kotlin/calcite/CalciteSchemaManager.kt#L26-L60

Please ignore "addDatabase" manually iterating schemas and not using
JdbcCatalogSchema
Also the methods which run manual RelNode expressions instead of SQL aren't
necessary

I asked about this and Julian said that running SQL through a
java.sql.Connection object amounts to the same thing

Hope this overview helps, let me know if anything is not clear.
Also if anyone else has advice on this topic, I'd be keen to hear it too
(or if I've made mistakes here)


On Tue, Aug 2, 2022 at 2:58 PM Pranav Deshpande <
deshpande.v.pranav@gmail.com> wrote:

> Hi Team,
> Say that I have a query which references 5 tables, and 3 of these are in
> Cassandra and 1 is a CSV file. Out of these, the cassandra tables are split
> across 2 different cassandra DB instances.
>
> How does calcite federation work in this case?
>
> Do I have to instantiate multiple connections? Is there an example for
> this? How does calcite know which tablescan/filter etc. to call?
>
> Regards,
> Pranav
>