You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Gavin Ray <ra...@gmail.com> on 2022/01/27 15:25:11 UTC

How to handle the fact that different databases have different ideas of what a "db"/"schema" are?

My RDBMS experience is nearly exclusively Postgres
While working on this project, I've made the assumption that the structure
of a database is:

Database -> Schema -> Table

It turns out that this isn't accurate. In MySQL for instance, "Schema" is
an alias for "DB".
From the below StackOverflow answer, it seems like this is all over the
place:

https://stackoverflow.com/a/7944489/13485494

I have a "CalciteSchemaManager" object which has a "rootSchema" to which
all datasources are attached
This "rootSchema" is used to generate the GraphQL API and types

It seems like I have two options, and I'm not sure which is a better design:

1. Force all datasources to conform to (Database -> Schema -> Table)

This means that adding a new MySQL database, would generate ("mysql_db" ->
"root" (fake schema) -> "some_table")
Adding a CSV schema too, would be something like ("csv_datasource" ->
"root" -> "some_csv_file")

2. Have an irregular data shape. Datasources can be of arbitrary sub-schema
depth.

Example Postgres: ("pg_db_1" -> "public" -> "user")
Example MySQL:   ("mysql_db_1" -> "user")
Example CSV: ("some_csv_file") or maybe ("csv_source_1" -> "some_csv_file")

What do you folks think I ought to do?
Thank you =)

Re: How to handle the fact that different databases have different ideas of what a "db"/"schema" are?

Posted by Gavin Ray <ra...@gmail.com>.
Jacques, thank you. I appreciate the weigh-in with your experience, and that
link is brilliant.

This is exactly what I am trying to do. I tried to write something like
this but
it felt shaky, I wasn't confident in it.

It took a JDBC "DataSource" and then used metadata to iterate
catalogs/schemas
and add them to itself as a child "JdbcSchema", returning them in
"getSubschemaMap()". Creating a wrapping default/root catalog if necessary.

The code here is a much better version of what I was trying to do it seems.
I am
going to unabashedly steal the overall implementation/design patterns here
=)

On Mon, Jan 31, 2022 at 8:06 PM Jacques Nadeau <ja...@apache.org> wrote:

> If I recall correctly, my experience is you have to code per underlying
> database. I believe there are only a few patterns but I don't think there
> is sufficient odbc/jdbc info to answer behavior reliably (or at least in a
> way that feels correct/native to each database). For example, I believe
> some databases require catalog selection at the connection level and so
> while a catalog concept exists, you have to use different connections for
> each catalog whereas other databases expose catalog within a connection.
>
> This is all quite old thinking. I remember writing some simpler logic for
> this here years ago at [1]. Note that code is long sense changed but wanted
> to call it.
>
> [1]
>
> https://github.com/apache/drill/blob/18a1ae4d31cd502d2f792b331fefeb0ed2106c53/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java#L301
>
>
> On Thu, Jan 27, 2022 at 7:38 PM Gavin Ray <ra...@gmail.com> wrote:
>
> > The filesystem hierarchy is a great analogy, I understand it much better
> > now I think -- thank you.
> >
> > This seems like a problem with potentially very brittle solutions. Using
> > your explanation I was able to get it to work,
> > with very terrible logic that says "If catalogs are all null, then it's
> > one-level so make a fake "root" schema to hold it in the rootSchema"
> >
> > But this doesn't feel so good. Not 100% sure what the best thing to do
> here
> > is, but at least you've cleared up what's going on.
> >
> > On Thu, Jan 27, 2022 at 6:41 PM Julian Hyde <jh...@gmail.com>
> > wrote:
> >
> > > Let’s not worry about the names, and say that some DBs have two
> namespace
> > > levels and others have just one.
> > >
> > > Calcite’s word for a namespace is ’schema’. Calcite schemas are
> arranged
> > > in a hierarchy, like a filesystem, so there is no preferred depth. Any
> > > schema can contain both tables and (sub)schemas. So you can easily
> built
> > a
> > > one- or two-level namespace structure, or whatever you want.
> > >
> > > Calcite’s catalog has a single ‘root schema’ (analogous to the root
> > > directory, ‘/‘ in file systems), and you can get to anything else from
> > > there.
> > >
> > > In JDBC parlance, a a level 1 namespace is called ‘catalog’, and a
> level
> > 2
> > > namespace is a ’schema’. If a DB has a one-level namespace then catalog
> > > will be null, or the empty string, or something.
> > >
> > > If you’re running an Avatica JDBC server backed by a particular Calcite
> > > root schema, and you want your database to look like a one-level or
> > > two-level database, we probably don’t make it particularly easy.
> > >
> > > Julian
> > >
> > >
> > > > On Jan 27, 2022, at 7:25 AM, Gavin Ray <ra...@gmail.com>
> wrote:
> > > >
> > > > My RDBMS experience is nearly exclusively Postgres
> > > > While working on this project, I've made the assumption that the
> > > structure
> > > > of a database is:
> > > >
> > > > Database -> Schema -> Table
> > > >
> > > > It turns out that this isn't accurate. In MySQL for instance,
> "Schema"
> > is
> > > > an alias for "DB".
> > > > From the below StackOverflow answer, it seems like this is all over
> the
> > > > place:
> > > >
> > > > https://stackoverflow.com/a/7944489/13485494
> > > >
> > > > I have a "CalciteSchemaManager" object which has a "rootSchema" to
> > which
> > > > all datasources are attached
> > > > This "rootSchema" is used to generate the GraphQL API and types
> > > >
> > > > It seems like I have two options, and I'm not sure which is a better
> > > design:
> > > >
> > > > 1. Force all datasources to conform to (Database -> Schema -> Table)
> > > >
> > > > This means that adding a new MySQL database, would generate
> ("mysql_db"
> > > ->
> > > > "root" (fake schema) -> "some_table")
> > > > Adding a CSV schema too, would be something like ("csv_datasource" ->
> > > > "root" -> "some_csv_file")
> > > >
> > > > 2. Have an irregular data shape. Datasources can be of arbitrary
> > > sub-schema
> > > > depth.
> > > >
> > > > Example Postgres: ("pg_db_1" -> "public" -> "user")
> > > > Example MySQL:   ("mysql_db_1" -> "user")
> > > > Example CSV: ("some_csv_file") or maybe ("csv_source_1" ->
> > > "some_csv_file")
> > > >
> > > > What do you folks think I ought to do?
> > > > Thank you =)
> > >
> > >
> >
>

Re: How to handle the fact that different databases have different ideas of what a "db"/"schema" are?

Posted by Jacques Nadeau <ja...@apache.org>.
If I recall correctly, my experience is you have to code per underlying
database. I believe there are only a few patterns but I don't think there
is sufficient odbc/jdbc info to answer behavior reliably (or at least in a
way that feels correct/native to each database). For example, I believe
some databases require catalog selection at the connection level and so
while a catalog concept exists, you have to use different connections for
each catalog whereas other databases expose catalog within a connection.

This is all quite old thinking. I remember writing some simpler logic for
this here years ago at [1]. Note that code is long sense changed but wanted
to call it.

[1]
https://github.com/apache/drill/blob/18a1ae4d31cd502d2f792b331fefeb0ed2106c53/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java#L301


On Thu, Jan 27, 2022 at 7:38 PM Gavin Ray <ra...@gmail.com> wrote:

> The filesystem hierarchy is a great analogy, I understand it much better
> now I think -- thank you.
>
> This seems like a problem with potentially very brittle solutions. Using
> your explanation I was able to get it to work,
> with very terrible logic that says "If catalogs are all null, then it's
> one-level so make a fake "root" schema to hold it in the rootSchema"
>
> But this doesn't feel so good. Not 100% sure what the best thing to do here
> is, but at least you've cleared up what's going on.
>
> On Thu, Jan 27, 2022 at 6:41 PM Julian Hyde <jh...@gmail.com>
> wrote:
>
> > Let’s not worry about the names, and say that some DBs have two namespace
> > levels and others have just one.
> >
> > Calcite’s word for a namespace is ’schema’. Calcite schemas are arranged
> > in a hierarchy, like a filesystem, so there is no preferred depth. Any
> > schema can contain both tables and (sub)schemas. So you can easily built
> a
> > one- or two-level namespace structure, or whatever you want.
> >
> > Calcite’s catalog has a single ‘root schema’ (analogous to the root
> > directory, ‘/‘ in file systems), and you can get to anything else from
> > there.
> >
> > In JDBC parlance, a a level 1 namespace is called ‘catalog’, and a level
> 2
> > namespace is a ’schema’. If a DB has a one-level namespace then catalog
> > will be null, or the empty string, or something.
> >
> > If you’re running an Avatica JDBC server backed by a particular Calcite
> > root schema, and you want your database to look like a one-level or
> > two-level database, we probably don’t make it particularly easy.
> >
> > Julian
> >
> >
> > > On Jan 27, 2022, at 7:25 AM, Gavin Ray <ra...@gmail.com> wrote:
> > >
> > > My RDBMS experience is nearly exclusively Postgres
> > > While working on this project, I've made the assumption that the
> > structure
> > > of a database is:
> > >
> > > Database -> Schema -> Table
> > >
> > > It turns out that this isn't accurate. In MySQL for instance, "Schema"
> is
> > > an alias for "DB".
> > > From the below StackOverflow answer, it seems like this is all over the
> > > place:
> > >
> > > https://stackoverflow.com/a/7944489/13485494
> > >
> > > I have a "CalciteSchemaManager" object which has a "rootSchema" to
> which
> > > all datasources are attached
> > > This "rootSchema" is used to generate the GraphQL API and types
> > >
> > > It seems like I have two options, and I'm not sure which is a better
> > design:
> > >
> > > 1. Force all datasources to conform to (Database -> Schema -> Table)
> > >
> > > This means that adding a new MySQL database, would generate ("mysql_db"
> > ->
> > > "root" (fake schema) -> "some_table")
> > > Adding a CSV schema too, would be something like ("csv_datasource" ->
> > > "root" -> "some_csv_file")
> > >
> > > 2. Have an irregular data shape. Datasources can be of arbitrary
> > sub-schema
> > > depth.
> > >
> > > Example Postgres: ("pg_db_1" -> "public" -> "user")
> > > Example MySQL:   ("mysql_db_1" -> "user")
> > > Example CSV: ("some_csv_file") or maybe ("csv_source_1" ->
> > "some_csv_file")
> > >
> > > What do you folks think I ought to do?
> > > Thank you =)
> >
> >
>

Re: How to handle the fact that different databases have different ideas of what a "db"/"schema" are?

Posted by Gavin Ray <ra...@gmail.com>.
The filesystem hierarchy is a great analogy, I understand it much better
now I think -- thank you.

This seems like a problem with potentially very brittle solutions. Using
your explanation I was able to get it to work,
with very terrible logic that says "If catalogs are all null, then it's
one-level so make a fake "root" schema to hold it in the rootSchema"

But this doesn't feel so good. Not 100% sure what the best thing to do here
is, but at least you've cleared up what's going on.

On Thu, Jan 27, 2022 at 6:41 PM Julian Hyde <jh...@gmail.com> wrote:

> Let’s not worry about the names, and say that some DBs have two namespace
> levels and others have just one.
>
> Calcite’s word for a namespace is ’schema’. Calcite schemas are arranged
> in a hierarchy, like a filesystem, so there is no preferred depth. Any
> schema can contain both tables and (sub)schemas. So you can easily built a
> one- or two-level namespace structure, or whatever you want.
>
> Calcite’s catalog has a single ‘root schema’ (analogous to the root
> directory, ‘/‘ in file systems), and you can get to anything else from
> there.
>
> In JDBC parlance, a a level 1 namespace is called ‘catalog’, and a level 2
> namespace is a ’schema’. If a DB has a one-level namespace then catalog
> will be null, or the empty string, or something.
>
> If you’re running an Avatica JDBC server backed by a particular Calcite
> root schema, and you want your database to look like a one-level or
> two-level database, we probably don’t make it particularly easy.
>
> Julian
>
>
> > On Jan 27, 2022, at 7:25 AM, Gavin Ray <ra...@gmail.com> wrote:
> >
> > My RDBMS experience is nearly exclusively Postgres
> > While working on this project, I've made the assumption that the
> structure
> > of a database is:
> >
> > Database -> Schema -> Table
> >
> > It turns out that this isn't accurate. In MySQL for instance, "Schema" is
> > an alias for "DB".
> > From the below StackOverflow answer, it seems like this is all over the
> > place:
> >
> > https://stackoverflow.com/a/7944489/13485494
> >
> > I have a "CalciteSchemaManager" object which has a "rootSchema" to which
> > all datasources are attached
> > This "rootSchema" is used to generate the GraphQL API and types
> >
> > It seems like I have two options, and I'm not sure which is a better
> design:
> >
> > 1. Force all datasources to conform to (Database -> Schema -> Table)
> >
> > This means that adding a new MySQL database, would generate ("mysql_db"
> ->
> > "root" (fake schema) -> "some_table")
> > Adding a CSV schema too, would be something like ("csv_datasource" ->
> > "root" -> "some_csv_file")
> >
> > 2. Have an irregular data shape. Datasources can be of arbitrary
> sub-schema
> > depth.
> >
> > Example Postgres: ("pg_db_1" -> "public" -> "user")
> > Example MySQL:   ("mysql_db_1" -> "user")
> > Example CSV: ("some_csv_file") or maybe ("csv_source_1" ->
> "some_csv_file")
> >
> > What do you folks think I ought to do?
> > Thank you =)
>
>

Re: How to handle the fact that different databases have different ideas of what a "db"/"schema" are?

Posted by Julian Hyde <jh...@gmail.com>.
Let’s not worry about the names, and say that some DBs have two namespace levels and others have just one.

Calcite’s word for a namespace is ’schema’. Calcite schemas are arranged in a hierarchy, like a filesystem, so there is no preferred depth. Any schema can contain both tables and (sub)schemas. So you can easily built a one- or two-level namespace structure, or whatever you want.

Calcite’s catalog has a single ‘root schema’ (analogous to the root directory, ‘/‘ in file systems), and you can get to anything else from there.

In JDBC parlance, a a level 1 namespace is called ‘catalog’, and a level 2 namespace is a ’schema’. If a DB has a one-level namespace then catalog will be null, or the empty string, or something.

If you’re running an Avatica JDBC server backed by a particular Calcite root schema, and you want your database to look like a one-level or two-level database, we probably don’t make it particularly easy.

Julian


> On Jan 27, 2022, at 7:25 AM, Gavin Ray <ra...@gmail.com> wrote:
> 
> My RDBMS experience is nearly exclusively Postgres
> While working on this project, I've made the assumption that the structure
> of a database is:
> 
> Database -> Schema -> Table
> 
> It turns out that this isn't accurate. In MySQL for instance, "Schema" is
> an alias for "DB".
> From the below StackOverflow answer, it seems like this is all over the
> place:
> 
> https://stackoverflow.com/a/7944489/13485494
> 
> I have a "CalciteSchemaManager" object which has a "rootSchema" to which
> all datasources are attached
> This "rootSchema" is used to generate the GraphQL API and types
> 
> It seems like I have two options, and I'm not sure which is a better design:
> 
> 1. Force all datasources to conform to (Database -> Schema -> Table)
> 
> This means that adding a new MySQL database, would generate ("mysql_db" ->
> "root" (fake schema) -> "some_table")
> Adding a CSV schema too, would be something like ("csv_datasource" ->
> "root" -> "some_csv_file")
> 
> 2. Have an irregular data shape. Datasources can be of arbitrary sub-schema
> depth.
> 
> Example Postgres: ("pg_db_1" -> "public" -> "user")
> Example MySQL:   ("mysql_db_1" -> "user")
> Example CSV: ("some_csv_file") or maybe ("csv_source_1" -> "some_csv_file")
> 
> What do you folks think I ought to do?
> Thank you =)