You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Andrew Pilloud <ap...@google.com> on 2018/04/23 23:55:53 UTC

[SQL] Reconciling Beam SQL Environments with Calcite Schema

I'm working on updating our Beam DDL code to use the DDL execution
functionality that recently merged into core calcite. This enables us to
take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
need to reconcile the Beam SQL Environments with the Calcite Schema (which
is calcite's environment). We currently have copies of our tables in the
Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
pending PR which merges the later two to just use the Calcite Schema copy.
Merging the Beam MetaStore and Calcite Schema isn't as simple. I have two
options I'm looking for feedback on:

1. Make Calcite Schema authoritative and demote MetaStore to be something
more like a Calcite TableFactory. Calcite Schema already implements the
semantics of our InMemoryMetaStore. If the Store interface is just over
built, this approach would result in a significant reduction in code. This
would however eliminate the CRUD part of the interface leaving just the
buildBeamSqlTable function.

2. Pass the Beam MetaStore into Calcite wrapped with a class translating to
Calcite Schema (like we do already with tables). Instead of copying tables
into the Calcite Schema we would pass in Beam meta/store as the source of
truth and Calcite would manipulate tables directly in the Beam meta/store.
This is a bit more complicated but retains the ability for DDL operations
to be processed by a custom MetaStore.

Thoughts?

Andrew

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

Posted by Andrew Pilloud <ap...@google.com>.
Reviews are wrapping up, this will probably merge Monday if I don't hear
from anyone else. One more TableProvider API change after review feedback:
getTables now returns Map<String, Table> instead of Set<Table>.

Andrew

On Thu, May 3, 2018 at 10:41 AM Andrew Pilloud <ap...@google.com> wrote:

> Ok, I've finished with this change. Didn't get reviews on the early
> cleanup PRs, so I've pushed all these changes into the first cleanup PR:
> https://github.com/apache/beam/pull/5224
>
> Andrew
>
> On Tue, May 1, 2018 at 10:35 AM Andrew Pilloud <ap...@google.com>
> wrote:
>
>> I'm just starting to move forward on this. Looking at my team's short
>> term needs for SQL, option one would be good enough, however I agree with
>> Kenn that we want something like option two eventually. I also don't want
>> to break existing users and it sounds like there is at least one custom
>> MetaStore not in beam. So my plan is to go with option two and simplify the
>> interface where functionality loss will not result.
>>
>> There is a common set of operations between the MetaStore and the
>> TableProvider. I'd like to make MetaStore inherit the interface of
>> TableProvider. Most operations we need (createTable, dropTable, listTables)
>> are already identical between the two, and so this will have no impact on
>> custom implementations. The buildBeamSqlTable operation does differ: the
>> MetaStore takes a table name, the TableProvider takes a table object.
>> However everything calling this API already has the full table object, so I
>> would like to simplify this interface by passing the table object in both
>> cases. Objections?
>>
>> Andrew
>>
>> On Tue, Apr 24, 2018 at 9:27 AM James <xu...@gmail.com> wrote:
>>
>>> Kenn: yes, MetaStore is user-facing, Users can choose to implement their
>>> own MetaStore, currently only an InMemory implementation in Beam CodeBase.
>>>
>>> Andrew: I like the second option, since it "retains the ability for DDL
>>> operations to be processed by a custom MetaStore.", IMO we should have the
>>> DDL ability as a fully functional SQL.
>>>
>>> On Tue, Apr 24, 2018 at 10:28 PM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> Can you say more about how the metastore is used? I presume it is or
>>>> will be user-facing, so are Beam SQL users already providing their own?
>>>>
>>>> I'm sure we want something like that eventually to support things like
>>>> Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
>>>> Beam SQL to create a derived data set. But I don't think we should build
>>>> out those code paths until we have at least one non-in-memory
>>>> implementation.
>>>>
>>>> Just a really high level $0.02.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud <ap...@google.com>
>>>> wrote:
>>>>
>>>>> I'm working on updating our Beam DDL code to use the DDL execution
>>>>> functionality that recently merged into core calcite. This enables us to
>>>>> take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
>>>>> need to reconcile the Beam SQL Environments with the Calcite Schema (which
>>>>> is calcite's environment). We currently have copies of our tables in the
>>>>> Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
>>>>> pending PR which merges the later two to just use the Calcite Schema copy.
>>>>> Merging the Beam MetaStore and Calcite Schema isn't as simple. I have
>>>>> two options I'm looking for feedback on:
>>>>>
>>>>> 1. Make Calcite Schema authoritative and demote MetaStore to be
>>>>> something more like a Calcite TableFactory. Calcite Schema already
>>>>> implements the semantics of our InMemoryMetaStore. If the Store interface
>>>>> is just over built, this approach would result in a significant reduction
>>>>> in code. This would however eliminate the CRUD part of the interface
>>>>> leaving just the buildBeamSqlTable function.
>>>>>
>>>>> 2. Pass the Beam MetaStore into Calcite wrapped with a class
>>>>> translating to Calcite Schema (like we do already with tables). Instead of
>>>>> copying tables into the Calcite Schema we would pass in Beam meta/store as
>>>>> the source of truth and Calcite would manipulate tables directly in the
>>>>> Beam meta/store. This is a bit more complicated but retains the ability for
>>>>> DDL operations to be processed by a custom MetaStore.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Andrew
>>>>>
>>>>

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

Posted by Andrew Pilloud <ap...@google.com>.
Ok, I've finished with this change. Didn't get reviews on the early cleanup
PRs, so I've pushed all these changes into the first cleanup PR:
https://github.com/apache/beam/pull/5224

Andrew

On Tue, May 1, 2018 at 10:35 AM Andrew Pilloud <ap...@google.com> wrote:

> I'm just starting to move forward on this. Looking at my team's short term
> needs for SQL, option one would be good enough, however I agree with Kenn
> that we want something like option two eventually. I also don't want to
> break existing users and it sounds like there is at least one custom
> MetaStore not in beam. So my plan is to go with option two and simplify the
> interface where functionality loss will not result.
>
> There is a common set of operations between the MetaStore and the
> TableProvider. I'd like to make MetaStore inherit the interface of
> TableProvider. Most operations we need (createTable, dropTable, listTables)
> are already identical between the two, and so this will have no impact on
> custom implementations. The buildBeamSqlTable operation does differ: the
> MetaStore takes a table name, the TableProvider takes a table object.
> However everything calling this API already has the full table object, so I
> would like to simplify this interface by passing the table object in both
> cases. Objections?
>
> Andrew
>
> On Tue, Apr 24, 2018 at 9:27 AM James <xu...@gmail.com> wrote:
>
>> Kenn: yes, MetaStore is user-facing, Users can choose to implement their
>> own MetaStore, currently only an InMemory implementation in Beam CodeBase.
>>
>> Andrew: I like the second option, since it "retains the ability for DDL
>> operations to be processed by a custom MetaStore.", IMO we should have the
>> DDL ability as a fully functional SQL.
>>
>> On Tue, Apr 24, 2018 at 10:28 PM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> Can you say more about how the metastore is used? I presume it is or
>>> will be user-facing, so are Beam SQL users already providing their own?
>>>
>>> I'm sure we want something like that eventually to support things like
>>> Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
>>> Beam SQL to create a derived data set. But I don't think we should build
>>> out those code paths until we have at least one non-in-memory
>>> implementation.
>>>
>>> Just a really high level $0.02.
>>>
>>> Kenn
>>>
>>> On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud <ap...@google.com>
>>> wrote:
>>>
>>>> I'm working on updating our Beam DDL code to use the DDL execution
>>>> functionality that recently merged into core calcite. This enables us to
>>>> take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
>>>> need to reconcile the Beam SQL Environments with the Calcite Schema (which
>>>> is calcite's environment). We currently have copies of our tables in the
>>>> Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
>>>> pending PR which merges the later two to just use the Calcite Schema copy.
>>>> Merging the Beam MetaStore and Calcite Schema isn't as simple. I have
>>>> two options I'm looking for feedback on:
>>>>
>>>> 1. Make Calcite Schema authoritative and demote MetaStore to be
>>>> something more like a Calcite TableFactory. Calcite Schema already
>>>> implements the semantics of our InMemoryMetaStore. If the Store interface
>>>> is just over built, this approach would result in a significant reduction
>>>> in code. This would however eliminate the CRUD part of the interface
>>>> leaving just the buildBeamSqlTable function.
>>>>
>>>> 2. Pass the Beam MetaStore into Calcite wrapped with a class
>>>> translating to Calcite Schema (like we do already with tables). Instead of
>>>> copying tables into the Calcite Schema we would pass in Beam meta/store as
>>>> the source of truth and Calcite would manipulate tables directly in the
>>>> Beam meta/store. This is a bit more complicated but retains the ability for
>>>> DDL operations to be processed by a custom MetaStore.
>>>>
>>>> Thoughts?
>>>>
>>>> Andrew
>>>>
>>>

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

Posted by Andrew Pilloud <ap...@google.com>.
I'm just starting to move forward on this. Looking at my team's short term
needs for SQL, option one would be good enough, however I agree with Kenn
that we want something like option two eventually. I also don't want to
break existing users and it sounds like there is at least one custom
MetaStore not in beam. So my plan is to go with option two and simplify the
interface where functionality loss will not result.

There is a common set of operations between the MetaStore and the
TableProvider. I'd like to make MetaStore inherit the interface of
TableProvider. Most operations we need (createTable, dropTable, listTables)
are already identical between the two, and so this will have no impact on
custom implementations. The buildBeamSqlTable operation does differ: the
MetaStore takes a table name, the TableProvider takes a table object.
However everything calling this API already has the full table object, so I
would like to simplify this interface by passing the table object in both
cases. Objections?

Andrew

On Tue, Apr 24, 2018 at 9:27 AM James <xu...@gmail.com> wrote:

> Kenn: yes, MetaStore is user-facing, Users can choose to implement their
> own MetaStore, currently only an InMemory implementation in Beam CodeBase.
>
> Andrew: I like the second option, since it "retains the ability for DDL
> operations to be processed by a custom MetaStore.", IMO we should have the
> DDL ability as a fully functional SQL.
>
> On Tue, Apr 24, 2018 at 10:28 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> Can you say more about how the metastore is used? I presume it is or will
>> be user-facing, so are Beam SQL users already providing their own?
>>
>> I'm sure we want something like that eventually to support things like
>> Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
>> Beam SQL to create a derived data set. But I don't think we should build
>> out those code paths until we have at least one non-in-memory
>> implementation.
>>
>> Just a really high level $0.02.
>>
>> Kenn
>>
>> On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> I'm working on updating our Beam DDL code to use the DDL execution
>>> functionality that recently merged into core calcite. This enables us to
>>> take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
>>> need to reconcile the Beam SQL Environments with the Calcite Schema (which
>>> is calcite's environment). We currently have copies of our tables in the
>>> Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
>>> pending PR which merges the later two to just use the Calcite Schema copy.
>>> Merging the Beam MetaStore and Calcite Schema isn't as simple. I have
>>> two options I'm looking for feedback on:
>>>
>>> 1. Make Calcite Schema authoritative and demote MetaStore to be
>>> something more like a Calcite TableFactory. Calcite Schema already
>>> implements the semantics of our InMemoryMetaStore. If the Store interface
>>> is just over built, this approach would result in a significant reduction
>>> in code. This would however eliminate the CRUD part of the interface
>>> leaving just the buildBeamSqlTable function.
>>>
>>> 2. Pass the Beam MetaStore into Calcite wrapped with a class translating
>>> to Calcite Schema (like we do already with tables). Instead of copying
>>> tables into the Calcite Schema we would pass in Beam meta/store as the
>>> source of truth and Calcite would manipulate tables directly in the Beam
>>> meta/store. This is a bit more complicated but retains the ability for DDL
>>> operations to be processed by a custom MetaStore.
>>>
>>> Thoughts?
>>>
>>> Andrew
>>>
>>

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

Posted by James <xu...@gmail.com>.
Kenn: yes, MetaStore is user-facing, Users can choose to implement their
own MetaStore, currently only an InMemory implementation in Beam CodeBase.

Andrew: I like the second option, since it "retains the ability for DDL
operations to be processed by a custom MetaStore.", IMO we should have the
DDL ability as a fully functional SQL.

On Tue, Apr 24, 2018 at 10:28 PM Kenneth Knowles <kl...@google.com> wrote:

> Can you say more about how the metastore is used? I presume it is or will
> be user-facing, so are Beam SQL users already providing their own?
>
> I'm sure we want something like that eventually to support things like
> Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
> Beam SQL to create a derived data set. But I don't think we should build
> out those code paths until we have at least one non-in-memory
> implementation.
>
> Just a really high level $0.02.
>
> Kenn
>
> On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud <ap...@google.com>
> wrote:
>
>> I'm working on updating our Beam DDL code to use the DDL execution
>> functionality that recently merged into core calcite. This enables us to
>> take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
>> need to reconcile the Beam SQL Environments with the Calcite Schema (which
>> is calcite's environment). We currently have copies of our tables in the
>> Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
>> pending PR which merges the later two to just use the Calcite Schema copy.
>> Merging the Beam MetaStore and Calcite Schema isn't as simple. I have
>> two options I'm looking for feedback on:
>>
>> 1. Make Calcite Schema authoritative and demote MetaStore to be
>> something more like a Calcite TableFactory. Calcite Schema already
>> implements the semantics of our InMemoryMetaStore. If the Store interface
>> is just over built, this approach would result in a significant reduction
>> in code. This would however eliminate the CRUD part of the interface
>> leaving just the buildBeamSqlTable function.
>>
>> 2. Pass the Beam MetaStore into Calcite wrapped with a class translating
>> to Calcite Schema (like we do already with tables). Instead of copying
>> tables into the Calcite Schema we would pass in Beam meta/store as the
>> source of truth and Calcite would manipulate tables directly in the Beam
>> meta/store. This is a bit more complicated but retains the ability for DDL
>> operations to be processed by a custom MetaStore.
>>
>> Thoughts?
>>
>> Andrew
>>
>

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

Posted by Kenneth Knowles <kl...@google.com>.
Can you say more about how the metastore is used? I presume it is or will
be user-facing, so are Beam SQL users already providing their own?

I'm sure we want something like that eventually to support things like
Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
Beam SQL to create a derived data set. But I don't think we should build
out those code paths until we have at least one non-in-memory
implementation.

Just a really high level $0.02.

Kenn

On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud <ap...@google.com> wrote:

> I'm working on updating our Beam DDL code to use the DDL execution
> functionality that recently merged into core calcite. This enables us to
> take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
> need to reconcile the Beam SQL Environments with the Calcite Schema (which
> is calcite's environment). We currently have copies of our tables in the
> Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
> pending PR which merges the later two to just use the Calcite Schema copy.
> Merging the Beam MetaStore and Calcite Schema isn't as simple. I have two
> options I'm looking for feedback on:
>
> 1. Make Calcite Schema authoritative and demote MetaStore to be something
> more like a Calcite TableFactory. Calcite Schema already implements the
> semantics of our InMemoryMetaStore. If the Store interface is just over
> built, this approach would result in a significant reduction in code. This
> would however eliminate the CRUD part of the interface leaving just the
> buildBeamSqlTable function.
>
> 2. Pass the Beam MetaStore into Calcite wrapped with a class translating
> to Calcite Schema (like we do already with tables). Instead of copying
> tables into the Calcite Schema we would pass in Beam meta/store as the
> source of truth and Calcite would manipulate tables directly in the Beam
> meta/store. This is a bit more complicated but retains the ability for DDL
> operations to be processed by a custom MetaStore.
>
> Thoughts?
>
> Andrew
>