You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by Chinmay Kolhatkar <ch...@datatorrent.com> on 2016/10/05 11:56:25 UTC

Re: APEXMALHAR-1818 : Apex-Calcite Integration

Dear Community,


On the review only PR (https://github.com/apache/apex-malhar/pull/432) that
I created for first phase of calcite integration, I've received a good
feedback for improvement from Julian (from Calcite PMC Chair), Tushar and
Yogi. The comments are taken care of all.

I think its time to remove "Review Only" for this PR and propose this PR
for including in apex malhar.

Please share your opinion.

Thanks,
Chinmay.


On Wed, Sep 28, 2016 at 9:03 PM, Chinmay Kolhatkar <ch...@datatorrent.com>
wrote:

> Dear Community,
>
> I've created a review only PR for first phase of calcite integration.
> https://github.com/apache/apex-malhar/pull/432
>
> Features implemented this PR are as follows:
> 1. SELECT STATEMENT
> 2. INSERT STATEMENT
> 3. INNER JOIN with non-empty equi join condition
> 4. WHERE clause
> 5. SCALAR functions implemented in calcite are ready to use
> 6. Custom scalar functions can be registered.
> 7. Endpoint can be File OR Kafka OR Streaming Port for both input and
> output
> 8. CSV Data Format implemented for both input and output side.
> 9. Static loading of calcite JDBC driver.
> 10. Testing on local as well as cluster mode.
>
> I request everyone to please review the PR and provide feedback.
>
> Thanks,
> Chinmay.
>
>
> On Tue, Sep 20, 2016 at 11:16 AM, Chinmay Kolhatkar <
> chinmay@datatorrent.com> wrote:
>
>> Hi All,
>>
>> I wanted to give a quick update on Apex-Calcite integration work.
>>
>> Currently I'm able to run SQL statement as a DAG against registered table
>> abstractions of data endpoint and message type.
>>
>> Here is the SQL support that is currently implemented:
>> 1. Data Endpoint (Source/Destination):
>>    - File
>>    - Kafka
>> 2. Message Types from Data endpoint (source/destination):
>>    - CSV
>> 3. SQL Functionality Support:
>>    - SELECT (Projection) - Select from Source
>>    - INSERT - Insert into Destination
>>    - WHERE (Filter)
>>    - Scalar functions which are provided in Calcite core
>>    - Custom sclar function can be defined as provided to SQL.
>> 4. Table can be defined as abstraction of Data Endpoint (source/dest) and
>> message type
>>
>> Currently Calcite integration with Apex is exposed as a small boiler
>> plate code in populateDAG as follows:
>>
>> SQLExecEnvironment.getEnvironment(dag)
>>           .registerTable("ORDERS", new KafkaEndpoint(broker, sourceTopic,
>> new CSVMessageFormat(schemaIn)))
>>           .registerTable("SALES", new KafkaEndpoint(broker, destTopic,
>> new CSVMessageFormat(schemaOut)))
>>           .registerFunction("APEXCONCAT", FileEndpointTest.class,
>> "apex_concat_str")
>>           .executeSQL("INSERT INTO SALES " + "SELECT STREAM ROWTIME, " +
>> "FLOOR(ROWTIME TO DAY), " +
>>               "APEXCONCAT('OILPAINT', SUBSTRING(PRODUCT, 6, 7)) " + "FROM
>> ORDERS WHERE ID > 3 " + "AND " +
>>               "PRODUCT LIKE 'paint%'");
>>
>> Following is a video recording of the demo of apex-capcite integration:
>> https://drive.google.com/open?id=0B_Tb-ZDtsUHeUVM5NWRYSFg0Z3c
>>
>> Currently I'm working on addition of inner join functionality.
>> Once the inner join functionality is implemented, I think the code is
>> good to create a Review Only PR for first cut of calcite integration.
>>
>> Please share your opinion on above.
>>
>> Thanks,
>> Chinmay.
>>
>>
>> On Fri, Aug 12, 2016 at 9:55 PM, Chinmay Kolhatkar <
>> chinmay@datatorrent.com> wrote:
>>
>>> Hi All,
>>>
>>> I wanted to give update on Apex-Calcite Integration work being done for
>>> visibility and feedback from the community.
>>>
>>> In the first phase, target is to use Calcite core library for SQL
>>> parsing and transformation of relation algebra to apex specific component
>>> (operators).
>>> Once this is achieved one would be able to define input and outputs
>>> using Calcite model file and define the processing from input to output
>>> using SQL statement.
>>>
>>> The status for above work as of now is as follows:
>>> 1. I'm able to traverse relational algebra for simple select statement.
>>> 2. DAG is getting generated for simple statement SELECT STREAM * FROM
>>> TABLE.
>>> 3. DAG is getting validated.
>>> 4. Operators are being set with properties, streams and schema is also
>>> being set using TUPLE_CLASS attr. For schema the class is generated on the
>>> fly and put in classpath using LIBRARY_JAR attr.
>>> 5. Able to run generated DAG in local mode.
>>> 6. The code is currently being developed at (WIP):
>>> Currently for each of development and code being farely large, I've
>>> added a new module malhar-sql in malhar in my fork. But I'm open to other
>>> suggestions here.
>>> https://github.com/chinmaykolhatkar/apex-malhar/tree/calcite/sql
>>>
>>> Next step:
>>> 1. Run the generate DAG in distributed mode.
>>> 2. Expand the source and destination definition (calcite model file) to
>>> include Kafka as source schema and destination.
>>> 3. Expand the scope to include filter operator (WHERE clause, HAVING too
>>> if possible) and inner join when it gets merged.
>>> 4. Write extensive unit tests for above.
>>>
>>> I'll send an update on this thread at every logical step of achieving
>>> something.
>>>
>>> I request the community to provide the feedback on above
>>> approach/targets and if possible take a look at the code in above link.
>>>
>>> Thanks,
>>> Chinmay.
>>>
>>>
>>
>

Re: APEXMALHAR-1818 : Apex-Calcite Integration

Posted by Thomas Weise <th...@apache.org>.

This is great news. Given the size of this PR, we should probably allow for
more time for review and feedback.


On Wed, Oct 5, 2016 at 4:56 AM, Chinmay Kolhatkar <ch...@datatorrent.com>
wrote:

> Dear Community,
>
> 
> On the review only PR (https://github.com/apache/apex-malhar/pull/432)
> that
> I created for first phase of calcite integration, I've received a good
> feedback for improvement from Julian (from Calcite PMC Chair), Tushar and
> Yogi. The comments are taken care of all.
>
> I think its time to remove "Review Only" for this PR and propose this PR
> for including in apex malhar.
>
> Please share your opinion.
>
> Thanks,
> Chinmay.
>
>
> On Wed, Sep 28, 2016 at 9:03 PM, Chinmay Kolhatkar <
> chinmay@datatorrent.com>
> wrote:
>
> > Dear Community,
> >
> > I've created a review only PR for first phase of calcite integration.
> > https://github.com/apache/apex-malhar/pull/432
> >
> > Features implemented this PR are as follows:
> > 1. SELECT STATEMENT
> > 2. INSERT STATEMENT
> > 3. INNER JOIN with non-empty equi join condition
> > 4. WHERE clause
> > 5. SCALAR functions implemented in calcite are ready to use
> > 6. Custom scalar functions can be registered.
> > 7. Endpoint can be File OR Kafka OR Streaming Port for both input and
> > output
> > 8. CSV Data Format implemented for both input and output side.
> > 9. Static loading of calcite JDBC driver.
> > 10. Testing on local as well as cluster mode.
> >
> > I request everyone to please review the PR and provide feedback.
> >
> > Thanks,
> > Chinmay.
> >
> >
> > On Tue, Sep 20, 2016 at 11:16 AM, Chinmay Kolhatkar <
> > chinmay@datatorrent.com> wrote:
> >
> >> Hi All,
> >>
> >> I wanted to give a quick update on Apex-Calcite integration work.
> >>
> >> Currently I'm able to run SQL statement as a DAG against registered
> table
> >> abstractions of data endpoint and message type.
> >>
> >> Here is the SQL support that is currently implemented:
> >> 1. Data Endpoint (Source/Destination):
> >>    - File
> >>    - Kafka
> >> 2. Message Types from Data endpoint (source/destination):
> >>    - CSV
> >> 3. SQL Functionality Support:
> >>    - SELECT (Projection) - Select from Source
> >>    - INSERT - Insert into Destination
> >>    - WHERE (Filter)
> >>    - Scalar functions which are provided in Calcite core
> >>    - Custom sclar function can be defined as provided to SQL.
> >> 4. Table can be defined as abstraction of Data Endpoint (source/dest)
> and
> >> message type
> >>
> >> Currently Calcite integration with Apex is exposed as a small boiler
> >> plate code in populateDAG as follows:
> >>
> >> SQLExecEnvironment.getEnvironment(dag)
> >>           .registerTable("ORDERS", new KafkaEndpoint(broker,
> sourceTopic,
> >> new CSVMessageFormat(schemaIn)))
> >>           .registerTable("SALES", new KafkaEndpoint(broker, destTopic,
> >> new CSVMessageFormat(schemaOut)))
> >>           .registerFunction("APEXCONCAT", FileEndpointTest.class,
> >> "apex_concat_str")
> >>           .executeSQL("INSERT INTO SALES " + "SELECT STREAM ROWTIME, " +
> >> "FLOOR(ROWTIME TO DAY), " +
> >>               "APEXCONCAT('OILPAINT', SUBSTRING(PRODUCT, 6, 7)) " +
> "FROM
> >> ORDERS WHERE ID > 3 " + "AND " +
> >>               "PRODUCT LIKE 'paint%'");
> >>
> >> Following is a video recording of the demo of apex-capcite integration:
> >> https://drive.google.com/open?id=0B_Tb-ZDtsUHeUVM5NWRYSFg0Z3c
> >>
> >> Currently I'm working on addition of inner join functionality.
> >> Once the inner join functionality is implemented, I think the code is
> >> good to create a Review Only PR for first cut of calcite integration.
> >>
> >> Please share your opinion on above.
> >>
> >> Thanks,
> >> Chinmay.
> >>
> >>
> >> On Fri, Aug 12, 2016 at 9:55 PM, Chinmay Kolhatkar <
> >> chinmay@datatorrent.com> wrote:
> >>
> >>> Hi All,
> >>>
> >>> I wanted to give update on Apex-Calcite Integration work being done for
> >>> visibility and feedback from the community.
> >>>
> >>> In the first phase, target is to use Calcite core library for SQL
> >>> parsing and transformation of relation algebra to apex specific
> component
> >>> (operators).
> >>> Once this is achieved one would be able to define input and outputs
> >>> using Calcite model file and define the processing from input to output
> >>> using SQL statement.
> >>>
> >>> The status for above work as of now is as follows:
> >>> 1. I'm able to traverse relational algebra for simple select statement.
> >>> 2. DAG is getting generated for simple statement SELECT STREAM * FROM
> >>> TABLE.
> >>> 3. DAG is getting validated.
> >>> 4. Operators are being set with properties, streams and schema is also
> >>> being set using TUPLE_CLASS attr. For schema the class is generated on
> the
> >>> fly and put in classpath using LIBRARY_JAR attr.
> >>> 5. Able to run generated DAG in local mode.
> >>> 6. The code is currently being developed at (WIP):
> >>> Currently for each of development and code being farely large, I've
> >>> added a new module malhar-sql in malhar in my fork. But I'm open to
> other
> >>> suggestions here.
> >>> https://github.com/chinmaykolhatkar/apex-malhar/tree/calcite/sql
> >>>
> >>> Next step:
> >>> 1. Run the generate DAG in distributed mode.
> >>> 2. Expand the source and destination definition (calcite model file) to
> >>> include Kafka as source schema and destination.
> >>> 3. Expand the scope to include filter operator (WHERE clause, HAVING
> too
> >>> if possible) and inner join when it gets merged.
> >>> 4. Write extensive unit tests for above.
> >>>
> >>> I'll send an update on this thread at every logical step of achieving
> >>> something.
> >>>
> >>> I request the community to provide the feedback on above
> >>> approach/targets and if possible take a look at the code in above link.
> >>>
> >>> Thanks,
> >>> Chinmay.
> >>>
> >>>
> >>
> >
>