You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Rajat Venkatesh <rv...@qubole.com> on 2016/01/21 07:39:30 UTC

Introducing Quark - Mat. Views & Cubes across DBs

Amogh and I (developers at Qubole) have been working on a project - Quark -
https://github.com/qubole/quark/ - to provide a unified view of data spread
across many databases. Two concrete examples:
1. Hot data is stored in a data warehouse (Redshift, Vertica etc) and cold
data is stored in HDFS and accessed through Apache Hive.
2. Cubes are stored in Redshift and the base tables are stored HDFS.
Similarly base tables are stored in Redshift and cubes are stored Postgres.

Data analysts will query hot data or cubes but have to often cross over to
the cold data or base tables. At scale this setup gets complicated in
multiple dimensions (no pun intended). Analysts have to keep track of which
dataset to use, they have to be trained to use different technologies and
interfaces etc.

So there is a requirement to provide a single interface to the data spread
across multiple data stores for e.g. through Tableau or Apache Zeppelin.
Quark is an optimizer based on Apache Calcite that models these
relationships as materialized views or lattices and reroutes queries to the
optimal dataset. Note that Quark is *not* a federation engine. It does not
join data across databases. It can integrate with Presto or Hive for
federation but the preferred option is to run a query in a single database.

This is an example where materialized views and cubes are setup between
Hive (on EMR) and Redshift:
https://github.com/qubole/quark/blob/master/examples/EMR.md

Quark relies quite a bit on Apache Calcite for the heavy lifting. It uses
the optimizer for determining the dataset and database to run the queries
on. It uses the execution engine to run the query. It is distributed as a
JDBC jar and it uses Avatica for the JDBC implementation. We are very
grateful to the community and Julian Hyde for all the help. We have made a
few small contributions as part of the building Quark. We are pushing
Lattices and Materialization Service to its limits and have made major
changes. We will start a thread discussing the issues we faced and designs
to solve it. Hopefully we can contribute it back to the project through the
usual process.

The cube use case is very similar to Apache Kylin. I looked at Kylin quite
a bit initially and learned a lot from it. A couple of big differences are
that Quark does not build or maintain cubes and it does not insist on Hive,
Hadoop and HBase. These requirements were driven by our users who already
have an ETL process to build cubes. They've also decided on the
technologies - Redshift, Postgres or Oracle for e.g. Looks like there is a
push to generalize Apache Kylin -
https://issues.apache.org/jira/browse/KYLIN-1351. Finally we have use cases
outside of OLAP cubes like supporting copies of data which are sorted
differently like Vertica's projections or stored in a fast DWH.

To summarize, we have built another application on top of Apache Calcite
that our users are very excited about. Its another strong vote for the
quality and utility of Apache Calcite. We are very happy to be part of the
community and can hopefully contribute back.

Re: Introducing Quark - Mat. Views & Cubes across DBs

Posted by Sarnath <st...@gmail.com>.
Sure..This is great. I will catch up in the first week of Feb @ Quark dev
group. Thanks. It's a great news. Sometimes laziness pays ;)  I mean I had
wanted to do this myself. Now we have Quark and it can do the heavy
lifting! Thanks!

Re: Introducing Quark - Mat. Views & Cubes across DBs

Posted by Rajat Venkatesh <rv...@qubole.com>.
Yes. There is a DataSourceFactory and DataSource Interface defined.
ElasticSearch has to define a concrete class and register it through the
JSON config. Quark will delegate the query to elastic search if the cube
can be used. We are very interested in supporting ElasticSearch. Can we
discuss more on quark-dev@googlegroups.com ?

On Thu, 21 Jan 2016 at 22:14, Sarnath <st...@gmail.com> wrote:

> I.e. My base table is in Hive, Cube is stored in ElasticSearch. Can I now
> benefit from Quark?
> On Jan 21, 2016 10:12 PM, "Sarnath" <st...@gmail.com> wrote:
>
> > Yes I got that part. From my side, I have a cube built in ElasticSearch.
> > Can I benefit from Quark? If so, how do I interface the cube to Quark?
> >
>

Re: Introducing Quark - Mat. Views & Cubes across DBs

Posted by Sarnath <st...@gmail.com>.
I.e. My base table is in Hive, Cube is stored in ElasticSearch. Can I now
benefit from Quark?
On Jan 21, 2016 10:12 PM, "Sarnath" <st...@gmail.com> wrote:

> Yes I got that part. From my side, I have a cube built in ElasticSearch.
> Can I benefit from Quark? If so, how do I interface the cube to Quark?
>

Re: Introducing Quark - Mat. Views & Cubes across DBs

Posted by Sarnath <st...@gmail.com>.
Yes I got that part. From my side, I have a cube built in ElasticSearch.
Can I benefit from Quark? If so, how do I interface the cube to Quark?

Re: Introducing Quark - Mat. Views & Cubes across DBs

Posted by Rajat Venkatesh <rv...@qubole.com>.
Thanks. I don't think I have understood your question. As the ETL engineer,
who builds and maintains the cube, your workflow will not change. There is
an additional step to keep Quark metadata updated.
As an analyst you don't have to know about the cube. You submit queries on
the base tables and Quark will route queries appropriately. Quark is a JDBC
jar. So the analysts can use sqlline or Zeppelin or other JDBC apps to use
Quark.
Hope that answers your question.
On Thu, 21 Jan 2016 at 21:38, Sarnath <st...@gmail.com> wrote:

> Hi Rajat,
> Congrats! What interfaces do you support for lattice providers? I.e. if I
> create a lattice, how do I benefit from your software?
> Best,
> Sarnath
>

Re: Introducing Quark - Mat. Views & Cubes across DBs

Posted by Sarnath <st...@gmail.com>.
Hi Rajat,
Congrats! What interfaces do you support for lattice providers? I.e. if I
create a lattice, how do I benefit from your software?
Best,
Sarnath

Re: Introducing Quark - Mat. Views & Cubes across DBs

Posted by Rajat Venkatesh <rv...@qubole.com>.
Hi Hongbin,
Thanks for the interest. We are still maturing as an open source project :)
I am setting up a google group. I'll send out the info once its in place.
WRT to roadmap, the current focus is on maturity. It was an experimental
project for the longest time and now we are starting to onboard users. We
are unearthing issues. For e.g. I mentioned the scalability of the
materialization service in another thread. In the medium term, we are
planning to solve incremental cubes better, federation after optimizing the
query by delegating execution to another engine like Presto and support for
more engines.
Qubole is definitely invested in the project for the forsee-able future.
I'll setup a google group and I can answer any specific questions you have.

On Thu, Jan 21, 2016 at 12:51 PM hongbin ma <ma...@apache.org> wrote:

> hi Rajat
>
> this is Hongbin Ma from Apache Kylin, I'm very interested in Quark, which
> in my opinion shares a lot in common with Quark. Actually I believe Kylin
> it self may benefit from Quark, too. Can you also please share your roadmap
> with the community? (People may be very interested in how sustainable your
> corp can invest on Quark, etc.)
>
> Do you have a dev mail list now? I'd love to contribute to the project.
> Things like mail list is what I personally preferred.
>
>
> On Thu, Jan 21, 2016 at 2:39 PM, Rajat Venkatesh <rv...@qubole.com>
> wrote:
>
> > Amogh and I (developers at Qubole) have been working on a project -
> Quark -
> > https://github.com/qubole/quark/ - to provide a unified view of data
> > spread
> > across many databases. Two concrete examples:
> > 1. Hot data is stored in a data warehouse (Redshift, Vertica etc) and
> cold
> > data is stored in HDFS and accessed through Apache Hive.
> > 2. Cubes are stored in Redshift and the base tables are stored HDFS.
> > Similarly base tables are stored in Redshift and cubes are stored
> Postgres.
> >
> > Data analysts will query hot data or cubes but have to often cross over
> to
> > the cold data or base tables. At scale this setup gets complicated in
> > multiple dimensions (no pun intended). Analysts have to keep track of
> which
> > dataset to use, they have to be trained to use different technologies and
> > interfaces etc.
> >
> > So there is a requirement to provide a single interface to the data
> spread
> > across multiple data stores for e.g. through Tableau or Apache Zeppelin.
> > Quark is an optimizer based on Apache Calcite that models these
> > relationships as materialized views or lattices and reroutes queries to
> the
> > optimal dataset. Note that Quark is *not* a federation engine. It does
> not
> > join data across databases. It can integrate with Presto or Hive for
> > federation but the preferred option is to run a query in a single
> database.
> >
> > This is an example where materialized views and cubes are setup between
> > Hive (on EMR) and Redshift:
> > https://github.com/qubole/quark/blob/master/examples/EMR.md
> >
> > Quark relies quite a bit on Apache Calcite for the heavy lifting. It uses
> > the optimizer for determining the dataset and database to run the queries
> > on. It uses the execution engine to run the query. It is distributed as a
> > JDBC jar and it uses Avatica for the JDBC implementation. We are very
> > grateful to the community and Julian Hyde for all the help. We have made
> a
> > few small contributions as part of the building Quark. We are pushing
> > Lattices and Materialization Service to its limits and have made major
> > changes. We will start a thread discussing the issues we faced and
> designs
> > to solve it. Hopefully we can contribute it back to the project through
> the
> > usual process.
> >
> > The cube use case is very similar to Apache Kylin. I looked at Kylin
> quite
> > a bit initially and learned a lot from it. A couple of big differences
> are
> > that Quark does not build or maintain cubes and it does not insist on
> Hive,
> > Hadoop and HBase. These requirements were driven by our users who already
> > have an ETL process to build cubes. They've also decided on the
> > technologies - Redshift, Postgres or Oracle for e.g. Looks like there is
> a
> > push to generalize Apache Kylin -
> > https://issues.apache.org/jira/browse/KYLIN-1351. Finally we have use
> > cases
> > outside of OLAP cubes like supporting copies of data which are sorted
> > differently like Vertica's projections or stored in a fast DWH.
> >
> > To summarize, we have built another application on top of Apache Calcite
> > that our users are very excited about. Its another strong vote for the
> > quality and utility of Apache Calcite. We are very happy to be part of
> the
> > community and can hopefully contribute back.
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Re: Introducing Quark - Mat. Views & Cubes across DBs

Posted by hongbin ma <ma...@apache.org>.
hi Rajat

this is Hongbin Ma from Apache Kylin, I'm very interested in Quark, which
in my opinion shares a lot in common with Quark. Actually I believe Kylin
it self may benefit from Quark, too. Can you also please share your roadmap
with the community? (People may be very interested in how sustainable your
corp can invest on Quark, etc.)

Do you have a dev mail list now? I'd love to contribute to the project.
Things like mail list is what I personally preferred.


On Thu, Jan 21, 2016 at 2:39 PM, Rajat Venkatesh <rv...@qubole.com>
wrote:

> Amogh and I (developers at Qubole) have been working on a project - Quark -
> https://github.com/qubole/quark/ - to provide a unified view of data
> spread
> across many databases. Two concrete examples:
> 1. Hot data is stored in a data warehouse (Redshift, Vertica etc) and cold
> data is stored in HDFS and accessed through Apache Hive.
> 2. Cubes are stored in Redshift and the base tables are stored HDFS.
> Similarly base tables are stored in Redshift and cubes are stored Postgres.
>
> Data analysts will query hot data or cubes but have to often cross over to
> the cold data or base tables. At scale this setup gets complicated in
> multiple dimensions (no pun intended). Analysts have to keep track of which
> dataset to use, they have to be trained to use different technologies and
> interfaces etc.
>
> So there is a requirement to provide a single interface to the data spread
> across multiple data stores for e.g. through Tableau or Apache Zeppelin.
> Quark is an optimizer based on Apache Calcite that models these
> relationships as materialized views or lattices and reroutes queries to the
> optimal dataset. Note that Quark is *not* a federation engine. It does not
> join data across databases. It can integrate with Presto or Hive for
> federation but the preferred option is to run a query in a single database.
>
> This is an example where materialized views and cubes are setup between
> Hive (on EMR) and Redshift:
> https://github.com/qubole/quark/blob/master/examples/EMR.md
>
> Quark relies quite a bit on Apache Calcite for the heavy lifting. It uses
> the optimizer for determining the dataset and database to run the queries
> on. It uses the execution engine to run the query. It is distributed as a
> JDBC jar and it uses Avatica for the JDBC implementation. We are very
> grateful to the community and Julian Hyde for all the help. We have made a
> few small contributions as part of the building Quark. We are pushing
> Lattices and Materialization Service to its limits and have made major
> changes. We will start a thread discussing the issues we faced and designs
> to solve it. Hopefully we can contribute it back to the project through the
> usual process.
>
> The cube use case is very similar to Apache Kylin. I looked at Kylin quite
> a bit initially and learned a lot from it. A couple of big differences are
> that Quark does not build or maintain cubes and it does not insist on Hive,
> Hadoop and HBase. These requirements were driven by our users who already
> have an ETL process to build cubes. They've also decided on the
> technologies - Redshift, Postgres or Oracle for e.g. Looks like there is a
> push to generalize Apache Kylin -
> https://issues.apache.org/jira/browse/KYLIN-1351. Finally we have use
> cases
> outside of OLAP cubes like supporting copies of data which are sorted
> differently like Vertica's projections or stored in a fast DWH.
>
> To summarize, we have built another application on top of Apache Calcite
> that our users are very excited about. Its another strong vote for the
> quality and utility of Apache Calcite. We are very happy to be part of the
> community and can hopefully contribute back.
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone