You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mrql.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2013/12/11 03:33:33 UTC

[DISCUSS] the future direction of MRQL

All,

Since there are too many similar projects, I'd like to suggest that we
change the future direction of MRQL to a powerful *analytics* query
language on top of Hadoop beyond ETL processing. In my eyes,
supporting multi-platforms (MapReduce, Hama, Spark, ...,etc) also
seems pointless. WDYT?

-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [DISCUSS] the future direction of MRQL

Posted by "Edward J. Yoon" <ed...@apache.org>.

Thanks for your clarification.

The only thing I worry about is that most people perceives MRQL as
another SQL on Hadoop. For example, look at this first sentence on our
website, 'MRQL is a query processing and optimization system for
large-scale, distributed data analysis, built on top of Apache Hadoop,
Hama, and Spark'. I think, we need to push the scientific and complex
side of our project.

> big data analysis but soon this may change. I think people will start
> using fault-tolerant in-memory distributed systems for data analysis,

Agree. BTW, should we support MapReduce continuously in the future?

> difference from others. The fact that it can run on multiple platforms
> is a big plus, but is secondary. Currently, most people use Hadoop for

Flexible and Extensible architecture of execution engine can be a
plus. However, that model should be based on contributions from
diverse assistants. Since we're very early stage, I think we should
focus on main execution engine. That's why I think It's meaningless
(at this time).

WDYT?

On Thu, Dec 12, 2013 at 12:15 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> Thanks Edward,
> Our biggest concern is that there is no activity in the user@mrql
> list. Does this mean that there no one using MRQL or that nobody
> posts any messages? Is there a way to get the number of people
> registered in this list? Can we also get the number of times MRQL has
> been downloaded from Apache mirrors after its first release? It was
> hoped that after the first release people will start downloading MRQL
> and will register at user@mrql list to ask questions, report bugs, ask
> for new features, etc. It hasn't happened yet. Maybe it's too soon.
>
> There are other query languages for big data analysis in ASF. All
> except MRQL are SQL-based data warehousing systems for Hadoop
> (eg, Hive and Tajo). MRQL is a query system for complex data analysis,
> including machine learning and scientific computing. This is the main
> difference from others. The fact that it can run on multiple platforms
> is a big plus, but is secondary. Currently, most people use Hadoop for
> big data analysis but soon this may change. I think people will start
> using fault-tolerant in-memory distributed systems for data analysis,
> such as Spark. Hama too may play a big role. So supporting multiple
> platforms will allow users to deploy applications using MRQL very fast
> and experiment with all these platforms without having to change the
> query. The whole idea of expressing distributed applications using an
> SQL-like query system is rapid and easy prototyping, without
> sacrificing performance. So performance is a very important factor.
> If MRQL is slow, nobody will use it. I think in this area, we are doing
> an excellent job because of the very advanced optimizer that allows
> operations such as matrix multiplication to be done using very fast
> algorithms.
>
> Leonidas Fegaras
>
>
>
> On 12/10/2013 08:33 PM, Edward J. Yoon wrote:
>>
>> All,
>>
>> Since there are too many similar projects, I'd like to suggest that we
>> change the future direction of MRQL to a powerful *analytics* query
>> language on top of Hadoop beyond ETL processing. In my eyes,
>> supporting multi-platforms (MapReduce, Hama, Spark, ...,etc) also
>> seems pointless. WDYT?
>>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [DISCUSS] the future direction of MRQL

Posted by "Edward J. Yoon" <ed...@apache.org>.

>>  We need to do something drastic.

+1

>> Now Cloudera seems to want to support both Hadoop and Spark
>> But I doubt that Cloudera will be interested in MRQL.

Frankly, they are likely to have their own roadmap.

In my eyes, hegemonic war is still in progress. If you feel difficult
to join there, feel free to consider the Hama as well. :-)


On Thu, Jan 23, 2014 at 9:40 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> Hi Mark,
> I totally agree that Spark is the next big thing, but how can we get
> involved? I am not sure what integrating with Spark means. Are you
> proposing to make MRQL a subproject of Spark? I would love to see
> that happen, but Spark must agree first. Of course, we can always ask
> them if they are interested. Spark is ready to become a top Apache
> project soon, and has tremendous momentum and support from industry.
> MRQL on the other hand needs to grow considerably in order to
> graduate. So becoming a subproject of Spark will really help MRQL.
> We need to be careful how to sell MRQL to Spark. Spark already has an
> ML library (MLlib) for ML/scientific computations but I don't think
> they do any optimizations. On the other hand, MRQL does
> inter-operator optimizations, such as fusing a matrix multiplication
> with matrix transpose into a single operation. This may be a selling
> point.
>
> Your first point about using MRQL as a glue on top of multiple
> platforms, so that users can experiment with multiple platforms
> (Spark, MR, Hama, ...) without changing the query, was one of the
> original motivations behind the design of MRQL. Doing this
> automatically is a little more complex: MRQL must look at the
> available resources and take into consideration the quality-of-service
> specs given by the user (eg, low latency or better fault tolerance?)
> to choose the best available platform to run the query. This can be
> a nice feature to add to MRQL. Now Cloudera seems to want to support
> both Hadoop and Spark on the same distribution. So MRQL can serve as
> a common language for this. But I doubt that Cloudera will be interested
> in MRQL. They have invested on their own query language (Impala), and
> I am guessing they will try to port it to Spark, so queries can run on
> both systems.
>
> So, it is good that we have this discussion. We need to do something
> drastic. And I think we need to act now, because of this tremendous
> interest in Spark and since MRQL already support Spark.
> What do you think?
> Leonidas
>
>
>
> On 01/22/2014 04:06 PM, Mark Wall wrote:
>>
>> Happy New Year! I see MRQL in the same space as an optimizer inside a
>> traditional RDBMS. Orchestrating, optimizing and redirecting user requests
>> to the most appropriate layer of abstraction to fulfill a request without
>> the end user even worrying about whether a query kicks of a Spark query that
>> runs MR in the background with HAMA. The fact is this: the opensource
>> community will continue to add new point solutions to address certain
>> deficiencies in processing data that is (i) high volume, (ii)high variety or
>> (iii) high velocity or a combination. Hadoop succeeds in (i) but fails in
>> (iii) wherever access to a continuous 'in pipe' processing stream is
>> required. For instance Online learning algo's, clickstream. This is where
>> storm and spark mini batch processing can potentially fill the gap but there
>> is no overarching single project that integrates the lot.
>>
>> To increase usage of MRQL, one option would be to integrate with
>> Hive/Impala/Presto and/or even Hue and Solr. Users should have a choice of
>> API that is most appropriate for the application in hand. The real
>> achievement of MRQL would be to provide a query optimization of ML and
>> Scientific computation and user interface across all layers in the stack
>> Hadoop/HAMA for batch and bulk processing and then Storm and Spark for
>> continuous queries and optimizations. Big Data democracy is all about
>> putting the power of open source into the hands of as many end users as
>> possible, not just the devs / ML / Scientific computing community.
>>
>> I vote we integrate with Spark ASAP. Cloudera just supported it, Mike
>> Olson announced it is the future direction of MapReduce.
>>
>>
>>
>> On 12 December 2013 02:15, Leonidas Fegaras <fegaras@cse.uta.edu
>> <ma...@cse.uta.edu>> wrote:
>>
>>     Thanks Edward,
>>     Our biggest concern is that there is no activity in the user@mrql
>>     list. Does this mean that there no one using MRQL or that nobody
>>     posts any messages? Is there a way to get the number of people
>>     registered in this list? Can we also get the number of times MRQL has
>>     been downloaded from Apache mirrors after its first release? It was
>>     hoped that after the first release people will start downloading MRQL
>>     and will register at user@mrql list to ask questions, report bugs, ask
>>     for new features, etc. It hasn't happened yet. Maybe it's too soon.
>>
>>     There are other query languages for big data analysis in ASF. All
>>     except MRQL are SQL-based data warehousing systems for Hadoop
>>     (eg, Hive and Tajo). MRQL is a query system for complex data analysis,
>>     including machine learning and scientific computing. This is the main
>>     difference from others. The fact that it can run on multiple platforms
>>     is a big plus, but is secondary. Currently, most people use Hadoop for
>>     big data analysis but soon this may change. I think people will start
>>     using fault-tolerant in-memory distributed systems for data analysis,
>>     such as Spark. Hama too may play a big role. So supporting multiple
>>     platforms will allow users to deploy applications using MRQL very fast
>>     and experiment with all these platforms without having to change the
>>     query. The whole idea of expressing distributed applications using an
>>     SQL-like query system is rapid and easy prototyping, without
>>     sacrificing performance. So performance is a very important factor.
>>     If MRQL is slow, nobody will use it. I think in this area, we are
>>     doing
>>     an excellent job because of the very advanced optimizer that allows
>>     operations such as matrix multiplication to be done using very fast
>>     algorithms.
>>
>>     Leonidas Fegaras
>>
>>
>>
>>     On 12/10/2013 08:33 PM, Edward J. Yoon wrote:
>>
>>         All,
>>
>>         Since there are too many similar projects, I'd like to suggest
>>         that we
>>         change the future direction of MRQL to a powerful *analytics*
>>         query
>>         language on top of Hadoop beyond ETL processing. In my eyes,
>>         supporting multi-platforms (MapReduce, Hama, Spark, ...,etc) also
>>         seems pointless. WDYT?
>>
>>
>>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [DISCUSS] the future direction of MRQL

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Hi Mark,
I totally agree that Spark is the next big thing, but how can we get
involved? I am not sure what integrating with Spark means. Are you
proposing to make MRQL a subproject of Spark? I would love to see
that happen, but Spark must agree first. Of course, we can always ask
them if they are interested. Spark is ready to become a top Apache
project soon, and has tremendous momentum and support from industry.
MRQL on the other hand needs to grow considerably in order to
graduate. So becoming a subproject of Spark will really help MRQL.
We need to be careful how to sell MRQL to Spark. Spark already has an
ML library (MLlib) for ML/scientific computations but I don't think
they do any optimizations. On the other hand, MRQL does
inter-operator optimizations, such as fusing a matrix multiplication
with matrix transpose into a single operation. This may be a selling
point.

Your first point about using MRQL as a glue on top of multiple
platforms, so that users can experiment with multiple platforms
(Spark, MR, Hama, ...) without changing the query, was one of the
original motivations behind the design of MRQL. Doing this
automatically is a little more complex: MRQL must look at the
available resources and take into consideration the quality-of-service
specs given by the user (eg, low latency or better fault tolerance?)
to choose the best available platform to run the query. This can be
a nice feature to add to MRQL. Now Cloudera seems to want to support
both Hadoop and Spark on the same distribution. So MRQL can serve as
a common language for this. But I doubt that Cloudera will be interested
in MRQL. They have invested on their own query language (Impala), and
I am guessing they will try to port it to Spark, so queries can run on
both systems.

So, it is good that we have this discussion. We need to do something
drastic. And I think we need to act now, because of this tremendous
interest in Spark and since MRQL already support Spark.
What do you think?
Leonidas

On 01/22/2014 04:06 PM, Mark Wall wrote:
> Happy New Year! I see MRQL in the same space as an optimizer inside a 
> traditional RDBMS. Orchestrating, optimizing and redirecting user 
> requests to the most appropriate layer of abstraction to fulfill a 
> request without the end user even worrying about whether a query kicks 
> of a Spark query that runs MR in the background with HAMA. The fact is 
> this: the opensource community will continue to add new point 
> solutions to address certain deficiencies in processing data that is 
> (i) high volume, (ii)high variety or (iii) high velocity or a 
> combination. Hadoop succeeds in (i) but fails in (iii) wherever access 
> to a continuous 'in pipe' processing stream is required. For instance 
> Online learning algo's, clickstream. This is where storm and spark 
> mini batch processing can potentially fill the gap but there is no 
> overarching single project that integrates the lot.
>
> To increase usage of MRQL, one option would be to integrate with 
> Hive/Impala/Presto and/or even Hue and Solr. Users should have a 
> choice of API that is most appropriate for the application in hand. 
> The real achievement of MRQL would be to provide a query optimization 
> of ML and Scientific computation and user interface across all layers 
> in the stack Hadoop/HAMA for batch and bulk processing and then Storm 
> and Spark for continuous queries and optimizations. Big Data democracy 
> is all about putting the power of open source into the hands of as 
> many end users as possible, not just the devs / ML / Scientific 
> computing community.
>
> I vote we integrate with Spark ASAP. Cloudera just supported it, Mike 
> Olson announced it is the future direction of MapReduce.
>
>
>
> On 12 December 2013 02:15, Leonidas Fegaras <fegaras@cse.uta.edu 
> <ma...@cse.uta.edu>> wrote:
>
>     Thanks Edward,
>     Our biggest concern is that there is no activity in the user@mrql
>     list. Does this mean that there no one using MRQL or that nobody
>     posts any messages? Is there a way to get the number of people
>     registered in this list? Can we also get the number of times MRQL has
>     been downloaded from Apache mirrors after its first release? It was
>     hoped that after the first release people will start downloading MRQL
>     and will register at user@mrql list to ask questions, report bugs, ask
>     for new features, etc. It hasn't happened yet. Maybe it's too soon.
>
>     There are other query languages for big data analysis in ASF. All
>     except MRQL are SQL-based data warehousing systems for Hadoop
>     (eg, Hive and Tajo). MRQL is a query system for complex data analysis,
>     including machine learning and scientific computing. This is the main
>     difference from others. The fact that it can run on multiple platforms
>     is a big plus, but is secondary. Currently, most people use Hadoop for
>     big data analysis but soon this may change. I think people will start
>     using fault-tolerant in-memory distributed systems for data analysis,
>     such as Spark. Hama too may play a big role. So supporting multiple
>     platforms will allow users to deploy applications using MRQL very fast
>     and experiment with all these platforms without having to change the
>     query. The whole idea of expressing distributed applications using an
>     SQL-like query system is rapid and easy prototyping, without
>     sacrificing performance. So performance is a very important factor.
>     If MRQL is slow, nobody will use it. I think in this area, we are
>     doing
>     an excellent job because of the very advanced optimizer that allows
>     operations such as matrix multiplication to be done using very fast
>     algorithms.
>
>     Leonidas Fegaras
>
>
>
>     On 12/10/2013 08:33 PM, Edward J. Yoon wrote:
>
>         All,
>
>         Since there are too many similar projects, I'd like to suggest
>         that we
>         change the future direction of MRQL to a powerful *analytics*
>         query
>         language on top of Hadoop beyond ETL processing. In my eyes,
>         supporting multi-platforms (MapReduce, Hama, Spark, ...,etc) also
>         seems pointless. WDYT?
>
>
>

Re: [DISCUSS] the future direction of MRQL

Posted by Mark Wall <ms...@apache.org>.

Happy New Year! I see MRQL in the same space as an optimizer inside a
traditional RDBMS. Orchestrating, optimizing and redirecting user requests
to the most appropriate layer of abstraction to fulfill a request without
the end user even worrying about whether a query kicks of a Spark query
that runs MR in the background with HAMA. The fact is this: the opensource
community will continue to add new point solutions to address certain
deficiencies in processing data that is (i) high volume, (ii)high variety
or (iii) high velocity or a combination. Hadoop succeeds in (i) but fails
in (iii) wherever access to a continuous 'in pipe' processing stream is
required. For instance Online learning algo's, clickstream. This is where
storm and spark mini batch processing can potentially fill the gap but
there is no overarching single project that integrates the lot.

To increase usage of MRQL, one option would be to integrate with
Hive/Impala/Presto and/or even Hue and Solr. Users should have a choice of
API that is most appropriate for the application in hand. The real
achievement of MRQL would be to provide a query optimization of ML and
Scientific computation and user interface across all layers in the stack
Hadoop/HAMA for batch and bulk processing and then Storm and Spark for
continuous queries and optimizations. Big Data democracy is all about
putting the power of open source into the hands of as many end users as
possible, not just the devs / ML / Scientific computing community.

I vote we integrate with Spark ASAP. Cloudera just supported it, Mike Olson
announced it is the future direction of MapReduce.

On 12 December 2013 02:15, Leonidas Fegaras <fe...@cse.uta.edu> wrote:

> Thanks Edward,
> Our biggest concern is that there is no activity in the user@mrql
> list. Does this mean that there no one using MRQL or that nobody
> posts any messages? Is there a way to get the number of people
> registered in this list? Can we also get the number of times MRQL has
> been downloaded from Apache mirrors after its first release? It was
> hoped that after the first release people will start downloading MRQL
> and will register at user@mrql list to ask questions, report bugs, ask
> for new features, etc. It hasn't happened yet. Maybe it's too soon.
>
> There are other query languages for big data analysis in ASF. All
> except MRQL are SQL-based data warehousing systems for Hadoop
> (eg, Hive and Tajo). MRQL is a query system for complex data analysis,
> including machine learning and scientific computing. This is the main
> difference from others. The fact that it can run on multiple platforms
> is a big plus, but is secondary. Currently, most people use Hadoop for
> big data analysis but soon this may change. I think people will start
> using fault-tolerant in-memory distributed systems for data analysis,
> such as Spark. Hama too may play a big role. So supporting multiple
> platforms will allow users to deploy applications using MRQL very fast
> and experiment with all these platforms without having to change the
> query. The whole idea of expressing distributed applications using an
> SQL-like query system is rapid and easy prototyping, without
> sacrificing performance. So performance is a very important factor.
> If MRQL is slow, nobody will use it. I think in this area, we are doing
> an excellent job because of the very advanced optimizer that allows
> operations such as matrix multiplication to be done using very fast
> algorithms.
>
> Leonidas Fegaras
>
>
>
> On 12/10/2013 08:33 PM, Edward J. Yoon wrote:
>
>> All,
>>
>> Since there are too many similar projects, I'd like to suggest that we
>> change the future direction of MRQL to a powerful *analytics* query
>> language on top of Hadoop beyond ETL processing. In my eyes,
>> supporting multi-platforms (MapReduce, Hama, Spark, ...,etc) also
>> seems pointless. WDYT?
>>
>>
>

Re: [DISCUSS] the future direction of MRQL

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Thanks Edward,
Our biggest concern is that there is no activity in the user@mrql
list. Does this mean that there no one using MRQL or that nobody
posts any messages? Is there a way to get the number of people
registered in this list? Can we also get the number of times MRQL has
been downloaded from Apache mirrors after its first release? It was
hoped that after the first release people will start downloading MRQL
and will register at user@mrql list to ask questions, report bugs, ask
for new features, etc. It hasn't happened yet. Maybe it's too soon.

There are other query languages for big data analysis in ASF. All
except MRQL are SQL-based data warehousing systems for Hadoop
(eg, Hive and Tajo). MRQL is a query system for complex data analysis,
including machine learning and scientific computing. This is the main
difference from others. The fact that it can run on multiple platforms
is a big plus, but is secondary. Currently, most people use Hadoop for
big data analysis but soon this may change. I think people will start
using fault-tolerant in-memory distributed systems for data analysis,
such as Spark. Hama too may play a big role. So supporting multiple
platforms will allow users to deploy applications using MRQL very fast
and experiment with all these platforms without having to change the
query. The whole idea of expressing distributed applications using an
SQL-like query system is rapid and easy prototyping, without
sacrificing performance. So performance is a very important factor.
If MRQL is slow, nobody will use it. I think in this area, we are doing
an excellent job because of the very advanced optimizer that allows
operations such as matrix multiplication to be done using very fast
algorithms.

Leonidas Fegaras

On 12/10/2013 08:33 PM, Edward J. Yoon wrote:
> All,
>
> Since there are too many similar projects, I'd like to suggest that we
> change the future direction of MRQL to a powerful *analytics* query
> language on top of Hadoop beyond ETL processing. In my eyes,
> supporting multi-platforms (MapReduce, Hama, Spark, ...,etc) also
> seems pointless. WDYT?
>