You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Leonidas Fegaras <fe...@cse.uta.edu> on 2014/08/27 22:49:36 UTC

MRQL on Flink

Hello,
I would like to let you know that Apache MRQL can now run queries on Flink.
MRQL is a query processing and optimization system for large-scale,
distributed data analysis, built on top of Apache Hadoop/map-reduce,
Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
They can work on complex, user-defined data (such as JSON and XML) and
can express complex queries (such as pagerank and matrix factorization).

MRQL on Flink has been tested on local mode and on a small Yarn cluster.

Here are the directions on how to build the latest MRQL snapshot:

git clone https://git-wip-us.apache.org/repos/asf/incubator-mrql.git mrql
cd mrql
mvn -Pyarn clean install

To make it run on your cluster, edit conf/mrql-env.sh and set the
Java, the Hadoop, and the Flink installation directories.

Here is how to run PageRank. First, you need to generate a random
graph and store it in a file using the MRQL query RMAT.mrql:

bin/mrql.flink -local queries/RMAT.mrql 1000 10000

This will create a graph with 1K nodes and 10K edges using the RMAT
algorithm, will remove duplicate edges, and will store the graph in
the binary file graph.bin. Then, run PageRank on Flink mode using:

bin/mrql.flink -local queries/pagerank.mrql

To run MRQL/Flink on a Yarn cluster, first start the Flink container
on Yarn by running the script yarn-session.sh, such as:

${FLINK_HOME}/bin/yarn-session.sh -n 8

This will print the name of the Flink JobManager, which can be used in:

export FLINK_MASTER=name-of-the-Flink-JobManager
bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000

This will create a graph with 1M nodes and 10M edges using RMAT on 16
nodes (slaves). You can adjust these numbers to fit your cluster.
Then, run PageRank using:

bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql

The MRQL project page is at: http://mrql.incubator.apache.org/

Let me know if you have any questions.
Leonidas Fegaras

Re: MRQL on Flink

Posted by Kostas Tzoumas <kt...@apache.org>.

This is awesome, looking forward to trying it out!

Leonidas, do you have any feedback for the Flink community? How easy was to
implement the translation? Anything you struggled with?

Kostas




On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:

> Awesome, indeed! Looking forward to trying it out. :)
>
>
> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > Awesome!
> >
> >
> > 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
> >
> > > Hello,
> > > I would like to let you know that Apache MRQL can now run queries on
> > Flink.
> > > MRQL is a query processing and optimization system for large-scale,
> > > distributed data analysis, built on top of Apache Hadoop/map-reduce,
> > > Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> > > They can work on complex, user-defined data (such as JSON and XML) and
> > > can express complex queries (such as pagerank and matrix
> factorization).
> > >
> > > MRQL on Flink has been tested on local mode and on a small Yarn
> cluster.
> > >
> > > Here are the directions on how to build the latest MRQL snapshot:
> > >
> > > git clone https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> > mrql
> > > cd mrql
> > > mvn -Pyarn clean install
> > >
> > > To make it run on your cluster, edit conf/mrql-env.sh and set the
> > > Java, the Hadoop, and the Flink installation directories.
> > >
> > > Here is how to run PageRank. First, you need to generate a random
> > > graph and store it in a file using the MRQL query RMAT.mrql:
> > >
> > > bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> > >
> > > This will create a graph with 1K nodes and 10K edges using the RMAT
> > > algorithm, will remove duplicate edges, and will store the graph in
> > > the binary file graph.bin. Then, run PageRank on Flink mode using:
> > >
> > > bin/mrql.flink -local queries/pagerank.mrql
> > >
> > > To run MRQL/Flink on a Yarn cluster, first start the Flink container
> > > on Yarn by running the script yarn-session.sh, such as:
> > >
> > > ${FLINK_HOME}/bin/yarn-session.sh -n 8
> > >
> > > This will print the name of the Flink JobManager, which can be used in:
> > >
> > > export FLINK_MASTER=name-of-the-Flink-JobManager
> > > bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> > >
> > > This will create a graph with 1M nodes and 10M edges using RMAT on 16
> > > nodes (slaves). You can adjust these numbers to fit your cluster.
> > > Then, run PageRank using:
> > >
> > > bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> > >
> > > The MRQL project page is at: http://mrql.incubator.apache.org/
> > >
> > > Let me know if you have any questions.
> > > Leonidas Fegaras
> > >
> > >
> >
>

Re: MRQL on Flink

Posted by Fabian Hueske <fh...@gmail.com>.

That's really cool!

I'm also curious about your experience with Flink. Did you find major
obstacles that you needed to overcome for the integration?
Is there some write-up / report available somewhere (maybe in JIRA) that
discusses the integration? Are you using Flink's full operator set or do
you compile everything into Map and Reduce?

Best, Fabian


2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:

> Very nice indeed! How well is this tested? Can it already run all the
> example queries you have? Can you say anything about the performance
> of the different underlying execution engines?
>
> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
> > Wow, that is impressive!
> >
> >
> > On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >
> >> Awesome, indeed! Looking forward to trying it out. :)
> >>
> >>
> >> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
> >> wrote:
> >>
> >> > Awesome!
> >> >
> >> >
> >> > 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
> >> >
> >> > > Hello,
> >> > > I would like to let you know that Apache MRQL can now run queries on
> >> > Flink.
> >> > > MRQL is a query processing and optimization system for large-scale,
> >> > > distributed data analysis, built on top of Apache Hadoop/map-reduce,
> >> > > Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> >> > > They can work on complex, user-defined data (such as JSON and XML)
> and
> >> > > can express complex queries (such as pagerank and matrix
> >> factorization).
> >> > >
> >> > > MRQL on Flink has been tested on local mode and on a small Yarn
> >> cluster.
> >> > >
> >> > > Here are the directions on how to build the latest MRQL snapshot:
> >> > >
> >> > > git clone
> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> >> > mrql
> >> > > cd mrql
> >> > > mvn -Pyarn clean install
> >> > >
> >> > > To make it run on your cluster, edit conf/mrql-env.sh and set the
> >> > > Java, the Hadoop, and the Flink installation directories.
> >> > >
> >> > > Here is how to run PageRank. First, you need to generate a random
> >> > > graph and store it in a file using the MRQL query RMAT.mrql:
> >> > >
> >> > > bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> >> > >
> >> > > This will create a graph with 1K nodes and 10K edges using the RMAT
> >> > > algorithm, will remove duplicate edges, and will store the graph in
> >> > > the binary file graph.bin. Then, run PageRank on Flink mode using:
> >> > >
> >> > > bin/mrql.flink -local queries/pagerank.mrql
> >> > >
> >> > > To run MRQL/Flink on a Yarn cluster, first start the Flink container
> >> > > on Yarn by running the script yarn-session.sh, such as:
> >> > >
> >> > > ${FLINK_HOME}/bin/yarn-session.sh -n 8
> >> > >
> >> > > This will print the name of the Flink JobManager, which can be used
> in:
> >> > >
> >> > > export FLINK_MASTER=name-of-the-Flink-JobManager
> >> > > bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> >> > >
> >> > > This will create a graph with 1M nodes and 10M edges using RMAT on
> 16
> >> > > nodes (slaves). You can adjust these numbers to fit your cluster.
> >> > > Then, run PageRank using:
> >> > >
> >> > > bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> >> > >
> >> > > The MRQL project page is at: http://mrql.incubator.apache.org/
> >> > >
> >> > > Let me know if you have any questions.
> >> > > Leonidas Fegaras
> >> > >
> >> > >
> >> >
> >>
>

Re: MRQL on Flink

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Thanks,
Yes, it will be beneficial for both projects to cross-link. We may need 
to wait for an announcement until I make the system more stable.
I forgot to mention that having a query processor on top of Flink which 
doesn't use the Flink optimizer much may be a bit unfair to Flink (when 
we compare compare Flink to Spark). Flink shines best when the data are 
relational and we use its special relational methods. MRQL uses custom 
data only, which doesn't leave many opportunities to the Flink 
optimizer. Nevertheless, I may improve the MRQL Flink evaluator in the 
future to recognize cases when the data is flat so it can call the 
relational Flink methods, instead of the generic methods. This will 
require lots of work.
I used multiple jars when I used a flink-snapshot but then later 
switched to uberjar after the first Flink release. I will make it to 
support both.
The MapReduce in the plan is not the Hadoop map-reduce; it's a physical 
plan operator whose functionality is equivalent to the hadoop 
map-reduce. It can easily translated into Flink flatMap with a groupBy.
Leonidas

On 08/28/2014 07:32 AM, Robert Metzger wrote:
> Amazing.
> In my opinion, we should cross-link our projects on the websites. Maybe we
> should add a section on our website where we list projects we depend on and
> projects depending on us.
> A little blog post / news on our website (once a MRQL release with Flink
> support is out) can also draw some attention to this great work!
>
> I've tried following your instructions and found one issue with Java 8 on
> the way: https://issues.apache.org/jira/browse/MRQL-46
> I think the classpath setup of the mrql scripts assumes that the user has a
> flink yarn uberjar file (one fat-jar with everything). I've first tried it
> with a regular "hadoop2" build of flink.
> We should probably generalize the classpath setup there a bit (to include
> all "flink-" prefixed jar files into the classpath).
>
> After I've sorted out these issues, mrql was working.
> Is the local mode actually using Flink's local execution?
> The output said:
> Apache MRQL version 0.9.4 (interpreted local Flink mode using 2 tasks)
> Query type: ( int, int, int, int ) -> ( int, int )
> Query type: !bag(( int, int ))
> Physical plan:
> MapReduce:
>     input: Generator
>
> In particular the "MapReduce" there was confusing me.
> I hope to find some more time soon to look closer into the MRQL query
> language.
>
> Robert
>
>
>
> On Thu, Aug 28, 2014 at 10:58 AM, Fabian Hueske <fh...@apache.org> wrote:
>
>> That's really cool!
>>
>> I'm also curious about your experience with Flink. Did you find major
>> obstacles that you needed to overcome for the integration?
>> Is there some write-up / report available somewhere (maybe in JIRA) that
>> discusses the integration? Are you using Flink's full operator set or do
>> you compile everything into Map and Reduce?
>>
>> Best, Fabian
>>
>>
>> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>>
>>> Very nice indeed! How well is this tested? Can it already run all the
>>> example queries you have? Can you say anything about the performance
>>> of the different underlying execution engines?
>>>
>>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
>>>> Wow, that is impressive!
>>>>
>>>>
>>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>
>>>>> Awesome, indeed! Looking forward to trying it out. :)
>>>>>
>>>>>
>>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Awesome!
>>>>>>
>>>>>>
>>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
>>>>>>
>>>>>>> Hello,
>>>>>>> I would like to let you know that Apache MRQL can now run queries
>> on
>>>>>> Flink.
>>>>>>> MRQL is a query processing and optimization system for
>> large-scale,
>>>>>>> distributed data analysis, built on top of Apache
>> Hadoop/map-reduce,
>>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
>>>>>>> They can work on complex, user-defined data (such as JSON and XML)
>>> and
>>>>>>> can express complex queries (such as pagerank and matrix
>>>>> factorization).
>>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
>>>>> cluster.
>>>>>>> Here are the directions on how to build the latest MRQL snapshot:
>>>>>>>
>>>>>>> git clone
>>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
>>>>>> mrql
>>>>>>> cd mrql
>>>>>>> mvn -Pyarn clean install
>>>>>>>
>>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
>>>>>>> Java, the Hadoop, and the Flink installation directories.
>>>>>>>
>>>>>>> Here is how to run PageRank. First, you need to generate a random
>>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
>>>>>>>
>>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
>>>>>>>
>>>>>>> This will create a graph with 1K nodes and 10K edges using the
>> RMAT
>>>>>>> algorithm, will remove duplicate edges, and will store the graph
>> in
>>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode using:
>>>>>>>
>>>>>>> bin/mrql.flink -local queries/pagerank.mrql
>>>>>>>
>>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink
>> container
>>>>>>> on Yarn by running the script yarn-session.sh, such as:
>>>>>>>
>>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
>>>>>>>
>>>>>>> This will print the name of the Flink JobManager, which can be
>> used
>>> in:
>>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
>>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
>>>>>>>
>>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT on
>>> 16
>>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
>>>>>>> Then, run PageRank using:
>>>>>>>
>>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
>>>>>>>
>>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
>>>>>>>
>>>>>>> Let me know if you have any questions.
>>>>>>> Leonidas Fegaras
>>>>>>>
>>>>>>>

Re: MRQL on Flink

Posted by Robert Metzger <rm...@apache.org>.

Amazing.
In my opinion, we should cross-link our projects on the websites. Maybe we
should add a section on our website where we list projects we depend on and
projects depending on us.
A little blog post / news on our website (once a MRQL release with Flink
support is out) can also draw some attention to this great work!

I've tried following your instructions and found one issue with Java 8 on
the way: https://issues.apache.org/jira/browse/MRQL-46
I think the classpath setup of the mrql scripts assumes that the user has a
flink yarn uberjar file (one fat-jar with everything). I've first tried it
with a regular "hadoop2" build of flink.
We should probably generalize the classpath setup there a bit (to include
all "flink-" prefixed jar files into the classpath).

After I've sorted out these issues, mrql was working.
Is the local mode actually using Flink's local execution?
The output said:
Apache MRQL version 0.9.4 (interpreted local Flink mode using 2 tasks)
Query type: ( int, int, int, int ) -> ( int, int )
Query type: !bag(( int, int ))
Physical plan:
MapReduce:
   input: Generator

In particular the "MapReduce" there was confusing me.
I hope to find some more time soon to look closer into the MRQL query
language.

Robert



On Thu, Aug 28, 2014 at 10:58 AM, Fabian Hueske <fh...@apache.org> wrote:

> That's really cool!
>
> I'm also curious about your experience with Flink. Did you find major
> obstacles that you needed to overcome for the integration?
> Is there some write-up / report available somewhere (maybe in JIRA) that
> discusses the integration? Are you using Flink's full operator set or do
> you compile everything into Map and Reduce?
>
> Best, Fabian
>
>
> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>
> > Very nice indeed! How well is this tested? Can it already run all the
> > example queries you have? Can you say anything about the performance
> > of the different underlying execution engines?
> >
> > On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
> > > Wow, that is impressive!
> > >
> > >
> > > On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > >
> > >> Awesome, indeed! Looking forward to trying it out. :)
> > >>
> > >>
> > >> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
> > >> wrote:
> > >>
> > >> > Awesome!
> > >> >
> > >> >
> > >> > 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
> > >> >
> > >> > > Hello,
> > >> > > I would like to let you know that Apache MRQL can now run queries
> on
> > >> > Flink.
> > >> > > MRQL is a query processing and optimization system for
> large-scale,
> > >> > > distributed data analysis, built on top of Apache
> Hadoop/map-reduce,
> > >> > > Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> > >> > > They can work on complex, user-defined data (such as JSON and XML)
> > and
> > >> > > can express complex queries (such as pagerank and matrix
> > >> factorization).
> > >> > >
> > >> > > MRQL on Flink has been tested on local mode and on a small Yarn
> > >> cluster.
> > >> > >
> > >> > > Here are the directions on how to build the latest MRQL snapshot:
> > >> > >
> > >> > > git clone
> > https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> > >> > mrql
> > >> > > cd mrql
> > >> > > mvn -Pyarn clean install
> > >> > >
> > >> > > To make it run on your cluster, edit conf/mrql-env.sh and set the
> > >> > > Java, the Hadoop, and the Flink installation directories.
> > >> > >
> > >> > > Here is how to run PageRank. First, you need to generate a random
> > >> > > graph and store it in a file using the MRQL query RMAT.mrql:
> > >> > >
> > >> > > bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> > >> > >
> > >> > > This will create a graph with 1K nodes and 10K edges using the
> RMAT
> > >> > > algorithm, will remove duplicate edges, and will store the graph
> in
> > >> > > the binary file graph.bin. Then, run PageRank on Flink mode using:
> > >> > >
> > >> > > bin/mrql.flink -local queries/pagerank.mrql
> > >> > >
> > >> > > To run MRQL/Flink on a Yarn cluster, first start the Flink
> container
> > >> > > on Yarn by running the script yarn-session.sh, such as:
> > >> > >
> > >> > > ${FLINK_HOME}/bin/yarn-session.sh -n 8
> > >> > >
> > >> > > This will print the name of the Flink JobManager, which can be
> used
> > in:
> > >> > >
> > >> > > export FLINK_MASTER=name-of-the-Flink-JobManager
> > >> > > bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> > >> > >
> > >> > > This will create a graph with 1M nodes and 10M edges using RMAT on
> > 16
> > >> > > nodes (slaves). You can adjust these numbers to fit your cluster.
> > >> > > Then, run PageRank using:
> > >> > >
> > >> > > bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> > >> > >
> > >> > > The MRQL project page is at: http://mrql.incubator.apache.org/
> > >> > >
> > >> > > Let me know if you have any questions.
> > >> > > Leonidas Fegaras
> > >> > >
> > >> > >
> > >> >
> > >>
> >
>

Re: MRQL on Flink

Posted by Stephan Ewen <se...@apache.org>.

Hey Leonidas, Edward, and Communities!

The are some serious efforts going on in Flink to improve the runtime,
optimize serialization, and change the way that the API lets you use your
types, specify keys, etc.

I believe that will take some more weeks to be in place. After that, it
would be really interesting for us to look how higher level languages (like
MRQL) could make use of that. And possibly do a before/after comparison.

I would like to ping you later with respect to that, if you are interested!

Greetings,
Stephan



On Fri, Aug 29, 2014 at 9:53 AM, Fabian Hueske <fh...@apache.org> wrote:

> Hi Edward,
>
> that sounds very interesting!
> Let us know if you have any problems setting up or configuring Flink. We'll
> be very happy to help.
>
> Cheers, Fabian
>
>
>
> 2014-08-29 4:30 GMT+02:00 Edward J. Yoon <ed...@apache.org>:
>
> > Cool!
> >
> > >>> Very nice indeed! How well is this tested? Can it already run all the
> > >>> example queries you have? Can you say anything about the performance
> > >>> of the different underlying execution engines?
> >
> > Recently I have a plan on benchmark for performance of new Hama
> > release. I might be able to generate some comparison table bt spark,
> > hama, flink.
> >
> > On Fri, Aug 29, 2014 at 12:13 AM, Leonidas Fegaras <fe...@cse.uta.edu>
> > wrote:
> > > I neglected to mentioned that this is still work in progress (!). It
> has
> > all
> > > the necessary parts to work with Flink but still has bugs and obviously
> > > needs lots of performance tuning. The reason I announced it early is to
> > get
> > > feedback and hopefully bug reports from the dev@flink. But I must say
> > you
> > > already gave me a lot of encouragement. Thanks!
> > > The major component missing in this system is to work with HDFS on
> > > distributed mode by default. Now, it uses the local file system (which
> is
> > > NFS shared by workers) on both local and distributed mode, which is
> > terribly
> > > inefficient. For local mode, I want to have the local working directory
> > as
> > > the default for relative paths (I think this works OK). For distributed
> > > mode, I want the HDFS and the user home on HDFS to be the default. I
> will
> > > try to fix this and have a workable system for Yarn by the end of this
> > > weekend. The local mode works fine now, I think.
> > > It was easy to port the MRQL physical operators to Flink DataSet
> > methods; I
> > > have done something similar for Spark. The components that took me long
> > to
> > > develop were the DataSources and the DataSinks. All the other MRQL
> > backends
> > > use the hadoop HDFS. So I had to copy some of my files from my core
> > system
> > > that uses HDFS to the Flink backend, change their names, and use the
> > Flink
> > > filesystem packages (which are very similar to Hadoop HDFS). Another
> > problem
> > > was that I had heavily used Hadoop Sequential files to store results
> for
> > the
> > > other backends. So I had to switch to Flink's BinaryOutputFormat. The
> > > DataSinks in Flink are not very convenient. I wish there was a DataSink
> > that
> > > contains an Iterator so that we can use the results for purposes other
> > than
> > > storing them in files. Also, compared to Spark, there are very few ways
> > to
> > > send results from workers to the master node after execution. Custom
> > > aggregators still have a bug when the aggregation result is a custom
> > class
> > > (it's a serialization problem: the class of the deserialized result
> > doesn't
> > > match the expected class, although they have the same name). In
> general,
> > I
> > > encountered some problems with serialization: sometimes I couldn't use
> > inner
> > > classes for the Flink functional parameters and I had to define them as
> > > static classes. Another thing that took me a couple of days to fix was
> to
> > > dump data from an Iterator to a Flink Binary file. Dumping the iterator
> > data
> > > into a vector first was not feasible because these data may be huge.
> > First,
> > > I tried to use the fromCollection method, but it required that the
> > Iterator
> > > be serializable (It doesn't make sense; how do you make an Iterator
> > > serializable?) Then I used the following hack:
> > >
> > >  BinaryOutputFormat of = new BinaryOutputFormat();
> > >  of.setOutputFilePath(path);
> > >  of.open(0,2);
> > >  ...
> > > It took me a while to find that I need to put of.open(0,2) instead of
> > > of.open(0,1). Why do we need 2 tasks?
> > > So, thanks for your encouragement. I will try to fix some of these bugs
> > by
> > > Monday and have a system that performs well on Yarn.
> > > Leonidas
> > >
> > >
> > > On 08/28/2014 03:58 AM, Fabian Hueske wrote:
> > >>
> > >> That's really cool!
> > >>
> > >> I'm also curious about your experience with Flink. Did you find major
> > >> obstacles that you needed to overcome for the integration?
> > >> Is there some write-up / report available somewhere (maybe in JIRA)
> that
> > >> discusses the integration? Are you using Flink's full operator set or
> do
> > >> you compile everything into Map and Reduce?
> > >>
> > >> Best, Fabian
> > >>
> > >>
> > >> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
> > >>
> > >>> Very nice indeed! How well is this tested? Can it already run all the
> > >>> example queries you have? Can you say anything about the performance
> > >>> of the different underlying execution engines?
> > >>>
> > >>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org>
> > wrote:
> > >>>>
> > >>>> Wow, that is impressive!
> > >>>>
> > >>>>
> > >>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org>
> wrote:
> > >>>>
> > >>>>> Awesome, indeed! Looking forward to trying it out. :)
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <
> ssc@apache.org
> > >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Awesome!
> > >>>>>>
> > >>>>>>
> > >>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fegaras@cse.uta.edu
> >:
> > >>>>>>
> > >>>>>>
> > >>>>>>> Hello,
> > >>>>>>> I would like to let you know that Apache MRQL can now run queries
> > on
> > >>>>>>
> > >>>>>> Flink.
> > >>>>>>>
> > >>>>>>> MRQL is a query processing and optimization system for
> large-scale,
> > >>>>>>> distributed data analysis, built on top of Apache
> > Hadoop/map-reduce,
> > >>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not
> SQL.
> > >>>>>>> They can work on complex, user-defined data (such as JSON and
> XML)
> > >>>
> > >>> and
> > >>>>>>>
> > >>>>>>> can express complex queries (such as pagerank and matrix
> > >>>>>
> > >>>>> factorization).
> > >>>>>>>
> > >>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
> > >>>>>
> > >>>>> cluster.
> > >>>>>>>
> > >>>>>>> Here are the directions on how to build the latest MRQL snapshot:
> > >>>>>>>
> > >>>>>>> git clone
> > >>>
> > >>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> > >>>>>>
> > >>>>>> mrql
> > >>>>>>>
> > >>>>>>> cd mrql
> > >>>>>>> mvn -Pyarn clean install
> > >>>>>>>
> > >>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
> > >>>>>>> Java, the Hadoop, and the Flink installation directories.
> > >>>>>>>
> > >>>>>>> Here is how to run PageRank. First, you need to generate a random
> > >>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
> > >>>>>>>
> > >>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> > >>>>>>>
> > >>>>>>> This will create a graph with 1K nodes and 10K edges using the
> RMAT
> > >>>>>>> algorithm, will remove duplicate edges, and will store the graph
> in
> > >>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode
> using:
> > >>>>>>>
> > >>>>>>> bin/mrql.flink -local queries/pagerank.mrql
> > >>>>>>>
> > >>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink
> > container
> > >>>>>>> on Yarn by running the script yarn-session.sh, such as:
> > >>>>>>>
> > >>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
> > >>>>>>>
> > >>>>>>> This will print the name of the Flink JobManager, which can be
> used
> > >>>
> > >>> in:
> > >>>>>>>
> > >>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
> > >>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> > >>>>>>>
> > >>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT
> on
> > >>>
> > >>> 16
> > >>>>>>>
> > >>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
> > >>>>>>> Then, run PageRank using:
> > >>>>>>>
> > >>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> > >>>>>>>
> > >>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
> > >>>>>>>
> > >>>>>>> Let me know if you have any questions.
> > >>>>>>> Leonidas Fegaras
> > >>>>>>>
> > >>>>>>>
> > >
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > CEO at DataSayer Co., Ltd.
> >
>

Re: MRQL on Flink

Posted by Stephan Ewen <se...@apache.org>.

Hey Leonidas, Edward, and Communities!

The are some serious efforts going on in Flink to improve the runtime,
optimize serialization, and change the way that the API lets you use your
types, specify keys, etc.

I believe that will take some more weeks to be in place. After that, it
would be really interesting for us to look how higher level languages (like
MRQL) could make use of that. And possibly do a before/after comparison.

I would like to ping you later with respect to that, if you are interested!

Greetings,
Stephan



On Fri, Aug 29, 2014 at 9:53 AM, Fabian Hueske <fh...@apache.org> wrote:

> Hi Edward,
>
> that sounds very interesting!
> Let us know if you have any problems setting up or configuring Flink. We'll
> be very happy to help.
>
> Cheers, Fabian
>
>
>
> 2014-08-29 4:30 GMT+02:00 Edward J. Yoon <ed...@apache.org>:
>
> > Cool!
> >
> > >>> Very nice indeed! How well is this tested? Can it already run all the
> > >>> example queries you have? Can you say anything about the performance
> > >>> of the different underlying execution engines?
> >
> > Recently I have a plan on benchmark for performance of new Hama
> > release. I might be able to generate some comparison table bt spark,
> > hama, flink.
> >
> > On Fri, Aug 29, 2014 at 12:13 AM, Leonidas Fegaras <fe...@cse.uta.edu>
> > wrote:
> > > I neglected to mentioned that this is still work in progress (!). It
> has
> > all
> > > the necessary parts to work with Flink but still has bugs and obviously
> > > needs lots of performance tuning. The reason I announced it early is to
> > get
> > > feedback and hopefully bug reports from the dev@flink. But I must say
> > you
> > > already gave me a lot of encouragement. Thanks!
> > > The major component missing in this system is to work with HDFS on
> > > distributed mode by default. Now, it uses the local file system (which
> is
> > > NFS shared by workers) on both local and distributed mode, which is
> > terribly
> > > inefficient. For local mode, I want to have the local working directory
> > as
> > > the default for relative paths (I think this works OK). For distributed
> > > mode, I want the HDFS and the user home on HDFS to be the default. I
> will
> > > try to fix this and have a workable system for Yarn by the end of this
> > > weekend. The local mode works fine now, I think.
> > > It was easy to port the MRQL physical operators to Flink DataSet
> > methods; I
> > > have done something similar for Spark. The components that took me long
> > to
> > > develop were the DataSources and the DataSinks. All the other MRQL
> > backends
> > > use the hadoop HDFS. So I had to copy some of my files from my core
> > system
> > > that uses HDFS to the Flink backend, change their names, and use the
> > Flink
> > > filesystem packages (which are very similar to Hadoop HDFS). Another
> > problem
> > > was that I had heavily used Hadoop Sequential files to store results
> for
> > the
> > > other backends. So I had to switch to Flink's BinaryOutputFormat. The
> > > DataSinks in Flink are not very convenient. I wish there was a DataSink
> > that
> > > contains an Iterator so that we can use the results for purposes other
> > than
> > > storing them in files. Also, compared to Spark, there are very few ways
> > to
> > > send results from workers to the master node after execution. Custom
> > > aggregators still have a bug when the aggregation result is a custom
> > class
> > > (it's a serialization problem: the class of the deserialized result
> > doesn't
> > > match the expected class, although they have the same name). In
> general,
> > I
> > > encountered some problems with serialization: sometimes I couldn't use
> > inner
> > > classes for the Flink functional parameters and I had to define them as
> > > static classes. Another thing that took me a couple of days to fix was
> to
> > > dump data from an Iterator to a Flink Binary file. Dumping the iterator
> > data
> > > into a vector first was not feasible because these data may be huge.
> > First,
> > > I tried to use the fromCollection method, but it required that the
> > Iterator
> > > be serializable (It doesn't make sense; how do you make an Iterator
> > > serializable?) Then I used the following hack:
> > >
> > >  BinaryOutputFormat of = new BinaryOutputFormat();
> > >  of.setOutputFilePath(path);
> > >  of.open(0,2);
> > >  ...
> > > It took me a while to find that I need to put of.open(0,2) instead of
> > > of.open(0,1). Why do we need 2 tasks?
> > > So, thanks for your encouragement. I will try to fix some of these bugs
> > by
> > > Monday and have a system that performs well on Yarn.
> > > Leonidas
> > >
> > >
> > > On 08/28/2014 03:58 AM, Fabian Hueske wrote:
> > >>
> > >> That's really cool!
> > >>
> > >> I'm also curious about your experience with Flink. Did you find major
> > >> obstacles that you needed to overcome for the integration?
> > >> Is there some write-up / report available somewhere (maybe in JIRA)
> that
> > >> discusses the integration? Are you using Flink's full operator set or
> do
> > >> you compile everything into Map and Reduce?
> > >>
> > >> Best, Fabian
> > >>
> > >>
> > >> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
> > >>
> > >>> Very nice indeed! How well is this tested? Can it already run all the
> > >>> example queries you have? Can you say anything about the performance
> > >>> of the different underlying execution engines?
> > >>>
> > >>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org>
> > wrote:
> > >>>>
> > >>>> Wow, that is impressive!
> > >>>>
> > >>>>
> > >>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org>
> wrote:
> > >>>>
> > >>>>> Awesome, indeed! Looking forward to trying it out. :)
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <
> ssc@apache.org
> > >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Awesome!
> > >>>>>>
> > >>>>>>
> > >>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fegaras@cse.uta.edu
> >:
> > >>>>>>
> > >>>>>>
> > >>>>>>> Hello,
> > >>>>>>> I would like to let you know that Apache MRQL can now run queries
> > on
> > >>>>>>
> > >>>>>> Flink.
> > >>>>>>>
> > >>>>>>> MRQL is a query processing and optimization system for
> large-scale,
> > >>>>>>> distributed data analysis, built on top of Apache
> > Hadoop/map-reduce,
> > >>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not
> SQL.
> > >>>>>>> They can work on complex, user-defined data (such as JSON and
> XML)
> > >>>
> > >>> and
> > >>>>>>>
> > >>>>>>> can express complex queries (such as pagerank and matrix
> > >>>>>
> > >>>>> factorization).
> > >>>>>>>
> > >>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
> > >>>>>
> > >>>>> cluster.
> > >>>>>>>
> > >>>>>>> Here are the directions on how to build the latest MRQL snapshot:
> > >>>>>>>
> > >>>>>>> git clone
> > >>>
> > >>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> > >>>>>>
> > >>>>>> mrql
> > >>>>>>>
> > >>>>>>> cd mrql
> > >>>>>>> mvn -Pyarn clean install
> > >>>>>>>
> > >>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
> > >>>>>>> Java, the Hadoop, and the Flink installation directories.
> > >>>>>>>
> > >>>>>>> Here is how to run PageRank. First, you need to generate a random
> > >>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
> > >>>>>>>
> > >>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> > >>>>>>>
> > >>>>>>> This will create a graph with 1K nodes and 10K edges using the
> RMAT
> > >>>>>>> algorithm, will remove duplicate edges, and will store the graph
> in
> > >>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode
> using:
> > >>>>>>>
> > >>>>>>> bin/mrql.flink -local queries/pagerank.mrql
> > >>>>>>>
> > >>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink
> > container
> > >>>>>>> on Yarn by running the script yarn-session.sh, such as:
> > >>>>>>>
> > >>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
> > >>>>>>>
> > >>>>>>> This will print the name of the Flink JobManager, which can be
> used
> > >>>
> > >>> in:
> > >>>>>>>
> > >>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
> > >>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> > >>>>>>>
> > >>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT
> on
> > >>>
> > >>> 16
> > >>>>>>>
> > >>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
> > >>>>>>> Then, run PageRank using:
> > >>>>>>>
> > >>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> > >>>>>>>
> > >>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
> > >>>>>>>
> > >>>>>>> Let me know if you have any questions.
> > >>>>>>> Leonidas Fegaras
> > >>>>>>>
> > >>>>>>>
> > >
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > CEO at DataSayer Co., Ltd.
> >
>

Re: MRQL on Flink

Posted by Fabian Hueske <fh...@apache.org>.

Hi Edward,

that sounds very interesting!
Let us know if you have any problems setting up or configuring Flink. We'll
be very happy to help.

Cheers, Fabian



2014-08-29 4:30 GMT+02:00 Edward J. Yoon <ed...@apache.org>:

> Cool!
>
> >>> Very nice indeed! How well is this tested? Can it already run all the
> >>> example queries you have? Can you say anything about the performance
> >>> of the different underlying execution engines?
>
> Recently I have a plan on benchmark for performance of new Hama
> release. I might be able to generate some comparison table bt spark,
> hama, flink.
>
> On Fri, Aug 29, 2014 at 12:13 AM, Leonidas Fegaras <fe...@cse.uta.edu>
> wrote:
> > I neglected to mentioned that this is still work in progress (!). It has
> all
> > the necessary parts to work with Flink but still has bugs and obviously
> > needs lots of performance tuning. The reason I announced it early is to
> get
> > feedback and hopefully bug reports from the dev@flink. But I must say
> you
> > already gave me a lot of encouragement. Thanks!
> > The major component missing in this system is to work with HDFS on
> > distributed mode by default. Now, it uses the local file system (which is
> > NFS shared by workers) on both local and distributed mode, which is
> terribly
> > inefficient. For local mode, I want to have the local working directory
> as
> > the default for relative paths (I think this works OK). For distributed
> > mode, I want the HDFS and the user home on HDFS to be the default. I will
> > try to fix this and have a workable system for Yarn by the end of this
> > weekend. The local mode works fine now, I think.
> > It was easy to port the MRQL physical operators to Flink DataSet
> methods; I
> > have done something similar for Spark. The components that took me long
> to
> > develop were the DataSources and the DataSinks. All the other MRQL
> backends
> > use the hadoop HDFS. So I had to copy some of my files from my core
> system
> > that uses HDFS to the Flink backend, change their names, and use the
> Flink
> > filesystem packages (which are very similar to Hadoop HDFS). Another
> problem
> > was that I had heavily used Hadoop Sequential files to store results for
> the
> > other backends. So I had to switch to Flink's BinaryOutputFormat. The
> > DataSinks in Flink are not very convenient. I wish there was a DataSink
> that
> > contains an Iterator so that we can use the results for purposes other
> than
> > storing them in files. Also, compared to Spark, there are very few ways
> to
> > send results from workers to the master node after execution. Custom
> > aggregators still have a bug when the aggregation result is a custom
> class
> > (it's a serialization problem: the class of the deserialized result
> doesn't
> > match the expected class, although they have the same name). In general,
> I
> > encountered some problems with serialization: sometimes I couldn't use
> inner
> > classes for the Flink functional parameters and I had to define them as
> > static classes. Another thing that took me a couple of days to fix was to
> > dump data from an Iterator to a Flink Binary file. Dumping the iterator
> data
> > into a vector first was not feasible because these data may be huge.
> First,
> > I tried to use the fromCollection method, but it required that the
> Iterator
> > be serializable (It doesn't make sense; how do you make an Iterator
> > serializable?) Then I used the following hack:
> >
> >  BinaryOutputFormat of = new BinaryOutputFormat();
> >  of.setOutputFilePath(path);
> >  of.open(0,2);
> >  ...
> > It took me a while to find that I need to put of.open(0,2) instead of
> > of.open(0,1). Why do we need 2 tasks?
> > So, thanks for your encouragement. I will try to fix some of these bugs
> by
> > Monday and have a system that performs well on Yarn.
> > Leonidas
> >
> >
> > On 08/28/2014 03:58 AM, Fabian Hueske wrote:
> >>
> >> That's really cool!
> >>
> >> I'm also curious about your experience with Flink. Did you find major
> >> obstacles that you needed to overcome for the integration?
> >> Is there some write-up / report available somewhere (maybe in JIRA) that
> >> discusses the integration? Are you using Flink's full operator set or do
> >> you compile everything into Map and Reduce?
> >>
> >> Best, Fabian
> >>
> >>
> >> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
> >>
> >>> Very nice indeed! How well is this tested? Can it already run all the
> >>> example queries you have? Can you say anything about the performance
> >>> of the different underlying execution engines?
> >>>
> >>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org>
> wrote:
> >>>>
> >>>> Wow, that is impressive!
> >>>>
> >>>>
> >>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >>>>
> >>>>> Awesome, indeed! Looking forward to trying it out. :)
> >>>>>
> >>>>>
> >>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ssc@apache.org
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Awesome!
> >>>>>>
> >>>>>>
> >>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
> >>>>>>
> >>>>>>
> >>>>>>> Hello,
> >>>>>>> I would like to let you know that Apache MRQL can now run queries
> on
> >>>>>>
> >>>>>> Flink.
> >>>>>>>
> >>>>>>> MRQL is a query processing and optimization system for large-scale,
> >>>>>>> distributed data analysis, built on top of Apache
> Hadoop/map-reduce,
> >>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> >>>>>>> They can work on complex, user-defined data (such as JSON and XML)
> >>>
> >>> and
> >>>>>>>
> >>>>>>> can express complex queries (such as pagerank and matrix
> >>>>>
> >>>>> factorization).
> >>>>>>>
> >>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
> >>>>>
> >>>>> cluster.
> >>>>>>>
> >>>>>>> Here are the directions on how to build the latest MRQL snapshot:
> >>>>>>>
> >>>>>>> git clone
> >>>
> >>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> >>>>>>
> >>>>>> mrql
> >>>>>>>
> >>>>>>> cd mrql
> >>>>>>> mvn -Pyarn clean install
> >>>>>>>
> >>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
> >>>>>>> Java, the Hadoop, and the Flink installation directories.
> >>>>>>>
> >>>>>>> Here is how to run PageRank. First, you need to generate a random
> >>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
> >>>>>>>
> >>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> >>>>>>>
> >>>>>>> This will create a graph with 1K nodes and 10K edges using the RMAT
> >>>>>>> algorithm, will remove duplicate edges, and will store the graph in
> >>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode using:
> >>>>>>>
> >>>>>>> bin/mrql.flink -local queries/pagerank.mrql
> >>>>>>>
> >>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink
> container
> >>>>>>> on Yarn by running the script yarn-session.sh, such as:
> >>>>>>>
> >>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
> >>>>>>>
> >>>>>>> This will print the name of the Flink JobManager, which can be used
> >>>
> >>> in:
> >>>>>>>
> >>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
> >>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> >>>>>>>
> >>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT on
> >>>
> >>> 16
> >>>>>>>
> >>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
> >>>>>>> Then, run PageRank using:
> >>>>>>>
> >>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> >>>>>>>
> >>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
> >>>>>>>
> >>>>>>> Let me know if you have any questions.
> >>>>>>> Leonidas Fegaras
> >>>>>>>
> >>>>>>>
> >
>
>
>
> --
> Best Regards, Edward J. Yoon
> CEO at DataSayer Co., Ltd.
>

Re: MRQL on Flink

Posted by Fabian Hueske <fh...@apache.org>.

Hi Edward,

that sounds very interesting!
Let us know if you have any problems setting up or configuring Flink. We'll
be very happy to help.

Cheers, Fabian



2014-08-29 4:30 GMT+02:00 Edward J. Yoon <ed...@apache.org>:

> Cool!
>
> >>> Very nice indeed! How well is this tested? Can it already run all the
> >>> example queries you have? Can you say anything about the performance
> >>> of the different underlying execution engines?
>
> Recently I have a plan on benchmark for performance of new Hama
> release. I might be able to generate some comparison table bt spark,
> hama, flink.
>
> On Fri, Aug 29, 2014 at 12:13 AM, Leonidas Fegaras <fe...@cse.uta.edu>
> wrote:
> > I neglected to mentioned that this is still work in progress (!). It has
> all
> > the necessary parts to work with Flink but still has bugs and obviously
> > needs lots of performance tuning. The reason I announced it early is to
> get
> > feedback and hopefully bug reports from the dev@flink. But I must say
> you
> > already gave me a lot of encouragement. Thanks!
> > The major component missing in this system is to work with HDFS on
> > distributed mode by default. Now, it uses the local file system (which is
> > NFS shared by workers) on both local and distributed mode, which is
> terribly
> > inefficient. For local mode, I want to have the local working directory
> as
> > the default for relative paths (I think this works OK). For distributed
> > mode, I want the HDFS and the user home on HDFS to be the default. I will
> > try to fix this and have a workable system for Yarn by the end of this
> > weekend. The local mode works fine now, I think.
> > It was easy to port the MRQL physical operators to Flink DataSet
> methods; I
> > have done something similar for Spark. The components that took me long
> to
> > develop were the DataSources and the DataSinks. All the other MRQL
> backends
> > use the hadoop HDFS. So I had to copy some of my files from my core
> system
> > that uses HDFS to the Flink backend, change their names, and use the
> Flink
> > filesystem packages (which are very similar to Hadoop HDFS). Another
> problem
> > was that I had heavily used Hadoop Sequential files to store results for
> the
> > other backends. So I had to switch to Flink's BinaryOutputFormat. The
> > DataSinks in Flink are not very convenient. I wish there was a DataSink
> that
> > contains an Iterator so that we can use the results for purposes other
> than
> > storing them in files. Also, compared to Spark, there are very few ways
> to
> > send results from workers to the master node after execution. Custom
> > aggregators still have a bug when the aggregation result is a custom
> class
> > (it's a serialization problem: the class of the deserialized result
> doesn't
> > match the expected class, although they have the same name). In general,
> I
> > encountered some problems with serialization: sometimes I couldn't use
> inner
> > classes for the Flink functional parameters and I had to define them as
> > static classes. Another thing that took me a couple of days to fix was to
> > dump data from an Iterator to a Flink Binary file. Dumping the iterator
> data
> > into a vector first was not feasible because these data may be huge.
> First,
> > I tried to use the fromCollection method, but it required that the
> Iterator
> > be serializable (It doesn't make sense; how do you make an Iterator
> > serializable?) Then I used the following hack:
> >
> >  BinaryOutputFormat of = new BinaryOutputFormat();
> >  of.setOutputFilePath(path);
> >  of.open(0,2);
> >  ...
> > It took me a while to find that I need to put of.open(0,2) instead of
> > of.open(0,1). Why do we need 2 tasks?
> > So, thanks for your encouragement. I will try to fix some of these bugs
> by
> > Monday and have a system that performs well on Yarn.
> > Leonidas
> >
> >
> > On 08/28/2014 03:58 AM, Fabian Hueske wrote:
> >>
> >> That's really cool!
> >>
> >> I'm also curious about your experience with Flink. Did you find major
> >> obstacles that you needed to overcome for the integration?
> >> Is there some write-up / report available somewhere (maybe in JIRA) that
> >> discusses the integration? Are you using Flink's full operator set or do
> >> you compile everything into Map and Reduce?
> >>
> >> Best, Fabian
> >>
> >>
> >> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
> >>
> >>> Very nice indeed! How well is this tested? Can it already run all the
> >>> example queries you have? Can you say anything about the performance
> >>> of the different underlying execution engines?
> >>>
> >>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org>
> wrote:
> >>>>
> >>>> Wow, that is impressive!
> >>>>
> >>>>
> >>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >>>>
> >>>>> Awesome, indeed! Looking forward to trying it out. :)
> >>>>>
> >>>>>
> >>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ssc@apache.org
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Awesome!
> >>>>>>
> >>>>>>
> >>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
> >>>>>>
> >>>>>>
> >>>>>>> Hello,
> >>>>>>> I would like to let you know that Apache MRQL can now run queries
> on
> >>>>>>
> >>>>>> Flink.
> >>>>>>>
> >>>>>>> MRQL is a query processing and optimization system for large-scale,
> >>>>>>> distributed data analysis, built on top of Apache
> Hadoop/map-reduce,
> >>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> >>>>>>> They can work on complex, user-defined data (such as JSON and XML)
> >>>
> >>> and
> >>>>>>>
> >>>>>>> can express complex queries (such as pagerank and matrix
> >>>>>
> >>>>> factorization).
> >>>>>>>
> >>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
> >>>>>
> >>>>> cluster.
> >>>>>>>
> >>>>>>> Here are the directions on how to build the latest MRQL snapshot:
> >>>>>>>
> >>>>>>> git clone
> >>>
> >>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> >>>>>>
> >>>>>> mrql
> >>>>>>>
> >>>>>>> cd mrql
> >>>>>>> mvn -Pyarn clean install
> >>>>>>>
> >>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
> >>>>>>> Java, the Hadoop, and the Flink installation directories.
> >>>>>>>
> >>>>>>> Here is how to run PageRank. First, you need to generate a random
> >>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
> >>>>>>>
> >>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> >>>>>>>
> >>>>>>> This will create a graph with 1K nodes and 10K edges using the RMAT
> >>>>>>> algorithm, will remove duplicate edges, and will store the graph in
> >>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode using:
> >>>>>>>
> >>>>>>> bin/mrql.flink -local queries/pagerank.mrql
> >>>>>>>
> >>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink
> container
> >>>>>>> on Yarn by running the script yarn-session.sh, such as:
> >>>>>>>
> >>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
> >>>>>>>
> >>>>>>> This will print the name of the Flink JobManager, which can be used
> >>>
> >>> in:
> >>>>>>>
> >>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
> >>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> >>>>>>>
> >>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT on
> >>>
> >>> 16
> >>>>>>>
> >>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
> >>>>>>> Then, run PageRank using:
> >>>>>>>
> >>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> >>>>>>>
> >>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
> >>>>>>>
> >>>>>>> Let me know if you have any questions.
> >>>>>>> Leonidas Fegaras
> >>>>>>>
> >>>>>>>
> >
>
>
>
> --
> Best Regards, Edward J. Yoon
> CEO at DataSayer Co., Ltd.
>

Re: MRQL on Flink

Posted by "Edward J. Yoon" <ed...@apache.org>.

Cool!

>>> Very nice indeed! How well is this tested? Can it already run all the
>>> example queries you have? Can you say anything about the performance
>>> of the different underlying execution engines?

Recently I have a plan on benchmark for performance of new Hama
release. I might be able to generate some comparison table bt spark,
hama, flink.

On Fri, Aug 29, 2014 at 12:13 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> I neglected to mentioned that this is still work in progress (!). It has all
> the necessary parts to work with Flink but still has bugs and obviously
> needs lots of performance tuning. The reason I announced it early is to get
> feedback and hopefully bug reports from the dev@flink. But I must say you
> already gave me a lot of encouragement. Thanks!
> The major component missing in this system is to work with HDFS on
> distributed mode by default. Now, it uses the local file system (which is
> NFS shared by workers) on both local and distributed mode, which is terribly
> inefficient. For local mode, I want to have the local working directory as
> the default for relative paths (I think this works OK). For distributed
> mode, I want the HDFS and the user home on HDFS to be the default. I will
> try to fix this and have a workable system for Yarn by the end of this
> weekend. The local mode works fine now, I think.
> It was easy to port the MRQL physical operators to Flink DataSet methods; I
> have done something similar for Spark. The components that took me long to
> develop were the DataSources and the DataSinks. All the other MRQL backends
> use the hadoop HDFS. So I had to copy some of my files from my core system
> that uses HDFS to the Flink backend, change their names, and use the Flink
> filesystem packages (which are very similar to Hadoop HDFS). Another problem
> was that I had heavily used Hadoop Sequential files to store results for the
> other backends. So I had to switch to Flink's BinaryOutputFormat. The
> DataSinks in Flink are not very convenient. I wish there was a DataSink that
> contains an Iterator so that we can use the results for purposes other than
> storing them in files. Also, compared to Spark, there are very few ways to
> send results from workers to the master node after execution. Custom
> aggregators still have a bug when the aggregation result is a custom class
> (it's a serialization problem: the class of the deserialized result doesn't
> match the expected class, although they have the same name). In general, I
> encountered some problems with serialization: sometimes I couldn't use inner
> classes for the Flink functional parameters and I had to define them as
> static classes. Another thing that took me a couple of days to fix was to
> dump data from an Iterator to a Flink Binary file. Dumping the iterator data
> into a vector first was not feasible because these data may be huge. First,
> I tried to use the fromCollection method, but it required that the Iterator
> be serializable (It doesn't make sense; how do you make an Iterator
> serializable?) Then I used the following hack:
>
>  BinaryOutputFormat of = new BinaryOutputFormat();
>  of.setOutputFilePath(path);
>  of.open(0,2);
>  ...
> It took me a while to find that I need to put of.open(0,2) instead of
> of.open(0,1). Why do we need 2 tasks?
> So, thanks for your encouragement. I will try to fix some of these bugs by
> Monday and have a system that performs well on Yarn.
> Leonidas
>
>
> On 08/28/2014 03:58 AM, Fabian Hueske wrote:
>>
>> That's really cool!
>>
>> I'm also curious about your experience with Flink. Did you find major
>> obstacles that you needed to overcome for the integration?
>> Is there some write-up / report available somewhere (maybe in JIRA) that
>> discusses the integration? Are you using Flink's full operator set or do
>> you compile everything into Map and Reduce?
>>
>> Best, Fabian
>>
>>
>> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>>
>>> Very nice indeed! How well is this tested? Can it already run all the
>>> example queries you have? Can you say anything about the performance
>>> of the different underlying execution engines?
>>>
>>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>> Wow, that is impressive!
>>>>
>>>>
>>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>
>>>>> Awesome, indeed! Looking forward to trying it out. :)
>>>>>
>>>>>
>>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Awesome!
>>>>>>
>>>>>>
>>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
>>>>>>
>>>>>>
>>>>>>> Hello,
>>>>>>> I would like to let you know that Apache MRQL can now run queries on
>>>>>>
>>>>>> Flink.
>>>>>>>
>>>>>>> MRQL is a query processing and optimization system for large-scale,
>>>>>>> distributed data analysis, built on top of Apache Hadoop/map-reduce,
>>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
>>>>>>> They can work on complex, user-defined data (such as JSON and XML)
>>>
>>> and
>>>>>>>
>>>>>>> can express complex queries (such as pagerank and matrix
>>>>>
>>>>> factorization).
>>>>>>>
>>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
>>>>>
>>>>> cluster.
>>>>>>>
>>>>>>> Here are the directions on how to build the latest MRQL snapshot:
>>>>>>>
>>>>>>> git clone
>>>
>>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
>>>>>>
>>>>>> mrql
>>>>>>>
>>>>>>> cd mrql
>>>>>>> mvn -Pyarn clean install
>>>>>>>
>>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
>>>>>>> Java, the Hadoop, and the Flink installation directories.
>>>>>>>
>>>>>>> Here is how to run PageRank. First, you need to generate a random
>>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
>>>>>>>
>>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
>>>>>>>
>>>>>>> This will create a graph with 1K nodes and 10K edges using the RMAT
>>>>>>> algorithm, will remove duplicate edges, and will store the graph in
>>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode using:
>>>>>>>
>>>>>>> bin/mrql.flink -local queries/pagerank.mrql
>>>>>>>
>>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink container
>>>>>>> on Yarn by running the script yarn-session.sh, such as:
>>>>>>>
>>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
>>>>>>>
>>>>>>> This will print the name of the Flink JobManager, which can be used
>>>
>>> in:
>>>>>>>
>>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
>>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
>>>>>>>
>>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT on
>>>
>>> 16
>>>>>>>
>>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
>>>>>>> Then, run PageRank using:
>>>>>>>
>>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
>>>>>>>
>>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
>>>>>>>
>>>>>>> Let me know if you have any questions.
>>>>>>> Leonidas Fegaras
>>>>>>>
>>>>>>>
>



-- 
Best Regards, Edward J. Yoon
CEO at DataSayer Co., Ltd.

Re: MRQL on Flink

Posted by "Edward J. Yoon" <ed...@apache.org>.

Cool!

>>> Very nice indeed! How well is this tested? Can it already run all the
>>> example queries you have? Can you say anything about the performance
>>> of the different underlying execution engines?

Recently I have a plan on benchmark for performance of new Hama
release. I might be able to generate some comparison table bt spark,
hama, flink.

On Fri, Aug 29, 2014 at 12:13 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> I neglected to mentioned that this is still work in progress (!). It has all
> the necessary parts to work with Flink but still has bugs and obviously
> needs lots of performance tuning. The reason I announced it early is to get
> feedback and hopefully bug reports from the dev@flink. But I must say you
> already gave me a lot of encouragement. Thanks!
> The major component missing in this system is to work with HDFS on
> distributed mode by default. Now, it uses the local file system (which is
> NFS shared by workers) on both local and distributed mode, which is terribly
> inefficient. For local mode, I want to have the local working directory as
> the default for relative paths (I think this works OK). For distributed
> mode, I want the HDFS and the user home on HDFS to be the default. I will
> try to fix this and have a workable system for Yarn by the end of this
> weekend. The local mode works fine now, I think.
> It was easy to port the MRQL physical operators to Flink DataSet methods; I
> have done something similar for Spark. The components that took me long to
> develop were the DataSources and the DataSinks. All the other MRQL backends
> use the hadoop HDFS. So I had to copy some of my files from my core system
> that uses HDFS to the Flink backend, change their names, and use the Flink
> filesystem packages (which are very similar to Hadoop HDFS). Another problem
> was that I had heavily used Hadoop Sequential files to store results for the
> other backends. So I had to switch to Flink's BinaryOutputFormat. The
> DataSinks in Flink are not very convenient. I wish there was a DataSink that
> contains an Iterator so that we can use the results for purposes other than
> storing them in files. Also, compared to Spark, there are very few ways to
> send results from workers to the master node after execution. Custom
> aggregators still have a bug when the aggregation result is a custom class
> (it's a serialization problem: the class of the deserialized result doesn't
> match the expected class, although they have the same name). In general, I
> encountered some problems with serialization: sometimes I couldn't use inner
> classes for the Flink functional parameters and I had to define them as
> static classes. Another thing that took me a couple of days to fix was to
> dump data from an Iterator to a Flink Binary file. Dumping the iterator data
> into a vector first was not feasible because these data may be huge. First,
> I tried to use the fromCollection method, but it required that the Iterator
> be serializable (It doesn't make sense; how do you make an Iterator
> serializable?) Then I used the following hack:
>
>  BinaryOutputFormat of = new BinaryOutputFormat();
>  of.setOutputFilePath(path);
>  of.open(0,2);
>  ...
> It took me a while to find that I need to put of.open(0,2) instead of
> of.open(0,1). Why do we need 2 tasks?
> So, thanks for your encouragement. I will try to fix some of these bugs by
> Monday and have a system that performs well on Yarn.
> Leonidas
>
>
> On 08/28/2014 03:58 AM, Fabian Hueske wrote:
>>
>> That's really cool!
>>
>> I'm also curious about your experience with Flink. Did you find major
>> obstacles that you needed to overcome for the integration?
>> Is there some write-up / report available somewhere (maybe in JIRA) that
>> discusses the integration? Are you using Flink's full operator set or do
>> you compile everything into Map and Reduce?
>>
>> Best, Fabian
>>
>>
>> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>>
>>> Very nice indeed! How well is this tested? Can it already run all the
>>> example queries you have? Can you say anything about the performance
>>> of the different underlying execution engines?
>>>
>>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>> Wow, that is impressive!
>>>>
>>>>
>>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>
>>>>> Awesome, indeed! Looking forward to trying it out. :)
>>>>>
>>>>>
>>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Awesome!
>>>>>>
>>>>>>
>>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
>>>>>>
>>>>>>
>>>>>>> Hello,
>>>>>>> I would like to let you know that Apache MRQL can now run queries on
>>>>>>
>>>>>> Flink.
>>>>>>>
>>>>>>> MRQL is a query processing and optimization system for large-scale,
>>>>>>> distributed data analysis, built on top of Apache Hadoop/map-reduce,
>>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
>>>>>>> They can work on complex, user-defined data (such as JSON and XML)
>>>
>>> and
>>>>>>>
>>>>>>> can express complex queries (such as pagerank and matrix
>>>>>
>>>>> factorization).
>>>>>>>
>>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
>>>>>
>>>>> cluster.
>>>>>>>
>>>>>>> Here are the directions on how to build the latest MRQL snapshot:
>>>>>>>
>>>>>>> git clone
>>>
>>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
>>>>>>
>>>>>> mrql
>>>>>>>
>>>>>>> cd mrql
>>>>>>> mvn -Pyarn clean install
>>>>>>>
>>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
>>>>>>> Java, the Hadoop, and the Flink installation directories.
>>>>>>>
>>>>>>> Here is how to run PageRank. First, you need to generate a random
>>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
>>>>>>>
>>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
>>>>>>>
>>>>>>> This will create a graph with 1K nodes and 10K edges using the RMAT
>>>>>>> algorithm, will remove duplicate edges, and will store the graph in
>>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode using:
>>>>>>>
>>>>>>> bin/mrql.flink -local queries/pagerank.mrql
>>>>>>>
>>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink container
>>>>>>> on Yarn by running the script yarn-session.sh, such as:
>>>>>>>
>>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
>>>>>>>
>>>>>>> This will print the name of the Flink JobManager, which can be used
>>>
>>> in:
>>>>>>>
>>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
>>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
>>>>>>>
>>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT on
>>>
>>> 16
>>>>>>>
>>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
>>>>>>> Then, run PageRank using:
>>>>>>>
>>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
>>>>>>>
>>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
>>>>>>>
>>>>>>> Let me know if you have any questions.
>>>>>>> Leonidas Fegaras
>>>>>>>
>>>>>>>
>



-- 
Best Regards, Edward J. Yoon
CEO at DataSayer Co., Ltd.

Re: MRQL on Flink

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

I neglected to mentioned that this is still work in progress (!). It has 
all the necessary parts to work with Flink but still has bugs and 
obviously needs lots of performance tuning. The reason I announced it 
early is to get feedback and hopefully bug reports from the dev@flink. 
But I must say you already gave me a lot of encouragement. Thanks!
The major component missing in this system is to work with HDFS on 
distributed mode by default. Now, it uses the local file system (which 
is NFS shared by workers) on both local and distributed mode, which is 
terribly inefficient. For local mode, I want to have the local working 
directory as the default for relative paths (I think this works OK). For 
distributed mode, I want the HDFS and the user home on HDFS to be the 
default. I will try to fix this and have a workable system for Yarn by 
the end of this weekend. The local mode works fine now, I think.
It was easy to port the MRQL physical operators to Flink DataSet 
methods; I have done something similar for Spark. The components that 
took me long to develop were the DataSources and the DataSinks. All the 
other MRQL backends use the hadoop HDFS. So I had to copy some of my 
files from my core system that uses HDFS to the Flink backend, change 
their names, and use the Flink filesystem packages (which are very 
similar to Hadoop HDFS). Another problem was that I had heavily used 
Hadoop Sequential files to store results for the other backends. So I 
had to switch to Flink's BinaryOutputFormat. The DataSinks in Flink are 
not very convenient. I wish there was a DataSink that contains an 
Iterator so that we can use the results for purposes other than storing 
them in files. Also, compared to Spark, there are very few ways to send 
results from workers to the master node after execution. Custom 
aggregators still have a bug when the aggregation result is a custom 
class (it's a serialization problem: the class of the deserialized 
result doesn't match the expected class, although they have the same 
name). In general, I encountered some problems with serialization: 
sometimes I couldn't use inner classes for the Flink functional 
parameters and I had to define them as static classes. Another thing 
that took me a couple of days to fix was to dump data from an Iterator 
to a Flink Binary file. Dumping the iterator data into a vector first 
was not feasible because these data may be huge. First, I tried to use 
the fromCollection method, but it required that the Iterator be 
serializable (It doesn't make sense; how do you make an Iterator 
serializable?) Then I used the following hack:

  BinaryOutputFormat of = new BinaryOutputFormat();
  of.setOutputFilePath(path);
  of.open(0,2);
  ...
It took me a while to find that I need to put of.open(0,2) instead of 
of.open(0,1). Why do we need 2 tasks?
So, thanks for your encouragement. I will try to fix some of these bugs 
by Monday and have a system that performs well on Yarn.
Leonidas

On 08/28/2014 03:58 AM, Fabian Hueske wrote:
> That's really cool!
>
> I'm also curious about your experience with Flink. Did you find major
> obstacles that you needed to overcome for the integration?
> Is there some write-up / report available somewhere (maybe in JIRA) that
> discusses the integration? Are you using Flink's full operator set or do
> you compile everything into Map and Reduce?
>
> Best, Fabian
>
>
> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>
>> Very nice indeed! How well is this tested? Can it already run all the
>> example queries you have? Can you say anything about the performance
>> of the different underlying execution engines?
>>
>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
>>> Wow, that is impressive!
>>>
>>>
>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>
>>>> Awesome, indeed! Looking forward to trying it out. :)
>>>>
>>>>
>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
>>>> wrote:
>>>>
>>>>> Awesome!
>>>>>
>>>>>
>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
>>>>>
>>>>>> Hello,
>>>>>> I would like to let you know that Apache MRQL can now run queries on
>>>>> Flink.
>>>>>> MRQL is a query processing and optimization system for large-scale,
>>>>>> distributed data analysis, built on top of Apache Hadoop/map-reduce,
>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
>>>>>> They can work on complex, user-defined data (such as JSON and XML)
>> and
>>>>>> can express complex queries (such as pagerank and matrix
>>>> factorization).
>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
>>>> cluster.
>>>>>> Here are the directions on how to build the latest MRQL snapshot:
>>>>>>
>>>>>> git clone
>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
>>>>> mrql
>>>>>> cd mrql
>>>>>> mvn -Pyarn clean install
>>>>>>
>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
>>>>>> Java, the Hadoop, and the Flink installation directories.
>>>>>>
>>>>>> Here is how to run PageRank. First, you need to generate a random
>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
>>>>>>
>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
>>>>>>
>>>>>> This will create a graph with 1K nodes and 10K edges using the RMAT
>>>>>> algorithm, will remove duplicate edges, and will store the graph in
>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode using:
>>>>>>
>>>>>> bin/mrql.flink -local queries/pagerank.mrql
>>>>>>
>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink container
>>>>>> on Yarn by running the script yarn-session.sh, such as:
>>>>>>
>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
>>>>>>
>>>>>> This will print the name of the Flink JobManager, which can be used
>> in:
>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
>>>>>>
>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT on
>> 16
>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
>>>>>> Then, run PageRank using:
>>>>>>
>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
>>>>>>
>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
>>>>>>
>>>>>> Let me know if you have any questions.
>>>>>> Leonidas Fegaras
>>>>>>
>>>>>>

Re: MRQL on Flink

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

I neglected to mentioned that this is still work in progress (!). It has 
all the necessary parts to work with Flink but still has bugs and 
obviously needs lots of performance tuning. The reason I announced it 
early is to get feedback and hopefully bug reports from the dev@flink. 
But I must say you already gave me a lot of encouragement. Thanks!
The major component missing in this system is to work with HDFS on 
distributed mode by default. Now, it uses the local file system (which 
is NFS shared by workers) on both local and distributed mode, which is 
terribly inefficient. For local mode, I want to have the local working 
directory as the default for relative paths (I think this works OK). For 
distributed mode, I want the HDFS and the user home on HDFS to be the 
default. I will try to fix this and have a workable system for Yarn by 
the end of this weekend. The local mode works fine now, I think.
It was easy to port the MRQL physical operators to Flink DataSet 
methods; I have done something similar for Spark. The components that 
took me long to develop were the DataSources and the DataSinks. All the 
other MRQL backends use the hadoop HDFS. So I had to copy some of my 
files from my core system that uses HDFS to the Flink backend, change 
their names, and use the Flink filesystem packages (which are very 
similar to Hadoop HDFS). Another problem was that I had heavily used 
Hadoop Sequential files to store results for the other backends. So I 
had to switch to Flink's BinaryOutputFormat. The DataSinks in Flink are 
not very convenient. I wish there was a DataSink that contains an 
Iterator so that we can use the results for purposes other than storing 
them in files. Also, compared to Spark, there are very few ways to send 
results from workers to the master node after execution. Custom 
aggregators still have a bug when the aggregation result is a custom 
class (it's a serialization problem: the class of the deserialized 
result doesn't match the expected class, although they have the same 
name). In general, I encountered some problems with serialization: 
sometimes I couldn't use inner classes for the Flink functional 
parameters and I had to define them as static classes. Another thing 
that took me a couple of days to fix was to dump data from an Iterator 
to a Flink Binary file. Dumping the iterator data into a vector first 
was not feasible because these data may be huge. First, I tried to use 
the fromCollection method, but it required that the Iterator be 
serializable (It doesn't make sense; how do you make an Iterator 
serializable?) Then I used the following hack:

  BinaryOutputFormat of = new BinaryOutputFormat();
  of.setOutputFilePath(path);
  of.open(0,2);
  ...
It took me a while to find that I need to put of.open(0,2) instead of 
of.open(0,1). Why do we need 2 tasks?
So, thanks for your encouragement. I will try to fix some of these bugs 
by Monday and have a system that performs well on Yarn.
Leonidas

On 08/28/2014 03:58 AM, Fabian Hueske wrote:
> That's really cool!
>
> I'm also curious about your experience with Flink. Did you find major
> obstacles that you needed to overcome for the integration?
> Is there some write-up / report available somewhere (maybe in JIRA) that
> discusses the integration? Are you using Flink's full operator set or do
> you compile everything into Map and Reduce?
>
> Best, Fabian
>
>
> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>
>> Very nice indeed! How well is this tested? Can it already run all the
>> example queries you have? Can you say anything about the performance
>> of the different underlying execution engines?
>>
>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
>>> Wow, that is impressive!
>>>
>>>
>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>
>>>> Awesome, indeed! Looking forward to trying it out. :)
>>>>
>>>>
>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
>>>> wrote:
>>>>
>>>>> Awesome!
>>>>>
>>>>>
>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
>>>>>
>>>>>> Hello,
>>>>>> I would like to let you know that Apache MRQL can now run queries on
>>>>> Flink.
>>>>>> MRQL is a query processing and optimization system for large-scale,
>>>>>> distributed data analysis, built on top of Apache Hadoop/map-reduce,
>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
>>>>>> They can work on complex, user-defined data (such as JSON and XML)
>> and
>>>>>> can express complex queries (such as pagerank and matrix
>>>> factorization).
>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn
>>>> cluster.
>>>>>> Here are the directions on how to build the latest MRQL snapshot:
>>>>>>
>>>>>> git clone
>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
>>>>> mrql
>>>>>> cd mrql
>>>>>> mvn -Pyarn clean install
>>>>>>
>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the
>>>>>> Java, the Hadoop, and the Flink installation directories.
>>>>>>
>>>>>> Here is how to run PageRank. First, you need to generate a random
>>>>>> graph and store it in a file using the MRQL query RMAT.mrql:
>>>>>>
>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
>>>>>>
>>>>>> This will create a graph with 1K nodes and 10K edges using the RMAT
>>>>>> algorithm, will remove duplicate edges, and will store the graph in
>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode using:
>>>>>>
>>>>>> bin/mrql.flink -local queries/pagerank.mrql
>>>>>>
>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink container
>>>>>> on Yarn by running the script yarn-session.sh, such as:
>>>>>>
>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8
>>>>>>
>>>>>> This will print the name of the Flink JobManager, which can be used
>> in:
>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager
>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
>>>>>>
>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT on
>> 16
>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster.
>>>>>> Then, run PageRank using:
>>>>>>
>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
>>>>>>
>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/
>>>>>>
>>>>>> Let me know if you have any questions.
>>>>>> Leonidas Fegaras
>>>>>>
>>>>>>

Re: MRQL on Flink

Posted by Fabian Hueske <fh...@apache.org>.

That's really cool!

I'm also curious about your experience with Flink. Did you find major
obstacles that you needed to overcome for the integration?
Is there some write-up / report available somewhere (maybe in JIRA) that
discusses the integration? Are you using Flink's full operator set or do
you compile everything into Map and Reduce?

Best, Fabian


2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <al...@apache.org>:

> Very nice indeed! How well is this tested? Can it already run all the
> example queries you have? Can you say anything about the performance
> of the different underlying execution engines?
>
> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
> > Wow, that is impressive!
> >
> >
> > On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >
> >> Awesome, indeed! Looking forward to trying it out. :)
> >>
> >>
> >> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
> >> wrote:
> >>
> >> > Awesome!
> >> >
> >> >
> >> > 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
> >> >
> >> > > Hello,
> >> > > I would like to let you know that Apache MRQL can now run queries on
> >> > Flink.
> >> > > MRQL is a query processing and optimization system for large-scale,
> >> > > distributed data analysis, built on top of Apache Hadoop/map-reduce,
> >> > > Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> >> > > They can work on complex, user-defined data (such as JSON and XML)
> and
> >> > > can express complex queries (such as pagerank and matrix
> >> factorization).
> >> > >
> >> > > MRQL on Flink has been tested on local mode and on a small Yarn
> >> cluster.
> >> > >
> >> > > Here are the directions on how to build the latest MRQL snapshot:
> >> > >
> >> > > git clone
> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> >> > mrql
> >> > > cd mrql
> >> > > mvn -Pyarn clean install
> >> > >
> >> > > To make it run on your cluster, edit conf/mrql-env.sh and set the
> >> > > Java, the Hadoop, and the Flink installation directories.
> >> > >
> >> > > Here is how to run PageRank. First, you need to generate a random
> >> > > graph and store it in a file using the MRQL query RMAT.mrql:
> >> > >
> >> > > bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> >> > >
> >> > > This will create a graph with 1K nodes and 10K edges using the RMAT
> >> > > algorithm, will remove duplicate edges, and will store the graph in
> >> > > the binary file graph.bin. Then, run PageRank on Flink mode using:
> >> > >
> >> > > bin/mrql.flink -local queries/pagerank.mrql
> >> > >
> >> > > To run MRQL/Flink on a Yarn cluster, first start the Flink container
> >> > > on Yarn by running the script yarn-session.sh, such as:
> >> > >
> >> > > ${FLINK_HOME}/bin/yarn-session.sh -n 8
> >> > >
> >> > > This will print the name of the Flink JobManager, which can be used
> in:
> >> > >
> >> > > export FLINK_MASTER=name-of-the-Flink-JobManager
> >> > > bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> >> > >
> >> > > This will create a graph with 1M nodes and 10M edges using RMAT on
> 16
> >> > > nodes (slaves). You can adjust these numbers to fit your cluster.
> >> > > Then, run PageRank using:
> >> > >
> >> > > bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> >> > >
> >> > > The MRQL project page is at: http://mrql.incubator.apache.org/
> >> > >
> >> > > Let me know if you have any questions.
> >> > > Leonidas Fegaras
> >> > >
> >> > >
> >> >
> >>
>

Re: MRQL on Flink

Posted by Aljoscha Krettek <al...@apache.org>.

Very nice indeed! How well is this tested? Can it already run all the
example queries you have? Can you say anything about the performance
of the different underlying execution engines?

On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <se...@apache.org> wrote:
> Wow, that is impressive!
>
>
> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:
>
>> Awesome, indeed! Looking forward to trying it out. :)
>>
>>
>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>> > Awesome!
>> >
>> >
>> > 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
>> >
>> > > Hello,
>> > > I would like to let you know that Apache MRQL can now run queries on
>> > Flink.
>> > > MRQL is a query processing and optimization system for large-scale,
>> > > distributed data analysis, built on top of Apache Hadoop/map-reduce,
>> > > Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
>> > > They can work on complex, user-defined data (such as JSON and XML) and
>> > > can express complex queries (such as pagerank and matrix
>> factorization).
>> > >
>> > > MRQL on Flink has been tested on local mode and on a small Yarn
>> cluster.
>> > >
>> > > Here are the directions on how to build the latest MRQL snapshot:
>> > >
>> > > git clone https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
>> > mrql
>> > > cd mrql
>> > > mvn -Pyarn clean install
>> > >
>> > > To make it run on your cluster, edit conf/mrql-env.sh and set the
>> > > Java, the Hadoop, and the Flink installation directories.
>> > >
>> > > Here is how to run PageRank. First, you need to generate a random
>> > > graph and store it in a file using the MRQL query RMAT.mrql:
>> > >
>> > > bin/mrql.flink -local queries/RMAT.mrql 1000 10000
>> > >
>> > > This will create a graph with 1K nodes and 10K edges using the RMAT
>> > > algorithm, will remove duplicate edges, and will store the graph in
>> > > the binary file graph.bin. Then, run PageRank on Flink mode using:
>> > >
>> > > bin/mrql.flink -local queries/pagerank.mrql
>> > >
>> > > To run MRQL/Flink on a Yarn cluster, first start the Flink container
>> > > on Yarn by running the script yarn-session.sh, such as:
>> > >
>> > > ${FLINK_HOME}/bin/yarn-session.sh -n 8
>> > >
>> > > This will print the name of the Flink JobManager, which can be used in:
>> > >
>> > > export FLINK_MASTER=name-of-the-Flink-JobManager
>> > > bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
>> > >
>> > > This will create a graph with 1M nodes and 10M edges using RMAT on 16
>> > > nodes (slaves). You can adjust these numbers to fit your cluster.
>> > > Then, run PageRank using:
>> > >
>> > > bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
>> > >
>> > > The MRQL project page is at: http://mrql.incubator.apache.org/
>> > >
>> > > Let me know if you have any questions.
>> > > Leonidas Fegaras
>> > >
>> > >
>> >
>>

Re: MRQL on Flink

Posted by Stephan Ewen <se...@apache.org>.

Wow, that is impressive!


On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <uc...@apache.org> wrote:

> Awesome, indeed! Looking forward to trying it out. :)
>
>
> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > Awesome!
> >
> >
> > 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
> >
> > > Hello,
> > > I would like to let you know that Apache MRQL can now run queries on
> > Flink.
> > > MRQL is a query processing and optimization system for large-scale,
> > > distributed data analysis, built on top of Apache Hadoop/map-reduce,
> > > Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> > > They can work on complex, user-defined data (such as JSON and XML) and
> > > can express complex queries (such as pagerank and matrix
> factorization).
> > >
> > > MRQL on Flink has been tested on local mode and on a small Yarn
> cluster.
> > >
> > > Here are the directions on how to build the latest MRQL snapshot:
> > >
> > > git clone https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> > mrql
> > > cd mrql
> > > mvn -Pyarn clean install
> > >
> > > To make it run on your cluster, edit conf/mrql-env.sh and set the
> > > Java, the Hadoop, and the Flink installation directories.
> > >
> > > Here is how to run PageRank. First, you need to generate a random
> > > graph and store it in a file using the MRQL query RMAT.mrql:
> > >
> > > bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> > >
> > > This will create a graph with 1K nodes and 10K edges using the RMAT
> > > algorithm, will remove duplicate edges, and will store the graph in
> > > the binary file graph.bin. Then, run PageRank on Flink mode using:
> > >
> > > bin/mrql.flink -local queries/pagerank.mrql
> > >
> > > To run MRQL/Flink on a Yarn cluster, first start the Flink container
> > > on Yarn by running the script yarn-session.sh, such as:
> > >
> > > ${FLINK_HOME}/bin/yarn-session.sh -n 8
> > >
> > > This will print the name of the Flink JobManager, which can be used in:
> > >
> > > export FLINK_MASTER=name-of-the-Flink-JobManager
> > > bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> > >
> > > This will create a graph with 1M nodes and 10M edges using RMAT on 16
> > > nodes (slaves). You can adjust these numbers to fit your cluster.
> > > Then, run PageRank using:
> > >
> > > bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> > >
> > > The MRQL project page is at: http://mrql.incubator.apache.org/
> > >
> > > Let me know if you have any questions.
> > > Leonidas Fegaras
> > >
> > >
> >
>

Re: MRQL on Flink

Posted by Ufuk Celebi <uc...@apache.org>.

Awesome, indeed! Looking forward to trying it out. :)


On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Awesome!
>
>
> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:
>
> > Hello,
> > I would like to let you know that Apache MRQL can now run queries on
> Flink.
> > MRQL is a query processing and optimization system for large-scale,
> > distributed data analysis, built on top of Apache Hadoop/map-reduce,
> > Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> > They can work on complex, user-defined data (such as JSON and XML) and
> > can express complex queries (such as pagerank and matrix factorization).
> >
> > MRQL on Flink has been tested on local mode and on a small Yarn cluster.
> >
> > Here are the directions on how to build the latest MRQL snapshot:
> >
> > git clone https://git-wip-us.apache.org/repos/asf/incubator-mrql.git
> mrql
> > cd mrql
> > mvn -Pyarn clean install
> >
> > To make it run on your cluster, edit conf/mrql-env.sh and set the
> > Java, the Hadoop, and the Flink installation directories.
> >
> > Here is how to run PageRank. First, you need to generate a random
> > graph and store it in a file using the MRQL query RMAT.mrql:
> >
> > bin/mrql.flink -local queries/RMAT.mrql 1000 10000
> >
> > This will create a graph with 1K nodes and 10K edges using the RMAT
> > algorithm, will remove duplicate edges, and will store the graph in
> > the binary file graph.bin. Then, run PageRank on Flink mode using:
> >
> > bin/mrql.flink -local queries/pagerank.mrql
> >
> > To run MRQL/Flink on a Yarn cluster, first start the Flink container
> > on Yarn by running the script yarn-session.sh, such as:
> >
> > ${FLINK_HOME}/bin/yarn-session.sh -n 8
> >
> > This will print the name of the Flink JobManager, which can be used in:
> >
> > export FLINK_MASTER=name-of-the-Flink-JobManager
> > bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
> >
> > This will create a graph with 1M nodes and 10M edges using RMAT on 16
> > nodes (slaves). You can adjust these numbers to fit your cluster.
> > Then, run PageRank using:
> >
> > bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
> >
> > The MRQL project page is at: http://mrql.incubator.apache.org/
> >
> > Let me know if you have any questions.
> > Leonidas Fegaras
> >
> >
>

Re: MRQL on Flink

Posted by Sebastian Schelter <ss...@apache.org>.

Awesome!


2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <fe...@cse.uta.edu>:

> Hello,
> I would like to let you know that Apache MRQL can now run queries on Flink.
> MRQL is a query processing and optimization system for large-scale,
> distributed data analysis, built on top of Apache Hadoop/map-reduce,
> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
> They can work on complex, user-defined data (such as JSON and XML) and
> can express complex queries (such as pagerank and matrix factorization).
>
> MRQL on Flink has been tested on local mode and on a small Yarn cluster.
>
> Here are the directions on how to build the latest MRQL snapshot:
>
> git clone https://git-wip-us.apache.org/repos/asf/incubator-mrql.git mrql
> cd mrql
> mvn -Pyarn clean install
>
> To make it run on your cluster, edit conf/mrql-env.sh and set the
> Java, the Hadoop, and the Flink installation directories.
>
> Here is how to run PageRank. First, you need to generate a random
> graph and store it in a file using the MRQL query RMAT.mrql:
>
> bin/mrql.flink -local queries/RMAT.mrql 1000 10000
>
> This will create a graph with 1K nodes and 10K edges using the RMAT
> algorithm, will remove duplicate edges, and will store the graph in
> the binary file graph.bin. Then, run PageRank on Flink mode using:
>
> bin/mrql.flink -local queries/pagerank.mrql
>
> To run MRQL/Flink on a Yarn cluster, first start the Flink container
> on Yarn by running the script yarn-session.sh, such as:
>
> ${FLINK_HOME}/bin/yarn-session.sh -n 8
>
> This will print the name of the Flink JobManager, which can be used in:
>
> export FLINK_MASTER=name-of-the-Flink-JobManager
> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000
>
> This will create a graph with 1M nodes and 10M edges using RMAT on 16
> nodes (slaves). You can adjust these numbers to fit your cluster.
> Then, run PageRank using:
>
> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql
>
> The MRQL project page is at: http://mrql.incubator.apache.org/
>
> Let me know if you have any questions.
> Leonidas Fegaras
>
>