You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hama.apache.org by Leonidas Fegaras <fe...@cse.uta.edu> on 2012/08/23 21:41:47 UTC

[ANNOUNCEMENT] A query system for BSP processing

Dear Hama users,
I am pleased to announce that the MRQL query processing system can now
evaluate SQL-like queries on a Hama cluster. MRQL is available at:

http://lambda.uta.edu/mrql/

MRQL (the Map-Reduce Query Language) is an SQL-like query language for
large-scale, distributed data analysis. MRQL is powerful enough to
express most common data analysis tasks over many different kinds of
raw data, including hierarchical data and nested collections, such as
XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using Apache
Hama. Both modes use Apache's HDFS to read and write their data.

Note that, the BSP mode is currently experimental (not fine-tuned yet)
and lacks any fault-tolerance (if an error occurs, the entire job must
be restarted). Due to our limited resources, MRQL has only been tested
on a small cluster (7-nodes/28-cores). We compared the BSP mode with
the MR mode by evaluating a pagerank query over a small graph (100K
nodes, 1M edges) and found that BSP mode is about 4.5 times faster
than the MR mode. Please let me know if you'd like to contribute to
this project by testing MRQL on a larger cluster.
Best regards,
Leonidas Fegaras
University of Texas at Arlington

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Thomas Jungblut <th...@gmail.com>.

Let's feature this project on our site and in our wiki.

2012/9/11 Leonidas Fegaras <fe...@cse.uta.edu>

> I created a project on Github:
> https://github.com/fegaras/**mrql.git<https://github.com/fegaras/mrql.git>
>
> Thank you for your help
> Leonidas Fegaras
>
>
> On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:
>
>  Yep, a subproject would be the alternative.
>> In this case we would give you PMC and committer rights so you can
>> actively
>> work on that.
>> However this would make the mapreduce part more or less useless, so if you
>> want to go the hybrid way, feel free to submit an incubation request.
>>
>> 2012/9/7 Suraj Menon <su...@apache.org>
>>
>>  I think Thomas has a point. How about making it a sub-module/sub-project
>>> of
>>> Hama for now? If/When it gains enough community support to make it a top
>>> level project, you can fork it as a separate project.
>>> I am not completely aware of the procedures and requirements for getting
>>> external project as sub-project.
>>> We can look into it if you are ready to take this route.
>>>
>>>  Could you please send me a link for setting up an open-source Apache
>>>>
>>> project?
>>> If I am right this is what you are looking for -
>>> http://incubator.apache.org/**guides/proposal.html<http://incubator.apache.org/guides/proposal.html>
>>> http://incubator.apache.org/**sitemap.html<http://incubator.apache.org/sitemap.html>
>>>
>>> Good luck,
>>> Suraj
>>>
>>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
>>> <th...@gmail.com>**wrote:
>>>
>>>  Although I think this is a great project, I think that you will not meet
>>>> the requirements.
>>>> You need a community and a charter to get it into the incubation.
>>>>
>>>> What about hosting it on Github?
>>>>
>>>> 2012/9/7 Leonidas Fegaras <fe...@cse.uta.edu>
>>>>
>>>>  Yes, this is a great idea. I have used GIT on my own server but I don't
>>>>> know how to do this for ASF. Could you please send me a link for
>>>>>
>>>> setting
>>>
>>>> up
>>>>
>>>>> an open-source Apache project?
>>>>>
>>>>>
>>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>>>>>
>>>>>  If you can open source this then I'm sure the ASF community can help
>>>>>> you and make this software better.
>>>>>>
>>>>>> Pls feel free to ask us if you need any assistance donating source
>>>>>> code to the ASF or contributing to the Hama project in the future.
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
>>>>>>
>>>>> fegaras@cse.uta.edu>
>>>
>>>> wrote:
>>>>>>
>>>>>>  Yes sure. I have fixed the bug with the repeat stopping condition
>>>>>>>
>>>>>> but I
>>>
>>>> have
>>>>>>> only tested pagerank on my small cluster. I still need to fix the
>>>>>>>
>>>>>> k-means
>>>>
>>>>> clustering (it's a special case because you improve a fixed number of
>>>>>>> points).
>>>>>>> Leonidas
>>>>>>>
>>>>>>>
>>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>>>>>
>>>>>>> Shall we work together?
>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
>>>>>>>>
>>>>>>> fegaras@cse.uta.edu
>>>
>>>>
>>>>>  wrote:
>>>>>>>>
>>>>>>>>  Thank you very much for your interest and for testing my system.
>>>>>>>>> It seems that my release was premature: It worked for some random
>>>>>>>>>
>>>>>>>> data
>>>>
>>>>> but
>>>>>>>>> didn't for some others. It's a minor logical error that I will try
>>>>>>>>>
>>>>>>>> to
>>>
>>>> fix
>>>>>>>>> in
>>>>>>>>> the next few days. The problem is with the stopping condition of
>>>>>>>>>
>>>>>>>> the
>>>
>>>> repeat
>>>>>>>>> expression that calculates the new pagerank from the old. It must
>>>>>>>>>
>>>>>>>> stop
>>>>
>>>>> if
>>>>>>>>> ALL peers reach  the specified precision. This is done by having
>>>>>>>>>
>>>>>>>> those
>>>>
>>>>> peers
>>>>>>>>> that need to continue send a message to others to continue. It
>>>>>>>>>
>>>>>>>> seems
>>>
>>>> that
>>>>>>>>> now when all peers agree at the same time, the program works fine.
>>>>>>>>>
>>>>>>>> But
>>>>
>>>>> if
>>>>>>>>> one finishes sooner, instead of continuing the repeat loop, it runs
>>>>>>>>> away
>>>>>>>>> to
>>>>>>>>> the next BSP step that follows the repeat, then exits prematurely
>>>>>>>>>
>>>>>>>> and
>>>
>>>> the
>>>>>>>>> system hangs. The casting errors are due to the run-away peers
>>>>>>>>> executing
>>>>>>>>> the
>>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat
>>>>>>>>>
>>>>>>>> though
>>>
>>>> are
>>>>>>>>> OK.
>>>>>>>>> By the way, I had a problem exchanging large amount of data during
>>>>>>>>>
>>>>>>>> sync
>>>>
>>>>> (I
>>>>>>>>> discussed this with Thomas).  My solution was to to break a BSP
>>>>>>>>> superstep
>>>>>>>>> into multiple substeps so that each substep can handle a max number
>>>>>>>>>
>>>>>>>> of
>>>>
>>>>> messages. Of course my program has to collect all messages in a
>>>>>>>>>
>>>>>>>> vector
>>>>
>>>>> in
>>>>>>>>> memory. When the vector is too big, it is spilled in a local file.
>>>>>>>>>
>>>>>>>> This
>>>>
>>>>> moved the problem from the Hama side to my side and allowed me to
>>>>>>>>> handle
>>>>>>>>> larger data, especially in joins. I think this problem of
>>>>>>>>>
>>>>>>>> exchanging
>>>
>>>> large
>>>>>>>>> amount of data during a superstep is currently a weakness of Hama.
>>>>>>>>> Leonidas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> BTW, should we feature this on our website?
>>>>>>>>>>
>>>>>>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gma**il.com<http://gmail.com>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>>> thomas.jungblut@gmail.com>
>>>>
>>>>>
>>>>>>>>>>>
>>>>>>>>>> Hi Leonidas!
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have to admit that I have known what is going on (and had to
>>>>>>>>>>>
>>>>>>>>>> keep
>>>
>>>> silent), but I have to say: Thank you very much!
>>>>>>>>>>> This will help many people writing BSPs in a more easier way.
>>>>>>>>>>>
>>>>>>>>>>> Of course this is not as fast as the native BSP code, Hive and
>>>>>>>>>>>
>>>>>>>>>> Pig
>>>
>>>> suffer
>>>>>>>>>>> from the same problems in MR.
>>>>>>>>>>> But it gives people the opportunity to develop faster and get
>>>>>>>>>>>
>>>>>>>>>> their
>>>
>>>> code
>>>>>>>>>>> in production with just a minor time expense.
>>>>>>>>>>>
>>>>>>>>>>> And I think, that we will help you gladly on improving the BSP
>>>>>>>>>>>
>>>>>>>>>> part
>>>
>>>> of
>>>>>>>>>>> your framework. At least I would do ;)
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> 2012/8/24 Edward J. Yoon<ed...@apache.org>
>>>>>>>>>>>
>>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
>>>>>>>>>>>
>>>>>>>>>> network).
>>>>
>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>>>>>
>>>>>>>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast
>>>>>>>>>>>>
>>>>>>>>>>> to
>>>>
>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
>>>>>>>>>>>>
>>>>>>>>>>> non-materialized
>>>>
>>>>> sequence ..., etc.)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 2383611 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 30.384 secs
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 1191805 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 24.412 secs
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>>>>>>> <ed...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Wow, very interesting. I'm going to install and test on my
>>>>>>>>>>>>>
>>>>>>>>>>>> large
>>>
>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>>>>>>>> <fe...@cse.uta.edu>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>  Dear Hama users,
>>>>>>>>>>>>>> I am pleased to announce that the MRQL query processing system
>>>>>>>>>>>>>>
>>>>>>>>>>>>> can
>>>>
>>>>> now
>>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available
>>>>>>>>>>>>>>
>>>>>>>>>>>>> at:
>>>>
>>>>>
>>>>>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
>>>>>>>>>>>>>>
>>>>>>>>>>>>> language
>>>>
>>>>> for
>>>>>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful
>>>>>>>>>>>>>>
>>>>>>>>>>>>> enough
>>>
>>>> to
>>>>
>>>>> express most common data analysis tasks over many different
>>>>>>>>>>>>>>
>>>>>>>>>>>>> kinds
>>>>
>>>>> of
>>>>>>>>>>>>>> raw data, including hierarchical data and nested collections,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> such
>>>>
>>>>> as
>>>>>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode
>>>>>>>>>>>>>>
>>>>>>>>>>>>> using
>>>>
>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode
>>>>>>>>>>>>>>
>>>>>>>>>>>>> using
>>>
>>>> Apache
>>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their
>>>>>>>>>>>>>>
>>>>>>>>>>>>> data.
>>>
>>>>
>>>>>>>>>>>>>> Note that, the BSP mode is currently experimental (not
>>>>>>>>>>>>>>
>>>>>>>>>>>>> fine-tuned
>>>>
>>>>> yet)
>>>>>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire
>>>>>>>>>>>>>>
>>>>>>>>>>>>> job
>>>>
>>>>> must
>>>>>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only
>>>>>>>>>>>>>>
>>>>>>>>>>>>> been
>>>
>>>> tested
>>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP
>>>>>>>>>>>>>>
>>>>>>>>>>>>> mode
>>>
>>>> with
>>>>>>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
>>>>>>>>>>>>>> (100K
>>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times
>>>>>>>>>>>>>>
>>>>>>>>>>>>> faster
>>>>
>>>>> than the MR mode. Please let me know if you'd like to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> contribute
>>>
>>>> to
>>>>>>>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  --
>>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>> @eddieyoon
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

I don't have any plans for supporting such queries, but I would like to 
try new applications.
Leonidas

On 09/13/2012 06:20 AM, Edward J. Yoon wrote:
> Just curious, is there a plan to support sophisticated queries for
> unstructured spatial datasets?
>
> On Wed, Sep 12, 2012 at 4:13 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
>> I created a project on Github:
>> https://github.com/fegaras/mrql.git
>>
>> Thank you for your help
>> Leonidas Fegaras
>>
>>
>> On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:
>>
>>> Yep, a subproject would be the alternative.
>>> In this case we would give you PMC and committer rights so you can
>>> actively
>>> work on that.
>>> However this would make the mapreduce part more or less useless, so if you
>>> want to go the hybrid way, feel free to submit an incubation request.
>>>
>>> 2012/9/7 Suraj Menon <su...@apache.org>
>>>
>>>> I think Thomas has a point. How about making it a sub-module/sub-project
>>>> of
>>>> Hama for now? If/When it gains enough community support to make it a top
>>>> level project, you can fork it as a separate project.
>>>> I am not completely aware of the procedures and requirements for getting
>>>> external project as sub-project.
>>>> We can look into it if you are ready to take this route.
>>>>
>>>>> Could you please send me a link for setting up an open-source Apache
>>>> project?
>>>> If I am right this is what you are looking for -
>>>> http://incubator.apache.org/guides/proposal.html
>>>> http://incubator.apache.org/sitemap.html
>>>>
>>>> Good luck,
>>>> Suraj
>>>>
>>>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
>>>> <th...@gmail.com>wrote:
>>>>
>>>>> Although I think this is a great project, I think that you will not meet
>>>>> the requirements.
>>>>> You need a community and a charter to get it into the incubation.
>>>>>
>>>>> What about hosting it on Github?
>>>>>
>>>>> 2012/9/7 Leonidas Fegaras <fe...@cse.uta.edu>
>>>>>
>>>>>> Yes, this is a great idea. I have used GIT on my own server but I don't
>>>>>> know how to do this for ASF. Could you please send me a link for
>>>> setting
>>>>> up
>>>>>> an open-source Apache project?
>>>>>>
>>>>>>
>>>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>>>>>>
>>>>>>> If you can open source this then I'm sure the ASF community can help
>>>>>>> you and make this software better.
>>>>>>>
>>>>>>> Pls feel free to ask us if you need any assistance donating source
>>>>>>> code to the ASF or contributing to the Hama project in the future.
>>>>>>>
>>>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
>>>> fegaras@cse.uta.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes sure. I have fixed the bug with the repeat stopping condition
>>>> but I
>>>>>>>> have
>>>>>>>> only tested pagerank on my small cluster. I still need to fix the
>>>>> k-means
>>>>>>>> clustering (it's a special case because you improve a fixed number of
>>>>>>>> points).
>>>>>>>> Leonidas
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>>>>>>
>>>>>>>> Shall we work together?
>>>>>>>>>
>>>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
>>>> fegaras@cse.uta.edu
>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you very much for your interest and for testing my system.
>>>>>>>>>> It seems that my release was premature: It worked for some random
>>>>> data
>>>>>>>>>> but
>>>>>>>>>> didn't for some others. It's a minor logical error that I will try
>>>> to
>>>>>>>>>> fix
>>>>>>>>>> in
>>>>>>>>>> the next few days. The problem is with the stopping condition of
>>>> the
>>>>>>>>>> repeat
>>>>>>>>>> expression that calculates the new pagerank from the old. It must
>>>>> stop
>>>>>>>>>> if
>>>>>>>>>> ALL peers reach  the specified precision. This is done by having
>>>>> those
>>>>>>>>>> peers
>>>>>>>>>> that need to continue send a message to others to continue. It
>>>> seems
>>>>>>>>>> that
>>>>>>>>>> now when all peers agree at the same time, the program works fine.
>>>>> But
>>>>>>>>>> if
>>>>>>>>>> one finishes sooner, instead of continuing the repeat loop, it runs
>>>>>>>>>> away
>>>>>>>>>> to
>>>>>>>>>> the next BSP step that follows the repeat, then exits prematurely
>>>> and
>>>>>>>>>> the
>>>>>>>>>> system hangs. The casting errors are due to the run-away peers
>>>>>>>>>> executing
>>>>>>>>>> the
>>>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat
>>>> though
>>>>>>>>>> are
>>>>>>>>>> OK.
>>>>>>>>>> By the way, I had a problem exchanging large amount of data during
>>>>> sync
>>>>>>>>>> (I
>>>>>>>>>> discussed this with Thomas).  My solution was to to break a BSP
>>>>>>>>>> superstep
>>>>>>>>>> into multiple substeps so that each substep can handle a max number
>>>>> of
>>>>>>>>>> messages. Of course my program has to collect all messages in a
>>>>> vector
>>>>>>>>>> in
>>>>>>>>>> memory. When the vector is too big, it is spilled in a local file.
>>>>> This
>>>>>>>>>> moved the problem from the Hama side to my side and allowed me to
>>>>>>>>>> handle
>>>>>>>>>> larger data, especially in joins. I think this problem of
>>>> exchanging
>>>>>>>>>> large
>>>>>>>>>> amount of data during a superstep is currently a weakness of Hama.
>>>>>>>>>> Leonidas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>>>>>>
>>>>>>>>>>> BTW, should we feature this on our website?
>>>>>>>>>>>
>>>>>>>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
>>>>> thomas.jungblut@gmail.com>
>>>>>>>>>>>>
>>>>>>>>>>> Hi Leonidas!
>>>>>>>>>>>>
>>>>>>>>>>>> I have to admit that I have known what is going on (and had to
>>>> keep
>>>>>>>>>>>> silent), but I have to say: Thank you very much!
>>>>>>>>>>>> This will help many people writing BSPs in a more easier way.
>>>>>>>>>>>>
>>>>>>>>>>>> Of course this is not as fast as the native BSP code, Hive and
>>>> Pig
>>>>>>>>>>>> suffer
>>>>>>>>>>>> from the same problems in MR.
>>>>>>>>>>>> But it gives people the opportunity to develop faster and get
>>>> their
>>>>>>>>>>>> code
>>>>>>>>>>>> in production with just a minor time expense.
>>>>>>>>>>>>
>>>>>>>>>>>> And I think, that we will help you gladly on improving the BSP
>>>> part
>>>>>>>>>>>> of
>>>>>>>>>>>> your framework. At least I would do ;)
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> 2012/8/24 Edward J. Yoon<ed...@apache.org>
>>>>>>>>>>>>
>>>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
>>>>> network).
>>>>>>>>>>>>
>>>>>>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>>>>>>
>>>>>>>>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast
>>>>> to
>>>>>>>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
>>>>> non-materialized
>>>>>>>>>>>>> sequence ..., etc.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>>>>>>
>>>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
>>>>>>>>>>>>> about
>>>>>>>>>>>>> 2383611 bytes of input data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Run time: 30.384 secs
>>>>>>>>>>>>>
>>>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
>>>>>>>>>>>>> about
>>>>>>>>>>>>> 1191805 bytes of input data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Run time: 24.412 secs
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>>>>>>>> <ed...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wow, very interesting. I'm going to install and test on my
>>>> large
>>>>>>>>>>>>>>
>>>>>>>>>>>>> cluster.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>>>>>>>>> <fe...@cse.uta.edu>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dear Hama users,
>>>>>>>>>>>>>>> I am pleased to announce that the MRQL query processing system
>>>>> can
>>>>>>>>>>>>>>> now
>>>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available
>>>>> at:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
>>>>> language
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful
>>>> enough
>>>>> to
>>>>>>>>>>>>>>> express most common data analysis tasks over many different
>>>>> kinds
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> raw data, including hierarchical data and nested collections,
>>>>> such
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode
>>>>> using
>>>>>>>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode
>>>> using
>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their
>>>> data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Note that, the BSP mode is currently experimental (not
>>>>> fine-tuned
>>>>>>>>>>>>>>> yet)
>>>>>>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire
>>>>> job
>>>>>>>>>>>>>>> must
>>>>>>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only
>>>> been
>>>>>>>>>>>>>>> tested
>>>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP
>>>> mode
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
>>>>>>>>>>>>>>> (100K
>>>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times
>>>>> faster
>>>>>>>>>>>>>>> than the MR mode. Please let me know if you'd like to
>>>> contribute
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>>
>>>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>> @eddieyoon
>>>>>>>>>
>>>>>>>>
>>>>>>>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by "Edward J. Yoon" <ed...@apache.org>.

Just curious, is there a plan to support sophisticated queries for
unstructured spatial datasets?

On Wed, Sep 12, 2012 at 4:13 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> I created a project on Github:
> https://github.com/fegaras/mrql.git
>
> Thank you for your help
> Leonidas Fegaras
>
>
> On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:
>
>> Yep, a subproject would be the alternative.
>> In this case we would give you PMC and committer rights so you can
>> actively
>> work on that.
>> However this would make the mapreduce part more or less useless, so if you
>> want to go the hybrid way, feel free to submit an incubation request.
>>
>> 2012/9/7 Suraj Menon <su...@apache.org>
>>
>>> I think Thomas has a point. How about making it a sub-module/sub-project
>>> of
>>> Hama for now? If/When it gains enough community support to make it a top
>>> level project, you can fork it as a separate project.
>>> I am not completely aware of the procedures and requirements for getting
>>> external project as sub-project.
>>> We can look into it if you are ready to take this route.
>>>
>>>> Could you please send me a link for setting up an open-source Apache
>>>
>>> project?
>>> If I am right this is what you are looking for -
>>> http://incubator.apache.org/guides/proposal.html
>>> http://incubator.apache.org/sitemap.html
>>>
>>> Good luck,
>>> Suraj
>>>
>>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
>>> <th...@gmail.com>wrote:
>>>
>>>> Although I think this is a great project, I think that you will not meet
>>>> the requirements.
>>>> You need a community and a charter to get it into the incubation.
>>>>
>>>> What about hosting it on Github?
>>>>
>>>> 2012/9/7 Leonidas Fegaras <fe...@cse.uta.edu>
>>>>
>>>>> Yes, this is a great idea. I have used GIT on my own server but I don't
>>>>> know how to do this for ASF. Could you please send me a link for
>>>
>>> setting
>>>>
>>>> up
>>>>>
>>>>> an open-source Apache project?
>>>>>
>>>>>
>>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>>>>>
>>>>>> If you can open source this then I'm sure the ASF community can help
>>>>>> you and make this software better.
>>>>>>
>>>>>> Pls feel free to ask us if you need any assistance donating source
>>>>>> code to the ASF or contributing to the Hama project in the future.
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
>>>
>>> fegaras@cse.uta.edu>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes sure. I have fixed the bug with the repeat stopping condition
>>>
>>> but I
>>>>>>>
>>>>>>> have
>>>>>>> only tested pagerank on my small cluster. I still need to fix the
>>>>
>>>> k-means
>>>>>>>
>>>>>>> clustering (it's a special case because you improve a fixed number of
>>>>>>> points).
>>>>>>> Leonidas
>>>>>>>
>>>>>>>
>>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>>>>>
>>>>>>> Shall we work together?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
>>>
>>> fegaras@cse.uta.edu
>>>>>
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you very much for your interest and for testing my system.
>>>>>>>>> It seems that my release was premature: It worked for some random
>>>>
>>>> data
>>>>>>>>>
>>>>>>>>> but
>>>>>>>>> didn't for some others. It's a minor logical error that I will try
>>>
>>> to
>>>>>>>>>
>>>>>>>>> fix
>>>>>>>>> in
>>>>>>>>> the next few days. The problem is with the stopping condition of
>>>
>>> the
>>>>>>>>>
>>>>>>>>> repeat
>>>>>>>>> expression that calculates the new pagerank from the old. It must
>>>>
>>>> stop
>>>>>>>>>
>>>>>>>>> if
>>>>>>>>> ALL peers reach  the specified precision. This is done by having
>>>>
>>>> those
>>>>>>>>>
>>>>>>>>> peers
>>>>>>>>> that need to continue send a message to others to continue. It
>>>
>>> seems
>>>>>>>>>
>>>>>>>>> that
>>>>>>>>> now when all peers agree at the same time, the program works fine.
>>>>
>>>> But
>>>>>>>>>
>>>>>>>>> if
>>>>>>>>> one finishes sooner, instead of continuing the repeat loop, it runs
>>>>>>>>> away
>>>>>>>>> to
>>>>>>>>> the next BSP step that follows the repeat, then exits prematurely
>>>
>>> and
>>>>>>>>>
>>>>>>>>> the
>>>>>>>>> system hangs. The casting errors are due to the run-away peers
>>>>>>>>> executing
>>>>>>>>> the
>>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat
>>>
>>> though
>>>>>>>>>
>>>>>>>>> are
>>>>>>>>> OK.
>>>>>>>>> By the way, I had a problem exchanging large amount of data during
>>>>
>>>> sync
>>>>>>>>>
>>>>>>>>> (I
>>>>>>>>> discussed this with Thomas).  My solution was to to break a BSP
>>>>>>>>> superstep
>>>>>>>>> into multiple substeps so that each substep can handle a max number
>>>>
>>>> of
>>>>>>>>>
>>>>>>>>> messages. Of course my program has to collect all messages in a
>>>>
>>>> vector
>>>>>>>>>
>>>>>>>>> in
>>>>>>>>> memory. When the vector is too big, it is spilled in a local file.
>>>>
>>>> This
>>>>>>>>>
>>>>>>>>> moved the problem from the Hama side to my side and allowed me to
>>>>>>>>> handle
>>>>>>>>> larger data, especially in joins. I think this problem of
>>>
>>> exchanging
>>>>>>>>>
>>>>>>>>> large
>>>>>>>>> amount of data during a superstep is currently a weakness of Hama.
>>>>>>>>> Leonidas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> BTW, should we feature this on our website?
>>>>>>>>>>
>>>>>>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
>>>>
>>>> thomas.jungblut@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Leonidas!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have to admit that I have known what is going on (and had to
>>>
>>> keep
>>>>>>>>>>>
>>>>>>>>>>> silent), but I have to say: Thank you very much!
>>>>>>>>>>> This will help many people writing BSPs in a more easier way.
>>>>>>>>>>>
>>>>>>>>>>> Of course this is not as fast as the native BSP code, Hive and
>>>
>>> Pig
>>>>>>>>>>>
>>>>>>>>>>> suffer
>>>>>>>>>>> from the same problems in MR.
>>>>>>>>>>> But it gives people the opportunity to develop faster and get
>>>
>>> their
>>>>>>>>>>>
>>>>>>>>>>> code
>>>>>>>>>>> in production with just a minor time expense.
>>>>>>>>>>>
>>>>>>>>>>> And I think, that we will help you gladly on improving the BSP
>>>
>>> part
>>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>> your framework. At least I would do ;)
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> 2012/8/24 Edward J. Yoon<ed...@apache.org>
>>>>>>>>>>>
>>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
>>>>
>>>> network).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>>>>>
>>>>>>>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast
>>>>
>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
>>>>
>>>> non-materialized
>>>>>>>>>>>>
>>>>>>>>>>>> sequence ..., etc.)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 2383611 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 30.384 secs
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 1191805 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 24.412 secs
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>>>>>>> <ed...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wow, very interesting. I'm going to install and test on my
>>>
>>> large
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>>>>>>>> <fe...@cse.uta.edu>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Hama users,
>>>>>>>>>>>>>> I am pleased to announce that the MRQL query processing system
>>>>
>>>> can
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> now
>>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available
>>>>
>>>> at:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
>>>>
>>>> language
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful
>>>
>>> enough
>>>>
>>>> to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> express most common data analysis tasks over many different
>>>>
>>>> kinds
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> raw data, including hierarchical data and nested collections,
>>>>
>>>> such
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode
>>>>
>>>> using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode
>>>
>>> using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their
>>>
>>> data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that, the BSP mode is currently experimental (not
>>>>
>>>> fine-tuned
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> yet)
>>>>>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire
>>>>
>>>> job
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> must
>>>>>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only
>>>
>>> been
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> tested
>>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP
>>>
>>> mode
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
>>>>>>>>>>>>>> (100K
>>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times
>>>>
>>>> faster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> than the MR mode. Please let me know if you'd like to
>>>
>>> contribute
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>
>>>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>> @eddieyoon
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

I created a project on Github:
https://github.com/fegaras/mrql.git

Thank you for your help
Leonidas Fegaras

On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:

> Yep, a subproject would be the alternative.
> In this case we would give you PMC and committer rights so you can  
> actively
> work on that.
> However this would make the mapreduce part more or less useless, so  
> if you
> want to go the hybrid way, feel free to submit an incubation request.
>
> 2012/9/7 Suraj Menon <su...@apache.org>
>
>> I think Thomas has a point. How about making it a sub-module/sub- 
>> project of
>> Hama for now? If/When it gains enough community support to make it  
>> a top
>> level project, you can fork it as a separate project.
>> I am not completely aware of the procedures and requirements for  
>> getting
>> external project as sub-project.
>> We can look into it if you are ready to take this route.
>>
>>> Could you please send me a link for setting up an open-source Apache
>> project?
>> If I am right this is what you are looking for -
>> http://incubator.apache.org/guides/proposal.html
>> http://incubator.apache.org/sitemap.html
>>
>> Good luck,
>> Suraj
>>
>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
>> <th...@gmail.com>wrote:
>>
>>> Although I think this is a great project, I think that you will  
>>> not meet
>>> the requirements.
>>> You need a community and a charter to get it into the incubation.
>>>
>>> What about hosting it on Github?
>>>
>>> 2012/9/7 Leonidas Fegaras <fe...@cse.uta.edu>
>>>
>>>> Yes, this is a great idea. I have used GIT on my own server but I  
>>>> don't
>>>> know how to do this for ASF. Could you please send me a link for
>> setting
>>> up
>>>> an open-source Apache project?
>>>>
>>>>
>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>>>>
>>>>> If you can open source this then I'm sure the ASF community can  
>>>>> help
>>>>> you and make this software better.
>>>>>
>>>>> Pls feel free to ask us if you need any assistance donating source
>>>>> code to the ASF or contributing to the Hama project in the future.
>>>>>
>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
>> fegaras@cse.uta.edu>
>>>>> wrote:
>>>>>
>>>>>> Yes sure. I have fixed the bug with the repeat stopping condition
>> but I
>>>>>> have
>>>>>> only tested pagerank on my small cluster. I still need to fix the
>>> k-means
>>>>>> clustering (it's a special case because you improve a fixed  
>>>>>> number of
>>>>>> points).
>>>>>> Leonidas
>>>>>>
>>>>>>
>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>>>>
>>>>>> Shall we work together?
>>>>>>>
>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
>> fegaras@cse.uta.edu
>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you very much for your interest and for testing my  
>>>>>>>> system.
>>>>>>>> It seems that my release was premature: It worked for some  
>>>>>>>> random
>>> data
>>>>>>>> but
>>>>>>>> didn't for some others. It's a minor logical error that I  
>>>>>>>> will try
>> to
>>>>>>>> fix
>>>>>>>> in
>>>>>>>> the next few days. The problem is with the stopping condition  
>>>>>>>> of
>> the
>>>>>>>> repeat
>>>>>>>> expression that calculates the new pagerank from the old. It  
>>>>>>>> must
>>> stop
>>>>>>>> if
>>>>>>>> ALL peers reach  the specified precision. This is done by  
>>>>>>>> having
>>> those
>>>>>>>> peers
>>>>>>>> that need to continue send a message to others to continue. It
>> seems
>>>>>>>> that
>>>>>>>> now when all peers agree at the same time, the program works  
>>>>>>>> fine.
>>> But
>>>>>>>> if
>>>>>>>> one finishes sooner, instead of continuing the repeat loop,  
>>>>>>>> it runs
>>>>>>>> away
>>>>>>>> to
>>>>>>>> the next BSP step that follows the repeat, then exits  
>>>>>>>> prematurely
>> and
>>>>>>>> the
>>>>>>>> system hangs. The casting errors are due to the run-away peers
>>>>>>>> executing
>>>>>>>> the
>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat
>> though
>>>>>>>> are
>>>>>>>> OK.
>>>>>>>> By the way, I had a problem exchanging large amount of data  
>>>>>>>> during
>>> sync
>>>>>>>> (I
>>>>>>>> discussed this with Thomas).  My solution was to to break a BSP
>>>>>>>> superstep
>>>>>>>> into multiple substeps so that each substep can handle a max  
>>>>>>>> number
>>> of
>>>>>>>> messages. Of course my program has to collect all messages in a
>>> vector
>>>>>>>> in
>>>>>>>> memory. When the vector is too big, it is spilled in a local  
>>>>>>>> file.
>>> This
>>>>>>>> moved the problem from the Hama side to my side and allowed  
>>>>>>>> me to
>>>>>>>> handle
>>>>>>>> larger data, especially in joins. I think this problem of
>> exchanging
>>>>>>>> large
>>>>>>>> amount of data during a superstep is currently a weakness of  
>>>>>>>> Hama.
>>>>>>>> Leonidas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW, should we feature this on our website?
>>>>>>>>>
>>>>>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
>>> thomas.jungblut@gmail.com>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Leonidas!
>>>>>>>>>>
>>>>>>>>>> I have to admit that I have known what is going on (and had  
>>>>>>>>>> to
>> keep
>>>>>>>>>> silent), but I have to say: Thank you very much!
>>>>>>>>>> This will help many people writing BSPs in a more easier way.
>>>>>>>>>>
>>>>>>>>>> Of course this is not as fast as the native BSP code, Hive  
>>>>>>>>>> and
>> Pig
>>>>>>>>>> suffer
>>>>>>>>>> from the same problems in MR.
>>>>>>>>>> But it gives people the opportunity to develop faster and get
>> their
>>>>>>>>>> code
>>>>>>>>>> in production with just a minor time expense.
>>>>>>>>>>
>>>>>>>>>> And I think, that we will help you gladly on improving the  
>>>>>>>>>> BSP
>> part
>>>>>>>>>> of
>>>>>>>>>> your framework. At least I would do ;)
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> 2012/8/24 Edward J. Yoon<ed...@apache.org>
>>>>>>>>>>
>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
>>> network).
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>>>>
>>>>>>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot  
>>>>>>>>>>> be cast
>>> to
>>>>>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
>>> non-materialized
>>>>>>>>>>> sequence ..., etc.)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>>>>
>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will  
>>>>>>>>>>> handle
>>>>>>>>>>> about
>>>>>>>>>>> 2383611 bytes of input data.
>>>>>>>>>>>
>>>>>>>>>>> Run time: 30.384 secs
>>>>>>>>>>>
>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will  
>>>>>>>>>>> handle
>>>>>>>>>>> about
>>>>>>>>>>> 1191805 bytes of input data.
>>>>>>>>>>>
>>>>>>>>>>> Run time: 24.412 secs
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>>>>>> <ed...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Wow, very interesting. I'm going to install and test on my
>> large
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> cluster.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>>>>>>> <fe...@cse.uta.edu>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear Hama users,
>>>>>>>>>>>>> I am pleased to announce that the MRQL query processing  
>>>>>>>>>>>>> system
>>> can
>>>>>>>>>>>>> now
>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is  
>>>>>>>>>>>>> available
>>> at:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>>>>
>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
>>> language
>>>>>>>>>>>>> for
>>>>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful
>> enough
>>> to
>>>>>>>>>>>>> express most common data analysis tasks over many  
>>>>>>>>>>>>> different
>>> kinds
>>>>>>>>>>>>> of
>>>>>>>>>>>>> raw data, including hierarchical data and nested  
>>>>>>>>>>>>> collections,
>>> such
>>>>>>>>>>>>> as
>>>>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce)  
>>>>>>>>>>>>> mode
>>> using
>>>>>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode
>> using
>>>>>>>>>>>>> Apache
>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their
>> data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that, the BSP mode is currently experimental (not
>>> fine-tuned
>>>>>>>>>>>>> yet)
>>>>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the  
>>>>>>>>>>>>> entire
>>> job
>>>>>>>>>>>>> must
>>>>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only
>> been
>>>>>>>>>>>>> tested
>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP
>> mode
>>>>>>>>>>>>> with
>>>>>>>>>>>>> the MR mode by evaluating a pagerank query over a small  
>>>>>>>>>>>>> graph
>>>>>>>>>>>>> (100K
>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5  
>>>>>>>>>>>>> times
>>> faster
>>>>>>>>>>>>> than the MR mode. Please let me know if you'd like to
>> contribute
>>>>>>>>>>>>> to
>>>>>>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>
>>>>>>>>>>> .
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards, Edward J. Yoon
>>>>>>> @eddieyoon
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Thomas Jungblut <th...@gmail.com>.

Yep, a subproject would be the alternative.
In this case we would give you PMC and committer rights so you can actively
work on that.
However this would make the mapreduce part more or less useless, so if you
want to go the hybrid way, feel free to submit an incubation request.

2012/9/7 Suraj Menon <su...@apache.org>

> I think Thomas has a point. How about making it a sub-module/sub-project of
> Hama for now? If/When it gains enough community support to make it a top
> level project, you can fork it as a separate project.
> I am not completely aware of the procedures and requirements for getting
> external project as sub-project.
> We can look into it if you are ready to take this route.
>
> > Could you please send me a link for setting up an open-source Apache
> project?
> If I am right this is what you are looking for -
> http://incubator.apache.org/guides/proposal.html
> http://incubator.apache.org/sitemap.html
>
> Good luck,
> Suraj
>
> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
> <th...@gmail.com>wrote:
>
> > Although I think this is a great project, I think that you will not meet
> > the requirements.
> > You need a community and a charter to get it into the incubation.
> >
> > What about hosting it on Github?
> >
> > 2012/9/7 Leonidas Fegaras <fe...@cse.uta.edu>
> >
> > > Yes, this is a great idea. I have used GIT on my own server but I don't
> > > know how to do this for ASF. Could you please send me a link for
> setting
> > up
> > > an open-source Apache project?
> > >
> > >
> > > On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
> > >
> > >> If you can open source this then I'm sure the ASF community can help
> > >> you and make this software better.
> > >>
> > >> Pls feel free to ask us if you need any assistance donating source
> > >> code to the ASF or contributing to the Hama project in the future.
> > >>
> > >> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
> fegaras@cse.uta.edu>
> > >>  wrote:
> > >>
> > >>> Yes sure. I have fixed the bug with the repeat stopping condition
> but I
> > >>> have
> > >>> only tested pagerank on my small cluster. I still need to fix the
> > k-means
> > >>> clustering (it's a special case because you improve a fixed number of
> > >>> points).
> > >>> Leonidas
> > >>>
> > >>>
> > >>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
> > >>>
> > >>>  Shall we work together?
> > >>>>
> > >>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
> fegaras@cse.uta.edu
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> Thank you very much for your interest and for testing my system.
> > >>>>> It seems that my release was premature: It worked for some random
> > data
> > >>>>> but
> > >>>>> didn't for some others. It's a minor logical error that I will try
> to
> > >>>>> fix
> > >>>>> in
> > >>>>> the next few days. The problem is with the stopping condition of
> the
> > >>>>> repeat
> > >>>>> expression that calculates the new pagerank from the old. It must
> > stop
> > >>>>> if
> > >>>>> ALL peers reach  the specified precision. This is done by having
> > those
> > >>>>> peers
> > >>>>> that need to continue send a message to others to continue. It
> seems
> > >>>>> that
> > >>>>> now when all peers agree at the same time, the program works fine.
> > But
> > >>>>> if
> > >>>>> one finishes sooner, instead of continuing the repeat loop, it runs
> > >>>>> away
> > >>>>> to
> > >>>>> the next BSP step that follows the repeat, then exits prematurely
> and
> > >>>>> the
> > >>>>> system hangs. The casting errors are due to the run-away peers
> > >>>>> executing
> > >>>>> the
> > >>>>> wrong BSP steps reading wrong messages. Queries without repeat
> though
> > >>>>> are
> > >>>>> OK.
> > >>>>> By the way, I had a problem exchanging large amount of data during
> > sync
> > >>>>> (I
> > >>>>> discussed this with Thomas).  My solution was to to break a BSP
> > >>>>> superstep
> > >>>>> into multiple substeps so that each substep can handle a max number
> > of
> > >>>>> messages. Of course my program has to collect all messages in a
> > vector
> > >>>>> in
> > >>>>> memory. When the vector is too big, it is spilled in a local file.
> > This
> > >>>>> moved the problem from the Hama side to my side and allowed me to
> > >>>>> handle
> > >>>>> larger data, especially in joins. I think this problem of
> exchanging
> > >>>>> large
> > >>>>> amount of data during a superstep is currently a weakness of Hama.
> > >>>>> Leonidas
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
> > >>>>>
> > >>>>>>
> > >>>>>> BTW, should we feature this on our website?
> > >>>>>>
> > >>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
> > thomas.jungblut@gmail.com>
> > >>>>>> >
> > >>>>>>
> > >>>>>>  Hi Leonidas!
> > >>>>>>>
> > >>>>>>> I have to admit that I have known what is going on (and had to
> keep
> > >>>>>>> silent), but I have to say: Thank you very much!
> > >>>>>>> This will help many people writing BSPs in a more easier way.
> > >>>>>>>
> > >>>>>>> Of course this is not as fast as the native BSP code, Hive and
> Pig
> > >>>>>>> suffer
> > >>>>>>> from the same problems in MR.
> > >>>>>>> But it gives people the opportunity to develop faster and get
> their
> > >>>>>>> code
> > >>>>>>> in production with just a minor time expense.
> > >>>>>>>
> > >>>>>>> And I think, that we will help you gladly on improving the BSP
> part
> > >>>>>>> of
> > >>>>>>> your framework. At least I would do ;)
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>>
> > >>>>>>> 2012/8/24 Edward J. Yoon<ed...@apache.org>
> > >>>>>>>
> > >>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
> > network).
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>> It seems slow than our PageRank example.
> > >>>>>>>>
> > >>>>>>>> P.S., There are some errors so I couldn't test large-scale.
> > >>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast
> > to
> > >>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
> > non-materialized
> > >>>>>>>> sequence ..., etc.)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> == 100K nodes and 1M edges ==
> > >>>>>>>>
> > >>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
> > >>>>>>>> about
> > >>>>>>>> 2383611 bytes of input data.
> > >>>>>>>>
> > >>>>>>>> Run time: 30.384 secs
> > >>>>>>>>
> > >>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
> > >>>>>>>> about
> > >>>>>>>> 1191805 bytes of input data.
> > >>>>>>>>
> > >>>>>>>> Run time: 24.412 secs
> > >>>>>>>>
> > >>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
> > >>>>>>>> <ed...@apache.org>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Wow, very interesting. I'm going to install and test on my
> large
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> cluster.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
> > >>>>>>>>> <fe...@cse.uta.edu>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> Dear Hama users,
> > >>>>>>>>>> I am pleased to announce that the MRQL query processing system
> > can
> > >>>>>>>>>> now
> > >>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available
> > at:
> > >>>>>>>>>>
> > >>>>>>>>>> http://lambda.uta.edu/mrql/
> > >>>>>>>>>>
> > >>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
> > language
> > >>>>>>>>>> for
> > >>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful
> enough
> > to
> > >>>>>>>>>> express most common data analysis tasks over many different
> > kinds
> > >>>>>>>>>> of
> > >>>>>>>>>> raw data, including hierarchical data and nested collections,
> > such
> > >>>>>>>>>> as
> > >>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode
> > using
> > >>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode
> using
> > >>>>>>>>>> Apache
> > >>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their
> data.
> > >>>>>>>>>>
> > >>>>>>>>>> Note that, the BSP mode is currently experimental (not
> > fine-tuned
> > >>>>>>>>>> yet)
> > >>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire
> > job
> > >>>>>>>>>> must
> > >>>>>>>>>> be restarted). Due to our limited resources, MRQL has only
> been
> > >>>>>>>>>> tested
> > >>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP
> mode
> > >>>>>>>>>> with
> > >>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
> > >>>>>>>>>> (100K
> > >>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times
> > faster
> > >>>>>>>>>> than the MR mode. Please let me know if you'd like to
> contribute
> > >>>>>>>>>> to
> > >>>>>>>>>> this project by testing MRQL on a larger cluster.
> > >>>>>>>>>> Best regards,
> > >>>>>>>>>> Leonidas Fegaras
> > >>>>>>>>>> University of Texas at Arlington
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Best Regards, Edward J. Yoon
> > >>>>>>>>> @eddieyoon
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Best Regards, Edward J. Yoon
> > >>>>>>>> @eddieyoon
> > >>>>>>>>
> > >>>>>>>>  .
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>> --
> > >>>> Best Regards, Edward J. Yoon
> > >>>> @eddieyoon
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Suraj Menon <su...@apache.org>.

I think Thomas has a point. How about making it a sub-module/sub-project of
Hama for now? If/When it gains enough community support to make it a top
level project, you can fork it as a separate project.
I am not completely aware of the procedures and requirements for getting
external project as sub-project.
We can look into it if you are ready to take this route.

> Could you please send me a link for setting up an open-source Apache
project?
If I am right this is what you are looking for -
http://incubator.apache.org/guides/proposal.html
http://incubator.apache.org/sitemap.html

Good luck,
Suraj

On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
<th...@gmail.com>wrote:

> Although I think this is a great project, I think that you will not meet
> the requirements.
> You need a community and a charter to get it into the incubation.
>
> What about hosting it on Github?
>
> 2012/9/7 Leonidas Fegaras <fe...@cse.uta.edu>
>
> > Yes, this is a great idea. I have used GIT on my own server but I don't
> > know how to do this for ASF. Could you please send me a link for setting
> up
> > an open-source Apache project?
> >
> >
> > On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
> >
> >> If you can open source this then I'm sure the ASF community can help
> >> you and make this software better.
> >>
> >> Pls feel free to ask us if you need any assistance donating source
> >> code to the ASF or contributing to the Hama project in the future.
> >>
> >> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<fe...@cse.uta.edu>
> >>  wrote:
> >>
> >>> Yes sure. I have fixed the bug with the repeat stopping condition but I
> >>> have
> >>> only tested pagerank on my small cluster. I still need to fix the
> k-means
> >>> clustering (it's a special case because you improve a fixed number of
> >>> points).
> >>> Leonidas
> >>>
> >>>
> >>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
> >>>
> >>>  Shall we work together?
> >>>>
> >>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<fegaras@cse.uta.edu
> >
> >>>> wrote:
> >>>>
> >>>>> Thank you very much for your interest and for testing my system.
> >>>>> It seems that my release was premature: It worked for some random
> data
> >>>>> but
> >>>>> didn't for some others. It's a minor logical error that I will try to
> >>>>> fix
> >>>>> in
> >>>>> the next few days. The problem is with the stopping condition of the
> >>>>> repeat
> >>>>> expression that calculates the new pagerank from the old. It must
> stop
> >>>>> if
> >>>>> ALL peers reach  the specified precision. This is done by having
> those
> >>>>> peers
> >>>>> that need to continue send a message to others to continue. It seems
> >>>>> that
> >>>>> now when all peers agree at the same time, the program works fine.
> But
> >>>>> if
> >>>>> one finishes sooner, instead of continuing the repeat loop, it runs
> >>>>> away
> >>>>> to
> >>>>> the next BSP step that follows the repeat, then exits prematurely and
> >>>>> the
> >>>>> system hangs. The casting errors are due to the run-away peers
> >>>>> executing
> >>>>> the
> >>>>> wrong BSP steps reading wrong messages. Queries without repeat though
> >>>>> are
> >>>>> OK.
> >>>>> By the way, I had a problem exchanging large amount of data during
> sync
> >>>>> (I
> >>>>> discussed this with Thomas).  My solution was to to break a BSP
> >>>>> superstep
> >>>>> into multiple substeps so that each substep can handle a max number
> of
> >>>>> messages. Of course my program has to collect all messages in a
> vector
> >>>>> in
> >>>>> memory. When the vector is too big, it is spilled in a local file.
> This
> >>>>> moved the problem from the Hama side to my side and allowed me to
> >>>>> handle
> >>>>> larger data, especially in joins. I think this problem of exchanging
> >>>>> large
> >>>>> amount of data during a superstep is currently a weakness of Hama.
> >>>>> Leonidas
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
> >>>>>
> >>>>>>
> >>>>>> BTW, should we feature this on our website?
> >>>>>>
> >>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
> thomas.jungblut@gmail.com>
> >>>>>> >
> >>>>>>
> >>>>>>  Hi Leonidas!
> >>>>>>>
> >>>>>>> I have to admit that I have known what is going on (and had to keep
> >>>>>>> silent), but I have to say: Thank you very much!
> >>>>>>> This will help many people writing BSPs in a more easier way.
> >>>>>>>
> >>>>>>> Of course this is not as fast as the native BSP code, Hive and Pig
> >>>>>>> suffer
> >>>>>>> from the same problems in MR.
> >>>>>>> But it gives people the opportunity to develop faster and get their
> >>>>>>> code
> >>>>>>> in production with just a minor time expense.
> >>>>>>>
> >>>>>>> And I think, that we will help you gladly on improving the BSP part
> >>>>>>> of
> >>>>>>> your framework. At least I would do ;)
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> 2012/8/24 Edward J. Yoon<ed...@apache.org>
> >>>>>>>
> >>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
> network).
> >>>>>>>
> >>>>>>>>
> >>>>>>>> It seems slow than our PageRank example.
> >>>>>>>>
> >>>>>>>> P.S., There are some errors so I couldn't test large-scale.
> >>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast
> to
> >>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
> non-materialized
> >>>>>>>> sequence ..., etc.)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> == 100K nodes and 1M edges ==
> >>>>>>>>
> >>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
> >>>>>>>> about
> >>>>>>>> 2383611 bytes of input data.
> >>>>>>>>
> >>>>>>>> Run time: 30.384 secs
> >>>>>>>>
> >>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
> >>>>>>>> about
> >>>>>>>> 1191805 bytes of input data.
> >>>>>>>>
> >>>>>>>> Run time: 24.412 secs
> >>>>>>>>
> >>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
> >>>>>>>> <ed...@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Wow, very interesting. I'm going to install and test on my large
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> cluster.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
> >>>>>>>>> <fe...@cse.uta.edu>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Dear Hama users,
> >>>>>>>>>> I am pleased to announce that the MRQL query processing system
> can
> >>>>>>>>>> now
> >>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available
> at:
> >>>>>>>>>>
> >>>>>>>>>> http://lambda.uta.edu/mrql/
> >>>>>>>>>>
> >>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
> language
> >>>>>>>>>> for
> >>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful enough
> to
> >>>>>>>>>> express most common data analysis tasks over many different
> kinds
> >>>>>>>>>> of
> >>>>>>>>>> raw data, including hierarchical data and nested collections,
> such
> >>>>>>>>>> as
> >>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode
> using
> >>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using
> >>>>>>>>>> Apache
> >>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
> >>>>>>>>>>
> >>>>>>>>>> Note that, the BSP mode is currently experimental (not
> fine-tuned
> >>>>>>>>>> yet)
> >>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire
> job
> >>>>>>>>>> must
> >>>>>>>>>> be restarted). Due to our limited resources, MRQL has only been
> >>>>>>>>>> tested
> >>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP mode
> >>>>>>>>>> with
> >>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
> >>>>>>>>>> (100K
> >>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times
> faster
> >>>>>>>>>> than the MR mode. Please let me know if you'd like to contribute
> >>>>>>>>>> to
> >>>>>>>>>> this project by testing MRQL on a larger cluster.
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Leonidas Fegaras
> >>>>>>>>>> University of Texas at Arlington
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best Regards, Edward J. Yoon
> >>>>>>>>> @eddieyoon
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best Regards, Edward J. Yoon
> >>>>>>>> @eddieyoon
> >>>>>>>>
> >>>>>>>>  .
> >>>>>>
> >>>>>>
> >>>>
> >>>> --
> >>>> Best Regards, Edward J. Yoon
> >>>> @eddieyoon
> >>>>
> >>>
> >>>
> >>
> >>
> >
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Thomas Jungblut <th...@gmail.com>.

Although I think this is a great project, I think that you will not meet
the requirements.
You need a community and a charter to get it into the incubation.

What about hosting it on Github?

2012/9/7 Leonidas Fegaras <fe...@cse.uta.edu>

> Yes, this is a great idea. I have used GIT on my own server but I don't
> know how to do this for ASF. Could you please send me a link for setting up
> an open-source Apache project?
>
>
> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>
>> If you can open source this then I'm sure the ASF community can help
>> you and make this software better.
>>
>> Pls feel free to ask us if you need any assistance donating source
>> code to the ASF or contributing to the Hama project in the future.
>>
>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<fe...@cse.uta.edu>
>>  wrote:
>>
>>> Yes sure. I have fixed the bug with the repeat stopping condition but I
>>> have
>>> only tested pagerank on my small cluster. I still need to fix the k-means
>>> clustering (it's a special case because you improve a fixed number of
>>> points).
>>> Leonidas
>>>
>>>
>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>
>>>  Shall we work together?
>>>>
>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<fe...@cse.uta.edu>
>>>> wrote:
>>>>
>>>>> Thank you very much for your interest and for testing my system.
>>>>> It seems that my release was premature: It worked for some random data
>>>>> but
>>>>> didn't for some others. It's a minor logical error that I will try to
>>>>> fix
>>>>> in
>>>>> the next few days. The problem is with the stopping condition of the
>>>>> repeat
>>>>> expression that calculates the new pagerank from the old. It must stop
>>>>> if
>>>>> ALL peers reach  the specified precision. This is done by having those
>>>>> peers
>>>>> that need to continue send a message to others to continue. It seems
>>>>> that
>>>>> now when all peers agree at the same time, the program works fine. But
>>>>> if
>>>>> one finishes sooner, instead of continuing the repeat loop, it runs
>>>>> away
>>>>> to
>>>>> the next BSP step that follows the repeat, then exits prematurely and
>>>>> the
>>>>> system hangs. The casting errors are due to the run-away peers
>>>>> executing
>>>>> the
>>>>> wrong BSP steps reading wrong messages. Queries without repeat though
>>>>> are
>>>>> OK.
>>>>> By the way, I had a problem exchanging large amount of data during sync
>>>>> (I
>>>>> discussed this with Thomas).  My solution was to to break a BSP
>>>>> superstep
>>>>> into multiple substeps so that each substep can handle a max number of
>>>>> messages. Of course my program has to collect all messages in a vector
>>>>> in
>>>>> memory. When the vector is too big, it is spilled in a local file. This
>>>>> moved the problem from the Hama side to my side and allowed me to
>>>>> handle
>>>>> larger data, especially in joins. I think this problem of exchanging
>>>>> large
>>>>> amount of data during a superstep is currently a weakness of Hama.
>>>>> Leonidas
>>>>>
>>>>>
>>>>>
>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>
>>>>>>
>>>>>> BTW, should we feature this on our website?
>>>>>>
>>>>>> 2012/8/24 Thomas Jungblut<th...@gmail.com>
>>>>>> >
>>>>>>
>>>>>>  Hi Leonidas!
>>>>>>>
>>>>>>> I have to admit that I have known what is going on (and had to keep
>>>>>>> silent), but I have to say: Thank you very much!
>>>>>>> This will help many people writing BSPs in a more easier way.
>>>>>>>
>>>>>>> Of course this is not as fast as the native BSP code, Hive and Pig
>>>>>>> suffer
>>>>>>> from the same problems in MR.
>>>>>>> But it gives people the opportunity to develop faster and get their
>>>>>>> code
>>>>>>> in production with just a minor time expense.
>>>>>>>
>>>>>>> And I think, that we will help you gladly on improving the BSP part
>>>>>>> of
>>>>>>> your framework. At least I would do ;)
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> 2012/8/24 Edward J. Yoon<ed...@apache.org>
>>>>>>>
>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband network).
>>>>>>>
>>>>>>>>
>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>
>>>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
>>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
>>>>>>>> sequence ..., etc.)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>
>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
>>>>>>>> about
>>>>>>>> 2383611 bytes of input data.
>>>>>>>>
>>>>>>>> Run time: 30.384 secs
>>>>>>>>
>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
>>>>>>>> about
>>>>>>>> 1191805 bytes of input data.
>>>>>>>>
>>>>>>>> Run time: 24.412 secs
>>>>>>>>
>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>>> <ed...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Wow, very interesting. I'm going to install and test on my large
>>>>>>>>>
>>>>>>>>
>>>>>>>> cluster.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>>>> <fe...@cse.uta.edu>
>>>>>>>>>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Dear Hama users,
>>>>>>>>>> I am pleased to announce that the MRQL query processing system can
>>>>>>>>>> now
>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>>>>>>>>>>
>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>
>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query language
>>>>>>>>>> for
>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful enough to
>>>>>>>>>> express most common data analysis tasks over many different kinds
>>>>>>>>>> of
>>>>>>>>>> raw data, including hierarchical data and nested collections, such
>>>>>>>>>> as
>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
>>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using
>>>>>>>>>> Apache
>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
>>>>>>>>>>
>>>>>>>>>> Note that, the BSP mode is currently experimental (not fine-tuned
>>>>>>>>>> yet)
>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire job
>>>>>>>>>> must
>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only been
>>>>>>>>>> tested
>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP mode
>>>>>>>>>> with
>>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
>>>>>>>>>> (100K
>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
>>>>>>>>>> than the MR mode. Please let me know if you'd like to contribute
>>>>>>>>>> to
>>>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>>>> Best regards,
>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>> @eddieyoon
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>> @eddieyoon
>>>>>>>>
>>>>>>>>  .
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>>
>>>
>>>
>>
>>
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Yes, this is a great idea. I have used GIT on my own server but I don't 
know how to do this for ASF. Could you please send me a link for setting 
up an open-source Apache project?

On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
> If you can open source this then I'm sure the ASF community can help
> you and make this software better.
>
> Pls feel free to ask us if you need any assistance donating source
> code to the ASF or contributing to the Hama project in the future.
>
> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<fe...@cse.uta.edu>  wrote:
>> Yes sure. I have fixed the bug with the repeat stopping condition but I have
>> only tested pagerank on my small cluster. I still need to fix the k-means
>> clustering (it's a special case because you improve a fixed number of
>> points).
>> Leonidas
>>
>>
>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>
>>> Shall we work together?
>>>
>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<fe...@cse.uta.edu>
>>> wrote:
>>>> Thank you very much for your interest and for testing my system.
>>>> It seems that my release was premature: It worked for some random data
>>>> but
>>>> didn't for some others. It's a minor logical error that I will try to fix
>>>> in
>>>> the next few days. The problem is with the stopping condition of the
>>>> repeat
>>>> expression that calculates the new pagerank from the old. It must stop if
>>>> ALL peers reach  the specified precision. This is done by having those
>>>> peers
>>>> that need to continue send a message to others to continue. It seems that
>>>> now when all peers agree at the same time, the program works fine. But if
>>>> one finishes sooner, instead of continuing the repeat loop, it runs away
>>>> to
>>>> the next BSP step that follows the repeat, then exits prematurely and the
>>>> system hangs. The casting errors are due to the run-away peers executing
>>>> the
>>>> wrong BSP steps reading wrong messages. Queries without repeat though are
>>>> OK.
>>>> By the way, I had a problem exchanging large amount of data during sync
>>>> (I
>>>> discussed this with Thomas).  My solution was to to break a BSP superstep
>>>> into multiple substeps so that each substep can handle a max number of
>>>> messages. Of course my program has to collect all messages in a vector in
>>>> memory. When the vector is too big, it is spilled in a local file. This
>>>> moved the problem from the Hama side to my side and allowed me to handle
>>>> larger data, especially in joins. I think this problem of exchanging
>>>> large
>>>> amount of data during a superstep is currently a weakness of Hama.
>>>> Leonidas
>>>>
>>>>
>>>>
>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>
>>>>> BTW, should we feature this on our website?
>>>>>
>>>>> 2012/8/24 Thomas Jungblut<th...@gmail.com>
>>>>>
>>>>>> Hi Leonidas!
>>>>>>
>>>>>> I have to admit that I have known what is going on (and had to keep
>>>>>> silent), but I have to say: Thank you very much!
>>>>>> This will help many people writing BSPs in a more easier way.
>>>>>>
>>>>>> Of course this is not as fast as the native BSP code, Hive and Pig
>>>>>> suffer
>>>>>> from the same problems in MR.
>>>>>> But it gives people the opportunity to develop faster and get their
>>>>>> code
>>>>>> in production with just a minor time expense.
>>>>>>
>>>>>> And I think, that we will help you gladly on improving the BSP part of
>>>>>> your framework. At least I would do ;)
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> 2012/8/24 Edward J. Yoon<ed...@apache.org>
>>>>>>
>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband network).
>>>>>>>
>>>>>>> It seems slow than our PageRank example.
>>>>>>>
>>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
>>>>>>> sequence ..., etc.)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>
>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle about
>>>>>>> 2383611 bytes of input data.
>>>>>>>
>>>>>>> Run time: 30.384 secs
>>>>>>>
>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle about
>>>>>>> 1191805 bytes of input data.
>>>>>>>
>>>>>>> Run time: 24.412 secs
>>>>>>>
>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>> <ed...@apache.org>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Wow, very interesting. I'm going to install and test on my large
>>>>>>>
>>>>>>> cluster.
>>>>>>>>
>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>>> <fe...@cse.uta.edu>
>>>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Dear Hama users,
>>>>>>>>> I am pleased to announce that the MRQL query processing system can
>>>>>>>>> now
>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>>>>>>>>>
>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>
>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query language
>>>>>>>>> for
>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful enough to
>>>>>>>>> express most common data analysis tasks over many different kinds of
>>>>>>>>> raw data, including hierarchical data and nested collections, such
>>>>>>>>> as
>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using
>>>>>>>>> Apache
>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
>>>>>>>>>
>>>>>>>>> Note that, the BSP mode is currently experimental (not fine-tuned
>>>>>>>>> yet)
>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire job
>>>>>>>>> must
>>>>>>>>> be restarted). Due to our limited resources, MRQL has only been
>>>>>>>>> tested
>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph (100K
>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
>>>>>>>>> than the MR mode. Please let me know if you'd like to contribute to
>>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>>> Best regards,
>>>>>>>>> Leonidas Fegaras
>>>>>>>>> University of Texas at Arlington
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>> @eddieyoon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards, Edward J. Yoon
>>>>>>> @eddieyoon
>>>>>>>
>>>>> .
>>>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>
>
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by "Edward J. Yoon" <ed...@apache.org>.

If you can open source this then I'm sure the ASF community can help
you and make this software better.

Pls feel free to ask us if you need any assistance donating source
code to the ASF or contributing to the Hama project in the future.

On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> Yes sure. I have fixed the bug with the repeat stopping condition but I have
> only tested pagerank on my small cluster. I still need to fix the k-means
> clustering (it's a special case because you improve a fixed number of
> points).
> Leonidas
>
>
> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>
>> Shall we work together?
>>
>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras <fe...@cse.uta.edu>
>> wrote:
>>>
>>> Thank you very much for your interest and for testing my system.
>>> It seems that my release was premature: It worked for some random data
>>> but
>>> didn't for some others. It's a minor logical error that I will try to fix
>>> in
>>> the next few days. The problem is with the stopping condition of the
>>> repeat
>>> expression that calculates the new pagerank from the old. It must stop if
>>> ALL peers reach  the specified precision. This is done by having those
>>> peers
>>> that need to continue send a message to others to continue. It seems that
>>> now when all peers agree at the same time, the program works fine. But if
>>> one finishes sooner, instead of continuing the repeat loop, it runs away
>>> to
>>> the next BSP step that follows the repeat, then exits prematurely and the
>>> system hangs. The casting errors are due to the run-away peers executing
>>> the
>>> wrong BSP steps reading wrong messages. Queries without repeat though are
>>> OK.
>>> By the way, I had a problem exchanging large amount of data during sync
>>> (I
>>> discussed this with Thomas).  My solution was to to break a BSP superstep
>>> into multiple substeps so that each substep can handle a max number of
>>> messages. Of course my program has to collect all messages in a vector in
>>> memory. When the vector is too big, it is spilled in a local file. This
>>> moved the problem from the Hama side to my side and allowed me to handle
>>> larger data, especially in joins. I think this problem of exchanging
>>> large
>>> amount of data during a superstep is currently a weakness of Hama.
>>> Leonidas
>>>
>>>
>>>
>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>
>>>>
>>>> BTW, should we feature this on our website?
>>>>
>>>> 2012/8/24 Thomas Jungblut <th...@gmail.com>
>>>>
>>>>> Hi Leonidas!
>>>>>
>>>>> I have to admit that I have known what is going on (and had to keep
>>>>> silent), but I have to say: Thank you very much!
>>>>> This will help many people writing BSPs in a more easier way.
>>>>>
>>>>> Of course this is not as fast as the native BSP code, Hive and Pig
>>>>> suffer
>>>>> from the same problems in MR.
>>>>> But it gives people the opportunity to develop faster and get their
>>>>> code
>>>>> in production with just a minor time expense.
>>>>>
>>>>> And I think, that we will help you gladly on improving the BSP part of
>>>>> your framework. At least I would do ;)
>>>>>
>>>>> Thanks!
>>>>>
>>>>> 2012/8/24 Edward J. Yoon <ed...@apache.org>
>>>>>
>>>>> Here's my few test results on Oracle BDA (40G/s infiniband network).
>>>>>>
>>>>>>
>>>>>> It seems slow than our PageRank example.
>>>>>>
>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
>>>>>> sequence ..., etc.)
>>>>>>
>>>>>>
>>>>>>
>>>>>> == 100K nodes and 1M edges ==
>>>>>>
>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle about
>>>>>> 2383611 bytes of input data.
>>>>>>
>>>>>> Run time: 30.384 secs
>>>>>>
>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle about
>>>>>> 1191805 bytes of input data.
>>>>>>
>>>>>> Run time: 24.412 secs
>>>>>>
>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>> <ed...@apache.org>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Wow, very interesting. I'm going to install and test on my large
>>>>>>
>>>>>>
>>>>>> cluster.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>> <fe...@cse.uta.edu>
>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Dear Hama users,
>>>>>>>> I am pleased to announce that the MRQL query processing system can
>>>>>>>> now
>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>>>>>>>>
>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>
>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query language
>>>>>>>> for
>>>>>>>> large-scale, distributed data analysis. MRQL is powerful enough to
>>>>>>>> express most common data analysis tasks over many different kinds of
>>>>>>>> raw data, including hierarchical data and nested collections, such
>>>>>>>> as
>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using
>>>>>>>> Apache
>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
>>>>>>>>
>>>>>>>> Note that, the BSP mode is currently experimental (not fine-tuned
>>>>>>>> yet)
>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire job
>>>>>>>> must
>>>>>>>> be restarted). Due to our limited resources, MRQL has only been
>>>>>>>> tested
>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
>>>>>>>> the MR mode by evaluating a pagerank query over a small graph (100K
>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
>>>>>>>> than the MR mode. Please let me know if you'd like to contribute to
>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>> Best regards,
>>>>>>>> Leonidas Fegaras
>>>>>>>> University of Texas at Arlington
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards, Edward J. Yoon
>>>>>>> @eddieyoon
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards, Edward J. Yoon
>>>>>> @eddieyoon
>>>>>>
>>>>>
>>>> .
>>>>
>>>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Yes sure. I have fixed the bug with the repeat stopping condition but  
I have only tested pagerank on my small cluster. I still need to fix  
the k-means clustering (it's a special case because you improve a  
fixed number of points).
Leonidas

On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:

> Shall we work together?
>
> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras  
> <fe...@cse.uta.edu> wrote:
>> Thank you very much for your interest and for testing my system.
>> It seems that my release was premature: It worked for some random  
>> data but
>> didn't for some others. It's a minor logical error that I will try  
>> to fix in
>> the next few days. The problem is with the stopping condition of  
>> the repeat
>> expression that calculates the new pagerank from the old. It must  
>> stop if
>> ALL peers reach  the specified precision. This is done by having  
>> those peers
>> that need to continue send a message to others to continue. It  
>> seems that
>> now when all peers agree at the same time, the program works fine.  
>> But if
>> one finishes sooner, instead of continuing the repeat loop, it runs  
>> away to
>> the next BSP step that follows the repeat, then exits prematurely  
>> and the
>> system hangs. The casting errors are due to the run-away peers  
>> executing the
>> wrong BSP steps reading wrong messages. Queries without repeat  
>> though are
>> OK.
>> By the way, I had a problem exchanging large amount of data during  
>> sync (I
>> discussed this with Thomas).  My solution was to to break a BSP  
>> superstep
>> into multiple substeps so that each substep can handle a max number  
>> of
>> messages. Of course my program has to collect all messages in a  
>> vector in
>> memory. When the vector is too big, it is spilled in a local file.  
>> This
>> moved the problem from the Hama side to my side and allowed me to  
>> handle
>> larger data, especially in joins. I think this problem of  
>> exchanging large
>> amount of data during a superstep is currently a weakness of Hama.
>> Leonidas
>>
>>
>>
>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>
>>> BTW, should we feature this on our website?
>>>
>>> 2012/8/24 Thomas Jungblut <th...@gmail.com>
>>>
>>>> Hi Leonidas!
>>>>
>>>> I have to admit that I have known what is going on (and had to keep
>>>> silent), but I have to say: Thank you very much!
>>>> This will help many people writing BSPs in a more easier way.
>>>>
>>>> Of course this is not as fast as the native BSP code, Hive and  
>>>> Pig suffer
>>>> from the same problems in MR.
>>>> But it gives people the opportunity to develop faster and get  
>>>> their code
>>>> in production with just a minor time expense.
>>>>
>>>> And I think, that we will help you gladly on improving the BSP  
>>>> part of
>>>> your framework. At least I would do ;)
>>>>
>>>> Thanks!
>>>>
>>>> 2012/8/24 Edward J. Yoon <ed...@apache.org>
>>>>
>>>> Here's my few test results on Oracle BDA (40G/s infiniband  
>>>> network).
>>>>>
>>>>> It seems slow than our PageRank example.
>>>>>
>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast  
>>>>> to
>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non- 
>>>>> materialized
>>>>> sequence ..., etc.)
>>>>>
>>>>>
>>>>>
>>>>> == 100K nodes and 1M edges ==
>>>>>
>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle  
>>>>> about
>>>>> 2383611 bytes of input data.
>>>>>
>>>>> Run time: 30.384 secs
>>>>>
>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle  
>>>>> about
>>>>> 1191805 bytes of input data.
>>>>>
>>>>> Run time: 24.412 secs
>>>>>
>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon <edwardyoon@apache.org 
>>>>> >
>>>>> wrote:
>>>>>>
>>>>>> Wow, very interesting. I'm going to install and test on my large
>>>>>
>>>>> cluster.
>>>>>>
>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras <fegaras@cse.uta.edu 
>>>>>> >
>>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> Dear Hama users,
>>>>>>> I am pleased to announce that the MRQL query processing system  
>>>>>>> can now
>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available  
>>>>>>> at:
>>>>>>>
>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>
>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query  
>>>>>>> language for
>>>>>>> large-scale, distributed data analysis. MRQL is powerful  
>>>>>>> enough to
>>>>>>> express most common data analysis tasks over many different  
>>>>>>> kinds of
>>>>>>> raw data, including hierarchical data and nested collections,  
>>>>>>> such as
>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode  
>>>>>>> using
>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode  
>>>>>>> using Apache
>>>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
>>>>>>>
>>>>>>> Note that, the BSP mode is currently experimental (not fine- 
>>>>>>> tuned yet)
>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire  
>>>>>>> job must
>>>>>>> be restarted). Due to our limited resources, MRQL has only  
>>>>>>> been tested
>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP  
>>>>>>> mode with
>>>>>>> the MR mode by evaluating a pagerank query over a small graph  
>>>>>>> (100K
>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times  
>>>>>>> faster
>>>>>>> than the MR mode. Please let me know if you'd like to  
>>>>>>> contribute to
>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>> Best regards,
>>>>>>> Leonidas Fegaras
>>>>>>> University of Texas at Arlington
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards, Edward J. Yoon
>>>>>> @eddieyoon
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards, Edward J. Yoon
>>>>> @eddieyoon
>>>>>
>>>>
>>> .
>>>
>>
>
>
>
> -- 
> Best Regards, Edward J. Yoon
> @eddieyoon

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by "Edward J. Yoon" <ed...@apache.org>.

Shall we work together?

On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> Thank you very much for your interest and for testing my system.
> It seems that my release was premature: It worked for some random data but
> didn't for some others. It's a minor logical error that I will try to fix in
> the next few days. The problem is with the stopping condition of the repeat
> expression that calculates the new pagerank from the old. It must stop if
> ALL peers reach  the specified precision. This is done by having those peers
> that need to continue send a message to others to continue. It seems that
> now when all peers agree at the same time, the program works fine. But if
> one finishes sooner, instead of continuing the repeat loop, it runs away to
> the next BSP step that follows the repeat, then exits prematurely and the
> system hangs. The casting errors are due to the run-away peers executing the
> wrong BSP steps reading wrong messages. Queries without repeat though are
> OK.
> By the way, I had a problem exchanging large amount of data during sync (I
> discussed this with Thomas).  My solution was to to break a BSP superstep
> into multiple substeps so that each substep can handle a max number of
> messages. Of course my program has to collect all messages in a vector in
> memory. When the vector is too big, it is spilled in a local file. This
> moved the problem from the Hama side to my side and allowed me to handle
> larger data, especially in joins. I think this problem of exchanging large
> amount of data during a superstep is currently a weakness of Hama.
> Leonidas
>
>
>
> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>
>> BTW, should we feature this on our website?
>>
>> 2012/8/24 Thomas Jungblut <th...@gmail.com>
>>
>>> Hi Leonidas!
>>>
>>> I have to admit that I have known what is going on (and had to keep
>>> silent), but I have to say: Thank you very much!
>>> This will help many people writing BSPs in a more easier way.
>>>
>>> Of course this is not as fast as the native BSP code, Hive and Pig suffer
>>> from the same problems in MR.
>>> But it gives people the opportunity to develop faster and get their code
>>> in production with just a minor time expense.
>>>
>>> And I think, that we will help you gladly on improving the BSP part of
>>> your framework. At least I would do ;)
>>>
>>> Thanks!
>>>
>>> 2012/8/24 Edward J. Yoon <ed...@apache.org>
>>>
>>> Here's my few test results on Oracle BDA (40G/s infiniband network).
>>>>
>>>> It seems slow than our PageRank example.
>>>>
>>>> P.S., There are some errors so I couldn't test large-scale.
>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
>>>> sequence ..., etc.)
>>>>
>>>>
>>>>
>>>> == 100K nodes and 1M edges ==
>>>>
>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle about
>>>> 2383611 bytes of input data.
>>>>
>>>> Run time: 30.384 secs
>>>>
>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle about
>>>> 1191805 bytes of input data.
>>>>
>>>> Run time: 24.412 secs
>>>>
>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon <ed...@apache.org>
>>>> wrote:
>>>>>
>>>>> Wow, very interesting. I'm going to install and test on my large
>>>>
>>>> cluster.
>>>>>
>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras <fe...@cse.uta.edu>
>>>>
>>>> wrote:
>>>>>>
>>>>>> Dear Hama users,
>>>>>> I am pleased to announce that the MRQL query processing system can now
>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>>>>>>
>>>>>> http://lambda.uta.edu/mrql/
>>>>>>
>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query language for
>>>>>> large-scale, distributed data analysis. MRQL is powerful enough to
>>>>>> express most common data analysis tasks over many different kinds of
>>>>>> raw data, including hierarchical data and nested collections, such as
>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using Apache
>>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
>>>>>>
>>>>>> Note that, the BSP mode is currently experimental (not fine-tuned yet)
>>>>>> and lacks any fault-tolerance (if an error occurs, the entire job must
>>>>>> be restarted). Due to our limited resources, MRQL has only been tested
>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
>>>>>> the MR mode by evaluating a pagerank query over a small graph (100K
>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
>>>>>> than the MR mode. Please let me know if you'd like to contribute to
>>>>>> this project by testing MRQL on a larger cluster.
>>>>>> Best regards,
>>>>>> Leonidas Fegaras
>>>>>> University of Texas at Arlington
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards, Edward J. Yoon
>>>>> @eddieyoon
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>>
>>>
>> .
>>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Thank you very much for your interest and for testing my system.
It seems that my release was premature: It worked for some random data 
but didn't for some others. It's a minor logical error that I will try 
to fix in the next few days. The problem is with the stopping condition 
of the repeat expression that calculates the new pagerank from the old. 
It must stop if ALL peers reach  the specified precision. This is done 
by having those peers that need to continue send a message to others to 
continue. It seems that now when all peers agree at the same time, the 
program works fine. But if one finishes sooner, instead of continuing 
the repeat loop, it runs away to the next BSP step that follows the 
repeat, then exits prematurely and the system hangs. The casting errors 
are due to the run-away peers executing the wrong BSP steps reading 
wrong messages. Queries without repeat though are OK.
By the way, I had a problem exchanging large amount of data during sync 
(I discussed this with Thomas).  My solution was to to break a BSP 
superstep into multiple substeps so that each substep can handle a max 
number of messages. Of course my program has to collect all messages in 
a vector in memory. When the vector is too big, it is spilled in a local 
file. This moved the problem from the Hama side to my side and allowed 
me to handle larger data, especially in joins. I think this problem of 
exchanging large amount of data during a superstep is currently a 
weakness of Hama.
Leonidas

On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
> BTW, should we feature this on our website?
>
> 2012/8/24 Thomas Jungblut <th...@gmail.com>
>
>> Hi Leonidas!
>>
>> I have to admit that I have known what is going on (and had to keep
>> silent), but I have to say: Thank you very much!
>> This will help many people writing BSPs in a more easier way.
>>
>> Of course this is not as fast as the native BSP code, Hive and Pig suffer
>> from the same problems in MR.
>> But it gives people the opportunity to develop faster and get their code
>> in production with just a minor time expense.
>>
>> And I think, that we will help you gladly on improving the BSP part of
>> your framework. At least I would do ;)
>>
>> Thanks!
>>
>> 2012/8/24 Edward J. Yoon <ed...@apache.org>
>>
>> Here's my few test results on Oracle BDA (40G/s infiniband network).
>>> It seems slow than our PageRank example.
>>>
>>> P.S., There are some errors so I couldn't test large-scale.
>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
>>> sequence ..., etc.)
>>>
>>>
>>>
>>> == 100K nodes and 1M edges ==
>>>
>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle about
>>> 2383611 bytes of input data.
>>>
>>> Run time: 30.384 secs
>>>
>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle about
>>> 1191805 bytes of input data.
>>>
>>> Run time: 24.412 secs
>>>
>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon <ed...@apache.org>
>>> wrote:
>>>> Wow, very interesting. I'm going to install and test on my large
>>> cluster.
>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras <fe...@cse.uta.edu>
>>> wrote:
>>>>> Dear Hama users,
>>>>> I am pleased to announce that the MRQL query processing system can now
>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>>>>>
>>>>> http://lambda.uta.edu/mrql/
>>>>>
>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query language for
>>>>> large-scale, distributed data analysis. MRQL is powerful enough to
>>>>> express most common data analysis tasks over many different kinds of
>>>>> raw data, including hierarchical data and nested collections, such as
>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using Apache
>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
>>>>>
>>>>> Note that, the BSP mode is currently experimental (not fine-tuned yet)
>>>>> and lacks any fault-tolerance (if an error occurs, the entire job must
>>>>> be restarted). Due to our limited resources, MRQL has only been tested
>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
>>>>> the MR mode by evaluating a pagerank query over a small graph (100K
>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
>>>>> than the MR mode. Please let me know if you'd like to contribute to
>>>>> this project by testing MRQL on a larger cluster.
>>>>> Best regards,
>>>>> Leonidas Fegaras
>>>>> University of Texas at Arlington
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>>
>>
> .
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Thomas Jungblut <th...@gmail.com>.

BTW, should we feature this on our website?

2012/8/24 Thomas Jungblut <th...@gmail.com>

> Hi Leonidas!
>
> I have to admit that I have known what is going on (and had to keep
> silent), but I have to say: Thank you very much!
> This will help many people writing BSPs in a more easier way.
>
> Of course this is not as fast as the native BSP code, Hive and Pig suffer
> from the same problems in MR.
> But it gives people the opportunity to develop faster and get their code
> in production with just a minor time expense.
>
> And I think, that we will help you gladly on improving the BSP part of
> your framework. At least I would do ;)
>
> Thanks!
>
> 2012/8/24 Edward J. Yoon <ed...@apache.org>
>
> Here's my few test results on Oracle BDA (40G/s infiniband network).
>> It seems slow than our PageRank example.
>>
>> P.S., There are some errors so I couldn't test large-scale.
>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
>> sequence ..., etc.)
>>
>>
>>
>> == 100K nodes and 1M edges ==
>>
>> *** Using 10 BSP tasks (out of a max 10). Each task will handle about
>> 2383611 bytes of input data.
>>
>> Run time: 30.384 secs
>>
>> *** Using 20 BSP tasks (out of a max 20). Each task will handle about
>> 1191805 bytes of input data.
>>
>> Run time: 24.412 secs
>>
>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon <ed...@apache.org>
>> wrote:
>> > Wow, very interesting. I'm going to install and test on my large
>> cluster.
>> >
>> > On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras <fe...@cse.uta.edu>
>> wrote:
>> >> Dear Hama users,
>> >> I am pleased to announce that the MRQL query processing system can now
>> >> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>> >>
>> >> http://lambda.uta.edu/mrql/
>> >>
>> >> MRQL (the Map-Reduce Query Language) is an SQL-like query language for
>> >> large-scale, distributed data analysis. MRQL is powerful enough to
>> >> express most common data analysis tasks over many different kinds of
>> >> raw data, including hierarchical data and nested collections, such as
>> >> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
>> >> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using Apache
>> >> Hama. Both modes use Apache's HDFS to read and write their data.
>> >>
>> >> Note that, the BSP mode is currently experimental (not fine-tuned yet)
>> >> and lacks any fault-tolerance (if an error occurs, the entire job must
>> >> be restarted). Due to our limited resources, MRQL has only been tested
>> >> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
>> >> the MR mode by evaluating a pagerank query over a small graph (100K
>> >> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
>> >> than the MR mode. Please let me know if you'd like to contribute to
>> >> this project by testing MRQL on a larger cluster.
>> >> Best regards,
>> >> Leonidas Fegaras
>> >> University of Texas at Arlington
>> >>
>> >
>> >
>> >
>> > --
>> > Best Regards, Edward J. Yoon
>> > @eddieyoon
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>
>
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by Thomas Jungblut <th...@gmail.com>.

Hi Leonidas!

I have to admit that I have known what is going on (and had to keep
silent), but I have to say: Thank you very much!
This will help many people writing BSPs in a more easier way.

Of course this is not as fast as the native BSP code, Hive and Pig suffer
from the same problems in MR.
But it gives people the opportunity to develop faster and get their code in
production with just a minor time expense.

And I think, that we will help you gladly on improving the BSP part of your
framework. At least I would do ;)

Thanks!

2012/8/24 Edward J. Yoon <ed...@apache.org>

> Here's my few test results on Oracle BDA (40G/s infiniband network).
> It seems slow than our PageRank example.
>
> P.S., There are some errors so I couldn't test large-scale.
> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
> sequence ..., etc.)
>
>
>
> == 100K nodes and 1M edges ==
>
> *** Using 10 BSP tasks (out of a max 10). Each task will handle about
> 2383611 bytes of input data.
>
> Run time: 30.384 secs
>
> *** Using 20 BSP tasks (out of a max 20). Each task will handle about
> 1191805 bytes of input data.
>
> Run time: 24.412 secs
>
> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
> > Wow, very interesting. I'm going to install and test on my large cluster.
> >
> > On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras <fe...@cse.uta.edu>
> wrote:
> >> Dear Hama users,
> >> I am pleased to announce that the MRQL query processing system can now
> >> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
> >>
> >> http://lambda.uta.edu/mrql/
> >>
> >> MRQL (the Map-Reduce Query Language) is an SQL-like query language for
> >> large-scale, distributed data analysis. MRQL is powerful enough to
> >> express most common data analysis tasks over many different kinds of
> >> raw data, including hierarchical data and nested collections, such as
> >> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
> >> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using Apache
> >> Hama. Both modes use Apache's HDFS to read and write their data.
> >>
> >> Note that, the BSP mode is currently experimental (not fine-tuned yet)
> >> and lacks any fault-tolerance (if an error occurs, the entire job must
> >> be restarted). Due to our limited resources, MRQL has only been tested
> >> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
> >> the MR mode by evaluating a pagerank query over a small graph (100K
> >> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
> >> than the MR mode. Please let me know if you'd like to contribute to
> >> this project by testing MRQL on a larger cluster.
> >> Best regards,
> >> Leonidas Fegaras
> >> University of Texas at Arlington
> >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by "Edward J. Yoon" <ed...@apache.org>.

Here's my few test results on Oracle BDA (40G/s infiniband network).
It seems slow than our PageRank example.

P.S., There are some errors so I couldn't test large-scale.
(java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
sequence ..., etc.)



== 100K nodes and 1M edges ==

*** Using 10 BSP tasks (out of a max 10). Each task will handle about
2383611 bytes of input data.

Run time: 30.384 secs

*** Using 20 BSP tasks (out of a max 20). Each task will handle about
1191805 bytes of input data.

Run time: 24.412 secs

On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon <ed...@apache.org> wrote:
> Wow, very interesting. I'm going to install and test on my large cluster.
>
> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
>> Dear Hama users,
>> I am pleased to announce that the MRQL query processing system can now
>> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>>
>> http://lambda.uta.edu/mrql/
>>
>> MRQL (the Map-Reduce Query Language) is an SQL-like query language for
>> large-scale, distributed data analysis. MRQL is powerful enough to
>> express most common data analysis tasks over many different kinds of
>> raw data, including hierarchical data and nested collections, such as
>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using Apache
>> Hama. Both modes use Apache's HDFS to read and write their data.
>>
>> Note that, the BSP mode is currently experimental (not fine-tuned yet)
>> and lacks any fault-tolerance (if an error occurs, the entire job must
>> be restarted). Due to our limited resources, MRQL has only been tested
>> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
>> the MR mode by evaluating a pagerank query over a small graph (100K
>> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
>> than the MR mode. Please let me know if you'd like to contribute to
>> this project by testing MRQL on a larger cluster.
>> Best regards,
>> Leonidas Fegaras
>> University of Texas at Arlington
>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [ANNOUNCEMENT] A query system for BSP processing

Posted by "Edward J. Yoon" <ed...@apache.org>.

Wow, very interesting. I'm going to install and test on my large cluster.

On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> Dear Hama users,
> I am pleased to announce that the MRQL query processing system can now
> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>
> http://lambda.uta.edu/mrql/
>
> MRQL (the Map-Reduce Query Language) is an SQL-like query language for
> large-scale, distributed data analysis. MRQL is powerful enough to
> express most common data analysis tasks over many different kinds of
> raw data, including hierarchical data and nested collections, such as
> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using Apache
> Hama. Both modes use Apache's HDFS to read and write their data.
>
> Note that, the BSP mode is currently experimental (not fine-tuned yet)
> and lacks any fault-tolerance (if an error occurs, the entire job must
> be restarted). Due to our limited resources, MRQL has only been tested
> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
> the MR mode by evaluating a pagerank query over a small graph (100K
> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
> than the MR mode. Please let me know if you'd like to contribute to
> this project by testing MRQL on a larger cluster.
> Best regards,
> Leonidas Fegaras
> University of Texas at Arlington
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon