You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mrql.apache.org by "Leonidas Fegaras (JIRA)" <ji...@apache.org> on 2015/01/26 19:43:35 UTC
[jira] [Updated] (MRQL-63) Add support for MRQL streaming in spark streaming mode

     [ https://issues.apache.org/jira/browse/MRQL-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leonidas Fegaras updated MRQL-63:
---------------------------------
    Description: 
This patch introduces a major extension to MRQL, called MRQL streaming.
We can now run continuous MRQL queries on streams of data.
Currently, it works on Spark Streaming only but we may add support for Flink Streaming and/or Storm in the future.
It has been tested in Spark local mode and in Spark distributed mode on a Yarn cluster.

MRQL now supports window-based streaming based on a sliding window during a certain time interval. To support MRQL streaming, you need to add the parameter "-stream t" to the mrql command, where t is the time interval in milliseconds. Then MRQL will processes the new batch of data in the input streams every t milliseconds.
A stream source in MRQL takes the form stream(...), which has the same parameters as the source(...) form. For example:
{code:SQL}
select (k,avg(p.Y))
from p in stream(binary,"tmp/points.bin")
group by k: p.X;
{code}
This query process all sequence files in the directory tmp/points.bin and then checks this directory every t milliseconds for new files. When a new file is inserted in the directory (or if the modification time of an existing file changes), it processes the new files. One may work on multiple files and the query may contain both stream and regular data sources. If there is at least one stream source, the query becomes continuous (never stops). One may dump the output stream to binary or CVS files using the existing MRQL syntax:
{code:SQL}
store "tmp/out" from e
{code}
This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, ... etc.

Example for testing:
First create data:
{quote}
mrql.spark -local queries/points.mrql 100
{quote}
Then run the continuous query:
{quote}
mrql.spark -local -stream 1000 queries/streaming.mrql
{quote}
On a separate terminal, you can type:
{quote}
touch tmp/points.bin/part-00000
{quote}
to process a new batch of data.


  was:
This patch introduces a major extension to MRQL, called MRQL streaming.
We can now run continuous MRQL queries on streams of data.
Currently, it works on Spark Streaming only but we may add support for Flink Streaming and/or Storm in the future.
It has been tested in Spark local mode and in Spark distributed mode on a Yarn cluster.

MRQL now supports window-based streaming based on a sliding window during a certain time interval. To support MRQL streaming, you need to add the parameter "-stream t" to the mrql command, where t is the time interval in milliseconds. Then MRQL will processes the new batch of data in the input streams every t milliseconds.
A stream source in MRQL takes the form stream(...), which has the same parameters as the source(...) form. For example:
{code:SQL}
select (k,avg(p.Y))
from p in stream(binary,"tmp/points.bin")
group by k: p.X;
{code:SQL}
This query process all sequence files in the directory tmp/points.bin and then checks this directory every t milliseconds for new files. When a new file is inserted in the directory (or if the modification time of an existing file changes), it processes the new files. One may work on multiple files and the query may contain both stream and regular data sources. If there is at least one stream source, the query becomes continuous (never stops). One may dump the output stream to binary or CVS files using the existing MRQL syntax:
{code:SQL}
store "tmp/out" from e
{code:SQL}
This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, ... etc.

Example for testing:
First create data:
{quote}
mrql.spark -local queries/points.mrql 100
{quote}
Then run the continuous query:
{quote}
mrql.spark -local -stream 1000 queries/streaming.mrql
{quote}
On a separate terminal, you can type:
{quote}
touch tmp/points.bin/part-00000
{quote}
to process a new batch of data.



> Add support for MRQL streaming in spark streaming mode
> ------------------------------------------------------
>
>                 Key: MRQL-63
>                 URL: https://issues.apache.org/jira/browse/MRQL-63
>             Project: MRQL
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 0.9.4
>            Reporter: Leonidas Fegaras
>            Assignee: Leonidas Fegaras
>
> This patch introduces a major extension to MRQL, called MRQL streaming.
> We can now run continuous MRQL queries on streams of data.
> Currently, it works on Spark Streaming only but we may add support for Flink Streaming and/or Storm in the future.
> It has been tested in Spark local mode and in Spark distributed mode on a Yarn cluster.
> MRQL now supports window-based streaming based on a sliding window during a certain time interval. To support MRQL streaming, you need to add the parameter "-stream t" to the mrql command, where t is the time interval in milliseconds. Then MRQL will processes the new batch of data in the input streams every t milliseconds.
> A stream source in MRQL takes the form stream(...), which has the same parameters as the source(...) form. For example:
> {code:SQL}
> select (k,avg(p.Y))
> from p in stream(binary,"tmp/points.bin")
> group by k: p.X;
> {code}
> This query process all sequence files in the directory tmp/points.bin and then checks this directory every t milliseconds for new files. When a new file is inserted in the directory (or if the modification time of an existing file changes), it processes the new files. One may work on multiple files and the query may contain both stream and regular data sources. If there is at least one stream source, the query becomes continuous (never stops). One may dump the output stream to binary or CVS files using the existing MRQL syntax:
> {code:SQL}
> store "tmp/out" from e
> {code}
> This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, ... etc.
> Example for testing:
> First create data:
> {quote}
> mrql.spark -local queries/points.mrql 100
> {quote}
> Then run the continuous query:
> {quote}
> mrql.spark -local -stream 1000 queries/streaming.mrql
> {quote}
> On a separate terminal, you can type:
> {quote}
> touch tmp/points.bin/part-00000
> {quote}
> to process a new batch of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)