You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Oded Maimon <od...@scene53.com> on 2015/07/12 15:49:04 UTC

Few basic spark questions

Hi All,
we are evaluating spark for real-time analytic. what we are trying to do is
the following:

   - READER APP- use custom receiver to get data from rabbitmq (written in
   scala)
   - ANALYZER APP - use spark R application to read the data (windowed),
   analyze it every minute and save the results inside spark
   - OUTPUT APP - user spark application (scala/java/python) to read the
   results from R every X minutes and send the data to few external systems

basically at the end i would like to have the READER COMPONENT as an app
that always consumes the data and keeps it in spark,
have as many ANALYZER COMPONENTS as my data scientists wants, and have one
OUTPUT APP that will read the ANALYZER results and send it to any relevant
system.

what is the right way to do it?

Thanks,
Oded.

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*

Re: Few basic spark questions

Posted by Feynman Liang <fl...@databricks.com>.

You could implement the receiver as a Spark Streaming Receiver
<https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers>;
the data received would be available for any streaming applications which
operate on DStreams (e.g. Streaming KMeans
<https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means>
).

On Tue, Jul 14, 2015 at 8:31 AM, Oded Maimon <od...@scene53.com> wrote:

> Hi,
> Thanks for all the help.
> I'm still missing something very basic.
>
> If I wont use sparkR, which doesn't support streaming (will use mlib
> instead as Debasish suggested), and I have my scala receiver working, how
> the receiver should save the data in memory? I do see the store method, so
> if i use it, how can i read the data from a different spark scala/java
> application? how do i find/query this data?
>
>
> Regards,
> Oded Maimon
> Scene53.
>
> On Tue, Jul 14, 2015 at 12:35 AM, Feynman Liang <fl...@databricks.com>
> wrote:
>
>> Sorry; I think I may have used poor wording. SparkR will let you use R to
>> analyze the data, but it has to be loaded into memory using SparkR (see SparkR
>> DataSources
>> <http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html>).
>> You will still have to write a Java receiver to store the data into some
>> tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
>> performing the analysis.
>>
>> R specific questions such as windowing in R should go to R-help@; you
>> won't be able to use window since that is a Spark Streaming method.
>>
>> On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon <od...@scene53.com> wrote:
>>
>>> You are helping me understanding stuff here a lot.
>>>
>>> I believe I have 3 last questions..
>>>
>>> If is use java receiver to get the data, how should I save it in memory?
>>> Using store command or other command?
>>>
>>> Once stored, how R can read that data?
>>>
>>> Can I use window command in R? I guess not because it is a streaming
>>> command, right? Any other way to window the data?
>>>
>>> Sent from IPhone
>>>
>>>
>>>
>>>
>>> On Mon, Jul 13, 2015 at 2:07 PM -0700, "Feynman Liang" <
>>> fliang@databricks.com> wrote:
>>>
>>>  If you use SparkR then you can analyze the data that's currently in
>>>> memory with R; otherwise you will have to write to disk (eg HDFS).
>>>>
>>>> On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon <od...@scene53.com> wrote:
>>>>
>>>>> Thanks again.
>>>>> What I'm missing is where can I store the data? Can I store it in
>>>>> spark memory and then use R to analyze it? Or should I use hdfs? Any other
>>>>> places that I can save the data?
>>>>>
>>>>> What would you suggest?
>>>>>
>>>>> Thanks...
>>>>>
>>>>> Sent from IPhone
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 13, 2015 at 1:41 PM -0700, "Feynman Liang" <
>>>>> fliang@databricks.com> wrote:
>>>>>
>>>>>  If you don't require true streaming processing and need to use R for
>>>>>> analysis, SparkR on a custom data source seems to fit your use case.
>>>>>>
>>>>>> On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon <od...@scene53.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, thanks for replying!
>>>>>>> I want to do the entire process in stages. Get the data using Java
>>>>>>> or scala because they are the only Langs that supports custom receivers,
>>>>>>> keep the data <somewhere>, use R to analyze it, keep the results
>>>>>>> <somewhere>, output the data to different systems.
>>>>>>>
>>>>>>> I thought that <somewhere> can be spark memory using rdd or
>>>>>>> dstreams.. But could it be that I need to keep it in hdfs to make the
>>>>>>> entire process in stages?
>>>>>>>
>>>>>>> Sent from IPhone
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 13, 2015 at 12:07 PM -0700, "Feynman Liang" <
>>>>>>> fliang@databricks.com> wrote:
>>>>>>>
>>>>>>>  Hi Oded,
>>>>>>>>
>>>>>>>> I'm not sure I completely understand your question, but it sounds
>>>>>>>> like you could have the READER receiver produce a DStream which is
>>>>>>>> windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
>>>>>>>> However, streaming in SparkR is not currently supported (SPARK-6803
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-6803>) so I'm not too
>>>>>>>> sure how ANALYZER would fit in.
>>>>>>>>
>>>>>>>> Feynman
>>>>>>>>
>>>>>>>> On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon <od...@scene53.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> any help / idea will be appreciated :)
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Oded Maimon
>>>>>>>>> Scene53.
>>>>>>>>>
>>>>>>>>> On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon <od...@scene53.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>> we are evaluating spark for real-time analytic. what we are
>>>>>>>>>> trying to do is the following:
>>>>>>>>>>
>>>>>>>>>>    - READER APP- use custom receiver to get data from rabbitmq
>>>>>>>>>>    (written in scala)
>>>>>>>>>>    - ANALYZER APP - use spark R application to read the data
>>>>>>>>>>    (windowed), analyze it every minute and save the results inside spark
>>>>>>>>>>    - OUTPUT APP - user spark application (scala/java/python) to
>>>>>>>>>>    read the results from R every X minutes and send the data to few external
>>>>>>>>>>    systems
>>>>>>>>>>
>>>>>>>>>> basically at the end i would like to have the READER COMPONENT as
>>>>>>>>>> an app that always consumes the data and keeps it in spark,
>>>>>>>>>> have as many ANALYZER COMPONENTS as my data scientists wants, and
>>>>>>>>>> have one OUTPUT APP that will read the ANALYZER results and send it to any
>>>>>>>>>> relevant system.
>>>>>>>>>>
>>>>>>>>>> what is the right way to do it?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Oded.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *This email and any files transmitted with it are confidential and
>>>>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>>>>>> content of this information is strictly forbidden. If you have received
>>>>>>>>> this email message in error, please destroy it immediately and notify its
>>>>>>>>> sender.*
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> *This email and any files transmitted with it are confidential and
>>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>>>> content of this information is strictly forbidden. If you have received
>>>>>>> this email message in error, please destroy it immediately and notify its
>>>>>>> sender.*
>>>>>>>
>>>>>>
>>>>>>
>>>>> *This email and any files transmitted with it are confidential and
>>>>> intended solely for the use of the individual or entity to whom they are
>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>> content of this information is strictly forbidden. If you have received
>>>>> this email message in error, please destroy it immediately and notify its
>>>>> sender.*
>>>>>
>>>>
>>>>
>>> *This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. Please note that any disclosure, copying or distribution of the
>>> content of this information is strictly forbidden. If you have received
>>> this email message in error, please destroy it immediately and notify its
>>> sender.*
>>>
>>
>>
>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please note that any disclosure, copying or distribution of the
> content of this information is strictly forbidden. If you have received
> this email message in error, please destroy it immediately and notify its
> sender.*
>

Re: Few basic spark questions

Posted by Oded Maimon <od...@scene53.com>.

Hi,
Thanks for all the help.
I'm still missing something very basic.

If I wont use sparkR, which doesn't support streaming (will use mlib
instead as Debasish suggested), and I have my scala receiver working, how
the receiver should save the data in memory? I do see the store method, so
if i use it, how can i read the data from a different spark scala/java
application? how do i find/query this data?


Regards,
Oded Maimon
Scene53.

On Tue, Jul 14, 2015 at 12:35 AM, Feynman Liang <fl...@databricks.com>
wrote:

> Sorry; I think I may have used poor wording. SparkR will let you use R to
> analyze the data, but it has to be loaded into memory using SparkR (see SparkR
> DataSources
> <http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html>).
> You will still have to write a Java receiver to store the data into some
> tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
> performing the analysis.
>
> R specific questions such as windowing in R should go to R-help@; you
> won't be able to use window since that is a Spark Streaming method.
>
> On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon <od...@scene53.com> wrote:
>
>> You are helping me understanding stuff here a lot.
>>
>> I believe I have 3 last questions..
>>
>> If is use java receiver to get the data, how should I save it in memory?
>> Using store command or other command?
>>
>> Once stored, how R can read that data?
>>
>> Can I use window command in R? I guess not because it is a streaming
>> command, right? Any other way to window the data?
>>
>> Sent from IPhone
>>
>>
>>
>>
>> On Mon, Jul 13, 2015 at 2:07 PM -0700, "Feynman Liang" <
>> fliang@databricks.com> wrote:
>>
>>  If you use SparkR then you can analyze the data that's currently in
>>> memory with R; otherwise you will have to write to disk (eg HDFS).
>>>
>>> On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon <od...@scene53.com> wrote:
>>>
>>>> Thanks again.
>>>> What I'm missing is where can I store the data? Can I store it in spark
>>>> memory and then use R to analyze it? Or should I use hdfs? Any other places
>>>> that I can save the data?
>>>>
>>>> What would you suggest?
>>>>
>>>> Thanks...
>>>>
>>>> Sent from IPhone
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jul 13, 2015 at 1:41 PM -0700, "Feynman Liang" <
>>>> fliang@databricks.com> wrote:
>>>>
>>>>  If you don't require true streaming processing and need to use R for
>>>>> analysis, SparkR on a custom data source seems to fit your use case.
>>>>>
>>>>> On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon <od...@scene53.com> wrote:
>>>>>
>>>>>> Hi, thanks for replying!
>>>>>> I want to do the entire process in stages. Get the data using Java or
>>>>>> scala because they are the only Langs that supports custom receivers, keep
>>>>>> the data <somewhere>, use R to analyze it, keep the results <somewhere>,
>>>>>> output the data to different systems.
>>>>>>
>>>>>> I thought that <somewhere> can be spark memory using rdd or
>>>>>> dstreams.. But could it be that I need to keep it in hdfs to make the
>>>>>> entire process in stages?
>>>>>>
>>>>>> Sent from IPhone
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 13, 2015 at 12:07 PM -0700, "Feynman Liang" <
>>>>>> fliang@databricks.com> wrote:
>>>>>>
>>>>>>  Hi Oded,
>>>>>>>
>>>>>>> I'm not sure I completely understand your question, but it sounds
>>>>>>> like you could have the READER receiver produce a DStream which is
>>>>>>> windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
>>>>>>> However, streaming in SparkR is not currently supported (SPARK-6803
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-6803>) so I'm not too
>>>>>>> sure how ANALYZER would fit in.
>>>>>>>
>>>>>>> Feynman
>>>>>>>
>>>>>>> On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon <od...@scene53.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> any help / idea will be appreciated :)
>>>>>>>> thanks
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Oded Maimon
>>>>>>>> Scene53.
>>>>>>>>
>>>>>>>> On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon <od...@scene53.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>> we are evaluating spark for real-time analytic. what we are trying
>>>>>>>>> to do is the following:
>>>>>>>>>
>>>>>>>>>    - READER APP- use custom receiver to get data from rabbitmq
>>>>>>>>>    (written in scala)
>>>>>>>>>    - ANALYZER APP - use spark R application to read the data
>>>>>>>>>    (windowed), analyze it every minute and save the results inside spark
>>>>>>>>>    - OUTPUT APP - user spark application (scala/java/python) to
>>>>>>>>>    read the results from R every X minutes and send the data to few external
>>>>>>>>>    systems
>>>>>>>>>
>>>>>>>>> basically at the end i would like to have the READER COMPONENT as
>>>>>>>>> an app that always consumes the data and keeps it in spark,
>>>>>>>>> have as many ANALYZER COMPONENTS as my data scientists wants, and
>>>>>>>>> have one OUTPUT APP that will read the ANALYZER results and send it to any
>>>>>>>>> relevant system.
>>>>>>>>>
>>>>>>>>> what is the right way to do it?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Oded.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> *This email and any files transmitted with it are confidential and
>>>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>>>>> content of this information is strictly forbidden. If you have received
>>>>>>>> this email message in error, please destroy it immediately and notify its
>>>>>>>> sender.*
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> *This email and any files transmitted with it are confidential and
>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>>> content of this information is strictly forbidden. If you have received
>>>>>> this email message in error, please destroy it immediately and notify its
>>>>>> sender.*
>>>>>>
>>>>>
>>>>>
>>>> *This email and any files transmitted with it are confidential and
>>>> intended solely for the use of the individual or entity to whom they are
>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>> content of this information is strictly forbidden. If you have received
>>>> this email message in error, please destroy it immediately and notify its
>>>> sender.*
>>>>
>>>
>>>
>> *This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. Please note that any disclosure, copying or distribution of the
>> content of this information is strictly forbidden. If you have received
>> this email message in error, please destroy it immediately and notify its
>> sender.*
>>
>
>

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*

Re: Few basic spark questions

Posted by Debasish Das <de...@gmail.com>.

What do you need in sparkR that mllib / ml don't  have....most of the basic
analysis that you need on stream can be done through mllib components...
On Jul 13, 2015 2:35 PM, "Feynman Liang" <fl...@databricks.com> wrote:

> Sorry; I think I may have used poor wording. SparkR will let you use R to
> analyze the data, but it has to be loaded into memory using SparkR (see SparkR
> DataSources
> <http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html>).
> You will still have to write a Java receiver to store the data into some
> tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
> performing the analysis.
>
> R specific questions such as windowing in R should go to R-help@; you
> won't be able to use window since that is a Spark Streaming method.
>
> On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon <od...@scene53.com> wrote:
>
>> You are helping me understanding stuff here a lot.
>>
>> I believe I have 3 last questions..
>>
>> If is use java receiver to get the data, how should I save it in memory?
>> Using store command or other command?
>>
>> Once stored, how R can read that data?
>>
>> Can I use window command in R? I guess not because it is a streaming
>> command, right? Any other way to window the data?
>>
>> Sent from IPhone
>>
>>
>>
>>
>> On Mon, Jul 13, 2015 at 2:07 PM -0700, "Feynman Liang" <
>> fliang@databricks.com> wrote:
>>
>>  If you use SparkR then you can analyze the data that's currently in
>>> memory with R; otherwise you will have to write to disk (eg HDFS).
>>>
>>> On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon <od...@scene53.com> wrote:
>>>
>>>> Thanks again.
>>>> What I'm missing is where can I store the data? Can I store it in spark
>>>> memory and then use R to analyze it? Or should I use hdfs? Any other places
>>>> that I can save the data?
>>>>
>>>> What would you suggest?
>>>>
>>>> Thanks...
>>>>
>>>> Sent from IPhone
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jul 13, 2015 at 1:41 PM -0700, "Feynman Liang" <
>>>> fliang@databricks.com> wrote:
>>>>
>>>>  If you don't require true streaming processing and need to use R for
>>>>> analysis, SparkR on a custom data source seems to fit your use case.
>>>>>
>>>>> On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon <od...@scene53.com> wrote:
>>>>>
>>>>>> Hi, thanks for replying!
>>>>>> I want to do the entire process in stages. Get the data using Java or
>>>>>> scala because they are the only Langs that supports custom receivers, keep
>>>>>> the data <somewhere>, use R to analyze it, keep the results <somewhere>,
>>>>>> output the data to different systems.
>>>>>>
>>>>>> I thought that <somewhere> can be spark memory using rdd or
>>>>>> dstreams.. But could it be that I need to keep it in hdfs to make the
>>>>>> entire process in stages?
>>>>>>
>>>>>> Sent from IPhone
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 13, 2015 at 12:07 PM -0700, "Feynman Liang" <
>>>>>> fliang@databricks.com> wrote:
>>>>>>
>>>>>>  Hi Oded,
>>>>>>>
>>>>>>> I'm not sure I completely understand your question, but it sounds
>>>>>>> like you could have the READER receiver produce a DStream which is
>>>>>>> windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
>>>>>>> However, streaming in SparkR is not currently supported (SPARK-6803
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-6803>) so I'm not too
>>>>>>> sure how ANALYZER would fit in.
>>>>>>>
>>>>>>> Feynman
>>>>>>>
>>>>>>> On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon <od...@scene53.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> any help / idea will be appreciated :)
>>>>>>>> thanks
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Oded Maimon
>>>>>>>> Scene53.
>>>>>>>>
>>>>>>>> On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon <od...@scene53.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>> we are evaluating spark for real-time analytic. what we are trying
>>>>>>>>> to do is the following:
>>>>>>>>>
>>>>>>>>>    - READER APP- use custom receiver to get data from rabbitmq
>>>>>>>>>    (written in scala)
>>>>>>>>>    - ANALYZER APP - use spark R application to read the data
>>>>>>>>>    (windowed), analyze it every minute and save the results inside spark
>>>>>>>>>    - OUTPUT APP - user spark application (scala/java/python) to
>>>>>>>>>    read the results from R every X minutes and send the data to few external
>>>>>>>>>    systems
>>>>>>>>>
>>>>>>>>> basically at the end i would like to have the READER COMPONENT as
>>>>>>>>> an app that always consumes the data and keeps it in spark,
>>>>>>>>> have as many ANALYZER COMPONENTS as my data scientists wants, and
>>>>>>>>> have one OUTPUT APP that will read the ANALYZER results and send it to any
>>>>>>>>> relevant system.
>>>>>>>>>
>>>>>>>>> what is the right way to do it?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Oded.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> *This email and any files transmitted with it are confidential and
>>>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>>>>> content of this information is strictly forbidden. If you have received
>>>>>>>> this email message in error, please destroy it immediately and notify its
>>>>>>>> sender.*
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> *This email and any files transmitted with it are confidential and
>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>>> content of this information is strictly forbidden. If you have received
>>>>>> this email message in error, please destroy it immediately and notify its
>>>>>> sender.*
>>>>>>
>>>>>
>>>>>
>>>> *This email and any files transmitted with it are confidential and
>>>> intended solely for the use of the individual or entity to whom they are
>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>> content of this information is strictly forbidden. If you have received
>>>> this email message in error, please destroy it immediately and notify its
>>>> sender.*
>>>>
>>>
>>>
>> *This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. Please note that any disclosure, copying or distribution of the
>> content of this information is strictly forbidden. If you have received
>> this email message in error, please destroy it immediately and notify its
>> sender.*
>>
>
>

Re: Basic Spark SQL question

Posted by Ron Gonzalez <zl...@yahoo.com.INVALID>.

Cool thanks. Will take a look...

Sent from my iPhone

> On Jul 13, 2015, at 6:40 PM, Michael Armbrust <mi...@databricks.com> wrote:
> 
> I'd look at the JDBC server (a long running yarn job you can submit queries too)
> 
> https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
> 
>> On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang <je...@gmail.com> wrote:
>> Well for adhoc queries you can use the CLI
>> 
>>> On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez <zl...@yahoo.com.invalid> wrote:
>>> Hi,
>>>   I have a question for Spark SQL. Is there a way to be able to use Spark SQL on YARN without having to submit a job?
>>>   Bottom line here is I want to be able to reduce the latency of running queries as a job. I know that the spark sql default submission is like a job, but was wondering if it's possible to run queries like one would with a regular db like MySQL or Oracle.
>>> 
>>> Thanks,
>>> Ron
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>

Re: Basic Spark SQL question

Posted by Michael Armbrust <mi...@databricks.com>.

I'd look at the JDBC server (a long running yarn job you can submit queries
too)

https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang <je...@gmail.com>
wrote:

> Well for adhoc queries you can use the CLI
>
> On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez <
> zlgonzalez@yahoo.com.invalid> wrote:
>
>> Hi,
>>   I have a question for Spark SQL. Is there a way to be able to use Spark
>> SQL on YARN without having to submit a job?
>>   Bottom line here is I want to be able to reduce the latency of running
>> queries as a job. I know that the spark sql default submission is like a
>> job, but was wondering if it's possible to run queries like one would with
>> a regular db like MySQL or Oracle.
>>
>> Thanks,
>> Ron
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Basic Spark SQL question

Posted by Jerrick Hoang <je...@gmail.com>.

Well for adhoc queries you can use the CLI

On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez <zl...@yahoo.com.invalid>
wrote:

> Hi,
>   I have a question for Spark SQL. Is there a way to be able to use Spark
> SQL on YARN without having to submit a job?
>   Bottom line here is I want to be able to reduce the latency of running
> queries as a job. I know that the spark sql default submission is like a
> job, but was wondering if it's possible to run queries like one would with
> a regular db like MySQL or Oracle.
>
> Thanks,
> Ron
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Basic Spark SQL question

Posted by Ron Gonzalez <zl...@yahoo.com.INVALID>.

Hi,
   I have a question for Spark SQL. Is there a way to be able to use 
Spark SQL on YARN without having to submit a job?
   Bottom line here is I want to be able to reduce the latency of 
running queries as a job. I know that the spark sql default submission 
is like a job, but was wondering if it's possible to run queries like 
one would with a regular db like MySQL or Oracle.

Thanks,
Ron


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Few basic spark questions

Posted by Feynman Liang <fl...@databricks.com>.

Sorry; I think I may have used poor wording. SparkR will let you use R to
analyze the data, but it has to be loaded into memory using SparkR (see SparkR
DataSources
<http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html>).
You will still have to write a Java receiver to store the data into some
tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
performing the analysis.

R specific questions such as windowing in R should go to R-help@; you won't
be able to use window since that is a Spark Streaming method.

On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon <od...@scene53.com> wrote:

> You are helping me understanding stuff here a lot.
>
> I believe I have 3 last questions..
>
> If is use java receiver to get the data, how should I save it in memory?
> Using store command or other command?
>
> Once stored, how R can read that data?
>
> Can I use window command in R? I guess not because it is a streaming
> command, right? Any other way to window the data?
>
> Sent from IPhone
>
>
>
>
> On Mon, Jul 13, 2015 at 2:07 PM -0700, "Feynman Liang" <
> fliang@databricks.com> wrote:
>
>  If you use SparkR then you can analyze the data that's currently in
>> memory with R; otherwise you will have to write to disk (eg HDFS).
>>
>> On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon <od...@scene53.com> wrote:
>>
>>> Thanks again.
>>> What I'm missing is where can I store the data? Can I store it in spark
>>> memory and then use R to analyze it? Or should I use hdfs? Any other places
>>> that I can save the data?
>>>
>>> What would you suggest?
>>>
>>> Thanks...
>>>
>>> Sent from IPhone
>>>
>>>
>>>
>>>
>>> On Mon, Jul 13, 2015 at 1:41 PM -0700, "Feynman Liang" <
>>> fliang@databricks.com> wrote:
>>>
>>>  If you don't require true streaming processing and need to use R for
>>>> analysis, SparkR on a custom data source seems to fit your use case.
>>>>
>>>> On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon <od...@scene53.com> wrote:
>>>>
>>>>> Hi, thanks for replying!
>>>>> I want to do the entire process in stages. Get the data using Java or
>>>>> scala because they are the only Langs that supports custom receivers, keep
>>>>> the data <somewhere>, use R to analyze it, keep the results <somewhere>,
>>>>> output the data to different systems.
>>>>>
>>>>> I thought that <somewhere> can be spark memory using rdd or dstreams..
>>>>> But could it be that I need to keep it in hdfs to make the entire process
>>>>> in stages?
>>>>>
>>>>> Sent from IPhone
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 13, 2015 at 12:07 PM -0700, "Feynman Liang" <
>>>>> fliang@databricks.com> wrote:
>>>>>
>>>>>  Hi Oded,
>>>>>>
>>>>>> I'm not sure I completely understand your question, but it sounds
>>>>>> like you could have the READER receiver produce a DStream which is
>>>>>> windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
>>>>>> However, streaming in SparkR is not currently supported (SPARK-6803
>>>>>> <https://issues.apache.org/jira/browse/SPARK-6803>) so I'm not too
>>>>>> sure how ANALYZER would fit in.
>>>>>>
>>>>>> Feynman
>>>>>>
>>>>>> On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon <od...@scene53.com>
>>>>>> wrote:
>>>>>>
>>>>>>> any help / idea will be appreciated :)
>>>>>>> thanks
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Oded Maimon
>>>>>>> Scene53.
>>>>>>>
>>>>>>> On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon <od...@scene53.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>> we are evaluating spark for real-time analytic. what we are trying
>>>>>>>> to do is the following:
>>>>>>>>
>>>>>>>>    - READER APP- use custom receiver to get data from rabbitmq
>>>>>>>>    (written in scala)
>>>>>>>>    - ANALYZER APP - use spark R application to read the data
>>>>>>>>    (windowed), analyze it every minute and save the results inside spark
>>>>>>>>    - OUTPUT APP - user spark application (scala/java/python) to
>>>>>>>>    read the results from R every X minutes and send the data to few external
>>>>>>>>    systems
>>>>>>>>
>>>>>>>> basically at the end i would like to have the READER COMPONENT as
>>>>>>>> an app that always consumes the data and keeps it in spark,
>>>>>>>> have as many ANALYZER COMPONENTS as my data scientists wants, and
>>>>>>>> have one OUTPUT APP that will read the ANALYZER results and send it to any
>>>>>>>> relevant system.
>>>>>>>>
>>>>>>>> what is the right way to do it?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Oded.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> *This email and any files transmitted with it are confidential and
>>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>>>> content of this information is strictly forbidden. If you have received
>>>>>>> this email message in error, please destroy it immediately and notify its
>>>>>>> sender.*
>>>>>>>
>>>>>>
>>>>>>
>>>>> *This email and any files transmitted with it are confidential and
>>>>> intended solely for the use of the individual or entity to whom they are
>>>>> addressed. Please note that any disclosure, copying or distribution of the
>>>>> content of this information is strictly forbidden. If you have received
>>>>> this email message in error, please destroy it immediately and notify its
>>>>> sender.*
>>>>>
>>>>
>>>>
>>> *This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. Please note that any disclosure, copying or distribution of the
>>> content of this information is strictly forbidden. If you have received
>>> this email message in error, please destroy it immediately and notify its
>>> sender.*
>>>
>>
>>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please note that any disclosure, copying or distribution of the
> content of this information is strictly forbidden. If you have received
> this email message in error, please destroy it immediately and notify its
> sender.*
>

Re: Few basic spark questions

Posted by Feynman Liang <fl...@databricks.com>.

Hi Oded,

I'm not sure I completely understand your question, but it sounds like you
could have the READER receiver produce a DStream which is
windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
However, streaming in SparkR is not currently supported (SPARK-6803
<https://issues.apache.org/jira/browse/SPARK-6803>) so I'm not too sure how
ANALYZER would fit in.

Feynman

On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon <od...@scene53.com> wrote:

> any help / idea will be appreciated :)
> thanks
>
>
> Regards,
> Oded Maimon
> Scene53.
>
> On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon <od...@scene53.com> wrote:
>
>> Hi All,
>> we are evaluating spark for real-time analytic. what we are trying to do
>> is the following:
>>
>>    - READER APP- use custom receiver to get data from rabbitmq (written
>>    in scala)
>>    - ANALYZER APP - use spark R application to read the data (windowed),
>>    analyze it every minute and save the results inside spark
>>    - OUTPUT APP - user spark application (scala/java/python) to read the
>>    results from R every X minutes and send the data to few external systems
>>
>> basically at the end i would like to have the READER COMPONENT as an app
>> that always consumes the data and keeps it in spark,
>> have as many ANALYZER COMPONENTS as my data scientists wants, and have
>> one OUTPUT APP that will read the ANALYZER results and send it to any
>> relevant system.
>>
>> what is the right way to do it?
>>
>> Thanks,
>> Oded.
>>
>>
>>
>>
>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please note that any disclosure, copying or distribution of the
> content of this information is strictly forbidden. If you have received
> this email message in error, please destroy it immediately and notify its
> sender.*
>

Re: Few basic spark questions

Posted by Oded Maimon <od...@scene53.com>.

any help / idea will be appreciated :)
thanks

Regards,
Oded Maimon
Scene53.

On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon <od...@scene53.com> wrote:

> Hi All,
> we are evaluating spark for real-time analytic. what we are trying to do
> is the following:
>
>    - READER APP- use custom receiver to get data from rabbitmq (written
>    in scala)
>    - ANALYZER APP - use spark R application to read the data (windowed),
>    analyze it every minute and save the results inside spark
>    - OUTPUT APP - user spark application (scala/java/python) to read the
>    results from R every X minutes and send the data to few external systems
>
> basically at the end i would like to have the READER COMPONENT as an app
> that always consumes the data and keeps it in spark,
> have as many ANALYZER COMPONENTS as my data scientists wants, and have one
> OUTPUT APP that will read the ANALYZER results and send it to any relevant
> system.
>
> what is the right way to do it?
>
> Thanks,
> Oded.
>
>
>
>

-- 

*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*