You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Marius Soutier <mp...@gmail.com> on 2014/09/12 11:23:48 UTC

Serving data

Hi there,

I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote Scalding jobs - one-off, read data from HDFS, count words, write counts back to HDFS.

Now I want to display these counts in a dashboard. Since Spark allows to cache RDDs in-memory and you have to explicitly terminate your app (and there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep an app running indefinitely and query an in-memory RDD from the outside (via SparkSQL for example).

Is this how others are using Spark? Or are you just dumping job results into message queues or databases?


Thanks
- Marius


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Serving data

Posted by Marius Soutier <mp...@gmail.com>.

No you’re right, that’s exactly what I’m doing right now. The choice would have been *either* Parquet *or* a database.

What’s unfortunate is that apparently this only works with Playframework 2.2, not 2.3, because of the incompatible Akka versions.

On 16.09.2014, at 16:37, Yana Kadiyska <ya...@gmail.com> wrote:

> If your dashboard is doing ajax/pull requests against say a REST API you can always create a Spark context in your rest service and use SparkSQL to query over the parquet files. The parquet files are already on disk so it seems silly to write both to parquet and to a DB...unless I'm missing something in your setup.
> 
> On Tue, Sep 16, 2014 at 4:18 AM, Marius Soutier <mp...@gmail.com> wrote:
> Writing to Parquet and querying the result via SparkSQL works great (except for some strange SQL parser errors). However the problem remains, how do I get that data back to a dashboard. So I guess I’ll have to use a database after all.
> 
> 
> You can batch up data & store into parquet partitions as well. & query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 
>

Re: Serving data

Posted by Yana Kadiyska <ya...@gmail.com>.

If your dashboard is doing ajax/pull requests against say a REST API you
can always create a Spark context in your rest service and use SparkSQL to
query over the parquet files. The parquet files are already on disk so it
seems silly to write both to parquet and to a DB...unless I'm missing
something in your setup.

On Tue, Sep 16, 2014 at 4:18 AM, Marius Soutier <mp...@gmail.com> wrote:

> Writing to Parquet and querying the result via SparkSQL works great
> (except for some strange SQL parser errors). However the problem remains,
> how do I get that data back to a dashboard. So I guess I’ll have to use a
> database after all.
>
>
>>> You can batch up data & store into parquet partitions as well. & query
>>>> it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
>>>> believe.
>>>>
>>>

Re: Serving data

Posted by Marius Soutier <mp...@gmail.com>.

Writing to Parquet and querying the result via SparkSQL works great (except for some strange SQL parser errors). However the problem remains, how do I get that data back to a dashboard. So I guess I’ll have to use a database after all.


You can batch up data & store into parquet partitions as well. & query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe.

Re: Serving data

Posted by Marius Soutier <mp...@gmail.com>.

Nice, I’ll check it out. At first glance, writing Parquet files seems to be a bit complicated.

On 15.09.2014, at 13:54, andy petrella <an...@gmail.com> wrote:

> nope.
> It's an efficient storage for genomics data :-D
> 
> aℕdy ℙetrella
> about.me/noootsab
> 
> 
> 
> On Mon, Sep 15, 2014 at 1:52 PM, Marius Soutier <mp...@gmail.com> wrote:
> So you are living the dream of using HDFS as a database? ;)
> 
> On 15.09.2014, at 13:50, andy petrella <an...@gmail.com> wrote:
> 
>> I'm using Parquet in ADAM, and I can say that it works pretty fine!
>> Enjoy ;-)
>> 
>> aℕdy ℙetrella
>> about.me/noootsab
>> 
>> 
>> 
>> On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier <mp...@gmail.com> wrote:
>> Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the usual route with either read-only or normal database.
>> 
>> On 13.09.2014, at 12:45, andy petrella <an...@gmail.com> wrote:
>> 
>>> however, the cache is not guaranteed to remain, if other jobs are launched in the cluster and require more memory than what's left in the overall caching memory, previous RDDs will be discarded.
>>> 
>>> Using an off heap cache like tachyon as a dump repo can help.
>>> 
>>> In general, I'd say that using a persistent sink (like Cassandra for instance) is best.
>>> 
>>> my .2¢
>>> 
>>> 
>>> aℕdy ℙetrella
>>> about.me/noootsab
>>> 
>>> 
>>> 
>>> On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi <ma...@gmail.com> wrote:
>>> You can cache data in memory & query it using Spark Job Server. 
>>> Most folks dump data down to a queue/db for retrieval 
>>> You can batch up data & store into parquet partitions as well. & query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 
>>> -- 
>>> Regards,
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi
>>> 
>>> 
>>> On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier <mp...@gmail.com> wrote:
>>> 
>>> Hi there, 
>>> 
>>> I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote Scalding jobs - one-off, read data from HDFS, count words, write counts back to HDFS. 
>>> 
>>> Now I want to display these counts in a dashboard. Since Spark allows to cache RDDs in-memory and you have to explicitly terminate your app (and there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep an app running indefinitely and query an in-memory RDD from the outside (via SparkSQL for example). 
>>> 
>>> Is this how others are using Spark? Or are you just dumping job results into message queues or databases? 
>>> 
>>> 
>>> Thanks 
>>> - Marius 
>>> 
>>> 
>>> --------------------------------------------------------------------- 
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
>>> For additional commands, e-mail: user-help@spark.apache.org 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Serving data

Posted by andy petrella <an...@gmail.com>.

nope.
It's an efficient storage for genomics data :-D

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

<http://about.me/noootsab>

On Mon, Sep 15, 2014 at 1:52 PM, Marius Soutier <mp...@gmail.com> wrote:

> So you are living the dream of using HDFS as a database? ;)
>
> On 15.09.2014, at 13:50, andy petrella <an...@gmail.com> wrote:
>
> I'm using Parquet in ADAM, and I can say that it works pretty fine!
> Enjoy ;-)
>
> aℕdy ℙetrella
> about.me/noootsab
> [image: aℕdy ℙetrella on about.me]
>
> <http://about.me/noootsab>
>
> On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier <mp...@gmail.com> wrote:
>
>> Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go
>> the usual route with either read-only or normal database.
>>
>> On 13.09.2014, at 12:45, andy petrella <an...@gmail.com> wrote:
>>
>> however, the cache is not guaranteed to remain, if other jobs are
>> launched in the cluster and require more memory than what's left in the
>> overall caching memory, previous RDDs will be discarded.
>>
>> Using an off heap cache like tachyon as a dump repo can help.
>>
>> In general, I'd say that using a persistent sink (like Cassandra for
>> instance) is best.
>>
>> my .2¢
>>
>>
>> aℕdy ℙetrella
>> about.me/noootsab
>> [image: aℕdy ℙetrella on about.me]
>>
>> <http://about.me/noootsab>
>>
>> On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi <ma...@gmail.com>
>> wrote:
>>
>>> You can cache data in memory & query it using Spark Job Server.
>>> Most folks dump data down to a queue/db for retrieval
>>> You can batch up data & store into parquet partitions as well. & query
>>> it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
>>> believe.
>>> --
>>> Regards,
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi
>>>
>>>
>>> On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier <mp...@gmail.com>
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I’m pretty new to Spark, and so far I’ve written my jobs the same way I
>>>> wrote Scalding jobs - one-off, read data from HDFS, count words, write
>>>> counts back to HDFS.
>>>>
>>>> Now I want to display these counts in a dashboard. Since Spark allows
>>>> to cache RDDs in-memory and you have to explicitly terminate your app (and
>>>> there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep
>>>> an app running indefinitely and query an in-memory RDD from the outside
>>>> (via SparkSQL for example).
>>>>
>>>> Is this how others are using Spark? Or are you just dumping job results
>>>> into message queues or databases?
>>>>
>>>>
>>>> Thanks
>>>> - Marius
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>
>

Re: Serving data

Posted by Marius Soutier <mp...@gmail.com>.

So you are living the dream of using HDFS as a database? ;)

On 15.09.2014, at 13:50, andy petrella <an...@gmail.com> wrote:

> I'm using Parquet in ADAM, and I can say that it works pretty fine!
> Enjoy ;-)
> 
> aℕdy ℙetrella
> about.me/noootsab
> 
> 
> 
> On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier <mp...@gmail.com> wrote:
> Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the usual route with either read-only or normal database.
> 
> On 13.09.2014, at 12:45, andy petrella <an...@gmail.com> wrote:
> 
>> however, the cache is not guaranteed to remain, if other jobs are launched in the cluster and require more memory than what's left in the overall caching memory, previous RDDs will be discarded.
>> 
>> Using an off heap cache like tachyon as a dump repo can help.
>> 
>> In general, I'd say that using a persistent sink (like Cassandra for instance) is best.
>> 
>> my .2¢
>> 
>> 
>> aℕdy ℙetrella
>> about.me/noootsab
>> 
>> 
>> 
>> On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi <ma...@gmail.com> wrote:
>> You can cache data in memory & query it using Spark Job Server. 
>> Most folks dump data down to a queue/db for retrieval 
>> You can batch up data & store into parquet partitions as well. & query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 
>> -- 
>> Regards,
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi
>> 
>> 
>> On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier <mp...@gmail.com> wrote:
>> 
>> Hi there, 
>> 
>> I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote Scalding jobs - one-off, read data from HDFS, count words, write counts back to HDFS. 
>> 
>> Now I want to display these counts in a dashboard. Since Spark allows to cache RDDs in-memory and you have to explicitly terminate your app (and there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep an app running indefinitely and query an in-memory RDD from the outside (via SparkSQL for example). 
>> 
>> Is this how others are using Spark? Or are you just dumping job results into message queues or databases? 
>> 
>> 
>> Thanks 
>> - Marius 
>> 
>> 
>> --------------------------------------------------------------------- 
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
>> For additional commands, e-mail: user-help@spark.apache.org 
>> 
>> 
>> 
>> 
> 
>

Re: Serving data

Posted by andy petrella <an...@gmail.com>.

I'm using Parquet in ADAM, and I can say that it works pretty fine!
Enjoy ;-)

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

<http://about.me/noootsab>

On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier <mp...@gmail.com> wrote:

> Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go
> the usual route with either read-only or normal database.
>
> On 13.09.2014, at 12:45, andy petrella <an...@gmail.com> wrote:
>
> however, the cache is not guaranteed to remain, if other jobs are launched
> in the cluster and require more memory than what's left in the overall
> caching memory, previous RDDs will be discarded.
>
> Using an off heap cache like tachyon as a dump repo can help.
>
> In general, I'd say that using a persistent sink (like Cassandra for
> instance) is best.
>
> my .2¢
>
>
> aℕdy ℙetrella
> about.me/noootsab
> [image: aℕdy ℙetrella on about.me]
>
> <http://about.me/noootsab>
>
> On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi <ma...@gmail.com>
> wrote:
>
>> You can cache data in memory & query it using Spark Job Server.
>> Most folks dump data down to a queue/db for retrieval
>> You can batch up data & store into parquet partitions as well. & query it
>> using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
>> believe.
>> --
>> Regards,
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi
>>
>>
>> On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier <mp...@gmail.com>
>> wrote:
>>
>>> Hi there,
>>>
>>> I’m pretty new to Spark, and so far I’ve written my jobs the same way I
>>> wrote Scalding jobs - one-off, read data from HDFS, count words, write
>>> counts back to HDFS.
>>>
>>> Now I want to display these counts in a dashboard. Since Spark allows to
>>> cache RDDs in-memory and you have to explicitly terminate your app (and
>>> there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep
>>> an app running indefinitely and query an in-memory RDD from the outside
>>> (via SparkSQL for example).
>>>
>>> Is this how others are using Spark? Or are you just dumping job results
>>> into message queues or databases?
>>>
>>>
>>> Thanks
>>> - Marius
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>
>

Re: Serving data

Posted by Marius Soutier <mp...@gmail.com>.

Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the usual route with either read-only or normal database.

On 13.09.2014, at 12:45, andy petrella <an...@gmail.com> wrote:

> however, the cache is not guaranteed to remain, if other jobs are launched in the cluster and require more memory than what's left in the overall caching memory, previous RDDs will be discarded.
> 
> Using an off heap cache like tachyon as a dump repo can help.
> 
> In general, I'd say that using a persistent sink (like Cassandra for instance) is best.
> 
> my .2¢
> 
> 
> aℕdy ℙetrella
> about.me/noootsab
> 
> 
> 
> On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi <ma...@gmail.com> wrote:
> You can cache data in memory & query it using Spark Job Server. 
> Most folks dump data down to a queue/db for retrieval 
> You can batch up data & store into parquet partitions as well. & query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 
> -- 
> Regards,
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier <mp...@gmail.com> wrote:
> 
> Hi there, 
> 
> I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote Scalding jobs - one-off, read data from HDFS, count words, write counts back to HDFS. 
> 
> Now I want to display these counts in a dashboard. Since Spark allows to cache RDDs in-memory and you have to explicitly terminate your app (and there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep an app running indefinitely and query an in-memory RDD from the outside (via SparkSQL for example). 
> 
> Is this how others are using Spark? Or are you just dumping job results into message queues or databases? 
> 
> 
> Thanks 
> - Marius 
> 
> 
> --------------------------------------------------------------------- 
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
> For additional commands, e-mail: user-help@spark.apache.org 
> 
> 
> 
>

Re: Serving data

Posted by andy petrella <an...@gmail.com>.

however, the cache is not guaranteed to remain, if other jobs are launched
in the cluster and require more memory than what's left in the overall
caching memory, previous RDDs will be discarded.

Using an off heap cache like tachyon as a dump repo can help.

In general, I'd say that using a persistent sink (like Cassandra for
instance) is best.

my .2¢


aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

<http://about.me/noootsab>

On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi <ma...@gmail.com>
wrote:

> You can cache data in memory & query it using Spark Job Server.
> Most folks dump data down to a queue/db for retrieval
> You can batch up data & store into parquet partitions as well. & query it
> using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
> believe.
> --
> Regards,
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
>
>
> On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier <mp...@gmail.com> wrote:
>
>> Hi there,
>>
>> I’m pretty new to Spark, and so far I’ve written my jobs the same way I
>> wrote Scalding jobs - one-off, read data from HDFS, count words, write
>> counts back to HDFS.
>>
>> Now I want to display these counts in a dashboard. Since Spark allows to
>> cache RDDs in-memory and you have to explicitly terminate your app (and
>> there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep
>> an app running indefinitely and query an in-memory RDD from the outside
>> (via SparkSQL for example).
>>
>> Is this how others are using Spark? Or are you just dumping job results
>> into message queues or databases?
>>
>>
>> Thanks
>> - Marius
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Serving data

Posted by Mayur Rustagi <ma...@gmail.com>.

You can cache data in memory & query it using Spark Job Server. 

Most folks dump data down to a queue/db for retrieval 

You can batch up data & store into parquet partitions as well. & query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 
-- 
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi

On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier <mp...@gmail.com> wrote:

> Hi there,
> I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote Scalding jobs - one-off, read data from HDFS, count words, write counts back to HDFS.
> Now I want to display these counts in a dashboard. Since Spark allows to cache RDDs in-memory and you have to explicitly terminate your app (and there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep an app running indefinitely and query an in-memory RDD from the outside (via SparkSQL for example).
> Is this how others are using Spark? Or are you just dumping job results into message queues or databases?
> Thanks
> - Marius
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org