You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "fightfate@163.com" <fi...@163.com> on 2015/11/09 07:02:47 UTC

OLAP query using spark dataframe with cassandra

Hi, community

We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see,

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for

both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more

knowledge.

Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this

OLAP engine using spark + cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or

pentaho mondrian ?

Best Regards,

Sun.

[1] http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark

fightfate@163.com

Re: OLAP query using spark dataframe with cassandra

Posted by KylinPOC <sa...@gmail.com>.

That's great.
Good to know the roadmap of Kylin.



--
View this message in context: http://apache-kylin-incubating.74782.x6.nabble.com/Re-OLAP-query-using-spark-dataframe-with-cassandra-tp2285p2291.html
Sent from the Apache Kylin (Incubating) mailing list archive at Nabble.com.

Re: OLAP query using spark dataframe with cassandra

Posted by David Morales <dm...@stratio.com>.

Hi there,

Please consider our real-time aggregation engine, sparkta, fully open
source (Apache2 License).

Here you have some slides about the project:

   - http://www.slideshare.net/Stratio/strata-sparkta

And the source code:


   - https://github.com/Stratio/sparkta

Sparkta is a real-time aggregation engine based on spark streaming. You can
define your aggregation policy in a declarative way and choose the output
of your rollups, too. In addition, you can store the raw data and transform
data on-the-fly, among other features.

When working with Cassandra, it could be useful to use the lucene
integration that we have also released at Stratio:


   - http://www.slideshare.net/Stratio/cassandra-meetup-20150217
   - https://github.com/Stratio/cassandra-lucene-index


Ready for use with sparkSQL or in your CQL queries.

We are now working in a SQL layer to work with the cubes in a flexible way,
but this is not available at this moment.

Do not hesitate to contact us if you have any doubt.


Regards.

















2015-11-10 8:16 GMT+01:00 Luke Han <lu...@gmail.com>:

> Some friends refer me this thread about OLAP/Kylin and Spark...
>
> Here's my 2 cents..
>
> If you are trying to setup OLAP, Apache Kylin should be one good idea for
> you to evaluate.
>
> The project has developed more than 2 years and going to graduate to
> Apache Top Level Project [1].
> There are many deployments on production already include
> eBay, Exponential, JD.com, VIP.com and others, refer to powered by page [2].
>
> Apache Kylin's spark engine also on the way, there's discussion about
> turning the performance [3].
>
> There are variety clients are available to interactive with Kylin with
> ANSI SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian,
> and the Excel/PowerBI support will roll out this week.
>
> Apache Kylin is young but mature with huge case validation (one biggest
> cube in eBay contains 85+B rows, 90%ile production platform's query latency
> in few seconds).
>
> StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
> already one real case on production inside eBay, please refer to our design
> deck [4]
>
> We are really welcome everyone to join and contribute to Kylin as OLAP
> engine for Big Data:-)
>
> Please feel free to contact our community or me for any question.
>
> Thanks.
>
> 1. http://s.apache.org/bah
> 2. http://kylin.incubator.apache.org/community/poweredby.html
> 3. http://s.apache.org/lHA
> 4.
> http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
> 5. http://kylin.io
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Tue, Nov 10, 2015 at 2:56 AM, tsh <ts...@timshenkao.su> wrote:
>
>> Hi,
>>
>> I'm in the same position right now: we are going to implement something
>> like OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have terabytes
>> of versatile data and the necessity to make something like cubes (Hive and
>> Hive on HBase are unsatisfactory). From the other, our users get accustomed
>> to Tableau + Vertica.
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
>> storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
>> Flume (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>>
>> For myself, I decided not to dive into Mesos. Cassandra is hardly
>> configurable, you'll have to dedicate special employee to support it.
>>
>> I'll be glad to hear other ideas & propositions as we are at the
>> beginning of the process too.
>>
>> Sincerely yours, Tim Shenkao
>>
>>
>> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>>
>> Hi,
>>
>> Thanks for suggesting. Actually we are now evaluating and stressing the
>> spark sql on cassandra, while
>>
>> trying to define business models. FWIW, the solution mentioned here is
>> different from traditional OLAP
>>
>> cube engine, right ? So we are hesitating on the common sense or
>> direction choice of olap architecture.
>>
>> And we are happy to hear more use case from this community.
>>
>> Best,
>> Sun.
>>
>> ------------------------------
>> fightfate@163.com
>>
>>
>> *From:* Jörn Franke <jo...@gmail.com>
>> *Date:* 2015-11-09 14:40
>> *To:* fightfate@163.com
>> *CC:* user <us...@spark.apache.org>; dev <de...@spark.apache.org>
>> *Subject:* Re: OLAP query using spark dataframe with cassandra
>>
>> Is there any distributor supporting these software components in
>> combination? If no and your core business is not software then you may want
>> to look for something else, because it might not make sense to build up
>> internal know-how in all of these areas.
>>
>> In any case - it depends all highly on your data and queries. You will
>> have to do your own experiments.
>>
>> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>>
>> Hi, community
>>
>> We are specially interested about this featural integration according to
>> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>
>> seems good implementation for lambda architecure in the open-source
>> world, especially non-hadoop based cluster environment. As we can see,
>>
>> the advantages obviously consist of :
>>
>> 1 the feasibility and scalability of spark datafram api, which can also
>> make a perfect complement for Apache Cassandra native cql feature.
>>
>> 2 both streaming and batch process availability using the ALL-STACK
>> thing, cool.
>>
>> 3 we can both achieve compacity and usability for spark with cassandra,
>> including seemlessly integrating with job scheduling and resource
>> management.
>>
>> Only one concern goes to the OLAP query performance issue, which mainly
>> caused by frequent aggregation work between daily increased large tables,
>> for
>>
>> both spark sql and cassandra. I can see that the [1] use case facilitates
>> FiloDB to achieve columnar storage and query performance, but we had
>> nothing more
>>
>> knowledge.
>>
>> Question is : Any guy had such use case for now, especially using in your
>> production environment ? Would be interested in your architeture for
>> designing this
>>
>> OLAP engine using spark +  cassandra. What do you think the comparison
>> between the scenario with traditional OLAP cube design? Like Apache Kylin
>> or
>>
>> pentaho mondrian ?
>>
>> Best Regards,
>>
>> Sun.
>>
>>
>> [1]
>> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
>> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>
>> ------------------------------
>> fightfate@163.com
>>
>>
>>
>


-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
<https://twitter.com/dmoralesdf>


<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
<https://twitter.com/StratioBD>*

Re: OLAP query using spark dataframe with cassandra

Posted by danielcsant <dc...@stratio.com>.

You can also evaluate Stratio Sparkta. It is a real time aggregation tool
based on Spark Streaming. 
It is able to write in Cassandra and in other databases like MongoDB,
Elasticsearch,... It is prepared to deploy this aggregations in Mesos so
maybe it fits your necessities.

There is no a query layer that could abstract the analytics part in OLAP but
it is on the roadmap.

Disclaimer: I work in this product 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/OLAP-query-using-spark-dataframe-with-cassandra-tp15082p15113.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: OLAP query using spark dataframe with cassandra

Posted by Luke Han <lu...@gmail.com>.

Some friends refer me this thread about OLAP/Kylin and Spark...

Here's my 2 cents..

If you are trying to setup OLAP, Apache Kylin should be one good idea for
you to evaluate.

The project has developed more than 2 years and going to graduate to Apache
Top Level Project [1].
There are many deployments on production already include eBay, Exponential,
JD.com, VIP.com and others, refer to powered by page [2].

Apache Kylin's spark engine also on the way, there's discussion about
turning the performance [3].

There are variety clients are available to interactive with Kylin with ANSI
SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian, and the
Excel/PowerBI support will roll out this week.

Apache Kylin is young but mature with huge case validation (one biggest
cube in eBay contains 85+B rows, 90%ile production platform's query latency
in few seconds).

StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
already one real case on production inside eBay, please refer to our design
deck [4]

We are really welcome everyone to join and contribute to Kylin as OLAP
engine for Big Data:-)

Please feel free to contact our community or me for any question.

Thanks.

1. http://s.apache.org/bah
2. http://kylin.incubator.apache.org/community/poweredby.html
3. http://s.apache.org/lHA
4.
http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
5. http://kylin.io


Best Regards!
---------------------

Luke Han

On Tue, Nov 10, 2015 at 2:56 AM, tsh <ts...@timshenkao.su> wrote:

> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
>
> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------
> fightfate@163.com
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2015-11-09 14:40
> *To:* fightfate@163.com
> *CC:* user <us...@spark.apache.org>; dev <de...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> ------------------------------
> fightfate@163.com
>
>
>

Re: OLAP query using spark dataframe with cassandra

Posted by Luke Han <lu...@gmail.com>.

Some friends refer me this thread about OLAP/Kylin and Spark...

Here's my 2 cents..

If you are trying to setup OLAP, Apache Kylin should be one good idea for
you to evaluate.

The project has developed more than 2 years and going to graduate to Apache
Top Level Project [1].
There are many deployments on production already include eBay, Exponential,
JD.com, VIP.com and others, refer to powered by page [2].

Apache Kylin's spark engine also on the way, there's discussion about
turning the performance [3].

There are variety clients are available to interactive with Kylin with ANSI
SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian, and the
Excel/PowerBI support will roll out this week.

Apache Kylin is young but mature with huge case validation (one biggest
cube in eBay contains 85+B rows, 90%ile production platform's query latency
in few seconds).

StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
already one real case on production inside eBay, please refer to our design
deck [4]

We are really welcome everyone to join and contribute to Kylin as OLAP
engine for Big Data:-)

Please feel free to contact our community or me for any question.

Thanks.

1. http://s.apache.org/bah
2. http://kylin.incubator.apache.org/community/poweredby.html
3. http://s.apache.org/lHA
4.
http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
5. http://kylin.io


Best Regards!
---------------------

Luke Han

On Tue, Nov 10, 2015 at 2:56 AM, tsh <ts...@timshenkao.su> wrote:

> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
>
> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------
> fightfate@163.com
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2015-11-09 14:40
> *To:* fightfate@163.com
> *CC:* user <us...@spark.apache.org>; dev <de...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> ------------------------------
> fightfate@163.com
>
>
>

Re: Re: OLAP query using spark dataframe with cassandra

Posted by Andrés Ivaldi <ia...@gmail.com>.

Hi,

We have been evaluating apache Kylin, how flexible is it? I mean, we need
to create the cube Structure Dynamically and populete it from different
sources, the process time is not too important, what is important is the
response time on queries?

Thanks.

On Mon, Nov 9, 2015 at 11:01 PM, fightfate@163.com <fi...@163.com>
wrote:

> Hi,
>
> According to my experience, I would recommend option 3) using Apache Kylin
> for your requirements.
>
> This is a suggestion based on the open-source world.
>
> For the per cassandra thing, I accept your advice for the special support
> thing. But the community is very
>
> open and convinient for prompt response.
>
> ------------------------------
> fightfate@163.com
>
>
> *From:* tsh <ts...@timshenkao.su>
> *Date:* 2015-11-10 02:56
> *To:* fightfate@163.com; user <us...@spark.apache.org>; dev
> <de...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------
> fightfate@163.com
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2015-11-09 14:40
> *To:* fightfate@163.com
> *CC:* user <us...@spark.apache.org>; dev <de...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> ------------------------------
> fightfate@163.com
>
>
>


-- 
Ing. Ivaldi Andres

Re: Re: OLAP query using spark dataframe with cassandra

Posted by "fightfate@163.com" <fi...@163.com>.

Hi,

According to my experience, I would recommend option 3) using Apache Kylin for your requirements. 

This is a suggestion based on the open-source world. 

For the per cassandra thing, I accept your advice for the special support thing. But the community is very

open and convinient for prompt response. 



fightfate@163.com
 
From: tsh
Date: 2015-11-10 02:56
To: fightfate@163.com; user; dev
Subject: Re: OLAP query using spark dataframe with cassandra
Hi,

I'm in the same position right now: we are going to implement something like OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes of versatile data and the necessity to make something like cubes (Hive and Hive on HBase are unsatisfactory). From the other, our users get accustomed to Tableau + Vertica. 
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume (has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the beginning of the process too.

Sincerely yours, Tim Shenkao

On 11/09/2015 09:46 AM, fightfate@163.com wrote:
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightfate@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightfate@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You will have to do your own experiments.

On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:

Hi, community

We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 

OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightfate@163.com

Re: Re: OLAP query using spark dataframe with cassandra

Posted by "fightfate@163.com" <fi...@163.com>.

Hi,

According to my experience, I would recommend option 3) using Apache Kylin for your requirements. 

This is a suggestion based on the open-source world. 

For the per cassandra thing, I accept your advice for the special support thing. But the community is very

open and convinient for prompt response. 



fightfate@163.com
 
From: tsh
Date: 2015-11-10 02:56
To: fightfate@163.com; user; dev
Subject: Re: OLAP query using spark dataframe with cassandra
Hi,

I'm in the same position right now: we are going to implement something like OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes of versatile data and the necessity to make something like cubes (Hive and Hive on HBase are unsatisfactory). From the other, our users get accustomed to Tableau + Vertica. 
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume (has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the beginning of the process too.

Sincerely yours, Tim Shenkao

On 11/09/2015 09:46 AM, fightfate@163.com wrote:
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightfate@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightfate@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You will have to do your own experiments.

On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:

Hi, community

We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 

OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightfate@163.com

Re: Re: OLAP query using spark dataframe with cassandra

Posted by Andrés Ivaldi <ia...@gmail.com>.

Hi,
Cassandra looks very interesting and It seems to fit right, but it looks
like It needs too much work to have the proper configuration that depends
of the data. And what We need to do it  a generic structure with less
configuration possible, because the end users dont have the know-how for do
that.

Please let me know if we did a bad interpretation about cassandra, so we
can take a look to it again.

Best Regards!!


On Mon, Nov 9, 2015 at 11:11 PM, fightfate@163.com <fi...@163.com>
wrote:

> Hi,
>
> Have you ever considered cassandra as a replacement ? We are now almost
> the seem usage as your engine, e.g. using mysql to store
>
> initial aggregated data. Can you share more about your kind of Cube
> queries ? We are very interested in that arch too : )
>
> Best,
> Sun.
> ------------------------------
> fightfate@163.com
>
>
> *From:* Andrés Ivaldi <ia...@gmail.com>
> *Date:* 2015-11-10 07:03
> *To:* tsh <ts...@timshenkao.su>
> *CC:* fightfate@163.com; user <us...@spark.apache.org>; dev
> <de...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
> Hi,
> I'm also considering something similar, Spark plain is too slow for my
> case, a possible solution is use Spark as Multiple Source connector and
> basic transformation layer, then persist the information (actually is a
> RDBM), after that with our engine we build a kind of Cube queries, and the
> result is processed again by Spark adding Machine Learning.
> Our Missing part is reemplace the RDBM with something more suitable and
> scalable than RDBM, dont care about pre processing information if after pre
> processing the queries are fast.
>
> Regards
>
> On Mon, Nov 9, 2015 at 3:56 PM, tsh <ts...@timshenkao.su> wrote:
>
>> Hi,
>>
>> I'm in the same position right now: we are going to implement something
>> like OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have terabytes
>> of versatile data and the necessity to make something like cubes (Hive and
>> Hive on HBase are unsatisfactory). From the other, our users get accustomed
>> to Tableau + Vertica.
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
>> storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
>> Flume (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>>
>> For myself, I decided not to dive into Mesos. Cassandra is hardly
>> configurable, you'll have to dedicate special employee to support it.
>>
>> I'll be glad to hear other ideas & propositions as we are at the
>> beginning of the process too.
>>
>> Sincerely yours, Tim Shenkao
>>
>>
>> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>>
>> Hi,
>>
>> Thanks for suggesting. Actually we are now evaluating and stressing the
>> spark sql on cassandra, while
>>
>> trying to define business models. FWIW, the solution mentioned here is
>> different from traditional OLAP
>>
>> cube engine, right ? So we are hesitating on the common sense or
>> direction choice of olap architecture.
>>
>> And we are happy to hear more use case from this community.
>>
>> Best,
>> Sun.
>>
>> ------------------------------
>> fightfate@163.com
>>
>>
>> *From:* Jörn Franke <jo...@gmail.com>
>> *Date:* 2015-11-09 14:40
>> *To:* fightfate@163.com
>> *CC:* user <us...@spark.apache.org>; dev <de...@spark.apache.org>
>> *Subject:* Re: OLAP query using spark dataframe with cassandra
>>
>> Is there any distributor supporting these software components in
>> combination? If no and your core business is not software then you may want
>> to look for something else, because it might not make sense to build up
>> internal know-how in all of these areas.
>>
>> In any case - it depends all highly on your data and queries. You will
>> have to do your own experiments.
>>
>> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>>
>> Hi, community
>>
>> We are specially interested about this featural integration according to
>> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>
>> seems good implementation for lambda architecure in the open-source
>> world, especially non-hadoop based cluster environment. As we can see,
>>
>> the advantages obviously consist of :
>>
>> 1 the feasibility and scalability of spark datafram api, which can also
>> make a perfect complement for Apache Cassandra native cql feature.
>>
>> 2 both streaming and batch process availability using the ALL-STACK
>> thing, cool.
>>
>> 3 we can both achieve compacity and usability for spark with cassandra,
>> including seemlessly integrating with job scheduling and resource
>> management.
>>
>> Only one concern goes to the OLAP query performance issue, which mainly
>> caused by frequent aggregation work between daily increased large tables,
>> for
>>
>> both spark sql and cassandra. I can see that the [1] use case facilitates
>> FiloDB to achieve columnar storage and query performance, but we had
>> nothing more
>>
>> knowledge.
>>
>> Question is : Any guy had such use case for now, especially using in your
>> production environment ? Would be interested in your architeture for
>> designing this
>>
>> OLAP engine using spark +  cassandra. What do you think the comparison
>> between the scenario with traditional OLAP cube design? Like Apache Kylin
>> or
>>
>> pentaho mondrian ?
>>
>> Best Regards,
>>
>> Sun.
>>
>>
>> [1]
>> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
>> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>
>> ------------------------------
>> fightfate@163.com
>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>
>


-- 
Ing. Ivaldi Andres

Re: Re: OLAP query using spark dataframe with cassandra

Posted by "fightfate@163.com" <fi...@163.com>.

Hi, 

Have you ever considered cassandra as a replacement ? We are now almost the seem usage as your engine, e.g. using mysql to store 

initial aggregated data. Can you share more about your kind of Cube queries ? We are very interested in that arch too : )

Best,
Sun.


fightfate@163.com
 
From: Andrés Ivaldi
Date: 2015-11-10 07:03
To: tsh
CC: fightfate@163.com; user; dev
Subject: Re: OLAP query using spark dataframe with cassandra
Hi,
I'm also considering something similar, Spark plain is too slow for my case, a possible solution is use Spark as Multiple Source connector and basic transformation layer, then persist the information (actually is a RDBM), after that with our engine we build a kind of Cube queries, and the result is processed again by Spark adding Machine Learning.
Our Missing part is reemplace the RDBM with something more suitable and scalable than RDBM, dont care about pre processing information if after pre processing the queries are fast.

Regards

On Mon, Nov 9, 2015 at 3:56 PM, tsh <ts...@timshenkao.su> wrote:
Hi,

I'm in the same position right now: we are going to implement something like OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes of versatile data and the necessity to make something like cubes (Hive and Hive on HBase are unsatisfactory). From the other, our users get accustomed to Tableau + Vertica. 
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume (has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the beginning of the process too.

Sincerely yours, Tim Shenkao


On 11/09/2015 09:46 AM, fightfate@163.com wrote:
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightfate@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightfate@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You will have to do your own experiments.

On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:

Hi, community

We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 

OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightfate@163.com




-- 
Ing. Ivaldi Andres

Re: Re: OLAP query using spark dataframe with cassandra

Posted by "fightfate@163.com" <fi...@163.com>.

Hi, 

Have you ever considered cassandra as a replacement ? We are now almost the seem usage as your engine, e.g. using mysql to store 

initial aggregated data. Can you share more about your kind of Cube queries ? We are very interested in that arch too : )

Best,
Sun.


fightfate@163.com
 
From: Andrés Ivaldi
Date: 2015-11-10 07:03
To: tsh
CC: fightfate@163.com; user; dev
Subject: Re: OLAP query using spark dataframe with cassandra
Hi,
I'm also considering something similar, Spark plain is too slow for my case, a possible solution is use Spark as Multiple Source connector and basic transformation layer, then persist the information (actually is a RDBM), after that with our engine we build a kind of Cube queries, and the result is processed again by Spark adding Machine Learning.
Our Missing part is reemplace the RDBM with something more suitable and scalable than RDBM, dont care about pre processing information if after pre processing the queries are fast.

Regards

On Mon, Nov 9, 2015 at 3:56 PM, tsh <ts...@timshenkao.su> wrote:
Hi,

I'm in the same position right now: we are going to implement something like OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes of versatile data and the necessity to make something like cubes (Hive and Hive on HBase are unsatisfactory). From the other, our users get accustomed to Tableau + Vertica. 
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume (has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the beginning of the process too.

Sincerely yours, Tim Shenkao


On 11/09/2015 09:46 AM, fightfate@163.com wrote:
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightfate@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightfate@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You will have to do your own experiments.

On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:

Hi, community

We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 

OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightfate@163.com




-- 
Ing. Ivaldi Andres

RE: First project in scala IDE : first problem

Posted by didier vila <vi...@hotmail.com>.

All, I identified the reason of my problem. Regards. D

From: viladidier@hotmail.com
To: user@spark.apache.org
Subject: First project in scala IDE : first problem
Date: Mon, 9 Nov 2015 23:39:55 +0000

All,
This is my first run with scala and maven on spark using scala IDE on my single computer.
I have the following problem. 
Thanks by advance 
Didier
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties15/11/09 23:30:52 INFO SparkContext: Running Spark version 1.4.015/11/09 23:30:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable15/11/09 23:30:53 INFO SecurityManager: Changing view acls to: dv18601015/11/09 23:30:53 INFO SecurityManager: Changing modify acls to: dv18601015/11/09 23:30:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dv186010); users with modify permissions: Set(dv186010)15/11/09 23:30:54 INFO Slf4jLogger: Slf4jLogger started15/11/09 23:30:54 INFO Remoting: Starting remoting15/11/09 23:30:54 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.28.1:59209]15/11/09 23:30:54 INFO Utils: Successfully started service 'sparkDriver' on port 59209.15/11/09 23:30:54 INFO SparkEnv: Registering MapOutputTracker15/11/09 23:30:54 INFO SparkEnv: Registering BlockManagerMaster15/11/09 23:30:54 INFO DiskBlockManager: Created local directory at C:\Users\DV186010\AppData\Local\Temp\spark-123c0ccc-d677-4fae-b9fd-41b9b243905e\blockmgr-65d29cdd-d04f-48f4-85ba-3df96ee4aca715/11/09 23:30:54 INFO MemoryStore: MemoryStore started with capacity 1955.6 MB15/11/09 23:30:54 INFO HttpFileServer: HTTP File server directory is C:\Users\DV186010\AppData\Local\Temp\spark-123c0ccc-d677-4fae-b9fd-41b9b243905e\httpd-62fbcdb8-3fbd-4206-9235-6a9586a3a6a115/11/09 23:30:54 INFO HttpServer: Starting HTTP Server15/11/09 23:30:54 INFO Utils: Successfully started service 'HTTP file server' on port 59210.15/11/09 23:30:54 INFO SparkEnv: Registering OutputCommitCoordinator15/11/09 23:30:55 INFO Utils: Successfully started service 'SparkUI' on port 4040.15/11/09 23:30:55 INFO SparkUI: Started SparkUI at http://192.168.28.1:404015/11/09 23:30:55 INFO Executor: Starting executor ID driver on host localhost15/11/09 23:30:55 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59229.15/11/09 23:30:55 INFO NettyBlockTransferService: Server created on 5922915/11/09 23:30:55 INFO BlockManagerMaster: Trying to register BlockManager15/11/09 23:30:55 INFO BlockManagerMasterEndpoint: Registering block manager localhost:59229 with 1955.6 MB RAM, BlockManagerId(driver, localhost, 59229)15/11/09 23:30:55 INFO BlockManagerMaster: Registered BlockManager15/11/09 23:30:56 INFO MemoryStore: ensureFreeSpace(110248) called with curMem=0, maxMem=205060571115/11/09 23:30:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 1955.5 MB)15/11/09 23:30:56 INFO MemoryStore: ensureFreeSpace(10090) called with curMem=110248, maxMem=205060571115/11/09 23:30:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.9 KB, free 1955.5 MB)15/11/09 23:30:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:59229 (size: 9.9 KB, free: 1955.6 MB)15/11/09 23:30:56 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:1515/11/09 23:30:57 ERROR Shell: Failed to locate the winutils binary in the hadoop binary pathjava.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.	at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)	at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)	at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)	at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:978)	at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:978)	at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)	at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)	at scala.Option.map(Option.scala:145)	at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)	at scala.Option.getOrElse(Option.scala:120)	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)	at scala.Option.getOrElse(Option.scala:120)	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)	at scala.Option.getOrElse(Option.scala:120)	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)	at scala.Option.getOrElse(Option.scala:120)	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)	at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289)	at org.test.spark.WordCount$.main(WordCount.scala:21)	at org.test.spark.WordCount.main(WordCount.scala)15/11/09 23:30:57 INFO FileInputFormat: Total input paths to process : 1Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/C:/Users/DV186010/scalaworkspace/spark/food.count.txt already exists	at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1089)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:989)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:897)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:896)	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1400)	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1379)	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1379)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1379)	at org.test.spark.WordCount$.main(WordCount.scala:22)	at org.test.spark.WordCount.main(WordCount.scala)15/11/09 23:30:57 INFO SparkContext: Invoking stop() from shutdown hook15/11/09 23:30:57 INFO SparkUI: Stopped Spark web UI at http://192.168.28.1:404015/11/09 23:30:57 INFO DAGScheduler: Stopping DAGScheduler15/11/09 23:30:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!15/11/09 23:30:57 INFO Utils: path = C:\Users\DV186010\AppData\Local\Temp\spark-123c0ccc-d677-4fae-b9fd-41b9b243905e\blockmgr-65d29cdd-d04f-48f4-85ba-3df96ee4aca7, already present as root for deletion.15/11/09 23:30:57 INFO MemoryStore: MemoryStore cleared15/11/09 23:30:57 INFO BlockManager: BlockManager stopped15/11/09 23:30:57 INFO BlockManagerMaster: BlockManagerMaster stopped15/11/09 23:30:57 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!15/11/09 23:30:57 INFO SparkContext: Successfully stopped SparkContext15/11/09 23:30:57 INFO Utils: Shutdown hook called15/11/09 23:30:57 INFO Utils: Deleting directory C:\Users\DV186010\AppData\Local\Temp\spark-123c0ccc-d677-4fae-b9fd-41b9b243905e15/11/09 23:30:57 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

First project in scala IDE : first problem

Posted by didier vila <vi...@hotmail.com>.

All,
This is my first run with scala and maven on spark using scala IDE on my single computer.
I have the following problem. 
Thanks by advance 
Didier
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties15/11/09 23:30:52 INFO SparkContext: Running Spark version 1.4.015/11/09 23:30:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable15/11/09 23:30:53 INFO SecurityManager: Changing view acls to: dv18601015/11/09 23:30:53 INFO SecurityManager: Changing modify acls to: dv18601015/11/09 23:30:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dv186010); users with modify permissions: Set(dv186010)15/11/09 23:30:54 INFO Slf4jLogger: Slf4jLogger started15/11/09 23:30:54 INFO Remoting: Starting remoting15/11/09 23:30:54 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.28.1:59209]15/11/09 23:30:54 INFO Utils: Successfully started service 'sparkDriver' on port 59209.15/11/09 23:30:54 INFO SparkEnv: Registering MapOutputTracker15/11/09 23:30:54 INFO SparkEnv: Registering BlockManagerMaster15/11/09 23:30:54 INFO DiskBlockManager: Created local directory at C:\Users\DV186010\AppData\Local\Temp\spark-123c0ccc-d677-4fae-b9fd-41b9b243905e\blockmgr-65d29cdd-d04f-48f4-85ba-3df96ee4aca715/11/09 23:30:54 INFO MemoryStore: MemoryStore started with capacity 1955.6 MB15/11/09 23:30:54 INFO HttpFileServer: HTTP File server directory is C:\Users\DV186010\AppData\Local\Temp\spark-123c0ccc-d677-4fae-b9fd-41b9b243905e\httpd-62fbcdb8-3fbd-4206-9235-6a9586a3a6a115/11/09 23:30:54 INFO HttpServer: Starting HTTP Server15/11/09 23:30:54 INFO Utils: Successfully started service 'HTTP file server' on port 59210.15/11/09 23:30:54 INFO SparkEnv: Registering OutputCommitCoordinator15/11/09 23:30:55 INFO Utils: Successfully started service 'SparkUI' on port 4040.15/11/09 23:30:55 INFO SparkUI: Started SparkUI at http://192.168.28.1:404015/11/09 23:30:55 INFO Executor: Starting executor ID driver on host localhost15/11/09 23:30:55 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59229.15/11/09 23:30:55 INFO NettyBlockTransferService: Server created on 5922915/11/09 23:30:55 INFO BlockManagerMaster: Trying to register BlockManager15/11/09 23:30:55 INFO BlockManagerMasterEndpoint: Registering block manager localhost:59229 with 1955.6 MB RAM, BlockManagerId(driver, localhost, 59229)15/11/09 23:30:55 INFO BlockManagerMaster: Registered BlockManager15/11/09 23:30:56 INFO MemoryStore: ensureFreeSpace(110248) called with curMem=0, maxMem=205060571115/11/09 23:30:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 1955.5 MB)15/11/09 23:30:56 INFO MemoryStore: ensureFreeSpace(10090) called with curMem=110248, maxMem=205060571115/11/09 23:30:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.9 KB, free 1955.5 MB)15/11/09 23:30:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:59229 (size: 9.9 KB, free: 1955.6 MB)15/11/09 23:30:56 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:1515/11/09 23:30:57 ERROR Shell: Failed to locate the winutils binary in the hadoop binary pathjava.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.	at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)	at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)	at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)	at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:978)	at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:978)	at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)	at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)	at scala.Option.map(Option.scala:145)	at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)	at scala.Option.getOrElse(Option.scala:120)	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)	at scala.Option.getOrElse(Option.scala:120)	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)	at scala.Option.getOrElse(Option.scala:120)	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)	at scala.Option.getOrElse(Option.scala:120)	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)	at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289)	at org.test.spark.WordCount$.main(WordCount.scala:21)	at org.test.spark.WordCount.main(WordCount.scala)15/11/09 23:30:57 INFO FileInputFormat: Total input paths to process : 1Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/C:/Users/DV186010/scalaworkspace/spark/food.count.txt already exists	at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1089)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:989)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:897)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:896)	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1400)	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1379)	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1379)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)	at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1379)	at org.test.spark.WordCount$.main(WordCount.scala:22)	at org.test.spark.WordCount.main(WordCount.scala)15/11/09 23:30:57 INFO SparkContext: Invoking stop() from shutdown hook15/11/09 23:30:57 INFO SparkUI: Stopped Spark web UI at http://192.168.28.1:404015/11/09 23:30:57 INFO DAGScheduler: Stopping DAGScheduler15/11/09 23:30:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!15/11/09 23:30:57 INFO Utils: path = C:\Users\DV186010\AppData\Local\Temp\spark-123c0ccc-d677-4fae-b9fd-41b9b243905e\blockmgr-65d29cdd-d04f-48f4-85ba-3df96ee4aca7, already present as root for deletion.15/11/09 23:30:57 INFO MemoryStore: MemoryStore cleared15/11/09 23:30:57 INFO BlockManager: BlockManager stopped15/11/09 23:30:57 INFO BlockManagerMaster: BlockManagerMaster stopped15/11/09 23:30:57 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!15/11/09 23:30:57 INFO SparkContext: Successfully stopped SparkContext15/11/09 23:30:57 INFO Utils: Shutdown hook called15/11/09 23:30:57 INFO Utils: Deleting directory C:\Users\DV186010\AppData\Local\Temp\spark-123c0ccc-d677-4fae-b9fd-41b9b243905e15/11/09 23:30:57 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

Re: OLAP query using spark dataframe with cassandra

Posted by Ted Yu <yu...@gmail.com>.

Please consider using NoSQL engine such as hbase. 

Cheers

> On Nov 9, 2015, at 3:03 PM, Andrés Ivaldi <ia...@gmail.com> wrote:
> 
> Hi,
> I'm also considering something similar, Spark plain is too slow for my case, a possible solution is use Spark as Multiple Source connector and basic transformation layer, then persist the information (actually is a RDBM), after that with our engine we build a kind of Cube queries, and the result is processed again by Spark adding Machine Learning.
> Our Missing part is reemplace the RDBM with something more suitable and scalable than RDBM, dont care about pre processing information if after pre processing the queries are fast.
> 
> Regards
> 
>> On Mon, Nov 9, 2015 at 3:56 PM, tsh <ts...@timshenkao.su> wrote:
>> Hi,
>> 
>> I'm in the same position right now: we are going to implement something like OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have     terabytes of versatile data and the necessity to make something like cubes (Hive and Hive on HBase are unsatisfactory). From the other, our users get accustomed to Tableau + Vertica. 
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>> 
>> For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, you'll have to dedicate special employee to support it.
>> 
>> I'll be glad to hear other ideas & propositions as we are at the beginning of the process too.
>> 
>> Sincerely yours, Tim Shenkao
>> 
>> 
>>> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>>> Hi, 
>>> 
>>> Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while
>>> 
>>> trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP
>>> 
>>> cube engine, right ? So we are hesitating on the common sense or direction choice of olap architecture. 
>>> 
>>> And we are happy to hear more use case from this community. 
>>> 
>>> Best,
>>> Sun. 
>>> 
>>> fightfate@163.com
>>>  
>>> From: Jörn Franke
>>> Date: 2015-11-09 14:40
>>> To: fightfate@163.com
>>> CC: user; dev
>>> Subject: Re: OLAP query using spark dataframe with cassandra
>>> 
>>> Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.
>>> 
>>> In any case - it depends all highly on your data and queries. You will have to do your own experiments.
>>> 
>>> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>>> 
>>>> Hi, community
>>>> 
>>>> We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>>> 
>>>> seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 
>>>> 
>>>> the advantages obviously consist of :
>>>> 
>>>> 1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.
>>>> 
>>>> 2 both streaming and batch process availability using the ALL-STACK thing, cool.
>>>> 
>>>> 3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.
>>>> 
>>>> Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 
>>>> 
>>>> both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 
>>>> 
>>>> knowledge. 
>>>> 
>>>> Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 
>>>> 
>>>> OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 
>>>> 
>>>> pentaho mondrian ? 
>>>> 
>>>> Best Regards,
>>>> 
>>>> Sun.
>>>> 
>>>> 
>>>> [1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>>> 
>>>> fightfate@163.com
> 
> 
> 
> -- 
> Ing. Ivaldi Andres

Re: OLAP query using spark dataframe with cassandra

Posted by Ted Yu <yu...@gmail.com>.

Please consider using NoSQL engine such as hbase. 

Cheers

> On Nov 9, 2015, at 3:03 PM, Andrés Ivaldi <ia...@gmail.com> wrote:
> 
> Hi,
> I'm also considering something similar, Spark plain is too slow for my case, a possible solution is use Spark as Multiple Source connector and basic transformation layer, then persist the information (actually is a RDBM), after that with our engine we build a kind of Cube queries, and the result is processed again by Spark adding Machine Learning.
> Our Missing part is reemplace the RDBM with something more suitable and scalable than RDBM, dont care about pre processing information if after pre processing the queries are fast.
> 
> Regards
> 
>> On Mon, Nov 9, 2015 at 3:56 PM, tsh <ts...@timshenkao.su> wrote:
>> Hi,
>> 
>> I'm in the same position right now: we are going to implement something like OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have     terabytes of versatile data and the necessity to make something like cubes (Hive and Hive on HBase are unsatisfactory). From the other, our users get accustomed to Tableau + Vertica. 
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>> 
>> For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, you'll have to dedicate special employee to support it.
>> 
>> I'll be glad to hear other ideas & propositions as we are at the beginning of the process too.
>> 
>> Sincerely yours, Tim Shenkao
>> 
>> 
>>> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>>> Hi, 
>>> 
>>> Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while
>>> 
>>> trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP
>>> 
>>> cube engine, right ? So we are hesitating on the common sense or direction choice of olap architecture. 
>>> 
>>> And we are happy to hear more use case from this community. 
>>> 
>>> Best,
>>> Sun. 
>>> 
>>> fightfate@163.com
>>>  
>>> From: Jörn Franke
>>> Date: 2015-11-09 14:40
>>> To: fightfate@163.com
>>> CC: user; dev
>>> Subject: Re: OLAP query using spark dataframe with cassandra
>>> 
>>> Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.
>>> 
>>> In any case - it depends all highly on your data and queries. You will have to do your own experiments.
>>> 
>>> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>>> 
>>>> Hi, community
>>>> 
>>>> We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>>> 
>>>> seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 
>>>> 
>>>> the advantages obviously consist of :
>>>> 
>>>> 1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.
>>>> 
>>>> 2 both streaming and batch process availability using the ALL-STACK thing, cool.
>>>> 
>>>> 3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.
>>>> 
>>>> Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 
>>>> 
>>>> both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 
>>>> 
>>>> knowledge. 
>>>> 
>>>> Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 
>>>> 
>>>> OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 
>>>> 
>>>> pentaho mondrian ? 
>>>> 
>>>> Best Regards,
>>>> 
>>>> Sun.
>>>> 
>>>> 
>>>> [1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>>> 
>>>> fightfate@163.com
> 
> 
> 
> -- 
> Ing. Ivaldi Andres

Re: OLAP query using spark dataframe with cassandra

Posted by Andrés Ivaldi <ia...@gmail.com>.

Hi,
I'm also considering something similar, Spark plain is too slow for my
case, a possible solution is use Spark as Multiple Source connector and
basic transformation layer, then persist the information (actually is a
RDBM), after that with our engine we build a kind of Cube queries, and the
result is processed again by Spark adding Machine Learning.
Our Missing part is reemplace the RDBM with something more suitable and
scalable than RDBM, dont care about pre processing information if after pre
processing the queries are fast.

Regards

On Mon, Nov 9, 2015 at 3:56 PM, tsh <ts...@timshenkao.su> wrote:

> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
>
> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------
> fightfate@163.com
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2015-11-09 14:40
> *To:* fightfate@163.com
> *CC:* user <us...@spark.apache.org>; dev <de...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> ------------------------------
> fightfate@163.com
>
>
>


-- 
Ing. Ivaldi Andres

Re: OLAP query using spark dataframe with cassandra

Posted by Luke Han <lu...@gmail.com>.

Some friends refer me this thread about OLAP/Kylin and Spark...

Here's my 2 cents..

If you are trying to setup OLAP, Apache Kylin should be one good idea for
you to evaluate.

The project has developed more than 2 years and going to graduate to Apache
Top Level Project [1].
There are many deployments on production already include eBay, Exponential,
JD.com, VIP.com and others, refer to powered by page [2].

Apache Kylin's spark engine also on the way, there's discussion about
turning the performance [3].

There are variety clients are available to interactive with Kylin with ANSI
SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian, and the
Excel/PowerBI support will roll out this week.

Apache Kylin is young but mature with huge case validation (one biggest
cube in eBay contains 85+B rows, 90%ile production platform's query latency
in few seconds).

StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
already one real case on production inside eBay, please refer to our design
deck [4]

We are really welcome everyone to join and contribute to Kylin as OLAP
engine for Big Data:-)

Please feel free to contact our community or me for any question.

Thanks.

1. http://s.apache.org/bah
2. http://kylin.incubator.apache.org/community/poweredby.html
3. http://s.apache.org/lHA
4.
http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
5. http://kylin.io


Best Regards!
---------------------

Luke Han

On Tue, Nov 10, 2015 at 2:56 AM, tsh <ts...@timshenkao.su> wrote:

> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
>
> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------
> fightfate@163.com
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2015-11-09 14:40
> *To:* fightfate@163.com
> *CC:* user <us...@spark.apache.org>; dev <de...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> ------------------------------
> fightfate@163.com
>
>
>

Re: OLAP query using spark dataframe with cassandra

Posted by Luke Han <lu...@gmail.com>.

Some friends refer me this thread about OLAP/Kylin and Spark...

Here's my 2 cents..

If you are trying to setup OLAP, Apache Kylin should be one good idea for
you to evaluate.

The project has developed more than 2 years and going to graduate to Apache
Top Level Project [1].
There are many deployments on production already include eBay, Exponential,
JD.com, VIP.com and others, refer to powered by page [2].

Apache Kylin's spark engine also on the way, there's discussion about
turning the performance [3].

There are variety clients are available to interactive with Kylin with ANSI
SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian, and the
Excel/PowerBI support will roll out this week.

Apache Kylin is young but mature with huge case validation (one biggest
cube in eBay contains 85+B rows, 90%ile production platform's query latency
in few seconds).

StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
already one real case on production inside eBay, please refer to our design
deck [4]

We are really welcome everyone to join and contribute to Kylin as OLAP
engine for Big Data:-)

Please feel free to contact our community or me for any question.

Thanks.

1. http://s.apache.org/bah
2. http://kylin.incubator.apache.org/community/poweredby.html
3. http://s.apache.org/lHA
4.
http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
5. http://kylin.io


Best Regards!
---------------------

Luke Han

On Tue, Nov 10, 2015 at 2:56 AM, tsh <ts...@timshenkao.su> wrote:

> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
>
> On 11/09/2015 09:46 AM, fightfate@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------
> fightfate@163.com
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2015-11-09 14:40
> *To:* fightfate@163.com
> *CC:* user <us...@spark.apache.org>; dev <de...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> ------------------------------
> fightfate@163.com
>
>
>

Re: OLAP query using spark dataframe with cassandra

Posted by tsh <ts...@timshenkao.su>.

Hi,

I'm in the same position right now: we are going to implement something 
like OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes 
of versatile data and the necessity to make something like cubes (Hive 
and Hive on HBase are unsatisfactory). From the other, our users get 
accustomed to Tableau + Vertica.
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some 
storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + 
Flume (has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly 
configurable, you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the 
beginning of the process too.

Sincerely yours, Tim Shenkao

On 11/09/2015 09:46 AM, fightfate@163.com wrote:
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing 
> the spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is 
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or 
> direction choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------------------------------------------------
> fightfate@163.com
>
>     *From:* Jörn Franke <ma...@gmail.com>
>     *Date:* 2015-11-09 14:40
>     *To:* fightfate@163.com <ma...@163.com>
>     *CC:* user <ma...@spark.apache.org>; dev
>     <ma...@spark.apache.org>
>     *Subject:* Re: OLAP query using spark dataframe with cassandra
>
>     Is there any distributor supporting these software components in
>     combination? If no and your core business is not software then you
>     may want to look for something else, because it might not make
>     sense to build up internal know-how in all of these areas.
>
>     In any case - it depends all highly on your data and queries. You
>     will have to do your own experiments.
>
>     On 09 Nov 2015, at 07:02, "fightfate@163.com
>     <ma...@163.com>" <fightfate@163.com
>     <ma...@163.com>> wrote:
>
>>     Hi, community
>>
>>     We are specially interested about this featural integration
>>     according to some slides from [1]. The
>>     SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>
>>     seems good implementation for lambda architecure in the
>>     open-source world, especially non-hadoop based cluster
>>     environment. As we can see,
>>
>>     the advantages obviously consist of :
>>
>>     1 the feasibility and scalability of spark datafram api, which
>>     can also make a perfect complement for Apache Cassandra native
>>     cql feature.
>>
>>     2 both streaming and batch process availability using the
>>     ALL-STACK thing, cool.
>>
>>     3 we can both achieve compacity and usability for spark with
>>     cassandra, including seemlessly integrating with job scheduling
>>     and resource management.
>>
>>     Only one concern goes to the OLAP query performance issue, which
>>     mainly caused by frequent aggregation work between daily
>>     increased large tables, for
>>
>>     both spark sql and cassandra. I can see that the [1] use case
>>     facilitates FiloDB to achieve columnar storage and query
>>     performance, but we had nothing more
>>
>>     knowledge.
>>
>>     Question is : Any guy had such use case for now, especially using
>>     in your production environment ? Would be interested in your
>>     architeture for designing this
>>
>>     OLAP engine using spark +  cassandra. What do you think the
>>     comparison between the scenario with traditional OLAP cube
>>     design? Like Apache Kylin or
>>
>>     pentaho mondrian ?
>>
>>     Best Regards,
>>
>>     Sun.
>>
>>
>>     [1]
>>     http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>
>>     ------------------------------------------------------------------------
>>     fightfate@163.com <ma...@163.com>
>

Re: OLAP query using spark dataframe with cassandra

Posted by tsh <ts...@timshenkao.su>.

Hi,

I'm in the same position right now: we are going to implement something 
like OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes 
of versatile data and the necessity to make something like cubes (Hive 
and Hive on HBase are unsatisfactory). From the other, our users get 
accustomed to Tableau + Vertica.
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some 
storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + 
Flume (has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly 
configurable, you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the 
beginning of the process too.

Sincerely yours, Tim Shenkao

On 11/09/2015 09:46 AM, fightfate@163.com wrote:
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing 
> the spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is 
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or 
> direction choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------------------------------------------------
> fightfate@163.com
>
>     *From:* Jörn Franke <ma...@gmail.com>
>     *Date:* 2015-11-09 14:40
>     *To:* fightfate@163.com <ma...@163.com>
>     *CC:* user <ma...@spark.apache.org>; dev
>     <ma...@spark.apache.org>
>     *Subject:* Re: OLAP query using spark dataframe with cassandra
>
>     Is there any distributor supporting these software components in
>     combination? If no and your core business is not software then you
>     may want to look for something else, because it might not make
>     sense to build up internal know-how in all of these areas.
>
>     In any case - it depends all highly on your data and queries. You
>     will have to do your own experiments.
>
>     On 09 Nov 2015, at 07:02, "fightfate@163.com
>     <ma...@163.com>" <fightfate@163.com
>     <ma...@163.com>> wrote:
>
>>     Hi, community
>>
>>     We are specially interested about this featural integration
>>     according to some slides from [1]. The
>>     SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>
>>     seems good implementation for lambda architecure in the
>>     open-source world, especially non-hadoop based cluster
>>     environment. As we can see,
>>
>>     the advantages obviously consist of :
>>
>>     1 the feasibility and scalability of spark datafram api, which
>>     can also make a perfect complement for Apache Cassandra native
>>     cql feature.
>>
>>     2 both streaming and batch process availability using the
>>     ALL-STACK thing, cool.
>>
>>     3 we can both achieve compacity and usability for spark with
>>     cassandra, including seemlessly integrating with job scheduling
>>     and resource management.
>>
>>     Only one concern goes to the OLAP query performance issue, which
>>     mainly caused by frequent aggregation work between daily
>>     increased large tables, for
>>
>>     both spark sql and cassandra. I can see that the [1] use case
>>     facilitates FiloDB to achieve columnar storage and query
>>     performance, but we had nothing more
>>
>>     knowledge.
>>
>>     Question is : Any guy had such use case for now, especially using
>>     in your production environment ? Would be interested in your
>>     architeture for designing this
>>
>>     OLAP engine using spark +  cassandra. What do you think the
>>     comparison between the scenario with traditional OLAP cube
>>     design? Like Apache Kylin or
>>
>>     pentaho mondrian ?
>>
>>     Best Regards,
>>
>>     Sun.
>>
>>
>>     [1]
>>     http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>
>>     ------------------------------------------------------------------------
>>     fightfate@163.com <ma...@163.com>
>

Re: Re: OLAP query using spark dataframe with cassandra

Posted by "fightfate@163.com" <fi...@163.com>.

Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightfate@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightfate@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You will have to do your own experiments.

On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:

Hi, community

We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 

OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightfate@163.com

Re: Re: OLAP query using spark dataframe with cassandra

Posted by "fightfate@163.com" <fi...@163.com>.

Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightfate@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightfate@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You will have to do your own experiments.

On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:

Hi, community

We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 

OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightfate@163.com

Re: OLAP query using spark dataframe with cassandra

Posted by Jörn Franke <jo...@gmail.com>.

Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You will have to do your own experiments.

> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
> 
> Hi, community
> 
> We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
> 
> seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 
> 
> the advantages obviously consist of :
> 
> 1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.
> 
> 2 both streaming and batch process availability using the ALL-STACK thing, cool.
> 
> 3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.
> 
> Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 
> 
> both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 
> 
> knowledge. 
> 
> Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 
> 
> OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 
> 
> pentaho mondrian ? 
> 
> Best Regards,
> 
> Sun.
> 
> 
> [1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
> 
> fightfate@163.com

Re: OLAP query using spark dataframe with cassandra

Posted by Jörn Franke <jo...@gmail.com>.

Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You will have to do your own experiments.

> On 09 Nov 2015, at 07:02, "fightfate@163.com" <fi...@163.com> wrote:
> 
> Hi, community
> 
> We are specially interested about this featural integration according to some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
> 
> seems good implementation for lambda architecure in the open-source world, especially non-hadoop based cluster environment. As we can see, 
> 
> the advantages obviously consist of :
> 
> 1 the feasibility and scalability of spark datafram api, which can also make a perfect complement for Apache Cassandra native cql feature.
> 
> 2 both streaming and batch process availability using the ALL-STACK thing, cool.
> 
> 3 we can both achieve compacity and usability for spark with cassandra, including seemlessly integrating with job scheduling and resource management.
> 
> Only one concern goes to the OLAP query performance issue, which mainly caused by frequent aggregation work between daily increased large tables, for 
> 
> both spark sql and cassandra. I can see that the [1] use case facilitates FiloDB to achieve columnar storage and query performance, but we had nothing more 
> 
> knowledge. 
> 
> Question is : Any guy had such use case for now, especially using in your production environment ? Would be interested in your architeture for designing this 
> 
> OLAP engine using spark +  cassandra. What do you think the comparison between the scenario with traditional OLAP cube design? Like Apache Kylin or 
> 
> pentaho mondrian ? 
> 
> Best Regards,
> 
> Sun.
> 
> 
> [1]  http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
> 
> fightfate@163.com