You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2016/01/05 19:14:52 UTC

Spark on Apache Ingnite?

Hi has anybody tried and had success with Spark on Apache Ignite seems
promising? https://ignite.apache.org/



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark on Apache Ingnite?

Posted by Ravi Kora <ra...@airisdata.com>.

We have been using ignite on spark for one of our use cases. We are using Ignite’s SharedRDD feature. Following links should get you started in that direction. We have been using for the basic use case and works fine so far. There is not a whole lot of documentation on spark-ignite integration though. Some pain points that we observed are that it gives serialization errors when used on non-basic data types(UDTs etc.)

https://apacheignite.readme.io/docs/shared-rdd
https://apacheignite.readme.io/docs/testing-integration-with-spark-shell

-Ravi

From: Umesh Kacha <um...@gmail.com>>
Date: Tuesday, January 5, 2016 at 11:47 PM
To: "nate@reactor8.com<ma...@reactor8.com>" <na...@reactor8.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: RE: Spark on Apache Ingnite?

Hi  Nate thanks much. I have exact same use cases mentioned by you. My spark job does heavy writing involving  group by and huge data shuffling. Can you please provide any pointer how can I run my existing spark job which is running on yarn to make it run on ignite? Please guide. Thanks again.

On Jan 6, 2016 02:28, <na...@reactor8.com>> wrote:
We started playing with Ignite back Hadoop, hive and spark services, and
looking to move to it as our default for deployment going forward, still
early but so far its been pretty nice and excited for the flexibility it
will provide for our particular use cases.

Would say in general its worth looking into if your data workloads are:

a) mix of read/write, or heavy write at times
b) want write/read access to data from services/apps outside of your spark
workloads (old Hadoop jobs, custom apps, etc)
c) have strings of spark jobs that could benefit from caching your data
across them (think similar usage to tachyon)
d) you have sparksql queries that could benefit from indexing and mutability
(see pt (a) about mix read/write)

If your data is read exclusive and very batch oriented, and your workloads
are strictly spark based, benefits will be less and ignite would probably
act as more of a tachyon replacement as many of the other features outside
of RDD caching wont be leveraged.

-----Original Message-----
From: unk1102 [mailto:umesh.kacha@gmail.com<ma...@gmail.com>]
Sent: Tuesday, January 5, 2016 10:15 AM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Spark on Apache Ingnite?

Hi has anybody tried and had success with Spark on Apache Ignite seems
promising? https://ignite.apache.org/

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-
tp25884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org> For additional
commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

RE: Spark on Apache Ingnite?

Posted by Umesh Kacha <um...@gmail.com>.

Hi  Nate thanks much. I have exact same use cases mentioned by you. My
spark job does heavy writing involving  group by and huge data shuffling.
Can you please provide any pointer how can I run my existing spark job
which is running on yarn to make it run on ignite? Please guide. Thanks
again.
On Jan 6, 2016 02:28, <na...@reactor8.com> wrote:

> We started playing with Ignite back Hadoop, hive and spark services, and
> looking to move to it as our default for deployment going forward, still
> early but so far its been pretty nice and excited for the flexibility it
> will provide for our particular use cases.
>
> Would say in general its worth looking into if your data workloads are:
>
> a) mix of read/write, or heavy write at times
> b) want write/read access to data from services/apps outside of your spark
> workloads (old Hadoop jobs, custom apps, etc)
> c) have strings of spark jobs that could benefit from caching your data
> across them (think similar usage to tachyon)
> d) you have sparksql queries that could benefit from indexing and
> mutability
> (see pt (a) about mix read/write)
>
> If your data is read exclusive and very batch oriented, and your workloads
> are strictly spark based, benefits will be less and ignite would probably
> act as more of a tachyon replacement as many of the other features outside
> of RDD caching wont be leveraged.
>
>
> -----Original Message-----
> From: unk1102 [mailto:umesh.kacha@gmail.com]
> Sent: Tuesday, January 5, 2016 10:15 AM
> To: user@spark.apache.org
> Subject: Spark on Apache Ingnite?
>
> Hi has anybody tried and had success with Spark on Apache Ignite seems
> promising? https://ignite.apache.org/
>
>
>
> --
> View this message in context:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-
> tp25884.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>
>

RE: Spark on Apache Ingnite?

Posted by na...@reactor8.com.

We started playing with Ignite back Hadoop, hive and spark services, and
looking to move to it as our default for deployment going forward, still
early but so far its been pretty nice and excited for the flexibility it
will provide for our particular use cases.

Would say in general its worth looking into if your data workloads are:

a) mix of read/write, or heavy write at times
b) want write/read access to data from services/apps outside of your spark
workloads (old Hadoop jobs, custom apps, etc)
c) have strings of spark jobs that could benefit from caching your data
across them (think similar usage to tachyon)
d) you have sparksql queries that could benefit from indexing and mutability
(see pt (a) about mix read/write)

If your data is read exclusive and very batch oriented, and your workloads
are strictly spark based, benefits will be less and ignite would probably
act as more of a tachyon replacement as many of the other features outside
of RDD caching wont be leveraged.


-----Original Message-----
From: unk1102 [mailto:umesh.kacha@gmail.com] 
Sent: Tuesday, January 5, 2016 10:15 AM
To: user@spark.apache.org
Subject: Spark on Apache Ingnite?

Hi has anybody tried and had success with Spark on Apache Ignite seems
promising? https://ignite.apache.org/



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-
tp25884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
commands, e-mail: user-help@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Spark on Apache Ingnite?

Posted by "Boavida, Rodrigo" <Ro...@Aspect.com>.

I also had a quick look and agree it’s not very clear. I believe if one reads through the clustering logic and the replication settings would get a good idea of how it works.
https://apacheignite.readme.io/docs/cluster
I believe it integrates with Hadoop and other file based systems for persisting when needed. Not sure about the details on how does it recover.
Also  resource manager such as Mesos can add recoverability for at least scenarios where there isn’t any state to recover.

Resilience is a feature and not every use case needs it. For example, I’m currently considering Ignite for caching purposes of transient data where we have the need to share RDDs between different Spark Contexts where one context produces data and the other consumes

From: Koert Kuipers [mailto:koert@tresata.com]
Sent: 11 January 2016 16:08
To: Boavida, Rodrigo <Ro...@Aspect.com>
Cc: user@spark.apache.org
Subject: Re: Spark on Apache Ingnite?

where is ignite's resilience/fault-tolerance design documented?
i can not find it. i would generally stay away from it if fault-tolerance is an afterthought.

On Mon, Jan 11, 2016 at 10:31 AM, RodrigoB <ro...@aspect.com>> wrote:
Although I haven't work explicitly with either, they do seem to differ in
design and consequently in usage scenarios.

Ignite is claimed to be a pure in-memory distributed database.
With Ignite, updating existing keys is something that is self-managed
comparing with Tachyon. In Tachyon once a value is created for a given key,
becomes immutable, so you either delete and insert again, or need to
manage/update the tachyon keys yourself.
Also, Tachyon's resilience design is based on the underlying file system
(typically hadoop), which means that if a node goes down, to recover the
lost data, it would need first to have been persisted on the corresponding
file partition.
With Ignite, there is no master dependency like with Tachyon, and my
understanding is that API calls will depend on master's availability in
Tachyon. I believe Ignite has some options for replication which would be
more aligned with the in-memory datastore.

If you are looking for persisting some RDD's output into an in-memory store
and query it outside of Spark, on the paper Ignite sounds like a better
solution.

Since you are asking about Ignite benefits that was the focus of my
response. Tachyon has its own benefits like the community support and the
Spark lineage persistency integration. If you are doing batch based
processing and want to persist fast Spark RDDs, Tachyon is your friend.

Hope this helps.

Tnks,
Rod

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884p25933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

This email (including any attachments) is proprietary to Aspect Software, Inc. and may contain information that is confidential. If you have received this message in error, please do not read, copy or forward this message. Please notify the sender immediately, delete it from your system and destroy any copies. You may not further disclose or distribute this email or its attachments.

Re: Spark on Apache Ingnite?

Posted by Koert Kuipers <ko...@tresata.com>.

where is ignite's resilience/fault-tolerance design documented?
i can not find it. i would generally stay away from it if fault-tolerance
is an afterthought.

On Mon, Jan 11, 2016 at 10:31 AM, RodrigoB <ro...@aspect.com>
wrote:

> Although I haven't work explicitly with either, they do seem to differ in
> design and consequently in usage scenarios.
>
> Ignite is claimed to be a pure in-memory distributed database.
> With Ignite, updating existing keys is something that is self-managed
> comparing with Tachyon. In Tachyon once a value is created for a given key,
> becomes immutable, so you either delete and insert again, or need to
> manage/update the tachyon keys yourself.
> Also, Tachyon's resilience design is based on the underlying file system
> (typically hadoop), which means that if a node goes down, to recover the
> lost data, it would need first to have been persisted on the corresponding
> file partition.
> With Ignite, there is no master dependency like with Tachyon, and my
> understanding is that API calls will depend on master's availability in
> Tachyon. I believe Ignite has some options for replication which would be
> more aligned with the in-memory datastore.
>
> If you are looking for persisting some RDD's output into an in-memory store
> and query it outside of Spark, on the paper Ignite sounds like a better
> solution.
>
> Since you are asking about Ignite benefits that was the focus of my
> response. Tachyon has its own benefits like the community support and the
> Spark lineage persistency integration. If you are doing batch based
> processing and want to persist fast Spark RDDs, Tachyon is your friend.
>
> Hope this helps.
>
> Tnks,
> Rod
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884p25933.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark on Apache Ingnite?

Posted by RodrigoB <ro...@aspect.com>.

Although I haven't work explicitly with either, they do seem to differ in
design and consequently in usage scenarios.

Ignite is claimed to be a pure in-memory distributed database.
With Ignite, updating existing keys is something that is self-managed
comparing with Tachyon. In Tachyon once a value is created for a given key,
becomes immutable, so you either delete and insert again, or need to
manage/update the tachyon keys yourself.
Also, Tachyon's resilience design is based on the underlying file system
(typically hadoop), which means that if a node goes down, to recover the
lost data, it would need first to have been persisted on the corresponding
file partition.
With Ignite, there is no master dependency like with Tachyon, and my
understanding is that API calls will depend on master's availability in
Tachyon. I believe Ignite has some options for replication which would be
more aligned with the in-memory datastore.

If you are looking for persisting some RDD's output into an in-memory store
and query it outside of Spark, on the paper Ignite sounds like a better
solution.

Since you are asking about Ignite benefits that was the focus of my
response. Tachyon has its own benefits like the community support and the
Spark lineage persistency integration. If you are doing batch based
processing and want to persist fast Spark RDDs, Tachyon is your friend.

Hope this helps.

Tnks,
Rod

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884p25933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org