You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Dmitry Goldenberg <dg...@gmail.com> on 2016/01/11 21:14:06 UTC

Best practices for sharing/maintaining large resource files for Spark jobs

We have a bunch of Spark jobs deployed and a few large resource files such
as e.g. a dictionary for lookups or a statistical model.

Right now, these are deployed as part of the Spark jobs which will
eventually make the mongo-jars too bloated for deployments.

What are some of the best practices to consider for maintaining and sharing
large resource files like these?

Thanks.

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Dmitry Goldenberg <dg...@gmail.com>.

The other thing from some folks' recommendations on this list was Apache
Ignite.  Their In-Memory File System (
https://ignite.apache.org/features/igfs.html) looks quite interesting.

On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg <dgoldenberg123@gmail.com
> wrote:

> OK so it looks like Tachyon is a cluster memory plugin marked as
> "experimental" in Spark.
>
> In any case, we've got a few requirements for the system we're working on
> which may drive the decision for how to implement large resource file
> management.
>
> The system is a framework of N data analyzers which take incoming
> documents as input and transform them or extract some data out of those
> documents.  These analyzers can be chained together which makes it a great
> case for processing with RDD's and a set of map/filter types of Spark
> functions. There's already an established framework API which we want to
> preserve.  This means that most likely, we'll create a relatively thin
> "binding" layer for exposing these analyzers as well-documented functions
> to the end-users who want to use them in a Spark based distributed
> computing environment.
>
> We also want to, ideally, hide the complexity of how these resources are
> loaded from the end-users who will be writing the actual Spark jobs that
> utilize the Spark "binding" functions that we provide.
>
> So, for managing large numbers of small, medium, or large resource files,
> we're considering the below options, with a variety of pros and cons
> attached, from the following perspectives:
>
> a) persistence - where do the resources reside initially;
> b) loading - what are the mechanics for loading of these resources;
> c) caching and sharing across worker nodes.
>
> Possible options:
>
> 1. Load each resource into a broadcast variable. Considering that we have
> scores if not hundreds of these resource files, maintaining that many
> broadcast variables seems like a complexity that's going to be hard to
> manage. We'd also need a translation layer between the broadcast variables
> and the internal API that would want to "speak" InputStream's rather than
> broadcast variables.
>
> 2. Load resources into RDD's and perform join's against them from our
> incoming document data RDD's, thus achieving the effect of a value lookup
> from the resources.  While this seems like a very Spark'y way of doing
> things, the lookup mechanics seem quite non-trivial, especially because
> some of the resources aren't going to be pure dictionaries; they may be
> statistical models.  Additionally, this forces us to utilize Spark's
> semantics for handling of these resources which means a potential rewrite
> of our internal product API. That would be a hard option to go with.
>
> 3. Pre-install all the needed resources on each of the worker nodes;
> retrieve the needed resources from the file system and load them into
> memory as needed. Ideally, the resources would only be installed once, on
> the Spark driver side; we'd want to avoid having to pre-install all these
> files on each node. However, we've done this as an exercise and this
> approach works OK.
>
> 4. Pre-load all the resources into HDFS or S3 i.e. into some distributed
> persistent store; load them into cluster memory from there, as necessary.
> Presumably this could be a pluggable store with a common API exposed.
> Since our framework is an OEM'able product, we could plug and play with a
> variety of such persistent stores via Java's FileSystem/URL scheme handler
> API's.
>
> 5. Implement a Resource management server, with a RESTful interface on
> top. Under the covers, this could be a wrapper on top of #4.  Potentially
> unnecessary if we have a solid persistent store API as per #4.
>
> 6. Beyond persistence, caching also has to be considered for these
> resources. We've considered Tachyon (especially since it's pluggable into
> Spark), Redis, and the like. Ideally, I would think we'd want resources to
> be loaded into the cluster memory as needed; paged in/out on-demand in an
> LRU fashion.  From this perspective, it's not yet clear to me what the best
> option(s) would be. Any thoughts / recommendations would be appreciated.
>
>
>
>
>
> On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg <
> dgoldenberg123@gmail.com> wrote:
>
>> Thanks, Gene.
>>
>> Does Spark use Tachyon under the covers anyway for implementing its
>> "cluster memory" support?
>>
>> It seems that the practice I hear the most about is the idea of loading
>> resources as RDD's and then doing join's against them to achieve the lookup
>> effect.
>>
>> The other approach would be to load the resources into broadcast
>> variables but I've heard concerns about memory.  Could we run out of memory
>> if we load too much into broadcast vars?  Is there any memory_to_disk/spill
>> to disk capability for broadcast variables in Spark?
>>
>>
>> On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <ge...@gmail.com> wrote:
>>
>>> Hi Dmitry,
>>>
>>> Yes, Tachyon can help with your use case. You can read and write to
>>> Tachyon via the filesystem api (
>>> http://tachyon-project.org/documentation/File-System-API.html). There
>>> is a native Java API as well as a Hadoop-compatible API. Spark is also able
>>> to interact with Tachyon via the Hadoop-compatible API, so Spark jobs can
>>> read input files from Tachyon and write output files to Tachyon.
>>>
>>> I hope that helps,
>>> Gene
>>>
>>> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
>>> dgoldenberg123@gmail.com> wrote:
>>>
>>>> I'd guess that if the resources are broadcast Spark would put them into
>>>> Tachyon...
>>>>
>>>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <
>>>> dgoldenberg123@gmail.com> wrote:
>>>>
>>>> Would it make sense to load them into Tachyon and read and broadcast
>>>> them from there since Tachyon is already a part of the Spark stack?
>>>>
>>>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>>>
>>>>
>>>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>>>> sabarish.sasidharan@manthan.com> wrote:
>>>>
>>>> One option could be to store them as blobs in a cache like Redis and
>>>> then read + broadcast them from the driver. Or you could store them in HDFS
>>>> and read + broadcast from the driver.
>>>>
>>>> Regards
>>>> Sab
>>>>
>>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>>>> dgoldenberg123@gmail.com> wrote:
>>>>
>>>>> We have a bunch of Spark jobs deployed and a few large resource files
>>>>> such as e.g. a dictionary for lookups or a statistical model.
>>>>>
>>>>> Right now, these are deployed as part of the Spark jobs which will
>>>>> eventually make the mongo-jars too bloated for deployments.
>>>>>
>>>>> What are some of the best practices to consider for maintaining and
>>>>> sharing large resource files like these?
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Architect - Big Data
>>>> Ph: +91 99805 99458
>>>>
>>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>>> Sullivan India ICT)*
>>>> +++
>>>>
>>>>
>>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Gene Pang <ge...@gmail.com>.

Hi Dmitry,

I am not familiar with all of the details you have just described, but I
think Tachyon should be able to help you.

If you store all of your resource files in HDFS or S3 or both, you can run
Tachyon to use those storage systems as the under storage (
http://tachyon-project.org/documentation/Mounting-and-Transparent-Naming.html).
Then, you can run Spark to read from Tachyon as an HDFS-compatible file
system and do your processing. If you ever need to write data back out, you
can write it out as a file back into Tachyon, which can also be reflected
into your original store (HDFS, S3). The caching in Tachyon will be done
on-demand and you can use the LRU policy for eviction (or a customized
policy
http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html). I
think this might sound like your option #4. I hope this helps!

Thanks,
Gene

On Thu, Jan 14, 2016 at 4:54 AM, Dmitry Goldenberg <dgoldenberg123@gmail.com
> wrote:

> OK so it looks like Tachyon is a cluster memory plugin marked as
> "experimental" in Spark.
>
> In any case, we've got a few requirements for the system we're working on
> which may drive the decision for how to implement large resource file
> management.
>
> The system is a framework of N data analyzers which take incoming
> documents as input and transform them or extract some data out of those
> documents.  These analyzers can be chained together which makes it a great
> case for processing with RDD's and a set of map/filter types of Spark
> functions. There's already an established framework API which we want to
> preserve.  This means that most likely, we'll create a relatively thin
> "binding" layer for exposing these analyzers as well-documented functions
> to the end-users who want to use them in a Spark based distributed
> computing environment.
>
> We also want to, ideally, hide the complexity of how these resources are
> loaded from the end-users who will be writing the actual Spark jobs that
> utilize the Spark "binding" functions that we provide.
>
> So, for managing large numbers of small, medium, or large resource files,
> we're considering the below options, with a variety of pros and cons
> attached, from the following perspectives:
>
> a) persistence - where do the resources reside initially;
> b) loading - what are the mechanics for loading of these resources;
> c) caching and sharing across worker nodes.
>
> Possible options:
>
> 1. Load each resource into a broadcast variable. Considering that we have
> scores if not hundreds of these resource files, maintaining that many
> broadcast variables seems like a complexity that's going to be hard to
> manage. We'd also need a translation layer between the broadcast variables
> and the internal API that would want to "speak" InputStream's rather than
> broadcast variables.
>
> 2. Load resources into RDD's and perform join's against them from our
> incoming document data RDD's, thus achieving the effect of a value lookup
> from the resources.  While this seems like a very Spark'y way of doing
> things, the lookup mechanics seem quite non-trivial, especially because
> some of the resources aren't going to be pure dictionaries; they may be
> statistical models.  Additionally, this forces us to utilize Spark's
> semantics for handling of these resources which means a potential rewrite
> of our internal product API. That would be a hard option to go with.
>
> 3. Pre-install all the needed resources on each of the worker nodes;
> retrieve the needed resources from the file system and load them into
> memory as needed. Ideally, the resources would only be installed once, on
> the Spark driver side; we'd want to avoid having to pre-install all these
> files on each node. However, we've done this as an exercise and this
> approach works OK.
>
> 4. Pre-load all the resources into HDFS or S3 i.e. into some distributed
> persistent store; load them into cluster memory from there, as necessary.
> Presumably this could be a pluggable store with a common API exposed.
> Since our framework is an OEM'able product, we could plug and play with a
> variety of such persistent stores via Java's FileSystem/URL scheme handler
> API's.
>
> 5. Implement a Resource management server, with a RESTful interface on
> top. Under the covers, this could be a wrapper on top of #4.  Potentially
> unnecessary if we have a solid persistent store API as per #4.
>
> 6. Beyond persistence, caching also has to be considered for these
> resources. We've considered Tachyon (especially since it's pluggable into
> Spark), Redis, and the like. Ideally, I would think we'd want resources to
> be loaded into the cluster memory as needed; paged in/out on-demand in an
> LRU fashion.  From this perspective, it's not yet clear to me what the best
> option(s) would be. Any thoughts / recommendations would be appreciated.
>
>
>
>
>
> On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg <
> dgoldenberg123@gmail.com> wrote:
>
>> Thanks, Gene.
>>
>> Does Spark use Tachyon under the covers anyway for implementing its
>> "cluster memory" support?
>>
>> It seems that the practice I hear the most about is the idea of loading
>> resources as RDD's and then doing join's against them to achieve the lookup
>> effect.
>>
>> The other approach would be to load the resources into broadcast
>> variables but I've heard concerns about memory.  Could we run out of memory
>> if we load too much into broadcast vars?  Is there any memory_to_disk/spill
>> to disk capability for broadcast variables in Spark?
>>
>>
>> On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <ge...@gmail.com> wrote:
>>
>>> Hi Dmitry,
>>>
>>> Yes, Tachyon can help with your use case. You can read and write to
>>> Tachyon via the filesystem api (
>>> http://tachyon-project.org/documentation/File-System-API.html). There
>>> is a native Java API as well as a Hadoop-compatible API. Spark is also able
>>> to interact with Tachyon via the Hadoop-compatible API, so Spark jobs can
>>> read input files from Tachyon and write output files to Tachyon.
>>>
>>> I hope that helps,
>>> Gene
>>>
>>> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
>>> dgoldenberg123@gmail.com> wrote:
>>>
>>>> I'd guess that if the resources are broadcast Spark would put them into
>>>> Tachyon...
>>>>
>>>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <
>>>> dgoldenberg123@gmail.com> wrote:
>>>>
>>>> Would it make sense to load them into Tachyon and read and broadcast
>>>> them from there since Tachyon is already a part of the Spark stack?
>>>>
>>>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>>>
>>>>
>>>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>>>> sabarish.sasidharan@manthan.com> wrote:
>>>>
>>>> One option could be to store them as blobs in a cache like Redis and
>>>> then read + broadcast them from the driver. Or you could store them in HDFS
>>>> and read + broadcast from the driver.
>>>>
>>>> Regards
>>>> Sab
>>>>
>>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>>>> dgoldenberg123@gmail.com> wrote:
>>>>
>>>>> We have a bunch of Spark jobs deployed and a few large resource files
>>>>> such as e.g. a dictionary for lookups or a statistical model.
>>>>>
>>>>> Right now, these are deployed as part of the Spark jobs which will
>>>>> eventually make the mongo-jars too bloated for deployments.
>>>>>
>>>>> What are some of the best practices to consider for maintaining and
>>>>> sharing large resource files like these?
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Architect - Big Data
>>>> Ph: +91 99805 99458
>>>>
>>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>>> Sullivan India ICT)*
>>>> +++
>>>>
>>>>
>>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Dmitry Goldenberg <dg...@gmail.com>.

OK so it looks like Tachyon is a cluster memory plugin marked as
"experimental" in Spark.

In any case, we've got a few requirements for the system we're working on
which may drive the decision for how to implement large resource file
management.

The system is a framework of N data analyzers which take incoming documents
as input and transform them or extract some data out of those documents.
These analyzers can be chained together which makes it a great case for
processing with RDD's and a set of map/filter types of Spark functions.
There's already an established framework API which we want to preserve.
This means that most likely, we'll create a relatively thin "binding" layer
for exposing these analyzers as well-documented functions to the end-users
who want to use them in a Spark based distributed computing environment.

We also want to, ideally, hide the complexity of how these resources are
loaded from the end-users who will be writing the actual Spark jobs that
utilize the Spark "binding" functions that we provide.

So, for managing large numbers of small, medium, or large resource files,
we're considering the below options, with a variety of pros and cons
attached, from the following perspectives:

a) persistence - where do the resources reside initially;
b) loading - what are the mechanics for loading of these resources;
c) caching and sharing across worker nodes.

Possible options:

1. Load each resource into a broadcast variable. Considering that we have
scores if not hundreds of these resource files, maintaining that many
broadcast variables seems like a complexity that's going to be hard to
manage. We'd also need a translation layer between the broadcast variables
and the internal API that would want to "speak" InputStream's rather than
broadcast variables.

2. Load resources into RDD's and perform join's against them from our
incoming document data RDD's, thus achieving the effect of a value lookup
from the resources.  While this seems like a very Spark'y way of doing
things, the lookup mechanics seem quite non-trivial, especially because
some of the resources aren't going to be pure dictionaries; they may be
statistical models.  Additionally, this forces us to utilize Spark's
semantics for handling of these resources which means a potential rewrite
of our internal product API. That would be a hard option to go with.

3. Pre-install all the needed resources on each of the worker nodes;
retrieve the needed resources from the file system and load them into
memory as needed. Ideally, the resources would only be installed once, on
the Spark driver side; we'd want to avoid having to pre-install all these
files on each node. However, we've done this as an exercise and this
approach works OK.

4. Pre-load all the resources into HDFS or S3 i.e. into some distributed
persistent store; load them into cluster memory from there, as necessary.
Presumably this could be a pluggable store with a common API exposed.
Since our framework is an OEM'able product, we could plug and play with a
variety of such persistent stores via Java's FileSystem/URL scheme handler
API's.

5. Implement a Resource management server, with a RESTful interface on top.
Under the covers, this could be a wrapper on top of #4.  Potentially
unnecessary if we have a solid persistent store API as per #4.

6. Beyond persistence, caching also has to be considered for these
resources. We've considered Tachyon (especially since it's pluggable into
Spark), Redis, and the like. Ideally, I would think we'd want resources to
be loaded into the cluster memory as needed; paged in/out on-demand in an
LRU fashion.  From this perspective, it's not yet clear to me what the best
option(s) would be. Any thoughts / recommendations would be appreciated.

On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg <dgoldenberg123@gmail.com
> wrote:

> Thanks, Gene.
>
> Does Spark use Tachyon under the covers anyway for implementing its
> "cluster memory" support?
>
> It seems that the practice I hear the most about is the idea of loading
> resources as RDD's and then doing join's against them to achieve the lookup
> effect.
>
> The other approach would be to load the resources into broadcast variables
> but I've heard concerns about memory.  Could we run out of memory if we
> load too much into broadcast vars?  Is there any memory_to_disk/spill to
> disk capability for broadcast variables in Spark?
>
>
> On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <ge...@gmail.com> wrote:
>
>> Hi Dmitry,
>>
>> Yes, Tachyon can help with your use case. You can read and write to
>> Tachyon via the filesystem api (
>> http://tachyon-project.org/documentation/File-System-API.html). There is
>> a native Java API as well as a Hadoop-compatible API. Spark is also able to
>> interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
>> input files from Tachyon and write output files to Tachyon.
>>
>> I hope that helps,
>> Gene
>>
>> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
>> dgoldenberg123@gmail.com> wrote:
>>
>>> I'd guess that if the resources are broadcast Spark would put them into
>>> Tachyon...
>>>
>>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dg...@gmail.com>
>>> wrote:
>>>
>>> Would it make sense to load them into Tachyon and read and broadcast
>>> them from there since Tachyon is already a part of the Spark stack?
>>>
>>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>>
>>>
>>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>>> sabarish.sasidharan@manthan.com> wrote:
>>>
>>> One option could be to store them as blobs in a cache like Redis and
>>> then read + broadcast them from the driver. Or you could store them in HDFS
>>> and read + broadcast from the driver.
>>>
>>> Regards
>>> Sab
>>>
>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>>> dgoldenberg123@gmail.com> wrote:
>>>
>>>> We have a bunch of Spark jobs deployed and a few large resource files
>>>> such as e.g. a dictionary for lookups or a statistical model.
>>>>
>>>> Right now, these are deployed as part of the Spark jobs which will
>>>> eventually make the mongo-jars too bloated for deployments.
>>>>
>>>> What are some of the best practices to consider for maintaining and
>>>> sharing large resource files like these?
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Architect - Big Data
>>> Ph: +91 99805 99458
>>>
>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>> Sullivan India ICT)*
>>> +++
>>>
>>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Dmitry Goldenberg <dg...@gmail.com>.

Thanks, Gene.

Does Spark use Tachyon under the covers anyway for implementing its
"cluster memory" support?

It seems that the practice I hear the most about is the idea of loading
resources as RDD's and then doing join's against them to achieve the lookup
effect.

The other approach would be to load the resources into broadcast variables
but I've heard concerns about memory.  Could we run out of memory if we
load too much into broadcast vars?  Is there any memory_to_disk/spill to
disk capability for broadcast variables in Spark?


On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <ge...@gmail.com> wrote:

> Hi Dmitry,
>
> Yes, Tachyon can help with your use case. You can read and write to
> Tachyon via the filesystem api (
> http://tachyon-project.org/documentation/File-System-API.html). There is
> a native Java API as well as a Hadoop-compatible API. Spark is also able to
> interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
> input files from Tachyon and write output files to Tachyon.
>
> I hope that helps,
> Gene
>
> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
> dgoldenberg123@gmail.com> wrote:
>
>> I'd guess that if the resources are broadcast Spark would put them into
>> Tachyon...
>>
>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dg...@gmail.com>
>> wrote:
>>
>> Would it make sense to load them into Tachyon and read and broadcast them
>> from there since Tachyon is already a part of the Spark stack?
>>
>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>
>>
>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>> sabarish.sasidharan@manthan.com> wrote:
>>
>> One option could be to store them as blobs in a cache like Redis and then
>> read + broadcast them from the driver. Or you could store them in HDFS and
>> read + broadcast from the driver.
>>
>> Regards
>> Sab
>>
>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>> dgoldenberg123@gmail.com> wrote:
>>
>>> We have a bunch of Spark jobs deployed and a few large resource files
>>> such as e.g. a dictionary for lookups or a statistical model.
>>>
>>> Right now, these are deployed as part of the Spark jobs which will
>>> eventually make the mongo-jars too bloated for deployments.
>>>
>>> What are some of the best practices to consider for maintaining and
>>> sharing large resource files like these?
>>>
>>> Thanks.
>>>
>>
>>
>>
>> --
>>
>> Architect - Big Data
>> Ph: +91 99805 99458
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>> +++
>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Gene Pang <ge...@gmail.com>.

Hi Dmitry,

Yes, Tachyon can help with your use case. You can read and write to Tachyon
via the filesystem api (
http://tachyon-project.org/documentation/File-System-API.html). There is a
native Java API as well as a Hadoop-compatible API. Spark is also able to
interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
input files from Tachyon and write output files to Tachyon.

I hope that helps,
Gene

On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <dgoldenberg123@gmail.com
> wrote:

> I'd guess that if the resources are broadcast Spark would put them into
> Tachyon...
>
> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dg...@gmail.com>
> wrote:
>
> Would it make sense to load them into Tachyon and read and broadcast them
> from there since Tachyon is already a part of the Spark stack?
>
> If so I wonder if I could do that Tachyon read/write via a Spark API?
>
>
> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
> sabarish.sasidharan@manthan.com> wrote:
>
> One option could be to store them as blobs in a cache like Redis and then
> read + broadcast them from the driver. Or you could store them in HDFS and
> read + broadcast from the driver.
>
> Regards
> Sab
>
> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
> dgoldenberg123@gmail.com> wrote:
>
>> We have a bunch of Spark jobs deployed and a few large resource files
>> such as e.g. a dictionary for lookups or a statistical model.
>>
>> Right now, these are deployed as part of the Spark jobs which will
>> eventually make the mongo-jars too bloated for deployments.
>>
>> What are some of the best practices to consider for maintaining and
>> sharing large resource files like these?
>>
>> Thanks.
>>
>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Dmitry Goldenberg <dg...@gmail.com>.

I'd guess that if the resources are broadcast Spark would put them into Tachyon...

> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dg...@gmail.com> wrote:
> 
> Would it make sense to load them into Tachyon and read and broadcast them from there since Tachyon is already a part of the Spark stack?
> 
> If so I wonder if I could do that Tachyon read/write via a Spark API?
> 
> 
>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <sa...@manthan.com> wrote:
>> 
>> One option could be to store them as blobs in a cache like Redis and then read + broadcast them from the driver. Or you could store them in HDFS and read + broadcast from the driver.
>> 
>> Regards
>> Sab
>> 
>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <dg...@gmail.com> wrote:
>>> We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model.
>>> 
>>> Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments.
>>> 
>>> What are some of the best practices to consider for maintaining and sharing large resource files like these?
>>> 
>>> Thanks.
>> 
>> 
>> 
>> -- 
>> 
>> Architect - Big Data
>> Ph: +91 99805 99458
>> 
>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan India ICT)
>> +++

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Dmitry Goldenberg <dg...@gmail.com>.

Would it make sense to load them into Tachyon and read and broadcast them from there since Tachyon is already a part of the Spark stack?

If so I wonder if I could do that Tachyon read/write via a Spark API?


> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <sa...@manthan.com> wrote:
> 
> One option could be to store them as blobs in a cache like Redis and then read + broadcast them from the driver. Or you could store them in HDFS and read + broadcast from the driver.
> 
> Regards
> Sab
> 
>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <dg...@gmail.com> wrote:
>> We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model.
>> 
>> Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments.
>> 
>> What are some of the best practices to consider for maintaining and sharing large resource files like these?
>> 
>> Thanks.
> 
> 
> 
> -- 
> 
> Architect - Big Data
> Ph: +91 99805 99458
> 
> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan India ICT)
> +++

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Sabarish Sasidharan <sa...@manthan.com>.

One option could be to store them as blobs in a cache like Redis and then
read + broadcast them from the driver. Or you could store them in HDFS and
read + broadcast from the driver.

Regards
Sab

On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <dgoldenberg123@gmail.com
> wrote:

> We have a bunch of Spark jobs deployed and a few large resource files such
> as e.g. a dictionary for lookups or a statistical model.
>
> Right now, these are deployed as part of the Spark jobs which will
> eventually make the mongo-jars too bloated for deployments.
>
> What are some of the best practices to consider for maintaining and
> sharing large resource files like these?
>
> Thanks.
>



-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Gavin Yue <yu...@gmail.com>.

Has anyone used Ignite in production system ?


On Mon, Jan 11, 2016 at 11:44 PM, Jörn Franke <jo...@gmail.com> wrote:

> You can look at ignite as a HDFS cache or for  storing rdds.
>
> > On 11 Jan 2016, at 21:14, Dmitry Goldenberg <dg...@gmail.com>
> wrote:
> >
> > We have a bunch of Spark jobs deployed and a few large resource files
> such as e.g. a dictionary for lookups or a statistical model.
> >
> > Right now, these are deployed as part of the Spark jobs which will
> eventually make the mongo-jars too bloated for deployments.
> >
> > What are some of the best practices to consider for maintaining and
> sharing large resource files like these?
> >
> > Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Jörn Franke <jo...@gmail.com>.

Ignite can also cache rdd 

> On 12 Jan 2016, at 13:06, Dmitry Goldenberg <dg...@gmail.com> wrote:
> 
> Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted.
> 
>> On Jan 12, 2016, at 2:44 AM, Jörn Franke <jo...@gmail.com> wrote:
>> 
>> You can look at ignite as a HDFS cache or for  storing rdds. 
>> 
>>> On 11 Jan 2016, at 21:14, Dmitry Goldenberg <dg...@gmail.com> wrote:
>>> 
>>> We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model.
>>> 
>>> Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments.
>>> 
>>> What are some of the best practices to consider for maintaining and sharing large resource files like these?
>>> 
>>> Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Dmitry Goldenberg <dg...@gmail.com>.

Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted.

> On Jan 12, 2016, at 2:44 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> You can look at ignite as a HDFS cache or for  storing rdds. 
> 
>> On 11 Jan 2016, at 21:14, Dmitry Goldenberg <dg...@gmail.com> wrote:
>> 
>> We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model.
>> 
>> Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments.
>> 
>> What are some of the best practices to consider for maintaining and sharing large resource files like these?
>> 
>> Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Posted by Jörn Franke <jo...@gmail.com>.

You can look at ignite as a HDFS cache or for  storing rdds. 

> On 11 Jan 2016, at 21:14, Dmitry Goldenberg <dg...@gmail.com> wrote:
> 
> We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model.
> 
> Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments.
> 
> What are some of the best practices to consider for maintaining and sharing large resource files like these?
> 
> Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org