You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vibhor Banga <vi...@gmail.com> on 2014/05/30 16:21:14 UTC

Using Spark on Data size larger than Memory size

Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data
from HBase Table.

I want to know that in the case when the size of HBase Table grows larger
than the size of RAM available in the cluster, will the application fail,
or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor

Re: Using Spark on Data size larger than Memory size

Posted by Andrew Ash <an...@andrewash.com>.
If an individual partition becomes too large to fit in memory then the
usual approach would be to repartition to more partitions, so each one is
smaller. Hopefully then it would fit.
On Jun 6, 2014 5:47 PM, "Roger Hoover" <ro...@gmail.com> wrote:

> Andrew,
>
> Thank you.  I'm using mapPartitions() but as you say, it requires that
> every partition fit in memory.  This will work for now but may not always
> work so I was wondering about another way.
>
> Thanks,
>
> Roger
>
>
> On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Hi Roger,
>>
>> You should be able to sort within partitions using the
>> rdd.mapPartitions() method, and that shouldn't require holding all data in
>> memory at once.  It does require holding the entire partition in memory
>> though.  Do you need the partition to never be held in memory all at once?
>>
>> As far as the work that Aaron mentioned is happening, I think he might be
>> referring to the discussion and code surrounding
>> https://issues.apache.org/jira/browse/SPARK-983
>>
>> Cheers!
>> Andrew
>>
>>
>> On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <ro...@gmail.com>
>> wrote:
>>
>>> I think it would very handy to be able to specify that you want sorting
>>> during a partitioning stage.
>>>
>>>
>>> On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <ro...@gmail.com>
>>> wrote:
>>>
>>>> Hi Aaron,
>>>>
>>>> When you say that sorting is being worked on, can you elaborate a
>>>> little more please?
>>>>
>>>> If particular, I want to sort the items within each partition (not
>>>> globally) without necessarily bringing them all into memory at once.
>>>>
>>>> Thanks,
>>>>
>>>> Roger
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <il...@gmail.com>
>>>> wrote:
>>>>
>>>>> There is no fundamental issue if you're running on data that is larger
>>>>> than cluster memory size. Many operations can stream data through, and thus
>>>>> memory usage is independent of input data size. Certain operations require
>>>>> an entire *partition* (not dataset) to fit in memory, but there are not
>>>>> many instances of this left (sorting comes to mind, and this is being
>>>>> worked on).
>>>>>
>>>>> In general, one problem with Spark today is that you *can* OOM under
>>>>> certain configurations, and it's possible you'll need to change from the
>>>>> default configuration if you're using doing very memory-intensive jobs.
>>>>> However, there are very few cases where Spark would simply fail as a matter
>>>>> of course *-- *for instance, you can always increase the number of
>>>>> partitions to decrease the size of any given one. or repartition data to
>>>>> eliminate skew.
>>>>>
>>>>> Regarding impact on performance, as Mayur said, there may absolutely
>>>>> be an impact depending on your jobs. If you're doing a join on a very large
>>>>> amount of data with few partitions, then we'll have to spill to disk. If
>>>>> you can't cache your working set of data in memory, you will also see a
>>>>> performance degradation. Spark enables the use of memory to make things
>>>>> fast, but if you just don't have enough memory, it won't be terribly fast.
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <
>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>
>>>>>> Clearly thr will be impact on performance but frankly depends on what
>>>>>> you are trying to achieve with the dataset.
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorbanga@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Some inputs will be really helpful.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Vibhor
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorbanga@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am planning to use spark with HBase, where I generate RDD by
>>>>>>>> reading data from HBase Table.
>>>>>>>>
>>>>>>>> I want to know that in the case when the size of HBase Table grows
>>>>>>>> larger than the size of RAM available in the cluster, will the application
>>>>>>>> fail, or will there be an impact in performance ?
>>>>>>>>
>>>>>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Vibhor
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vibhor Banga
>>>>>>> Software Development Engineer
>>>>>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Posted by Roger Hoover <ro...@gmail.com>.
Andrew,

Thank you.  I'm using mapPartitions() but as you say, it requires that
every partition fit in memory.  This will work for now but may not always
work so I was wondering about another way.

Thanks,

Roger


On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash <an...@andrewash.com> wrote:

> Hi Roger,
>
> You should be able to sort within partitions using the rdd.mapPartitions()
> method, and that shouldn't require holding all data in memory at once.  It
> does require holding the entire partition in memory though.  Do you need
> the partition to never be held in memory all at once?
>
> As far as the work that Aaron mentioned is happening, I think he might be
> referring to the discussion and code surrounding
> https://issues.apache.org/jira/browse/SPARK-983
>
> Cheers!
> Andrew
>
>
> On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <ro...@gmail.com>
> wrote:
>
>> I think it would very handy to be able to specify that you want sorting
>> during a partitioning stage.
>>
>>
>> On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <ro...@gmail.com>
>> wrote:
>>
>>> Hi Aaron,
>>>
>>> When you say that sorting is being worked on, can you elaborate a little
>>> more please?
>>>
>>> If particular, I want to sort the items within each partition (not
>>> globally) without necessarily bringing them all into memory at once.
>>>
>>> Thanks,
>>>
>>> Roger
>>>
>>>
>>> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <il...@gmail.com>
>>> wrote:
>>>
>>>> There is no fundamental issue if you're running on data that is larger
>>>> than cluster memory size. Many operations can stream data through, and thus
>>>> memory usage is independent of input data size. Certain operations require
>>>> an entire *partition* (not dataset) to fit in memory, but there are not
>>>> many instances of this left (sorting comes to mind, and this is being
>>>> worked on).
>>>>
>>>> In general, one problem with Spark today is that you *can* OOM under
>>>> certain configurations, and it's possible you'll need to change from the
>>>> default configuration if you're using doing very memory-intensive jobs.
>>>> However, there are very few cases where Spark would simply fail as a matter
>>>> of course *-- *for instance, you can always increase the number of
>>>> partitions to decrease the size of any given one. or repartition data to
>>>> eliminate skew.
>>>>
>>>> Regarding impact on performance, as Mayur said, there may absolutely be
>>>> an impact depending on your jobs. If you're doing a join on a very large
>>>> amount of data with few partitions, then we'll have to spill to disk. If
>>>> you can't cache your working set of data in memory, you will also see a
>>>> performance degradation. Spark enables the use of memory to make things
>>>> fast, but if you just don't have enough memory, it won't be terribly fast.
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <
>>>> mayur.rustagi@gmail.com> wrote:
>>>>
>>>>> Clearly thr will be impact on performance but frankly depends on what
>>>>> you are trying to achieve with the dataset.
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Some inputs will be really helpful.
>>>>>>
>>>>>> Thanks,
>>>>>> -Vibhor
>>>>>>
>>>>>>
>>>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am planning to use spark with HBase, where I generate RDD by
>>>>>>> reading data from HBase Table.
>>>>>>>
>>>>>>> I want to know that in the case when the size of HBase Table grows
>>>>>>> larger than the size of RAM available in the cluster, will the application
>>>>>>> fail, or will there be an impact in performance ?
>>>>>>>
>>>>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Vibhor
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Vibhor Banga
>>>>>> Software Development Engineer
>>>>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Posted by Andrew Ash <an...@andrewash.com>.
Hi Roger,

You should be able to sort within partitions using the rdd.mapPartitions()
method, and that shouldn't require holding all data in memory at once.  It
does require holding the entire partition in memory though.  Do you need
the partition to never be held in memory all at once?

As far as the work that Aaron mentioned is happening, I think he might be
referring to the discussion and code surrounding
https://issues.apache.org/jira/browse/SPARK-983

Cheers!
Andrew


On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <ro...@gmail.com> wrote:

> I think it would very handy to be able to specify that you want sorting
> during a partitioning stage.
>
>
> On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <ro...@gmail.com>
> wrote:
>
>> Hi Aaron,
>>
>> When you say that sorting is being worked on, can you elaborate a little
>> more please?
>>
>> If particular, I want to sort the items within each partition (not
>> globally) without necessarily bringing them all into memory at once.
>>
>> Thanks,
>>
>> Roger
>>
>>
>> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <il...@gmail.com>
>> wrote:
>>
>>> There is no fundamental issue if you're running on data that is larger
>>> than cluster memory size. Many operations can stream data through, and thus
>>> memory usage is independent of input data size. Certain operations require
>>> an entire *partition* (not dataset) to fit in memory, but there are not
>>> many instances of this left (sorting comes to mind, and this is being
>>> worked on).
>>>
>>> In general, one problem with Spark today is that you *can* OOM under
>>> certain configurations, and it's possible you'll need to change from the
>>> default configuration if you're using doing very memory-intensive jobs.
>>> However, there are very few cases where Spark would simply fail as a matter
>>> of course *-- *for instance, you can always increase the number of
>>> partitions to decrease the size of any given one. or repartition data to
>>> eliminate skew.
>>>
>>> Regarding impact on performance, as Mayur said, there may absolutely be
>>> an impact depending on your jobs. If you're doing a join on a very large
>>> amount of data with few partitions, then we'll have to spill to disk. If
>>> you can't cache your working set of data in memory, you will also see a
>>> performance degradation. Spark enables the use of memory to make things
>>> fast, but if you just don't have enough memory, it won't be terribly fast.
>>>
>>>
>>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <mayur.rustagi@gmail.com
>>> > wrote:
>>>
>>>> Clearly thr will be impact on performance but frankly depends on what
>>>> you are trying to achieve with the dataset.
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Some inputs will be really helpful.
>>>>>
>>>>> Thanks,
>>>>> -Vibhor
>>>>>
>>>>>
>>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am planning to use spark with HBase, where I generate RDD by
>>>>>> reading data from HBase Table.
>>>>>>
>>>>>> I want to know that in the case when the size of HBase Table grows
>>>>>> larger than the size of RAM available in the cluster, will the application
>>>>>> fail, or will there be an impact in performance ?
>>>>>>
>>>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>>>
>>>>>> Thanks,
>>>>>> -Vibhor
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Vibhor Banga
>>>>> Software Development Engineer
>>>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Posted by Roger Hoover <ro...@gmail.com>.
I think it would very handy to be able to specify that you want sorting
during a partitioning stage.


On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <ro...@gmail.com> wrote:

> Hi Aaron,
>
> When you say that sorting is being worked on, can you elaborate a little
> more please?
>
> If particular, I want to sort the items within each partition (not
> globally) without necessarily bringing them all into memory at once.
>
> Thanks,
>
> Roger
>
>
> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <il...@gmail.com>
> wrote:
>
>> There is no fundamental issue if you're running on data that is larger
>> than cluster memory size. Many operations can stream data through, and thus
>> memory usage is independent of input data size. Certain operations require
>> an entire *partition* (not dataset) to fit in memory, but there are not
>> many instances of this left (sorting comes to mind, and this is being
>> worked on).
>>
>> In general, one problem with Spark today is that you *can* OOM under
>> certain configurations, and it's possible you'll need to change from the
>> default configuration if you're using doing very memory-intensive jobs.
>> However, there are very few cases where Spark would simply fail as a matter
>> of course *-- *for instance, you can always increase the number of
>> partitions to decrease the size of any given one. or repartition data to
>> eliminate skew.
>>
>> Regarding impact on performance, as Mayur said, there may absolutely be
>> an impact depending on your jobs. If you're doing a join on a very large
>> amount of data with few partitions, then we'll have to spill to disk. If
>> you can't cache your working set of data in memory, you will also see a
>> performance degradation. Spark enables the use of memory to make things
>> fast, but if you just don't have enough memory, it won't be terribly fast.
>>
>>
>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <ma...@gmail.com>
>> wrote:
>>
>>> Clearly thr will be impact on performance but frankly depends on what
>>> you are trying to achieve with the dataset.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vi...@gmail.com>
>>> wrote:
>>>
>>>> Some inputs will be really helpful.
>>>>
>>>> Thanks,
>>>> -Vibhor
>>>>
>>>>
>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am planning to use spark with HBase, where I generate RDD by reading
>>>>> data from HBase Table.
>>>>>
>>>>> I want to know that in the case when the size of HBase Table grows
>>>>> larger than the size of RAM available in the cluster, will the application
>>>>> fail, or will there be an impact in performance ?
>>>>>
>>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>>
>>>>> Thanks,
>>>>> -Vibhor
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Vibhor Banga
>>>> Software Development Engineer
>>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>>
>>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Posted by Roger Hoover <ro...@gmail.com>.
Hi Aaron,

When you say that sorting is being worked on, can you elaborate a little
more please?

If particular, I want to sort the items within each partition (not
globally) without necessarily bringing them all into memory at once.

Thanks,

Roger


On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <il...@gmail.com> wrote:

> There is no fundamental issue if you're running on data that is larger
> than cluster memory size. Many operations can stream data through, and thus
> memory usage is independent of input data size. Certain operations require
> an entire *partition* (not dataset) to fit in memory, but there are not
> many instances of this left (sorting comes to mind, and this is being
> worked on).
>
> In general, one problem with Spark today is that you *can* OOM under
> certain configurations, and it's possible you'll need to change from the
> default configuration if you're using doing very memory-intensive jobs.
> However, there are very few cases where Spark would simply fail as a matter
> of course *-- *for instance, you can always increase the number of
> partitions to decrease the size of any given one. or repartition data to
> eliminate skew.
>
> Regarding impact on performance, as Mayur said, there may absolutely be an
> impact depending on your jobs. If you're doing a join on a very large
> amount of data with few partitions, then we'll have to spill to disk. If
> you can't cache your working set of data in memory, you will also see a
> performance degradation. Spark enables the use of memory to make things
> fast, but if you just don't have enough memory, it won't be terribly fast.
>
>
> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <ma...@gmail.com>
> wrote:
>
>> Clearly thr will be impact on performance but frankly depends on what you
>> are trying to achieve with the dataset.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vi...@gmail.com>
>> wrote:
>>
>>> Some inputs will be really helpful.
>>>
>>> Thanks,
>>> -Vibhor
>>>
>>>
>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vi...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am planning to use spark with HBase, where I generate RDD by reading
>>>> data from HBase Table.
>>>>
>>>> I want to know that in the case when the size of HBase Table grows
>>>> larger than the size of RAM available in the cluster, will the application
>>>> fail, or will there be an impact in performance ?
>>>>
>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>
>>>> Thanks,
>>>> -Vibhor
>>>>
>>>
>>>
>>>
>>> --
>>> Vibhor Banga
>>> Software Development Engineer
>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Posted by Allen Chang <al...@yahoo.com>.
Thanks. We've run into timeout issues at scale as well. We were able to
workaround them by setting the following JVM options:

-Dspark.akka.askTimeout=300
-Dspark.akka.timeout=300
-Dspark.worker.timeout=300

NOTE: these JVM options *must* be set on worker nodes (and not just the
driver/master) for the settings to take.

Allen





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using Spark on Data size larger than Memory size

Posted by Surendranauth Hiraman <su...@velos.io>.
My team has been using DISK_ONLY. The challenge with this approach is
knowing when to unpersist if your job creates a lot of intermediate data.
The "right solution" would be to mark a transient RDD as being capable of
spilling to disk, rather than having to persist it to force this behavior.
Hopefully that will be added at some point, now that Iterable is available
in the PairRDDFunctions api.

The other thing that was important for us was setting the executor memory
to the right level because it seems some intermediate buffers can be large.

We are currently using 16 GB for spark.executor.memory and 18 GB for
SPARK_WORKER_MEMORY. Parallelism (spark.default.parallelism) seems to have
an impact, through we are still working on tuning that.

We are using 16 executors/workers.

Our test input size is about 10 GB but we generate up to a total of 500GB
of intermediate and final data.

Right now, we have gotten past our memory issues and we are now facing a
communication timeout issue in some long-tail tasks, so that's something to
watch out for.

If you come up with anything else, please let us know. :-)

-Suren


SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io



On Tue, Jun 10, 2014 at 9:42 PM, Allen Chang <al...@yahoo.com> wrote:

> Thanks for the clarification.
>
> What is the proper way to configure RDDs when your aggregate data size
> exceeds your available working memory size? In particular, in additional to
> typical operations, I'm performing cogroups, joins, and coalesces/shuffles.
>
> I see that the default storage level for RDDs is MEMORY_ONLY. Do I just
> need
> to set all the storage level for all of my RDDs to something like
> MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in
> the presence of coalesces/shuffles, cogroups, and joins?
>
> Thanks,
> Allen
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Using Spark on Data size larger than Memory size

Posted by Allen Chang <al...@yahoo.com>.
Thanks for the clarification.

What is the proper way to configure RDDs when your aggregate data size
exceeds your available working memory size? In particular, in additional to
typical operations, I'm performing cogroups, joins, and coalesces/shuffles.

I see that the default storage level for RDDs is MEMORY_ONLY. Do I just need
to set all the storage level for all of my RDDs to something like
MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in
the presence of coalesces/shuffles, cogroups, and joins?

Thanks,
Allen



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using Spark on Data size larger than Memory size

Posted by Vibhor Banga <vi...@gmail.com>.
Aaron, Thank You for your response and clarifying things.

-Vibhor


On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson <il...@gmail.com> wrote:

> There is no fundamental issue if you're running on data that is larger
> than cluster memory size. Many operations can stream data through, and thus
> memory usage is independent of input data size. Certain operations require
> an entire *partition* (not dataset) to fit in memory, but there are not
> many instances of this left (sorting comes to mind, and this is being
> worked on).
>
> In general, one problem with Spark today is that you *can* OOM under
> certain configurations, and it's possible you'll need to change from the
> default configuration if you're using doing very memory-intensive jobs.
> However, there are very few cases where Spark would simply fail as a matter
> of course *-- *for instance, you can always increase the number of
> partitions to decrease the size of any given one. or repartition data to
> eliminate skew.
>
> Regarding impact on performance, as Mayur said, there may absolutely be an
> impact depending on your jobs. If you're doing a join on a very large
> amount of data with few partitions, then we'll have to spill to disk. If
> you can't cache your working set of data in memory, you will also see a
> performance degradation. Spark enables the use of memory to make things
> fast, but if you just don't have enough memory, it won't be terribly fast.
>
>
> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <ma...@gmail.com>
> wrote:
>
>> Clearly thr will be impact on performance but frankly depends on what you
>> are trying to achieve with the dataset.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vi...@gmail.com>
>> wrote:
>>
>>> Some inputs will be really helpful.
>>>
>>> Thanks,
>>> -Vibhor
>>>
>>>
>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vi...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am planning to use spark with HBase, where I generate RDD by reading
>>>> data from HBase Table.
>>>>
>>>> I want to know that in the case when the size of HBase Table grows
>>>> larger than the size of RAM available in the cluster, will the application
>>>> fail, or will there be an impact in performance ?
>>>>
>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>
>>>> Thanks,
>>>> -Vibhor
>>>>
>>>
>>>
>>>
>>> --
>>> Vibhor Banga
>>> Software Development Engineer
>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Posted by Aaron Davidson <il...@gmail.com>.
There is no fundamental issue if you're running on data that is larger than
cluster memory size. Many operations can stream data through, and thus
memory usage is independent of input data size. Certain operations require
an entire *partition* (not dataset) to fit in memory, but there are not
many instances of this left (sorting comes to mind, and this is being
worked on).

In general, one problem with Spark today is that you *can* OOM under
certain configurations, and it's possible you'll need to change from the
default configuration if you're using doing very memory-intensive jobs.
However, there are very few cases where Spark would simply fail as a matter
of course *-- *for instance, you can always increase the number of
partitions to decrease the size of any given one. or repartition data to
eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an
impact depending on your jobs. If you're doing a join on a very large
amount of data with few partitions, then we'll have to spill to disk. If
you can't cache your working set of data in memory, you will also see a
performance degradation. Spark enables the use of memory to make things
fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <ma...@gmail.com>
wrote:

> Clearly thr will be impact on performance but frankly depends on what you
> are trying to achieve with the dataset.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vi...@gmail.com>
> wrote:
>
>> Some inputs will be really helpful.
>>
>> Thanks,
>> -Vibhor
>>
>>
>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vi...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am planning to use spark with HBase, where I generate RDD by reading
>>> data from HBase Table.
>>>
>>> I want to know that in the case when the size of HBase Table grows
>>> larger than the size of RAM available in the cluster, will the application
>>> fail, or will there be an impact in performance ?
>>>
>>> Any thoughts in this direction will be helpful and are welcome.
>>>
>>> Thanks,
>>> -Vibhor
>>>
>>
>>
>>
>> --
>> Vibhor Banga
>> Software Development Engineer
>> Flipkart Internet Pvt. Ltd., Bangalore
>>
>>
>

Re: Using Spark on Data size larger than Memory size

Posted by Mayur Rustagi <ma...@gmail.com>.
Clearly thr will be impact on performance but frankly depends on what you
are trying to achieve with the dataset.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vi...@gmail.com>
wrote:

> Some inputs will be really helpful.
>
> Thanks,
> -Vibhor
>
>
> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vi...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I am planning to use spark with HBase, where I generate RDD by reading
>> data from HBase Table.
>>
>> I want to know that in the case when the size of HBase Table grows larger
>> than the size of RAM available in the cluster, will the application fail,
>> or will there be an impact in performance ?
>>
>> Any thoughts in this direction will be helpful and are welcome.
>>
>> Thanks,
>> -Vibhor
>>
>
>
>
> --
> Vibhor Banga
> Software Development Engineer
> Flipkart Internet Pvt. Ltd., Bangalore
>
>

Re: Using Spark on Data size larger than Memory size

Posted by Vibhor Banga <vi...@gmail.com>.
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vi...@gmail.com> wrote:

> Hi all,
>
> I am planning to use spark with HBase, where I generate RDD by reading
> data from HBase Table.
>
> I want to know that in the case when the size of HBase Table grows larger
> than the size of RAM available in the cluster, will the application fail,
> or will there be an impact in performance ?
>
> Any thoughts in this direction will be helpful and are welcome.
>
> Thanks,
> -Vibhor
>



-- 
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore