You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pierre Borckmans <pi...@realimpactanalytics.com> on 2014/04/10 18:07:14 UTC

Behaviour of caching when dataset does not fit into memory

Hi there,

Just playing around in the Spark shell, I am now a bit confused by the performance I observe when the dataset does not fit into memory :

- i load a dataset with roughly 500 million rows
- i do a count, it takes about 20 seconds
- now if I cache the RDD and do a count again (which will try cache the data again), it takes roughly 90 seconds (the fraction cached is only 25%).
	=> is this expected? to be roughly 5 times slower when caching and not enough RAM is available?
- the subsequent calls to count are also really slow : about 90 seconds as well.
	=> I can see that the first 25% tasks are fast (the ones dealing with data in memory), but then it gets really slow…

Am I missing something?
I thought performance would decrease kind of linearly with the amour of data fit into memory…

Thanks for your help!

Cheers





Pierre Borckmans

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckmans@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans

Re: Behaviour of caching when dataset does not fit into memory

Posted by Mayur Rustagi <ma...@gmail.com>.

One reason could be that spark uses scratch disk space on intermediate
calculations so as you perform calculations that data need to be flushed
before you can leverage memory for operations.
Second issue could be large intermediate data may push more data in rdd
onto disk ( something I see in warehouse use cases a lot) .
Can you see in storage tab how much of rdd is in memory on each subsequent
counts & how much intermediate data is generated each time.
 On Apr 11, 2014 9:22 AM, "Pierre Borckmans" <
pierre.borckmans@realimpactanalytics.com> wrote:

> Hi Matei,
>
> Could you enlighten us on this please?
>
> Thanks
>
> Pierre
>
> On 11 Apr 2014, at 14:49, Jérémy Subtil <je...@gmail.com> wrote:
>
> Hi Xusen,
>
> I was convinced the cache() method would involve in-memory only operations
> and has nothing to do with disks as the underlying default cache strategy
> is MEMORY_ONLY. Am I missing something?
>
>
> 2014-04-11 11:44 GMT+02:00 尹绪森 <yi...@gmail.com>:
>
>> Hi Pierre,
>>
>> 1. cache() would cost time to carry stuffs from disk to memory, so pls do
>> not use cache() if your job is not an iterative one.
>>
>> 2. If your dataset is larger than memory amount, then there will be a
>> replacement strategy to exchange data between memory and disk.
>>
>>
>> 2014-04-11 0:07 GMT+08:00 Pierre Borckmans <
>> pierre.borckmans@realimpactanalytics.com>:
>>
>> Hi there,
>>>
>>> Just playing around in the Spark shell, I am now a bit confused by the
>>> performance I observe when the dataset does not fit into memory :
>>>
>>> - i load a dataset with roughly 500 million rows
>>> - i do a count, it takes about 20 seconds
>>> - now if I cache the RDD and do a count again (which will try cache the
>>> data again), it takes roughly 90 seconds (the fraction cached is only 25%).
>>>  => is this expected? to be roughly 5 times slower when caching and not
>>> enough RAM is available?
>>> - the subsequent calls to count are also really slow : about 90 seconds
>>> as well.
>>>  => I can see that the first 25% tasks are fast (the ones dealing with
>>> data in memory), but then it gets really slow…
>>>
>>> Am I missing something?
>>> I thought performance would decrease kind of linearly with the amour of
>>> data fit into memory…
>>>
>>> Thanks for your help!
>>>
>>> Cheers
>>>
>>>
>>>
>>>
>>>
>>>  *Pierre Borckmans*
>>>
>>> *Real**Impact* Analytics *| *Brussels Office
>>>  www.realimpactanalytics.com *| *
>>> pierre.borckmans@realimpactanalytics.com<th...@realimpactanalytics.com>
>>>
>>> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>> -----------------------------------
>> Xusen Yin    尹绪森
>> Intel Labs China
>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>>
>
>
>

Re: Behaviour of caching when dataset does not fit into memory

Posted by Pierre Borckmans <pi...@realimpactanalytics.com>.

Hi Matei,

Could you enlighten us on this please?

Thanks

Pierre

On 11 Apr 2014, at 14:49, Jérémy Subtil <je...@gmail.com> wrote:

> Hi Xusen,
> 
> I was convinced the cache() method would involve in-memory only operations and has nothing to do with disks as the underlying default cache strategy is MEMORY_ONLY. Am I missing something?
> 
> 
> 2014-04-11 11:44 GMT+02:00 尹绪森 <yi...@gmail.com>:
> Hi Pierre,
> 
> 1. cache() would cost time to carry stuffs from disk to memory, so pls do not use cache() if your job is not an iterative one.
> 
> 2. If your dataset is larger than memory amount, then there will be a replacement strategy to exchange data between memory and disk.
> 
> 
> 2014-04-11 0:07 GMT+08:00 Pierre Borckmans <pi...@realimpactanalytics.com>:
> 
> Hi there,
> 
> Just playing around in the Spark shell, I am now a bit confused by the performance I observe when the dataset does not fit into memory :
> 
> - i load a dataset with roughly 500 million rows
> - i do a count, it takes about 20 seconds
> - now if I cache the RDD and do a count again (which will try cache the data again), it takes roughly 90 seconds (the fraction cached is only 25%).
> 	=> is this expected? to be roughly 5 times slower when caching and not enough RAM is available?
> - the subsequent calls to count are also really slow : about 90 seconds as well.
> 	=> I can see that the first 25% tasks are fast (the ones dealing with data in memory), but then it gets really slow…
> 
> Am I missing something?
> I thought performance would decrease kind of linearly with the amour of data fit into memory…
> 
> Thanks for your help!
> 
> Cheers
> 
> 
> 
> 
> 
> Pierre Borckmans
> 
> RealImpact Analytics | Brussels Office
> www.realimpactanalytics.com | pierre.borckmans@realimpactanalytics.com
> 
> FR +32 485 91 87 31 | Skype pierre.borckmans
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Intel Labs China
> Homepage: http://yinxusen.github.io/
>

Re: Behaviour of caching when dataset does not fit into memory

Posted by Jérémy Subtil <je...@gmail.com>.

Hi Xusen,

I was convinced the cache() method would involve in-memory only operations
and has nothing to do with disks as the underlying default cache strategy
is MEMORY_ONLY. Am I missing something?


2014-04-11 11:44 GMT+02:00 尹绪森 <yi...@gmail.com>:

> Hi Pierre,
>
> 1. cache() would cost time to carry stuffs from disk to memory, so pls do
> not use cache() if your job is not an iterative one.
>
> 2. If your dataset is larger than memory amount, then there will be a
> replacement strategy to exchange data between memory and disk.
>
>
> 2014-04-11 0:07 GMT+08:00 Pierre Borckmans <
> pierre.borckmans@realimpactanalytics.com>:
>
> Hi there,
>>
>> Just playing around in the Spark shell, I am now a bit confused by the
>> performance I observe when the dataset does not fit into memory :
>>
>> - i load a dataset with roughly 500 million rows
>> - i do a count, it takes about 20 seconds
>> - now if I cache the RDD and do a count again (which will try cache the
>> data again), it takes roughly 90 seconds (the fraction cached is only 25%).
>>  => is this expected? to be roughly 5 times slower when caching and not
>> enough RAM is available?
>> - the subsequent calls to count are also really slow : about 90 seconds
>> as well.
>>  => I can see that the first 25% tasks are fast (the ones dealing with
>> data in memory), but then it gets really slow…
>>
>> Am I missing something?
>> I thought performance would decrease kind of linearly with the amour of
>> data fit into memory…
>>
>> Thanks for your help!
>>
>> Cheers
>>
>>
>>
>>
>>
>>  *Pierre Borckmans*
>>
>> *Real**Impact* Analytics *| *Brussels Office
>>  www.realimpactanalytics.com *| *pierre.borckmans@realimpactanalytics.com<th...@realimpactanalytics.com>
>>
>> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans
>>
>>
>>
>>
>>
>>
>
>
> --
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Intel Labs China
> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>

Re: Behaviour of caching when dataset does not fit into memory

Posted by 尹绪森 <yi...@gmail.com>.

Hi Pierre,

1. cache() would cost time to carry stuffs from disk to memory, so pls do
not use cache() if your job is not an iterative one.

2. If your dataset is larger than memory amount, then there will be a
replacement strategy to exchange data between memory and disk.


2014-04-11 0:07 GMT+08:00 Pierre Borckmans <
pierre.borckmans@realimpactanalytics.com>:

> Hi there,
>
> Just playing around in the Spark shell, I am now a bit confused by the
> performance I observe when the dataset does not fit into memory :
>
> - i load a dataset with roughly 500 million rows
> - i do a count, it takes about 20 seconds
> - now if I cache the RDD and do a count again (which will try cache the
> data again), it takes roughly 90 seconds (the fraction cached is only 25%).
>  => is this expected? to be roughly 5 times slower when caching and not
> enough RAM is available?
> - the subsequent calls to count are also really slow : about 90 seconds as
> well.
>  => I can see that the first 25% tasks are fast (the ones dealing with
> data in memory), but then it gets really slow…
>
> Am I missing something?
> I thought performance would decrease kind of linearly with the amour of
> data fit into memory…
>
> Thanks for your help!
>
> Cheers
>
>
>
>
>
>  *Pierre Borckmans*
>
> *Real**Impact* Analytics *| *Brussels Office
>  www.realimpactanalytics.com *| *pierre.borckmans@realimpactanalytics.com<th...@realimpactanalytics.com>
>
> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans
>
>
>
>
>
>


-- 
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*