You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Harit Vishwakarma <ha...@gmail.com> on 2015/07/17 13:03:10 UTC

Spark APIs memory usage?

Hi,

I used createDataFrame API of SqlContext in python. and getting
OutOfMemoryException. I am wondering if it is creating whole dataFrame in
memory?
I did not find any documentation describing memory usage of Spark APIs.
Documentation given is nice but little more details (specially on memory
usage/ data distribution etc.) will really help.

-- 
Regards
Harit Vishwakarma

Re: Spark APIs memory usage?

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

This is what happens when you create a DataFrame
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L430>,
in your case, rdd1.values.flatMap(fun) will be executed
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L127>
when you create the df. Can you check just rdd1.values.flatMap(fun).count()
or a save just to see it executes without any problems.

Thanks
Best Regards

On Sat, Jul 18, 2015 at 2:27 PM, Harit Vishwakarma <
harit.vishwakarma@gmail.com> wrote:

> Even if I remove numpy calls. (no matrices loaded), Same exception is
> coming.
> Can anyone tell what createDataFrame does internally? Are there any
> alternatives for it?
>
> On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> I suspect its the numpy filling up Memory.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jul 17, 2015 at 5:46 PM, Harit Vishwakarma <
>> harit.vishwakarma@gmail.com> wrote:
>>
>>> 1. load 3 matrices of size ~ 10000 X 10000 using numpy.
>>> 2. rdd2 = rdd1.values().flatMap( fun )  # rdd1 has roughly 10^7 tuples
>>> 3. df = sqlCtx.createDataFrame(rdd2)
>>> 4. df.save() # in parquet format
>>>
>>> It throws exception in createDataFrame() call. I don't know what exactly
>>> it is creating ? everything in memory? or can I make it to persist
>>> simultaneously while getting created.
>>>
>>> Thanks
>>>
>>>
>>> On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Can you paste the code? How much memory does your system have and how
>>>> big is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma <
>>>> harit.vishwakarma@gmail.com> wrote:
>>>>
>>>>> Thanks,
>>>>> Code is running on a single machine.
>>>>> And it still doesn't answer my question.
>>>>>
>>>>> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <gu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You can bump up number of partitions while creating the rdd you are
>>>>>> using for df
>>>>>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <
>>>>>> harit.vishwakarma@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I used createDataFrame API of SqlContext in python. and getting
>>>>>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in
>>>>>>> memory?
>>>>>>> I did not find any documentation describing memory usage of Spark
>>>>>>> APIs.
>>>>>>> Documentation given is nice but little more details (specially on
>>>>>>> memory usage/ data distribution etc.) will really help.
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>> Harit Vishwakarma
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards
>>>>> Harit Vishwakarma
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Regards
>>> Harit Vishwakarma
>>>
>>>
>>
>
>
> --
> Regards
> Harit Vishwakarma
>
>

Re: Spark APIs memory usage?

Posted by Harit Vishwakarma <ha...@gmail.com>.

Even if I remove numpy calls. (no matrices loaded), Same exception is
coming.
Can anyone tell what createDataFrame does internally? Are there any
alternatives for it?

On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> I suspect its the numpy filling up Memory.
>
> Thanks
> Best Regards
>
> On Fri, Jul 17, 2015 at 5:46 PM, Harit Vishwakarma <
> harit.vishwakarma@gmail.com> wrote:
>
>> 1. load 3 matrices of size ~ 10000 X 10000 using numpy.
>> 2. rdd2 = rdd1.values().flatMap( fun )  # rdd1 has roughly 10^7 tuples
>> 3. df = sqlCtx.createDataFrame(rdd2)
>> 4. df.save() # in parquet format
>>
>> It throws exception in createDataFrame() call. I don't know what exactly
>> it is creating ? everything in memory? or can I make it to persist
>> simultaneously while getting created.
>>
>> Thanks
>>
>>
>> On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Can you paste the code? How much memory does your system have and how
>>> big is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma <
>>> harit.vishwakarma@gmail.com> wrote:
>>>
>>>> Thanks,
>>>> Code is running on a single machine.
>>>> And it still doesn't answer my question.
>>>>
>>>> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> You can bump up number of partitions while creating the rdd you are
>>>>> using for df
>>>>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <ha...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I used createDataFrame API of SqlContext in python. and getting
>>>>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in
>>>>>> memory?
>>>>>> I did not find any documentation describing memory usage of Spark
>>>>>> APIs.
>>>>>> Documentation given is nice but little more details (specially on
>>>>>> memory usage/ data distribution etc.) will really help.
>>>>>>
>>>>>> --
>>>>>> Regards
>>>>>> Harit Vishwakarma
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>> Harit Vishwakarma
>>>>
>>>>
>>>
>>
>>
>> --
>> Regards
>> Harit Vishwakarma
>>
>>
>


-- 
Regards
Harit Vishwakarma

Re: Spark APIs memory usage?

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

I suspect its the numpy filling up Memory.

Thanks
Best Regards

On Fri, Jul 17, 2015 at 5:46 PM, Harit Vishwakarma <
harit.vishwakarma@gmail.com> wrote:

> 1. load 3 matrices of size ~ 10000 X 10000 using numpy.
> 2. rdd2 = rdd1.values().flatMap( fun )  # rdd1 has roughly 10^7 tuples
> 3. df = sqlCtx.createDataFrame(rdd2)
> 4. df.save() # in parquet format
>
> It throws exception in createDataFrame() call. I don't know what exactly
> it is creating ? everything in memory? or can I make it to persist
> simultaneously while getting created.
>
> Thanks
>
>
> On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Can you paste the code? How much memory does your system have and how big
>> is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma <
>> harit.vishwakarma@gmail.com> wrote:
>>
>>> Thanks,
>>> Code is running on a single machine.
>>> And it still doesn't answer my question.
>>>
>>> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> You can bump up number of partitions while creating the rdd you are
>>>> using for df
>>>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <ha...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I used createDataFrame API of SqlContext in python. and getting
>>>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in
>>>>> memory?
>>>>> I did not find any documentation describing memory usage of Spark APIs.
>>>>> Documentation given is nice but little more details (specially on
>>>>> memory usage/ data distribution etc.) will really help.
>>>>>
>>>>> --
>>>>> Regards
>>>>> Harit Vishwakarma
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Regards
>>> Harit Vishwakarma
>>>
>>>
>>
>
>
> --
> Regards
> Harit Vishwakarma
>
>

Re: Spark APIs memory usage?

Posted by Harit Vishwakarma <ha...@gmail.com>.

1. load 3 matrices of size ~ 10000 X 10000 using numpy.
2. rdd2 = rdd1.values().flatMap( fun )  # rdd1 has roughly 10^7 tuples
3. df = sqlCtx.createDataFrame(rdd2)
4. df.save() # in parquet format

It throws exception in createDataFrame() call. I don't know what exactly it
is creating ? everything in memory? or can I make it to persist
simultaneously while getting created.

Thanks


On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Can you paste the code? How much memory does your system have and how big
> is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?
>
> Thanks
> Best Regards
>
> On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma <
> harit.vishwakarma@gmail.com> wrote:
>
>> Thanks,
>> Code is running on a single machine.
>> And it still doesn't answer my question.
>>
>> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <gu...@gmail.com> wrote:
>>
>>> You can bump up number of partitions while creating the rdd you are
>>> using for df
>>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <ha...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I used createDataFrame API of SqlContext in python. and getting
>>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in
>>>> memory?
>>>> I did not find any documentation describing memory usage of Spark APIs.
>>>> Documentation given is nice but little more details (specially on
>>>> memory usage/ data distribution etc.) will really help.
>>>>
>>>> --
>>>> Regards
>>>> Harit Vishwakarma
>>>>
>>>>
>>
>>
>> --
>> Regards
>> Harit Vishwakarma
>>
>>
>


-- 
Regards
Harit Vishwakarma

Re: Spark APIs memory usage?

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Can you paste the code? How much memory does your system have and how big
is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?

Thanks
Best Regards

On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma <
harit.vishwakarma@gmail.com> wrote:

> Thanks,
> Code is running on a single machine.
> And it still doesn't answer my question.
>
> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <gu...@gmail.com> wrote:
>
>> You can bump up number of partitions while creating the rdd you are using
>> for df
>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <ha...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I used createDataFrame API of SqlContext in python. and getting
>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in
>>> memory?
>>> I did not find any documentation describing memory usage of Spark APIs.
>>> Documentation given is nice but little more details (specially on memory
>>> usage/ data distribution etc.) will really help.
>>>
>>> --
>>> Regards
>>> Harit Vishwakarma
>>>
>>>
>
>
> --
> Regards
> Harit Vishwakarma
>
>

Re: Spark APIs memory usage?

Posted by Harit Vishwakarma <ha...@gmail.com>.

Thanks,
Code is running on a single machine.
And it still doesn't answer my question.

On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <gu...@gmail.com> wrote:

> You can bump up number of partitions while creating the rdd you are using
> for df
> On 17 Jul 2015 21:03, "Harit Vishwakarma" <ha...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I used createDataFrame API of SqlContext in python. and getting
>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in
>> memory?
>> I did not find any documentation describing memory usage of Spark APIs.
>> Documentation given is nice but little more details (specially on memory
>> usage/ data distribution etc.) will really help.
>>
>> --
>> Regards
>> Harit Vishwakarma
>>
>>


-- 
Regards
Harit Vishwakarma

Re: Spark APIs memory usage?

Posted by ayan guha <gu...@gmail.com>.

You can bump up number of partitions while creating the rdd you are using
for df
On 17 Jul 2015 21:03, "Harit Vishwakarma" <ha...@gmail.com>
wrote:

> Hi,
>
> I used createDataFrame API of SqlContext in python. and getting
> OutOfMemoryException. I am wondering if it is creating whole dataFrame in
> memory?
> I did not find any documentation describing memory usage of Spark APIs.
> Documentation given is nice but little more details (specially on memory
> usage/ data distribution etc.) will really help.
>
> --
> Regards
> Harit Vishwakarma
>
>