You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by sam smith <qu...@gmail.com> on 2023/03/11 18:34:48 UTC

What could be the cause of an execution freeze on Hadoop for small datasets?

Hello guys,

I am launching through code (client mode) a Spark program to run in Hadoop.
If I execute on the dataset methods of the likes of show() and count() or
collectAsList() (that are displayed in the Spark UI) after performing heavy
transformations on the columns then the mentioned methods will cause the
execution to freeze on Hadoop and that independently of the dataset size
(intriguing issue for small size datasets!).
Any idea what could be causing this type of issue?
To note that if I execute collectAsList on the dataset at the beginning of
the program (before performing the transformations on the columns) then the
method yields results correctly.

Thanks.
Regards

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

Posted by sam smith <qu...@gmail.com>.
" In this case your program may work because effectively you are not using
the spark in yarn on the hadoop cluster  " I am actually using Yarn as
mentioned (client mode)
I already know that, but it is not just about collectAsList, the execution
freezes also for example when using save() on the dataset (after the
transformations, before them it is ok to perform save() on the dataset).

I hope the question is clearer (for anybody who's reading) now.

Le sam. 11 mars 2023 à 20:15, Mich Talebzadeh <mi...@gmail.com> a
écrit :

> collectAsList brings all the data into the driver which is a single JVM
> on a single node. In this case your program may work because effectively
> you are not using the spark in yarn on the hadoop cluster. The benefit of
> Spark is that you can process a large amount of data using the memory and
> processors across multiple executors on multiple nodes.
>
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 11 Mar 2023 at 19:01, sam smith <qu...@gmail.com>
> wrote:
>
>> not sure what you mean by your question, but it is not helping in any case
>>
>>
>> Le sam. 11 mars 2023 à 19:54, Mich Talebzadeh <mi...@gmail.com>
>> a écrit :
>>
>>>
>>>
>>> ... To note that if I execute collectAsList on the dataset at the
>>> beginning of the program....
>>>
>>> What do you think  collectAsList does?
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 11 Mar 2023 at 18:29, sam smith <qu...@gmail.com>
>>> wrote:
>>>
>>>> Hello guys,
>>>>
>>>> I am launching through code (client mode) a Spark program to run in
>>>> Hadoop. If I execute on the dataset methods of the likes of show() and
>>>> count() or collectAsList() (that are displayed in the Spark UI) after
>>>> performing heavy transformations on the columns then the mentioned methods
>>>> will cause the execution to freeze on Hadoop and that independently of the
>>>> dataset size (intriguing issue for small size datasets!).
>>>> Any idea what could be causing this type of issue?
>>>> To note that if I execute collectAsList on the dataset at the beginning
>>>> of the program (before performing the transformations on the columns) then
>>>> the method yields results correctly.
>>>>
>>>> Thanks.
>>>> Regards
>>>>
>>>>

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

Posted by Mich Talebzadeh <mi...@gmail.com>.
collectAsList brings all the data into the driver which is a single JVM on
a single node. In this case your program may work because effectively you
are not using the spark in yarn on the hadoop cluster. The benefit of Spark
is that you can process a large amount of data using the memory and
processors across multiple executors on multiple nodes.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 11 Mar 2023 at 19:01, sam smith <qu...@gmail.com> wrote:

> not sure what you mean by your question, but it is not helping in any case
>
>
> Le sam. 11 mars 2023 à 19:54, Mich Talebzadeh <mi...@gmail.com>
> a écrit :
>
>>
>>
>> ... To note that if I execute collectAsList on the dataset at the
>> beginning of the program....
>>
>> What do you think  collectAsList does?
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 11 Mar 2023 at 18:29, sam smith <qu...@gmail.com>
>> wrote:
>>
>>> Hello guys,
>>>
>>> I am launching through code (client mode) a Spark program to run in
>>> Hadoop. If I execute on the dataset methods of the likes of show() and
>>> count() or collectAsList() (that are displayed in the Spark UI) after
>>> performing heavy transformations on the columns then the mentioned methods
>>> will cause the execution to freeze on Hadoop and that independently of the
>>> dataset size (intriguing issue for small size datasets!).
>>> Any idea what could be causing this type of issue?
>>> To note that if I execute collectAsList on the dataset at the beginning
>>> of the program (before performing the transformations on the columns) then
>>> the method yields results correctly.
>>>
>>> Thanks.
>>> Regards
>>>
>>>

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

Posted by sam smith <qu...@gmail.com>.
not sure what you mean by your question, but it is not helping in any case


Le sam. 11 mars 2023 à 19:54, Mich Talebzadeh <mi...@gmail.com> a
écrit :

>
>
> ... To note that if I execute collectAsList on the dataset at the
> beginning of the program....
>
> What do you think  collectAsList does?
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 11 Mar 2023 at 18:29, sam smith <qu...@gmail.com>
> wrote:
>
>> Hello guys,
>>
>> I am launching through code (client mode) a Spark program to run in
>> Hadoop. If I execute on the dataset methods of the likes of show() and
>> count() or collectAsList() (that are displayed in the Spark UI) after
>> performing heavy transformations on the columns then the mentioned methods
>> will cause the execution to freeze on Hadoop and that independently of the
>> dataset size (intriguing issue for small size datasets!).
>> Any idea what could be causing this type of issue?
>> To note that if I execute collectAsList on the dataset at the beginning
>> of the program (before performing the transformations on the columns) then
>> the method yields results correctly.
>>
>> Thanks.
>> Regards
>>
>>

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

Posted by Mich Talebzadeh <mi...@gmail.com>.
... To note that if I execute collectAsList on the dataset at the beginning
of the program....

What do you think  collectAsList does?



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 11 Mar 2023 at 18:29, sam smith <qu...@gmail.com> wrote:

> Hello guys,
>
> I am launching through code (client mode) a Spark program to run in
> Hadoop. If I execute on the dataset methods of the likes of show() and
> count() or collectAsList() (that are displayed in the Spark UI) after
> performing heavy transformations on the columns then the mentioned methods
> will cause the execution to freeze on Hadoop and that independently of the
> dataset size (intriguing issue for small size datasets!).
> Any idea what could be causing this type of issue?
> To note that if I execute collectAsList on the dataset at the beginning of
> the program (before performing the transformations on the columns) then the
> method yields results correctly.
>
> Thanks.
> Regards
>
>