You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Olivier Girardot <o....@lateral-thoughts.com> on 2015/08/25 09:39:21 UTC
Re: Spark stages very slow to complete
I have pretty much the same "symptoms" - the computation itself is pretty
fast, but most of my computation is spent in JavaToPython steps (~15min).
I'm using the Spark 1.5.0-rc1 with DataFrame and ML Pipelines.
Any insights into what these steps are exactly ?
2015-06-02 9:18 GMT+02:00 Karlson <ks...@siberie.de>:
> Hi, the code is some hundreds lines of Python. I can try to compose a
> minimal example as soon as I find the time, though. Any ideas until then?
>
>
> Would you mind posting the code?
>> On 2 Jun 2015 00:53, "Karlson" <ks...@siberie.de> wrote:
>>
>> Hi,
>>>
>>> In all (pyspark) Spark jobs, that become somewhat more involved, I am
>>> experiencing the issue that some stages take a very long time to complete
>>> and sometimes don't at all. This clearly correlates with the size of my
>>> input data. Looking at the stage details for one such stage, I am
>>> wondering
>>> where Spark spends all this time. Take this table of the stages task
>>> metrics for example:
>>>
>>> Metric Min 25th
>>> percentile Median 75th percentile Max
>>> Duration 1.4 min 1.5 min 1.7 min
>>> 1.9 min 2.3 min
>>> Scheduler Delay 1 ms 3 ms 4 ms
>>> 5 ms 23 ms
>>> Task Deserialization Time 1 ms 2 ms 3 ms
>>> 8 ms 22 ms
>>> GC Time 0 ms 0 ms 0 ms
>>> 0 ms 0 ms
>>> Result Serialization Time 0 ms 0 ms 0 ms
>>> 0 ms 1 ms
>>> Getting Result Time 0 ms 0 ms 0 ms
>>> 0 ms 0 ms
>>> Input Size / Records 23.9 KB / 1 24.0 KB / 1 24.1 KB /
>>> 1 24.1 KB / 1 24.3 KB / 1
>>>
>>> Why is the overall duration almost 2min? Where is all this time spent,
>>> when no progress of the stages is visible? The progress bar simply
>>> displays
>>> 0 succeeded tasks for a very long time before sometimes slowly
>>> progressing.
>>>
>>> Also, the name of the stage displayed above is `javaToPython at null:-1`,
>>> which I find very uninformative. I don't even know which action exactly
>>> is
>>> responsible for this stage. Does anyone experience similar issues or have
>>> any advice for me?
>>>
>>> Thanks!
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
--
*Olivier Girardot* | AssociƩ
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94