You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Sanders, Isaac B" <sa...@rose-hulman.edu> on 2016/01/21 16:35:01 UTC

10hrs of Scheduler Delay

Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine parameters, I am using the utility class they supply, org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

Posted by "Sanders, Isaac B" <sa...@rose-hulman.edu>.

I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am using more resources on this one.

- Isaac

On Jan 21, 2016, at 11:06 PM, Ted Yu <yu...@gmail.com>> wrote:

You may have seen the following on github page:

Latest commit 50fdf0e on Feb 22, 2015

That was 11 months ago.

Can you search for similar algorithm which runs on Spark and is newer ?

If nothing found, consider running the tests coming from the project to determine whether the delay is intrinsic.

Cheers

On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
That thread seems to be moving, it oscillates between a few different traces… Maybe it is working. It seems odd that it would take that long.

This is 3rd party code, and after looking at some of it, I think it might not be as Spark-y as it could be.

I linked it below. I don’t know a lot about spark, so it might be fine, but I have my suspicions.

https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala

- Isaac

On Jan 21, 2016, at 10:08 PM, Ted Yu <yu...@gmail.com>> wrote:

You may have noticed the following - did this indicate prolonged computation in your code ?

org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)

On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
Hadoop is: HDP 2.3.2.0-2950

Here is a gist (pastebin) of my versions en masse and a stacktrace: https://gist.github.com/isaacsanders/2e59131758469097651b

Thanks

On Jan 21, 2016, at 7:44 PM, Ted Yu <yu...@gmail.com>> wrote:

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even interesting [INFO] lines.

Here is the script I am using: https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu <yu...@gmail.com>> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine parameters, I am using the utility class they supply, org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

Posted by Ted Yu <yu...@gmail.com>.

You may have seen the following on github page:

Latest commit 50fdf0e  on Feb 22, 2015

That was 11 months ago.

Can you search for similar algorithm which runs on Spark and is newer ?

If nothing found, consider running the tests coming from the project to
determine whether the delay is intrinsic.

Cheers

On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B <sa...@rose-hulman.edu>
wrote:

> That thread seems to be moving, it oscillates between a few different
> traces… Maybe it is working. It seems odd that it would take that long.
>
> This is 3rd party code, and after looking at some of it, I think it might
> not be as Spark-y as it could be.
>
> I linked it below. I don’t know a lot about spark, so it might be fine,
> but I have my suspicions.
>
>
> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala
>
> - Isaac
>
> On Jan 21, 2016, at 10:08 PM, Ted Yu <yu...@gmail.com> wrote:
>
> You may have noticed the following - did this indicate prolonged
> computation in your code ?
>
> org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
> org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
> org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)
>
>
> On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B <
> sanderib@rose-hulman.edu> wrote:
>
>> Hadoop is: HDP 2.3.2.0-2950
>>
>> Here is a gist (pastebin) of my versions en masse and a stacktrace:
>> https://gist.github.com/isaacsanders/2e59131758469097651b
>>
>> Thanks
>>
>> On Jan 21, 2016, at 7:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>> Looks like you were running on YARN.
>>
>> What hadoop version are you using ?
>>
>> Can you capture a few stack traces of the AppMaster during the delay and
>> pastebin them ?
>>
>> Thanks
>>
>> On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <
>> sanderib@rose-hulman.edu> wrote:
>>
>>> The Spark Version is 1.4.1
>>>
>>> The logs are full of standard fair, nothing like an exception or even
>>> interesting [INFO] lines.
>>>
>>> Here is the script I am using:
>>> https://gist.github.com/isaacsanders/660f480810fbc07d4df2
>>>
>>> Thanks
>>> Isaac
>>>
>>> On Jan 21, 2016, at 11:03 AM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> Can you provide a bit more information ?
>>>
>>> command line for submitting Spark job
>>> version of Spark
>>> anything interesting from driver / executor logs ?
>>>
>>> Thanks
>>>
>>> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
>>> sanderib@rose-hulman.edu> wrote:
>>>
>>>> Hey all,
>>>>
>>>> I am a CS student in the United States working on my senior thesis.
>>>>
>>>> My thesis uses Spark, and I am encountering some trouble.
>>>>
>>>> I am using https://github.com/alitouka/spark_dbscan, and to determine
>>>> parameters, I am using the utility class they supply,
>>>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>>>>
>>>> I am on a 10 node cluster with one machine with 8 cores and 32G of
>>>> memory and nine machines with 6 cores and 16G of memory.
>>>>
>>>> I have 442M of data, which seems like it would be a joke, but the job
>>>> stalls at the last stage.
>>>>
>>>> It was stuck in Scheduler Delay for 10 hours overnight, and I have
>>>> tried a number of things for the last couple days, but nothing seems to be
>>>> helping.
>>>>
>>>> I have tried:
>>>> - Increasing heap sizes and numbers of cores
>>>> - More/less executors with different amounts of resources.
>>>> - Kyro Serialization
>>>> - FAIR Scheduling
>>>>
>>>> It doesn’t seem like it should require this much. Any ideas?
>>>>
>>>> - Isaac
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: 10hrs of Scheduler Delay

Posted by "Sanders, Isaac B" <sa...@rose-hulman.edu>.

That thread seems to be moving, it oscillates between a few different traces… Maybe it is working. It seems odd that it would take that long.

This is 3rd party code, and after looking at some of it, I think it might not be as Spark-y as it could be.

I linked it below. I don’t know a lot about spark, so it might be fine, but I have my suspicions.

https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala

- Isaac

On Jan 21, 2016, at 10:08 PM, Ted Yu <yu...@gmail.com>> wrote:

You may have noticed the following - did this indicate prolonged computation in your code ?

org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)

On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
Hadoop is: HDP 2.3.2.0-2950

Here is a gist (pastebin) of my versions en masse and a stacktrace: https://gist.github.com/isaacsanders/2e59131758469097651b

Thanks

On Jan 21, 2016, at 7:44 PM, Ted Yu <yu...@gmail.com>> wrote:

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even interesting [INFO] lines.

Here is the script I am using: https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu <yu...@gmail.com>> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine parameters, I am using the utility class they supply, org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

Posted by Ted Yu <yu...@gmail.com>.

You may have noticed the following - did this indicate prolonged
computation in your code ?

org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)


On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B <sa...@rose-hulman.edu>
wrote:

> Hadoop is: HDP 2.3.2.0-2950
>
> Here is a gist (pastebin) of my versions en masse and a stacktrace:
> https://gist.github.com/isaacsanders/2e59131758469097651b
>
> Thanks
>
> On Jan 21, 2016, at 7:44 PM, Ted Yu <yu...@gmail.com> wrote:
>
> Looks like you were running on YARN.
>
> What hadoop version are you using ?
>
> Can you capture a few stack traces of the AppMaster during the delay and
> pastebin them ?
>
> Thanks
>
> On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <
> sanderib@rose-hulman.edu> wrote:
>
>> The Spark Version is 1.4.1
>>
>> The logs are full of standard fair, nothing like an exception or even
>> interesting [INFO] lines.
>>
>> Here is the script I am using:
>> https://gist.github.com/isaacsanders/660f480810fbc07d4df2
>>
>> Thanks
>> Isaac
>>
>> On Jan 21, 2016, at 11:03 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>> Can you provide a bit more information ?
>>
>> command line for submitting Spark job
>> version of Spark
>> anything interesting from driver / executor logs ?
>>
>> Thanks
>>
>> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
>> sanderib@rose-hulman.edu> wrote:
>>
>>> Hey all,
>>>
>>> I am a CS student in the United States working on my senior thesis.
>>>
>>> My thesis uses Spark, and I am encountering some trouble.
>>>
>>> I am using https://github.com/alitouka/spark_dbscan, and to determine
>>> parameters, I am using the utility class they supply,
>>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>>>
>>> I am on a 10 node cluster with one machine with 8 cores and 32G of
>>> memory and nine machines with 6 cores and 16G of memory.
>>>
>>> I have 442M of data, which seems like it would be a joke, but the job
>>> stalls at the last stage.
>>>
>>> It was stuck in Scheduler Delay for 10 hours overnight, and I have tried
>>> a number of things for the last couple days, but nothing seems to be
>>> helping.
>>>
>>> I have tried:
>>> - Increasing heap sizes and numbers of cores
>>> - More/less executors with different amounts of resources.
>>> - Kyro Serialization
>>> - FAIR Scheduling
>>>
>>> It doesn’t seem like it should require this much. Any ideas?
>>>
>>> - Isaac
>>
>>
>>
>>
>
>

Re: 10hrs of Scheduler Delay

Posted by "Sanders, Isaac B" <sa...@rose-hulman.edu>.

Hadoop is: HDP 2.3.2.0-2950

Here is a gist (pastebin) of my versions en masse and a stacktrace: https://gist.github.com/isaacsanders/2e59131758469097651b

Thanks

On Jan 21, 2016, at 7:44 PM, Ted Yu <yu...@gmail.com>> wrote:

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even interesting [INFO] lines.

Here is the script I am using: https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu <yu...@gmail.com>> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

Posted by Ted Yu <yu...@gmail.com>.

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and
pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <sa...@rose-hulman.edu>
wrote:

> The Spark Version is 1.4.1
>
> The logs are full of standard fair, nothing like an exception or even
> interesting [INFO] lines.
>
> Here is the script I am using:
> https://gist.github.com/isaacsanders/660f480810fbc07d4df2
>
> Thanks
> Isaac
>
> On Jan 21, 2016, at 11:03 AM, Ted Yu <yu...@gmail.com> wrote:
>
> Can you provide a bit more information ?
>
> command line for submitting Spark job
> version of Spark
> anything interesting from driver / executor logs ?
>
> Thanks
>
> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
> sanderib@rose-hulman.edu> wrote:
>
>> Hey all,
>>
>> I am a CS student in the United States working on my senior thesis.
>>
>> My thesis uses Spark, and I am encountering some trouble.
>>
>> I am using https://github.com/alitouka/spark_dbscan, and to determine
>> parameters, I am using the utility class they supply,
>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>>
>> I am on a 10 node cluster with one machine with 8 cores and 32G of memory
>> and nine machines with 6 cores and 16G of memory.
>>
>> I have 442M of data, which seems like it would be a joke, but the job
>> stalls at the last stage.
>>
>> It was stuck in Scheduler Delay for 10 hours overnight, and I have tried
>> a number of things for the last couple days, but nothing seems to be
>> helping.
>>
>> I have tried:
>> - Increasing heap sizes and numbers of cores
>> - More/less executors with different amounts of resources.
>> - Kyro Serialization
>> - FAIR Scheduling
>>
>> It doesn’t seem like it should require this much. Any ideas?
>>
>> - Isaac
>
>
>
>

Re: 10hrs of Scheduler Delay

Posted by "Sanders, Isaac B" <sa...@rose-hulman.edu>.

The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even interesting [INFO] lines.

Here is the script I am using: https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu <yu...@gmail.com>> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <sa...@rose-hulman.edu>> wrote:
Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine parameters, I am using the utility class they supply, org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

Posted by Ted Yu <yu...@gmail.com>.

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <sa...@rose-hulman.edu>
wrote:

> Hey all,
>
> I am a CS student in the United States working on my senior thesis.
>
> My thesis uses Spark, and I am encountering some trouble.
>
> I am using https://github.com/alitouka/spark_dbscan, and to determine
> parameters, I am using the utility class they supply,
> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>
> I am on a 10 node cluster with one machine with 8 cores and 32G of memory
> and nine machines with 6 cores and 16G of memory.
>
> I have 442M of data, which seems like it would be a joke, but the job
> stalls at the last stage.
>
> It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a
> number of things for the last couple days, but nothing seems to be helping.
>
> I have tried:
> - Increasing heap sizes and numbers of cores
> - More/less executors with different amounts of resources.
> - Kyro Serialization
> - FAIR Scheduling
>
> It doesn’t seem like it should require this much. Any ideas?
>
> - Isaac