You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/04/23 19:03:51 UTC

Task runs on only one machine

Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-xxxxx’ files.  When reading the nano-batch files and doing a cooccurrence calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The rdd is created using sc.textFile(“list of thousand files”)

The driver launches the sc.textFile then creates several intermediate rdds and finally a DrmRdd[Int]. This goes into cooccurrence. From the read onward, all tasks run only on the machine where the driver was launched.

What would cause the read to occur only on the machine that launched the driver? I’ve seen this with and without Yarn.

Do I need to do something to the RDD after reading? Has some partition factor been applied to all derived rdds?

Re: Task runs on only one machine

Posted by Pat Ferrel <pa...@occamsmachete.com>.
That’s a good point. We’re starting with small data and increasing so I think of it as large but right now, not so much—I lied about the thousands, only hundreds right now.

I bet that’s it.

On Apr 23, 2015, at 10:26 AM, Andrew Musselman <an...@gmail.com> wrote:

Not sure about your specific situation but it reminds me of wondering why a
job only has one mapper assigned to it; is the total dataset big enough to
require partitioning?

On Thursday, April 23, 2015, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Using Spark streaming to create a large volume of small nano-batch input
> files, ~4k per file, thousands of ‘part-xxxxx’ files.  When reading the
> nano-batch files and doing a cooccurrence calculation my tasks run only on
> the machine where it was launched. I’m launching in “yarn-client” mode. The
> rdd is created using sc.textFile(“list of thousand files”)
> 
> The driver launches the sc.textFile then creates several intermediate rdds
> and finally a DrmRdd[Int]. This goes into cooccurrence. From the read
> onward, all tasks run only on the machine where the driver was launched.
> 
> What would cause the read to occur only on the machine that launched the
> driver? I’ve seen this with and without Yarn.
> 
> Do I need to do something to the RDD after reading? Has some partition
> factor been applied to all derived rdds?


Re: Task runs on only one machine

Posted by Andrew Musselman <an...@gmail.com>.
Not sure about your specific situation but it reminds me of wondering why a
job only has one mapper assigned to it; is the total dataset big enough to
require partitioning?

On Thursday, April 23, 2015, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Using Spark streaming to create a large volume of small nano-batch input
> files, ~4k per file, thousands of ‘part-xxxxx’ files.  When reading the
> nano-batch files and doing a cooccurrence calculation my tasks run only on
> the machine where it was launched. I’m launching in “yarn-client” mode. The
> rdd is created using sc.textFile(“list of thousand files”)
>
> The driver launches the sc.textFile then creates several intermediate rdds
> and finally a DrmRdd[Int]. This goes into cooccurrence. From the read
> onward, all tasks run only on the machine where the driver was launched.
>
> What would cause the read to occur only on the machine that launched the
> driver? I’ve seen this with and without Yarn.
>
> Do I need to do something to the RDD after reading? Has some partition
> factor been applied to all derived rdds?