You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Raajen <ra...@gmail.com> on 2016/07/01 21:46:22 UTC

Spark driver assigning splits to incorrect workers

I would like to use Spark on a non-distributed file system but am having
trouble getting the driver to assign tasks to the workers that are local to
the files. I have extended InputSplits to create my own version of
FileSplits, so that each worker gets a bit more information than the default
FileSplit provides. I thought that the driver would assign splits based on
their locality. But I have found that the driver will send these splits to
workers seemingly at random -- even the very first split will go to a node
with a different IP than the split specifies. I can see that I am providing
the right node address via GetLocations. I also set spark.locality.wait to a
high value, but the same misassignment keeps happening.

I am using newAPIHadoopFile to create my RDD. InputFormat is creating the
required splits, but not all splits refer to the same file or the same
worker IP. 

What else I can check, or change, to force the driver to send these tasks to
the right workers?

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-driver-assigning-splits-to-incorrect-workers-tp27261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark driver assigning splits to incorrect workers

Posted by Raajen Patel <ra...@gmail.com>.

Hi Ted,

Perhaps this might help? Thanks for your response. I am trying to
access/read binary files stored over a series of servers.

Line used to build RDD:
val BIN_pairRDD: RDD[(BIN_Key, BIN_Value)]  =
spark.newAPIHadoopFile("not.used", classOf[BIN_InputFormat],
classOf[BIN_Key], classOf[BIN_Value], config);

In order to support this, we have the following custom classes:
- BIN_Key and BIN_Value as the paired entry for the RDD
- BIN_RecordReader and BIN_FileSplit to handle the special splits
- BIN_FileSplit overrides getLocations() and getLocationInfo(), and we have
verified that the right IP address is being sent to Spark.
- BIN_InputFormat queries a database for details about every split to be
created; as in, which file to read and the IP address where that file is
local.

When it works:
- No problems running a local job
- No problems running in a cluster when there is 1 computer as Master and
another computer with 3 workers along with the files to process.

When it fails:
- When running in a cluster with multiple workers and files spread across
multiple computers. Jobs are not assigned to the nodes where the files are
local.

Thanks,
Raajen

Re: Spark driver assigning splits to incorrect workers

Posted by Ted Yu <yu...@gmail.com>.

I guess you extended some InputFormat for providing locality information.

Can you share some code snippet ?

Which non-distributed file system are you using ?

Thanks

On Fri, Jul 1, 2016 at 2:46 PM, Raajen <ra...@gmail.com> wrote:

> I would like to use Spark on a non-distributed file system but am having
> trouble getting the driver to assign tasks to the workers that are local to
> the files. I have extended InputSplits to create my own version of
> FileSplits, so that each worker gets a bit more information than the
> default
> FileSplit provides. I thought that the driver would assign splits based on
> their locality. But I have found that the driver will send these splits to
> workers seemingly at random -- even the very first split will go to a node
> with a different IP than the split specifies. I can see that I am providing
> the right node address via GetLocations. I also set spark.locality.wait to
> a
> high value, but the same misassignment keeps happening.
>
> I am using newAPIHadoopFile to create my RDD. InputFormat is creating the
> required splits, but not all splits refer to the same file or the same
> worker IP.
>
> What else I can check, or change, to force the driver to send these tasks
> to
> the right workers?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-driver-assigning-splits-to-incorrect-workers-tp27261.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>