You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by asylvest <ad...@gmail.com> on 2014/07/07 02:29:10 UTC

Controlling amount of data sent to slaves

I'm in the process of evaluating Spark to see if it's a fit for my
CPU-intensive application.  Many operations in my chain are highly
parallelizable, but some require a minimum number of rows of an input image
in order to operate.  Is there a way to give Spark a minimum and/or maximum
size to send to a worker (as a function of bytes, rows, or whatever units I
need to convert to)?  In addition, some steps in the chain may need to know
what range of the input rows they contain (for example, in order to apply
the correct transformation to the input data, they may need to look at and
use some other metadata which varies per row).  Is there a way for Spark to
provide some sort of index as to where this data lies in the larger global
space?

The examples I found are based on manipulating text files in ways that are
simplistic enough that they don't need to know these things - it's great
that you can do something this flexible in just a few lines of code, but I'm
hoping there's some more complex API that can also be used that allows
finer-grained control.  An example showing some of this stuff would be
great, but a pointer to the appropriate function/class to take a look at
works too.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Controlling-amount-of-data-sent-to-slaves-tp8882.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.