You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by sim <si...@swoop.com> on 2015/09/15 01:36:05 UTC

RDD API patterns

I'd like to get some feedback on an API design issue pertaining to RDDs.

The design goal to avoid RDD nesting, which I agree with, leads the methods
operating on subsets of an RDD (not necessarily partitions) to use Iterable
as an abstraction. The mapPartitions and groupBy* family of methods are good
examples. The problem with that API choice is that developers often very
quickly run out of the benefits of the RDD API, independent of partitioning.

Consider two very simple problems that demonstrate the issue. The input is
the same for all: an RDD of integers that has been grouped into odd and
even.

1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified
sampling (sampleByKey) is built into PairRDDFunctions.

2. Sample at 10% if there are more than 1,000 elements in a group and at 20%
otherwise. Suddenly, the problem becomes a lot less easy. The sub-groups are
no longer RDDs and we can't use the RDD sampling API.

Note that the only reason the first problem is easy is because it was part
of Spark. If that hadn't happened, implementing it with the higher-level API
abstractions wouldn't have been easy. As more an more people use Spark for
ever more diverse sets of problems the likelihood that the RDD APIs provide
pre-existing high-level abstractions will diminish.

How do you feel about this? Do you think it is desirable to lose all
high-level RDD API abstractions the very moment we group an RDD or call
mapPartitions? Does the goal of no nested RDDs mean there are absolutely no
high-level abstractions that we can expose via the Iterables borne of RDDs?

I'd love your thoughts.

/Sim
http://linkedin.com/in/simeons <http://linkedin.com/in/simeons>

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD API patterns

Posted by sim <si...@swoop.com>.

Thanks everyone for the comments! I waited for more replies to come before I
responded as I was interested in the community's opinion.

The thread I'm noticing in this thread (pun intended) is that most responses
focus on the nested RDD issue. I think we all agree that it is problematic
for many reasons, including not just implementation complexity but also
user-facing complexity (programming model, processing patterns, debugging,
etc.).

What about a much simpler approach? Rather than producing RDDs with
Iterable[T], why not produce RDDs with SparkIterable[T]? Then, let's look at
the *RDD APIs and decide which methods would be useful to have there. The
simplest rule I can think of is anything that does not involve context, job
or partitioning in any form, therefore implicitly protecting the
RDD/partitioning abstractions underneath. Instead of returning RDDs these
functions in SparkIterable will produce SparkIterables.

The benefits are significant: API consistency, programming model
simplicity/consistency, greater leverage of non-trivial community code such
as sampling, approximate counting, reuse of user code for reducers, etc. The
cost is only the implementation effort of the new methods. No change to the
process model. No nested RDDs.

Here is a quick list of the methods we can do this for. Not all need to be
available at once: this is directional. This list is alphabetical, /not/ in
priority order of value.

*From RDD*
aggregate
count
countApprox
countApproxDistinct
countByValue
countByValueApprox
pipe
randomSplit
sample
sortBy
takeOrdered
takeSample
treeAggregate
treeReduce
union
zipWithUniqueId
aggregateByKey

*From PairRDDFunctions*
combineByKey
countApproxDistinctByKey
countByKey
flatMapValues
foldByKey
groupByKey
keys
lookup
mapValues
reduceByKey
sampleByKey
sampleByKeyExact
values

Here is another way to look at this: I am not sure why these methods, whose
signatures have nothing to do with partitions or partitioning, were defined
directly on RDDs as opposed to in some abstract trait. How a method is
implemented is a separate concern from how APIs should be designed. Had
these methods been put in a trait early on in the life of Spark, it would
have been natural to expose them to Spark-backed Iterables. Since this was
not done, we look at them and tell ourselves "we can't do this because we
can't have RDD nesting" which is not the real issue as the implementations
of these methods in a SparkIterable don't need access to any RDD APIs. In
fact, many implementations would be one-liners using the underlying Scala
Iterable API:

// Purely for illustration
implicit class PairSparkIterableFunctions[K, V](self: SparkIterable[(K, V)])
extends SparkIterable[(K, V)] {
def groupByKey() = groupBy(_._1).map { case (k, valuePairs) => (k,
valuePairs.map(_._2)) }
}

The Spark community has decided that the RDD methods are important for large
scale data processing so let's make them available to all data Spark touches
while avoiding the nested RDD mess.

What do you think about this approach?

P.S. In a previous life I built developer tools, APIs and standards used by
over a million enterprise developers. One of the lessons I learned was that
simple, high-level APIs based on consistent patterns substantially
accelerate the growth of communities. Conversely, lack of either high-level
abstractions or consistency introduces friction. Because of the iterative
nature of development, even small amounts of friction meaningfully slow down
adoption. Further, simplicity of high-level APIs and consistency always beat
capability & performance in terms of how the mass of developers make
technology choices. I have found no exceptions to this, which is why I
wanted to bring the issue with the RDD API up here.

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14191.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD API patterns

Posted by Reynold Xin <rx...@databricks.com>.

I'm not sure what we can do here. Nested RDDs are a pain to implement,
support, and explain. The programming model is not well explored.

Maybe a UDAF interface that allows going through the data twice?


On Mon, Sep 14, 2015 at 4:36 PM, sim <si...@swoop.com> wrote:

> I'd like to get some feedback on an API design issue pertaining to RDDs.
>
> The design goal to avoid RDD nesting, which I agree with, leads the methods
> operating on subsets of an RDD (not necessarily partitions) to use Iterable
> as an abstraction. The mapPartitions and groupBy* family of methods are
> good
> examples. The problem with that API choice is that developers often very
> quickly run out of the benefits of the RDD API, independent of
> partitioning.
>
> Consider two very simple problems that demonstrate the issue. The input is
> the same for all: an RDD of integers that has been grouped into odd and
> even.
>
> 1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified
> sampling (sampleByKey) is built into PairRDDFunctions.
>
> 2. Sample at 10% if there are more than 1,000 elements in a group and at
> 20%
> otherwise. Suddenly, the problem becomes a lot less easy. The sub-groups
> are
> no longer RDDs and we can't use the RDD sampling API.
>
> Note that the only reason the first problem is easy is because it was part
> of Spark. If that hadn't happened, implementing it with the higher-level
> API
> abstractions wouldn't have been easy. As more an more people use Spark for
> ever more diverse sets of problems the likelihood that the RDD APIs provide
> pre-existing high-level abstractions will diminish.
>
> How do you feel about this? Do you think it is desirable to lose all
> high-level RDD API abstractions the very moment we group an RDD or call
> mapPartitions? Does the goal of no nested RDDs mean there are absolutely no
> high-level abstractions that we can expose via the Iterables borne of RDDs?
>
> I'd love your thoughts.
>
> /Sim
> http://linkedin.com/in/simeons <http://linkedin.com/in/simeons>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: RDD API patterns

Posted by sim <si...@swoop.com>.

Aniket, yes, I've done the separate file trick. :) Still, I think we can
solve this problem without nested RDDs. 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14192.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD API patterns

Posted by sim <si...@swoop.com>.

Juan, thanks for sharing this. I am facing what looks like a similar issue
having to do with variable grouped upsampling (sampling some groups at
different rates, sometimes > 100%). I will study the approach you took.

As for the topic of this thread, I think it is important to separate two
issues:

- Logical RDD-style operations on Iterables
- Physical RDD-style operations on partitioned data

Issues related to nested RDDs, jobs and the scheduler only apply to the
latter unless we want to heavily optimize the performance of the former. I
wouldn't do that until we see enough usage of the former to know what's
worth optimizing.

Thanks,
Sim



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14193.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD API patterns

Posted by Juan Rodríguez Hortalá <ju...@gmail.com>.

Hi,

That reminds me to a previous discussion about splitting an RDD into
several RDDs
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-split-into-multiple-RDDs-td11877.html.
There you can see a simple code to convert RDD[(K, V)] into Map[K, RDD[V]]
through several filters. On top of that maybe you could build an
abstraction that simulates nested RDDs, as a proof of concepts, forgetting
for now about performance. But the main problem I've found is that the
Spark scheduler gets stuck when you have a huge amount of very small RDDs,
or at least that is what happened several versions ago
http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3CCAMAsSdJ+bzV++Cr44eDV-CPChr-1X-A+y2vmtUgwc0uX91fvKg@mail.gmail.com%3E

Just my two cents





2015-09-16 11:51 GMT+02:00 Aniket <an...@gmail.com>:

> I agree that this in issue but I am afraid supporting RDD nesting would be
> hard and perhaps would need rearchitecting Spark. For now, you may to use
> workarounds like storing each group in a separate file, process each file
> as separate RDD and finally merge results in a single RDD.
>
> I know its painful and I share the pain :)
>
> Thanks,
> Aniket
>
> On Tue, Sep 15, 2015, 5:06 AM sim [via Apache Spark Developers List] <[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=14146&i=0>> wrote:
>
>> I'd like to get some feedback on an API design issue pertaining to RDDs.
>>
>> The design goal to avoid RDD nesting, which I agree with, leads the
>> methods operating on subsets of an RDD (not necessarily partitions) to use
>> Iterable as an abstraction. The mapPartitions and groupBy* family of
>> methods are good examples. The problem with that API choice is that
>> developers often very quickly run out of the benefits of the RDD API,
>> independent of partitioning.
>>
>> Consider two very simple problems that demonstrate the issue. The input
>> is the same for all: an RDD of integers that has been grouped into odd and
>> even.
>>
>> 1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified
>> sampling (sampleByKey) is built into PairRDDFunctions.
>>
>> 2. Sample at 10% if there are more than 1,000 elements in a group and at
>> 20% otherwise. Suddenly, the problem becomes a lot less easy. The
>> sub-groups are no longer RDDs and we can't use the RDD sampling API.
>>
>> Note that the only reason the first problem is easy is because it was
>> part of Spark. If that hadn't happened, implementing it with the
>> higher-level API abstractions wouldn't have been easy. As more an more
>> people use Spark for ever more diverse sets of problems the likelihood that
>> the RDD APIs provide pre-existing high-level abstractions will diminish.
>>
>> How do you feel about this? Do you think it is desirable to lose all
>> high-level RDD API abstractions the very moment we group an RDD or call
>> mapPartitions? Does the goal of no nested RDDs mean there are absolutely no
>> high-level abstractions that we can expose via the Iterables borne of RDDs?
>>
>> I'd love your thoughts.
>>
>> /Sim
>> http://linkedin.com/in/simeons
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=14146&i=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
> ------------------------------
> View this message in context: Re: RDD API patterns
> <http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14146.html>
>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

Re: RDD API patterns

Posted by Aniket <an...@gmail.com>.

I agree that this in issue but I am afraid supporting RDD nesting would be
hard and perhaps would need rearchitecting Spark. For now, you may to use
workarounds like storing each group in a separate file, process each file
as separate RDD and finally merge results in a single RDD.

I know its painful and I share the pain :)

Thanks,
Aniket

On Tue, Sep 15, 2015, 5:06 AM sim [via Apache Spark Developers List] <
ml-node+s1001551n14116h41@n3.nabble.com> wrote:

> I'd like to get some feedback on an API design issue pertaining to RDDs.
>
> The design goal to avoid RDD nesting, which I agree with, leads the
> methods operating on subsets of an RDD (not necessarily partitions) to use
> Iterable as an abstraction. The mapPartitions and groupBy* family of
> methods are good examples. The problem with that API choice is that
> developers often very quickly run out of the benefits of the RDD API,
> independent of partitioning.
>
> Consider two very simple problems that demonstrate the issue. The input is
> the same for all: an RDD of integers that has been grouped into odd and
> even.
>
> 1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified
> sampling (sampleByKey) is built into PairRDDFunctions.
>
> 2. Sample at 10% if there are more than 1,000 elements in a group and at
> 20% otherwise. Suddenly, the problem becomes a lot less easy. The
> sub-groups are no longer RDDs and we can't use the RDD sampling API.
>
> Note that the only reason the first problem is easy is because it was part
> of Spark. If that hadn't happened, implementing it with the higher-level
> API abstractions wouldn't have been easy. As more an more people use Spark
> for ever more diverse sets of problems the likelihood that the RDD APIs
> provide pre-existing high-level abstractions will diminish.
>
> How do you feel about this? Do you think it is desirable to lose all
> high-level RDD API abstractions the very moment we group an RDD or call
> mapPartitions? Does the goal of no nested RDDs mean there are absolutely no
> high-level abstractions that we can expose via the Iterables borne of RDDs?
>
> I'd love your thoughts.
>
> /Sim
> http://linkedin.com/in/simeons
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html
> To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1h76@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14146.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: RDD API patterns

Posted by sim <si...@swoop.com>.

Robin, my point exactly. When an API is valuable, let's expose it in a way
that it may be used easily for all data Spark touches. It should not require
much development work to implement the sampling logic to work for an
Iterable as opposed to an RDD.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14194.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD API patterns

Posted by sim <si...@swoop.com>.

Juan, I wouldn't go as far as suggesting we switch from programming using
RDDs to using SparkIterable. For example, all methods involving context,
jobs or partitions should only be part of the RDD API and not part of
SparkIterable. That said, the Spark community would benefit from a
consistent set of APIs for both RDDs and Iterables inside RDDs.

You raise an important point about performance analysis & guarantees.
Reasoning about performance should not be any more complicated than
reasoning about the performance of the code that works with Iterables
generated by mapPartitions or groupByKey today.

However, it is important to not confuse users about what object they are
working with: an RDD, which supports the SparkIterable API, vs. an iterable
part of an RDD, which also supports the SparkIterable API (e.g., one that
mapPartitions generates). Therefore, RDD transformation APIs should continue
to return RDDs, as they do today.

Thank you for your implementation pointers. The Scala type system is
certainly flexible enough to support SparkIterable. If we get more consensus
that this is a good direction, I'd love to do a Skype session with you to
evaluate implementation options.

Best,
Sim

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14222.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD API patterns

Posted by "Evan R. Sparks" <ev...@gmail.com>.

Mike,

I believe the reason you're seeing near identical performance on the
gradient computations is twofold
1) Gradient computations for GLM models are computationally pretty cheap
from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call
in the dense case, which is well known to be bound by memory bandwidth on
modern processors. So, you're basically paying the cost of a scan of the
points you've sampled to do the gradient computation.
2) The default sampling mechanism used by the GradientDescent optimizer in
MLlib is implemented via RDD.sample, which does reservoir sampling on each
partition. This requires a full scan of each partition at every iteration
to collect the samples.

So - you're going to pay the cost of a scan to do the sampling anyway, and
the gradient computation is essentially free at this point (and can be
pipelined, etc.).

It is quite possible to improve #2 by coming up with a better sampling
algorithm. One easy algorithm would be to assume the data is already
randomly shuffled (or do that once) and then use the first
miniBatchFraction*partitionSize records on the first iteration, the second
set on the second set on the second iteration, and so on. You could
protoype this algorithm pretty easily by converting your data to an
RDD[Array[DenseVector]] and doing some bookkeeping at each iteration.

That said - eventually the overheads of the platform catch up to you. As a
rule of thumb I estimate about 50ms/iteration as a floor for things like
task serialization and other platform overheads. You've got to balance how
much computation you want to do vs. the amount of time you want to spend
waiting for the platform.

- Evan

On Sat, Sep 26, 2015 at 9:27 AM, Mike Hynes <91...@gmail.com> wrote:

> Hello Devs,
>
> This email concerns some timing results for a treeAggregate in
> computing a (stochastic) gradient over an RDD of labelled points, as
> is currently done in the MLlib optimization routine for SGD.
>
> In SGD, the underlying RDD is downsampled by a fraction f \in (0,1],
> and the subgradients over all the instances in the downsampled RDD are
> aggregated to the driver as a dense vector. However, we have noticed
> some unusual behaviour when f < 1: it takes the same amount of time to
> compute the stochastic gradient for a stochastic minibatch as it does
> for a full batch (f = 1).
>
> Attached are two plots of the mean task timing metrics for each level
> in the aggregation, which has been performed with 4 levels (level 4 is
> the final level, in which the results are communicated to the driver).
> 16 nodes are used, and the RDD has 256 partitions. We run in (client)
> standalone mode. Here, the total time for the tasks is shown (\tau)
> alongside the execution time (not counting GC),
> serialization/deserialization time, the GC time, and the difference
> between tau and all other times, assumed to be variable
> IO/communication/waiting time. The RDD in this case is a labelled
> point representation of the KDD Bridge to Algebra dataset, with 20M
> (sparse) instances and a problem dimension of 30M. The sparsity of the
> instances is very high---each individual instance vector may have only
> a hundred nonzeros. All metrics have been taken from the JSON Spark
> event logs.
>
> The plot gradient_f1.pdf shows the times for a gradient computation
> with f = 1, and gradient_f-3.pdf shows the same metrics with f = 1e-3.
> For other f values in {1e-1 1e-2 ... 1e-5}, the same effect is
> observed.
>
> What I would like to mention about these plots, and ask if anyone has
> experience with, is the following:
> 1. The times are essentially identical; I would have thought that
> downsampling the RDD before aggregating the subgradients would at
> least reduce the execution time required, if not the
> communication/serialization times.
> 2. The serialization time in level 4 is almost entirely from the
> result serialization to the driver, and not the task deserialization.
> In each level of the treeAggregation, however, the local (dense)
> gradients have to be communicated between compute nodes, so I am
> surprised that it takes so much longer to return the vectors to the
> driver.
>
> I initially wondered if the large IO overhead in the last stage had
> anything to do with client mode vs cluster mode, since, from what I
> understand, only a single core is allocated to the driver thread in
> client mode. However, when running tests in the two modes, I have
> previously seen no appreciable difference in the running time for
> other (admittedly smaller) problems. Furthermore, I am still very
> confused about why the execution time for each task is just as large
> for the downsampled RDD. It seems unlikely that sampling each
> partition would be as expensive as the gradient computations, even for
> sparse feature vectors.
>
> If anyone has experience working with the sampling in minibatch SGD or
> has tested the scalability of the treeAggregation operation for
> vectors, I'd really appreciate your thoughts.
>
> Thanks,
> Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

Re: RDD API patterns

Posted by Mike Hynes <91...@gmail.com>.

Hello Devs,

This email concerns some timing results for a treeAggregate in
computing a (stochastic) gradient over an RDD of labelled points, as
is currently done in the MLlib optimization routine for SGD.

In SGD, the underlying RDD is downsampled by a fraction f \in (0,1],
and the subgradients over all the instances in the downsampled RDD are
aggregated to the driver as a dense vector. However, we have noticed
some unusual behaviour when f < 1: it takes the same amount of time to
compute the stochastic gradient for a stochastic minibatch as it does
for a full batch (f = 1).

Attached are two plots of the mean task timing metrics for each level
in the aggregation, which has been performed with 4 levels (level 4 is
the final level, in which the results are communicated to the driver).
16 nodes are used, and the RDD has 256 partitions. We run in (client)
standalone mode. Here, the total time for the tasks is shown (\tau)
alongside the execution time (not counting GC),
serialization/deserialization time, the GC time, and the difference
between tau and all other times, assumed to be variable
IO/communication/waiting time. The RDD in this case is a labelled
point representation of the KDD Bridge to Algebra dataset, with 20M
(sparse) instances and a problem dimension of 30M. The sparsity of the
instances is very high---each individual instance vector may have only
a hundred nonzeros. All metrics have been taken from the JSON Spark
event logs.

The plot gradient_f1.pdf shows the times for a gradient computation
with f = 1, and gradient_f-3.pdf shows the same metrics with f = 1e-3.
For other f values in {1e-1 1e-2 ... 1e-5}, the same effect is
observed.

What I would like to mention about these plots, and ask if anyone has
experience with, is the following:
1. The times are essentially identical; I would have thought that
downsampling the RDD before aggregating the subgradients would at
least reduce the execution time required, if not the
communication/serialization times.
2. The serialization time in level 4 is almost entirely from the
result serialization to the driver, and not the task deserialization.
In each level of the treeAggregation, however, the local (dense)
gradients have to be communicated between compute nodes, so I am
surprised that it takes so much longer to return the vectors to the
driver.

I initially wondered if the large IO overhead in the last stage had
anything to do with client mode vs cluster mode, since, from what I
understand, only a single core is allocated to the driver thread in
client mode. However, when running tests in the two modes, I have
previously seen no appreciable difference in the running time for
other (admittedly smaller) problems. Furthermore, I am still very
confused about why the execution time for each task is just as large
for the downsampled RDD. It seems unlikely that sampling each
partition would be as expensive as the gradient computations, even for
sparse feature vectors.

If anyone has experience working with the sampling in minibatch SGD or
has tested the scalability of the treeAggregation operation for
vectors, I'd really appreciate your thoughts.

Thanks,
Mike

Re: RDD API patterns

Posted by Juan Rodríguez Hortalá <ju...@gmail.com>.

Hi Sim,

I understand that what you propose is defining a trait SparkIterable (and
also PairSparkIterable for RDDs of pairs) that encapsulates the methods in
RDDs, and then program using that trait instead of RDD. That is similar to
programming using scala.collection.GenSeq to abstract from using a
sequential or parallel Seq. This new trait SparkIterable would be needed to
cover methods in RDDs that are not present in GenSeq and other standard
traits. I understand you suggest implementing it using wrapper classes and
implicit conversions, like in PairRDDFunctions, in order to see both RDD,
Iterable and other classes as SparkIterable. That reminds me of type
classes
http://danielwestheide.com/blog/2013/02/06/the-neophytes-guide-to-scala-part-12-type-classes.html,
which could be a similar approach. I think it would be interesting to know
if some standard type classes like for example those in
https://non.github.io/cats//typeclasses.html could be of use here.

A downside I find in this approach is that it would be more difficult to
reason about the performance of programs, and to write them to obtain the
best performance, if we don't know whether a SparkIterable in a distributed
RDD or a node local collection, that for example might even be indexed. Or
we might avoid accessing a SparkIterable from a closure in a map because we
don't know if we are in the driver or in a worker. That could difficult the
development of efficient programs, but this is not very surprising because
the trade off because abstraction level and performance is always there in
programming anyway.

Anyway I find your idea very interesting, I think it could be developed
into a nice library

Greetings,

Juan




2015-09-18 14:55 GMT+02:00 sim <si...@swoop.com>:

> @debasish83, yes, there are many ways to optimize and work around the
> limitation of no nested RDDs. The point of this thread is to discuss the
> API
> patterns of Spark in order to make the platform more accessible to lots of
> developers solving interesting problems quickly. We can get API consistency
> without resorting to simulations of nested RDDs.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14195.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: RDD API patterns

Posted by sim <si...@swoop.com>.

@debasish83, yes, there are many ways to optimize and work around the
limitation of no nested RDDs. The point of this thread is to discuss the API
patterns of Spark in order to make the platform more accessible to lots of
developers solving interesting problems quickly. We can get API consistency
without resorting to simulations of nested RDDs.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14195.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD API patterns

Posted by Debasish Das <de...@gmail.com>.

Rdd nesting can lead to recursive nesting...i would like to know the
usecase and why join can't support it...you can always expose an api over a
rdd and access that in another rdd mappartition...use a external data
source like hbase cassandra redis to support the api...

For ur case group by and then pass the logic...collect each group sample in
a seq and then lookup if u r doing one at a time...if doing all try joining
it...pattern is common if every key is a iid and you a cross validating a
model for each key on 80% train 20% test...

We are looking to fit it in pipeline flow...with minor mods it will fit..
On Sep 16, 2015 6:39 AM, "robineast" <ro...@xense.co.uk> wrote:

> I'm not sure the problem is quite as bad as you state. Both sampleByKey and
> sampleByKeyExact are implemented using a function from
> StratifiedSamplingUtils which does one of two things depending on whether
> the exact implementation is needed. The exact version requires double the
> number of lines of code (17) than the non-exact and has to do extra passes
> over the data to get, for example, the counts per key.
>
> As far as I can see your problem 2 and sampleByKeyExact are very similar
> and
> could be solved the same way. It has been decided that sampleByKeyExact is
> a
> widely useful function and so is provided out of the box as part of the
> PairRDD API. I don't see any reason why your problem 2 couldn't be provided
> in the same way as part of the API if there was the demand for it.
>
> An alternative design would perhaps be something like an extension to
> PairRDD, let's call it TwoPassPairRDD, where certain information for the
> key
> could be provided along with an Iterable e.g. the counts for the key. Both
> sampleByKeyExact and your problem 2 could be implemented in a few less
> lines
> of code.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: RDD API patterns

Posted by robineast <ro...@xense.co.uk>.

I'm not sure the problem is quite as bad as you state. Both sampleByKey and
sampleByKeyExact are implemented using a function from
StratifiedSamplingUtils which does one of two things depending on whether
the exact implementation is needed. The exact version requires double the
number of lines of code (17) than the non-exact and has to do extra passes
over the data to get, for example, the counts per key.

As far as I can see your problem 2 and sampleByKeyExact are very similar and
could be solved the same way. It has been decided that sampleByKeyExact is a
widely useful function and so is provided out of the box as part of the
PairRDD API. I don't see any reason why your problem 2 couldn't be provided
in the same way as part of the API if there was the demand for it. 

An alternative design would perhaps be something like an extension to
PairRDD, let's call it TwoPassPairRDD, where certain information for the key
could be provided along with an Iterable e.g. the counts for the key. Both
sampleByKeyExact and your problem 2 could be implemented in a few less lines
of code.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org