You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2014/03/13 21:39:07 UTC

combining operations elegantly

not that long ago there was a nice example on here about how to combine
multiple operations on a single RDD. so basically if you want to do a
count() and something else, how to roll them into a single job. i think
patrick wendell gave the examples.

i cant find them anymore.... patrick can you please repost? thanks!

Re: combining operations elegantly

Posted by Richard Siebeling <rs...@gmail.com>.

Hi guys,

thanks for the information, I'll give it a try with Algebird,
thanks again,
Richard

@Patrick, thanks for the release calendar


On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell <pw...@gmail.com>wrote:

> Hey All,
>
> I think the old thread is here:
> https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J
>
> The method proposed in that thread is to create a utility class for
> doing single-pass aggregations. Using Algebird is a pretty good way to
> do this and is a bit more flexible since you don't need to create a
> new utility each time you want to do this.
>
> In Spark 1.0 and later you will be able to do this more elegantly with
> the schema support:
> myRDD.groupBy('user).select(Sum('clicks) as 'clicks,
> Average('duration) as 'duration)
>
> and it will use a single pass automatically... but that's not quite
> released yet :)
>
> - Patrick
>
>
>
>
> On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > i currently typically do something like this:
> >
> > scala> val rdd = sc.parallelize(1 to 10)
> > scala> import com.twitter.algebird.Operators._
> > scala> import com.twitter.algebird.{Max, Min}
> > scala> rdd.map{ x => (
> >      |   1L,
> >      |   Min(x),
> >      |   Max(x),
> >      |   x
> >      | )}.reduce(_ + _)
> > res0: (Long, com.twitter.algebird.Min[Int],
> com.twitter.algebird.Max[Int],
> > Int) = (10,Min(1),Max(10),55)
> >
> > however for this you need twitter algebird dependency. without that you
> have
> > to code the reduce function on the tuples yourself...
> >
> > another example with 2 columns, where i do conditional count for first
> > column, and simple sum for second:
> > scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
> >      |   if (x > 5) 1 else 0,
> >      |   y
> >      | )}.reduce(_ + _)
> > res3: (Int, Int) = (5,155)
> >
> >
> >
> > On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rsiebeling@gmail.com
> >
> > wrote:
> >>
> >> Hi Koert, Patrick,
> >>
> >> do you already have an elegant solution to combine multiple operations
> on
> >> a single RDD?
> >> Say for example that I want to do a sum over one column, a count and an
> >> average over another column,
> >>
> >> thanks in advance,
> >> Richard
> >>
> >>
> >> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <
> rsiebeling@gmail.com>
> >> wrote:
> >>>
> >>> Patrick, Koert,
> >>>
> >>> I'm also very interested in these examples, could you please post them
> if
> >>> you find them?
> >>> thanks in advance,
> >>> Richard
> >>>
> >>>
> >>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com>
> wrote:
> >>>>
> >>>> not that long ago there was a nice example on here about how to
> combine
> >>>> multiple operations on a single RDD. so basically if you want to do a
> >>>> count() and something else, how to roll them into a single job. i
> think
> >>>> patrick wendell gave the examples.
> >>>>
> >>>> i cant find them anymore.... patrick can you please repost? thanks!
> >>>
> >>>
> >>
> >
>

Re: combining operations elegantly

Posted by Patrick Wendell <pw...@gmail.com>.

Hey All,

I think the old thread is here:
https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J

The method proposed in that thread is to create a utility class for
doing single-pass aggregations. Using Algebird is a pretty good way to
do this and is a bit more flexible since you don't need to create a
new utility each time you want to do this.

In Spark 1.0 and later you will be able to do this more elegantly with
the schema support:
myRDD.groupBy('user).select(Sum('clicks) as 'clicks,
Average('duration) as 'duration)

and it will use a single pass automatically... but that's not quite
released yet :)

- Patrick




On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i currently typically do something like this:
>
> scala> val rdd = sc.parallelize(1 to 10)
> scala> import com.twitter.algebird.Operators._
> scala> import com.twitter.algebird.{Max, Min}
> scala> rdd.map{ x => (
>      |   1L,
>      |   Min(x),
>      |   Max(x),
>      |   x
>      | )}.reduce(_ + _)
> res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int],
> Int) = (10,Min(1),Max(10),55)
>
> however for this you need twitter algebird dependency. without that you have
> to code the reduce function on the tuples yourself...
>
> another example with 2 columns, where i do conditional count for first
> column, and simple sum for second:
> scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
>      |   if (x > 5) 1 else 0,
>      |   y
>      | )}.reduce(_ + _)
> res3: (Int, Int) = (5,155)
>
>
>
> On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rs...@gmail.com>
> wrote:
>>
>> Hi Koert, Patrick,
>>
>> do you already have an elegant solution to combine multiple operations on
>> a single RDD?
>> Say for example that I want to do a sum over one column, a count and an
>> average over another column,
>>
>> thanks in advance,
>> Richard
>>
>>
>> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <rs...@gmail.com>
>> wrote:
>>>
>>> Patrick, Koert,
>>>
>>> I'm also very interested in these examples, could you please post them if
>>> you find them?
>>> thanks in advance,
>>> Richard
>>>
>>>
>>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>> not that long ago there was a nice example on here about how to combine
>>>> multiple operations on a single RDD. so basically if you want to do a
>>>> count() and something else, how to roll them into a single job. i think
>>>> patrick wendell gave the examples.
>>>>
>>>> i cant find them anymore.... patrick can you please repost? thanks!
>>>
>>>
>>
>

Re: combining operations elegantly

Posted by Koert Kuipers <ko...@tresata.com>.

i currently typically do something like this:

scala> val rdd = sc.parallelize(1 to 10)
scala> import com.twitter.algebird.Operators._
scala> import com.twitter.algebird.{Max, Min}
scala> rdd.map{ x => (
     |   1L,
     |   Min(x),
     |   Max(x),
     |   x
     | )}.reduce(_ + _)
res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int],
Int) = (10,Min(1),Max(10),55)

however for this you need twitter algebird dependency. without that you
have to code the reduce function on the tuples yourself...

another example with 2 columns, where i do conditional count for first
column, and simple sum for second:
scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
     |   if (x > 5) 1 else 0,
     |   y
     | )}.reduce(_ + _)
res3: (Int, Int) = (5,155)



On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rs...@gmail.com>wrote:

> Hi Koert, Patrick,
>
> do you already have an elegant solution to combine multiple operations on
> a single RDD?
> Say for example that I want to do a sum over one column, a count and an
> average over another column,
>
> thanks in advance,
> Richard
>
>
> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <rs...@gmail.com>wrote:
>
>> Patrick, Koert,
>>
>> I'm also very interested in these examples, could you please post them if
>> you find them?
>> thanks in advance,
>> Richard
>>
>>
>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> not that long ago there was a nice example on here about how to combine
>>> multiple operations on a single RDD. so basically if you want to do a
>>> count() and something else, how to roll them into a single job. i think
>>> patrick wendell gave the examples.
>>>
>>> i cant find them anymore.... patrick can you please repost? thanks!
>>>
>>
>>
>

Re: combining operations elegantly

Posted by Richard Siebeling <rs...@gmail.com>.

Hi Koert, Patrick,

do you already have an elegant solution to combine multiple operations on a
single RDD?
Say for example that I want to do a sum over one column, a count and an
average over another column,

thanks in advance,
Richard

On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <rs...@gmail.com>wrote:

> Patrick, Koert,
>
> I'm also very interested in these examples, could you please post them if
> you find them?
> thanks in advance,
> Richard
>
>
> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> not that long ago there was a nice example on here about how to combine
>> multiple operations on a single RDD. so basically if you want to do a
>> count() and something else, how to roll them into a single job. i think
>> patrick wendell gave the examples.
>>
>> i cant find them anymore.... patrick can you please repost? thanks!
>>
>
>

Re: combining operations elegantly

Posted by Richard Siebeling <rs...@gmail.com>.

Patrick, Koert,

I'm also very interested in these examples, could you please post them if
you find them?
thanks in advance,
Richard

On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote:

> not that long ago there was a nice example on here about how to combine
> multiple operations on a single RDD. so basically if you want to do a
> count() and something else, how to roll them into a single job. i think
> patrick wendell gave the examples.
>
> i cant find them anymore.... patrick can you please repost? thanks!
>