You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by bdamos <am...@adobe.com> on 2014/07/11 23:19:13 UTC

How to separate a subset of an RDD by day?

Hi, I have an RDD that represents data over a time interval and I want
to select some subinterval of my data and partition it by day
based on a unix time field in the data.
What is the best way to do this with Spark?

I have currently implemented 2 solutions, both which seem suboptimal.
Solution 1 is to filter the subinterval from the overall data set,
and then to filter each day out of this filtered data set.
However, this causes the same data in the subset to be filtered many times.

Solution 2 is to map the objects into a pair RDD where the
key is the number of the day in the interval, then group by
key, collect, and parallelize the resulting grouped data.
However, I worry collecting large data sets is going to be
a serious performance bottleneck.

A small query using Solution 1 takes 13 seconds to run, and the same
query using Solution 2 takes 10 seconds to run,
but I think this can be further improved.
Does anybody have any suggestions on the best way to separate
a subset of data by day?

Thanks,
Brandon.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to separate a subset of an RDD by day?

Posted by Soumya Simanta <so...@gmail.com>.

If you are on 1.0.0 release you can also try converting your RDD to a
SchemaRDD and run a groupBy there. The SparkSQL optimizer "may" yield
better results. It's worth a try at least.

On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta <so...@gmail.com>
wrote:

>
>
>
>
>>
>> Solution 2 is to map the objects into a pair RDD where the
>> key is the number of the day in the interval, then group by
>> key, collect, and parallelize the resulting grouped data.
>> However, I worry collecting large data sets is going to be
>> a serious performance bottleneck.
>>
>>
> Why do you have to do a "collect" ?  You can do a groupBy and then write
> the grouped data to disk again
>

Re: How to separate a subset of an RDD by day?

Posted by Soumya Simanta <so...@gmail.com>.

> I think my best option is to partition my data in directories by day
> before running my Spark application, and then direct
> my Spark application to load RDD's from each directory when
> I want to load a date range. How does this sound?
>
> If your upstream system can write data by day then it makes perfect sense
to do that and load (into Spark) only the data that is required for
processing. This also saves you the filter step and hopefully time and
memory. If you want to get back the bigger dataset you can always join
multiple days of data (RDDs) together.

Re: How to separate a subset of an RDD by day?

Posted by bdamos <am...@adobe.com>.

ssimanta wrote
>> Solution 2 is to map the objects into a pair RDD where the
>> key is the number of the day in the interval, then group by
>> key, collect, and parallelize the resulting grouped data.
>> However, I worry collecting large data sets is going to be
>> a serious performance bottleneck.
> Why do you have to do a "collect" ?  You can do a groupBy and then write
> the grouped data to disk again

I want to process the resulting data sets as RDD's,
and groupBy only returns the data as Seq.
Thanks on the idea to write the grouped data back to disk.
I think my best option is to partition my data in directories by day
before running my Spark application, and then direct
my Spark application to load RDD's from each directory when
I want to load a date range. How does this sound?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454p9459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to separate a subset of an RDD by day?

Posted by Soumya Simanta <so...@gmail.com>.

>
> Solution 2 is to map the objects into a pair RDD where the
> key is the number of the day in the interval, then group by
> key, collect, and parallelize the resulting grouped data.
> However, I worry collecting large data sets is going to be
> a serious performance bottleneck.
>
>
Why do you have to do a "collect" ?  You can do a groupBy and then write
the grouped data to disk again

Re: How to separate a subset of an RDD by day?

Posted by Sean Owen <so...@cloudera.com>.

On Fri, Jul 11, 2014 at 10:53 PM, bdamos <am...@adobe.com> wrote:
> I didn't make it clear in my first message that I want to obtain an RDD
> instead
> of an Iterable, and will be doing map-reduce like operations on the
> data by day. My problem is that groupBy returns an RDD[(K, Iterable[T])],
> but I really want an RDD[(K, RDD[T])].
> Is there a better approach to this?

Yeah, you can't have an RDD of RDDs. Why does it need to be an RDD --
because a day could have a huge amount of data? because Scala
collections have map and reduce methods and the like too.

I think that if you really want RDDs you can just make a series of
them, with some code like

(start/86400 to end/86400).map(day => (day, rdd.filter(rec => rec.time
>= day*86400 && rec.time < (day+1)*86400)))

I think that's your solution 1. I don't imagine it's that bad if this
is what you need to do.

Re: How to separate a subset of an RDD by day?

Posted by bdamos <am...@adobe.com>.

Sean Owen-2 wrote
> Can you not just filter the range you want, then groupBy
> timestamp/86400 ? That sounds like your solution 1 and is about as
> fast as it gets, I think. Are you thinking you would have to filter
> out each day individually from there, and that's why it would be slow?
> I don't think that's needed. You also don't need to map to pairs.

I didn't make it clear in my first message that I want to obtain an RDD
instead
of an Iterable, and will be doing map-reduce like operations on the
data by day. My problem is that groupBy returns an RDD[(K, Iterable[T])],
but I really want an RDD[(K, RDD[T])].
Is there a better approach to this?

I'm leaning towards partitioning my data by day on disk since all of my
queries will always process data per day.
However, the only problem I see with partitioning the data on disk is that
it
limits my system to cleanly work for a single timezone.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454p9464.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to separate a subset of an RDD by day?

Posted by Sean Owen <so...@cloudera.com>.

Can you not just filter the range you want, then groupBy
timestamp/86400 ? That sounds like your solution 1 and is about as
fast as it gets, I think. Are you thinking you would have to filter
out each day individually from there, and that's why it would be slow?
I don't think that's needed. You also don't need to map to pairs.

On Fri, Jul 11, 2014 at 10:19 PM, bdamos <am...@adobe.com> wrote:
> Hi, I have an RDD that represents data over a time interval and I want
> to select some subinterval of my data and partition it by day
> based on a unix time field in the data.
> What is the best way to do this with Spark?
>
> I have currently implemented 2 solutions, both which seem suboptimal.
> Solution 1 is to filter the subinterval from the overall data set,
> and then to filter each day out of this filtered data set.
> However, this causes the same data in the subset to be filtered many times.
>
> Solution 2 is to map the objects into a pair RDD where the
> key is the number of the day in the interval, then group by
> key, collect, and parallelize the resulting grouped data.
> However, I worry collecting large data sets is going to be
> a serious performance bottleneck.
>
> A small query using Solution 1 takes 13 seconds to run, and the same
> query using Solution 2 takes 10 seconds to run,
> but I think this can be further improved.
> Does anybody have any suggestions on the best way to separate
> a subset of data by day?
>
> Thanks,
> Brandon.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.