You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by sparkx <ya...@yang-cs.com> on 2015/03/26 17:40:38 UTC
Combining Many RDDs
Hi,
I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would like
to combine all these RDDs and perform a next job. What I have found out is
that the computation itself is quite fast, but combining these RDDs takes
much longer time.
val result = data // 0.5M data items
.map(compute(_)) // Produces an RDD - fast
.reduce(_ ++ _) // Combining RDDs - slow
I have also tried to collect results from compute(_) and use a flatMap, but
that is also slow.
Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not sure if
that's a preferred way in Spark.
Thank you.
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Combining-Many-RDDs-tp22243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Combining Many RDDs
Posted by Noorul Islam K M <no...@noorul.com>.
Yang Chen <ya...@yang-cs.com> writes:
> Hi Noorul,
>
> Thank you for your suggestion. I tried that, but ran out of memory. I did
> some search and found some suggestions
> that we should try to avoid rdd.union(
> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
> ).
> I will try to come up with some other ways.
>
I think you are using rdd.union(), but I was referring to
SparkContext.union(). I am not sure about the number of RDDs that you
have but I had no issues with memory when I used it to combine 2000
RDDs. Having said that I had other performance issues with spark
cassandra connector.
Thanks and Regards
Noorul
>
> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com> wrote:
>
>> sparkx <ya...@yang-cs.com> writes:
>>
>> > Hi,
>> >
>> > I have a Spark job and a dataset of 0.5 Million items. Each item performs
>> > some sort of computation (joining a shared external dataset, if that does
>> > matter) and produces an RDD containing 20-500 result items. Now I would
>> like
>> > to combine all these RDDs and perform a next job. What I have found out
>> is
>> > that the computation itself is quite fast, but combining these RDDs takes
>> > much longer time.
>> >
>> > val result = data // 0.5M data items
>> > .map(compute(_)) // Produces an RDD - fast
>> > .reduce(_ ++ _) // Combining RDDs - slow
>> >
>> > I have also tried to collect results from compute(_) and use a flatMap,
>> but
>> > that is also slow.
>> >
>> > Is there a way to efficiently do this? I'm thinking about writing this
>> > result to HDFS and reading from disk for the next job, but am not sure if
>> > that's a preferred way in Spark.
>> >
>>
>> Are you looking for SparkContext.union() [1] ?
>>
>> This is not performing well with spark cassandra connector. I am not
>> sure whether this will help you.
>>
>> Thanks and Regards
>> Noorul
>>
>> [1]
>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Combining Many RDDs
Posted by Yang Chen <ya...@yang-cs.com>.
Hi Kelvin,
Thank you. That works for me. I wrote my own joins that produced Scala
collections, instead of using rdd.join.
Regards,
Yang
On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu <2d...@gmail.com> wrote:
> Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
> variable 'data' is a Scala collection and compute() returns an RDD. Right?
> If yes, I tried the approach below to operate on one RDD only during the
> whole computation (Yes, I also saw that too many RDD hurt performance).
>
> Change compute() to return Scala collection instead of RDD.
>
> val result = sc.parallelize(data) // Create and partition the
> 0.5M items in a single RDD.
> .flatMap(compute(_)) // You still have only one RDD with each item
> joined with external data already
>
> Hope this help.
>
> Kelvin
>
> On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <ya...@yang-cs.com> wrote:
>
>> Hi Mark,
>>
>> That's true, but in neither way can I combine the RDDs, so I have to
>> avoid unions.
>>
>> Thanks,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> RDD#union is not the same thing as SparkContext#union
>>>
>>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <ya...@yang-cs.com> wrote:
>>>
>>>> Hi Noorul,
>>>>
>>>> Thank you for your suggestion. I tried that, but ran out of memory. I
>>>> did some search and found some suggestions
>>>> that we should try to avoid rdd.union(
>>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>>>> ).
>>>> I will try to come up with some other ways.
>>>>
>>>> Thank you,
>>>> Yang
>>>>
>>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com>
>>>> wrote:
>>>>
>>>>> sparkx <ya...@yang-cs.com> writes:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>>>> performs
>>>>> > some sort of computation (joining a shared external dataset, if that
>>>>> does
>>>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>>>> would like
>>>>> > to combine all these RDDs and perform a next job. What I have found
>>>>> out is
>>>>> > that the computation itself is quite fast, but combining these RDDs
>>>>> takes
>>>>> > much longer time.
>>>>> >
>>>>> > val result = data // 0.5M data items
>>>>> > .map(compute(_)) // Produces an RDD - fast
>>>>> > .reduce(_ ++ _) // Combining RDDs - slow
>>>>> >
>>>>> > I have also tried to collect results from compute(_) and use a
>>>>> flatMap, but
>>>>> > that is also slow.
>>>>> >
>>>>> > Is there a way to efficiently do this? I'm thinking about writing
>>>>> this
>>>>> > result to HDFS and reading from disk for the next job, but am not
>>>>> sure if
>>>>> > that's a preferred way in Spark.
>>>>> >
>>>>>
>>>>> Are you looking for SparkContext.union() [1] ?
>>>>>
>>>>> This is not performing well with spark cassandra connector. I am not
>>>>> sure whether this will help you.
>>>>>
>>>>> Thanks and Regards
>>>>> Noorul
>>>>>
>>>>> [1]
>>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yang Chen
>>>> Dept. of CISE, University of Florida
>>>> Mail: yang@yang-cs.com
>>>> Web: www.cise.ufl.edu/~yang
>>>>
>>>
>>>
>>
>>
>> --
>> Yang Chen
>> Dept. of CISE, University of Florida
>> Mail: yang@yang-cs.com
>> Web: www.cise.ufl.edu/~yang
>>
>
>
--
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang
Re: Combining Many RDDs
Posted by Kelvin Chu <2d...@gmail.com>.
Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
variable 'data' is a Scala collection and compute() returns an RDD. Right?
If yes, I tried the approach below to operate on one RDD only during the
whole computation (Yes, I also saw that too many RDD hurt performance).
Change compute() to return Scala collection instead of RDD.
val result = sc.parallelize(data) // Create and partition the
0.5M items in a single RDD.
.flatMap(compute(_)) // You still have only one RDD with each item
joined with external data already
Hope this help.
Kelvin
On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <ya...@yang-cs.com> wrote:
> Hi Mark,
>
> That's true, but in neither way can I combine the RDDs, so I have to avoid
> unions.
>
> Thanks,
> Yang
>
> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> RDD#union is not the same thing as SparkContext#union
>>
>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <ya...@yang-cs.com> wrote:
>>
>>> Hi Noorul,
>>>
>>> Thank you for your suggestion. I tried that, but ran out of memory. I
>>> did some search and found some suggestions
>>> that we should try to avoid rdd.union(
>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>>> ).
>>> I will try to come up with some other ways.
>>>
>>> Thank you,
>>> Yang
>>>
>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com>
>>> wrote:
>>>
>>>> sparkx <ya...@yang-cs.com> writes:
>>>>
>>>> > Hi,
>>>> >
>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>>> performs
>>>> > some sort of computation (joining a shared external dataset, if that
>>>> does
>>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>>> would like
>>>> > to combine all these RDDs and perform a next job. What I have found
>>>> out is
>>>> > that the computation itself is quite fast, but combining these RDDs
>>>> takes
>>>> > much longer time.
>>>> >
>>>> > val result = data // 0.5M data items
>>>> > .map(compute(_)) // Produces an RDD - fast
>>>> > .reduce(_ ++ _) // Combining RDDs - slow
>>>> >
>>>> > I have also tried to collect results from compute(_) and use a
>>>> flatMap, but
>>>> > that is also slow.
>>>> >
>>>> > Is there a way to efficiently do this? I'm thinking about writing this
>>>> > result to HDFS and reading from disk for the next job, but am not
>>>> sure if
>>>> > that's a preferred way in Spark.
>>>> >
>>>>
>>>> Are you looking for SparkContext.union() [1] ?
>>>>
>>>> This is not performing well with spark cassandra connector. I am not
>>>> sure whether this will help you.
>>>>
>>>> Thanks and Regards
>>>> Noorul
>>>>
>>>> [1]
>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>>
>>>
>>>
>>>
>>> --
>>> Yang Chen
>>> Dept. of CISE, University of Florida
>>> Mail: yang@yang-cs.com
>>> Web: www.cise.ufl.edu/~yang
>>>
>>
>>
>
>
> --
> Yang Chen
> Dept. of CISE, University of Florida
> Mail: yang@yang-cs.com
> Web: www.cise.ufl.edu/~yang
>
Re: Combining Many RDDs
Posted by Yang Chen <ya...@yang-cs.com>.
Hi Mark,
That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.
Thanks,
Yang
On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:
> RDD#union is not the same thing as SparkContext#union
>
> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <ya...@yang-cs.com> wrote:
>
>> Hi Noorul,
>>
>> Thank you for your suggestion. I tried that, but ran out of memory. I did
>> some search and found some suggestions
>> that we should try to avoid rdd.union(
>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>> ).
>> I will try to come up with some other ways.
>>
>> Thank you,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com>
>> wrote:
>>
>>> sparkx <ya...@yang-cs.com> writes:
>>>
>>> > Hi,
>>> >
>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>> performs
>>> > some sort of computation (joining a shared external dataset, if that
>>> does
>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>> would like
>>> > to combine all these RDDs and perform a next job. What I have found
>>> out is
>>> > that the computation itself is quite fast, but combining these RDDs
>>> takes
>>> > much longer time.
>>> >
>>> > val result = data // 0.5M data items
>>> > .map(compute(_)) // Produces an RDD - fast
>>> > .reduce(_ ++ _) // Combining RDDs - slow
>>> >
>>> > I have also tried to collect results from compute(_) and use a
>>> flatMap, but
>>> > that is also slow.
>>> >
>>> > Is there a way to efficiently do this? I'm thinking about writing this
>>> > result to HDFS and reading from disk for the next job, but am not sure
>>> if
>>> > that's a preferred way in Spark.
>>> >
>>>
>>> Are you looking for SparkContext.union() [1] ?
>>>
>>> This is not performing well with spark cassandra connector. I am not
>>> sure whether this will help you.
>>>
>>> Thanks and Regards
>>> Noorul
>>>
>>> [1]
>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>
>>
>>
>>
>> --
>> Yang Chen
>> Dept. of CISE, University of Florida
>> Mail: yang@yang-cs.com
>> Web: www.cise.ufl.edu/~yang
>>
>
>
--
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang
Re: Combining Many RDDs
Posted by Mark Hamstra <ma...@clearstorydata.com>.
RDD#union is not the same thing as SparkContext#union
On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <ya...@yang-cs.com> wrote:
> Hi Noorul,
>
> Thank you for your suggestion. I tried that, but ran out of memory. I did
> some search and found some suggestions
> that we should try to avoid rdd.union(
> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
> ).
> I will try to come up with some other ways.
>
> Thank you,
> Yang
>
> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com>
> wrote:
>
>> sparkx <ya...@yang-cs.com> writes:
>>
>> > Hi,
>> >
>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>> performs
>> > some sort of computation (joining a shared external dataset, if that
>> does
>> > matter) and produces an RDD containing 20-500 result items. Now I would
>> like
>> > to combine all these RDDs and perform a next job. What I have found out
>> is
>> > that the computation itself is quite fast, but combining these RDDs
>> takes
>> > much longer time.
>> >
>> > val result = data // 0.5M data items
>> > .map(compute(_)) // Produces an RDD - fast
>> > .reduce(_ ++ _) // Combining RDDs - slow
>> >
>> > I have also tried to collect results from compute(_) and use a flatMap,
>> but
>> > that is also slow.
>> >
>> > Is there a way to efficiently do this? I'm thinking about writing this
>> > result to HDFS and reading from disk for the next job, but am not sure
>> if
>> > that's a preferred way in Spark.
>> >
>>
>> Are you looking for SparkContext.union() [1] ?
>>
>> This is not performing well with spark cassandra connector. I am not
>> sure whether this will help you.
>>
>> Thanks and Regards
>> Noorul
>>
>> [1]
>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>
>
>
>
> --
> Yang Chen
> Dept. of CISE, University of Florida
> Mail: yang@yang-cs.com
> Web: www.cise.ufl.edu/~yang
>
Re: Combining Many RDDs
Posted by Yang Chen <ya...@yang-cs.com>.
Hi Noorul,
Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.
Thank you,
Yang
On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com> wrote:
> sparkx <ya...@yang-cs.com> writes:
>
> > Hi,
> >
> > I have a Spark job and a dataset of 0.5 Million items. Each item performs
> > some sort of computation (joining a shared external dataset, if that does
> > matter) and produces an RDD containing 20-500 result items. Now I would
> like
> > to combine all these RDDs and perform a next job. What I have found out
> is
> > that the computation itself is quite fast, but combining these RDDs takes
> > much longer time.
> >
> > val result = data // 0.5M data items
> > .map(compute(_)) // Produces an RDD - fast
> > .reduce(_ ++ _) // Combining RDDs - slow
> >
> > I have also tried to collect results from compute(_) and use a flatMap,
> but
> > that is also slow.
> >
> > Is there a way to efficiently do this? I'm thinking about writing this
> > result to HDFS and reading from disk for the next job, but am not sure if
> > that's a preferred way in Spark.
> >
>
> Are you looking for SparkContext.union() [1] ?
>
> This is not performing well with spark cassandra connector. I am not
> sure whether this will help you.
>
> Thanks and Regards
> Noorul
>
> [1]
> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>
--
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang
Re: Combining Many RDDs
Posted by Noorul Islam K M <no...@noorul.com>.
sparkx <ya...@yang-cs.com> writes:
> Hi,
>
> I have a Spark job and a dataset of 0.5 Million items. Each item performs
> some sort of computation (joining a shared external dataset, if that does
> matter) and produces an RDD containing 20-500 result items. Now I would like
> to combine all these RDDs and perform a next job. What I have found out is
> that the computation itself is quite fast, but combining these RDDs takes
> much longer time.
>
> val result = data // 0.5M data items
> .map(compute(_)) // Produces an RDD - fast
> .reduce(_ ++ _) // Combining RDDs - slow
>
> I have also tried to collect results from compute(_) and use a flatMap, but
> that is also slow.
>
> Is there a way to efficiently do this? I'm thinking about writing this
> result to HDFS and reading from disk for the next job, but am not sure if
> that's a preferred way in Spark.
>
Are you looking for SparkContext.union() [1] ?
This is not performing well with spark cassandra connector. I am not
sure whether this will help you.
Thanks and Regards
Noorul
[1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org