You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sparkx <ya...@yang-cs.com> on 2015/03/26 17:40:38 UTC

Combining Many RDDs

Hi,

I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would like
to combine all these RDDs and perform a next job. What I have found out is
that the computation itself is quite fast, but combining these RDDs takes
much longer time.

    val result = data        // 0.5M data items
      .map(compute(_))   // Produces an RDD - fast
      .reduce(_ ++ _)      // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a flatMap, but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not sure if
that's a preferred way in Spark.

Thank you.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Combining-Many-RDDs-tp22243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Combining Many RDDs

Posted by Noorul Islam K M <no...@noorul.com>.

Yang Chen <ya...@yang-cs.com> writes:

> Hi Noorul,
>
> Thank you for your suggestion. I tried that, but ran out of memory. I did
> some search and found some suggestions
> that we should try to avoid rdd.union(
> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
> ).
> I will try to come up with some other ways.
>

I think you are using rdd.union(), but I was referring to
SparkContext.union(). I am not sure about the number of RDDs that you
have but I had no issues with memory when I used it to combine 2000
RDDs. Having said that I had other performance issues with spark
cassandra connector.

Thanks and Regards
Noorul

>
> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com> wrote:
>
>> sparkx <ya...@yang-cs.com> writes:
>>
>> > Hi,
>> >
>> > I have a Spark job and a dataset of 0.5 Million items. Each item performs
>> > some sort of computation (joining a shared external dataset, if that does
>> > matter) and produces an RDD containing 20-500 result items. Now I would
>> like
>> > to combine all these RDDs and perform a next job. What I have found out
>> is
>> > that the computation itself is quite fast, but combining these RDDs takes
>> > much longer time.
>> >
>> >     val result = data        // 0.5M data items
>> >       .map(compute(_))   // Produces an RDD - fast
>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>> >
>> > I have also tried to collect results from compute(_) and use a flatMap,
>> but
>> > that is also slow.
>> >
>> > Is there a way to efficiently do this? I'm thinking about writing this
>> > result to HDFS and reading from disk for the next job, but am not sure if
>> > that's a preferred way in Spark.
>> >
>>
>> Are you looking for SparkContext.union() [1] ?
>>
>> This is not performing well with spark cassandra connector. I am not
>> sure whether this will help you.
>>
>> Thanks and Regards
>> Noorul
>>
>> [1]
>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Combining Many RDDs

Posted by Yang Chen <ya...@yang-cs.com>.

Hi Kelvin,

Thank you. That works for me. I wrote my own joins that produced Scala
collections, instead of using rdd.join.

Regards,
Yang

On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu <2d...@gmail.com> wrote:

> Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
> variable 'data' is a Scala collection and compute() returns an RDD. Right?
> If yes, I tried the approach below to operate on one RDD only during the
> whole computation (Yes, I also saw that too many RDD hurt performance).
>
> Change compute() to return Scala collection instead of RDD.
>
>     val result = sc.parallelize(data)        // Create and partition the
> 0.5M items in a single RDD.
>       .flatMap(compute(_))   // You still have only one RDD with each item
> joined with external data already
>
> Hope this help.
>
> Kelvin
>
> On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <ya...@yang-cs.com> wrote:
>
>> Hi Mark,
>>
>> That's true, but in neither way can I combine the RDDs, so I have to
>> avoid unions.
>>
>> Thanks,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> RDD#union is not the same thing as SparkContext#union
>>>
>>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <ya...@yang-cs.com> wrote:
>>>
>>>> Hi Noorul,
>>>>
>>>> Thank you for your suggestion. I tried that, but ran out of memory. I
>>>> did some search and found some suggestions
>>>> that we should try to avoid rdd.union(
>>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>>>> ).
>>>> I will try to come up with some other ways.
>>>>
>>>> Thank you,
>>>> Yang
>>>>
>>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com>
>>>> wrote:
>>>>
>>>>> sparkx <ya...@yang-cs.com> writes:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>>>> performs
>>>>> > some sort of computation (joining a shared external dataset, if that
>>>>> does
>>>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>>>> would like
>>>>> > to combine all these RDDs and perform a next job. What I have found
>>>>> out is
>>>>> > that the computation itself is quite fast, but combining these RDDs
>>>>> takes
>>>>> > much longer time.
>>>>> >
>>>>> >     val result = data        // 0.5M data items
>>>>> >       .map(compute(_))   // Produces an RDD - fast
>>>>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>>>>> >
>>>>> > I have also tried to collect results from compute(_) and use a
>>>>> flatMap, but
>>>>> > that is also slow.
>>>>> >
>>>>> > Is there a way to efficiently do this? I'm thinking about writing
>>>>> this
>>>>> > result to HDFS and reading from disk for the next job, but am not
>>>>> sure if
>>>>> > that's a preferred way in Spark.
>>>>> >
>>>>>
>>>>> Are you looking for SparkContext.union() [1] ?
>>>>>
>>>>> This is not performing well with spark cassandra connector. I am not
>>>>> sure whether this will help you.
>>>>>
>>>>> Thanks and Regards
>>>>> Noorul
>>>>>
>>>>> [1]
>>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yang Chen
>>>> Dept. of CISE, University of Florida
>>>> Mail: yang@yang-cs.com
>>>> Web: www.cise.ufl.edu/~yang
>>>>
>>>
>>>
>>
>>
>> --
>> Yang Chen
>> Dept. of CISE, University of Florida
>> Mail: yang@yang-cs.com
>> Web: www.cise.ufl.edu/~yang
>>
>
>


-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang

Re: Combining Many RDDs

Posted by Kelvin Chu <2d...@gmail.com>.

Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
variable 'data' is a Scala collection and compute() returns an RDD. Right?
If yes, I tried the approach below to operate on one RDD only during the
whole computation (Yes, I also saw that too many RDD hurt performance).

Change compute() to return Scala collection instead of RDD.

    val result = sc.parallelize(data)        // Create and partition the
0.5M items in a single RDD.
      .flatMap(compute(_))   // You still have only one RDD with each item
joined with external data already

Hope this help.

Kelvin

On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <ya...@yang-cs.com> wrote:

> Hi Mark,
>
> That's true, but in neither way can I combine the RDDs, so I have to avoid
> unions.
>
> Thanks,
> Yang
>
> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> RDD#union is not the same thing as SparkContext#union
>>
>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <ya...@yang-cs.com> wrote:
>>
>>> Hi Noorul,
>>>
>>> Thank you for your suggestion. I tried that, but ran out of memory. I
>>> did some search and found some suggestions
>>> that we should try to avoid rdd.union(
>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>>> ).
>>> I will try to come up with some other ways.
>>>
>>> Thank you,
>>> Yang
>>>
>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com>
>>> wrote:
>>>
>>>> sparkx <ya...@yang-cs.com> writes:
>>>>
>>>> > Hi,
>>>> >
>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>>> performs
>>>> > some sort of computation (joining a shared external dataset, if that
>>>> does
>>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>>> would like
>>>> > to combine all these RDDs and perform a next job. What I have found
>>>> out is
>>>> > that the computation itself is quite fast, but combining these RDDs
>>>> takes
>>>> > much longer time.
>>>> >
>>>> >     val result = data        // 0.5M data items
>>>> >       .map(compute(_))   // Produces an RDD - fast
>>>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>>>> >
>>>> > I have also tried to collect results from compute(_) and use a
>>>> flatMap, but
>>>> > that is also slow.
>>>> >
>>>> > Is there a way to efficiently do this? I'm thinking about writing this
>>>> > result to HDFS and reading from disk for the next job, but am not
>>>> sure if
>>>> > that's a preferred way in Spark.
>>>> >
>>>>
>>>> Are you looking for SparkContext.union() [1] ?
>>>>
>>>> This is not performing well with spark cassandra connector. I am not
>>>> sure whether this will help you.
>>>>
>>>> Thanks and Regards
>>>> Noorul
>>>>
>>>> [1]
>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>>
>>>
>>>
>>>
>>> --
>>> Yang Chen
>>> Dept. of CISE, University of Florida
>>> Mail: yang@yang-cs.com
>>> Web: www.cise.ufl.edu/~yang
>>>
>>
>>
>
>
> --
> Yang Chen
> Dept. of CISE, University of Florida
> Mail: yang@yang-cs.com
> Web: www.cise.ufl.edu/~yang
>

Re: Combining Many RDDs

Posted by Yang Chen <ya...@yang-cs.com>.

Hi Mark,

That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.

Thanks,
Yang

On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> RDD#union is not the same thing as SparkContext#union
>
> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <ya...@yang-cs.com> wrote:
>
>> Hi Noorul,
>>
>> Thank you for your suggestion. I tried that, but ran out of memory. I did
>> some search and found some suggestions
>> that we should try to avoid rdd.union(
>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>> ).
>> I will try to come up with some other ways.
>>
>> Thank you,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com>
>> wrote:
>>
>>> sparkx <ya...@yang-cs.com> writes:
>>>
>>> > Hi,
>>> >
>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>> performs
>>> > some sort of computation (joining a shared external dataset, if that
>>> does
>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>> would like
>>> > to combine all these RDDs and perform a next job. What I have found
>>> out is
>>> > that the computation itself is quite fast, but combining these RDDs
>>> takes
>>> > much longer time.
>>> >
>>> >     val result = data        // 0.5M data items
>>> >       .map(compute(_))   // Produces an RDD - fast
>>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>>> >
>>> > I have also tried to collect results from compute(_) and use a
>>> flatMap, but
>>> > that is also slow.
>>> >
>>> > Is there a way to efficiently do this? I'm thinking about writing this
>>> > result to HDFS and reading from disk for the next job, but am not sure
>>> if
>>> > that's a preferred way in Spark.
>>> >
>>>
>>> Are you looking for SparkContext.union() [1] ?
>>>
>>> This is not performing well with spark cassandra connector. I am not
>>> sure whether this will help you.
>>>
>>> Thanks and Regards
>>> Noorul
>>>
>>> [1]
>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>
>>
>>
>>
>> --
>> Yang Chen
>> Dept. of CISE, University of Florida
>> Mail: yang@yang-cs.com
>> Web: www.cise.ufl.edu/~yang
>>
>
>


-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang

Re: Combining Many RDDs

Posted by Mark Hamstra <ma...@clearstorydata.com>.

RDD#union is not the same thing as SparkContext#union

On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <ya...@yang-cs.com> wrote:

> Hi Noorul,
>
> Thank you for your suggestion. I tried that, but ran out of memory. I did
> some search and found some suggestions
> that we should try to avoid rdd.union(
> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
> ).
> I will try to come up with some other ways.
>
> Thank you,
> Yang
>
> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com>
> wrote:
>
>> sparkx <ya...@yang-cs.com> writes:
>>
>> > Hi,
>> >
>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>> performs
>> > some sort of computation (joining a shared external dataset, if that
>> does
>> > matter) and produces an RDD containing 20-500 result items. Now I would
>> like
>> > to combine all these RDDs and perform a next job. What I have found out
>> is
>> > that the computation itself is quite fast, but combining these RDDs
>> takes
>> > much longer time.
>> >
>> >     val result = data        // 0.5M data items
>> >       .map(compute(_))   // Produces an RDD - fast
>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>> >
>> > I have also tried to collect results from compute(_) and use a flatMap,
>> but
>> > that is also slow.
>> >
>> > Is there a way to efficiently do this? I'm thinking about writing this
>> > result to HDFS and reading from disk for the next job, but am not sure
>> if
>> > that's a preferred way in Spark.
>> >
>>
>> Are you looking for SparkContext.union() [1] ?
>>
>> This is not performing well with spark cassandra connector. I am not
>> sure whether this will help you.
>>
>> Thanks and Regards
>> Noorul
>>
>> [1]
>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>
>
>
>
> --
> Yang Chen
> Dept. of CISE, University of Florida
> Mail: yang@yang-cs.com
> Web: www.cise.ufl.edu/~yang
>

Re: Combining Many RDDs

Posted by Yang Chen <ya...@yang-cs.com>.

Hi Noorul,

Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.

Thank you,
Yang

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <no...@noorul.com> wrote:

> sparkx <ya...@yang-cs.com> writes:
>
> > Hi,
> >
> > I have a Spark job and a dataset of 0.5 Million items. Each item performs
> > some sort of computation (joining a shared external dataset, if that does
> > matter) and produces an RDD containing 20-500 result items. Now I would
> like
> > to combine all these RDDs and perform a next job. What I have found out
> is
> > that the computation itself is quite fast, but combining these RDDs takes
> > much longer time.
> >
> >     val result = data        // 0.5M data items
> >       .map(compute(_))   // Produces an RDD - fast
> >       .reduce(_ ++ _)      // Combining RDDs - slow
> >
> > I have also tried to collect results from compute(_) and use a flatMap,
> but
> > that is also slow.
> >
> > Is there a way to efficiently do this? I'm thinking about writing this
> > result to HDFS and reading from disk for the next job, but am not sure if
> > that's a preferred way in Spark.
> >
>
> Are you looking for SparkContext.union() [1] ?
>
> This is not performing well with spark cassandra connector. I am not
> sure whether this will help you.
>
> Thanks and Regards
> Noorul
>
> [1]
> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>



-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang

Re: Combining Many RDDs

Posted by Noorul Islam K M <no...@noorul.com>.

sparkx <ya...@yang-cs.com> writes:

> Hi,
>
> I have a Spark job and a dataset of 0.5 Million items. Each item performs
> some sort of computation (joining a shared external dataset, if that does
> matter) and produces an RDD containing 20-500 result items. Now I would like
> to combine all these RDDs and perform a next job. What I have found out is
> that the computation itself is quite fast, but combining these RDDs takes
> much longer time.
>
>     val result = data        // 0.5M data items
>       .map(compute(_))   // Produces an RDD - fast
>       .reduce(_ ++ _)      // Combining RDDs - slow
>
> I have also tried to collect results from compute(_) and use a flatMap, but
> that is also slow.
>
> Is there a way to efficiently do this? I'm thinking about writing this
> result to HDFS and reading from disk for the next job, but am not sure if
> that's a preferred way in Spark.
>

Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org