You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joe L <se...@yahoo.com> on 2014/04/21 20:23:16 UTC

Spark is slow

It is claimed that spark is 10x or 100x times faster than mapreduce and hive
but since I started using it I haven't seen any faster performance. it is
taking 2 minutes to run map and join tasks over just 2GB data. Instead hive
was taking just a few seconds to join 2 tables over the same data. And, I
haven't gotten any answers to my questions. I don't understand the purpose
of this group and there is no enough documentations of spark and its usage. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark is slow

Posted by Nicholas Chammas <ni...@gmail.com>.
How long are the count() steps taking? And how many partitions are pairs1and
triples initially divided into? You can see this by doing
pairs1._jrdd.splits().size(), for example.

If you just need to count the number of distinct keys, is it faster if you
did the following instead of groupByKey().count()?

g1 = pairs1.map(lambda (k,v): k).distinct().count()
g2 = triples.map(lambda (k,v): k).distinct().count()

Nick


On Mon, Apr 21, 2014 at 10:42 PM, Joe L <se...@yahoo.com> wrote:

> g1 = pairs1.groupByKey().count()
> pairs1 = pairs1.groupByKey(g1).cache()
> g2 = triples.groupByKey().count()
> pairs2 = pairs2.groupByKey(g2)
>
> pairs = pairs2.join(pairs1)
>
> Hi, I want to implement hash-partitioned joining as shown above. But
> somehow, it is taking so long to perform. As I understand, the above
> joining
> is only implemented locally right since they are partitioned respectively?
> After we partition, they will reside in the same node. So, isn't it
> supposed
> to be fast when we partition by keys. Thank you.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539p4577.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Spark is slow

Posted by Joe L <se...@yahoo.com>.
g1 = pairs1.groupByKey().count() 
pairs1 = pairs1.groupByKey(g1).cache() 
g2 = triples.groupByKey().count() 
pairs2 = pairs2.groupByKey(g2) 

pairs = pairs2.join(pairs1) 

Hi, I want to implement hash-partitioned joining as shown above. But
somehow, it is taking so long to perform. As I understand, the above joining
is only implemented locally right since they are partitioned respectively?
After we partition, they will reside in the same node. So, isn't it supposed
to be fast when we partition by keys. Thank you. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539p4577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark is slow

Posted by John Meagher <jo...@gmail.com>.
Yahoo made some changes that drive mailing list posts into spam
folders:  http://www.virusbtn.com/blog/2014/04_15.xml

On Mon, Apr 21, 2014 at 2:50 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> Hi Joe,
>
> On Mon, Apr 21, 2014 at 11:23 AM, Joe L <se...@yahoo.com> wrote:
>> And, I  haven't gotten any answers to my questions.
>
> One thing that might explain that is that, at least for me, all (and I
> mean *all*) of your messages are ending up in my GMail spam folder,
> complaining that GMail can't verify that it really comes from
> yahoo.com.
>
> No idea why that's happening or how to fix it.
>
> --
> Marcelo

Re: Spark is slow

Posted by Nicholas Chammas <ni...@gmail.com>.
I'm seeing the same thing as Marcelo, Joe. All your mail is going to my
Spam folder. :(

With regards to your questions, I would suggest in general adding some more
technical detail to them. It will be difficult for people to give you
suggestions if all they are told is "Spark is slow". How does your Spark
setup differ from your MR/Hive setup? What operations are you doing? What
do you see in the Spark UI? What have you tried doing to isolate or
identify the reason for the slowness? Etc.

Nick


On Mon, Apr 21, 2014 at 2:50 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> Hi Joe,
>
> On Mon, Apr 21, 2014 at 11:23 AM, Joe L <se...@yahoo.com> wrote:
> > And, I  haven't gotten any answers to my questions.
>
> One thing that might explain that is that, at least for me, all (and I
> mean *all*) of your messages are ending up in my GMail spam folder,
> complaining that GMail can't verify that it really comes from
> yahoo.com.
>
> No idea why that's happening or how to fix it.
>
> --
> Marcelo
>

Re: Spark is slow

Posted by Marcelo Vanzin <va...@cloudera.com>.
Hi Joe,

On Mon, Apr 21, 2014 at 11:23 AM, Joe L <se...@yahoo.com> wrote:
> And, I  haven't gotten any answers to my questions.

One thing that might explain that is that, at least for me, all (and I
mean *all*) of your messages are ending up in my GMail spam folder,
complaining that GMail can't verify that it really comes from
yahoo.com.

No idea why that's happening or how to fix it.

-- 
Marcelo

Re: Spark is slow

Posted by Sam Bessalah <sa...@gmail.com>.
Why don't start by explaining what kind of operation you're running on
spark that's faster than hadoop mapred. Mybewe could start there. And yes
this mailing is very busy since many people are getting into Spark, it's
hard to answer to everyone.
On 21 Apr 2014 20:23, "Joe L" <se...@yahoo.com> wrote:

> It is claimed that spark is 10x or 100x times faster than mapreduce and
> hive
> but since I started using it I haven't seen any faster performance. it is
> taking 2 minutes to run map and join tasks over just 2GB data. Instead hive
> was taking just a few seconds to join 2 tables over the same data. And, I
> haven't gotten any answers to my questions. I don't understand the purpose
> of this group and there is no enough documentations of spark and its usage.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>