You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Guillaume Guy <gu...@gmail.com> on 2015/03/04 12:52:22 UTC
Re: Speed Benchmark

Hi Davies:

As agreed, this is the output of the profile. Do you see anything
suspicious?


This is the code run (in pyspark with the conf above):

input = sc.textFile(inputFile)
input.count()


[image: Inline image 1]


Best,

Guillaume Guy

* +1 919 - 972 - 8750 <%2B1%20919%20-%20972%20-%208750>*

On Sat, Feb 28, 2015 at 8:13 AM, Davies Liu <da...@databricks.com> wrote:

> No. It should not be that slow. In my Mac, it took 1.4 minutes to do
> `rdd.count()` on 4.3G text file ( 25M / s / CPU).
>
> Could you turn on profile in pyspark to see what happened in Python
> process?
>
> spark.python.profile = true
>
> On Fri, Feb 27, 2015 at 4:14 PM, Guillaume Guy
> <gu...@gmail.com> wrote:
> > It is a simple text file.
> >
> > I'm not using SQL. just doing a rdd.count() on it. Does the bug affect
> it?
> >
> >
> > On Friday, February 27, 2015, Davies Liu <da...@databricks.com> wrote:
> >>
> >> What is this dataset? text file or parquet file?
> >>
> >> There is an issue with serialization in Spark SQL, which will make it
> >> very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will
> >> be fixed very soon.
> >>
> >> Davies
> >>
> >> On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy
> >> <gu...@gmail.com> wrote:
> >> > Hi Sean:
> >> >
> >> > Thanks for your feedback. Scala is much faster. The count is performed
> >> > in ~1
> >> > minutes (vs 17min). I would expect scala to be 2-5X faster but this
> gap
> >> > seems to be more than that. Is that also your conclusion?
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > Best,
> >> >
> >> > Guillaume Guy
> >> >  +1 919 - 972 - 8750
> >> >
> >> > On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <so...@cloudera.com>
> wrote:
> >> >>
> >> >> That's very slow, and there are a lot of possible explanations. The
> >> >> first one that comes to mind is: I assume your YARN and HDFS are on
> >> >> the same machines, but are you running executors on all HDFS nodes
> >> >> when you run this? if not, a lot of these reads could be remote.
> >> >>
> >> >> You have 6 executor slots, but your data exists in 96 blocks on HDFS.
> >> >> You could read with up to 96-way parallelism. You say you're
> CPU-bound
> >> >> though, but normally I'd wonder if this was simply a case of
> >> >> under-using parallelism.
> >> >>
> >> >> I also wonder if the bottleneck is something to do with pyspark in
> >> >> this case; might be good to just try it in the spark-shell to check.
> >> >>
> >> >> On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy
> >> >> <gu...@gmail.com> wrote:
> >> >> > Dear Spark users:
> >> >> >
> >> >> > I want to see if anyone has an idea of the performance for a small
> >> >> > cluster.
> >> >> >
> >> >> > Reading from HDFS, what should be the performance of  a count()
> >> >> > operation on
> >> >> > an 10GB RDD with 100M rows using pyspark. I looked into the CPU
> >> >> > usage,
> >> >> > all 6
> >> >> > are at 100%.
> >> >> >
> >> >> > Details:
> >> >> >
> >> >> > master yarn-client
> >> >> > num-executors 3
> >> >> > executor-cores 2
> >> >> > driver-memory 5g
> >> >> > executor-memory 2g
> >> >> > Distribution: Cloudera
> >> >> >
> >> >> > I also attached the screenshot.
> >> >> >
> >> >> > Right now, I'm at 17 minutes which seems quite slow. Any idea how a
> >> >> > decent
> >> >> > performance with similar configuration?
> >> >> >
> >> >> > If it's way off, I would appreciate any pointers as to ways to
> >> >> > improve
> >> >> > performance.
> >> >> >
> >> >> > Thanks.
> >> >> >
> >> >> > Best,
> >> >> >
> >> >> > Guillaume
> >> >> >
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> >> > For additional commands, e-mail: user-help@spark.apache.org
> >> >
> >> >
> >
> >
> >
> > --
> >
> > Best,
> >
> > Guillaume Guy
> >  +1 919 - 972 - 8750
> >
>