You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Xinhui Tian <ti...@ict.ac.cn> on 2015/07/20 09:47:11 UTC

Benchmarks of Flink, supporting Flink in BigDataBench

Hello, everyone.

I'm a PhD student from the Institute of Computing Technology, Chinese
Academy of Sciences. Our team has released a benchmark for big data systems
called BigDataBench, which has become an industry-standard big data
benchmark in China. You can find our work on this website:
http://prof.ict.ac.cn/BigDataBench/

We are now planning to support Flink in our benchmark, which could provide a
set of workloads on different domains and an objective comparison with
systems such as Spark and Hadoop. But we are new to this system, so we are
asking for your advice about benchmark design. The first thing is to decide
what workloads should be added to our benchmark and which domain we should
pay more attention. 

The attachment is a preliminary plan, which lists some workloads that have
already been implemented in the Spark version. We plan to first implement
these workloads on Flink, and evalute these two systems. Does anyone have
some adivce for this list? We will be very grateful for any idea.
BigDataBench_for_Flink.docx
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx>  

Thanks ;)



--
View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html
Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.

Re: Benchmarks of Flink, supporting Flink in BigDataBench

Posted by Fabian Hueske <fh...@gmail.com>.

Hi,

welcome to the Flink community and thanks for including Flink into your
benchmark suite! That's really exciting news :-)

Most of the jobs that you listed in your preliminary plan are available as
example programs in Flink's code base [1].
However, you should know, that these examples are NOT tuned for performance
but rather for easy understanding and to showcase certain features.

If your implementations of Flink programs are online available (e.g., on
Github) we could assist with some performance tuning.

Thank you,
Fabian

[1]
https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java


2015-07-20 10:19 GMT+02:00 Stephan Ewen <se...@apache.org>:

> Hi!
>
> Thanks for reaching out and adding Flink to BigDataBench.
>
> The plan you sent looks like a nice first draft. It is pretty much batch
> jobs. Here are a few ideas what you could add as batch jobs:
>
>  - Joins are something people seem do a lot with these systems, so a 2-3
> table join would be a nice addition
>
>  - For batch algorithms, it is often interesting to scale them beyond
> memory (we have seen that a lot from users)
>
>  - For graph algorithms, you can try incremental versions (see here:
> http://data-artisans.com/data-analysis-with-flink.html)
>
>
>
> On the streaming side, it is harder, as the systems are very different
> there and bot every system can do everything.
> For Flink, some ideas would be:
>   - Streaming Grep
>   - Streaming pattern detection (see
>
> https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine
> )
>   - Streaming word count
>   - For streaming Jobs, it is often interesting to play with enabled /
> disabled fault tolerance
>
>
>
> A few generic comments on Flink, for performance testing.
>
>  - The Java API is usually slightly faster then the Scala API, but only by
> a bit
>  - Tuples (Java) and case classes (Scala) usually beat POJOs in
> performance.
>  - If your implementation allows it, turning on "objectReuseMode()" can
> gain some performance.
>  - If you implement sorting / Tera sort, have a look here, for how to make
> sure that Flink handles the Hadoop types efficiently
>
> http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html
>
> Greetings,
> Stephan
>
>
>
> On Mon, Jul 20, 2015 at 9:47 AM, Xinhui Tian <ti...@ict.ac.cn> wrote:
>
> > Hello, everyone.
> >
> > I'm a PhD student from the Institute of Computing Technology, Chinese
> > Academy of Sciences. Our team has released a benchmark for big data
> systems
> > called BigDataBench, which has become an industry-standard big data
> > benchmark in China. You can find our work on this website:
> > http://prof.ict.ac.cn/BigDataBench/
> >
> > We are now planning to support Flink in our benchmark, which could
> provide
> > a
> > set of workloads on different domains and an objective comparison with
> > systems such as Spark and Hadoop. But we are new to this system, so we
> are
> > asking for your advice about benchmark design. The first thing is to
> decide
> > what workloads should be added to our benchmark and which domain we
> should
> > pay more attention.
> >
> > The attachment is a preliminary plan, which lists some workloads that
> have
> > already been implemented in the Spark version. We plan to first implement
> > these workloads on Flink, and evalute these two systems. Does anyone have
> > some adivce for this list? We will be very grateful for any idea.
> > BigDataBench_for_Flink.docx
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx
> > >
> >
> > Thanks ;)
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> > Nabble.com.
> >
>

Re: Benchmarks of Flink, supporting Flink in BigDataBench

Posted by Stephan Ewen <se...@apache.org>.

Hi!

Thanks for reaching out and adding Flink to BigDataBench.

The plan you sent looks like a nice first draft. It is pretty much batch
jobs. Here are a few ideas what you could add as batch jobs:

 - Joins are something people seem do a lot with these systems, so a 2-3
table join would be a nice addition

 - For batch algorithms, it is often interesting to scale them beyond
memory (we have seen that a lot from users)

 - For graph algorithms, you can try incremental versions (see here:
http://data-artisans.com/data-analysis-with-flink.html)



On the streaming side, it is harder, as the systems are very different
there and bot every system can do everything.
For Flink, some ideas would be:
  - Streaming Grep
  - Streaming pattern detection (see
https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine
)
  - Streaming word count
  - For streaming Jobs, it is often interesting to play with enabled /
disabled fault tolerance



A few generic comments on Flink, for performance testing.

 - The Java API is usually slightly faster then the Scala API, but only by
a bit
 - Tuples (Java) and case classes (Scala) usually beat POJOs in performance.
 - If your implementation allows it, turning on "objectReuseMode()" can
gain some performance.
 - If you implement sorting / Tera sort, have a look here, for how to make
sure that Flink handles the Hadoop types efficiently
http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html

Greetings,
Stephan



On Mon, Jul 20, 2015 at 9:47 AM, Xinhui Tian <ti...@ict.ac.cn> wrote:

> Hello, everyone.
>
> I'm a PhD student from the Institute of Computing Technology, Chinese
> Academy of Sciences. Our team has released a benchmark for big data systems
> called BigDataBench, which has become an industry-standard big data
> benchmark in China. You can find our work on this website:
> http://prof.ict.ac.cn/BigDataBench/
>
> We are now planning to support Flink in our benchmark, which could provide
> a
> set of workloads on different domains and an objective comparison with
> systems such as Spark and Hadoop. But we are new to this system, so we are
> asking for your advice about benchmark design. The first thing is to decide
> what workloads should be added to our benchmark and which domain we should
> pay more attention.
>
> The attachment is a preliminary plan, which lists some workloads that have
> already been implemented in the Spark version. We plan to first implement
> these workloads on Flink, and evalute these two systems. Does anyone have
> some adivce for this list? We will be very grateful for any idea.
> BigDataBench_for_Flink.docx
> <
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx
> >
>
> Thanks ;)
>
>
>
> --
> View this message in context:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>

Re: Benchmarks of Flink, supporting Flink in BigDataBench

Posted by hawin <ha...@gmail.com>.

Hi  Xinhui

As Stephan mentioned for the batch jobs, there are 2 - 3 tables would be
nice addition. 
Can we use the same Spark examples as below to implement it.
Thanks. 


For example:
1. Scan Query
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

2. Aggregation Query
SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY
SUBSTR(sourceIP, 1, X)


3. Join Query
SELECT sourceIP, totalRevenue, avgPageRank
FROM
  (SELECT sourceIP,
          AVG(pageRank) as avgPageRank,
          SUM(adRevenue) as totalRevenue
    FROM Rankings AS R, UserVisits AS UV
    WHERE R.pageURL = UV.destURL
       AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X')
    GROUP BY UV.sourceIP)
  ORDER BY totalRevenue DESC LIMIT 1

https://amplab.cs.berkeley.edu/benchmark/




--
View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079p7114.html
Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.