You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Robert Schmidtke <ro...@gmail.com> on 2016/04/13 17:03:08 UTC

Flink performance pre-packaged vs. self-compiled

Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small
changes to Flink, which led me to compiling and running it myself. This is
when I noticed a performance difference in the pre-packaged Flink version
that I downloaded from the web (
http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz)
versus the form of the release-0.10 branch I built myself (mvn
-Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean
install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort)
and I noticed that the pre-packaged version of Flink performs 10-20% better
than the one I built myself (the only tweaks I mead are in the CliFrontend
after the Job has finished running, so I would rule out bad programming on
my side).

Has anyone come across this before? Or could you provide me with clearer
build instructions in order to reproduce the downloadable archive as
closely as possible? Thanks in advance!

Robert

-- 
My GPG Key ID: 336E2680

Re: Flink performance pre-packaged vs. self-compiled

Posted by Robert Schmidtke <ro...@gmail.com>.

You're obviously right, the configs were different. In the downloaded
version I had set off heap memory to true, whereas in the version I
compiled myself this one-time change to flink-conf.yaml was overwritten by
recompiling. I have fixed it now and performance is the same.

For the record, I had 30 GiB of TeraGen'd data:

-m yarn-cluster \
  -yn 10 \
  -ys 4 \
  -p 40 \
  -yjm 3072 \
  -ytm 4096

Each of the nodes has 64 GiB of RAM, job ran in 27s, repeatedly.

Thanks and sorry for not having checked the obvious ...

Robert

On Thu, Apr 14, 2016 at 10:23 PM, Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Hi,
>
> Your assumption may be incorrect related to the TeraSort use case for
> eastcirclek's implementation.
> How many time did you run your program?
> It would be helpful to give more details about your experiment, in terms
> of configuration, dataset size.
>
> Best,
> Ovidiu
>
> On 14 Apr 2016, at 17:14, Robert Schmidtke <ro...@gmail.com> wrote:
>
> I have tried multiple Maven and Scala Versions, but to no avail. I can't
> seem to achieve performance of the downloaded archive. I am stumped by this
> and will need to do more experiments when I have more time.
>
> Robert
>
> On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <ro...@gmail.com>
> wrote:
>
>> Hi Robert,
>>
>> thanks for the hint! Looks like something I could have figured out myself
>> -.-" I'll let you know if I find something.
>>
>> Robert
>>
>> On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Hi Robert,
>>>
>>> check out the tools/create_release_files.sh file in the source tree.
>>> There you can see how we are building the release binaries.
>>> It would be quite interesting to find out what caused the performance
>>> difference.
>>>
>>> On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <
>>> ro.schmidtke@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm using Flink 0.10.2 for some benchmarks and had to add some small
>>>> changes to Flink, which led me to compiling and running it myself. This is
>>>> when I noticed a performance difference in the pre-packaged Flink version
>>>> that I downloaded from the web (
>>>> http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz)
>>>> versus the form of the release-0.10 branch I built myself (mvn
>>>> -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean
>>>> install // mvn version 3.0.4).
>>>>
>>>> I ran some version of TeraSort (https://github.com/eastcirclek/terasort)
>>>> and I noticed that the pre-packaged version of Flink performs 10-20% better
>>>> than the one I built myself (the only tweaks I mead are in the CliFrontend
>>>> after the Job has finished running, so I would rule out bad programming on
>>>> my side).
>>>>
>>>> Has anyone come across this before? Or could you provide me with
>>>> clearer build instructions in order to reproduce the downloadable archive
>>>> as closely as possible? Thanks in advance!
>>>>
>>>> Robert
>>>>
>>>> --
>>>> My GPG Key ID: 336E2680
>>>>
>>>
>>>
>>
>>
>> --
>> My GPG Key ID: 336E2680
>>
>
>
>
> --
> My GPG Key ID: 336E2680
>
>
>


-- 
My GPG Key ID: 336E2680

Re: Flink performance pre-packaged vs. self-compiled

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Hi,

Your assumption may be incorrect related to the TeraSort use case for eastcirclek's implementation. 
How many time did you run your program?
It would be helpful to give more details about your experiment, in terms of configuration, dataset size.

Best,
Ovidiu

> On 14 Apr 2016, at 17:14, Robert Schmidtke <ro...@gmail.com> wrote:
> 
> I have tried multiple Maven and Scala Versions, but to no avail. I can't seem to achieve performance of the downloaded archive. I am stumped by this and will need to do more experiments when I have more time.
> 
> Robert
> 
> On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <ro.schmidtke@gmail.com <ma...@gmail.com>> wrote:
> Hi Robert,
> 
> thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.
> 
> Robert
> 
> On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <rmetzger@apache.org <ma...@apache.org>> wrote:
> Hi Robert,
> 
> check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
> It would be quite interesting to find out what caused the performance difference.
> 
> On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <ro.schmidtke@gmail.com <ma...@gmail.com>> wrote:
> Hi everyone,
> 
> I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz <http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz>) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).
> 
> I ran some version of TeraSort (https://github.com/eastcirclek/terasort <https://github.com/eastcirclek/terasort>) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).
> 
> Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!
> 
> Robert
> 
> -- 
> My GPG Key ID: 336E2680
> 
> 
> 
> 
> -- 
> My GPG Key ID: 336E2680
> 
> 
> 
> -- 
> My GPG Key ID: 336E2680

Re: Flink performance pre-packaged vs. self-compiled

Posted by Robert Schmidtke <ro...@gmail.com>.

I have tried multiple Maven and Scala Versions, but to no avail. I can't
seem to achieve performance of the downloaded archive. I am stumped by this
and will need to do more experiments when I have more time.

Robert

On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <ro...@gmail.com>
wrote:

> Hi Robert,
>
> thanks for the hint! Looks like something I could have figured out myself
> -.-" I'll let you know if I find something.
>
> Robert
>
> On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hi Robert,
>>
>> check out the tools/create_release_files.sh file in the source tree.
>> There you can see how we are building the release binaries.
>> It would be quite interesting to find out what caused the performance
>> difference.
>>
>> On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <ro.schmidtke@gmail.com
>> > wrote:
>>
>>> Hi everyone,
>>>
>>> I'm using Flink 0.10.2 for some benchmarks and had to add some small
>>> changes to Flink, which led me to compiling and running it myself. This is
>>> when I noticed a performance difference in the pre-packaged Flink version
>>> that I downloaded from the web (
>>> http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz)
>>> versus the form of the release-0.10 branch I built myself (mvn
>>> -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean
>>> install // mvn version 3.0.4).
>>>
>>> I ran some version of TeraSort (https://github.com/eastcirclek/terasort)
>>> and I noticed that the pre-packaged version of Flink performs 10-20% better
>>> than the one I built myself (the only tweaks I mead are in the CliFrontend
>>> after the Job has finished running, so I would rule out bad programming on
>>> my side).
>>>
>>> Has anyone come across this before? Or could you provide me with clearer
>>> build instructions in order to reproduce the downloadable archive as
>>> closely as possible? Thanks in advance!
>>>
>>> Robert
>>>
>>> --
>>> My GPG Key ID: 336E2680
>>>
>>
>>
>
>
> --
> My GPG Key ID: 336E2680
>



-- 
My GPG Key ID: 336E2680

Re: Flink performance pre-packaged vs. self-compiled

Posted by Robert Schmidtke <ro...@gmail.com>.

Hi Robert,

thanks for the hint! Looks like something I could have figured out myself
-.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <rm...@apache.org> wrote:

> Hi Robert,
>
> check out the tools/create_release_files.sh file in the source tree. There
> you can see how we are building the release binaries.
> It would be quite interesting to find out what caused the performance
> difference.
>
> On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <ro...@gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> I'm using Flink 0.10.2 for some benchmarks and had to add some small
>> changes to Flink, which led me to compiling and running it myself. This is
>> when I noticed a performance difference in the pre-packaged Flink version
>> that I downloaded from the web (
>> http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz)
>> versus the form of the release-0.10 branch I built myself (mvn
>> -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean
>> install // mvn version 3.0.4).
>>
>> I ran some version of TeraSort (https://github.com/eastcirclek/terasort)
>> and I noticed that the pre-packaged version of Flink performs 10-20% better
>> than the one I built myself (the only tweaks I mead are in the CliFrontend
>> after the Job has finished running, so I would rule out bad programming on
>> my side).
>>
>> Has anyone come across this before? Or could you provide me with clearer
>> build instructions in order to reproduce the downloadable archive as
>> closely as possible? Thanks in advance!
>>
>> Robert
>>
>> --
>> My GPG Key ID: 336E2680
>>
>
>


-- 
My GPG Key ID: 336E2680

Re: Flink performance pre-packaged vs. self-compiled

Posted by Robert Metzger <rm...@apache.org>.

Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There
you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance
difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <ro...@gmail.com>
wrote:

> Hi everyone,
>
> I'm using Flink 0.10.2 for some benchmarks and had to add some small
> changes to Flink, which led me to compiling and running it myself. This is
> when I noticed a performance difference in the pre-packaged Flink version
> that I downloaded from the web (
> http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz)
> versus the form of the release-0.10 branch I built myself (mvn
> -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean
> install // mvn version 3.0.4).
>
> I ran some version of TeraSort (https://github.com/eastcirclek/terasort)
> and I noticed that the pre-packaged version of Flink performs 10-20% better
> than the one I built myself (the only tweaks I mead are in the CliFrontend
> after the Job has finished running, so I would rule out bad programming on
> my side).
>
> Has anyone come across this before? Or could you provide me with clearer
> build instructions in order to reproduce the downloadable archive as
> closely as possible? Thanks in advance!
>
> Robert
>
> --
> My GPG Key ID: 336E2680
>