You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Arvid Heise <ar...@ververica.com> on 2020/02/26 14:49:14 UTC

Re: Batch Flink Job S3 write performance vs Spark

Fair benchmarks are notoriously difficult to setup.

Usually, it's easy to find a workload where one system shines and as its
vendor you report that. Then, the competitor benchmarks a different use
case where his system outperforms ours. In the end, customers are more
confused than before.

You should do your own benchmarks for your own workloads. That is the only
reliable way.

In the end, both systems use similar setups and improvements in one system
are often also incorporated into the other system with some delay, such
that there should be no ground-breaking differences between the two systems
running on Java and using the same set of libraries.
Of course, if one system has a very specific optimization for your use
case, that could be much faster.

On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <
kali.tummala@gmail.com> wrote:

> Hi All,
>
> have a question did anyone compared the performance of Flink batch job
> writing to s3 vs spark writing to s3?
>
> --
> Thanks & Regards
> Sri Tummala
>
>

Re: Batch Flink Job S3 write performance vs Spark

Posted by sri hari kali charan Tummala <ka...@gmail.com>.

sorry for being lazy I would have gone through flink source code.

On Wed, Feb 26, 2020 at 9:35 AM sri hari kali charan Tummala <
kali.tummala@gmail.com> wrote:

> Ok, thanks for the clarification.
>
> On Wed, Feb 26, 2020 at 9:22 AM Arvid Heise <ar...@ververica.com> wrote:
>
>> Exactly. We use the hadoop-fs as an indirection on top of that, but Spark
>> probably does the same.
>>
>> On Wed, Feb 26, 2020 at 3:52 PM sri hari kali charan Tummala <
>> kali.tummala@gmail.com> wrote:
>>
>>> Thank you  (the two systems running on Java and using the same set of
>>> libraries), so from my understanding, Flink uses AWS SDK behind the scenes
>>> same as spark.
>>>
>>> On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <ar...@ververica.com> wrote:
>>>
>>>> Fair benchmarks are notoriously difficult to setup.
>>>>
>>>> Usually, it's easy to find a workload where one system shines and as
>>>> its vendor you report that. Then, the competitor benchmarks a different use
>>>> case where his system outperforms ours. In the end, customers are more
>>>> confused than before.
>>>>
>>>> You should do your own benchmarks for your own workloads. That is the
>>>> only reliable way.
>>>>
>>>> In the end, both systems use similar setups and improvements in one
>>>> system are often also incorporated into the other system with some delay,
>>>> such that there should be no ground-breaking differences between the two
>>>> systems running on Java and using the same set of libraries.
>>>> Of course, if one system has a very specific optimization for your use
>>>> case, that could be much faster.
>>>>
>>>>
>>>> On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <
>>>> kali.tummala@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> have a question did anyone compared the performance of Flink batch job
>>>>> writing to s3 vs spark writing to s3?
>>>>>
>>>>> --
>>>>> Thanks & Regards
>>>>> Sri Tummala
>>>>>
>>>>>
>>>
>>> --
>>> Thanks & Regards
>>> Sri Tummala
>>>
>>>
>
> --
> Thanks & Regards
> Sri Tummala
>
>

-- 
Thanks & Regards
Sri Tummala

Re: Batch Flink Job S3 write performance vs Spark

Posted by sri hari kali charan Tummala <ka...@gmail.com>.

Ok, thanks for the clarification.

On Wed, Feb 26, 2020 at 9:22 AM Arvid Heise <ar...@ververica.com> wrote:

> Exactly. We use the hadoop-fs as an indirection on top of that, but Spark
> probably does the same.
>
> On Wed, Feb 26, 2020 at 3:52 PM sri hari kali charan Tummala <
> kali.tummala@gmail.com> wrote:
>
>> Thank you  (the two systems running on Java and using the same set of
>> libraries), so from my understanding, Flink uses AWS SDK behind the scenes
>> same as spark.
>>
>> On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <ar...@ververica.com> wrote:
>>
>>> Fair benchmarks are notoriously difficult to setup.
>>>
>>> Usually, it's easy to find a workload where one system shines and as its
>>> vendor you report that. Then, the competitor benchmarks a different use
>>> case where his system outperforms ours. In the end, customers are more
>>> confused than before.
>>>
>>> You should do your own benchmarks for your own workloads. That is the
>>> only reliable way.
>>>
>>> In the end, both systems use similar setups and improvements in one
>>> system are often also incorporated into the other system with some delay,
>>> such that there should be no ground-breaking differences between the two
>>> systems running on Java and using the same set of libraries.
>>> Of course, if one system has a very specific optimization for your use
>>> case, that could be much faster.
>>>
>>>
>>> On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <
>>> kali.tummala@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> have a question did anyone compared the performance of Flink batch job
>>>> writing to s3 vs spark writing to s3?
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Sri Tummala
>>>>
>>>>
>>
>> --
>> Thanks & Regards
>> Sri Tummala
>>
>>

-- 
Thanks & Regards
Sri Tummala

Re: Batch Flink Job S3 write performance vs Spark

Posted by Arvid Heise <ar...@ververica.com>.

Exactly. We use the hadoop-fs as an indirection on top of that, but Spark
probably does the same.

On Wed, Feb 26, 2020 at 3:52 PM sri hari kali charan Tummala <
kali.tummala@gmail.com> wrote:

> Thank you  (the two systems running on Java and using the same set of
> libraries), so from my understanding, Flink uses AWS SDK behind the scenes
> same as spark.
>
> On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <ar...@ververica.com> wrote:
>
>> Fair benchmarks are notoriously difficult to setup.
>>
>> Usually, it's easy to find a workload where one system shines and as its
>> vendor you report that. Then, the competitor benchmarks a different use
>> case where his system outperforms ours. In the end, customers are more
>> confused than before.
>>
>> You should do your own benchmarks for your own workloads. That is the
>> only reliable way.
>>
>> In the end, both systems use similar setups and improvements in one
>> system are often also incorporated into the other system with some delay,
>> such that there should be no ground-breaking differences between the two
>> systems running on Java and using the same set of libraries.
>> Of course, if one system has a very specific optimization for your use
>> case, that could be much faster.
>>
>>
>> On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <
>> kali.tummala@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> have a question did anyone compared the performance of Flink batch job
>>> writing to s3 vs spark writing to s3?
>>>
>>> --
>>> Thanks & Regards
>>> Sri Tummala
>>>
>>>
>
> --
> Thanks & Regards
> Sri Tummala
>
>

Re: Batch Flink Job S3 write performance vs Spark

Posted by Arvid Heise <ar...@ververica.com>.

Exactly. We use the hadoop-fs as an indirection on top of that, but Spark
probably does the same.

On Wed, Feb 26, 2020 at 3:52 PM sri hari kali charan Tummala <
kali.tummala@gmail.com> wrote:

> Thank you  (the two systems running on Java and using the same set of
> libraries), so from my understanding, Flink uses AWS SDK behind the scenes
> same as spark.
>
> On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <ar...@ververica.com> wrote:
>
>> Fair benchmarks are notoriously difficult to setup.
>>
>> Usually, it's easy to find a workload where one system shines and as its
>> vendor you report that. Then, the competitor benchmarks a different use
>> case where his system outperforms ours. In the end, customers are more
>> confused than before.
>>
>> You should do your own benchmarks for your own workloads. That is the
>> only reliable way.
>>
>> In the end, both systems use similar setups and improvements in one
>> system are often also incorporated into the other system with some delay,
>> such that there should be no ground-breaking differences between the two
>> systems running on Java and using the same set of libraries.
>> Of course, if one system has a very specific optimization for your use
>> case, that could be much faster.
>>
>>
>> On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <
>> kali.tummala@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> have a question did anyone compared the performance of Flink batch job
>>> writing to s3 vs spark writing to s3?
>>>
>>> --
>>> Thanks & Regards
>>> Sri Tummala
>>>
>>>
>
> --
> Thanks & Regards
> Sri Tummala
>
>

Re: Batch Flink Job S3 write performance vs Spark

Posted by sri hari kali charan Tummala <ka...@gmail.com>.

Thank you  (the two systems running on Java and using the same set of
libraries), so from my understanding, Flink uses AWS SDK behind the scenes
same as spark.

On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <ar...@ververica.com> wrote:

> Fair benchmarks are notoriously difficult to setup.
>
> Usually, it's easy to find a workload where one system shines and as its
> vendor you report that. Then, the competitor benchmarks a different use
> case where his system outperforms ours. In the end, customers are more
> confused than before.
>
> You should do your own benchmarks for your own workloads. That is the only
> reliable way.
>
> In the end, both systems use similar setups and improvements in one system
> are often also incorporated into the other system with some delay, such
> that there should be no ground-breaking differences between the two systems
> running on Java and using the same set of libraries.
> Of course, if one system has a very specific optimization for your use
> case, that could be much faster.
>
>
> On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <
> kali.tummala@gmail.com> wrote:
>
>> Hi All,
>>
>> have a question did anyone compared the performance of Flink batch job
>> writing to s3 vs spark writing to s3?
>>
>> --
>> Thanks & Regards
>> Sri Tummala
>>
>>

-- 
Thanks & Regards
Sri Tummala