You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Konstantinos Kallas <ko...@hotmail.com> on 2019/10/01 02:18:46 UTC

[SURVEY] What is the most subtle/hard to catch bug that people have seen?

Hi everyone.

I wanted to ask Flink users what are the most subtle Flink bugs that 
people have witnessed. The cause of the bugs could be anything (e.g. 
wrong assumptions on data, parallelism of non-parallel operator, simple 
mistakes).

We are developing a testing framework for Flink and it would be 
interesting to have examples of difficult to spot bugs to evaluate our 
testing framework on.

Thanks,
Konstantinos Kallas

Re: [SURVEY] What is the most subtle/hard to catch bug that people have seen?

Posted by Konstantinos Kallas <ko...@hotmail.com>.
Hi Piotrek,

Thank you very much for your feedback. I am mostly interested in bugs in Flink applications.

Especially ones that are hard to notice -- either because they don't always occur, or because they don't crash the program, but instead they subtly affect its output, or lead to deadlocks.

Best,

Konstantinos

On 1/10/19 5:45 π.μ., Piotr Nowojski wrote:
Hi,

Are you asking about bugs in Flink, in libraries that Flink is using or bugs in applications that were using Flink? From my perspective/what I have seen:

The most problematic bugs while developing features for Flink:

Dead locks & data losses caused by concurrency issues in network stack after changing some trivial things in new data notifications.
Data visibility issues for concurrent writes/reads when implementing S3 connector.

The most problematic bug/type of bugs in the Dependencies:

Dead locks in the external connector (for example https://issues.apache.org/jira/browse/KAFKA-6132 ). Integration with external systems is always difficult. If you add concurrency issues to the mix…

The most problematic bug in the Flink application:

Being unaware that for some reasons, some unknown to me code was interrupting (SIGINT) threads spawned by a custom SourceFunction, that were emitting the data, when the job was back pressured. This was causing records serialisation very rarerly to be interrupted in the middle showing up on the down stream receiver as deserialisation errors.

Piotrek

On 1 Oct 2019, at 04:18, Konstantinos Kallas <ko...@hotmail.com>> wrote:

Hi everyone.

I wanted to ask Flink users what are the most subtle Flink bugs that
people have witnessed. The cause of the bugs could be anything (e.g.
wrong assumptions on data, parallelism of non-parallel operator, simple
mistakes).

We are developing a testing framework for Flink and it would be
interesting to have examples of difficult to spot bugs to evaluate our
testing framework on.

Thanks,
Konstantinos Kallas


Re: [SURVEY] What is the most subtle/hard to catch bug that people have seen?

Posted by Konstantinos Kallas <ko...@hotmail.com>.
Hi Jan,

Thanks a lot for that pointer, that is very interesting.

Best,

Konstantinos

On 1/10/19 6:02 π.μ., Jan Lukavský wrote:
Hi,

I'd add another one regarding Java hashCode() and its practical usability for distributed systems [1], although practically all (Java based) data processing systems rely on it.

One bug directly related to this I once saw was, that using an Enum inside other object used as partitioning key results in really hard to debug bugs. Mostly because during local testing everything works just fine, problem arises only when multiple JVMs are involved. This is caused by the fact, that hashCode() of Enum is derived from associated memory position.

Jan

[1] https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

On 10/1/19 11:45 AM, Piotr Nowojski wrote:
Hi,

Are you asking about bugs in Flink, in libraries that Flink is using or bugs in applications that were using Flink? From my perspective/what I have seen:

The most problematic bugs while developing features for Flink:

    Dead locks & data losses caused by concurrency issues in network stack after changing some trivial things in new data notifications.
    Data visibility issues for concurrent writes/reads when implementing S3 connector.

The most problematic bug/type of bugs in the Dependencies:

    Dead locks in the external connector (for example https://issues.apache.org/jira/browse/KAFKA-6132 <https://issues.apache.org/jira/browse/KAFKA-6132><https://issues.apache.org/jira/browse/KAFKA-6132> ). Integration with external systems is always difficult. If you add concurrency issues to the mix…

The most problematic bug in the Flink application:

    Being unaware that for some reasons, some unknown to me code was interrupting (SIGINT) threads spawned by a custom SourceFunction, that were emitting the data, when the job was back pressured. This was causing records serialisation very rarerly to be interrupted in the middle showing up on the down stream receiver as deserialisation errors.

Piotrek

On 1 Oct 2019, at 04:18, Konstantinos Kallas <ko...@hotmail.com> wrote:

Hi everyone.

I wanted to ask Flink users what are the most subtle Flink bugs that
people have witnessed. The cause of the bugs could be anything (e.g.
wrong assumptions on data, parallelism of non-parallel operator, simple
mistakes).

We are developing a testing framework for Flink and it would be
interesting to have examples of difficult to spot bugs to evaluate our
testing framework on.

Thanks,
Konstantinos Kallas


Re: [SURVEY] What is the most subtle/hard to catch bug that people have seen?

Posted by Jan Lukavský <je...@seznam.cz>.
Hi,

I'd add another one regarding Java hashCode() and its practical 
usability for distributed systems [1], although practically all (Java 
based) data processing systems rely on it.

One bug directly related to this I once saw was, that using an Enum 
inside other object used as partitioning key results in really hard to 
debug bugs. Mostly because during local testing everything works just 
fine, problem arises only when multiple JVMs are involved. This is 
caused by the fact, that hashCode() of Enum is derived from associated 
memory position.

Jan

[1] 
https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

On 10/1/19 11:45 AM, Piotr Nowojski wrote:
> Hi,
>
> Are you asking about bugs in Flink, in libraries that Flink is using or bugs in applications that were using Flink? From my perspective/what I have seen:
>
> The most problematic bugs while developing features for Flink:
>
> 	Dead locks & data losses caused by concurrency issues in network stack after changing some trivial things in new data notifications.
> 	Data visibility issues for concurrent writes/reads when implementing S3 connector.
>
> The most problematic bug/type of bugs in the Dependencies:
>
> 	Dead locks in the external connector (for example https://issues.apache.org/jira/browse/KAFKA-6132 <https://issues.apache.org/jira/browse/KAFKA-6132> ). Integration with external systems is always difficult. If you add concurrency issues to the mix…
>
> The most problematic bug in the Flink application:
>
> 	Being unaware that for some reasons, some unknown to me code was interrupting (SIGINT) threads spawned by a custom SourceFunction, that were emitting the data, when the job was back pressured. This was causing records serialisation very rarerly to be interrupted in the middle showing up on the down stream receiver as deserialisation errors.
>
> Piotrek
>
>> On 1 Oct 2019, at 04:18, Konstantinos Kallas <ko...@hotmail.com> wrote:
>>
>> Hi everyone.
>>
>> I wanted to ask Flink users what are the most subtle Flink bugs that
>> people have witnessed. The cause of the bugs could be anything (e.g.
>> wrong assumptions on data, parallelism of non-parallel operator, simple
>> mistakes).
>>
>> We are developing a testing framework for Flink and it would be
>> interesting to have examples of difficult to spot bugs to evaluate our
>> testing framework on.
>>
>> Thanks,
>> Konstantinos Kallas
>

Re: [SURVEY] What is the most subtle/hard to catch bug that people have seen?

Posted by Jan Lukavský <je...@seznam.cz>.
Hi,

I'd add another one regarding Java hashCode() and its practical 
usability for distributed systems [1], although practically all (Java 
based) data processing systems rely on it.

One bug directly related to this I once saw was, that using an Enum 
inside other object used as partitioning key results in really hard to 
debug bugs. Mostly because during local testing everything works just 
fine, problem arises only when multiple JVMs are involved. This is 
caused by the fact, that hashCode() of Enum is derived from associated 
memory position.

Jan

[1] 
https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

On 10/1/19 11:45 AM, Piotr Nowojski wrote:
> Hi,
>
> Are you asking about bugs in Flink, in libraries that Flink is using or bugs in applications that were using Flink? From my perspective/what I have seen:
>
> The most problematic bugs while developing features for Flink:
>
> 	Dead locks & data losses caused by concurrency issues in network stack after changing some trivial things in new data notifications.
> 	Data visibility issues for concurrent writes/reads when implementing S3 connector.
>
> The most problematic bug/type of bugs in the Dependencies:
>
> 	Dead locks in the external connector (for example https://issues.apache.org/jira/browse/KAFKA-6132 <https://issues.apache.org/jira/browse/KAFKA-6132> ). Integration with external systems is always difficult. If you add concurrency issues to the mix…
>
> The most problematic bug in the Flink application:
>
> 	Being unaware that for some reasons, some unknown to me code was interrupting (SIGINT) threads spawned by a custom SourceFunction, that were emitting the data, when the job was back pressured. This was causing records serialisation very rarerly to be interrupted in the middle showing up on the down stream receiver as deserialisation errors.
>
> Piotrek
>
>> On 1 Oct 2019, at 04:18, Konstantinos Kallas <ko...@hotmail.com> wrote:
>>
>> Hi everyone.
>>
>> I wanted to ask Flink users what are the most subtle Flink bugs that
>> people have witnessed. The cause of the bugs could be anything (e.g.
>> wrong assumptions on data, parallelism of non-parallel operator, simple
>> mistakes).
>>
>> We are developing a testing framework for Flink and it would be
>> interesting to have examples of difficult to spot bugs to evaluate our
>> testing framework on.
>>
>> Thanks,
>> Konstantinos Kallas
>

Re: [SURVEY] What is the most subtle/hard to catch bug that people have seen?

Posted by Konstantinos Kallas <ko...@hotmail.com>.
Hi Piotrek,

Thank you very much for your feedback. I am mostly interested in bugs in Flink applications.

Especially ones that are hard to notice -- either because they don't always occur, or because they don't crash the program, but instead they subtly affect its output, or lead to deadlocks.

Best,

Konstantinos

On 1/10/19 5:45 π.μ., Piotr Nowojski wrote:
Hi,

Are you asking about bugs in Flink, in libraries that Flink is using or bugs in applications that were using Flink? From my perspective/what I have seen:

The most problematic bugs while developing features for Flink:

Dead locks & data losses caused by concurrency issues in network stack after changing some trivial things in new data notifications.
Data visibility issues for concurrent writes/reads when implementing S3 connector.

The most problematic bug/type of bugs in the Dependencies:

Dead locks in the external connector (for example https://issues.apache.org/jira/browse/KAFKA-6132 ). Integration with external systems is always difficult. If you add concurrency issues to the mix…

The most problematic bug in the Flink application:

Being unaware that for some reasons, some unknown to me code was interrupting (SIGINT) threads spawned by a custom SourceFunction, that were emitting the data, when the job was back pressured. This was causing records serialisation very rarerly to be interrupted in the middle showing up on the down stream receiver as deserialisation errors.

Piotrek

On 1 Oct 2019, at 04:18, Konstantinos Kallas <ko...@hotmail.com>> wrote:

Hi everyone.

I wanted to ask Flink users what are the most subtle Flink bugs that
people have witnessed. The cause of the bugs could be anything (e.g.
wrong assumptions on data, parallelism of non-parallel operator, simple
mistakes).

We are developing a testing framework for Flink and it would be
interesting to have examples of difficult to spot bugs to evaluate our
testing framework on.

Thanks,
Konstantinos Kallas


Re: [SURVEY] What is the most subtle/hard to catch bug that people have seen?

Posted by Piotr Nowojski <pi...@ververica.com>.
Hi,

Are you asking about bugs in Flink, in libraries that Flink is using or bugs in applications that were using Flink? From my perspective/what I have seen:

The most problematic bugs while developing features for Flink:

	Dead locks & data losses caused by concurrency issues in network stack after changing some trivial things in new data notifications.
	Data visibility issues for concurrent writes/reads when implementing S3 connector.

The most problematic bug/type of bugs in the Dependencies:

	Dead locks in the external connector (for example https://issues.apache.org/jira/browse/KAFKA-6132 <https://issues.apache.org/jira/browse/KAFKA-6132> ). Integration with external systems is always difficult. If you add concurrency issues to the mix…

The most problematic bug in the Flink application:

	Being unaware that for some reasons, some unknown to me code was interrupting (SIGINT) threads spawned by a custom SourceFunction, that were emitting the data, when the job was back pressured. This was causing records serialisation very rarerly to be interrupted in the middle showing up on the down stream receiver as deserialisation errors. 

Piotrek

> On 1 Oct 2019, at 04:18, Konstantinos Kallas <ko...@hotmail.com> wrote:
> 
> Hi everyone.
> 
> I wanted to ask Flink users what are the most subtle Flink bugs that 
> people have witnessed. The cause of the bugs could be anything (e.g. 
> wrong assumptions on data, parallelism of non-parallel operator, simple 
> mistakes).
> 
> We are developing a testing framework for Flink and it would be 
> interesting to have examples of difficult to spot bugs to evaluate our 
> testing framework on.
> 
> Thanks,
> Konstantinos Kallas


Re: [SURVEY] What is the most subtle/hard to catch bug that people have seen?

Posted by Piotr Nowojski <pi...@ververica.com>.
Hi,

Are you asking about bugs in Flink, in libraries that Flink is using or bugs in applications that were using Flink? From my perspective/what I have seen:

The most problematic bugs while developing features for Flink:

	Dead locks & data losses caused by concurrency issues in network stack after changing some trivial things in new data notifications.
	Data visibility issues for concurrent writes/reads when implementing S3 connector.

The most problematic bug/type of bugs in the Dependencies:

	Dead locks in the external connector (for example https://issues.apache.org/jira/browse/KAFKA-6132 <https://issues.apache.org/jira/browse/KAFKA-6132> ). Integration with external systems is always difficult. If you add concurrency issues to the mix…

The most problematic bug in the Flink application:

	Being unaware that for some reasons, some unknown to me code was interrupting (SIGINT) threads spawned by a custom SourceFunction, that were emitting the data, when the job was back pressured. This was causing records serialisation very rarerly to be interrupted in the middle showing up on the down stream receiver as deserialisation errors. 

Piotrek

> On 1 Oct 2019, at 04:18, Konstantinos Kallas <ko...@hotmail.com> wrote:
> 
> Hi everyone.
> 
> I wanted to ask Flink users what are the most subtle Flink bugs that 
> people have witnessed. The cause of the bugs could be anything (e.g. 
> wrong assumptions on data, parallelism of non-parallel operator, simple 
> mistakes).
> 
> We are developing a testing framework for Flink and it would be 
> interesting to have examples of difficult to spot bugs to evaluate our 
> testing framework on.
> 
> Thanks,
> Konstantinos Kallas