You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Gyula Fóra <gy...@gmail.com> on 2018/05/04 08:27:54 UTC

PartitionNotFoundException after deployment

Hi Ufuk,

Do you have any quick idea what could cause this problems in flink 1.4.2?
Seems like one operator takes too long to deploy and downstream tasks error
out on partition not found. This only seems to happen when the job is
restored from state and in fact that operator has some keyed and operator
state as well.

Deploying the same job from empty state works well. We tried increasing
the taskmanager.network.request-backoff.max that didnt help.

It would be great if you have some pointers where to look further, I havent
seen this happening before.

Thank you!
Gyula

The errror:
org.apache.flink.runtime.io.network.partition.: Partition
4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found.
    at
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
    at
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
    at
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)

Re: PartitionNotFoundException after deployment

Posted by Nico Kruber <ni...@data-artisans.com>.

Hi Gyula,
as a follow-up, you may be interested in
https://issues.apache.org/jira/browse/FLINK-9413


Nico

On 04/05/18 15:36, Gyula Fóra wrote:
> Looks pretty clear that one operator takes too long to start (even on
> the UI it shows it in the created state for far too long). Any idea what
> might cause this delay? It actually often crashes on Akka ask timeout
> during scheduling the node.
> 
> Gyula
> 
> Piotr Nowojski <piotr@data-artisans.com
> <ma...@data-artisans.com>> ezt írta (időpont: 2018. máj. 4., P,
> 15:33):
> 
>     Ufuk: I don’t know why.
> 
>     +1 for your other suggestions.
> 
>     Piotrek
> 
>     > On 4 May 2018, at 14:52, Ufuk Celebi <ufuk@data-artisans.com
>     <ma...@data-artisans.com>> wrote:
>     >
>     > Hey Gyula!
>     >
>     > I'm including Piotr and Nico (cc'd) who have worked on the network
>     > stack in the last releases.
>     >
>     > Registering the network structures including the intermediate results
>     > actually happens **before** any state is restored. I'm not sure why
>     > this reproducibly happens when you restore state. @Nico, Piotr: any
>     > ideas here?
>     >
>     > In general I think what happens here is the following:
>     > - a task requests the result of a local upstream producer, but that
>     > one has not registered its intermediate result yet
>     > - this should result in a retry of the request with some backoff
>     > (controlled via the config params you mention
>     > taskmanager.network.request-backoff.max,
>     > taskmanager.network.request-backoff.initial)
>     >
>     > As a first step I would set logging to DEBUG and check the TM logs for
>     > messages like "Retriggering partition request {}:{}."
>     >
>     > You can also check the SingleInputGate code which has the logic for
>     > retriggering requests.
>     >
>     > – Ufuk
>     >
>     >
>     > On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gyula.fora@gmail.com
>     <ma...@gmail.com>> wrote:
>     >> Hi Ufuk,
>     >>
>     >> Do you have any quick idea what could cause this problems in
>     flink 1.4.2?
>     >> Seems like one operator takes too long to deploy and downstream
>     tasks error
>     >> out on partition not found. This only seems to happen when the job is
>     >> restored from state and in fact that operator has some keyed and
>     operator
>     >> state as well.
>     >>
>     >> Deploying the same job from empty state works well. We tried
>     increasing the
>     >> taskmanager.network.request-backoff.max that didnt help.
>     >>
>     >> It would be great if you have some pointers where to look
>     further, I havent
>     >> seen this happening before.
>     >>
>     >> Thank you!
>     >> Gyula
>     >>
>     >> The errror:
>     >> org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.partition.: Partition
>     >> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd
>     not found.
>     >>    at
>     >> org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
>     >>    at
>     >> org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
>     >>    at
>     >> org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
>     >>    at java.util.TimerThread.mainLoop(Timer.java:555)
>     >>    at java.util.TimerThread.run(Timer.java:505)
>     >
>     >
>     >
>     > --
>     > Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin
>     >
>     > info@data-artisans.com <ma...@data-artisans.com>
>     > +49-30-43208879 <tel:+49%2030%2043208879>
>     >
>     > Registered at Amtsgericht Charlottenburg - HRB 158244 B
>     > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> 

-- 
Nico Kruber | Software Engineer
data Artisans

Follow us @dataArtisans
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: PartitionNotFoundException after deployment

Posted by Gyula Fóra <gy...@gmail.com>.

Looks pretty clear that one operator takes too long to start (even on the
UI it shows it in the created state for far too long). Any idea what might
cause this delay? It actually often crashes on Akka ask timeout during
scheduling the node.

Gyula

Piotr Nowojski <pi...@data-artisans.com> ezt írta (időpont: 2018. máj. 4.,
P, 15:33):

> Ufuk: I don’t know why.
>
> +1 for your other suggestions.
>
> Piotrek
>
> > On 4 May 2018, at 14:52, Ufuk Celebi <uf...@data-artisans.com> wrote:
> >
> > Hey Gyula!
> >
> > I'm including Piotr and Nico (cc'd) who have worked on the network
> > stack in the last releases.
> >
> > Registering the network structures including the intermediate results
> > actually happens **before** any state is restored. I'm not sure why
> > this reproducibly happens when you restore state. @Nico, Piotr: any
> > ideas here?
> >
> > In general I think what happens here is the following:
> > - a task requests the result of a local upstream producer, but that
> > one has not registered its intermediate result yet
> > - this should result in a retry of the request with some backoff
> > (controlled via the config params you mention
> > taskmanager.network.request-backoff.max,
> > taskmanager.network.request-backoff.initial)
> >
> > As a first step I would set logging to DEBUG and check the TM logs for
> > messages like "Retriggering partition request {}:{}."
> >
> > You can also check the SingleInputGate code which has the logic for
> > retriggering requests.
> >
> > – Ufuk
> >
> >
> > On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gy...@gmail.com>
> wrote:
> >> Hi Ufuk,
> >>
> >> Do you have any quick idea what could cause this problems in flink
> 1.4.2?
> >> Seems like one operator takes too long to deploy and downstream tasks
> error
> >> out on partition not found. This only seems to happen when the job is
> >> restored from state and in fact that operator has some keyed and
> operator
> >> state as well.
> >>
> >> Deploying the same job from empty state works well. We tried increasing
> the
> >> taskmanager.network.request-backoff.max that didnt help.
> >>
> >> It would be great if you have some pointers where to look further, I
> havent
> >> seen this happening before.
> >>
> >> Thank you!
> >> Gyula
> >>
> >> The errror:
> >> org.apache.flink.runtime.io.network.partition.: Partition
> >> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not
> found.
> >>    at
> >> org.apache.flink.runtime.io
> .network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
> >>    at
> >> org.apache.flink.runtime.io
> .network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
> >>    at
> >> org.apache.flink.runtime.io
> .network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
> >>    at java.util.TimerThread.mainLoop(Timer.java:555)
> >>    at java.util.TimerThread.run(Timer.java:505)
> >
> >
> >
> > --
> > Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin
> >
> > info@data-artisans.com
> > +49-30-43208879 <+49%2030%2043208879>
> >
> > Registered at Amtsgericht Charlottenburg - HRB 158244 B
> > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>
>

Re: PartitionNotFoundException after deployment

Posted by Piotr Nowojski <pi...@data-artisans.com>.

Ufuk: I don’t know why.

+1 for your other suggestions.

Piotrek

> On 4 May 2018, at 14:52, Ufuk Celebi <uf...@data-artisans.com> wrote:
> 
> Hey Gyula!
> 
> I'm including Piotr and Nico (cc'd) who have worked on the network
> stack in the last releases.
> 
> Registering the network structures including the intermediate results
> actually happens **before** any state is restored. I'm not sure why
> this reproducibly happens when you restore state. @Nico, Piotr: any
> ideas here?
> 
> In general I think what happens here is the following:
> - a task requests the result of a local upstream producer, but that
> one has not registered its intermediate result yet
> - this should result in a retry of the request with some backoff
> (controlled via the config params you mention
> taskmanager.network.request-backoff.max,
> taskmanager.network.request-backoff.initial)
> 
> As a first step I would set logging to DEBUG and check the TM logs for
> messages like "Retriggering partition request {}:{}."
> 
> You can also check the SingleInputGate code which has the logic for
> retriggering requests.
> 
> – Ufuk
> 
> 
> On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gy...@gmail.com> wrote:
>> Hi Ufuk,
>> 
>> Do you have any quick idea what could cause this problems in flink 1.4.2?
>> Seems like one operator takes too long to deploy and downstream tasks error
>> out on partition not found. This only seems to happen when the job is
>> restored from state and in fact that operator has some keyed and operator
>> state as well.
>> 
>> Deploying the same job from empty state works well. We tried increasing the
>> taskmanager.network.request-backoff.max that didnt help.
>> 
>> It would be great if you have some pointers where to look further, I havent
>> seen this happening before.
>> 
>> Thank you!
>> Gyula
>> 
>> The errror:
>> org.apache.flink.runtime.io.network.partition.: Partition
>> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found.
>>    at
>> org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
>>    at
>> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
>>    at
>> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
>>    at java.util.TimerThread.mainLoop(Timer.java:555)
>>    at java.util.TimerThread.run(Timer.java:505)
> 
> 
> 
> -- 
> Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin
> 
> info@data-artisans.com
> +49-30-43208879
> 
> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: PartitionNotFoundException after deployment

Posted by Ufuk Celebi <uf...@data-artisans.com>.

Hey Gyula!

I'm including Piotr and Nico (cc'd) who have worked on the network
stack in the last releases.

Registering the network structures including the intermediate results
actually happens **before** any state is restored. I'm not sure why
this reproducibly happens when you restore state. @Nico, Piotr: any
ideas here?

In general I think what happens here is the following:
- a task requests the result of a local upstream producer, but that
one has not registered its intermediate result yet
- this should result in a retry of the request with some backoff
(controlled via the config params you mention
taskmanager.network.request-backoff.max,
taskmanager.network.request-backoff.initial)

As a first step I would set logging to DEBUG and check the TM logs for
messages like "Retriggering partition request {}:{}."

You can also check the SingleInputGate code which has the logic for
retriggering requests.

– Ufuk


On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gy...@gmail.com> wrote:
> Hi Ufuk,
>
> Do you have any quick idea what could cause this problems in flink 1.4.2?
> Seems like one operator takes too long to deploy and downstream tasks error
> out on partition not found. This only seems to happen when the job is
> restored from state and in fact that operator has some keyed and operator
> state as well.
>
> Deploying the same job from empty state works well. We tried increasing the
> taskmanager.network.request-backoff.max that didnt help.
>
> It would be great if you have some pointers where to look further, I havent
> seen this happening before.
>
> Thank you!
> Gyula
>
> The errror:
> org.apache.flink.runtime.io.network.partition.: Partition
> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found.
>     at
> org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
>     at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
>     at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
>     at java.util.TimerThread.mainLoop(Timer.java:555)
>     at java.util.TimerThread.run(Timer.java:505)



-- 
Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin

info@data-artisans.com
+49-30-43208879

Registered at Amtsgericht Charlottenburg - HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen