You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Timur Fayruzov <ti...@gmail.com> on 2016/04/26 03:24:33 UTC

Job hangs

Hello,

Now I'm at the stage where my job seem to completely hang. Source code is
attached (it won't compile but I think gives a very good idea of what
happens). Unfortunately I can't provide the datasets. Most of them are
about 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB
memory for each.

It was working for smaller input sizes. Any idea on what I can do
differently is appreciated.

Thans,
Timur

Re: Job hangs

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Timur,

I had a look at the plan you shared.
I could not find any flow that branches and merges again, a pattern which
is prone to cause a deadlocks.

However, I noticed that the plan performs a lot of partitioning steps.
You might want to have a look at forwarded field annotations which can help
to reduce the partitioning and sorting steps [1].
This might help with complex jobs such as yours.

Best, Fabian

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#semantic-annotations


2016-04-27 10:57 GMT+02:00 Vasiliki Kalavri <va...@gmail.com>:

> Hi Timur,
>
> I've previously seen large batch jobs hang because of join deadlocks. We
> should have fixed those problems, but we might have missed some corner
> case. Did you check whether there was any cpu activity when the job hangs?
> Can you try running htop on the taskmanager machines and see if they're
> idle?
>
> Cheers,
> -Vasia.
>
> On 27 April 2016 at 02:48, Timur Fayruzov <ti...@gmail.com>
> wrote:
>
>> Robert, Ufuk, logs, execution plan and a screenshot of the console are in
>> the archive:
>> https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0
>>
>> Note that when I looked in the backpressure view I saw back pressure
>> 'high' on following paths:
>>
>> Input->code_line:123,124->map->join
>> Input->code_line:134,135->map->join
>> Input->code_line:121->map->join
>>
>> Unfortunately, I was not able to take thread dumps nor heap dumps
>> (neither kill -3, jstack nor jmap worked, some Amazon AMI problem I assume).
>>
>> Hope that helps.
>>
>> Please, let me know if I can assist you in any way. Otherwise, I probably
>> would not be actively looking at this problem.
>>
>> Thanks,
>> Timur
>>
>>
>> On Tue, Apr 26, 2016 at 8:11 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>
>>> Can you please further provide the execution plan via
>>>
>>> env.getExecutionPlan()
>>>
>>>
>>>
>>> On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov
>>> <ti...@gmail.com> wrote:
>>> > Hello Robert,
>>> >
>>> > I observed progress for 2 hours(meaning numbers change on dashboard),
>>> and
>>> > then I waited for 2 hours more. I'm sure it had to spill at some
>>> point, but
>>> > I figured 2h is enough time.
>>> >
>>> > Thanks,
>>> > Timur
>>> >
>>> > On Apr 26, 2016 1:35 AM, "Robert Metzger" <rm...@apache.org> wrote:
>>> >>
>>> >> Hi Timur,
>>> >>
>>> >> thank you for sharing the source code of your job. That is helpful!
>>> >> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is
>>> much
>>> >> more IO heavy with the larger input data because all the joins start
>>> >> spilling?
>>> >> Our monitoring, in particular for batch jobs is really not very
>>> advanced..
>>> >> If we had some monitoring showing the spill status, we would maybe
>>> see that
>>> >> the job is still running.
>>> >>
>>> >> How long did you wait until you declared the job hanging?
>>> >>
>>> >> Regards,
>>> >> Robert
>>> >>
>>> >>
>>> >> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>> >>>
>>> >>> No.
>>> >>>
>>> >>> If you run on YARN, the YARN logs are the relevant ones for the
>>> >>> JobManager and TaskManager. The client log submitting the job should
>>> >>> be found in /log.
>>> >>>
>>> >>> – Ufuk
>>> >>>
>>> >>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
>>> >>> <ti...@gmail.com> wrote:
>>> >>> > I will do it my tomorrow. Logs don't show anything unusual. Are
>>> there
>>> >>> > any
>>> >>> > logs besides what's in flink/log and yarn container logs?
>>> >>> >
>>> >>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
>>> >>> >
>>> >>> > Hey Timur,
>>> >>> >
>>> >>> > is it possible to connect to the VMs and get stack traces of the
>>> Flink
>>> >>> > processes as well?
>>> >>> >
>>> >>> > We can first have a look at the logs, but the stack traces will be
>>> >>> > helpful if we can't figure out what the issue is.
>>> >>> >
>>> >>> > – Ufuk
>>> >>> >
>>> >>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <
>>> trohrmann@apache.org>
>>> >>> > wrote:
>>> >>> >> Could you share the logs with us, Timur? That would be very
>>> helpful.
>>> >>> >>
>>> >>> >> Cheers,
>>> >>> >> Till
>>> >>> >>
>>> >>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <
>>> timur.fairuzov@gmail.com>
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Hello,
>>> >>> >>>
>>> >>> >>> Now I'm at the stage where my job seem to completely hang. Source
>>> >>> >>> code is
>>> >>> >>> attached (it won't compile but I think gives a very good idea of
>>> what
>>> >>> >>> happens). Unfortunately I can't provide the datasets. Most of
>>> them
>>> >>> >>> are
>>> >>> >>> about
>>> >>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks
>>> 6GB
>>> >>> >>> memory
>>> >>> >>> for each.
>>> >>> >>>
>>> >>> >>> It was working for smaller input sizes. Any idea on what I can do
>>> >>> >>> differently is appreciated.
>>> >>> >>>
>>> >>> >>> Thans,
>>> >>> >>> Timur
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Job hangs

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hi Timur,

I've previously seen large batch jobs hang because of join deadlocks. We
should have fixed those problems, but we might have missed some corner
case. Did you check whether there was any cpu activity when the job hangs?
Can you try running htop on the taskmanager machines and see if they're
idle?

Cheers,
-Vasia.

On 27 April 2016 at 02:48, Timur Fayruzov <ti...@gmail.com> wrote:

> Robert, Ufuk, logs, execution plan and a screenshot of the console are in
> the archive:
> https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0
>
> Note that when I looked in the backpressure view I saw back pressure
> 'high' on following paths:
>
> Input->code_line:123,124->map->join
> Input->code_line:134,135->map->join
> Input->code_line:121->map->join
>
> Unfortunately, I was not able to take thread dumps nor heap dumps (neither
> kill -3, jstack nor jmap worked, some Amazon AMI problem I assume).
>
> Hope that helps.
>
> Please, let me know if I can assist you in any way. Otherwise, I probably
> would not be actively looking at this problem.
>
> Thanks,
> Timur
>
>
> On Tue, Apr 26, 2016 at 8:11 AM, Ufuk Celebi <uc...@apache.org> wrote:
>
>> Can you please further provide the execution plan via
>>
>> env.getExecutionPlan()
>>
>>
>>
>> On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov
>> <ti...@gmail.com> wrote:
>> > Hello Robert,
>> >
>> > I observed progress for 2 hours(meaning numbers change on dashboard),
>> and
>> > then I waited for 2 hours more. I'm sure it had to spill at some point,
>> but
>> > I figured 2h is enough time.
>> >
>> > Thanks,
>> > Timur
>> >
>> > On Apr 26, 2016 1:35 AM, "Robert Metzger" <rm...@apache.org> wrote:
>> >>
>> >> Hi Timur,
>> >>
>> >> thank you for sharing the source code of your job. That is helpful!
>> >> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is
>> much
>> >> more IO heavy with the larger input data because all the joins start
>> >> spilling?
>> >> Our monitoring, in particular for batch jobs is really not very
>> advanced..
>> >> If we had some monitoring showing the spill status, we would maybe see
>> that
>> >> the job is still running.
>> >>
>> >> How long did you wait until you declared the job hanging?
>> >>
>> >> Regards,
>> >> Robert
>> >>
>> >>
>> >> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <uc...@apache.org> wrote:
>> >>>
>> >>> No.
>> >>>
>> >>> If you run on YARN, the YARN logs are the relevant ones for the
>> >>> JobManager and TaskManager. The client log submitting the job should
>> >>> be found in /log.
>> >>>
>> >>> – Ufuk
>> >>>
>> >>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
>> >>> <ti...@gmail.com> wrote:
>> >>> > I will do it my tomorrow. Logs don't show anything unusual. Are
>> there
>> >>> > any
>> >>> > logs besides what's in flink/log and yarn container logs?
>> >>> >
>> >>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
>> >>> >
>> >>> > Hey Timur,
>> >>> >
>> >>> > is it possible to connect to the VMs and get stack traces of the
>> Flink
>> >>> > processes as well?
>> >>> >
>> >>> > We can first have a look at the logs, but the stack traces will be
>> >>> > helpful if we can't figure out what the issue is.
>> >>> >
>> >>> > – Ufuk
>> >>> >
>> >>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <
>> trohrmann@apache.org>
>> >>> > wrote:
>> >>> >> Could you share the logs with us, Timur? That would be very
>> helpful.
>> >>> >>
>> >>> >> Cheers,
>> >>> >> Till
>> >>> >>
>> >>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <
>> timur.fairuzov@gmail.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Hello,
>> >>> >>>
>> >>> >>> Now I'm at the stage where my job seem to completely hang. Source
>> >>> >>> code is
>> >>> >>> attached (it won't compile but I think gives a very good idea of
>> what
>> >>> >>> happens). Unfortunately I can't provide the datasets. Most of them
>> >>> >>> are
>> >>> >>> about
>> >>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks
>> 6GB
>> >>> >>> memory
>> >>> >>> for each.
>> >>> >>>
>> >>> >>> It was working for smaller input sizes. Any idea on what I can do
>> >>> >>> differently is appreciated.
>> >>> >>>
>> >>> >>> Thans,
>> >>> >>> Timur
>> >>
>> >>
>> >
>>
>
>

Re: Job hangs

Posted by Timur Fayruzov <ti...@gmail.com>.

Robert, Ufuk, logs, execution plan and a screenshot of the console are in
the archive:
https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0

Note that when I looked in the backpressure view I saw back pressure 'high'
on following paths:

Input->code_line:123,124->map->join
Input->code_line:134,135->map->join
Input->code_line:121->map->join

Unfortunately, I was not able to take thread dumps nor heap dumps (neither
kill -3, jstack nor jmap worked, some Amazon AMI problem I assume).

Hope that helps.

Please, let me know if I can assist you in any way. Otherwise, I probably
would not be actively looking at this problem.

Thanks,
Timur


On Tue, Apr 26, 2016 at 8:11 AM, Ufuk Celebi <uc...@apache.org> wrote:

> Can you please further provide the execution plan via
>
> env.getExecutionPlan()
>
>
>
> On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov
> <ti...@gmail.com> wrote:
> > Hello Robert,
> >
> > I observed progress for 2 hours(meaning numbers change on dashboard), and
> > then I waited for 2 hours more. I'm sure it had to spill at some point,
> but
> > I figured 2h is enough time.
> >
> > Thanks,
> > Timur
> >
> > On Apr 26, 2016 1:35 AM, "Robert Metzger" <rm...@apache.org> wrote:
> >>
> >> Hi Timur,
> >>
> >> thank you for sharing the source code of your job. That is helpful!
> >> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is
> much
> >> more IO heavy with the larger input data because all the joins start
> >> spilling?
> >> Our monitoring, in particular for batch jobs is really not very
> advanced..
> >> If we had some monitoring showing the spill status, we would maybe see
> that
> >> the job is still running.
> >>
> >> How long did you wait until you declared the job hanging?
> >>
> >> Regards,
> >> Robert
> >>
> >>
> >> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >>>
> >>> No.
> >>>
> >>> If you run on YARN, the YARN logs are the relevant ones for the
> >>> JobManager and TaskManager. The client log submitting the job should
> >>> be found in /log.
> >>>
> >>> – Ufuk
> >>>
> >>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
> >>> <ti...@gmail.com> wrote:
> >>> > I will do it my tomorrow. Logs don't show anything unusual. Are there
> >>> > any
> >>> > logs besides what's in flink/log and yarn container logs?
> >>> >
> >>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
> >>> >
> >>> > Hey Timur,
> >>> >
> >>> > is it possible to connect to the VMs and get stack traces of the
> Flink
> >>> > processes as well?
> >>> >
> >>> > We can first have a look at the logs, but the stack traces will be
> >>> > helpful if we can't figure out what the issue is.
> >>> >
> >>> > – Ufuk
> >>> >
> >>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <trohrmann@apache.org
> >
> >>> > wrote:
> >>> >> Could you share the logs with us, Timur? That would be very helpful.
> >>> >>
> >>> >> Cheers,
> >>> >> Till
> >>> >>
> >>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <timur.fairuzov@gmail.com
> >
> >>> >> wrote:
> >>> >>>
> >>> >>> Hello,
> >>> >>>
> >>> >>> Now I'm at the stage where my job seem to completely hang. Source
> >>> >>> code is
> >>> >>> attached (it won't compile but I think gives a very good idea of
> what
> >>> >>> happens). Unfortunately I can't provide the datasets. Most of them
> >>> >>> are
> >>> >>> about
> >>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks
> 6GB
> >>> >>> memory
> >>> >>> for each.
> >>> >>>
> >>> >>> It was working for smaller input sizes. Any idea on what I can do
> >>> >>> differently is appreciated.
> >>> >>>
> >>> >>> Thans,
> >>> >>> Timur
> >>
> >>
> >
>

Re: Job hangs

Posted by Ufuk Celebi <uc...@apache.org>.

Can you please further provide the execution plan via

env.getExecutionPlan()



On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov
<ti...@gmail.com> wrote:
> Hello Robert,
>
> I observed progress for 2 hours(meaning numbers change on dashboard), and
> then I waited for 2 hours more. I'm sure it had to spill at some point, but
> I figured 2h is enough time.
>
> Thanks,
> Timur
>
> On Apr 26, 2016 1:35 AM, "Robert Metzger" <rm...@apache.org> wrote:
>>
>> Hi Timur,
>>
>> thank you for sharing the source code of your job. That is helpful!
>> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is much
>> more IO heavy with the larger input data because all the joins start
>> spilling?
>> Our monitoring, in particular for batch jobs is really not very advanced..
>> If we had some monitoring showing the spill status, we would maybe see that
>> the job is still running.
>>
>> How long did you wait until you declared the job hanging?
>>
>> Regards,
>> Robert
>>
>>
>> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>
>>> No.
>>>
>>> If you run on YARN, the YARN logs are the relevant ones for the
>>> JobManager and TaskManager. The client log submitting the job should
>>> be found in /log.
>>>
>>> – Ufuk
>>>
>>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
>>> <ti...@gmail.com> wrote:
>>> > I will do it my tomorrow. Logs don't show anything unusual. Are there
>>> > any
>>> > logs besides what's in flink/log and yarn container logs?
>>> >
>>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
>>> >
>>> > Hey Timur,
>>> >
>>> > is it possible to connect to the VMs and get stack traces of the Flink
>>> > processes as well?
>>> >
>>> > We can first have a look at the logs, but the stack traces will be
>>> > helpful if we can't figure out what the issue is.
>>> >
>>> > – Ufuk
>>> >
>>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <tr...@apache.org>
>>> > wrote:
>>> >> Could you share the logs with us, Timur? That would be very helpful.
>>> >>
>>> >> Cheers,
>>> >> Till
>>> >>
>>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <ti...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hello,
>>> >>>
>>> >>> Now I'm at the stage where my job seem to completely hang. Source
>>> >>> code is
>>> >>> attached (it won't compile but I think gives a very good idea of what
>>> >>> happens). Unfortunately I can't provide the datasets. Most of them
>>> >>> are
>>> >>> about
>>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB
>>> >>> memory
>>> >>> for each.
>>> >>>
>>> >>> It was working for smaller input sizes. Any idea on what I can do
>>> >>> differently is appreciated.
>>> >>>
>>> >>> Thans,
>>> >>> Timur
>>
>>
>

Re: Job hangs

Posted by Timur Fayruzov <ti...@gmail.com>.

Hello Robert,

I observed progress for 2 hours(meaning numbers change on dashboard), and
then I waited for 2 hours more. I'm sure it had to spill at some point, but
I figured 2h is enough time.

Thanks,
Timur
On Apr 26, 2016 1:35 AM, "Robert Metzger" <rm...@apache.org> wrote:

> Hi Timur,
>
> thank you for sharing the source code of your job. That is helpful!
> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is much
> more IO heavy with the larger input data because all the joins start
> spilling?
> Our monitoring, in particular for batch jobs is really not very advanced..
> If we had some monitoring showing the spill status, we would maybe see that
> the job is still running.
>
> How long did you wait until you declared the job hanging?
>
> Regards,
> Robert
>
>
> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <uc...@apache.org> wrote:
>
>> No.
>>
>> If you run on YARN, the YARN logs are the relevant ones for the
>> JobManager and TaskManager. The client log submitting the job should
>> be found in /log.
>>
>> – Ufuk
>>
>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
>> <ti...@gmail.com> wrote:
>> > I will do it my tomorrow. Logs don't show anything unusual. Are there
>> any
>> > logs besides what's in flink/log and yarn container logs?
>> >
>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
>> >
>> > Hey Timur,
>> >
>> > is it possible to connect to the VMs and get stack traces of the Flink
>> > processes as well?
>> >
>> > We can first have a look at the logs, but the stack traces will be
>> > helpful if we can't figure out what the issue is.
>> >
>> > – Ufuk
>> >
>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <tr...@apache.org>
>> wrote:
>> >> Could you share the logs with us, Timur? That would be very helpful.
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <ti...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>> Now I'm at the stage where my job seem to completely hang. Source
>> code is
>> >>> attached (it won't compile but I think gives a very good idea of what
>> >>> happens). Unfortunately I can't provide the datasets. Most of them are
>> >>> about
>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB
>> >>> memory
>> >>> for each.
>> >>>
>> >>> It was working for smaller input sizes. Any idea on what I can do
>> >>> differently is appreciated.
>> >>>
>> >>> Thans,
>> >>> Timur
>>
>
>

Re: Job hangs

Posted by Robert Metzger <rm...@apache.org>.

Hi Timur,

thank you for sharing the source code of your job. That is helpful!
Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is much
more IO heavy with the larger input data because all the joins start
spilling?
Our monitoring, in particular for batch jobs is really not very advanced..
If we had some monitoring showing the spill status, we would maybe see that
the job is still running.

How long did you wait until you declared the job hanging?

Regards,
Robert


On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <uc...@apache.org> wrote:

> No.
>
> If you run on YARN, the YARN logs are the relevant ones for the
> JobManager and TaskManager. The client log submitting the job should
> be found in /log.
>
> – Ufuk
>
> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
> <ti...@gmail.com> wrote:
> > I will do it my tomorrow. Logs don't show anything unusual. Are there any
> > logs besides what's in flink/log and yarn container logs?
> >
> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
> >
> > Hey Timur,
> >
> > is it possible to connect to the VMs and get stack traces of the Flink
> > processes as well?
> >
> > We can first have a look at the logs, but the stack traces will be
> > helpful if we can't figure out what the issue is.
> >
> > – Ufuk
> >
> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <tr...@apache.org>
> wrote:
> >> Could you share the logs with us, Timur? That would be very helpful.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <ti...@gmail.com>
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>> Now I'm at the stage where my job seem to completely hang. Source code
> is
> >>> attached (it won't compile but I think gives a very good idea of what
> >>> happens). Unfortunately I can't provide the datasets. Most of them are
> >>> about
> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB
> >>> memory
> >>> for each.
> >>>
> >>> It was working for smaller input sizes. Any idea on what I can do
> >>> differently is appreciated.
> >>>
> >>> Thans,
> >>> Timur
>

Re: Job hangs

Posted by Ufuk Celebi <uc...@apache.org>.

No.

If you run on YARN, the YARN logs are the relevant ones for the
JobManager and TaskManager. The client log submitting the job should
be found in /log.

– Ufuk

On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
<ti...@gmail.com> wrote:
> I will do it my tomorrow. Logs don't show anything unusual. Are there any
> logs besides what's in flink/log and yarn container logs?
>
> On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
>
> Hey Timur,
>
> is it possible to connect to the VMs and get stack traces of the Flink
> processes as well?
>
> We can first have a look at the logs, but the stack traces will be
> helpful if we can't figure out what the issue is.
>
> – Ufuk
>
> On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <tr...@apache.org> wrote:
>> Could you share the logs with us, Timur? That would be very helpful.
>>
>> Cheers,
>> Till
>>
>> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <ti...@gmail.com>
>> wrote:
>>>
>>> Hello,
>>>
>>> Now I'm at the stage where my job seem to completely hang. Source code is
>>> attached (it won't compile but I think gives a very good idea of what
>>> happens). Unfortunately I can't provide the datasets. Most of them are
>>> about
>>> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB
>>> memory
>>> for each.
>>>
>>> It was working for smaller input sizes. Any idea on what I can do
>>> differently is appreciated.
>>>
>>> Thans,
>>> Timur

Re: Job hangs

Posted by Timur Fayruzov <ti...@gmail.com>.

I will do it my tomorrow. Logs don't show anything unusual. Are there any
logs besides what's in flink/log and yarn container logs?
On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uc...@apache.org> wrote:

Hey Timur,

is it possible to connect to the VMs and get stack traces of the Flink
processes as well?

We can first have a look at the logs, but the stack traces will be
helpful if we can't figure out what the issue is.

– Ufuk

On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <tr...@apache.org> wrote:
> Could you share the logs with us, Timur? That would be very helpful.
>
> Cheers,
> Till
>
> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <ti...@gmail.com>
wrote:
>>
>> Hello,
>>
>> Now I'm at the stage where my job seem to completely hang. Source code is
>> attached (it won't compile but I think gives a very good idea of what
>> happens). Unfortunately I can't provide the datasets. Most of them are
about
>> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB
memory
>> for each.
>>
>> It was working for smaller input sizes. Any idea on what I can do
>> differently is appreciated.
>>
>> Thans,
>> Timur

Re: Job hangs

Posted by Ufuk Celebi <uc...@apache.org>.

Hey Timur,

is it possible to connect to the VMs and get stack traces of the Flink
processes as well?

We can first have a look at the logs, but the stack traces will be
helpful if we can't figure out what the issue is.

– Ufuk

On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <tr...@apache.org> wrote:
> Could you share the logs with us, Timur? That would be very helpful.
>
> Cheers,
> Till
>
> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <ti...@gmail.com> wrote:
>>
>> Hello,
>>
>> Now I'm at the stage where my job seem to completely hang. Source code is
>> attached (it won't compile but I think gives a very good idea of what
>> happens). Unfortunately I can't provide the datasets. Most of them are about
>> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB memory
>> for each.
>>
>> It was working for smaller input sizes. Any idea on what I can do
>> differently is appreciated.
>>
>> Thans,
>> Timur

Re: Job hangs

Posted by Till Rohrmann <tr...@apache.org>.

Could you share the logs with us, Timur? That would be very helpful.

Cheers,
Till
On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <ti...@gmail.com> wrote:

> Hello,
>
> Now I'm at the stage where my job seem to completely hang. Source code is
> attached (it won't compile but I think gives a very good idea of what
> happens). Unfortunately I can't provide the datasets. Most of them are
> about 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB
> memory for each.
>
> It was working for smaller input sizes. Any idea on what I can do
> differently is appreciated.
>
> Thans,
> Timur
>