You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Sam Bourne <sa...@gmail.com> on 2020/09/22 17:35:48 UTC

Flink JobService on k8s

Hello beam community!

I’m looking for some help solving an issue running a beam job on flink
using --environment_type DOCKER.

I have a flink cluster running in kubernetes configured so the taskworkers
can run docker [1]. I’m attempting to deploy a flink jobserver in the
cluster. The issue is that the jobserver does not provide the proper
endpoints to the SDK harness when it submits the job to flink. It typically
provides something like localhost:34567 using the hostname the grpc server
was bound to. There is a jobserver flag --job-host that will bind the grpc
server to this provided hostname, but I cannot seem to get it to bind to
the k8s jobservice Service name [2]. I’ve tried different flavors of FQDNs
but haven’t had any luck.

[main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver -
ArtifactStagingService started on flink-beam-jobserver:8098
[main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver -
Exception during job server creation
java.io.IOException: Failed to bind
...

Does anyone have some experience with this that could help provide some
guidance?

Cheers,
Sam

[1] https://github.com/sambvfx/beam-flink-k8s
[2]
https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29

Re: Flink JobService on k8s

Posted by Sam Bourne <sa...@gmail.com>.

Thank you Kyle for clarifying things for me. I've confirmed it works by
simply sharing the artifact staging volume between the jobserver and
taskmanager pods. This works fine with the dind setup using the docker
environment.

Thanks again,
Sam

On Tue, Sep 22, 2020 at 4:37 PM Kyle Weaver <kc...@google.com> wrote:

> > It was my understanding that the client first uploads the artifacts to
> the jobserver and then the SDK harness will pull in these artifacts from
> the jobserver over a gRPC port.
> Not quite. The artifact endpoint is for communicating which artifacts are
> needed, and where to find them. But the SDK harness pulls the actual
> artifacts itself.
>
> > Do the jobserver and the taskmanager need to share the artifact staging
> volume.
>
> More precisely, the job server and the SDK harness need to share the
> artifact staging volume (which is why we generally recommend using a
> distributed filesystem for this purpose if possible).
>
> General note: there is never any direct communication between the job
> server and the SDK harness. Usually it goes Beam job server -> Flink job
> manager -> Flink task manager -> Beam SDK harness.
>
> On Tue, Sep 22, 2020 at 4:20 PM Sam Bourne <sa...@gmail.com> wrote:
>
>> It was my understanding that the client first uploads the artifacts to
>> the jobserver and then the SDK harness will pull in these artifacts from
>> the jobserver over a gRPC port.
>>
>> I see the artifacts on the jobserver while the job is attempting to run:
>>
>> root@flink-beam-jobserver-9fccb99b8-6mhtq
>> :/tmp/beam-artifact-staging/3024e5d862fef831e830945b2d3e4e9511e0423bfb9c48de75aa2b3b67decce4
>>
>> Do the jobserver and the taskmanager need to share the artifact staging
>> volume?
>>
>> On Tue, Sep 22, 2020 at 4:04 PM Kyle Weaver <kc...@google.com> wrote:
>>
>>> > rpc error: code = Unknown desc = ; failed to retrieve chunk for
>>> /tmp/staged/pickled_main_session
>>>
>>> Are you sure that's due to a networking issue, and not a problem with
>>> the filesystem / volume mounting?
>>>
>>> On Tue, Sep 22, 2020 at 3:55 PM Sam Bourne <sa...@gmail.com> wrote:
>>>
>>>> I would not be surprised if there was something weird going on with
>>>>> Docker in Docker. The defaults mostly work fine when an external SDK
>>>>> harness is used [1].
>>>>>
>>>> Can you provide more information on the exception you got? (I'm
>>>>> particularly interested in the line number).
>>>>>
>>>> The actual error is a bit tricky to find but if you monitor the docker
>>>> logs from within the taskmanager pod you can find it failing when the SDK
>>>> harness boot.go attempts to pull the the artifacts from the artifact
>>>> endpoint [1]
>>>> [1]
>>>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139
>>>>
>>>> 2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot --id=1-1 --provision_endpoint=localhost:45775
>>>> 2020/09/22 22:07:59 Failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for /tmp/staged/pickled_main_session
>>>>     caused by:
>>>> rpc error: code = Unknown desc = ; failed to retrieve chunk for /tmp/staged/pickled_main_session
>>>>
>>>> I can hit the jobserver fine from my taskmanager pod, as well as from
>>>> within a SDK container I spin up manually (with —network host):
>>>>
>>>> root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping flink-beam-jobserver
>>>> PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129) 56(84) bytes of data.
>>>>
>>>> I don’t see how this would work if the endpoint hostname is localhost.
>>>> I’ll explore how this is working in the flink-on-k8s-operator.
>>>>
>>>> Thanks for taking a look!
>>>> Sam
>>>>
>>>> On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kc...@google.com>
>>>> wrote:
>>>>
>>>>> > The issue is that the jobserver does not provide the proper
>>>>> endpoints to the SDK harness when it submits the job to flink.
>>>>>
>>>>> I would not be surprised if there was something weird going on with
>>>>> Docker in Docker. The defaults mostly work fine when an external SDK
>>>>> harness is used [1].
>>>>>
>>>>> Can you provide more information on the exception you got? (I'm
>>>>> particularly interested in the line number).
>>>>>
>>>>> > The issue is that the jobserver does not provide the proper
>>>>> endpoints to the SDK harness when it submits the job to flink.
>>>>>
>>>>> More information about this failure mode would be helpful as well.
>>>>>
>>>>> [1]
>>>>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml
>>>>>
>>>>>
>>>>> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <sa...@gmail.com> wrote:
>>>>>
>>>>>> Hello beam community!
>>>>>>
>>>>>> I’m looking for some help solving an issue running a beam job on
>>>>>> flink using --environment_type DOCKER.
>>>>>>
>>>>>> I have a flink cluster running in kubernetes configured so the
>>>>>> taskworkers can run docker [1]. I’m attempting to deploy a flink jobserver
>>>>>> in the cluster. The issue is that the jobserver does not provide the proper
>>>>>> endpoints to the SDK harness when it submits the job to flink. It typically
>>>>>> provides something like localhost:34567 using the hostname the grpc
>>>>>> server was bound to. There is a jobserver flag --job-host that will
>>>>>> bind the grpc server to this provided hostname, but I cannot seem to get it
>>>>>> to bind to the k8s jobservice Service name [2]. I’ve tried different
>>>>>> flavors of FQDNs but haven’t had any luck.
>>>>>>
>>>>>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - ArtifactStagingService started on flink-beam-jobserver:8098
>>>>>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - Exception during job server creation
>>>>>> java.io.IOException: Failed to bind
>>>>>> ...
>>>>>>
>>>>>> Does anyone have some experience with this that could help provide
>>>>>> some guidance?
>>>>>>
>>>>>> Cheers,
>>>>>> Sam
>>>>>>
>>>>>> [1] https://github.com/sambvfx/beam-flink-k8s
>>>>>> [2]
>>>>>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29
>>>>>>
>>>>>

Re: Flink JobService on k8s

Posted by Kyle Weaver <kc...@google.com>.

> It was my understanding that the client first uploads the artifacts to
the jobserver and then the SDK harness will pull in these artifacts from
the jobserver over a gRPC port.
Not quite. The artifact endpoint is for communicating which artifacts are
needed, and where to find them. But the SDK harness pulls the actual
artifacts itself.

> Do the jobserver and the taskmanager need to share the artifact staging
volume.

More precisely, the job server and the SDK harness need to share the
artifact staging volume (which is why we generally recommend using a
distributed filesystem for this purpose if possible).

General note: there is never any direct communication between the job
server and the SDK harness. Usually it goes Beam job server -> Flink job
manager -> Flink task manager -> Beam SDK harness.

On Tue, Sep 22, 2020 at 4:20 PM Sam Bourne <sa...@gmail.com> wrote:

> It was my understanding that the client first uploads the artifacts to the
> jobserver and then the SDK harness will pull in these artifacts from the
> jobserver over a gRPC port.
>
> I see the artifacts on the jobserver while the job is attempting to run:
>
> root@flink-beam-jobserver-9fccb99b8-6mhtq
> :/tmp/beam-artifact-staging/3024e5d862fef831e830945b2d3e4e9511e0423bfb9c48de75aa2b3b67decce4
>
> Do the jobserver and the taskmanager need to share the artifact staging
> volume?
>
> On Tue, Sep 22, 2020 at 4:04 PM Kyle Weaver <kc...@google.com> wrote:
>
>> > rpc error: code = Unknown desc = ; failed to retrieve chunk for
>> /tmp/staged/pickled_main_session
>>
>> Are you sure that's due to a networking issue, and not a problem with the
>> filesystem / volume mounting?
>>
>> On Tue, Sep 22, 2020 at 3:55 PM Sam Bourne <sa...@gmail.com> wrote:
>>
>>> I would not be surprised if there was something weird going on with
>>>> Docker in Docker. The defaults mostly work fine when an external SDK
>>>> harness is used [1].
>>>>
>>> Can you provide more information on the exception you got? (I'm
>>>> particularly interested in the line number).
>>>>
>>> The actual error is a bit tricky to find but if you monitor the docker
>>> logs from within the taskmanager pod you can find it failing when the SDK
>>> harness boot.go attempts to pull the the artifacts from the artifact
>>> endpoint [1]
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139
>>>
>>> 2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot --id=1-1 --provision_endpoint=localhost:45775
>>> 2020/09/22 22:07:59 Failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for /tmp/staged/pickled_main_session
>>>     caused by:
>>> rpc error: code = Unknown desc = ; failed to retrieve chunk for /tmp/staged/pickled_main_session
>>>
>>> I can hit the jobserver fine from my taskmanager pod, as well as from
>>> within a SDK container I spin up manually (with —network host):
>>>
>>> root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping flink-beam-jobserver
>>> PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129) 56(84) bytes of data.
>>>
>>> I don’t see how this would work if the endpoint hostname is localhost.
>>> I’ll explore how this is working in the flink-on-k8s-operator.
>>>
>>> Thanks for taking a look!
>>> Sam
>>>
>>> On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kc...@google.com> wrote:
>>>
>>>> > The issue is that the jobserver does not provide the proper endpoints
>>>> to the SDK harness when it submits the job to flink.
>>>>
>>>> I would not be surprised if there was something weird going on with
>>>> Docker in Docker. The defaults mostly work fine when an external SDK
>>>> harness is used [1].
>>>>
>>>> Can you provide more information on the exception you got? (I'm
>>>> particularly interested in the line number).
>>>>
>>>> > The issue is that the jobserver does not provide the proper endpoints
>>>> to the SDK harness when it submits the job to flink.
>>>>
>>>> More information about this failure mode would be helpful as well.
>>>>
>>>> [1]
>>>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml
>>>>
>>>>
>>>> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <sa...@gmail.com> wrote:
>>>>
>>>>> Hello beam community!
>>>>>
>>>>> I’m looking for some help solving an issue running a beam job on flink
>>>>> using --environment_type DOCKER.
>>>>>
>>>>> I have a flink cluster running in kubernetes configured so the
>>>>> taskworkers can run docker [1]. I’m attempting to deploy a flink jobserver
>>>>> in the cluster. The issue is that the jobserver does not provide the proper
>>>>> endpoints to the SDK harness when it submits the job to flink. It typically
>>>>> provides something like localhost:34567 using the hostname the grpc
>>>>> server was bound to. There is a jobserver flag --job-host that will
>>>>> bind the grpc server to this provided hostname, but I cannot seem to get it
>>>>> to bind to the k8s jobservice Service name [2]. I’ve tried different
>>>>> flavors of FQDNs but haven’t had any luck.
>>>>>
>>>>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - ArtifactStagingService started on flink-beam-jobserver:8098
>>>>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - Exception during job server creation
>>>>> java.io.IOException: Failed to bind
>>>>> ...
>>>>>
>>>>> Does anyone have some experience with this that could help provide
>>>>> some guidance?
>>>>>
>>>>> Cheers,
>>>>> Sam
>>>>>
>>>>> [1] https://github.com/sambvfx/beam-flink-k8s
>>>>> [2]
>>>>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29
>>>>>
>>>>

Re: Flink JobService on k8s

Posted by Sam Bourne <sa...@gmail.com>.

It was my understanding that the client first uploads the artifacts to the
jobserver and then the SDK harness will pull in these artifacts from the
jobserver over a gRPC port.

I see the artifacts on the jobserver while the job is attempting to run:

root@flink-beam-jobserver-9fccb99b8-6mhtq
:/tmp/beam-artifact-staging/3024e5d862fef831e830945b2d3e4e9511e0423bfb9c48de75aa2b3b67decce4

Do the jobserver and the taskmanager need to share the artifact staging
volume?

On Tue, Sep 22, 2020 at 4:04 PM Kyle Weaver <kc...@google.com> wrote:

> > rpc error: code = Unknown desc = ; failed to retrieve chunk for
> /tmp/staged/pickled_main_session
>
> Are you sure that's due to a networking issue, and not a problem with the
> filesystem / volume mounting?
>
> On Tue, Sep 22, 2020 at 3:55 PM Sam Bourne <sa...@gmail.com> wrote:
>
>> I would not be surprised if there was something weird going on with
>>> Docker in Docker. The defaults mostly work fine when an external SDK
>>> harness is used [1].
>>>
>> Can you provide more information on the exception you got? (I'm
>>> particularly interested in the line number).
>>>
>> The actual error is a bit tricky to find but if you monitor the docker
>> logs from within the taskmanager pod you can find it failing when the SDK
>> harness boot.go attempts to pull the the artifacts from the artifact
>> endpoint [1]
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139
>>
>> 2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot --id=1-1 --provision_endpoint=localhost:45775
>> 2020/09/22 22:07:59 Failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for /tmp/staged/pickled_main_session
>>     caused by:
>> rpc error: code = Unknown desc = ; failed to retrieve chunk for /tmp/staged/pickled_main_session
>>
>> I can hit the jobserver fine from my taskmanager pod, as well as from
>> within a SDK container I spin up manually (with —network host):
>>
>> root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping flink-beam-jobserver
>> PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129) 56(84) bytes of data.
>>
>> I don’t see how this would work if the endpoint hostname is localhost.
>> I’ll explore how this is working in the flink-on-k8s-operator.
>>
>> Thanks for taking a look!
>> Sam
>>
>> On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kc...@google.com> wrote:
>>
>>> > The issue is that the jobserver does not provide the proper endpoints
>>> to the SDK harness when it submits the job to flink.
>>>
>>> I would not be surprised if there was something weird going on with
>>> Docker in Docker. The defaults mostly work fine when an external SDK
>>> harness is used [1].
>>>
>>> Can you provide more information on the exception you got? (I'm
>>> particularly interested in the line number).
>>>
>>> > The issue is that the jobserver does not provide the proper endpoints
>>> to the SDK harness when it submits the job to flink.
>>>
>>> More information about this failure mode would be helpful as well.
>>>
>>> [1]
>>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml
>>>
>>>
>>> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <sa...@gmail.com> wrote:
>>>
>>>> Hello beam community!
>>>>
>>>> I’m looking for some help solving an issue running a beam job on flink
>>>> using --environment_type DOCKER.
>>>>
>>>> I have a flink cluster running in kubernetes configured so the
>>>> taskworkers can run docker [1]. I’m attempting to deploy a flink jobserver
>>>> in the cluster. The issue is that the jobserver does not provide the proper
>>>> endpoints to the SDK harness when it submits the job to flink. It typically
>>>> provides something like localhost:34567 using the hostname the grpc
>>>> server was bound to. There is a jobserver flag --job-host that will
>>>> bind the grpc server to this provided hostname, but I cannot seem to get it
>>>> to bind to the k8s jobservice Service name [2]. I’ve tried different
>>>> flavors of FQDNs but haven’t had any luck.
>>>>
>>>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - ArtifactStagingService started on flink-beam-jobserver:8098
>>>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - Exception during job server creation
>>>> java.io.IOException: Failed to bind
>>>> ...
>>>>
>>>> Does anyone have some experience with this that could help provide some
>>>> guidance?
>>>>
>>>> Cheers,
>>>> Sam
>>>>
>>>> [1] https://github.com/sambvfx/beam-flink-k8s
>>>> [2]
>>>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29
>>>>
>>>

Re: Flink JobService on k8s

Posted by Kyle Weaver <kc...@google.com>.

> rpc error: code = Unknown desc = ; failed to retrieve chunk for
/tmp/staged/pickled_main_session

Are you sure that's due to a networking issue, and not a problem with the
filesystem / volume mounting?

On Tue, Sep 22, 2020 at 3:55 PM Sam Bourne <sa...@gmail.com> wrote:

> I would not be surprised if there was something weird going on with Docker
>> in Docker. The defaults mostly work fine when an external SDK harness is
>> used [1].
>>
> Can you provide more information on the exception you got? (I'm
>> particularly interested in the line number).
>>
> The actual error is a bit tricky to find but if you monitor the docker
> logs from within the taskmanager pod you can find it failing when the SDK
> harness boot.go attempts to pull the the artifacts from the artifact
> endpoint [1]
> [1]
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139
>
> 2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot --id=1-1 --provision_endpoint=localhost:45775
> 2020/09/22 22:07:59 Failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for /tmp/staged/pickled_main_session
>     caused by:
> rpc error: code = Unknown desc = ; failed to retrieve chunk for /tmp/staged/pickled_main_session
>
> I can hit the jobserver fine from my taskmanager pod, as well as from
> within a SDK container I spin up manually (with —network host):
>
> root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping flink-beam-jobserver
> PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129) 56(84) bytes of data.
>
> I don’t see how this would work if the endpoint hostname is localhost.
> I’ll explore how this is working in the flink-on-k8s-operator.
>
> Thanks for taking a look!
> Sam
>
> On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kc...@google.com> wrote:
>
>> > The issue is that the jobserver does not provide the proper endpoints
>> to the SDK harness when it submits the job to flink.
>>
>> I would not be surprised if there was something weird going on with
>> Docker in Docker. The defaults mostly work fine when an external SDK
>> harness is used [1].
>>
>> Can you provide more information on the exception you got? (I'm
>> particularly interested in the line number).
>>
>> > The issue is that the jobserver does not provide the proper endpoints
>> to the SDK harness when it submits the job to flink.
>>
>> More information about this failure mode would be helpful as well.
>>
>> [1]
>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml
>>
>>
>> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <sa...@gmail.com> wrote:
>>
>>> Hello beam community!
>>>
>>> I’m looking for some help solving an issue running a beam job on flink
>>> using --environment_type DOCKER.
>>>
>>> I have a flink cluster running in kubernetes configured so the
>>> taskworkers can run docker [1]. I’m attempting to deploy a flink jobserver
>>> in the cluster. The issue is that the jobserver does not provide the proper
>>> endpoints to the SDK harness when it submits the job to flink. It typically
>>> provides something like localhost:34567 using the hostname the grpc
>>> server was bound to. There is a jobserver flag --job-host that will
>>> bind the grpc server to this provided hostname, but I cannot seem to get it
>>> to bind to the k8s jobservice Service name [2]. I’ve tried different
>>> flavors of FQDNs but haven’t had any luck.
>>>
>>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - ArtifactStagingService started on flink-beam-jobserver:8098
>>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - Exception during job server creation
>>> java.io.IOException: Failed to bind
>>> ...
>>>
>>> Does anyone have some experience with this that could help provide some
>>> guidance?
>>>
>>> Cheers,
>>> Sam
>>>
>>> [1] https://github.com/sambvfx/beam-flink-k8s
>>> [2]
>>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29
>>>
>>

Re: Flink JobService on k8s

Posted by Sam Bourne <sa...@gmail.com>.

I would not be surprised if there was something weird going on with Docker
> in Docker. The defaults mostly work fine when an external SDK harness is
> used [1].
>
Can you provide more information on the exception you got? (I'm
> particularly interested in the line number).
>
The actual error is a bit tricky to find but if you monitor the docker logs
from within the taskmanager pod you can find it failing when the SDK
harness boot.go attempts to pull the the artifacts from the artifact
endpoint [1]
[1]
https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139

2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot
--id=1-1 --provision_endpoint=localhost:45775
2020/09/22 22:07:59 Failed to retrieve staged files: failed to
retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for
/tmp/staged/pickled_main_session
    caused by:
rpc error: code = Unknown desc = ; failed to retrieve chunk for
/tmp/staged/pickled_main_session

I can hit the jobserver fine from my taskmanager pod, as well as from
within a SDK container I spin up manually (with —network host):

root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping
flink-beam-jobserver
PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129)
56(84) bytes of data.

I don’t see how this would work if the endpoint hostname is localhost. I’ll
explore how this is working in the flink-on-k8s-operator.

Thanks for taking a look!
Sam

On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kc...@google.com> wrote:

> > The issue is that the jobserver does not provide the proper endpoints to
> the SDK harness when it submits the job to flink.
>
> I would not be surprised if there was something weird going on with Docker
> in Docker. The defaults mostly work fine when an external SDK harness is
> used [1].
>
> Can you provide more information on the exception you got? (I'm
> particularly interested in the line number).
>
> > The issue is that the jobserver does not provide the proper endpoints to
> the SDK harness when it submits the job to flink.
>
> More information about this failure mode would be helpful as well.
>
> [1]
> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml
>
>
> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <sa...@gmail.com> wrote:
>
>> Hello beam community!
>>
>> I’m looking for some help solving an issue running a beam job on flink
>> using --environment_type DOCKER.
>>
>> I have a flink cluster running in kubernetes configured so the
>> taskworkers can run docker [1]. I’m attempting to deploy a flink jobserver
>> in the cluster. The issue is that the jobserver does not provide the proper
>> endpoints to the SDK harness when it submits the job to flink. It typically
>> provides something like localhost:34567 using the hostname the grpc
>> server was bound to. There is a jobserver flag --job-host that will bind
>> the grpc server to this provided hostname, but I cannot seem to get it to
>> bind to the k8s jobservice Service name [2]. I’ve tried different flavors
>> of FQDNs but haven’t had any luck.
>>
>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - ArtifactStagingService started on flink-beam-jobserver:8098
>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - Exception during job server creation
>> java.io.IOException: Failed to bind
>> ...
>>
>> Does anyone have some experience with this that could help provide some
>> guidance?
>>
>> Cheers,
>> Sam
>>
>> [1] https://github.com/sambvfx/beam-flink-k8s
>> [2]
>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29
>>
>

Re: Flink JobService on k8s

Posted by Kyle Weaver <kc...@google.com>.

> The issue is that the jobserver does not provide the proper endpoints to
the SDK harness when it submits the job to flink.

I would not be surprised if there was something weird going on with Docker
in Docker. The defaults mostly work fine when an external SDK harness is
used [1].

Can you provide more information on the exception you got? (I'm
particularly interested in the line number).

> The issue is that the jobserver does not provide the proper endpoints to
the SDK harness when it submits the job to flink.

More information about this failure mode would be helpful as well.

[1]
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml


On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <sa...@gmail.com> wrote:

> Hello beam community!
>
> I’m looking for some help solving an issue running a beam job on flink
> using --environment_type DOCKER.
>
> I have a flink cluster running in kubernetes configured so the taskworkers
> can run docker [1]. I’m attempting to deploy a flink jobserver in the
> cluster. The issue is that the jobserver does not provide the proper
> endpoints to the SDK harness when it submits the job to flink. It typically
> provides something like localhost:34567 using the hostname the grpc
> server was bound to. There is a jobserver flag --job-host that will bind
> the grpc server to this provided hostname, but I cannot seem to get it to
> bind to the k8s jobservice Service name [2]. I’ve tried different flavors
> of FQDNs but haven’t had any luck.
>
> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - ArtifactStagingService started on flink-beam-jobserver:8098
> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - Exception during job server creation
> java.io.IOException: Failed to bind
> ...
>
> Does anyone have some experience with this that could help provide some
> guidance?
>
> Cheers,
> Sam
>
> [1] https://github.com/sambvfx/beam-flink-k8s
> [2]
> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29
>