You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Andrew Otto <ot...@wikimedia.org> on 2023/02/10 14:08:12 UTC

Pyflink Side Output Question and/or suggested documentation change

Question about side outputs and OutputTags in pyflink.  The docs
<https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/side_output/>
say we are supposed to

yield output_tag, value

Docs then say:
> For retrieving the side output stream you use getSideOutput(OutputTag) on
the result of the DataStream operation.

From this, I'd expect that calling datastream.get_side_output would be
optional.   However, it seems that if you do not call
datastream.get_side_output, then the main datastream will have the record
destined to the output tag still in it, as a Tuple(output_tag, value).
This caused me great confusion for a while, as my downstream tasks would
break because of the unexpected Tuple type of the record.

Here's an example of the failure using side output and ProcessFunction in
the word count example.
<https://gist.github.com/ottomata/001df5df72eb1224c01c9827399fcbd7#file-pyflink_sideout_fail_word_count_example-py-L86-L100>

I'd expect that just yielding an output_tag would make those records be in
a different datastream, but apparently this is not the case unless you call
get_side_output.

If this is the expected behavior, perhaps the docs should be updated to say
so?

-Andrew Otto
 Wikimedia Foundation

Re: Pyflink Side Output Question and/or suggested documentation change

Posted by Andrew Otto <ot...@wikimedia.org>.

Thank you!

On Mon, Feb 13, 2023 at 5:55 AM Dian Fu <di...@gmail.com> wrote:

> Thanks Andrew, I think this is a valid advice. I will update the
> documentation~
>
> Regards,
> Dian
>
> ,
>
> On Fri, Feb 10, 2023 at 10:08 PM Andrew Otto <ot...@wikimedia.org> wrote:
>
>> Question about side outputs and OutputTags in pyflink.  The docs
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/side_output/>
>> say we are supposed to
>>
>> yield output_tag, value
>>
>> Docs then say:
>> > For retrieving the side output stream you use getSideOutput(OutputTag) on
>> the result of the DataStream operation.
>>
>> From this, I'd expect that calling datastream.get_side_output would be
>> optional.   However, it seems that if you do not call
>> datastream.get_side_output, then the main datastream will have the record
>> destined to the output tag still in it, as a Tuple(output_tag, value).
>> This caused me great confusion for a while, as my downstream tasks would
>> break because of the unexpected Tuple type of the record.
>>
>> Here's an example of the failure using side output and ProcessFunction
>> in the word count example.
>> <https://gist.github.com/ottomata/001df5df72eb1224c01c9827399fcbd7#file-pyflink_sideout_fail_word_count_example-py-L86-L100>
>>
>> I'd expect that just yielding an output_tag would make those records be
>> in a different datastream, but apparently this is not the case unless you
>> call get_side_output.
>>
>> If this is the expected behavior, perhaps the docs should be updated to
>> say so?
>>
>> -Andrew Otto
>>  Wikimedia Foundation
>>
>>
>>
>>
>>

Re: Pyflink Side Output Question and/or suggested documentation change

Posted by Dian Fu <di...@gmail.com>.

Thanks Andrew, I think this is a valid advice. I will update the
documentation~

Regards,
Dian

,

On Fri, Feb 10, 2023 at 10:08 PM Andrew Otto <ot...@wikimedia.org> wrote:

> Question about side outputs and OutputTags in pyflink.  The docs
> <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/side_output/>
> say we are supposed to
>
> yield output_tag, value
>
> Docs then say:
> > For retrieving the side output stream you use getSideOutput(OutputTag) on
> the result of the DataStream operation.
>
> From this, I'd expect that calling datastream.get_side_output would be
> optional.   However, it seems that if you do not call
> datastream.get_side_output, then the main datastream will have the record
> destined to the output tag still in it, as a Tuple(output_tag, value).
> This caused me great confusion for a while, as my downstream tasks would
> break because of the unexpected Tuple type of the record.
>
> Here's an example of the failure using side output and ProcessFunction in
> the word count example.
> <https://gist.github.com/ottomata/001df5df72eb1224c01c9827399fcbd7#file-pyflink_sideout_fail_word_count_example-py-L86-L100>
>
> I'd expect that just yielding an output_tag would make those records be in
> a different datastream, but apparently this is not the case unless you call
> get_side_output.
>
> If this is the expected behavior, perhaps the docs should be updated to
> say so?
>
> -Andrew Otto
>  Wikimedia Foundation
>
>
>
>
>