You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Ruoyun Huang <ru...@google.com> on 2018/11/22 00:39:24 UTC

To create a WordCount-SideInput.java example?

Hi,

I am working on sideInput support in java reference runner (ULR) JIRA-2928
[1].
Although there are inline code snippet example [2] and unit tests [3], I
did not find
a good place showing a working example of SideInput(please correct me if I
am wrong).
I am thinking of creating one more WordCount example under example folder
[2].
In particular, in this example we show variants of a) sideinputs as a
scalar AND multimap, b) from pipeline data or created within code and c)
[OPTIONAL?] Streaming versus batch, if there are differences (this I am not
sure yet).

In the meanwhile, JIRA-2928 can also easily rely on such an example to
validate behaviors between portable/non-portable runners.

Would like to double check if is this a reasonable idea.

Even though SideInput is just one of our many many features, my
justification is that, it is commonly used, thus having a one-stop example
make it easier for new users.  That being said, is there a reason not to
have yet another WordCount example? (Another idea is to extend existing
WordCount.java, but that breaks its simplicity.)

If it is a good change to have, any suggestion on what else to include?

Thanks!

[1] https://issues.apache.org/jira/browse/BEAM-2928
[2]
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L160
[3]
sdks/java/harness/src/test/java/org/apache/beam/fn/harness/state/MultimapSideInputTest.java
[4] examples/java/src/main/java/org/apache/beam/examples

-- 
================
Ruoyun  Huang

Re: To create a WordCount-SideInput.java example?

Posted by Kenneth Knowles <ke...@apache.org>.
On Mon, Nov 26, 2018 at 1:32 PM Ruoyun Huang <ru...@google.com> wrote:

> Thanks Kenneth. Didn't look into subfolders, let me read a bit more.  And
> will look into the tests Luke pointed out as well.
>
> To make sure I understand your comments of "Side inputs _are_ different in
> streaming as *you* have to ...", are you saying either: 1) a user needs
> to use/treat SideInput API differently when handling streaming case, OR 2)
> Beam developments had to do the underlying implementations differently?
>

2) each runner has to do the execution differently

so

1) users might need to know about this, since their main input will wait
for side input data or for the window to expire; in batch the window is
always ready so there is no waiting

Kenn


>
>
>
> On Wed, Nov 21, 2018 at 7:50 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> I like the idea of a/many very simple example(s) of side inputs. There
>> are existing examples that use side inputs:
>>
>> $ cd examples/java/src/main/java/org/apache/beam/examples
>> $ grep -r withSideInput .
>> ./complete/TfIdf.java:                  .withSideInputs(totalDocuments));
>> ./complete/game/GameStats.java:
>> .withSideInputs(globalMeanScore));
>> ./complete/game/GameStats.java:
>> .withSideInputs(spammersView))
>> ./cookbook/FilterExamples.java:
>> .withSideInputs(globalMeanTemp));
>>
>> From just this grep It looks like all but one are broadcast scalar
>> values. I have not looked at them to see if they are too complex or too
>> trivial.
>>
>> Side inputs _are_ different in streaming as you have to pause the main
>> input or push back elements until a side input is ready for a window.
>>
>> I would suggest multiple simple examples each showing one way of using
>> side inputs. A particular thing to demonstrated might be a triggered
>> Combine.perKey() and tutorial that it requires a View.asMultimap() because
>> triggers result in duplicate entries for a key.
>>
>
>> Kenn
>>
>> On Wed, Nov 21, 2018 at 4:40 PM Ruoyun Huang <ru...@google.com> wrote:
>>
>>> Hi,
>>>
>>> I am working on sideInput support in java reference runner (ULR)
>>> JIRA-2928 [1].
>>> Although there are inline code snippet example [2] and unit tests [3], I
>>> did not find
>>> a good place showing a working example of SideInput(please correct me if
>>> I am wrong).
>>> I am thinking of creating one more WordCount example under example
>>> folder [2].
>>> In particular, in this example we show variants of a) sideinputs as a
>>> scalar AND multimap, b) from pipeline data or created within code and c)
>>> [OPTIONAL?] Streaming versus batch, if there are differences (this I am not
>>> sure yet).
>>>
>>> In the meanwhile, JIRA-2928 can also easily rely on such an example to
>>> validate behaviors between portable/non-portable runners.
>>>
>>> Would like to double check if is this a reasonable idea.
>>>
>>> Even though SideInput is just one of our many many features, my
>>> justification is that, it is commonly used, thus having a one-stop example
>>> make it easier for new users.  That being said, is there a reason not to
>>> have yet another WordCount example? (Another idea is to extend existing
>>> WordCount.java, but that breaks its simplicity.)
>>>
>>> If it is a good change to have, any suggestion on what else to include?
>>>
>>> Thanks!
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-2928
>>> [2]
>>> sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L160
>>> [3]
>>> sdks/java/harness/src/test/java/org/apache/beam/fn/harness/state/MultimapSideInputTest.java
>>> [4] examples/java/src/main/java/org/apache/beam/examples
>>>
>>> --
>>> ================
>>> Ruoyun  Huang
>>>
>>>
>
> --
> ================
> Ruoyun  Huang
>
>

Re: To create a WordCount-SideInput.java example?

Posted by Ruoyun Huang <ru...@google.com>.
Thanks Kenneth. Didn't look into subfolders, let me read a bit more.  And
will look into the tests Luke pointed out as well.

To make sure I understand your comments of "Side inputs _are_ different in
streaming as *you* have to ...", are you saying either: 1) a user needs to
use/treat SideInput API differently when handling streaming case, OR 2)
Beam developments had to do the underlying implementations differently?


On Wed, Nov 21, 2018 at 7:50 PM Kenneth Knowles <ke...@apache.org> wrote:

> I like the idea of a/many very simple example(s) of side inputs. There are
> existing examples that use side inputs:
>
> $ cd examples/java/src/main/java/org/apache/beam/examples
> $ grep -r withSideInput .
> ./complete/TfIdf.java:                  .withSideInputs(totalDocuments));
> ./complete/game/GameStats.java:
> .withSideInputs(globalMeanScore));
> ./complete/game/GameStats.java:
> .withSideInputs(spammersView))
> ./cookbook/FilterExamples.java:
> .withSideInputs(globalMeanTemp));
>
> From just this grep It looks like all but one are broadcast scalar values.
> I have not looked at them to see if they are too complex or too trivial.
>
> Side inputs _are_ different in streaming as you have to pause the main
> input or push back elements until a side input is ready for a window.
>
> I would suggest multiple simple examples each showing one way of using
> side inputs. A particular thing to demonstrated might be a triggered
> Combine.perKey() and tutorial that it requires a View.asMultimap() because
> triggers result in duplicate entries for a key.
>

> Kenn
>
> On Wed, Nov 21, 2018 at 4:40 PM Ruoyun Huang <ru...@google.com> wrote:
>
>> Hi,
>>
>> I am working on sideInput support in java reference runner (ULR)
>> JIRA-2928 [1].
>> Although there are inline code snippet example [2] and unit tests [3], I
>> did not find
>> a good place showing a working example of SideInput(please correct me if
>> I am wrong).
>> I am thinking of creating one more WordCount example under example folder
>> [2].
>> In particular, in this example we show variants of a) sideinputs as a
>> scalar AND multimap, b) from pipeline data or created within code and c)
>> [OPTIONAL?] Streaming versus batch, if there are differences (this I am not
>> sure yet).
>>
>> In the meanwhile, JIRA-2928 can also easily rely on such an example to
>> validate behaviors between portable/non-portable runners.
>>
>> Would like to double check if is this a reasonable idea.
>>
>> Even though SideInput is just one of our many many features, my
>> justification is that, it is commonly used, thus having a one-stop example
>> make it easier for new users.  That being said, is there a reason not to
>> have yet another WordCount example? (Another idea is to extend existing
>> WordCount.java, but that breaks its simplicity.)
>>
>> If it is a good change to have, any suggestion on what else to include?
>>
>> Thanks!
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-2928
>> [2]
>> sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L160
>> [3]
>> sdks/java/harness/src/test/java/org/apache/beam/fn/harness/state/MultimapSideInputTest.java
>> [4] examples/java/src/main/java/org/apache/beam/examples
>>
>> --
>> ================
>> Ruoyun  Huang
>>
>>

-- 
================
Ruoyun  Huang

Re: To create a WordCount-SideInput.java example?

Posted by Lukasz Cwik <lc...@google.com>.
Examples are good for showing users how to use certain concepts but we
should stick with ValidatesRunner tests for ensuring that runners / SDKs
implement concepts correctly. We have several ValidatesRunner side input
tests in ParDoTest.java[1], ViewTest.java[2], and sideinputs_test.py[3]
that could be used to validate the ULR.

1:
https://github.com/apache/beam/blob/658630d8f75281f50dc434f2e062e2819ebeff84/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ParDoTest.java#L710
2:
https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ViewTest.java
3:
https://github.com/apache/beam/blob/70259e13e83ce3e2bb8d9ee9731e09f9ff456b05/sdks/python/apache_beam/transforms/sideinputs_test.py

On Wed, Nov 21, 2018 at 7:50 PM Kenneth Knowles <ke...@apache.org> wrote:

> I like the idea of a/many very simple example(s) of side inputs. There are
> existing examples that use side inputs:
>
> $ cd examples/java/src/main/java/org/apache/beam/examples
> $ grep -r withSideInput .
> ./complete/TfIdf.java:                  .withSideInputs(totalDocuments));
> ./complete/game/GameStats.java:
> .withSideInputs(globalMeanScore));
> ./complete/game/GameStats.java:
> .withSideInputs(spammersView))
> ./cookbook/FilterExamples.java:
> .withSideInputs(globalMeanTemp));
>
> From just this grep It looks like all but one are broadcast scalar values.
> I have not looked at them to see if they are too complex or too trivial.
>
> Side inputs _are_ different in streaming as you have to pause the main
> input or push back elements until a side input is ready for a window.
>
> I would suggest multiple simple examples each showing one way of using
> side inputs. A particular thing to demonstrated might be a triggered
> Combine.perKey() and tutorial that it requires a View.asMultimap() because
> triggers result in duplicate entries for a key.
>
> Kenn
>
> On Wed, Nov 21, 2018 at 4:40 PM Ruoyun Huang <ru...@google.com> wrote:
>
>> Hi,
>>
>> I am working on sideInput support in java reference runner (ULR)
>> JIRA-2928 [1].
>> Although there are inline code snippet example [2] and unit tests [3], I
>> did not find
>> a good place showing a working example of SideInput(please correct me if
>> I am wrong).
>> I am thinking of creating one more WordCount example under example folder
>> [2].
>> In particular, in this example we show variants of a) sideinputs as a
>> scalar AND multimap, b) from pipeline data or created within code and c)
>> [OPTIONAL?] Streaming versus batch, if there are differences (this I am not
>> sure yet).
>>
>> In the meanwhile, JIRA-2928 can also easily rely on such an example to
>> validate behaviors between portable/non-portable runners.
>>
>> Would like to double check if is this a reasonable idea.
>>
>> Even though SideInput is just one of our many many features, my
>> justification is that, it is commonly used, thus having a one-stop example
>> make it easier for new users.  That being said, is there a reason not to
>> have yet another WordCount example? (Another idea is to extend existing
>> WordCount.java, but that breaks its simplicity.)
>>
>> If it is a good change to have, any suggestion on what else to include?
>>
>> Thanks!
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-2928
>> [2]
>> sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L160
>> [3]
>> sdks/java/harness/src/test/java/org/apache/beam/fn/harness/state/MultimapSideInputTest.java
>> [4] examples/java/src/main/java/org/apache/beam/examples
>>
>> --
>> ================
>> Ruoyun  Huang
>>
>>

Re: To create a WordCount-SideInput.java example?

Posted by Kenneth Knowles <ke...@apache.org>.
I like the idea of a/many very simple example(s) of side inputs. There are
existing examples that use side inputs:

$ cd examples/java/src/main/java/org/apache/beam/examples
$ grep -r withSideInput .
./complete/TfIdf.java:                  .withSideInputs(totalDocuments));
./complete/game/GameStats.java:
.withSideInputs(globalMeanScore));
./complete/game/GameStats.java:
.withSideInputs(spammersView))
./cookbook/FilterExamples.java:
.withSideInputs(globalMeanTemp));

From just this grep It looks like all but one are broadcast scalar values.
I have not looked at them to see if they are too complex or too trivial.

Side inputs _are_ different in streaming as you have to pause the main
input or push back elements until a side input is ready for a window.

I would suggest multiple simple examples each showing one way of using side
inputs. A particular thing to demonstrated might be a triggered
Combine.perKey() and tutorial that it requires a View.asMultimap() because
triggers result in duplicate entries for a key.

Kenn

On Wed, Nov 21, 2018 at 4:40 PM Ruoyun Huang <ru...@google.com> wrote:

> Hi,
>
> I am working on sideInput support in java reference runner (ULR) JIRA-2928
> [1].
> Although there are inline code snippet example [2] and unit tests [3], I
> did not find
> a good place showing a working example of SideInput(please correct me if I
> am wrong).
> I am thinking of creating one more WordCount example under example folder
> [2].
> In particular, in this example we show variants of a) sideinputs as a
> scalar AND multimap, b) from pipeline data or created within code and c)
> [OPTIONAL?] Streaming versus batch, if there are differences (this I am not
> sure yet).
>
> In the meanwhile, JIRA-2928 can also easily rely on such an example to
> validate behaviors between portable/non-portable runners.
>
> Would like to double check if is this a reasonable idea.
>
> Even though SideInput is just one of our many many features, my
> justification is that, it is commonly used, thus having a one-stop example
> make it easier for new users.  That being said, is there a reason not to
> have yet another WordCount example? (Another idea is to extend existing
> WordCount.java, but that breaks its simplicity.)
>
> If it is a good change to have, any suggestion on what else to include?
>
> Thanks!
>
> [1] https://issues.apache.org/jira/browse/BEAM-2928
> [2]
> sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L160
> [3]
> sdks/java/harness/src/test/java/org/apache/beam/fn/harness/state/MultimapSideInputTest.java
> [4] examples/java/src/main/java/org/apache/beam/examples
>
> --
> ================
> Ruoyun  Huang
>
>