You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Lukasz Cwik <lc...@google.com.INVALID> on 2016/07/12 13:24:05 UTC

Re: [jira] [Commented] (BEAM-434) When examples write output to file it creates many output files instead of one

If we go with any option that restricts the number of outputs then in the
example we should discuss what it does and why it is not considered a good
thing.

On Tue, Jul 12, 2016 at 2:11 AM, Amit Sela (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372225#comment-15372225
> ]
>
> Amit Sela commented on BEAM-434:
> --------------------------------
>
> I sort of prefer 2, but by letting the user pass the numShards
> configuration (which may need a better name)
> Like I mentioned in the PR, if we want to give a simple example result on
> one hand, while keeping in the user's mind the fact that multiple shards
> are a thing to consider, we could add a --numShards option and add it to
> the examples code with a default of 1 (or 3).
> If we want the users to know about multiple output shards, why should we
> keep the examples "pure" ?
>
> How about adding an option named "--numOutputShards" with default value 1
> (or 3, I could live with 3 :) ) and adding this to the examples README,
> thus giving a better experience in terms of "seeing" the output, while
> keeping the multiple-shards "on the table" and as a bonus, the Travis CI
> tests could still run with as many shards as we want (while I wanted
> examples to be easy enough, I definitely didn't want that for Travis!)
>
> WDYT ?
>
>
> > When examples write output to file it creates many output files instead
> of one
> >
> ------------------------------------------------------------------------------
> >
> >                 Key: BEAM-434
> >                 URL: https://issues.apache.org/jira/browse/BEAM-434
> >             Project: Beam
> >          Issue Type: Bug
> >          Components: examples-java
> >            Reporter: Amit Sela
> >            Assignee: Amit Sela
> >            Priority: Minor
> >
> > When using `TextIO.Write.to("/path/to/output")` without any
> restrictions on the number of shards, it might generate many output files
> (depending on your input), for WordCount for example, you'll get as many
> output files as unique words in your input.
> > Since I think examples are expected to execute in a friendly manner to
> "see" what it does and not optimize for performance in some way, I suggest
> to use `withoutSharding()` when writing the example output to an output
> file.
> > Examples I could find that behave this way:
> > org.apache.beam.examples.WordCount
> > org.apache.beam.examples.complete.TfIdf
> > org.apache.beam.examples.cookbook.DeDupExample
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>