You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by Bhupesh Chawda <bh...@datatorrent.com> on 2017/02/15 12:02:29 UTC

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

For better understanding the use case for control tuples in batch, I am
creating a prototype for a batch application using File Input and File
Output operators.

To enable basic batch processing for File IO operators, I am proposing the
following changes to File input and output operators:
1. File Input operator emits a watermark each time it opens and closes a
file. These can be "start file" and "end file" watermarks which include the
corresponding file names. The "start file" tuple should be sent before any
of the data from that file flows.
2. File Input operator can be configured to end the application after a
single or n scans of the directory (a batch). This is where the operator
emits the final watermark (the end of application control tuple). This will
also shutdown the application.
3. The File output operator handles these control tuples. "Start file"
initializes the file name for the incoming tuples. "End file" watermark
forces a finalize on that file.

The user would be able to enable the operators to send only those
watermarks that are needed in the application. If none of the options are
configured, the operators behave as in a streaming application.

There are a few challenges in the implementation where the input operator
is partitioned. In this case, the correlation between the start/end for a
file and the data tuples for that file is lost. Hence we need to maintain
the filename as part of each tuple in the pipeline.

The "start file" and "end file" control tuples in this example are
temporary names for watermarks. We can have generic "start batch" / "end
batch" tuples which could be used for other use cases as well. The Final
watermark is common and serves the same purpose in each case.

Please let me know your thoughts on this.

~ Bhupesh



On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Yes, this can be part of operator configuration. Given this, for a user to
> define a batch application, would mean configuring the connectors (mostly
> the input operator) in the application for the desired behavior. Similarly,
> there can be other use cases that can be achieved other than batch.
>
> We may also need to take care of the following:
> 1. Make sure that the watermarks or control tuples are consistent across
> sources. Meaning an HDFS sink should be able to interpret the watermark
> tuple sent out by, say, a JDBC source.
> 2. In addition to I/O connectors, we should also look at the need for
> processing operators to understand some of the control tuples / watermarks.
> For example, we may want to reset the operator behavior on arrival of some
> watermark tuple.
>
> ~ Bhupesh
>
> On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org> wrote:
>
>> The HDFS source can operate in two modes, bounded or unbounded. If you
>> scan
>> only once, then it should emit the final watermark after it is done.
>> Otherwise it would emit watermarks based on a policy (files names etc.).
>> The mechanism to generate the marks may depend on the type of source and
>> the user needs to be able to influence/configure it.
>>
>> Thomas
>>
>>
>> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <bh...@datatorrent.com>
>> wrote:
>>
>> > Hi Thomas,
>> >
>> > I am not sure that I completely understand your suggestion. Are you
>> > suggesting to broaden the scope of the proposal to treat all sources as
>> > bounded as well as unbounded?
>> >
>> > In case of Apex, we treat all sources as unbounded sources. Even bounded
>> > sources like HDFS file source is treated as unbounded by means of
>> scanning
>> > the input directory repeatedly.
>> >
>> > Let's consider HDFS file source for example:
>> > In this case, if we treat it as a bounded source, we can define hooks
>> which
>> > allows us to detect the end of the file and send the "final watermark".
>> We
>> > could also consider HDFS file source as a streaming source and define
>> hooks
>> > which send watermarks based on different kinds of windows.
>> >
>> > Please correct me if I misunderstand.
>> >
>> > ~ Bhupesh
>> >
>> >
>> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org> wrote:
>> >
>> > > Bhupesh,
>> > >
>> > > Please see how that can be solved in a unified way using windows and
>> > > watermarks. It is bounded data vs. unbounded data. In Beam for
>> example,
>> > you
>> > > can use the "global window" and the final watermark to accomplish what
>> > you
>> > > are looking for. Batch is just a special case of streaming where the
>> > source
>> > > emits the final watermark.
>> > >
>> > > Thanks,
>> > > Thomas
>> > >
>> > >
>> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
>> bhupesh@datatorrent.com
>> > >
>> > > wrote:
>> > >
>> > > > Yes, if the user needs to develop a batch application, then batch
>> aware
>> > > > operators need to be used in the application.
>> > > > The nature of the application is mostly controlled by the input and
>> the
>> > > > output operators used in the application.
>> > > >
>> > > > For example, consider an application which needs to filter records
>> in a
>> > > > input file and store the filtered records in another file. The
>> nature
>> > of
>> > > > this app is to end once the entire file is processed. Following
>> things
>> > > are
>> > > > expected of the application:
>> > > >
>> > > >    1. Once the input data is over, finalize the output file from
>> .tmp
>> > > >    files. - Responsibility of output operator
>> > > >    2. End the application, once the data is read and processed -
>> > > >    Responsibility of input operator
>> > > >
>> > > > These functions are essential to allow the user to do higher level
>> > > > operations like scheduling or running a workflow of batch
>> applications.
>> > > >
>> > > > I am not sure about intermediate (processing) operators, as there
>> is no
>> > > > change in their functionality for batch use cases. Perhaps, allowing
>> > > > running multiple batches in a single application may require similar
>> > > > changes in processing operators as well.
>> > > >
>> > > > ~ Bhupesh
>> > > >
>> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <priyag@apache.org
>> >
>> > > > wrote:
>> > > >
>> > > > > Will it make an impression on user that, if he has a batch
>> usecase he
>> > > has
>> > > > > to use batch aware operators only? If so, is that what we expect?
>> I
>> > am
>> > > > not
>> > > > > aware of how do we implement batch scenario so this might be a
>> basic
>> > > > > question.
>> > > > >
>> > > > > -Priyanka
>> > > > >
>> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
>> > > > bhupesh@datatorrent.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi All,
>> > > > > >
>> > > > > > While design / implementation for custom control tuples is
>> > ongoing, I
>> > > > > > thought it would be a good idea to consider its usefulness in
>> one
>> > of
>> > > > the
>> > > > > > use cases -  batch applications.
>> > > > > >
>> > > > > > This is a proposal to adapt / extend existing operators in the
>> > Apache
>> > > > > Apex
>> > > > > > Malhar library so that it is easy to use them in batch use
>> cases.
>> > > > > > Naturally, this would be applicable for only a subset of
>> operators
>> > > like
>> > > > > > File, JDBC and NoSQL databases.
>> > > > > > For example, for a file based store, (say HDFS store), we could
>> > have
>> > > > > > FileBatchInput and FileBatchOutput operators which allow easy
>> > > > integration
>> > > > > > into a batch application. These operators would be extended from
>> > > their
>> > > > > > existing implementations and would be "Batch Aware", in that
>> they
>> > may
>> > > > > > understand the meaning of some specific control tuples that flow
>> > > > through
>> > > > > > the DAG. Start batch and end batch seem to be the obvious
>> > candidates
>> > > > that
>> > > > > > come to mind. On receipt of such control tuples, they may try to
>> > > modify
>> > > > > the
>> > > > > > behavior of the operator - to reinitialize some metrics or
>> finalize
>> > > an
>> > > > > > output file for example.
>> > > > > >
>> > > > > > We can discuss the potential control tuples and actions in
>> detail,
>> > > but
>> > > > > > first I would like to understand the views of the community for
>> > this
>> > > > > > proposal.
>> > > > > >
>> > > > > > ~ Bhupesh
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by AJAY GUPTA <aj...@gmail.com>.

Hi all,

As per the discussion on this thread, we can conclude with the following
points.


1) Batch control tuples will need to be handled separately from watermark
tuples.
2) Add support for start batch and stop batch control tuples
3) Add support for reset control tuples to indicate to windowed operators
to reset the watermark.



Ajay

On Thu, May 11, 2017 at 10:13 PM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> I think we should have the support to allow running multiple batches
> through the same DAG.
>
> Resetting the watermark to the initial watermark seems like a good idea.
> The windowed operator needs to understand the start/end batch control tuple
> and reset the watermark.
>
> ~ Bhupesh
>
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Thu, May 11, 2017 at 7:37 PM, Thomas Weise <th...@apache.org> wrote:
>
> > Usually batches are processed by different instances of a topology. First
> > of all we should agree that running multiple batches through the same DAG
> > *in sequence* is something that we want to address.
> >
> > If yes, then there is the reset problem you are referring to, and it only
> > occurs when you want to do event time processing per batch, because here
> > you cannot repurpose the time component to segregate batches.
> >
> > 2 possible options that come to mind:
> >
> > - Reset the watermark to the initial watermark, which basically means
> that
> > instead of a "shutdown" tuple there is a "reset" tuple.
> > - Schedule separate operators/pipelines for each batch run.
> >
> > Thomas
> >
> >
> > On Tue, May 9, 2017 at 5:25 AM, AJAY GUPTA <aj...@gmail.com> wrote:
> >
> > > After some discussion and trying out the approach discussed above, it
> > seems
> > > we would need to separate out the concepts of Watermarks and Batch
> > Control
> > > tuples.
> > > The windowed operator needs to be modified to understand batch control
> > > tuples.
> > >
> > > Even if we have watermark tuples which also include batch information,
> > > windowed operator will fail when the source data is event time based.
> > This
> > > is because in this scenario, there are two notions of time in the
> > > watermark:
> > > 1. Time used to denote the file / batch boundary
> > > 2. The event time in the data.
> > >
> > > For this reason, it makes sense to separate the concepts of batch
> tuples
> > > (start something / end something) from the watermark tuples (which
> > > essentially deal with event times).
> > >
> > > We could argue having a watermark tuple indicating end of the batch - a
> > > final watermark (with time = Long.MAX) which would finalize all windows
> > in
> > > the windowed operator. However, now, if a next batch needs to be
> > processed
> > > subsequently by the same windowed operator, we would need to reset the
> > > state of the operator as it has moved ahead in the event time domain.
> The
> > > batch control tuples can do this resetting of state (in other words,
> > > preparation for processing a new batch of data).
> > >
> > > As an example, consider telecom data logs for same 24 hrs of 2 regions
> (A
> > > and B) which are to be processed as a batch. After processing data
> > records
> > > from region A, a "final" watermark would be emitted indicating end of
> all
> > > data from region A. Now, unless we clear the windowed operator's state
> > > information (current watermark, data storage) from the windowed
> operator,
> > > the data records from region B will not be processed. In such scenario,
> > > receiving an end batch control tuple can indicate the operator to reset
> > its
> > > state.
> > >
> > >
> > > Ajay
> > >
> > >
> > > On Sun, Apr 30, 2017 at 4:32 AM, Vlad Rozov <v....@datatorrent.com>
> > > wrote:
> > >
> > > > public static class Pojo implements Tuple
> > > > {
> > > >   @Override
> > > >   public Object getValue()
> > > >   {
> > > >     return this;
> > > >   }
> > > > }
> > > >
> > > > @Override
> > > > public void populateDAG(DAG dag, Configuration conf)
> > > > {
> > > >   CsvParser csvParser = dag.addOperator("csvParser",
> CsvParser.class);
> > > >   WindowedOperatorImpl<Pojo, Pojo, Pojo> windowedOperator =
> > > > dag.addOperator("windowOperator", WindowedOperatorImpl.class);
> > > >   dag.addStream("csvToWindowed", csvParser.out, new
> > > > InputPort[]{windowedOperator.input});
> > > > }
> > > >
> > > >
> > > > Thank you,
> > > >
> > > > Vlad
> > > >
> > > > On 4/29/17 15:20, AJAY GUPTA wrote:
> > > >
> > > >> Even this will not work because the output port of CsvParser is of
> > type
> > > >> Object. Even though Customer extends Tuple<Object>, it will still
> fail
> > > to
> > > >> work since Tuple<Object> gets output as Object.
> > > >>
> > > >> *DefaultOutputPort<Object> output = new
> DefaultOutputPort<Object>();*
> > > >>
> > > >> The input port type at windowed operator with InputT = Object :
> > > >> *DefaultInputPort<Tuple<Object>>*
> > > >>
> > > >>
> > > >> Ajay
> > > >>
> > > >>
> > > >> On Sun, Apr 30, 2017 at 1:45 AM, Vlad Rozov <
> v.rozov@datatorrent.com>
> > > >> wrote:
> > > >>
> > > >> Use Object in place of InputT in the WindowedOperatorImpl. Cast
> Object
> > > to
> > > >>> the actual type of InputT at runtime. Introducing an operator just
> to
> > > do
> > > >>> a
> > > >>> cast is not a good design decision, IMO.
> > > >>>
> > > >>> Thank you,
> > > >>> Vlad
> > > >>>
> > > >>> Отправлено с iPhone
> > > >>>
> > > >>> On Apr 29, 2017, at 02:50, AJAY GUPTA <aj...@gmail.com>
> wrote:
> > > >>>>
> > > >>>> I am using WindowedOperatorImpl and it is declared as follows.
> > > >>>>
> > > >>>> WindowedOperatorImpl<InputT, AccumulationType, OutputType>
> > > >>>>
> > > >>> windowedOperator
> > > >>>
> > > >>>> = new WindowedOperatorImpl<>();
> > > >>>>
> > > >>>> In my application scenario, the InputT is Customer POJO which is
> > > getting
> > > >>>> output as an Object by CsvParser.
> > > >>>>
> > > >>>>
> > > >>>> Ajay
> > > >>>>
> > > >>>> On Fri, Apr 28, 2017 at 11:53 PM, Vlad Rozov <
> > v.rozov@datatorrent.com
> > > >
> > > >>>> wrote:
> > > >>>>
> > > >>>> How do you declare WindowedOperator?
> > > >>>>>
> > > >>>>> Thank you,
> > > >>>>>
> > > >>>>> Vlad
> > > >>>>>
> > > >>>>>
> > > >>>>> On 4/28/17 10:35, AJAY GUPTA wrote:
> > > >>>>>>
> > > >>>>>> Vlad,
> > > >>>>>>
> > > >>>>>> The approach you suggested doesn't work because the CSVParser
> > > outputs
> > > >>>>>> Object Data Type irrespective of the POJO class being emitted.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Ajay
> > > >>>>>>
> > > >>>>>> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <
> > > v.rozov@datatorrent.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>> Make your POJO class implement WindowedOperator Tuple interface
> > (it
> > > >>>>>> may
> > > >>>>>>
> > > >>>>>>> return itself in getValue()).
> > > >>>>>>>
> > > >>>>>>> Thank you,
> > > >>>>>>>
> > > >>>>>>> Vlad
> > > >>>>>>>
> > > >>>>>>> On 4/28/17 02:44, AJAY GUPTA wrote:
> > > >>>>>>>
> > > >>>>>>> Hi All,
> > > >>>>>>>
> > > >>>>>>>> I am creating an application which is using Windowed Operator.
> > > This
> > > >>>>>>>> application involves CsvParser operator emitting a POJO object
> > > which
> > > >>>>>>>>
> > > >>>>>>> is
> > > >>>
> > > >>>> to
> > > >>>>>>>> be passed as input to WindowedOperator. The WindowedOperator
> > > >>>>>>>>
> > > >>>>>>> requires an
> > > >>>
> > > >>>> instance of Tuple class as input :
> > > >>>>>>>> *public final transient DefaultInputPort<Tuple<InputT>>
> > > >>>>>>>> input = new DefaultInputPort<Tuple<InputT>>() *
> > > >>>>>>>>
> > > >>>>>>>> Due to this, the addStream cannot work as the output of
> > > CsvParser's
> > > >>>>>>>> output
> > > >>>>>>>> port is not compatible with input port type of
> WindowedOperator.
> > > >>>>>>>> One way to solve this problem is to have an operator between
> the
> > > >>>>>>>>
> > > >>>>>>> above
> > > >>>
> > > >>>> two
> > > >>>>>>>> operators as a convertor.
> > > >>>>>>>> I would like to know if there is any other more generic
> approach
> > > to
> > > >>>>>>>> solve
> > > >>>>>>>> this problem without writing a new Operator for every new
> > > >>>>>>>> application
> > > >>>>>>>> using
> > > >>>>>>>> Windowed Operators.
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Ajay
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <
> > > >>>>>>>> bhupesh@datatorrent.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>> Hi All,
> > > >>>>>>>>
> > > >>>>>>>> I think we have some agreement on the way we should use
> control
> > > >>>>>>>>>
> > > >>>>>>>> tuples
> > > >>>
> > > >>>> for
> > > >>>>>>>>> File I/O operators to support batch.
> > > >>>>>>>>>
> > > >>>>>>>>> In order to have more operators in Malhar, support this
> > > paradigm, I
> > > >>>>>>>>> think
> > > >>>>>>>>> we should also look at store operators - JDBC, Cassandra,
> HBase
> > > >>>>>>>>> etc.
> > > >>>>>>>>> The case with these operators is simpler as most of these do
> > not
> > > >>>>>>>>>
> > > >>>>>>>> poll
> > > >>>
> > > >>>> the
> > > >>>>>>>>> sources (except JDBC poller operator) and just stop once they
> > > have
> > > >>>>>>>>> read a
> > > >>>>>>>>> fixed amount of data. In other words, these are inherently
> > batch
> > > >>>>>>>>> sources.
> > > >>>>>>>>> The only change that we should add to these operators is to
> > shut
> > > >>>>>>>>>
> > > >>>>>>>> down
> > > >>>
> > > >>>> the
> > > >>>>>>>>> DAG once the reading of data is done. For a windowed operator
> > > this
> > > >>>>>>>>> would
> > > >>>>>>>>> mean a Global window with a final watermark before the DAG is
> > > shut
> > > >>>>>>>>> down.
> > > >>>>>>>>>
> > > >>>>>>>>> ~ Bhupesh
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> _______________________________________________________
> > > >>>>>>>>>
> > > >>>>>>>>> Bhupesh Chawda
> > > >>>>>>>>>
> > > >>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >>>>>>>>>
> > > >>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
> > > >>>>>>>>> bhupesh@datatorrent.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Thomas,
> > > >>>>>>>>>
> > > >>>>>>>>> Even though the windowing operator is not just "event time",
> it
> > > >>>>>>>>>>
> > > >>>>>>>>> seems
> > > >>>
> > > >>>> it
> > > >>>>>>>>>> is too much dependent on the "time" attribute of the
> incoming
> > > >>>>>>>>>>
> > > >>>>>>>>> tuple.
> > > >>>
> > > >>>> This
> > > >>>>>>>>>> is the reason we had to model the file index as a timestamp
> to
> > > >>>>>>>>>>
> > > >>>>>>>>> solve
> > > >>>
> > > >>>> the
> > > >>>>>>>>>> batch case for files.
> > > >>>>>>>>>> Perhaps we should work on increasing the scope of the
> windowed
> > > >>>>>>>>>> operator
> > > >>>>>>>>>>
> > > >>>>>>>>>> to
> > > >>>>>>>>>>
> > > >>>>>>>>> consider other types of windows as well. The Sequence option
> > > >>>>>>>>>
> > > >>>>>>>> suggested
> > > >>>
> > > >>>> by
> > > >>>>>>>>>> David seems to be something in that direction.
> > > >>>>>>>>>>
> > > >>>>>>>>>> ~ Bhupesh
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> _______________________________________________________
> > > >>>>>>>>>>
> > > >>>>>>>>>> Bhupesh Chawda
> > > >>>>>>>>>>
> > > >>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >>>>>>>>>>
> > > >>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <
> > thw@apache.org>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> That's correct, we are looking at a generalized approach for
> > > state
> > > >>>>>>>>>>
> > > >>>>>>>>>> management vs. a series of special cases.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> And to be clear, windowing does not imply event time,
> > otherwise
> > > >>>>>>>>>>> it
> > > >>>>>>>>>>> would
> > > >>>>>>>>>>> be
> > > >>>>>>>>>>> "EventTimeOperator" :-)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thomas
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> bhupesh@datatorrent.com>
> > > >>>>>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hi David,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I went through the discussion, but it seems like it is more
> > on
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> the
> > > >>>
> > > >>>> event
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> time watermark handling as opposed to batches. What we are
> > > trying
> > > >>>>>>>>>>
> > > >>>>>>>>> to
> > > >>>
> > > >>>> do
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> is
> > > >>>>>>>>>>
> > > >>>>>>>>>> have watermarks serve the purpose of demarcating batches
> using
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> control
> > > >>>>>>>>>>>> tuples. Since each batch is separate from others, we would
> > > like
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> to
> > > >>>
> > > >>>> have
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> stateful processing within a batch, but not across batches.
> > > >>>>>>>>>>
> > > >>>>>>>>>> At the same time, we would like to do this in a manner which
> > is
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> consistent
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> with the windowing mechanism provided by the windowed
> > operator.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> This
> > > >>>
> > > >>>> will
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> allow us to treat a single batch as a (bounded) stream and
> > > apply
> > > >>>>>>>>>>>
> > > >>>>>>>>>> all
> > > >>>
> > > >>>> the
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> event time windowing concepts in that time span.
> > > >>>>>>>>>>
> > > >>>>>>>>>> For example, let's say I need to process data for a day (24
> > > >>>>>>>>>>>
> > > >>>>>>>>>> hours) as
> > > >>>
> > > >>>> a
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> single batch. The application is still streaming in nature:
> > it
> > > >>>>>>>>>>
> > > >>>>>>>>> would
> > > >>>
> > > >>>> end
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> the batch after a day and start a new batch the next day.
> At
> > > the
> > > >>>>>>>>>>
> > > >>>>>>>>> same
> > > >>>
> > > >>>> time,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I would be able to have early trigger firings every minute
> as
> > > >>>>>>>>>>>
> > > >>>>>>>>>> well as
> > > >>>
> > > >>>> drop
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> any data which is say, 5 mins late. All this within a
> single
> > > day.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> ~ Bhupesh
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> _______________________________________________________
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Bhupesh Chawda
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <
> > > davidyan@gmail.com>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> There is a discussion in the Flink mailing list about
> > key-based
> > > >>>>>>>>>>
> > > >>>>>>>>>> watermarks.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> I think it's relevant to our use case here.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> https://lists.apache.org/threa
> > d.html/2b90d5b1d5e2654212cfbbc
> > > >>>>>>>>>>>>> c6510ef
> > > >>>>>>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> David
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> bhupesh@datatorrent.com
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Hi David,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> If using time window does not seem appropriate, we can
> have
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> another
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> class
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> which is more suited for such sequential and distinct
> > windows.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Perhaps, a
> > > >>>>>>>>>>>>> CustomWindow option can be introduced which takes in a
> > window
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> id.
> > > >>>
> > > >>>> The
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> purpose of this window option could be to translate the
> > > window
> > > >>>>>>>>>>>> id
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> into
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> appropriate timestamps.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Another option would be to go with a custom
> > timestampExtractor
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> for
> > > >>>
> > > >>>> such
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> tuples which translates the each unique file name to a
> > > distinct
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> timestamp
> > > >>>>>>>>>>>>> while using time windows in the windowed operator.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> ~ Bhupesh
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> _______________________________________________________
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Bhupesh Chawda
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> davidyan@gmail.com>
> > > >>>
> > > >>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I now see your rationale on putting the filename in the
> > > window.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> As far as I understand, the reasons why the filename is
> not
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> part
> > > >>>
> > > >>>> of
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> key
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> and the Global Window is not used are:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 1) The files are processed in sequence, not in parallel
> > > >>>>>>>>>>>>>>> 2) The windowed operator should not keep the state
> > > associated
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> file
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> when the processing of the file is done
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 3) The trigger should be fired for the file when a file
> > is
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> done
> > > >>>
> > > >>>> processing.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> However, if the file is just a sequence has nothing to
> do
> > > with
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> a
> > > >>>
> > > >>>> timestamp,
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> assigning a timestamp to a file is not an intuitive
> thing
> > to
> > > >>>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>> just create confusions to the users, especially when it's
> > > used
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> example for new users.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> How about having a separate class called SequenceWindow?
> And
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> perhaps
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> TimeWindow can inherit from it?
> > > >>>>>>>>>>>>> David
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <
> > > thw@apache.org
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> bhupesh@datatorrent.com
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I think my comments related to count based windows
> might
> > be
> > > >>>>>>>>>>>>>>>> causing
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> confusion. Let's not discuss count based scenarios for
> > > now.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Just want to make sure we are on the same page wrt. the
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> "each
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> file
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> is a
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple
> from
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> file
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> has the same timestamp (which is just a sequence number)
> > and
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> helps
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> keep tuples from each file in a separate window.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could
> > be a
> > > >>>>>>>>>>>>>>>> time
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> stamp
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> also, depending on the file naming convention. And if it
> > was
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> event
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> time
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> processing, the watermark would be derived from records
> > > within
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> file.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Agreed, the source should have a mechanism to control the
> > > time
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> stamp
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> extraction along with everything else pertaining to the
> > > >>>>>>>>>>>>>> watermark
> > > >>>>>>>>>>>>>> generation.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> We could also implement a "timestampExtractor"
> interface
> > to
> > > >>>>>>>>>>>>>>>> identify
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> timestamp (sequence number) for a file.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> ~ Bhupesh
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> ______________________________
> > _________________________
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Bhupesh Chawda
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> thw@apache.org
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I don't think this is a use case for count based window.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> We have multiple files that are retrieved in a sequence
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> there
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> no
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> knowledge of the number of records per file. The
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> requirement is
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> aggregate each file separately and emit the aggregate
> > when
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> file
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> read
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> fully. There is no concept of "end of something" for
> an
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> individual
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> key
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> global window isn't applicable.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> However, as already explained and implemented by
> > > Bhupesh,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> solved using watermark and window (in this case the
> window
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> timestamp
> > > >>>>>>>>>>>>>>>> isn't
> > > >>>>>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't
> > matter.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thomas
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> davidyan@gmail.com
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I don't think this is the way to go. Global Window only
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> means
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> timestamp
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> necessarily
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> mean it's a large batch. Unless there is some notion
> of
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> event
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> time
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> each
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> file, you don't want to embed the file into the
> window
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> itself.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> If you want the result broken up by file name, and
> if
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> files
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> are
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> processed in parallel, I think making the file name
> be
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> part
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> key
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> the way to go. I think it's very confusing if we
> > somehow
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> make
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> file
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> be part of the window.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> you're
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> welcome
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> add
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> that feature. In case of count-based windows, there
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> no
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> notion
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> time and you probably only trigger at the end of each
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> window.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> In
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> of count-based windows, the watermark only matters for
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> batch
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> since
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> you
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> need
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> a way to know when the batch has ended (if the count
> is
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> 10,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> number
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> end
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> last
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> window
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> with 5 tuples).
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> David
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Hi David,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Thanks for your comments.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> The wordcount example that I created based on the
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> windowed
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> operator
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> does
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> processing of word counts per file (each file as a
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> separate
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> batch),
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> i.e.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> process counts for each file and dump into separate
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> files.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> As I understand Global window is for one large
> batch;
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> i.e.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> all
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> incoming
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> data falls into the same batch. This could not be
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> processed
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> using
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> GlobalWindow option as we need more than one windows.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> In
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> case, I
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> configured the windowed operator to have time windows
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>> 1ms
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> each
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> passed data for each file with increasing timestamps:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> (file1,
> > > >>>>>>>>>>>>>>>>>> 1),
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> (file2,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> scenario?
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Regarding (2 - count based windows), I think there
> is
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> trigger
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> option
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> process count based windows. In case I want to process
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> every
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> 1000
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> tuples
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> CountTrigger
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> accumulation set to Discarding. Is this correct?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Global
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> window.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> ~ Bhupesh
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> ______________________________
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> _________________________
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Bhupesh Chawda
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> davidyan@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I'm worried that we are making the watermark concept
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> too
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> complicated.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Watermarks should simply just tell you what windows
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> considered
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> complete.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Point 2 is basically a count-based window.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Watermarks
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> play a
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> role
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> here because the window is always complete at the
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> n-th
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> tuple.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> If I understand correctly, point 3 is for batch
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> processing
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> files.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Unless
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> achieved
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> watermark
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> +infinity
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> upon
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> receipt
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> watermark.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> achieved
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> with a
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> watermark with a +infinity timestamp.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> David
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hi Thomas,
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> For an input operator which is supposed to
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> generate
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> watermarks
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> downstream operators, I can think about the
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> following
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> watermarks
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> operator can emit:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark /
> low
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> watermark)
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> tuples)
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> 4. Final watermark
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> File based watermarks seem to be applicable for
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> batch
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> (file
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> based)
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> well,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Does
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> seem
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> line
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> with the thought process?
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> ~ Bhupesh
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> ______________________________
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> _________________________
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Bhupesh Chawda
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Software Engineer
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> thw@apache.org
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> I don't think this should be designed based on a
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> simplistic
> > > >>>>>>>>>>>>>>>>>>>>> file
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> input-output scenario. It would be good to
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> include a
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> stateful
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> transformation based on event time.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> More complex pipelines contain stateful
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> transformations
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> depend
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> on
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> watermark
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> concept
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> based
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> increasing
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> sequence)
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> other operators can generically work with.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> produce
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> time
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> based
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> watermarks,
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> for example when you read part files that are
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> bound
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> event
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> time.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Thomas
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

I think we should have the support to allow running multiple batches
through the same DAG.

Resetting the watermark to the initial watermark seems like a good idea.
The windowed operator needs to understand the start/end batch control tuple
and reset the watermark.

~ Bhupesh



_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Thu, May 11, 2017 at 7:37 PM, Thomas Weise <th...@apache.org> wrote:

> Usually batches are processed by different instances of a topology. First
> of all we should agree that running multiple batches through the same DAG
> *in sequence* is something that we want to address.
>
> If yes, then there is the reset problem you are referring to, and it only
> occurs when you want to do event time processing per batch, because here
> you cannot repurpose the time component to segregate batches.
>
> 2 possible options that come to mind:
>
> - Reset the watermark to the initial watermark, which basically means that
> instead of a "shutdown" tuple there is a "reset" tuple.
> - Schedule separate operators/pipelines for each batch run.
>
> Thomas
>
>
> On Tue, May 9, 2017 at 5:25 AM, AJAY GUPTA <aj...@gmail.com> wrote:
>
> > After some discussion and trying out the approach discussed above, it
> seems
> > we would need to separate out the concepts of Watermarks and Batch
> Control
> > tuples.
> > The windowed operator needs to be modified to understand batch control
> > tuples.
> >
> > Even if we have watermark tuples which also include batch information,
> > windowed operator will fail when the source data is event time based.
> This
> > is because in this scenario, there are two notions of time in the
> > watermark:
> > 1. Time used to denote the file / batch boundary
> > 2. The event time in the data.
> >
> > For this reason, it makes sense to separate the concepts of batch tuples
> > (start something / end something) from the watermark tuples (which
> > essentially deal with event times).
> >
> > We could argue having a watermark tuple indicating end of the batch - a
> > final watermark (with time = Long.MAX) which would finalize all windows
> in
> > the windowed operator. However, now, if a next batch needs to be
> processed
> > subsequently by the same windowed operator, we would need to reset the
> > state of the operator as it has moved ahead in the event time domain. The
> > batch control tuples can do this resetting of state (in other words,
> > preparation for processing a new batch of data).
> >
> > As an example, consider telecom data logs for same 24 hrs of 2 regions (A
> > and B) which are to be processed as a batch. After processing data
> records
> > from region A, a "final" watermark would be emitted indicating end of all
> > data from region A. Now, unless we clear the windowed operator's state
> > information (current watermark, data storage) from the windowed operator,
> > the data records from region B will not be processed. In such scenario,
> > receiving an end batch control tuple can indicate the operator to reset
> its
> > state.
> >
> >
> > Ajay
> >
> >
> > On Sun, Apr 30, 2017 at 4:32 AM, Vlad Rozov <v....@datatorrent.com>
> > wrote:
> >
> > > public static class Pojo implements Tuple
> > > {
> > >   @Override
> > >   public Object getValue()
> > >   {
> > >     return this;
> > >   }
> > > }
> > >
> > > @Override
> > > public void populateDAG(DAG dag, Configuration conf)
> > > {
> > >   CsvParser csvParser = dag.addOperator("csvParser", CsvParser.class);
> > >   WindowedOperatorImpl<Pojo, Pojo, Pojo> windowedOperator =
> > > dag.addOperator("windowOperator", WindowedOperatorImpl.class);
> > >   dag.addStream("csvToWindowed", csvParser.out, new
> > > InputPort[]{windowedOperator.input});
> > > }
> > >
> > >
> > > Thank you,
> > >
> > > Vlad
> > >
> > > On 4/29/17 15:20, AJAY GUPTA wrote:
> > >
> > >> Even this will not work because the output port of CsvParser is of
> type
> > >> Object. Even though Customer extends Tuple<Object>, it will still fail
> > to
> > >> work since Tuple<Object> gets output as Object.
> > >>
> > >> *DefaultOutputPort<Object> output = new DefaultOutputPort<Object>();*
> > >>
> > >> The input port type at windowed operator with InputT = Object :
> > >> *DefaultInputPort<Tuple<Object>>*
> > >>
> > >>
> > >> Ajay
> > >>
> > >>
> > >> On Sun, Apr 30, 2017 at 1:45 AM, Vlad Rozov <v....@datatorrent.com>
> > >> wrote:
> > >>
> > >> Use Object in place of InputT in the WindowedOperatorImpl. Cast Object
> > to
> > >>> the actual type of InputT at runtime. Introducing an operator just to
> > do
> > >>> a
> > >>> cast is not a good design decision, IMO.
> > >>>
> > >>> Thank you,
> > >>> Vlad
> > >>>
> > >>> Отправлено с iPhone
> > >>>
> > >>> On Apr 29, 2017, at 02:50, AJAY GUPTA <aj...@gmail.com> wrote:
> > >>>>
> > >>>> I am using WindowedOperatorImpl and it is declared as follows.
> > >>>>
> > >>>> WindowedOperatorImpl<InputT, AccumulationType, OutputType>
> > >>>>
> > >>> windowedOperator
> > >>>
> > >>>> = new WindowedOperatorImpl<>();
> > >>>>
> > >>>> In my application scenario, the InputT is Customer POJO which is
> > getting
> > >>>> output as an Object by CsvParser.
> > >>>>
> > >>>>
> > >>>> Ajay
> > >>>>
> > >>>> On Fri, Apr 28, 2017 at 11:53 PM, Vlad Rozov <
> v.rozov@datatorrent.com
> > >
> > >>>> wrote:
> > >>>>
> > >>>> How do you declare WindowedOperator?
> > >>>>>
> > >>>>> Thank you,
> > >>>>>
> > >>>>> Vlad
> > >>>>>
> > >>>>>
> > >>>>> On 4/28/17 10:35, AJAY GUPTA wrote:
> > >>>>>>
> > >>>>>> Vlad,
> > >>>>>>
> > >>>>>> The approach you suggested doesn't work because the CSVParser
> > outputs
> > >>>>>> Object Data Type irrespective of the POJO class being emitted.
> > >>>>>>
> > >>>>>>
> > >>>>>> Ajay
> > >>>>>>
> > >>>>>> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <
> > v.rozov@datatorrent.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> Make your POJO class implement WindowedOperator Tuple interface
> (it
> > >>>>>> may
> > >>>>>>
> > >>>>>>> return itself in getValue()).
> > >>>>>>>
> > >>>>>>> Thank you,
> > >>>>>>>
> > >>>>>>> Vlad
> > >>>>>>>
> > >>>>>>> On 4/28/17 02:44, AJAY GUPTA wrote:
> > >>>>>>>
> > >>>>>>> Hi All,
> > >>>>>>>
> > >>>>>>>> I am creating an application which is using Windowed Operator.
> > This
> > >>>>>>>> application involves CsvParser operator emitting a POJO object
> > which
> > >>>>>>>>
> > >>>>>>> is
> > >>>
> > >>>> to
> > >>>>>>>> be passed as input to WindowedOperator. The WindowedOperator
> > >>>>>>>>
> > >>>>>>> requires an
> > >>>
> > >>>> instance of Tuple class as input :
> > >>>>>>>> *public final transient DefaultInputPort<Tuple<InputT>>
> > >>>>>>>> input = new DefaultInputPort<Tuple<InputT>>() *
> > >>>>>>>>
> > >>>>>>>> Due to this, the addStream cannot work as the output of
> > CsvParser's
> > >>>>>>>> output
> > >>>>>>>> port is not compatible with input port type of WindowedOperator.
> > >>>>>>>> One way to solve this problem is to have an operator between the
> > >>>>>>>>
> > >>>>>>> above
> > >>>
> > >>>> two
> > >>>>>>>> operators as a convertor.
> > >>>>>>>> I would like to know if there is any other more generic approach
> > to
> > >>>>>>>> solve
> > >>>>>>>> this problem without writing a new Operator for every new
> > >>>>>>>> application
> > >>>>>>>> using
> > >>>>>>>> Windowed Operators.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Ajay
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <
> > >>>>>>>> bhupesh@datatorrent.com>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi All,
> > >>>>>>>>
> > >>>>>>>> I think we have some agreement on the way we should use control
> > >>>>>>>>>
> > >>>>>>>> tuples
> > >>>
> > >>>> for
> > >>>>>>>>> File I/O operators to support batch.
> > >>>>>>>>>
> > >>>>>>>>> In order to have more operators in Malhar, support this
> > paradigm, I
> > >>>>>>>>> think
> > >>>>>>>>> we should also look at store operators - JDBC, Cassandra, HBase
> > >>>>>>>>> etc.
> > >>>>>>>>> The case with these operators is simpler as most of these do
> not
> > >>>>>>>>>
> > >>>>>>>> poll
> > >>>
> > >>>> the
> > >>>>>>>>> sources (except JDBC poller operator) and just stop once they
> > have
> > >>>>>>>>> read a
> > >>>>>>>>> fixed amount of data. In other words, these are inherently
> batch
> > >>>>>>>>> sources.
> > >>>>>>>>> The only change that we should add to these operators is to
> shut
> > >>>>>>>>>
> > >>>>>>>> down
> > >>>
> > >>>> the
> > >>>>>>>>> DAG once the reading of data is done. For a windowed operator
> > this
> > >>>>>>>>> would
> > >>>>>>>>> mean a Global window with a final watermark before the DAG is
> > shut
> > >>>>>>>>> down.
> > >>>>>>>>>
> > >>>>>>>>> ~ Bhupesh
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> _______________________________________________________
> > >>>>>>>>>
> > >>>>>>>>> Bhupesh Chawda
> > >>>>>>>>>
> > >>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >>>>>>>>>
> > >>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
> > >>>>>>>>> bhupesh@datatorrent.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Thomas,
> > >>>>>>>>>
> > >>>>>>>>> Even though the windowing operator is not just "event time", it
> > >>>>>>>>>>
> > >>>>>>>>> seems
> > >>>
> > >>>> it
> > >>>>>>>>>> is too much dependent on the "time" attribute of the incoming
> > >>>>>>>>>>
> > >>>>>>>>> tuple.
> > >>>
> > >>>> This
> > >>>>>>>>>> is the reason we had to model the file index as a timestamp to
> > >>>>>>>>>>
> > >>>>>>>>> solve
> > >>>
> > >>>> the
> > >>>>>>>>>> batch case for files.
> > >>>>>>>>>> Perhaps we should work on increasing the scope of the windowed
> > >>>>>>>>>> operator
> > >>>>>>>>>>
> > >>>>>>>>>> to
> > >>>>>>>>>>
> > >>>>>>>>> consider other types of windows as well. The Sequence option
> > >>>>>>>>>
> > >>>>>>>> suggested
> > >>>
> > >>>> by
> > >>>>>>>>>> David seems to be something in that direction.
> > >>>>>>>>>>
> > >>>>>>>>>> ~ Bhupesh
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> _______________________________________________________
> > >>>>>>>>>>
> > >>>>>>>>>> Bhupesh Chawda
> > >>>>>>>>>>
> > >>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >>>>>>>>>>
> > >>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <
> thw@apache.org>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> That's correct, we are looking at a generalized approach for
> > state
> > >>>>>>>>>>
> > >>>>>>>>>> management vs. a series of special cases.
> > >>>>>>>>>>>
> > >>>>>>>>>>> And to be clear, windowing does not imply event time,
> otherwise
> > >>>>>>>>>>> it
> > >>>>>>>>>>> would
> > >>>>>>>>>>> be
> > >>>>>>>>>>> "EventTimeOperator" :-)
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thomas
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
> > >>>>>>>>>>>
> > >>>>>>>>>>> bhupesh@datatorrent.com>
> > >>>>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi David,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I went through the discussion, but it seems like it is more
> on
> > >>>>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>
> > >>>> event
> > >>>>>>>>>>>>
> > >>>>>>>>>>> time watermark handling as opposed to batches. What we are
> > trying
> > >>>>>>>>>>
> > >>>>>>>>> to
> > >>>
> > >>>> do
> > >>>>>>>>>>>
> > >>>>>>>>>>> is
> > >>>>>>>>>>
> > >>>>>>>>>> have watermarks serve the purpose of demarcating batches using
> > >>>>>>>>>>>
> > >>>>>>>>>>>> control
> > >>>>>>>>>>>> tuples. Since each batch is separate from others, we would
> > like
> > >>>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>
> > >>>> have
> > >>>>>>>>>>>>
> > >>>>>>>>>>> stateful processing within a batch, but not across batches.
> > >>>>>>>>>>
> > >>>>>>>>>> At the same time, we would like to do this in a manner which
> is
> > >>>>>>>>>>>
> > >>>>>>>>>>>> consistent
> > >>>>>>>>>>>>
> > >>>>>>>>>>> with the windowing mechanism provided by the windowed
> operator.
> > >>>>>>>>>>>
> > >>>>>>>>>> This
> > >>>
> > >>>> will
> > >>>>>>>>>>>>
> > >>>>>>>>>>> allow us to treat a single batch as a (bounded) stream and
> > apply
> > >>>>>>>>>>>
> > >>>>>>>>>> all
> > >>>
> > >>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>> event time windowing concepts in that time span.
> > >>>>>>>>>>
> > >>>>>>>>>> For example, let's say I need to process data for a day (24
> > >>>>>>>>>>>
> > >>>>>>>>>> hours) as
> > >>>
> > >>>> a
> > >>>>>>>>>>>>
> > >>>>>>>>>>> single batch. The application is still streaming in nature:
> it
> > >>>>>>>>>>
> > >>>>>>>>> would
> > >>>
> > >>>> end
> > >>>>>>>>>>>
> > >>>>>>>>>>> the batch after a day and start a new batch the next day. At
> > the
> > >>>>>>>>>>
> > >>>>>>>>> same
> > >>>
> > >>>> time,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I would be able to have early trigger firings every minute as
> > >>>>>>>>>>>
> > >>>>>>>>>> well as
> > >>>
> > >>>> drop
> > >>>>>>>>>>>>
> > >>>>>>>>>>> any data which is say, 5 mins late. All this within a single
> > day.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> ~ Bhupesh
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> _______________________________________________________
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Bhupesh Chawda
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <
> > davidyan@gmail.com>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>> There is a discussion in the Flink mailing list about
> key-based
> > >>>>>>>>>>
> > >>>>>>>>>> watermarks.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I think it's relevant to our use case here.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> https://lists.apache.org/threa
> d.html/2b90d5b1d5e2654212cfbbc
> > >>>>>>>>>>>>> c6510ef
> > >>>>>>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> David
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> bhupesh@datatorrent.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi David,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> If using time window does not seem appropriate, we can have
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> another
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> class
> > >>>>>>>>>>>>
> > >>>>>>>>>>> which is more suited for such sequential and distinct
> windows.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Perhaps, a
> > >>>>>>>>>>>>> CustomWindow option can be introduced which takes in a
> window
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> id.
> > >>>
> > >>>> The
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> purpose of this window option could be to translate the
> > window
> > >>>>>>>>>>>> id
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> into
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> appropriate timestamps.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Another option would be to go with a custom
> timestampExtractor
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> for
> > >>>
> > >>>> such
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> tuples which translates the each unique file name to a
> > distinct
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> timestamp
> > >>>>>>>>>>>>> while using time windows in the windowed operator.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> ~ Bhupesh
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> _______________________________________________________
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Bhupesh Chawda
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> davidyan@gmail.com>
> > >>>
> > >>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> I now see your rationale on putting the filename in the
> > window.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> As far as I understand, the reasons why the filename is not
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> part
> > >>>
> > >>>> of
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> key
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> and the Global Window is not used are:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 1) The files are processed in sequence, not in parallel
> > >>>>>>>>>>>>>>> 2) The windowed operator should not keep the state
> > associated
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> file
> > >>>>>>>>>>>
> > >>>>>>>>>>>> when the processing of the file is done
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 3) The trigger should be fired for the file when a file
> is
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> done
> > >>>
> > >>>> processing.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> However, if the file is just a sequence has nothing to do
> > with
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> a
> > >>>
> > >>>> timestamp,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> assigning a timestamp to a file is not an intuitive thing
> to
> > >>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> would
> > >>>>>>>>>>>>> just create confusions to the users, especially when it's
> > used
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> example for new users.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> How about having a separate class called SequenceWindow? And
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> perhaps
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> TimeWindow can inherit from it?
> > >>>>>>>>>>>>> David
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <
> > thw@apache.org
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> bhupesh@datatorrent.com
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I think my comments related to count based windows might
> be
> > >>>>>>>>>>>>>>>> causing
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> confusion. Let's not discuss count based scenarios for
> > now.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Just want to make sure we are on the same page wrt. the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> "each
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> file
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> is a
> > >>>>>>>>>>>
> > >>>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> file
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> has the same timestamp (which is just a sequence number)
> and
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> helps
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> keep tuples from each file in a separate window.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could
> be a
> > >>>>>>>>>>>>>>>> time
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> stamp
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> also, depending on the file naming convention. And if it
> was
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> event
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> time
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> processing, the watermark would be derived from records
> > within
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> file.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Agreed, the source should have a mechanism to control the
> > time
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> stamp
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> extraction along with everything else pertaining to the
> > >>>>>>>>>>>>>> watermark
> > >>>>>>>>>>>>>> generation.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> We could also implement a "timestampExtractor" interface
> to
> > >>>>>>>>>>>>>>>> identify
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> timestamp (sequence number) for a file.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> ~ Bhupesh
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> ______________________________
> _________________________
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Bhupesh Chawda
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> thw@apache.org
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I don't think this is a use case for count based window.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> We have multiple files that are retrieved in a sequence
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> there
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>>> no
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> knowledge of the number of records per file. The
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> requirement is
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> aggregate each file separately and emit the aggregate
> when
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> file
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> read
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> fully. There is no concept of "end of something" for an
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> individual
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> key
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> global window isn't applicable.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> However, as already explained and implemented by
> > Bhupesh,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> solved using watermark and window (in this case the window
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> timestamp
> > >>>>>>>>>>>>>>>> isn't
> > >>>>>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't
> matter.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thomas
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> davidyan@gmail.com
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I don't think this is the way to go. Global Window only
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> means
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> timestamp
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> necessarily
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> mean it's a large batch. Unless there is some notion of
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> event
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> time
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> each
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> file, you don't want to embed the file into the window
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> itself.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> If you want the result broken up by file name, and if
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> files
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> are
> > >>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> processed in parallel, I think making the file name be
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> part
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> key
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the way to go. I think it's very confusing if we
> somehow
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> make
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> file
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> be part of the window.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> you're
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> welcome
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> add
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> that feature. In case of count-based windows, there
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> no
> > >>>>>>>>>>>
> > >>>>>>>>>>>> notion
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> time and you probably only trigger at the end of each
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> window.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> In
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> of count-based windows, the watermark only matters for
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> batch
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> since
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> need
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 10,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> number
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> end
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> last
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> window
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> with 5 tuples).
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> David
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi David,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks for your comments.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> The wordcount example that I created based on the
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> windowed
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> operator
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> does
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> processing of word counts per file (each file as a
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> separate
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> batch),
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> i.e.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> process counts for each file and dump into separate
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> files.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> As I understand Global window is for one large batch;
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> i.e.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> incoming
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> data falls into the same batch. This could not be
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> processed
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> using
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> GlobalWindow option as we need more than one windows.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> In
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> case, I
> > >>>>>>>>>>>
> > >>>>>>>>>>>> configured the windowed operator to have time windows
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>> 1ms
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> each
> > >>>>>>>>>>>
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> passed data for each file with increasing timestamps:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> (file1,
> > >>>>>>>>>>>>>>>>>> 1),
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> (file2,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> scenario?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Regarding (2 - count based windows), I think there is
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> trigger
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> option
> > >>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> process count based windows. In case I want to process
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> every
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 1000
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> tuples
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> CountTrigger
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>>> accumulation set to Discarding. Is this correct?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Global
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> window.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> ~ Bhupesh
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> ______________________________
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> _________________________
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Bhupesh Chawda
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >>>>>>>>>>>
> > >>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> davidyan@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I'm worried that we are making the watermark concept
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> too
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> complicated.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Watermarks should simply just tell you what windows
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> considered
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> complete.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Point 2 is basically a count-based window.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Watermarks
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>
> > >>>>>>>>>>>> play a
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> role
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> here because the window is always complete at the
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> n-th
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> tuple.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> If I understand correctly, point 3 is for batch
> > >>>>>>>>>>>
> > >>>>>>>>>>>> processing
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> files.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Unless
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> achieved
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> watermark
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> +infinity
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> upon
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> receipt
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> watermark.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> achieved
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> with a
> > >>>>>>>>>>>
> > >>>>>>>>>>>> watermark with a +infinity timestamp.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> David
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi Thomas,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> For an input operator which is supposed to
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> generate
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> watermarks
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>
> > >>>>>>>>>>>> downstream operators, I can think about the
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> following
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> watermarks
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> operator can emit:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> watermark)
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> tuples)
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> 4. Final watermark
> > >>>>>>>>>>>
> > >>>>>>>>>>>> File based watermarks seem to be applicable for
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> batch
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> (file
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> based)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> well,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Does
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> seem
> > >>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> line
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> with the thought process?
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> ~ Bhupesh
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> ______________________________
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> _________________________
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Bhupesh Chawda
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Software Engineer
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> thw@apache.org
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I don't think this should be designed based on a
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> simplistic
> > >>>>>>>>>>>>>>>>>>>>> file
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> input-output scenario. It would be good to
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> include a
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> stateful
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> transformation based on event time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> More complex pipelines contain stateful
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> transformations
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> depend
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> watermark
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> concept
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> based
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> increasing
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> sequence)
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> other operators can generically work with.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> produce
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> time
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> based
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> watermarks,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> for example when you read part files that are
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> bound
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> event
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> time.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thomas
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

Usually batches are processed by different instances of a topology. First
of all we should agree that running multiple batches through the same DAG
*in sequence* is something that we want to address.

If yes, then there is the reset problem you are referring to, and it only
occurs when you want to do event time processing per batch, because here
you cannot repurpose the time component to segregate batches.

2 possible options that come to mind:

- Reset the watermark to the initial watermark, which basically means that
instead of a "shutdown" tuple there is a "reset" tuple.
- Schedule separate operators/pipelines for each batch run.

Thomas


On Tue, May 9, 2017 at 5:25 AM, AJAY GUPTA <aj...@gmail.com> wrote:

> After some discussion and trying out the approach discussed above, it seems
> we would need to separate out the concepts of Watermarks and Batch Control
> tuples.
> The windowed operator needs to be modified to understand batch control
> tuples.
>
> Even if we have watermark tuples which also include batch information,
> windowed operator will fail when the source data is event time based. This
> is because in this scenario, there are two notions of time in the
> watermark:
> 1. Time used to denote the file / batch boundary
> 2. The event time in the data.
>
> For this reason, it makes sense to separate the concepts of batch tuples
> (start something / end something) from the watermark tuples (which
> essentially deal with event times).
>
> We could argue having a watermark tuple indicating end of the batch - a
> final watermark (with time = Long.MAX) which would finalize all windows in
> the windowed operator. However, now, if a next batch needs to be processed
> subsequently by the same windowed operator, we would need to reset the
> state of the operator as it has moved ahead in the event time domain. The
> batch control tuples can do this resetting of state (in other words,
> preparation for processing a new batch of data).
>
> As an example, consider telecom data logs for same 24 hrs of 2 regions (A
> and B) which are to be processed as a batch. After processing data records
> from region A, a "final" watermark would be emitted indicating end of all
> data from region A. Now, unless we clear the windowed operator's state
> information (current watermark, data storage) from the windowed operator,
> the data records from region B will not be processed. In such scenario,
> receiving an end batch control tuple can indicate the operator to reset its
> state.
>
>
> Ajay
>
>
> On Sun, Apr 30, 2017 at 4:32 AM, Vlad Rozov <v....@datatorrent.com>
> wrote:
>
> > public static class Pojo implements Tuple
> > {
> >   @Override
> >   public Object getValue()
> >   {
> >     return this;
> >   }
> > }
> >
> > @Override
> > public void populateDAG(DAG dag, Configuration conf)
> > {
> >   CsvParser csvParser = dag.addOperator("csvParser", CsvParser.class);
> >   WindowedOperatorImpl<Pojo, Pojo, Pojo> windowedOperator =
> > dag.addOperator("windowOperator", WindowedOperatorImpl.class);
> >   dag.addStream("csvToWindowed", csvParser.out, new
> > InputPort[]{windowedOperator.input});
> > }
> >
> >
> > Thank you,
> >
> > Vlad
> >
> > On 4/29/17 15:20, AJAY GUPTA wrote:
> >
> >> Even this will not work because the output port of CsvParser is of type
> >> Object. Even though Customer extends Tuple<Object>, it will still fail
> to
> >> work since Tuple<Object> gets output as Object.
> >>
> >> *DefaultOutputPort<Object> output = new DefaultOutputPort<Object>();*
> >>
> >> The input port type at windowed operator with InputT = Object :
> >> *DefaultInputPort<Tuple<Object>>*
> >>
> >>
> >> Ajay
> >>
> >>
> >> On Sun, Apr 30, 2017 at 1:45 AM, Vlad Rozov <v....@datatorrent.com>
> >> wrote:
> >>
> >> Use Object in place of InputT in the WindowedOperatorImpl. Cast Object
> to
> >>> the actual type of InputT at runtime. Introducing an operator just to
> do
> >>> a
> >>> cast is not a good design decision, IMO.
> >>>
> >>> Thank you,
> >>> Vlad
> >>>
> >>> Отправлено с iPhone
> >>>
> >>> On Apr 29, 2017, at 02:50, AJAY GUPTA <aj...@gmail.com> wrote:
> >>>>
> >>>> I am using WindowedOperatorImpl and it is declared as follows.
> >>>>
> >>>> WindowedOperatorImpl<InputT, AccumulationType, OutputType>
> >>>>
> >>> windowedOperator
> >>>
> >>>> = new WindowedOperatorImpl<>();
> >>>>
> >>>> In my application scenario, the InputT is Customer POJO which is
> getting
> >>>> output as an Object by CsvParser.
> >>>>
> >>>>
> >>>> Ajay
> >>>>
> >>>> On Fri, Apr 28, 2017 at 11:53 PM, Vlad Rozov <v.rozov@datatorrent.com
> >
> >>>> wrote:
> >>>>
> >>>> How do you declare WindowedOperator?
> >>>>>
> >>>>> Thank you,
> >>>>>
> >>>>> Vlad
> >>>>>
> >>>>>
> >>>>> On 4/28/17 10:35, AJAY GUPTA wrote:
> >>>>>>
> >>>>>> Vlad,
> >>>>>>
> >>>>>> The approach you suggested doesn't work because the CSVParser
> outputs
> >>>>>> Object Data Type irrespective of the POJO class being emitted.
> >>>>>>
> >>>>>>
> >>>>>> Ajay
> >>>>>>
> >>>>>> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <
> v.rozov@datatorrent.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Make your POJO class implement WindowedOperator Tuple interface (it
> >>>>>> may
> >>>>>>
> >>>>>>> return itself in getValue()).
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>>
> >>>>>>> Vlad
> >>>>>>>
> >>>>>>> On 4/28/17 02:44, AJAY GUPTA wrote:
> >>>>>>>
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>>> I am creating an application which is using Windowed Operator.
> This
> >>>>>>>> application involves CsvParser operator emitting a POJO object
> which
> >>>>>>>>
> >>>>>>> is
> >>>
> >>>> to
> >>>>>>>> be passed as input to WindowedOperator. The WindowedOperator
> >>>>>>>>
> >>>>>>> requires an
> >>>
> >>>> instance of Tuple class as input :
> >>>>>>>> *public final transient DefaultInputPort<Tuple<InputT>>
> >>>>>>>> input = new DefaultInputPort<Tuple<InputT>>() *
> >>>>>>>>
> >>>>>>>> Due to this, the addStream cannot work as the output of
> CsvParser's
> >>>>>>>> output
> >>>>>>>> port is not compatible with input port type of WindowedOperator.
> >>>>>>>> One way to solve this problem is to have an operator between the
> >>>>>>>>
> >>>>>>> above
> >>>
> >>>> two
> >>>>>>>> operators as a convertor.
> >>>>>>>> I would like to know if there is any other more generic approach
> to
> >>>>>>>> solve
> >>>>>>>> this problem without writing a new Operator for every new
> >>>>>>>> application
> >>>>>>>> using
> >>>>>>>> Windowed Operators.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Ajay
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <
> >>>>>>>> bhupesh@datatorrent.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi All,
> >>>>>>>>
> >>>>>>>> I think we have some agreement on the way we should use control
> >>>>>>>>>
> >>>>>>>> tuples
> >>>
> >>>> for
> >>>>>>>>> File I/O operators to support batch.
> >>>>>>>>>
> >>>>>>>>> In order to have more operators in Malhar, support this
> paradigm, I
> >>>>>>>>> think
> >>>>>>>>> we should also look at store operators - JDBC, Cassandra, HBase
> >>>>>>>>> etc.
> >>>>>>>>> The case with these operators is simpler as most of these do not
> >>>>>>>>>
> >>>>>>>> poll
> >>>
> >>>> the
> >>>>>>>>> sources (except JDBC poller operator) and just stop once they
> have
> >>>>>>>>> read a
> >>>>>>>>> fixed amount of data. In other words, these are inherently batch
> >>>>>>>>> sources.
> >>>>>>>>> The only change that we should add to these operators is to shut
> >>>>>>>>>
> >>>>>>>> down
> >>>
> >>>> the
> >>>>>>>>> DAG once the reading of data is done. For a windowed operator
> this
> >>>>>>>>> would
> >>>>>>>>> mean a Global window with a final watermark before the DAG is
> shut
> >>>>>>>>> down.
> >>>>>>>>>
> >>>>>>>>> ~ Bhupesh
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________________
> >>>>>>>>>
> >>>>>>>>> Bhupesh Chawda
> >>>>>>>>>
> >>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>
> >>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
> >>>>>>>>> bhupesh@datatorrent.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Thomas,
> >>>>>>>>>
> >>>>>>>>> Even though the windowing operator is not just "event time", it
> >>>>>>>>>>
> >>>>>>>>> seems
> >>>
> >>>> it
> >>>>>>>>>> is too much dependent on the "time" attribute of the incoming
> >>>>>>>>>>
> >>>>>>>>> tuple.
> >>>
> >>>> This
> >>>>>>>>>> is the reason we had to model the file index as a timestamp to
> >>>>>>>>>>
> >>>>>>>>> solve
> >>>
> >>>> the
> >>>>>>>>>> batch case for files.
> >>>>>>>>>> Perhaps we should work on increasing the scope of the windowed
> >>>>>>>>>> operator
> >>>>>>>>>>
> >>>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>> consider other types of windows as well. The Sequence option
> >>>>>>>>>
> >>>>>>>> suggested
> >>>
> >>>> by
> >>>>>>>>>> David seems to be something in that direction.
> >>>>>>>>>>
> >>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________________
> >>>>>>>>>>
> >>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>
> >>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>
> >>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> That's correct, we are looking at a generalized approach for
> state
> >>>>>>>>>>
> >>>>>>>>>> management vs. a series of special cases.
> >>>>>>>>>>>
> >>>>>>>>>>> And to be clear, windowing does not imply event time, otherwise
> >>>>>>>>>>> it
> >>>>>>>>>>> would
> >>>>>>>>>>> be
> >>>>>>>>>>> "EventTimeOperator" :-)
> >>>>>>>>>>>
> >>>>>>>>>>> Thomas
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
> >>>>>>>>>>>
> >>>>>>>>>>> bhupesh@datatorrent.com>
> >>>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi David,
> >>>>>>>>>>>
> >>>>>>>>>>> I went through the discussion, but it seems like it is more on
> >>>>>>>>>>>>
> >>>>>>>>>>> the
> >>>
> >>>> event
> >>>>>>>>>>>>
> >>>>>>>>>>> time watermark handling as opposed to batches. What we are
> trying
> >>>>>>>>>>
> >>>>>>>>> to
> >>>
> >>>> do
> >>>>>>>>>>>
> >>>>>>>>>>> is
> >>>>>>>>>>
> >>>>>>>>>> have watermarks serve the purpose of demarcating batches using
> >>>>>>>>>>>
> >>>>>>>>>>>> control
> >>>>>>>>>>>> tuples. Since each batch is separate from others, we would
> like
> >>>>>>>>>>>>
> >>>>>>>>>>> to
> >>>
> >>>> have
> >>>>>>>>>>>>
> >>>>>>>>>>> stateful processing within a batch, but not across batches.
> >>>>>>>>>>
> >>>>>>>>>> At the same time, we would like to do this in a manner which is
> >>>>>>>>>>>
> >>>>>>>>>>>> consistent
> >>>>>>>>>>>>
> >>>>>>>>>>> with the windowing mechanism provided by the windowed operator.
> >>>>>>>>>>>
> >>>>>>>>>> This
> >>>
> >>>> will
> >>>>>>>>>>>>
> >>>>>>>>>>> allow us to treat a single batch as a (bounded) stream and
> apply
> >>>>>>>>>>>
> >>>>>>>>>> all
> >>>
> >>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>> event time windowing concepts in that time span.
> >>>>>>>>>>
> >>>>>>>>>> For example, let's say I need to process data for a day (24
> >>>>>>>>>>>
> >>>>>>>>>> hours) as
> >>>
> >>>> a
> >>>>>>>>>>>>
> >>>>>>>>>>> single batch. The application is still streaming in nature: it
> >>>>>>>>>>
> >>>>>>>>> would
> >>>
> >>>> end
> >>>>>>>>>>>
> >>>>>>>>>>> the batch after a day and start a new batch the next day. At
> the
> >>>>>>>>>>
> >>>>>>>>> same
> >>>
> >>>> time,
> >>>>>>>>>>>
> >>>>>>>>>>> I would be able to have early trigger firings every minute as
> >>>>>>>>>>>
> >>>>>>>>>> well as
> >>>
> >>>> drop
> >>>>>>>>>>>>
> >>>>>>>>>>> any data which is say, 5 mins late. All this within a single
> day.
> >>>>>>>>>>>
> >>>>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________________
> >>>>>>>>>>>>
> >>>>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>>>
> >>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>>
> >>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <
> davidyan@gmail.com>
> >>>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>> There is a discussion in the Flink mailing list about key-based
> >>>>>>>>>>
> >>>>>>>>>> watermarks.
> >>>>>>>>>>>
> >>>>>>>>>>>> I think it's relevant to our use case here.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbc
> >>>>>>>>>>>>> c6510ef
> >>>>>>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> David
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi David,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If using time window does not seem appropriate, we can have
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> class
> >>>>>>>>>>>>
> >>>>>>>>>>> which is more suited for such sequential and distinct windows.
> >>>>>>>>>>>
> >>>>>>>>>>>> Perhaps, a
> >>>>>>>>>>>>> CustomWindow option can be introduced which takes in a window
> >>>>>>>>>>>>>
> >>>>>>>>>>>> id.
> >>>
> >>>> The
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> purpose of this window option could be to translate the
> window
> >>>>>>>>>>>> id
> >>>>>>>>>>>>
> >>>>>>>>>>>> into
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> appropriate timestamps.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another option would be to go with a custom timestampExtractor
> >>>>>>>>>>>>>
> >>>>>>>>>>>> for
> >>>
> >>>> such
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> tuples which translates the each unique file name to a
> distinct
> >>>>>>>>>>>>
> >>>>>>>>>>>> timestamp
> >>>>>>>>>>>>> while using time windows in the windowed operator.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________________
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> davidyan@gmail.com>
> >>>
> >>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> I now see your rationale on putting the filename in the
> window.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As far as I understand, the reasons why the filename is not
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> part
> >>>
> >>>> of
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> key
> >>>>>>>>>>>>
> >>>>>>>>>>>>> and the Global Window is not used are:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1) The files are processed in sequence, not in parallel
> >>>>>>>>>>>>>>> 2) The windowed operator should not keep the state
> associated
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> file
> >>>>>>>>>>>
> >>>>>>>>>>>> when the processing of the file is done
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3) The trigger should be fired for the file when a file is
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> done
> >>>
> >>>> processing.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> However, if the file is just a sequence has nothing to do
> with
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> a
> >>>
> >>>> timestamp,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> assigning a timestamp to a file is not an intuitive thing to
> >>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> would
> >>>>>>>>>>>>> just create confusions to the users, especially when it's
> used
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> an
> >>>>>>>>>>>>>
> >>>>>>>>>>>> example for new users.
> >>>>>>>>>>>
> >>>>>>>>>>>> How about having a separate class called SequenceWindow? And
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> perhaps
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> TimeWindow can inherit from it?
> >>>>>>>>>>>>> David
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <
> thw@apache.org
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think my comments related to count based windows might be
> >>>>>>>>>>>>>>>> causing
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> confusion. Let's not discuss count based scenarios for
> now.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Just want to make sure we are on the same page wrt. the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "each
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> file
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> is a
> >>>>>>>>>>>
> >>>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> file
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> has the same timestamp (which is just a sequence number) and
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> helps
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> keep tuples from each file in a separate window.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
> >>>>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> stamp
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> also, depending on the file naming convention. And if it was
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> event
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> processing, the watermark would be derived from records
> within
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> file.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Agreed, the source should have a mechanism to control the
> time
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> stamp
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> extraction along with everything else pertaining to the
> >>>>>>>>>>>>>> watermark
> >>>>>>>>>>>>>> generation.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We could also implement a "timestampExtractor" interface to
> >>>>>>>>>>>>>>>> identify
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> timestamp (sequence number) for a file.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> _______________________________________________________
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> thw@apache.org
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I don't think this is a use case for count based window.
> >>>>>>>>>>>
> >>>>>>>>>>>> We have multiple files that are retrieved in a sequence
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>>> no
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> knowledge of the number of records per file. The
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> requirement is
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> aggregate each file separately and emit the aggregate when
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> file
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> read
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> fully. There is no concept of "end of something" for an
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> key
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> global window isn't applicable.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> However, as already explained and implemented by
> Bhupesh,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> solved using watermark and window (in this case the window
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> timestamp
> >>>>>>>>>>>>>>>> isn't
> >>>>>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thomas
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> davidyan@gmail.com
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't think this is the way to go. Global Window only
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> means
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> timestamp
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> necessarily
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> mean it's a large batch. Unless there is some notion of
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> event
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> each
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> file, you don't want to embed the file into the window
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> itself.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If you want the result broken up by file name, and if
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> processed in parallel, I think making the file name be
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> part
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>>> key
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> file
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> be part of the window.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> you're
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> welcome
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> add
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> that feature. In case of count-based windows, there
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> no
> >>>>>>>>>>>
> >>>>>>>>>>>> notion
> >>>>>>>>>>>>
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> time and you probably only trigger at the end of each
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> window.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> of count-based windows, the watermark only matters for
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> since
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 10,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> number
> >>>>>>>>>>>>
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> end
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> last
> >>>>>>>>>>>>
> >>>>>>>>>>>>> window
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> with 5 tuples).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi David,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for your comments.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The wordcount example that I created based on the
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> windowed
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> does
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> processing of word counts per file (each file as a
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> separate
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> batch),
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> process counts for each file and dump into separate
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> files.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> As I understand Global window is for one large batch;
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> incoming
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> data falls into the same batch. This could not be
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> processed
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> using
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> GlobalWindow option as we need more than one windows.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> case, I
> >>>>>>>>>>>
> >>>>>>>>>>>> configured the windowed operator to have time windows
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> 1ms
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> each
> >>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> passed data for each file with increasing timestamps:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> (file1,
> >>>>>>>>>>>>>>>>>> 1),
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> (file2,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> scenario?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regarding (2 - count based windows), I think there is
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> trigger
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> option
> >>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> process count based windows. In case I want to process
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> every
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1000
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> tuples
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> CountTrigger
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>>> accumulation set to Discarding. Is this correct?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Global
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> window.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> ______________________________
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> _________________________
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>
> >>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> davidyan@gmail.com>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm worried that we are making the watermark concept
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> too
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> complicated.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Watermarks should simply just tell you what windows
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> considered
> >>>>>>>>>>>>
> >>>>>>>>>>>>> complete.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Point 2 is basically a count-based window.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Watermarks
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>
> >>>>>>>>>>>> play a
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> role
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> here because the window is always complete at the
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> n-th
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> tuple.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If I understand correctly, point 3 is for batch
> >>>>>>>>>>>
> >>>>>>>>>>>> processing
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> files.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Unless
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> achieved
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> watermark
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +infinity
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> upon
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> receipt
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> watermark.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> achieved
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> with a
> >>>>>>>>>>>
> >>>>>>>>>>>> watermark with a +infinity timestamp.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Thomas,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> For an input operator which is supposed to
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> generate
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> watermarks
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>
> >>>>>>>>>>>> downstream operators, I can think about the
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> following
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> watermarks
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> operator can emit:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> watermark)
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> tuples)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 4. Final watermark
> >>>>>>>>>>>
> >>>>>>>>>>>> File based watermarks seem to be applicable for
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> (file
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> based)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> well,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Does
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> seem
> >>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> line
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> with the thought process?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> ______________________________
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> _________________________
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Software Engineer
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> thw@apache.org
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I don't think this should be designed based on a
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> simplistic
> >>>>>>>>>>>>>>>>>>>>> file
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> input-output scenario. It would be good to
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> include a
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> stateful
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> transformation based on event time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> More complex pipelines contain stateful
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> transformations
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> depend
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> watermark
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> concept
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> increasing
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> sequence)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> other operators can generically work with.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> produce
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> based
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> watermarks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> for example when you read part files that are
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> bound
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> event
> >>>>>>>>>>>>
> >>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thomas
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by AJAY GUPTA <aj...@gmail.com>.

After some discussion and trying out the approach discussed above, it seems
we would need to separate out the concepts of Watermarks and Batch Control
tuples.
The windowed operator needs to be modified to understand batch control
tuples.

Even if we have watermark tuples which also include batch information,
windowed operator will fail when the source data is event time based. This
is because in this scenario, there are two notions of time in the watermark:
1. Time used to denote the file / batch boundary
2. The event time in the data.

For this reason, it makes sense to separate the concepts of batch tuples
(start something / end something) from the watermark tuples (which
essentially deal with event times).

We could argue having a watermark tuple indicating end of the batch - a
final watermark (with time = Long.MAX) which would finalize all windows in
the windowed operator. However, now, if a next batch needs to be processed
subsequently by the same windowed operator, we would need to reset the
state of the operator as it has moved ahead in the event time domain. The
batch control tuples can do this resetting of state (in other words,
preparation for processing a new batch of data).

As an example, consider telecom data logs for same 24 hrs of 2 regions (A
and B) which are to be processed as a batch. After processing data records
from region A, a "final" watermark would be emitted indicating end of all
data from region A. Now, unless we clear the windowed operator's state
information (current watermark, data storage) from the windowed operator,
the data records from region B will not be processed. In such scenario,
receiving an end batch control tuple can indicate the operator to reset its
state.


Ajay


On Sun, Apr 30, 2017 at 4:32 AM, Vlad Rozov <v....@datatorrent.com> wrote:

> public static class Pojo implements Tuple
> {
>   @Override
>   public Object getValue()
>   {
>     return this;
>   }
> }
>
> @Override
> public void populateDAG(DAG dag, Configuration conf)
> {
>   CsvParser csvParser = dag.addOperator("csvParser", CsvParser.class);
>   WindowedOperatorImpl<Pojo, Pojo, Pojo> windowedOperator =
> dag.addOperator("windowOperator", WindowedOperatorImpl.class);
>   dag.addStream("csvToWindowed", csvParser.out, new
> InputPort[]{windowedOperator.input});
> }
>
>
> Thank you,
>
> Vlad
>
> On 4/29/17 15:20, AJAY GUPTA wrote:
>
>> Even this will not work because the output port of CsvParser is of type
>> Object. Even though Customer extends Tuple<Object>, it will still fail to
>> work since Tuple<Object> gets output as Object.
>>
>> *DefaultOutputPort<Object> output = new DefaultOutputPort<Object>();*
>>
>> The input port type at windowed operator with InputT = Object :
>> *DefaultInputPort<Tuple<Object>>*
>>
>>
>> Ajay
>>
>>
>> On Sun, Apr 30, 2017 at 1:45 AM, Vlad Rozov <v....@datatorrent.com>
>> wrote:
>>
>> Use Object in place of InputT in the WindowedOperatorImpl. Cast Object to
>>> the actual type of InputT at runtime. Introducing an operator just to do
>>> a
>>> cast is not a good design decision, IMO.
>>>
>>> Thank you,
>>> Vlad
>>>
>>> Отправлено с iPhone
>>>
>>> On Apr 29, 2017, at 02:50, AJAY GUPTA <aj...@gmail.com> wrote:
>>>>
>>>> I am using WindowedOperatorImpl and it is declared as follows.
>>>>
>>>> WindowedOperatorImpl<InputT, AccumulationType, OutputType>
>>>>
>>> windowedOperator
>>>
>>>> = new WindowedOperatorImpl<>();
>>>>
>>>> In my application scenario, the InputT is Customer POJO which is getting
>>>> output as an Object by CsvParser.
>>>>
>>>>
>>>> Ajay
>>>>
>>>> On Fri, Apr 28, 2017 at 11:53 PM, Vlad Rozov <v....@datatorrent.com>
>>>> wrote:
>>>>
>>>> How do you declare WindowedOperator?
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Vlad
>>>>>
>>>>>
>>>>> On 4/28/17 10:35, AJAY GUPTA wrote:
>>>>>>
>>>>>> Vlad,
>>>>>>
>>>>>> The approach you suggested doesn't work because the CSVParser outputs
>>>>>> Object Data Type irrespective of the POJO class being emitted.
>>>>>>
>>>>>>
>>>>>> Ajay
>>>>>>
>>>>>> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <v....@datatorrent.com>
>>>>>> wrote:
>>>>>>
>>>>>> Make your POJO class implement WindowedOperator Tuple interface (it
>>>>>> may
>>>>>>
>>>>>>> return itself in getValue()).
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Vlad
>>>>>>>
>>>>>>> On 4/28/17 02:44, AJAY GUPTA wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>>> I am creating an application which is using Windowed Operator. This
>>>>>>>> application involves CsvParser operator emitting a POJO object which
>>>>>>>>
>>>>>>> is
>>>
>>>> to
>>>>>>>> be passed as input to WindowedOperator. The WindowedOperator
>>>>>>>>
>>>>>>> requires an
>>>
>>>> instance of Tuple class as input :
>>>>>>>> *public final transient DefaultInputPort<Tuple<InputT>>
>>>>>>>> input = new DefaultInputPort<Tuple<InputT>>() *
>>>>>>>>
>>>>>>>> Due to this, the addStream cannot work as the output of CsvParser's
>>>>>>>> output
>>>>>>>> port is not compatible with input port type of WindowedOperator.
>>>>>>>> One way to solve this problem is to have an operator between the
>>>>>>>>
>>>>>>> above
>>>
>>>> two
>>>>>>>> operators as a convertor.
>>>>>>>> I would like to know if there is any other more generic approach to
>>>>>>>> solve
>>>>>>>> this problem without writing a new Operator for every new
>>>>>>>> application
>>>>>>>> using
>>>>>>>> Windowed Operators.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ajay
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <
>>>>>>>> bhupesh@datatorrent.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I think we have some agreement on the way we should use control
>>>>>>>>>
>>>>>>>> tuples
>>>
>>>> for
>>>>>>>>> File I/O operators to support batch.
>>>>>>>>>
>>>>>>>>> In order to have more operators in Malhar, support this paradigm, I
>>>>>>>>> think
>>>>>>>>> we should also look at store operators - JDBC, Cassandra, HBase
>>>>>>>>> etc.
>>>>>>>>> The case with these operators is simpler as most of these do not
>>>>>>>>>
>>>>>>>> poll
>>>
>>>> the
>>>>>>>>> sources (except JDBC poller operator) and just stop once they have
>>>>>>>>> read a
>>>>>>>>> fixed amount of data. In other words, these are inherently batch
>>>>>>>>> sources.
>>>>>>>>> The only change that we should add to these operators is to shut
>>>>>>>>>
>>>>>>>> down
>>>
>>>> the
>>>>>>>>> DAG once the reading of data is done. For a windowed operator this
>>>>>>>>> would
>>>>>>>>> mean a Global window with a final watermark before the DAG is shut
>>>>>>>>> down.
>>>>>>>>>
>>>>>>>>> ~ Bhupesh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________________
>>>>>>>>>
>>>>>>>>> Bhupesh Chawda
>>>>>>>>>
>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>
>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
>>>>>>>>> bhupesh@datatorrent.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Thomas,
>>>>>>>>>
>>>>>>>>> Even though the windowing operator is not just "event time", it
>>>>>>>>>>
>>>>>>>>> seems
>>>
>>>> it
>>>>>>>>>> is too much dependent on the "time" attribute of the incoming
>>>>>>>>>>
>>>>>>>>> tuple.
>>>
>>>> This
>>>>>>>>>> is the reason we had to model the file index as a timestamp to
>>>>>>>>>>
>>>>>>>>> solve
>>>
>>>> the
>>>>>>>>>> batch case for files.
>>>>>>>>>> Perhaps we should work on increasing the scope of the windowed
>>>>>>>>>> operator
>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>> consider other types of windows as well. The Sequence option
>>>>>>>>>
>>>>>>>> suggested
>>>
>>>> by
>>>>>>>>>> David seems to be something in that direction.
>>>>>>>>>>
>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________________
>>>>>>>>>>
>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>
>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>
>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> That's correct, we are looking at a generalized approach for state
>>>>>>>>>>
>>>>>>>>>> management vs. a series of special cases.
>>>>>>>>>>>
>>>>>>>>>>> And to be clear, windowing does not imply event time, otherwise
>>>>>>>>>>> it
>>>>>>>>>>> would
>>>>>>>>>>> be
>>>>>>>>>>> "EventTimeOperator" :-)
>>>>>>>>>>>
>>>>>>>>>>> Thomas
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
>>>>>>>>>>>
>>>>>>>>>>> bhupesh@datatorrent.com>
>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi David,
>>>>>>>>>>>
>>>>>>>>>>> I went through the discussion, but it seems like it is more on
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>
>>>> event
>>>>>>>>>>>>
>>>>>>>>>>> time watermark handling as opposed to batches. What we are trying
>>>>>>>>>>
>>>>>>>>> to
>>>
>>>> do
>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>> have watermarks serve the purpose of demarcating batches using
>>>>>>>>>>>
>>>>>>>>>>>> control
>>>>>>>>>>>> tuples. Since each batch is separate from others, we would like
>>>>>>>>>>>>
>>>>>>>>>>> to
>>>
>>>> have
>>>>>>>>>>>>
>>>>>>>>>>> stateful processing within a batch, but not across batches.
>>>>>>>>>>
>>>>>>>>>> At the same time, we would like to do this in a manner which is
>>>>>>>>>>>
>>>>>>>>>>>> consistent
>>>>>>>>>>>>
>>>>>>>>>>> with the windowing mechanism provided by the windowed operator.
>>>>>>>>>>>
>>>>>>>>>> This
>>>
>>>> will
>>>>>>>>>>>>
>>>>>>>>>>> allow us to treat a single batch as a (bounded) stream and apply
>>>>>>>>>>>
>>>>>>>>>> all
>>>
>>>> the
>>>>>>>>>>>>
>>>>>>>>>>> event time windowing concepts in that time span.
>>>>>>>>>>
>>>>>>>>>> For example, let's say I need to process data for a day (24
>>>>>>>>>>>
>>>>>>>>>> hours) as
>>>
>>>> a
>>>>>>>>>>>>
>>>>>>>>>>> single batch. The application is still streaming in nature: it
>>>>>>>>>>
>>>>>>>>> would
>>>
>>>> end
>>>>>>>>>>>
>>>>>>>>>>> the batch after a day and start a new batch the next day. At the
>>>>>>>>>>
>>>>>>>>> same
>>>
>>>> time,
>>>>>>>>>>>
>>>>>>>>>>> I would be able to have early trigger firings every minute as
>>>>>>>>>>>
>>>>>>>>>> well as
>>>
>>>> drop
>>>>>>>>>>>>
>>>>>>>>>>> any data which is say, 5 mins late. All this within a single day.
>>>>>>>>>>>
>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>>
>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>
>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>
>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>> There is a discussion in the Flink mailing list about key-based
>>>>>>>>>>
>>>>>>>>>> watermarks.
>>>>>>>>>>>
>>>>>>>>>>>> I think it's relevant to our use case here.
>>>>>>>>>>>>
>>>>>>>>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbc
>>>>>>>>>>>>> c6510ef
>>>>>>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>>>>>>>>>>>>>
>>>>>>>>>>>>> David
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>>>>>>>>>>>>>
>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> If using time window does not seem appropriate, we can have
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>
>>>>>>>>>>>>> class
>>>>>>>>>>>>
>>>>>>>>>>> which is more suited for such sequential and distinct windows.
>>>>>>>>>>>
>>>>>>>>>>>> Perhaps, a
>>>>>>>>>>>>> CustomWindow option can be introduced which takes in a window
>>>>>>>>>>>>>
>>>>>>>>>>>> id.
>>>
>>>> The
>>>>>>>>>>>>>
>>>>>>>>>>>>> purpose of this window option could be to translate the window
>>>>>>>>>>>> id
>>>>>>>>>>>>
>>>>>>>>>>>> into
>>>>>>>>>>>>>
>>>>>>>>>>>>> appropriate timestamps.
>>>>>>>>>>>>
>>>>>>>>>>>> Another option would be to go with a custom timestampExtractor
>>>>>>>>>>>>>
>>>>>>>>>>>> for
>>>
>>>> such
>>>>>>>>>>>>>>
>>>>>>>>>>>>> tuples which translates the each unique file name to a distinct
>>>>>>>>>>>>
>>>>>>>>>>>> timestamp
>>>>>>>>>>>>> while using time windows in the windowed operator.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> davidyan@gmail.com>
>>>
>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I now see your rationale on putting the filename in the window.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As far as I understand, the reasons why the filename is not
>>>>>>>>>>>>>>
>>>>>>>>>>>>> part
>>>
>>>> of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>> key
>>>>>>>>>>>>
>>>>>>>>>>>>> and the Global Window is not used are:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) The files are processed in sequence, not in parallel
>>>>>>>>>>>>>>> 2) The windowed operator should not keep the state associated
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>> file
>>>>>>>>>>>
>>>>>>>>>>>> when the processing of the file is done
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3) The trigger should be fired for the file when a file is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> done
>>>
>>>> processing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, if the file is just a sequence has nothing to do with
>>>>>>>>>>>>>>
>>>>>>>>>>>>> a
>>>
>>>> timestamp,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> assigning a timestamp to a file is not an intuitive thing to
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> would
>>>>>>>>>>>>> just create confusions to the users, especially when it's used
>>>>>>>>>>>>>
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>
>>>>>>>>>>>> example for new users.
>>>>>>>>>>>
>>>>>>>>>>>> How about having a separate class called SequenceWindow? And
>>>>>>>>>>>>>
>>>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> TimeWindow can inherit from it?
>>>>>>>>>>>>> David
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <thw@apache.org
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>>>>>>>>>>>>>
>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think my comments related to count based windows might be
>>>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> confusion. Let's not discuss count based scenarios for now.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just want to make sure we are on the same page wrt. the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "each
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is a
>>>>>>>>>>>
>>>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>
>>>>>>>>>>>>>> has the same timestamp (which is just a sequence number) and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> helps
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> keep tuples from each file in a separate window.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> stamp
>>>>>>>>>>>>>>
>>>>>>>>>>>>> also, depending on the file naming convention. And if it was
>>>>>>>>>>>>>
>>>>>>>>>>>>>> event
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>
>>>>>>>>>>>>> processing, the watermark would be derived from records within
>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Agreed, the source should have a mechanism to control the time
>>>>>>>>>>>>>
>>>>>>>>>>>>>> stamp
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> extraction along with everything else pertaining to the
>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>> generation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We could also implement a "timestampExtractor" interface to
>>>>>>>>>>>>>>>> identify
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> timestamp (sequence number) for a file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think this is a use case for count based window.
>>>>>>>>>>>
>>>>>>>>>>>> We have multiple files that are retrieved in a sequence
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> no
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> knowledge of the number of records per file. The
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> requirement is
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> aggregate each file separately and emit the aggregate when
>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>
>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> fully. There is no concept of "end of something" for an
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> global window isn't applicable.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> However, as already explained and implemented by Bhupesh,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>
>>>>>>>>>>>>>> solved using watermark and window (in this case the window
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> davidyan@gmail.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think this is the way to go. Global Window only
>>>>>>>>>>>>>
>>>>>>>>>>>>>> means
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>
>>>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> necessarily
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> mean it's a large batch. Unless there is some notion of
>>>>>>>>>>>>>
>>>>>>>>>>>>>> event
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>
>>>>>>>>>>>>>> each
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> file, you don't want to embed the file into the window
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you want the result broken up by file name, and if
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> processed in parallel, I think making the file name be
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>>>> key
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> be part of the window.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> welcome
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> that feature. In case of count-based windows, there
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> no
>>>>>>>>>>>
>>>>>>>>>>>> notion
>>>>>>>>>>>>
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> time and you probably only trigger at the end of each
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> window.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> of count-based windows, the watermark only matters for
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 10,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> last
>>>>>>>>>>>>
>>>>>>>>>>>>> window
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> with 5 tuples).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The wordcount example that I created based on the
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> windowed
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> does
>>>>>>>>>>>>>
>>>>>>>>>>>>>> processing of word counts per file (each file as a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> batch),
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> process counts for each file and dump into separate
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As I understand Global window is for one large batch;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> incoming
>>>>>>>>>>>>>
>>>>>>>>>>>>>> data falls into the same batch. This could not be
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> processed
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> GlobalWindow option as we need more than one windows.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> case, I
>>>>>>>>>>>
>>>>>>>>>>>> configured the windowed operator to have time windows
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> 1ms
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> each
>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> passed data for each file with increasing timestamps:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (file1,
>>>>>>>>>>>>>>>>>> 1),
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (file2,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> scenario?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regarding (2 - count based windows), I think there is
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> trigger
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> option
>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> process count based windows. In case I want to process
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1000
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> tuples
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> CountTrigger
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> accumulation set to Discarding. Is this correct?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Global
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> window.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>
>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _________________________
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>
>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> davidyan@gmail.com>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm worried that we are making the watermark concept
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Watermarks should simply just tell you what windows
>>>>>>>>>>>>>
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>
>>>>>>>>>>>>> complete.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Point 2 is basically a count-based window.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Watermarks
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>
>>>>>>>>>>>> play a
>>>>>>>>>>>>>
>>>>>>>>>>>>>> role
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> here because the window is always complete at the
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> n-th
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> tuple.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If I understand correctly, point 3 is for batch
>>>>>>>>>>>
>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unless
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>
>>>>>>>>>>>>>> achieved
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +infinity
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> upon
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> receipt
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> watermark.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> achieved
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> with a
>>>>>>>>>>>
>>>>>>>>>>>> watermark with a +infinity timestamp.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For an input operator which is supposed to
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> generate
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> watermarks
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>>> downstream operators, I can think about the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> watermarks
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> operator can emit:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> watermark)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> tuples)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 4. Final watermark
>>>>>>>>>>>
>>>>>>>>>>>> File based watermarks seem to be applicable for
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (file
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> based)
>>>>>>>>>>>>>
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> well,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Does
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> seem
>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> with the thought process?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _________________________
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think this should be designed based on a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> simplistic
>>>>>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> input-output scenario. It would be good to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> include a
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> stateful
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> transformation based on event time.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> More complex pipelines contain stateful
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> transformations
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> depend
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> concept
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> increasing
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> sequence)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> other operators can generically work with.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> produce
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>
>>>>>>>>>>>>>> watermarks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for example when you read part files that are
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> bound
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> event
>>>>>>>>>>>>
>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Vlad Rozov <v....@datatorrent.com>.

public static class Pojo implements Tuple
{
   @Override
   public Object getValue()
   {
     return this;
   }
}

@Override
public void populateDAG(DAG dag, Configuration conf)
{
   CsvParser csvParser = dag.addOperator("csvParser", CsvParser.class);
   WindowedOperatorImpl<Pojo, Pojo, Pojo> windowedOperator = dag.addOperator("windowOperator", WindowedOperatorImpl.class);
   dag.addStream("csvToWindowed", csvParser.out, new InputPort[]{windowedOperator.input});
}


Thank you,

Vlad

On 4/29/17 15:20, AJAY GUPTA wrote:
> Even this will not work because the output port of CsvParser is of type
> Object. Even though Customer extends Tuple<Object>, it will still fail to
> work since Tuple<Object> gets output as Object.
>
> *DefaultOutputPort<Object> output = new DefaultOutputPort<Object>();*
>
> The input port type at windowed operator with InputT = Object :
> *DefaultInputPort<Tuple<Object>>*
>
>
> Ajay
>
>
> On Sun, Apr 30, 2017 at 1:45 AM, Vlad Rozov <v....@datatorrent.com> wrote:
>
>> Use Object in place of InputT in the WindowedOperatorImpl. Cast Object to
>> the actual type of InputT at runtime. Introducing an operator just to do a
>> cast is not a good design decision, IMO.
>>
>> Thank you,
>> Vlad
>>
>> \u041e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e \u0441 iPhone
>>
>>> On Apr 29, 2017, at 02:50, AJAY GUPTA <aj...@gmail.com> wrote:
>>>
>>> I am using WindowedOperatorImpl and it is declared as follows.
>>>
>>> WindowedOperatorImpl<InputT, AccumulationType, OutputType>
>> windowedOperator
>>> = new WindowedOperatorImpl<>();
>>>
>>> In my application scenario, the InputT is Customer POJO which is getting
>>> output as an Object by CsvParser.
>>>
>>>
>>> Ajay
>>>
>>> On Fri, Apr 28, 2017 at 11:53 PM, Vlad Rozov <v....@datatorrent.com>
>>> wrote:
>>>
>>>> How do you declare WindowedOperator?
>>>>
>>>> Thank you,
>>>>
>>>> Vlad
>>>>
>>>>
>>>>> On 4/28/17 10:35, AJAY GUPTA wrote:
>>>>>
>>>>> Vlad,
>>>>>
>>>>> The approach you suggested doesn't work because the CSVParser outputs
>>>>> Object Data Type irrespective of the POJO class being emitted.
>>>>>
>>>>>
>>>>> Ajay
>>>>>
>>>>> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <v....@datatorrent.com>
>>>>> wrote:
>>>>>
>>>>> Make your POJO class implement WindowedOperator Tuple interface (it may
>>>>>> return itself in getValue()).
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Vlad
>>>>>>
>>>>>> On 4/28/17 02:44, AJAY GUPTA wrote:
>>>>>>
>>>>>> Hi All,
>>>>>>> I am creating an application which is using Windowed Operator. This
>>>>>>> application involves CsvParser operator emitting a POJO object which
>> is
>>>>>>> to
>>>>>>> be passed as input to WindowedOperator. The WindowedOperator
>> requires an
>>>>>>> instance of Tuple class as input :
>>>>>>> *public final transient DefaultInputPort<Tuple<InputT>>
>>>>>>> input = new DefaultInputPort<Tuple<InputT>>() *
>>>>>>>
>>>>>>> Due to this, the addStream cannot work as the output of CsvParser's
>>>>>>> output
>>>>>>> port is not compatible with input port type of WindowedOperator.
>>>>>>> One way to solve this problem is to have an operator between the
>> above
>>>>>>> two
>>>>>>> operators as a convertor.
>>>>>>> I would like to know if there is any other more generic approach to
>>>>>>> solve
>>>>>>> this problem without writing a new Operator for every new application
>>>>>>> using
>>>>>>> Windowed Operators.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ajay
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <
>>>>>>> bhupesh@datatorrent.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>>> I think we have some agreement on the way we should use control
>> tuples
>>>>>>>> for
>>>>>>>> File I/O operators to support batch.
>>>>>>>>
>>>>>>>> In order to have more operators in Malhar, support this paradigm, I
>>>>>>>> think
>>>>>>>> we should also look at store operators - JDBC, Cassandra, HBase etc.
>>>>>>>> The case with these operators is simpler as most of these do not
>> poll
>>>>>>>> the
>>>>>>>> sources (except JDBC poller operator) and just stop once they have
>>>>>>>> read a
>>>>>>>> fixed amount of data. In other words, these are inherently batch
>>>>>>>> sources.
>>>>>>>> The only change that we should add to these operators is to shut
>> down
>>>>>>>> the
>>>>>>>> DAG once the reading of data is done. For a windowed operator this
>>>>>>>> would
>>>>>>>> mean a Global window with a final watermark before the DAG is shut
>>>>>>>> down.
>>>>>>>>
>>>>>>>> ~ Bhupesh
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________________
>>>>>>>>
>>>>>>>> Bhupesh Chawda
>>>>>>>>
>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>
>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
>>>>>>>> bhupesh@datatorrent.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Thomas,
>>>>>>>>
>>>>>>>>> Even though the windowing operator is not just "event time", it
>> seems
>>>>>>>>> it
>>>>>>>>> is too much dependent on the "time" attribute of the incoming
>> tuple.
>>>>>>>>> This
>>>>>>>>> is the reason we had to model the file index as a timestamp to
>> solve
>>>>>>>>> the
>>>>>>>>> batch case for files.
>>>>>>>>> Perhaps we should work on increasing the scope of the windowed
>>>>>>>>> operator
>>>>>>>>>
>>>>>>>>> to
>>>>>>>> consider other types of windows as well. The Sequence option
>> suggested
>>>>>>>>> by
>>>>>>>>> David seems to be something in that direction.
>>>>>>>>>
>>>>>>>>> ~ Bhupesh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________________
>>>>>>>>>
>>>>>>>>> Bhupesh Chawda
>>>>>>>>>
>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>
>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> That's correct, we are looking at a generalized approach for state
>>>>>>>>>
>>>>>>>>>> management vs. a series of special cases.
>>>>>>>>>>
>>>>>>>>>> And to be clear, windowing does not imply event time, otherwise it
>>>>>>>>>> would
>>>>>>>>>> be
>>>>>>>>>> "EventTimeOperator" :-)
>>>>>>>>>>
>>>>>>>>>> Thomas
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
>>>>>>>>>>
>>>>>>>>>> bhupesh@datatorrent.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi David,
>>>>>>>>>>
>>>>>>>>>>> I went through the discussion, but it seems like it is more on
>> the
>>>>>>>>>>> event
>>>>>>>>> time watermark handling as opposed to batches. What we are trying
>> to
>>>>>>>>>> do
>>>>>>>>>>
>>>>>>>>> is
>>>>>>>>>
>>>>>>>>>> have watermarks serve the purpose of demarcating batches using
>>>>>>>>>>> control
>>>>>>>>>>> tuples. Since each batch is separate from others, we would like
>> to
>>>>>>>>>>> have
>>>>>>>>> stateful processing within a batch, but not across batches.
>>>>>>>>>
>>>>>>>>>> At the same time, we would like to do this in a manner which is
>>>>>>>>>>> consistent
>>>>>>>>>> with the windowing mechanism provided by the windowed operator.
>> This
>>>>>>>>>>> will
>>>>>>>>>> allow us to treat a single batch as a (bounded) stream and apply
>> all
>>>>>>>>>>> the
>>>>>>>>> event time windowing concepts in that time span.
>>>>>>>>>
>>>>>>>>>> For example, let's say I need to process data for a day (24
>> hours) as
>>>>>>>>>>> a
>>>>>>>>> single batch. The application is still streaming in nature: it
>> would
>>>>>>>>>> end
>>>>>>>>>>
>>>>>>>>> the batch after a day and start a new batch the next day. At the
>> same
>>>>>>>>>> time,
>>>>>>>>>>
>>>>>>>>>> I would be able to have early trigger firings every minute as
>> well as
>>>>>>>>>>> drop
>>>>>>>>>> any data which is say, 5 mins late. All this within a single day.
>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>
>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>
>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>
>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>> There is a discussion in the Flink mailing list about key-based
>>>>>>>>>
>>>>>>>>>> watermarks.
>>>>>>>>>>> I think it's relevant to our use case here.
>>>>>>>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbc
>>>>>>>>>>>> c6510ef
>>>>>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>>>>>>>>>>>>
>>>>>>>>>>>> David
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>>>>>>>>>>>>
>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>
>>>>>>>>>>>>> If using time window does not seem appropriate, we can have
>>>>>>>>>>>>>
>>>>>>>>>>>>> another
>>>>>>>>>>> class
>>>>>>>>>> which is more suited for such sequential and distinct windows.
>>>>>>>>>>>> Perhaps, a
>>>>>>>>>>>> CustomWindow option can be introduced which takes in a window
>> id.
>>>>>>>>>>>> The
>>>>>>>>>>>>
>>>>>>>>>>> purpose of this window option could be to translate the window id
>>>>>>>>>>>
>>>>>>>>>>>> into
>>>>>>>>>>>>
>>>>>>>>>>> appropriate timestamps.
>>>>>>>>>>>
>>>>>>>>>>>> Another option would be to go with a custom timestampExtractor
>> for
>>>>>>>>>>>>> such
>>>>>>>>>>> tuples which translates the each unique file name to a distinct
>>>>>>>>>>>
>>>>>>>>>>>> timestamp
>>>>>>>>>>>> while using time windows in the windowed operator.
>>>>>>>>>>>>
>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>>>
>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>
>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>
>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <
>> davidyan@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>> I now see your rationale on putting the filename in the window.
>>>>>>>>>>>>
>>>>>>>>>>>>> As far as I understand, the reasons why the filename is not
>> part
>>>>>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>> key
>>>>>>>>>>>>> and the Global Window is not used are:
>>>>>>>>>>>>>> 1) The files are processed in sequence, not in parallel
>>>>>>>>>>>>>> 2) The windowed operator should not keep the state associated
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> with
>>>>>>>>>>>> the
>>>>>>>>>> file
>>>>>>>>>>>>> when the processing of the file is done
>>>>>>>>>>>>>> 3) The trigger should be fired for the file when a file is
>> done
>>>>>>>>>>>>>> processing.
>>>>>>>>>>>>> However, if the file is just a sequence has nothing to do with
>> a
>>>>>>>>>>>>>> timestamp,
>>>>>>>>>>>>> assigning a timestamp to a file is not an intuitive thing to do
>>>>>>>>>>>>>> and
>>>>>>>>>>>> would
>>>>>>>>>>>> just create confusions to the users, especially when it's used
>>>>>>>>>>>>> as
>>>>>>>>>>>>>
>>>>>>>>>>>> an
>>>>>>>>>> example for new users.
>>>>>>>>>>>> How about having a separate class called SequenceWindow? And
>>>>>>>>>>>>>> perhaps
>>>>>>>>>>>> TimeWindow can inherit from it?
>>>>>>>>>>>> David
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <thw@apache.org
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>>>>>>>>>>>>
>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think my comments related to count based windows might be
>>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> confusion. Let's not discuss count based scenarios for now.
>>>>>>>>>>>>> Just want to make sure we are on the same page wrt. the
>>>>>>>>>>>>>>>> "each
>>>>>>>>>>>>>> file
>>>>>>>>>> is a
>>>>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> same
>>>>>>>>>>>> file
>>>>>>>>>>>>>> has the same timestamp (which is just a sequence number) and
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> helps
>>>>>>>>>>>> keep tuples from each file in a separate window.
>>>>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
>>>>>>>>>>>>>>> time
>>>>>>>>>>>>> stamp
>>>>>>>>>>>> also, depending on the file naming convention. And if it was
>>>>>>>>>>>>>> event
>>>>>>>>>>>>>>
>>>>>>>>>>>>> time
>>>>>>>>>>>> processing, the watermark would be derived from records within
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> file.
>>>>>>>>>>>> Agreed, the source should have a mechanism to control the time
>>>>>>>>>>>>>> stamp
>>>>>>>>>>>>>>
>>>>>>>>>>>>> extraction along with everything else pertaining to the
>>>>>>>>>>>>> watermark
>>>>>>>>>>>>> generation.
>>>>>>>>>>>>>>> We could also implement a "timestampExtractor" interface to
>>>>>>>>>>>>>>> identify
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> timestamp (sequence number) for a file.
>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>>> wrote:
>>>>>>>>>> I don't think this is a use case for count based window.
>>>>>>>>>>>>>>> We have multiple files that are retrieved in a sequence
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> there
>>>>>>>>>> is
>>>>>>>>>>>>> no
>>>>>>>>>>>>>> knowledge of the number of records per file. The
>>>>>>>>>>>>>>>> requirement is
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> to
>>>>>>>>>>>> aggregate each file separately and emit the aggregate when
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> file
>>>>>>>>>>>> is
>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>> fully. There is no concept of "end of something" for an
>>>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> global window isn't applicable.
>>>>>>>>>>>>>>>>> However, as already explained and implemented by Bhupesh,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> can
>>>>>>>>>>>> be
>>>>>>>>>>>>> solved using watermark and window (in this case the window
>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> davidyan@gmail.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>> I don't think this is the way to go. Global Window only
>>>>>>>>>>>>>>>> means
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> necessarily
>>>>>>>>>>>> mean it's a large batch. Unless there is some notion of
>>>>>>>>>>>>>>>>> event
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> time
>>>>>>>>>>>> for
>>>>>>>>>>>>>>> each
>>>>>>>>>>>>>>>>> file, you don't want to embed the file into the window
>>>>>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>>>> If you want the result broken up by file name, and if
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> files
>>>>>>>>>> are
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> processed in parallel, I think making the file name be
>>>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>> key
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>> file
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> be part of the window.
>>>>>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>>>> welcome
>>>>>>>>>>>> to
>>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>>>> that feature. In case of count-based windows, there
>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> be
>>>>>>>>>> no
>>>>>>>>>>> notion
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> time and you probably only trigger at the end of each
>>>>>>>>>>>>>>>>> window.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In
>>>>>>>>>>>> the
>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>> of count-based windows, the watermark only matters for
>>>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> since
>>>>>>>>>>>> you
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
>>>>>>>>>>>>>>>>>> 10,
>>>>>>>>>>>>>>>> the
>>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>> last
>>>>>>>>>>>>> window
>>>>>>>>>>>>>>>> with 5 tuples).
>>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The wordcount example that I created based on the
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> windowed
>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>> does
>>>>>>>>>>>>>>>> processing of word counts per file (each file as a
>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> batch),
>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>> process counts for each file and dump into separate
>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As I understand Global window is for one large batch;
>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>> incoming
>>>>>>>>>>>>>> data falls into the same batch. This could not be
>>>>>>>>>>>>>>>>>> processed
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>> GlobalWindow option as we need more than one windows.
>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>> this
>>>>>>>>>> case, I
>>>>>>>>>>>>> configured the windowed operator to have time windows
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> 1ms
>>>>>>>>>> each
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> passed data for each file with increasing timestamps:
>>>>>>>>>>>>>>>>> (file1,
>>>>>>>>>>>>>>>>> 1),
>>>>>>>>>>>>> (file2,
>>>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
>>>>>>>>>>>>>>>>>> scenario?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regarding (2 - count based windows), I think there is
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> trigger
>>>>>>>>>> option
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> process count based windows. In case I want to process
>>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1000
>>>>>>>>>>>>> tuples
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
>>>>>>>>>>>>>>>>>>> CountTrigger
>>>>>>>>>>>>>>>>> with
>>>>>>>>>> the
>>>>>>>>>>>>> accumulation set to Discarding. Is this correct?
>>>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
>>>>>>>>>>>>>>>>>>> Global
>>>>>>>>>>>>>>>>> window.
>>>>>>>>>>>> \u200b~ Bhupesh\u200b
>>>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>>>> _________________________
>>>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> davidyan@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I'm worried that we are making the watermark concept
>>>>>>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>> Watermarks should simply just tell you what windows
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>> considered
>>>>>>>>>>>>> complete.
>>>>>>>>>>>>>>>>>> Point 2 is basically a count-based window.
>>>>>>>>>>>>>>>>>>>> Watermarks
>>>>>>>>>>>>>>>>>> do
>>>>>>>>>> not
>>>>>>>>>>>> play a
>>>>>>>>>>>>>> role
>>>>>>>>>>>>>>>>> here because the window is always complete at the
>>>>>>>>>>>>>>>>>>> n-th
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> tuple.
>>>>>>>>>> If I understand correctly, point 3 is for batch
>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>> files.
>>>>>>>>>>>>>> Unless
>>>>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>> be
>>>>>>>>>>>>>> achieved
>>>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
>>>>>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>> a
>>>>>>>>>>>>>> +infinity
>>>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
>>>>>>>>>>>>>>>>>>>> upon
>>>>>>>>>>>>>>>>>> receipt
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> watermark.
>>>>>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> achieved
>>>>>>>>>> with a
>>>>>>>>>>>>>>> watermark with a +infinity timestamp.
>>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For an input operator which is supposed to
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> generate
>>>>>>>>>>>>>>>>>>> watermarks
>>>>>>>>>> for
>>>>>>>>>>>>>>>> downstream operators, I can think about the
>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>> watermarks
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> operator can emit:
>>>>>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> watermark)
>>>>>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
>>>>>>>>>>>>>>> tuples)
>>>>>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
>>>>>>>>>> 4. Final watermark
>>>>>>>>>>>>>>>>>>>>> File based watermarks seem to be applicable for
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>>> (file
>>>>>>>>>>>> based)
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> well,
>>>>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
>>>>>>>>>>>>>>>>>>>>> Does
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>> seem
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>>>>> with the thought process?
>>>>>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _________________________
>>>>>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> I don't think this should be designed based on a
>>>>>>>>>>>>>>>>>>>> simplistic
>>>>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>> input-output scenario. It would be good to
>>>>>>>>>>>>>>>>>> include a
>>>>>>>>>>>>>>>>>>>> stateful
>>>>>>>>>>>> transformation based on event time.
>>>>>>>>>>>>>>>> More complex pipelines contain stateful
>>>>>>>>>>>>>>>>>>>>>> transformations
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>> depend
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
>>>>>>>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> concept
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
>>>>>>>>>>>>>>>>>>>>>> increasing
>>>>>>>>>>>>>>>>>>>> sequence)
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> other operators can generically work with.
>>>>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
>>>>>>>>>>>>>>>>>>>>>> produce
>>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>> based
>>>>>>>>>>>>>> watermarks,
>>>>>>>>>>>>>>>>> for example when you read part files that are
>>>>>>>>>>>>>>>>>>>>>> bound
>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>> event
>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by AJAY GUPTA <aj...@gmail.com>.

Even this will not work because the output port of CsvParser is of type
Object. Even though Customer extends Tuple<Object>, it will still fail to
work since Tuple<Object> gets output as Object.

*DefaultOutputPort<Object> output = new DefaultOutputPort<Object>();*

The input port type at windowed operator with InputT = Object :
*DefaultInputPort<Tuple<Object>>*


Ajay


On Sun, Apr 30, 2017 at 1:45 AM, Vlad Rozov <v....@datatorrent.com> wrote:

> Use Object in place of InputT in the WindowedOperatorImpl. Cast Object to
> the actual type of InputT at runtime. Introducing an operator just to do a
> cast is not a good design decision, IMO.
>
> Thank you,
> Vlad
>
> Отправлено с iPhone
>
> > On Apr 29, 2017, at 02:50, AJAY GUPTA <aj...@gmail.com> wrote:
> >
> > I am using WindowedOperatorImpl and it is declared as follows.
> >
> > WindowedOperatorImpl<InputT, AccumulationType, OutputType>
> windowedOperator
> > = new WindowedOperatorImpl<>();
> >
> > In my application scenario, the InputT is Customer POJO which is getting
> > output as an Object by CsvParser.
> >
> >
> > Ajay
> >
> > On Fri, Apr 28, 2017 at 11:53 PM, Vlad Rozov <v....@datatorrent.com>
> > wrote:
> >
> >> How do you declare WindowedOperator?
> >>
> >> Thank you,
> >>
> >> Vlad
> >>
> >>
> >>> On 4/28/17 10:35, AJAY GUPTA wrote:
> >>>
> >>> Vlad,
> >>>
> >>> The approach you suggested doesn't work because the CSVParser outputs
> >>> Object Data Type irrespective of the POJO class being emitted.
> >>>
> >>>
> >>> Ajay
> >>>
> >>> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <v....@datatorrent.com>
> >>> wrote:
> >>>
> >>> Make your POJO class implement WindowedOperator Tuple interface (it may
> >>>> return itself in getValue()).
> >>>>
> >>>> Thank you,
> >>>>
> >>>> Vlad
> >>>>
> >>>> On 4/28/17 02:44, AJAY GUPTA wrote:
> >>>>
> >>>> Hi All,
> >>>>>
> >>>>> I am creating an application which is using Windowed Operator. This
> >>>>> application involves CsvParser operator emitting a POJO object which
> is
> >>>>> to
> >>>>> be passed as input to WindowedOperator. The WindowedOperator
> requires an
> >>>>> instance of Tuple class as input :
> >>>>> *public final transient DefaultInputPort<Tuple<InputT>>
> >>>>> input = new DefaultInputPort<Tuple<InputT>>() *
> >>>>>
> >>>>> Due to this, the addStream cannot work as the output of CsvParser's
> >>>>> output
> >>>>> port is not compatible with input port type of WindowedOperator.
> >>>>> One way to solve this problem is to have an operator between the
> above
> >>>>> two
> >>>>> operators as a convertor.
> >>>>> I would like to know if there is any other more generic approach to
> >>>>> solve
> >>>>> this problem without writing a new Operator for every new application
> >>>>> using
> >>>>> Windowed Operators.
> >>>>>
> >>>>> Thanks,
> >>>>> Ajay
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <
> >>>>> bhupesh@datatorrent.com>
> >>>>> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>>> I think we have some agreement on the way we should use control
> tuples
> >>>>>> for
> >>>>>> File I/O operators to support batch.
> >>>>>>
> >>>>>> In order to have more operators in Malhar, support this paradigm, I
> >>>>>> think
> >>>>>> we should also look at store operators - JDBC, Cassandra, HBase etc.
> >>>>>> The case with these operators is simpler as most of these do not
> poll
> >>>>>> the
> >>>>>> sources (except JDBC poller operator) and just stop once they have
> >>>>>> read a
> >>>>>> fixed amount of data. In other words, these are inherently batch
> >>>>>> sources.
> >>>>>> The only change that we should add to these operators is to shut
> down
> >>>>>> the
> >>>>>> DAG once the reading of data is done. For a windowed operator this
> >>>>>> would
> >>>>>> mean a Global window with a final watermark before the DAG is shut
> >>>>>> down.
> >>>>>>
> >>>>>> ~ Bhupesh
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________________
> >>>>>>
> >>>>>> Bhupesh Chawda
> >>>>>>
> >>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>
> >>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
> >>>>>> bhupesh@datatorrent.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi Thomas,
> >>>>>>
> >>>>>>> Even though the windowing operator is not just "event time", it
> seems
> >>>>>>> it
> >>>>>>> is too much dependent on the "time" attribute of the incoming
> tuple.
> >>>>>>> This
> >>>>>>> is the reason we had to model the file index as a timestamp to
> solve
> >>>>>>> the
> >>>>>>> batch case for files.
> >>>>>>> Perhaps we should work on increasing the scope of the windowed
> >>>>>>> operator
> >>>>>>>
> >>>>>>> to
> >>>>>>
> >>>>>> consider other types of windows as well. The Sequence option
> suggested
> >>>>>>> by
> >>>>>>> David seems to be something in that direction.
> >>>>>>>
> >>>>>>> ~ Bhupesh
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________________
> >>>>>>>
> >>>>>>> Bhupesh Chawda
> >>>>>>>
> >>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>
> >>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> That's correct, we are looking at a generalized approach for state
> >>>>>>>
> >>>>>>>> management vs. a series of special cases.
> >>>>>>>>
> >>>>>>>> And to be clear, windowing does not imply event time, otherwise it
> >>>>>>>> would
> >>>>>>>> be
> >>>>>>>> "EventTimeOperator" :-)
> >>>>>>>>
> >>>>>>>> Thomas
> >>>>>>>>
> >>>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
> >>>>>>>>
> >>>>>>>> bhupesh@datatorrent.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi David,
> >>>>>>>>
> >>>>>>>>> I went through the discussion, but it seems like it is more on
> the
> >>>>>>>>>
> >>>>>>>>> event
> >>>>>>>>
> >>>>>>> time watermark handling as opposed to batches. What we are trying
> to
> >>>>>>>
> >>>>>>>> do
> >>>>>>>>
> >>>>>>> is
> >>>>>>>
> >>>>>>>> have watermarks serve the purpose of demarcating batches using
> >>>>>>>>> control
> >>>>>>>>> tuples. Since each batch is separate from others, we would like
> to
> >>>>>>>>>
> >>>>>>>>> have
> >>>>>>>>
> >>>>>>> stateful processing within a batch, but not across batches.
> >>>>>>>
> >>>>>>>> At the same time, we would like to do this in a manner which is
> >>>>>>>>>
> >>>>>>>>> consistent
> >>>>>>>>
> >>>>>>>> with the windowing mechanism provided by the windowed operator.
> This
> >>>>>>>>>
> >>>>>>>>> will
> >>>>>>>>
> >>>>>>>> allow us to treat a single batch as a (bounded) stream and apply
> all
> >>>>>>>>>
> >>>>>>>>> the
> >>>>>>>>
> >>>>>>> event time windowing concepts in that time span.
> >>>>>>>
> >>>>>>>> For example, let's say I need to process data for a day (24
> hours) as
> >>>>>>>>>
> >>>>>>>>> a
> >>>>>>>>
> >>>>>>> single batch. The application is still streaming in nature: it
> would
> >>>>>>>
> >>>>>>>> end
> >>>>>>>>
> >>>>>>> the batch after a day and start a new batch the next day. At the
> same
> >>>>>>>
> >>>>>>>> time,
> >>>>>>>>
> >>>>>>>> I would be able to have early trigger firings every minute as
> well as
> >>>>>>>>>
> >>>>>>>>> drop
> >>>>>>>>
> >>>>>>>> any data which is say, 5 mins late. All this within a single day.
> >>>>>>>>>
> >>>>>>>>> ~ Bhupesh
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________________
> >>>>>>>>>
> >>>>>>>>> Bhupesh Chawda
> >>>>>>>>>
> >>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>
> >>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
> >>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>> There is a discussion in the Flink mailing list about key-based
> >>>>>>>
> >>>>>>>> watermarks.
> >>>>>>>>>
> >>>>>>>>> I think it's relevant to our use case here.
> >>>>>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbc
> >>>>>>>>>> c6510ef
> >>>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
> >>>>>>>>>>
> >>>>>>>>>> David
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
> >>>>>>>>>>
> >>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi David,
> >>>>>>>>>>
> >>>>>>>>>>> If using time window does not seem appropriate, we can have
> >>>>>>>>>>>
> >>>>>>>>>>> another
> >>>>>>>>>>
> >>>>>>>>> class
> >>>>>>>
> >>>>>>>> which is more suited for such sequential and distinct windows.
> >>>>>>>>>> Perhaps, a
> >>>>>>>>>> CustomWindow option can be introduced which takes in a window
> id.
> >>>>>>>>>> The
> >>>>>>>>>>
> >>>>>>>>> purpose of this window option could be to translate the window id
> >>>>>>>>>
> >>>>>>>>>> into
> >>>>>>>>>>
> >>>>>>>>> appropriate timestamps.
> >>>>>>>>>
> >>>>>>>>>> Another option would be to go with a custom timestampExtractor
> for
> >>>>>>>>>>>
> >>>>>>>>>>> such
> >>>>>>>>>>
> >>>>>>>>> tuples which translates the each unique file name to a distinct
> >>>>>>>>>
> >>>>>>>>>> timestamp
> >>>>>>>>>> while using time windows in the windowed operator.
> >>>>>>>>>>
> >>>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________________
> >>>>>>>>>>>
> >>>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>>
> >>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>
> >>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <
> davidyan@gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>> I now see your rationale on putting the filename in the window.
> >>>>>>>>>>
> >>>>>>>>>>> As far as I understand, the reasons why the filename is not
> part
> >>>>>>>>>>>>
> >>>>>>>>>>>> of
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>> key
> >>>>>>>>>>
> >>>>>>>>>>> and the Global Window is not used are:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) The files are processed in sequence, not in parallel
> >>>>>>>>>>>> 2) The windowed operator should not keep the state associated
> >>>>>>>>>>>>
> >>>>>>>>>>>> with
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>
> >>>>>>>> file
> >>>>>>>>>>
> >>>>>>>>>>> when the processing of the file is done
> >>>>>>>>>>>> 3) The trigger should be fired for the file when a file is
> done
> >>>>>>>>>>>>
> >>>>>>>>>>>> processing.
> >>>>>>>>>>>
> >>>>>>>>>>> However, if the file is just a sequence has nothing to do with
> a
> >>>>>>>>>>>>
> >>>>>>>>>>>> timestamp,
> >>>>>>>>>>>
> >>>>>>>>>>> assigning a timestamp to a file is not an intuitive thing to do
> >>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>> would
> >>>>>>>>>
> >>>>>>>>>> just create confusions to the users, especially when it's used
> >>>>>>>>>>> as
> >>>>>>>>>>>
> >>>>>>>>>> an
> >>>>>>>
> >>>>>>>> example for new users.
> >>>>>>>>>
> >>>>>>>>>> How about having a separate class called SequenceWindow? And
> >>>>>>>>>>>>
> >>>>>>>>>>>> perhaps
> >>>>>>>>>>>
> >>>>>>>>>> TimeWindow can inherit from it?
> >>>>>>>>>
> >>>>>>>>>> David
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <thw@apache.org
> >
> >>>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> >>>>>>>>>>
> >>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I think my comments related to count based windows might be
> >>>>>>>>>>>>> causing
> >>>>>>>>>>>>>
> >>>>>>>>>>>> confusion. Let's not discuss count based scenarios for now.
> >>>>>>>>>>
> >>>>>>>>>>> Just want to make sure we are on the same page wrt. the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> "each
> >>>>>>>>>>>>>
> >>>>>>>>>>>> file
> >>>>>>>
> >>>>>>>> is a
> >>>>>>>>>>
> >>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
> >>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> same
> >>>>>>>>>
> >>>>>>>>>> file
> >>>>>>>>>>>
> >>>>>>>>>>>> has the same timestamp (which is just a sequence number) and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>
> >>>>>>>>>>>> helps
> >>>>>>>>>
> >>>>>>>>>> keep tuples from each file in a separate window.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> time
> >>>>>>>>>>>>
> >>>>>>>>>>> stamp
> >>>>>>>>>
> >>>>>>>>>> also, depending on the file naming convention. And if it was
> >>>>>>>>>>>
> >>>>>>>>>>>> event
> >>>>>>>>>>>>
> >>>>>>>>>>> time
> >>>>>>>>>
> >>>>>>>>>> processing, the watermark would be derived from records within
> >>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>> file.
> >>>>>>>>>
> >>>>>>>>>> Agreed, the source should have a mechanism to control the time
> >>>>>>>>>>>> stamp
> >>>>>>>>>>>>
> >>>>>>>>>>> extraction along with everything else pertaining to the
> >>>>>>>>>>
> >>>>>>>>>>> watermark
> >>>>>>>>>>>>
> >>>>>>>>>>> generation.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>> We could also implement a "timestampExtractor" interface to
> >>>>>>>>>>>>> identify
> >>>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>>> timestamp (sequence number) for a file.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________________
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> thw@apache.org
> >>>>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I don't think this is a use case for count based window.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> We have multiple files that are retrieved in a sequence
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> there
> >>>>>>>
> >>>>>>>> is
> >>>>>>>>>>
> >>>>>>>>>>> no
> >>>>>>>>>>>>
> >>>>>>>>>>>> knowledge of the number of records per file. The
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> requirement is
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> to
> >>>>>>>>>
> >>>>>>>>>> aggregate each file separately and emit the aggregate when
> >>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> file
> >>>>>>>>>
> >>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>>> read
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> fully. There is no concept of "end of something" for an
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> key
> >>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> global window isn't applicable.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> However, as already explained and implemented by Bhupesh,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> can
> >>>>>>>>>
> >>>>>>>>>> be
> >>>>>>>>>>>
> >>>>>>>>>>> solved using watermark and window (in this case the window
> >>>>>>>>>>>>
> >>>>>>>>>>>>> timestamp
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> isn't
> >>>>>>>>>>>>
> >>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thomas
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> davidyan@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I don't think this is the way to go. Global Window only
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> means
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>>> timestamp
> >>>>>>>>>>>
> >>>>>>>>>>>> does not matter (or that there is no timestamp). It does
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> necessarily
> >>>>>>>>>
> >>>>>>>>>> mean it's a large batch. Unless there is some notion of
> >>>>>>>>>>>>>>> event
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> time
> >>>>>>>>>
> >>>>>>>>>> for
> >>>>>>>>>>>>
> >>>>>>>>>>>>> each
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> file, you don't want to embed the file into the window
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> itself.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If you want the result broken up by file name, and if
> >>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> files
> >>>>>>>
> >>>>>>>> are
> >>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> processed in parallel, I think making the file name be
> >>>>>>>>>>>>>>> part
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> of
> >>>>>>>>>
> >>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>>> key
> >>>>>>>>>>>>
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
> >>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>>> file
> >>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> be part of the window.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> you're
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> welcome
> >>>>>>>>>
> >>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>>> add
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> that feature. In case of count-based windows, there
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> be
> >>>>>>>
> >>>>>>>> no
> >>>>>>>>>
> >>>>>>>>> notion
> >>>>>>>>>>
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> time and you probably only trigger at the end of each
> >>>>>>>>>>>>>>> window.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In
> >>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>>> case
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> of count-based windows, the watermark only matters for
> >>>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> since
> >>>>>>>>>
> >>>>>>>>>> you
> >>>>>>>>>>>>
> >>>>>>>>>>>>> need
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 10,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>> number
> >>>>>>>>>>
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
> >>>>>>>>>>>>>>> end
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>> last
> >>>>>>>>>>
> >>>>>>>>>>> window
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> with 5 tuples).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi David,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for your comments.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> The wordcount example that I created based on the
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> windowed
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> operator
> >>>>>>>>>
> >>>>>>>>>> does
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> processing of word counts per file (each file as a
> >>>>>>>>>>>>>>>> separate
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> batch),
> >>>>>>>>>
> >>>>>>>>>> i.e.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> process counts for each file and dump into separate
> >>>>>>>>>>>>>>>> files.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As I understand Global window is for one large batch;
> >>>>>>>>>
> >>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> all
> >>>>>>>>>
> >>>>>>>>>> incoming
> >>>>>>>>>>>
> >>>>>>>>>>>> data falls into the same batch. This could not be
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> processed
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> using
> >>>>>>>>>
> >>>>>>>>>> GlobalWindow option as we need more than one windows.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> this
> >>>>>>>
> >>>>>>>> case, I
> >>>>>>>>>>
> >>>>>>>>>>> configured the windowed operator to have time windows
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1ms
> >>>>>>>
> >>>>>>>> each
> >>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>>>>
> >>>>>>>>>>>>> passed data for each file with increasing timestamps:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (file1,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1),
> >>>>>>>>>>
> >>>>>>>>>>> (file2,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
> >>>>>>>>>>>>>>>> scenario?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regarding (2 - count based windows), I think there is
> >>>>>>>>>>>
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> trigger
> >>>>>>>
> >>>>>>>> option
> >>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> process count based windows. In case I want to process
> >>>>>>>>>>>>>>>> every
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1000
> >>>>>>>>>>
> >>>>>>>>>>> tuples
> >>>>>>>>>>>>
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> CountTrigger
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> with
> >>>>>>>
> >>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>>> accumulation set to Discarding. Is this correct?
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Global
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> window.
> >>>>>>>>>
> >>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> ______________________________
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> _________________________
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Bhupesh Chawda
> >>>>>>>
> >>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> davidyan@gmail.com>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I'm worried that we are making the watermark concept
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> too
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> complicated.
> >>>>>>>>>
> >>>>>>>>>> Watermarks should simply just tell you what windows
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>
> >>>>>>>>> considered
> >>>>>>>>>>
> >>>>>>>>>>> complete.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Point 2 is basically a count-based window.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Watermarks
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> do
> >>>>>>>
> >>>>>>>> not
> >>>>>>>>>
> >>>>>>>>>> play a
> >>>>>>>>>>>
> >>>>>>>>>>>> role
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> here because the window is always complete at the
> >>>>>>>>>>>>>>>>> n-th
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> tuple.
> >>>>>>>
> >>>>>>>> If I understand correctly, point 3 is for batch
> >>>>>>>>>>>
> >>>>>>>>>>>> processing
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>
> >>>>>>>>>> files.
> >>>>>>>>>>>
> >>>>>>>>>>>> Unless
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>
> >>>>>>>>>> be
> >>>>>>>>>>>
> >>>>>>>>>>>> achieved
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
> >>>>>>>>>>>>>>>>> watermark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> with
> >>>>>>>>>
> >>>>>>>>>> a
> >>>>>>>>>>>>
> >>>>>>>>>>>> +infinity
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> upon
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> receipt
> >>>>>>>>>
> >>>>>>>>>> of
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> watermark.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> achieved
> >>>>>>>
> >>>>>>>> with a
> >>>>>>>>>>>>
> >>>>>>>>>>>>> watermark with a +infinity timestamp.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Thomas,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For an input operator which is supposed to
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> generate
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> watermarks
> >>>>>>>
> >>>>>>>> for
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> downstream operators, I can think about the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> following
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> watermarks
> >>>>>>>>>
> >>>>>>>>>> that
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> operator can emit:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> watermark)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
> >>>>>>>>>>>>
> >>>>>>>>>>>>> tuples)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
> >>>>>>>
> >>>>>>>> 4. Final watermark
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> File based watermarks seem to be applicable for
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> (file
> >>>>>>>>>
> >>>>>>>>>> based)
> >>>>>>>>>>>
> >>>>>>>>>>>> as
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> well,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Does
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> this
> >>>>>>>
> >>>>>>>> seem
> >>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> line
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> with the thought process?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> ~ Bhupesh
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> ______________________________
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> _________________________
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Bhupesh Chawda
> >>>>>>>>>>
> >>>>>>>>>>> Software Engineer
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> thw@apache.org
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I don't think this should be designed based on a
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> simplistic
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> file
> >>>>>>>>>>>>
> >>>>>>>>>>>>> input-output scenario. It would be good to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> include a
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> stateful
> >>>>>>>>>
> >>>>>>>>>> transformation based on event time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> More complex pipelines contain stateful
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> transformations
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>
> >>>>>>>>>>> depend
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> watermark
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> concept
> >>>>>>>>>
> >>>>>>>>>> that
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> increasing
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> sequence)
> >>>>>>>>>>>
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> other operators can generically work with.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> produce
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> time
> >>>>>>>>>
> >>>>>>>>>> based
> >>>>>>>>>>>
> >>>>>>>>>>>> watermarks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> for example when you read part files that are
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> bound
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> by
> >>>>>>>>>
> >>>>>>>>> event
> >>>>>>>>>>
> >>>>>>>>>>> time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thomas
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Vlad Rozov <v....@datatorrent.com>.

Use Object in place of InputT in the WindowedOperatorImpl. Cast Object to the actual type of InputT at runtime. Introducing an operator just to do a cast is not a good design decision, IMO.

Thank you,
Vlad

Отправлено с iPhone

> On Apr 29, 2017, at 02:50, AJAY GUPTA <aj...@gmail.com> wrote:
> 
> I am using WindowedOperatorImpl and it is declared as follows.
> 
> WindowedOperatorImpl<InputT, AccumulationType, OutputType> windowedOperator
> = new WindowedOperatorImpl<>();
> 
> In my application scenario, the InputT is Customer POJO which is getting
> output as an Object by CsvParser.
> 
> 
> Ajay
> 
> On Fri, Apr 28, 2017 at 11:53 PM, Vlad Rozov <v....@datatorrent.com>
> wrote:
> 
>> How do you declare WindowedOperator?
>> 
>> Thank you,
>> 
>> Vlad
>> 
>> 
>>> On 4/28/17 10:35, AJAY GUPTA wrote:
>>> 
>>> Vlad,
>>> 
>>> The approach you suggested doesn't work because the CSVParser outputs
>>> Object Data Type irrespective of the POJO class being emitted.
>>> 
>>> 
>>> Ajay
>>> 
>>> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <v....@datatorrent.com>
>>> wrote:
>>> 
>>> Make your POJO class implement WindowedOperator Tuple interface (it may
>>>> return itself in getValue()).
>>>> 
>>>> Thank you,
>>>> 
>>>> Vlad
>>>> 
>>>> On 4/28/17 02:44, AJAY GUPTA wrote:
>>>> 
>>>> Hi All,
>>>>> 
>>>>> I am creating an application which is using Windowed Operator. This
>>>>> application involves CsvParser operator emitting a POJO object which is
>>>>> to
>>>>> be passed as input to WindowedOperator. The WindowedOperator requires an
>>>>> instance of Tuple class as input :
>>>>> *public final transient DefaultInputPort<Tuple<InputT>>
>>>>> input = new DefaultInputPort<Tuple<InputT>>() *
>>>>> 
>>>>> Due to this, the addStream cannot work as the output of CsvParser's
>>>>> output
>>>>> port is not compatible with input port type of WindowedOperator.
>>>>> One way to solve this problem is to have an operator between the above
>>>>> two
>>>>> operators as a convertor.
>>>>> I would like to know if there is any other more generic approach to
>>>>> solve
>>>>> this problem without writing a new Operator for every new application
>>>>> using
>>>>> Windowed Operators.
>>>>> 
>>>>> Thanks,
>>>>> Ajay
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <
>>>>> bhupesh@datatorrent.com>
>>>>> wrote:
>>>>> 
>>>>> Hi All,
>>>>> 
>>>>>> I think we have some agreement on the way we should use control tuples
>>>>>> for
>>>>>> File I/O operators to support batch.
>>>>>> 
>>>>>> In order to have more operators in Malhar, support this paradigm, I
>>>>>> think
>>>>>> we should also look at store operators - JDBC, Cassandra, HBase etc.
>>>>>> The case with these operators is simpler as most of these do not poll
>>>>>> the
>>>>>> sources (except JDBC poller operator) and just stop once they have
>>>>>> read a
>>>>>> fixed amount of data. In other words, these are inherently batch
>>>>>> sources.
>>>>>> The only change that we should add to these operators is to shut down
>>>>>> the
>>>>>> DAG once the reading of data is done. For a windowed operator this
>>>>>> would
>>>>>> mean a Global window with a final watermark before the DAG is shut
>>>>>> down.
>>>>>> 
>>>>>> ~ Bhupesh
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________________
>>>>>> 
>>>>>> Bhupesh Chawda
>>>>>> 
>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>> 
>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
>>>>>> bhupesh@datatorrent.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Hi Thomas,
>>>>>> 
>>>>>>> Even though the windowing operator is not just "event time", it seems
>>>>>>> it
>>>>>>> is too much dependent on the "time" attribute of the incoming tuple.
>>>>>>> This
>>>>>>> is the reason we had to model the file index as a timestamp to solve
>>>>>>> the
>>>>>>> batch case for files.
>>>>>>> Perhaps we should work on increasing the scope of the windowed
>>>>>>> operator
>>>>>>> 
>>>>>>> to
>>>>>> 
>>>>>> consider other types of windows as well. The Sequence option suggested
>>>>>>> by
>>>>>>> David seems to be something in that direction.
>>>>>>> 
>>>>>>> ~ Bhupesh
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________________
>>>>>>> 
>>>>>>> Bhupesh Chawda
>>>>>>> 
>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>> 
>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> That's correct, we are looking at a generalized approach for state
>>>>>>> 
>>>>>>>> management vs. a series of special cases.
>>>>>>>> 
>>>>>>>> And to be clear, windowing does not imply event time, otherwise it
>>>>>>>> would
>>>>>>>> be
>>>>>>>> "EventTimeOperator" :-)
>>>>>>>> 
>>>>>>>> Thomas
>>>>>>>> 
>>>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
>>>>>>>> 
>>>>>>>> bhupesh@datatorrent.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi David,
>>>>>>>> 
>>>>>>>>> I went through the discussion, but it seems like it is more on the
>>>>>>>>> 
>>>>>>>>> event
>>>>>>>> 
>>>>>>> time watermark handling as opposed to batches. What we are trying to
>>>>>>> 
>>>>>>>> do
>>>>>>>> 
>>>>>>> is
>>>>>>> 
>>>>>>>> have watermarks serve the purpose of demarcating batches using
>>>>>>>>> control
>>>>>>>>> tuples. Since each batch is separate from others, we would like to
>>>>>>>>> 
>>>>>>>>> have
>>>>>>>> 
>>>>>>> stateful processing within a batch, but not across batches.
>>>>>>> 
>>>>>>>> At the same time, we would like to do this in a manner which is
>>>>>>>>> 
>>>>>>>>> consistent
>>>>>>>> 
>>>>>>>> with the windowing mechanism provided by the windowed operator. This
>>>>>>>>> 
>>>>>>>>> will
>>>>>>>> 
>>>>>>>> allow us to treat a single batch as a (bounded) stream and apply all
>>>>>>>>> 
>>>>>>>>> the
>>>>>>>> 
>>>>>>> event time windowing concepts in that time span.
>>>>>>> 
>>>>>>>> For example, let's say I need to process data for a day (24 hours) as
>>>>>>>>> 
>>>>>>>>> a
>>>>>>>> 
>>>>>>> single batch. The application is still streaming in nature: it would
>>>>>>> 
>>>>>>>> end
>>>>>>>> 
>>>>>>> the batch after a day and start a new batch the next day. At the same
>>>>>>> 
>>>>>>>> time,
>>>>>>>> 
>>>>>>>> I would be able to have early trigger firings every minute as well as
>>>>>>>>> 
>>>>>>>>> drop
>>>>>>>> 
>>>>>>>> any data which is say, 5 mins late. All this within a single day.
>>>>>>>>> 
>>>>>>>>> ~ Bhupesh
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________________
>>>>>>>>> 
>>>>>>>>> Bhupesh Chawda
>>>>>>>>> 
>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>> 
>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
>>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>> There is a discussion in the Flink mailing list about key-based
>>>>>>> 
>>>>>>>> watermarks.
>>>>>>>>> 
>>>>>>>>> I think it's relevant to our use case here.
>>>>>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbc
>>>>>>>>>> c6510ef
>>>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>>>>>>>>>> 
>>>>>>>>>> David
>>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>>>>>>>>>> 
>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi David,
>>>>>>>>>> 
>>>>>>>>>>> If using time window does not seem appropriate, we can have
>>>>>>>>>>> 
>>>>>>>>>>> another
>>>>>>>>>> 
>>>>>>>>> class
>>>>>>> 
>>>>>>>> which is more suited for such sequential and distinct windows.
>>>>>>>>>> Perhaps, a
>>>>>>>>>> CustomWindow option can be introduced which takes in a window id.
>>>>>>>>>> The
>>>>>>>>>> 
>>>>>>>>> purpose of this window option could be to translate the window id
>>>>>>>>> 
>>>>>>>>>> into
>>>>>>>>>> 
>>>>>>>>> appropriate timestamps.
>>>>>>>>> 
>>>>>>>>>> Another option would be to go with a custom timestampExtractor for
>>>>>>>>>>> 
>>>>>>>>>>> such
>>>>>>>>>> 
>>>>>>>>> tuples which translates the each unique file name to a distinct
>>>>>>>>> 
>>>>>>>>>> timestamp
>>>>>>>>>> while using time windows in the windowed operator.
>>>>>>>>>> 
>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>> 
>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>> 
>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>> 
>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>> I now see your rationale on putting the filename in the window.
>>>>>>>>>> 
>>>>>>>>>>> As far as I understand, the reasons why the filename is not part
>>>>>>>>>>>> 
>>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>> key
>>>>>>>>>> 
>>>>>>>>>>> and the Global Window is not used are:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1) The files are processed in sequence, not in parallel
>>>>>>>>>>>> 2) The windowed operator should not keep the state associated
>>>>>>>>>>>> 
>>>>>>>>>>>> with
>>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>> 
>>>>>>>> file
>>>>>>>>>> 
>>>>>>>>>>> when the processing of the file is done
>>>>>>>>>>>> 3) The trigger should be fired for the file when a file is done
>>>>>>>>>>>> 
>>>>>>>>>>>> processing.
>>>>>>>>>>> 
>>>>>>>>>>> However, if the file is just a sequence has nothing to do with a
>>>>>>>>>>>> 
>>>>>>>>>>>> timestamp,
>>>>>>>>>>> 
>>>>>>>>>>> assigning a timestamp to a file is not an intuitive thing to do
>>>>>>>>>>>> 
>>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>> would
>>>>>>>>> 
>>>>>>>>>> just create confusions to the users, especially when it's used
>>>>>>>>>>> as
>>>>>>>>>>> 
>>>>>>>>>> an
>>>>>>> 
>>>>>>>> example for new users.
>>>>>>>>> 
>>>>>>>>>> How about having a separate class called SequenceWindow? And
>>>>>>>>>>>> 
>>>>>>>>>>>> perhaps
>>>>>>>>>>> 
>>>>>>>>>> TimeWindow can inherit from it?
>>>>>>>>> 
>>>>>>>>>> David
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
>>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>>>>>>>>>> 
>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I think my comments related to count based windows might be
>>>>>>>>>>>>> causing
>>>>>>>>>>>>> 
>>>>>>>>>>>> confusion. Let's not discuss count based scenarios for now.
>>>>>>>>>> 
>>>>>>>>>>> Just want to make sure we are on the same page wrt. the
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> "each
>>>>>>>>>>>>> 
>>>>>>>>>>>> file
>>>>>>> 
>>>>>>>> is a
>>>>>>>>>> 
>>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
>>>>>>>>>>>> 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> 
>>>>>>>>>>>> same
>>>>>>>>> 
>>>>>>>>>> file
>>>>>>>>>>> 
>>>>>>>>>>>> has the same timestamp (which is just a sequence number) and
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>> 
>>>>>>>>>>>> helps
>>>>>>>>> 
>>>>>>>>>> keep tuples from each file in a separate window.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> time
>>>>>>>>>>>> 
>>>>>>>>>>> stamp
>>>>>>>>> 
>>>>>>>>>> also, depending on the file naming convention. And if it was
>>>>>>>>>>> 
>>>>>>>>>>>> event
>>>>>>>>>>>> 
>>>>>>>>>>> time
>>>>>>>>> 
>>>>>>>>>> processing, the watermark would be derived from records within
>>>>>>>>>>> 
>>>>>>>>>>>> the
>>>>>>>>>>>> 
>>>>>>>>>>> file.
>>>>>>>>> 
>>>>>>>>>> Agreed, the source should have a mechanism to control the time
>>>>>>>>>>>> stamp
>>>>>>>>>>>> 
>>>>>>>>>>> extraction along with everything else pertaining to the
>>>>>>>>>> 
>>>>>>>>>>> watermark
>>>>>>>>>>>> 
>>>>>>>>>>> generation.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> We could also implement a "timestampExtractor" interface to
>>>>>>>>>>>>> identify
>>>>>>>>>>>>> 
>>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>>> timestamp (sequence number) for a file.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I don't think this is a use case for count based window.
>>>>>>>>>>>> 
>>>>>>>>>>>>> We have multiple files that are retrieved in a sequence
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> there
>>>>>>> 
>>>>>>>> is
>>>>>>>>>> 
>>>>>>>>>>> no
>>>>>>>>>>>> 
>>>>>>>>>>>> knowledge of the number of records per file. The
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> requirement is
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> to
>>>>>>>>> 
>>>>>>>>>> aggregate each file separately and emit the aggregate when
>>>>>>>>>>> 
>>>>>>>>>>>> the
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> file
>>>>>>>>> 
>>>>>>>>>> is
>>>>>>>>>>> 
>>>>>>>>>>>> read
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> fully. There is no concept of "end of something" for an
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> key
>>>>>>>>>>> 
>>>>>>>>>>>> and
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> global window isn't applicable.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> However, as already explained and implemented by Bhupesh,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> can
>>>>>>>>> 
>>>>>>>>>> be
>>>>>>>>>>> 
>>>>>>>>>>> solved using watermark and window (in this case the window
>>>>>>>>>>>> 
>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> isn't
>>>>>>>>>>>> 
>>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> davidyan@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I don't think this is the way to go. Global Window only
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> means
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>>> timestamp
>>>>>>>>>>> 
>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> necessarily
>>>>>>>>> 
>>>>>>>>>> mean it's a large batch. Unless there is some notion of
>>>>>>>>>>>>>>> event
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> time
>>>>>>>>> 
>>>>>>>>>> for
>>>>>>>>>>>> 
>>>>>>>>>>>>> each
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> file, you don't want to embed the file into the window
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If you want the result broken up by file name, and if
>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> files
>>>>>>> 
>>>>>>>> are
>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>>> be
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> processed in parallel, I think making the file name be
>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> of
>>>>>>>>> 
>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>> key
>>>>>>>>>>>> 
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>>> file
>>>>>>>>>>> 
>>>>>>>>>>>> to
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> be part of the window.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> welcome
>>>>>>>>> 
>>>>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>>> add
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> that feature. In case of count-based windows, there
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> be
>>>>>>> 
>>>>>>>> no
>>>>>>>>> 
>>>>>>>>> notion
>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> time and you probably only trigger at the end of each
>>>>>>>>>>>>>>> window.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In
>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>>> case
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> of count-based windows, the watermark only matters for
>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> since
>>>>>>>>> 
>>>>>>>>>> you
>>>>>>>>>>>> 
>>>>>>>>>>>>> need
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 10,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>> number
>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>> last
>>>>>>>>>> 
>>>>>>>>>>> window
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> with 5 tuples).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The wordcount example that I created based on the
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> windowed
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> operator
>>>>>>>>> 
>>>>>>>>>> does
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> processing of word counts per file (each file as a
>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> batch),
>>>>>>>>> 
>>>>>>>>>> i.e.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> process counts for each file and dump into separate
>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As I understand Global window is for one large batch;
>>>>>>>>> 
>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> all
>>>>>>>>> 
>>>>>>>>>> incoming
>>>>>>>>>>> 
>>>>>>>>>>>> data falls into the same batch. This could not be
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> processed
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> using
>>>>>>>>> 
>>>>>>>>>> GlobalWindow option as we need more than one windows.
>>>>>>>>>>>> 
>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> this
>>>>>>> 
>>>>>>>> case, I
>>>>>>>>>> 
>>>>>>>>>>> configured the windowed operator to have time windows
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1ms
>>>>>>> 
>>>>>>>> each
>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>>> 
>>>>>>>>>>>>> passed data for each file with increasing timestamps:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (file1,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1),
>>>>>>>>>> 
>>>>>>>>>>> (file2,
>>>>>>>>>>>> 
>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
>>>>>>>>>>>>>>>> scenario?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regarding (2 - count based windows), I think there is
>>>>>>>>>>> 
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> trigger
>>>>>>> 
>>>>>>>> option
>>>>>>>>>>> 
>>>>>>>>>>>> to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> process count based windows. In case I want to process
>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1000
>>>>>>>>>> 
>>>>>>>>>>> tuples
>>>>>>>>>>>> 
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> CountTrigger
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> with
>>>>>>> 
>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>> accumulation set to Discarding. Is this correct?
>>>>>>>>>>>> 
>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Global
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> window.
>>>>>>>>> 
>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _________________________
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>> 
>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> davidyan@gmail.com>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm worried that we are making the watermark concept
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> complicated.
>>>>>>>>> 
>>>>>>>>>> Watermarks should simply just tell you what windows
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> be
>>>>>>>>> 
>>>>>>>>> considered
>>>>>>>>>> 
>>>>>>>>>>> complete.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Point 2 is basically a count-based window.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Watermarks
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> do
>>>>>>> 
>>>>>>>> not
>>>>>>>>> 
>>>>>>>>>> play a
>>>>>>>>>>> 
>>>>>>>>>>>> role
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> here because the window is always complete at the
>>>>>>>>>>>>>>>>> n-th
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> tuple.
>>>>>>> 
>>>>>>>> If I understand correctly, point 3 is for batch
>>>>>>>>>>> 
>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> files.
>>>>>>>>>>> 
>>>>>>>>>>>> Unless
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> can
>>>>>>>>> 
>>>>>>>>>> be
>>>>>>>>>>> 
>>>>>>>>>>>> achieved
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
>>>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> with
>>>>>>>>> 
>>>>>>>>>> a
>>>>>>>>>>>> 
>>>>>>>>>>>> +infinity
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> upon
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> receipt
>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> watermark.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> achieved
>>>>>>> 
>>>>>>>> with a
>>>>>>>>>>>> 
>>>>>>>>>>>>> watermark with a +infinity timestamp.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For an input operator which is supposed to
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> generate
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> watermarks
>>>>>>> 
>>>>>>>> for
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> downstream operators, I can think about the
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> watermarks
>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> operator can emit:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> watermark)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
>>>>>>>>>>>> 
>>>>>>>>>>>>> tuples)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
>>>>>>> 
>>>>>>>> 4. Final watermark
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> File based watermarks seem to be applicable for
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> (file
>>>>>>>>> 
>>>>>>>>>> based)
>>>>>>>>>>> 
>>>>>>>>>>>> as
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> well,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Does
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> this
>>>>>>> 
>>>>>>>> seem
>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>>> be
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> with the thought process?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> _________________________
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>> 
>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I don't think this should be designed based on a
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> simplistic
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>> 
>>>>>>>>>>>>> input-output scenario. It would be good to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> include a
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> stateful
>>>>>>>>> 
>>>>>>>>>> transformation based on event time.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> More complex pipelines contain stateful
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> transformations
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>>> depend
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> concept
>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> increasing
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> sequence)
>>>>>>>>>>> 
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> other operators can generically work with.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> produce
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> time
>>>>>>>>> 
>>>>>>>>>> based
>>>>>>>>>>> 
>>>>>>>>>>>> watermarks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> for example when you read part files that are
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> bound
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> by
>>>>>>>>> 
>>>>>>>>> event
>>>>>>>>>> 
>>>>>>>>>>> time.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>> 
>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by AJAY GUPTA <aj...@gmail.com>.

I am using WindowedOperatorImpl and it is declared as follows.

WindowedOperatorImpl<InputT, AccumulationType, OutputType> windowedOperator
= new WindowedOperatorImpl<>();

In my application scenario, the InputT is Customer POJO which is getting
output as an Object by CsvParser.


Ajay

On Fri, Apr 28, 2017 at 11:53 PM, Vlad Rozov <v....@datatorrent.com>
wrote:

> How do you declare WindowedOperator?
>
> Thank you,
>
> Vlad
>
>
> On 4/28/17 10:35, AJAY GUPTA wrote:
>
>> Vlad,
>>
>> The approach you suggested doesn't work because the CSVParser outputs
>> Object Data Type irrespective of the POJO class being emitted.
>>
>>
>> Ajay
>>
>> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <v....@datatorrent.com>
>> wrote:
>>
>> Make your POJO class implement WindowedOperator Tuple interface (it may
>>> return itself in getValue()).
>>>
>>> Thank you,
>>>
>>> Vlad
>>>
>>> On 4/28/17 02:44, AJAY GUPTA wrote:
>>>
>>> Hi All,
>>>>
>>>> I am creating an application which is using Windowed Operator. This
>>>> application involves CsvParser operator emitting a POJO object which is
>>>> to
>>>> be passed as input to WindowedOperator. The WindowedOperator requires an
>>>> instance of Tuple class as input :
>>>> *public final transient DefaultInputPort<Tuple<InputT>>
>>>> input = new DefaultInputPort<Tuple<InputT>>() *
>>>>
>>>> Due to this, the addStream cannot work as the output of CsvParser's
>>>> output
>>>> port is not compatible with input port type of WindowedOperator.
>>>> One way to solve this problem is to have an operator between the above
>>>> two
>>>> operators as a convertor.
>>>> I would like to know if there is any other more generic approach to
>>>> solve
>>>> this problem without writing a new Operator for every new application
>>>> using
>>>> Windowed Operators.
>>>>
>>>> Thanks,
>>>> Ajay
>>>>
>>>>
>>>>
>>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <
>>>> bhupesh@datatorrent.com>
>>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>> I think we have some agreement on the way we should use control tuples
>>>>> for
>>>>> File I/O operators to support batch.
>>>>>
>>>>> In order to have more operators in Malhar, support this paradigm, I
>>>>> think
>>>>> we should also look at store operators - JDBC, Cassandra, HBase etc.
>>>>> The case with these operators is simpler as most of these do not poll
>>>>> the
>>>>> sources (except JDBC poller operator) and just stop once they have
>>>>> read a
>>>>> fixed amount of data. In other words, these are inherently batch
>>>>> sources.
>>>>> The only change that we should add to these operators is to shut down
>>>>> the
>>>>> DAG once the reading of data is done. For a windowed operator this
>>>>> would
>>>>> mean a Global window with a final watermark before the DAG is shut
>>>>> down.
>>>>>
>>>>> ~ Bhupesh
>>>>>
>>>>>
>>>>> _______________________________________________________
>>>>>
>>>>> Bhupesh Chawda
>>>>>
>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>
>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
>>>>> bhupesh@datatorrent.com>
>>>>> wrote:
>>>>>
>>>>> Hi Thomas,
>>>>>
>>>>>> Even though the windowing operator is not just "event time", it seems
>>>>>> it
>>>>>> is too much dependent on the "time" attribute of the incoming tuple.
>>>>>> This
>>>>>> is the reason we had to model the file index as a timestamp to solve
>>>>>> the
>>>>>> batch case for files.
>>>>>> Perhaps we should work on increasing the scope of the windowed
>>>>>> operator
>>>>>>
>>>>>> to
>>>>>
>>>>> consider other types of windows as well. The Sequence option suggested
>>>>>> by
>>>>>> David seems to be something in that direction.
>>>>>>
>>>>>> ~ Bhupesh
>>>>>>
>>>>>>
>>>>>> _______________________________________________________
>>>>>>
>>>>>> Bhupesh Chawda
>>>>>>
>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>
>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> That's correct, we are looking at a generalized approach for state
>>>>>>
>>>>>>> management vs. a series of special cases.
>>>>>>>
>>>>>>> And to be clear, windowing does not imply event time, otherwise it
>>>>>>> would
>>>>>>> be
>>>>>>> "EventTimeOperator" :-)
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
>>>>>>>
>>>>>>> bhupesh@datatorrent.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>>> I went through the discussion, but it seems like it is more on the
>>>>>>>>
>>>>>>>> event
>>>>>>>
>>>>>> time watermark handling as opposed to batches. What we are trying to
>>>>>>
>>>>>>> do
>>>>>>>
>>>>>> is
>>>>>>
>>>>>>> have watermarks serve the purpose of demarcating batches using
>>>>>>>> control
>>>>>>>> tuples. Since each batch is separate from others, we would like to
>>>>>>>>
>>>>>>>> have
>>>>>>>
>>>>>> stateful processing within a batch, but not across batches.
>>>>>>
>>>>>>> At the same time, we would like to do this in a manner which is
>>>>>>>>
>>>>>>>> consistent
>>>>>>>
>>>>>>> with the windowing mechanism provided by the windowed operator. This
>>>>>>>>
>>>>>>>> will
>>>>>>>
>>>>>>> allow us to treat a single batch as a (bounded) stream and apply all
>>>>>>>>
>>>>>>>> the
>>>>>>>
>>>>>> event time windowing concepts in that time span.
>>>>>>
>>>>>>> For example, let's say I need to process data for a day (24 hours) as
>>>>>>>>
>>>>>>>> a
>>>>>>>
>>>>>> single batch. The application is still streaming in nature: it would
>>>>>>
>>>>>>> end
>>>>>>>
>>>>>> the batch after a day and start a new batch the next day. At the same
>>>>>>
>>>>>>> time,
>>>>>>>
>>>>>>> I would be able to have early trigger firings every minute as well as
>>>>>>>>
>>>>>>>> drop
>>>>>>>
>>>>>>> any data which is say, 5 mins late. All this within a single day.
>>>>>>>>
>>>>>>>> ~ Bhupesh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________________
>>>>>>>>
>>>>>>>> Bhupesh Chawda
>>>>>>>>
>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>
>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>
>>>>>> There is a discussion in the Flink mailing list about key-based
>>>>>>
>>>>>>> watermarks.
>>>>>>>>
>>>>>>>> I think it's relevant to our use case here.
>>>>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbc
>>>>>>>>> c6510ef
>>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>>>>>>>>>
>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi David,
>>>>>>>>>
>>>>>>>>>> If using time window does not seem appropriate, we can have
>>>>>>>>>>
>>>>>>>>>> another
>>>>>>>>>
>>>>>>>> class
>>>>>>
>>>>>>> which is more suited for such sequential and distinct windows.
>>>>>>>>> Perhaps, a
>>>>>>>>> CustomWindow option can be introduced which takes in a window id.
>>>>>>>>> The
>>>>>>>>>
>>>>>>>> purpose of this window option could be to translate the window id
>>>>>>>>
>>>>>>>>> into
>>>>>>>>>
>>>>>>>> appropriate timestamps.
>>>>>>>>
>>>>>>>>> Another option would be to go with a custom timestampExtractor for
>>>>>>>>>>
>>>>>>>>>> such
>>>>>>>>>
>>>>>>>> tuples which translates the each unique file name to a distinct
>>>>>>>>
>>>>>>>>> timestamp
>>>>>>>>> while using time windows in the windowed operator.
>>>>>>>>>
>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________________
>>>>>>>>>>
>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>
>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>
>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>> I now see your rationale on putting the filename in the window.
>>>>>>>>>
>>>>>>>>>> As far as I understand, the reasons why the filename is not part
>>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>> the
>>>>>>>>
>>>>>>>> key
>>>>>>>>>
>>>>>>>>>> and the Global Window is not used are:
>>>>>>>>>>>
>>>>>>>>>>> 1) The files are processed in sequence, not in parallel
>>>>>>>>>>> 2) The windowed operator should not keep the state associated
>>>>>>>>>>>
>>>>>>>>>>> with
>>>>>>>>>>
>>>>>>>>> the
>>>>>>
>>>>>>> file
>>>>>>>>>
>>>>>>>>>> when the processing of the file is done
>>>>>>>>>>> 3) The trigger should be fired for the file when a file is done
>>>>>>>>>>>
>>>>>>>>>>> processing.
>>>>>>>>>>
>>>>>>>>>> However, if the file is just a sequence has nothing to do with a
>>>>>>>>>>>
>>>>>>>>>>> timestamp,
>>>>>>>>>>
>>>>>>>>>> assigning a timestamp to a file is not an intuitive thing to do
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>> would
>>>>>>>>
>>>>>>>>> just create confusions to the users, especially when it's used
>>>>>>>>>> as
>>>>>>>>>>
>>>>>>>>> an
>>>>>>
>>>>>>> example for new users.
>>>>>>>>
>>>>>>>>> How about having a separate class called SequenceWindow? And
>>>>>>>>>>>
>>>>>>>>>>> perhaps
>>>>>>>>>>
>>>>>>>>> TimeWindow can inherit from it?
>>>>>>>>
>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>>>>>>>>>
>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think my comments related to count based windows might be
>>>>>>>>>>>> causing
>>>>>>>>>>>>
>>>>>>>>>>> confusion. Let's not discuss count based scenarios for now.
>>>>>>>>>
>>>>>>>>>> Just want to make sure we are on the same page wrt. the
>>>>>>>>>>>>>
>>>>>>>>>>>>> "each
>>>>>>>>>>>>
>>>>>>>>>>> file
>>>>>>
>>>>>>> is a
>>>>>>>>>
>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>> same
>>>>>>>>
>>>>>>>>> file
>>>>>>>>>>
>>>>>>>>>>> has the same timestamp (which is just a sequence number) and
>>>>>>>>>>>>>
>>>>>>>>>>>>> that
>>>>>>>>>>>>
>>>>>>>>>>> helps
>>>>>>>>
>>>>>>>>> keep tuples from each file in a separate window.
>>>>>>>>>>>
>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
>>>>>>>>>>>>>
>>>>>>>>>>>> time
>>>>>>>>>>>
>>>>>>>>>> stamp
>>>>>>>>
>>>>>>>>> also, depending on the file naming convention. And if it was
>>>>>>>>>>
>>>>>>>>>>> event
>>>>>>>>>>>
>>>>>>>>>> time
>>>>>>>>
>>>>>>>>> processing, the watermark would be derived from records within
>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>> file.
>>>>>>>>
>>>>>>>>> Agreed, the source should have a mechanism to control the time
>>>>>>>>>>> stamp
>>>>>>>>>>>
>>>>>>>>>> extraction along with everything else pertaining to the
>>>>>>>>>
>>>>>>>>>> watermark
>>>>>>>>>>>
>>>>>>>>>> generation.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> We could also implement a "timestampExtractor" interface to
>>>>>>>>>>>> identify
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> timestamp (sequence number) for a file.
>>>>>>>>>>>>
>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>>>
>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>
>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>
>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
>>>>>>>>>>>>>
>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>
>>>>>>> I don't think this is a use case for count based window.
>>>>>>>>>>>
>>>>>>>>>>>> We have multiple files that are retrieved in a sequence
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>> there
>>>>>>
>>>>>>> is
>>>>>>>>>
>>>>>>>>>> no
>>>>>>>>>>>
>>>>>>>>>>> knowledge of the number of records per file. The
>>>>>>>>>>>>
>>>>>>>>>>>>> requirement is
>>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>
>>>>>>>>> aggregate each file separately and emit the aggregate when
>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>> file
>>>>>>>>
>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>>> read
>>>>>>>>>>>>
>>>>>>>>>>>>> fully. There is no concept of "end of something" for an
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>
>>>>>>>>>>>> key
>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>>> global window isn't applicable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, as already explained and implemented by Bhupesh,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>
>>>>>>>>>>>> can
>>>>>>>>
>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>> solved using watermark and window (in this case the window
>>>>>>>>>>>
>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>
>>>>>>>>>>>> isn't
>>>>>>>>>>>
>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> davidyan@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I don't think this is the way to go. Global Window only
>>>>>>>>>>>>
>>>>>>>>>>>>> means
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>
>>>>>>>>> timestamp
>>>>>>>>>>
>>>>>>>>>>> does not matter (or that there is no timestamp). It does
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>
>>>>>>>>>>>>> necessarily
>>>>>>>>
>>>>>>>>> mean it's a large batch. Unless there is some notion of
>>>>>>>>>>>>>> event
>>>>>>>>>>>>>>
>>>>>>>>>>>>> time
>>>>>>>>
>>>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>>> each
>>>>>>>>>>>>>
>>>>>>>>>>>>>> file, you don't want to embed the file into the window
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> If you want the result broken up by file name, and if
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> files
>>>>>>
>>>>>>> are
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> be
>>>>>>>>>>>>>
>>>>>>>>>>>>> processed in parallel, I think making the file name be
>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>
>>>>>>>>>>>>> of
>>>>>>>>
>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> key
>>>>>>>>>>>
>>>>>>>>>>>> is
>>>>>>>>>>>>>
>>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>
>>>>>>>>> file
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>> be part of the window.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>>
>>>>>>>>>>>>> welcome
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> add
>>>>>>>>>>>>>
>>>>>>>>>>>>>> that feature. In case of count-based windows, there
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>
>>>>>>>>>>>>> be
>>>>>>
>>>>>>> no
>>>>>>>>
>>>>>>>> notion
>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>>>>
>>>>>>>>>>>>> time and you probably only trigger at the end of each
>>>>>>>>>>>>>> window.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> In
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> case
>>>>>>>>>>>>
>>>>>>>>>>>>> of count-based windows, the watermark only matters for
>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>
>>>>>>>>>>>>> since
>>>>>>>>
>>>>>>>>> you
>>>>>>>>>>>
>>>>>>>>>>>> need
>>>>>>>>>>>>>
>>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 10,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>
>>>>>>>> number
>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>>>>
>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>
>>>>>>>> last
>>>>>>>>>
>>>>>>>>>> window
>>>>>>>>>>>>
>>>>>>>>>>>>> with 5 tuples).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The wordcount example that I created based on the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> windowed
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> operator
>>>>>>>>
>>>>>>>>> does
>>>>>>>>>>>>
>>>>>>>>>>>>> processing of word counts per file (each file as a
>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> batch),
>>>>>>>>
>>>>>>>>> i.e.
>>>>>>>>>>>>
>>>>>>>>>>>>> process counts for each file and dump into separate
>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As I understand Global window is for one large batch;
>>>>>>>>
>>>>>>>>> i.e.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> all
>>>>>>>>
>>>>>>>>> incoming
>>>>>>>>>>
>>>>>>>>>>> data falls into the same batch. This could not be
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> processed
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> using
>>>>>>>>
>>>>>>>>> GlobalWindow option as we need more than one windows.
>>>>>>>>>>>
>>>>>>>>>>>> In
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> this
>>>>>>
>>>>>>> case, I
>>>>>>>>>
>>>>>>>>>> configured the windowed operator to have time windows
>>>>>>>>>>>>>
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1ms
>>>>>>
>>>>>>> each
>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>> passed data for each file with increasing timestamps:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> (file1,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1),
>>>>>>>>>
>>>>>>>>>> (file2,
>>>>>>>>>>>
>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
>>>>>>>>>>>>>>> scenario?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regarding (2 - count based windows), I think there is
>>>>>>>>>>
>>>>>>>>>>> a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> trigger
>>>>>>
>>>>>>> option
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> process count based windows. In case I want to process
>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1000
>>>>>>>>>
>>>>>>>>>> tuples
>>>>>>>>>>>
>>>>>>>>>>>> as
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> CountTrigger
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> with
>>>>>>
>>>>>>> the
>>>>>>>>>
>>>>>>>>>> accumulation set to Discarding. Is this correct?
>>>>>>>>>>>
>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Global
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> window.
>>>>>>>>
>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>
>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _________________________
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>
>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> davidyan@gmail.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm worried that we are making the watermark concept
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> complicated.
>>>>>>>>
>>>>>>>>> Watermarks should simply just tell you what windows
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be
>>>>>>>>
>>>>>>>> considered
>>>>>>>>>
>>>>>>>>>> complete.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Point 2 is basically a count-based window.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Watermarks
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> do
>>>>>>
>>>>>>> not
>>>>>>>>
>>>>>>>>> play a
>>>>>>>>>>
>>>>>>>>>>> role
>>>>>>>>>>>>>
>>>>>>>>>>>>>> here because the window is always complete at the
>>>>>>>>>>>>>>>> n-th
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> tuple.
>>>>>>
>>>>>>> If I understand correctly, point 3 is for batch
>>>>>>>>>>
>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> of
>>>>>>>>>
>>>>>>>>> files.
>>>>>>>>>>
>>>>>>>>>>> Unless
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can
>>>>>>>>
>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>>> achieved
>>>>>>>>>>>>
>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
>>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> with
>>>>>>>>
>>>>>>>>> a
>>>>>>>>>>>
>>>>>>>>>>> +infinity
>>>>>>>>>>>>
>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> upon
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> receipt
>>>>>>>>
>>>>>>>>> of
>>>>>>>>>>>>
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> watermark.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> achieved
>>>>>>
>>>>>>> with a
>>>>>>>>>>>
>>>>>>>>>>>> watermark with a +infinity timestamp.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For an input operator which is supposed to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> generate
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> watermarks
>>>>>>
>>>>>>> for
>>>>>>>>>>>>
>>>>>>>>>>>>> downstream operators, I can think about the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> watermarks
>>>>>>>>
>>>>>>>>> that
>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> operator can emit:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> watermark)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
>>>>>>>>>>>
>>>>>>>>>>>> tuples)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
>>>>>>
>>>>>>> 4. Final watermark
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> File based watermarks seem to be applicable for
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (file
>>>>>>>>
>>>>>>>>> based)
>>>>>>>>>>
>>>>>>>>>>> as
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> well,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Does
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> this
>>>>>>
>>>>>>> seem
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> be
>>>>>>>>>>>>>
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> with the thought process?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _________________________
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>
>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think this should be designed based on a
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> simplistic
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>
>>>>>>>>>>>> input-output scenario. It would be good to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> include a
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> stateful
>>>>>>>>
>>>>>>>>> transformation based on event time.
>>>>>>>>>>>>
>>>>>>>>>>>>> More complex pipelines contain stateful
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> transformations
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> that
>>>>>>>>>
>>>>>>>>>> depend
>>>>>>>>>>>>
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> concept
>>>>>>>>
>>>>>>>>> that
>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> increasing
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sequence)
>>>>>>>>>>
>>>>>>>>>>> that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> other operators can generically work with.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> produce
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> time
>>>>>>>>
>>>>>>>>> based
>>>>>>>>>>
>>>>>>>>>>> watermarks,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> for example when you read part files that are
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> bound
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> by
>>>>>>>>
>>>>>>>> event
>>>>>>>>>
>>>>>>>>>> time.
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Vlad Rozov <v....@datatorrent.com>.

How do you declare WindowedOperator?

Thank you,

Vlad

On 4/28/17 10:35, AJAY GUPTA wrote:
> Vlad,
>
> The approach you suggested doesn't work because the CSVParser outputs
> Object Data Type irrespective of the POJO class being emitted.
>
>
> Ajay
>
> On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <v....@datatorrent.com> wrote:
>
>> Make your POJO class implement WindowedOperator Tuple interface (it may
>> return itself in getValue()).
>>
>> Thank you,
>>
>> Vlad
>>
>> On 4/28/17 02:44, AJAY GUPTA wrote:
>>
>>> Hi All,
>>>
>>> I am creating an application which is using Windowed Operator. This
>>> application involves CsvParser operator emitting a POJO object which is to
>>> be passed as input to WindowedOperator. The WindowedOperator requires an
>>> instance of Tuple class as input :
>>> *public final transient DefaultInputPort<Tuple<InputT>>
>>> input = new DefaultInputPort<Tuple<InputT>>() *
>>>
>>> Due to this, the addStream cannot work as the output of CsvParser's output
>>> port is not compatible with input port type of WindowedOperator.
>>> One way to solve this problem is to have an operator between the above two
>>> operators as a convertor.
>>> I would like to know if there is any other more generic approach to solve
>>> this problem without writing a new Operator for every new application
>>> using
>>> Windowed Operators.
>>>
>>> Thanks,
>>> Ajay
>>>
>>>
>>>
>>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <bh...@datatorrent.com>
>>> wrote:
>>>
>>> Hi All,
>>>> I think we have some agreement on the way we should use control tuples
>>>> for
>>>> File I/O operators to support batch.
>>>>
>>>> In order to have more operators in Malhar, support this paradigm, I think
>>>> we should also look at store operators - JDBC, Cassandra, HBase etc.
>>>> The case with these operators is simpler as most of these do not poll the
>>>> sources (except JDBC poller operator) and just stop once they have read a
>>>> fixed amount of data. In other words, these are inherently batch sources.
>>>> The only change that we should add to these operators is to shut down the
>>>> DAG once the reading of data is done. For a windowed operator this would
>>>> mean a Global window with a final watermark before the DAG is shut down.
>>>>
>>>> ~ Bhupesh
>>>>
>>>>
>>>> _______________________________________________________
>>>>
>>>> Bhupesh Chawda
>>>>
>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>
>>>> www.datatorrent.com  |  apex.apache.org
>>>>
>>>>
>>>>
>>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
>>>> bhupesh@datatorrent.com>
>>>> wrote:
>>>>
>>>> Hi Thomas,
>>>>> Even though the windowing operator is not just "event time", it seems it
>>>>> is too much dependent on the "time" attribute of the incoming tuple.
>>>>> This
>>>>> is the reason we had to model the file index as a timestamp to solve the
>>>>> batch case for files.
>>>>> Perhaps we should work on increasing the scope of the windowed operator
>>>>>
>>>> to
>>>>
>>>>> consider other types of windows as well. The Sequence option suggested
>>>>> by
>>>>> David seems to be something in that direction.
>>>>>
>>>>> ~ Bhupesh
>>>>>
>>>>>
>>>>> _______________________________________________________
>>>>>
>>>>> Bhupesh Chawda
>>>>>
>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>
>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org> wrote:
>>>>>
>>>>> That's correct, we are looking at a generalized approach for state
>>>>>> management vs. a series of special cases.
>>>>>>
>>>>>> And to be clear, windowing does not imply event time, otherwise it
>>>>>> would
>>>>>> be
>>>>>> "EventTimeOperator" :-)
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
>>>>>>
>>>>> bhupesh@datatorrent.com>
>>>>> wrote:
>>>>>> Hi David,
>>>>>>> I went through the discussion, but it seems like it is more on the
>>>>>>>
>>>>>> event
>>>>> time watermark handling as opposed to batches. What we are trying to
>>>>>> do
>>>>> is
>>>>>>> have watermarks serve the purpose of demarcating batches using control
>>>>>>> tuples. Since each batch is separate from others, we would like to
>>>>>>>
>>>>>> have
>>>>> stateful processing within a batch, but not across batches.
>>>>>>> At the same time, we would like to do this in a manner which is
>>>>>>>
>>>>>> consistent
>>>>>>
>>>>>>> with the windowing mechanism provided by the windowed operator. This
>>>>>>>
>>>>>> will
>>>>>>
>>>>>>> allow us to treat a single batch as a (bounded) stream and apply all
>>>>>>>
>>>>>> the
>>>>> event time windowing concepts in that time span.
>>>>>>> For example, let's say I need to process data for a day (24 hours) as
>>>>>>>
>>>>>> a
>>>>> single batch. The application is still streaming in nature: it would
>>>>>> end
>>>>> the batch after a day and start a new batch the next day. At the same
>>>>>> time,
>>>>>>
>>>>>>> I would be able to have early trigger firings every minute as well as
>>>>>>>
>>>>>> drop
>>>>>>
>>>>>>> any data which is say, 5 mins late. All this within a single day.
>>>>>>>
>>>>>>> ~ Bhupesh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________________
>>>>>>>
>>>>>>> Bhupesh Chawda
>>>>>>>
>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>
>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>> There is a discussion in the Flink mailing list about key-based
>>>>>>> watermarks.
>>>>>>>
>>>>>>>> I think it's relevant to our use case here.
>>>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
>>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>>>>>>>>
>>>>>>> bhupesh@datatorrent.com
>>>>>>> wrote:
>>>>>>>> Hi David,
>>>>>>>>> If using time window does not seem appropriate, we can have
>>>>>>>>>
>>>>>>>> another
>>>>> class
>>>>>>>> which is more suited for such sequential and distinct windows.
>>>>>>>> Perhaps, a
>>>>>>>> CustomWindow option can be introduced which takes in a window id.
>>>>>>>> The
>>>>>>> purpose of this window option could be to translate the window id
>>>>>>>> into
>>>>>>> appropriate timestamps.
>>>>>>>>> Another option would be to go with a custom timestampExtractor for
>>>>>>>>>
>>>>>>>> such
>>>>>>> tuples which translates the each unique file name to a distinct
>>>>>>>> timestamp
>>>>>>>> while using time windows in the windowed operator.
>>>>>>>>> ~ Bhupesh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________________
>>>>>>>>>
>>>>>>>>> Bhupesh Chawda
>>>>>>>>>
>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>
>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>> I now see your rationale on putting the filename in the window.
>>>>>>>>>> As far as I understand, the reasons why the filename is not part
>>>>>>>>>>
>>>>>>>>> of
>>>>>>> the
>>>>>>>
>>>>>>>> key
>>>>>>>>>> and the Global Window is not used are:
>>>>>>>>>>
>>>>>>>>>> 1) The files are processed in sequence, not in parallel
>>>>>>>>>> 2) The windowed operator should not keep the state associated
>>>>>>>>>>
>>>>>>>>> with
>>>>> the
>>>>>>>> file
>>>>>>>>>> when the processing of the file is done
>>>>>>>>>> 3) The trigger should be fired for the file when a file is done
>>>>>>>>>>
>>>>>>>>> processing.
>>>>>>>>>
>>>>>>>>>> However, if the file is just a sequence has nothing to do with a
>>>>>>>>>>
>>>>>>>>> timestamp,
>>>>>>>>>
>>>>>>>>>> assigning a timestamp to a file is not an intuitive thing to do
>>>>>>>>>>
>>>>>>>>> and
>>>>>>> would
>>>>>>>>> just create confusions to the users, especially when it's used
>>>>>>>>> as
>>>>> an
>>>>>>> example for new users.
>>>>>>>>>> How about having a separate class called SequenceWindow? And
>>>>>>>>>>
>>>>>>>>> perhaps
>>>>>>> TimeWindow can inherit from it?
>>>>>>>>>> David
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>> wrote:
>>>>>>>>>>> I think my comments related to count based windows might be
>>>>>>>>>>> causing
>>>>>>>> confusion. Let's not discuss count based scenarios for now.
>>>>>>>>>>>> Just want to make sure we are on the same page wrt. the
>>>>>>>>>>>>
>>>>>>>>>>> "each
>>>>> file
>>>>>>>> is a
>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
>>>>>>>>>>> the
>>>>>>> same
>>>>>>>>> file
>>>>>>>>>>>> has the same timestamp (which is just a sequence number) and
>>>>>>>>>>>>
>>>>>>>>>>> that
>>>>>>> helps
>>>>>>>>>> keep tuples from each file in a separate window.
>>>>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
>>>>>>>>>> time
>>>>>>> stamp
>>>>>>>>> also, depending on the file naming convention. And if it was
>>>>>>>>>> event
>>>>>>> time
>>>>>>>>> processing, the watermark would be derived from records within
>>>>>>>>>> the
>>>>>>> file.
>>>>>>>>>> Agreed, the source should have a mechanism to control the time
>>>>>>>>>> stamp
>>>>>>>> extraction along with everything else pertaining to the
>>>>>>>>>> watermark
>>>>>>> generation.
>>>>>>>>>>>
>>>>>>>>>>> We could also implement a "timestampExtractor" interface to
>>>>>>>>>>> identify
>>>>>>>>> the
>>>>>>>>>>> timestamp (sequence number) for a file.
>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>>
>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>
>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>
>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
>>>>>>>>>>>>
>>>>>>>>>>> thw@apache.org
>>>>> wrote:
>>>>>>>>>> I don't think this is a use case for count based window.
>>>>>>>>>>>>> We have multiple files that are retrieved in a sequence
>>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>> there
>>>>>>>> is
>>>>>>>>>> no
>>>>>>>>>>
>>>>>>>>>>> knowledge of the number of records per file. The
>>>>>>>>>>>> requirement is
>>>>>>> to
>>>>>>>>> aggregate each file separately and emit the aggregate when
>>>>>>>>>>>> the
>>>>>>> file
>>>>>>>>> is
>>>>>>>>>>> read
>>>>>>>>>>>>> fully. There is no concept of "end of something" for an
>>>>>>>>>>>>>
>>>>>>>>>>>> individual
>>>>>>>>> key
>>>>>>>>>>> and
>>>>>>>>>>>>> global window isn't applicable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, as already explained and implemented by Bhupesh,
>>>>>>>>>>>>>
>>>>>>>>>>>> this
>>>>>>> can
>>>>>>>>> be
>>>>>>>>>
>>>>>>>>>> solved using watermark and window (in this case the window
>>>>>>>>>>>> timestamp
>>>>>>>>>> isn't
>>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>>>>>>>>>>>>>
>>>>>>>>>>>> davidyan@gmail.com
>>>>>>> wrote:
>>>>>>>>>>> I don't think this is the way to go. Global Window only
>>>>>>>>>>>>> means
>>>>>>> the
>>>>>>>>> timestamp
>>>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
>>>>>>>>>>>>>>
>>>>>>>>>>>>> not
>>>>>>> necessarily
>>>>>>>>>>>>> mean it's a large batch. Unless there is some notion of
>>>>>>>>>>>>> event
>>>>>>> time
>>>>>>>>>> for
>>>>>>>>>>>> each
>>>>>>>>>>>>>> file, you don't want to embed the file into the window
>>>>>>>>>>>>>>
>>>>>>>>>>>>> itself.
>>>>>>>> If you want the result broken up by file name, and if
>>>>>>>>>>>>> the
>>>>> files
>>>>>>>> are
>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>>
>>>>>>>>>>>>> processed in parallel, I think making the file name be
>>>>>>>>>>>>> part
>>>>>>> of
>>>>>>>
>>>>>>>> the
>>>>>>>>>> key
>>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
>>>>>>>>>>>>> make
>>>>>>> the
>>>>>>>>> file
>>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>>> be part of the window.
>>>>>>>>>>>>>> For count-based window, it's not implemented yet and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> you're
>>>>>>> welcome
>>>>>>>>>> to
>>>>>>>>>>>> add
>>>>>>>>>>>>>> that feature. In case of count-based windows, there
>>>>>>>>>>>>>>
>>>>>>>>>>>>> would
>>>>> be
>>>>>>> no
>>>>>>>
>>>>>>>> notion
>>>>>>>>>>>> of
>>>>>>>>>>>>
>>>>>>>>>>>>> time and you probably only trigger at the end of each
>>>>>>>>>>>>> window.
>>>>>>> In
>>>>>>>>> the
>>>>>>>>>>> case
>>>>>>>>>>>>> of count-based windows, the watermark only matters for
>>>>>>>>>>>>> batch
>>>>>>> since
>>>>>>>>>> you
>>>>>>>>>>>> need
>>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 10,
>>>>>>> the
>>>>>>>
>>>>>>>> number
>>>>>>>>>>>> of
>>>>>>>>>>>>
>>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
>>>>>>>>>>>>> end
>>>>>>> the
>>>>>>>
>>>>>>>> last
>>>>>>>>>>> window
>>>>>>>>>>>>>> with 5 tuples).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The wordcount example that I created based on the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> windowed
>>>>>>> operator
>>>>>>>>>>> does
>>>>>>>>>>>>>> processing of word counts per file (each file as a
>>>>>>>>>>>>>> separate
>>>>>>> batch),
>>>>>>>>>>> i.e.
>>>>>>>>>>>>>> process counts for each file and dump into separate
>>>>>>>>>>>>>> files.
>>>>>>> As I understand Global window is for one large batch;
>>>>>>>>>>>>>> i.e.
>>>>>>> all
>>>>>>>>> incoming
>>>>>>>>>>>>> data falls into the same batch. This could not be
>>>>>>>>>>>>>> processed
>>>>>>> using
>>>>>>>>>> GlobalWindow option as we need more than one windows.
>>>>>>>>>>>>>> In
>>>>> this
>>>>>>>> case, I
>>>>>>>>>>>> configured the windowed operator to have time windows
>>>>>>>>>>>>>> of
>>>>> 1ms
>>>>>>>> each
>>>>>>>>>> and
>>>>>>>>>>>> passed data for each file with increasing timestamps:
>>>>>>>>>>>>>> (file1,
>>>>>>>> 1),
>>>>>>>>>> (file2,
>>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
>>>>>>>>>>>>>> scenario?
>>>>>>>>> Regarding (2 - count based windows), I think there is
>>>>>>>>>>>>>> a
>>>>> trigger
>>>>>>>>> option
>>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>> process count based windows. In case I want to process
>>>>>>>>>>>>>> every
>>>>>>>> 1000
>>>>>>>>>> tuples
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> a batch, I could set the Trigger option to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CountTrigger
>>>>> with
>>>>>>>> the
>>>>>>>>>> accumulation set to Discarding. Is this correct?
>>>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Global
>>>>>>> window.
>>>>>>>>>>> \u200b~ Bhupesh\u200b
>>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _________________________
>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> davidyan@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>> I'm worried that we are making the watermark concept
>>>>>>>>>>>>>>> too
>>>>>>> complicated.
>>>>>>>>>>>>> Watermarks should simply just tell you what windows
>>>>>>>>>>>>>>> can
>>>>>>> be
>>>>>>>
>>>>>>>> considered
>>>>>>>>>>>>> complete.
>>>>>>>>>>>>>>>> Point 2 is basically a count-based window.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Watermarks
>>>>> do
>>>>>>> not
>>>>>>>>> play a
>>>>>>>>>>>> role
>>>>>>>>>>>>>>> here because the window is always complete at the
>>>>>>>>>>>>>>> n-th
>>>>> tuple.
>>>>>>>>> If I understand correctly, point 3 is for batch
>>>>>>>>>>>>>>> processing
>>>>>>>> of
>>>>>>>>
>>>>>>>>> files.
>>>>>>>>>>>>> Unless
>>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> this
>>>>>>> can
>>>>>>>>> be
>>>>>>>>>>> achieved
>>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
>>>>>>>>>>>>>>> watermark
>>>>>>> with
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>>> +infinity
>>>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> upon
>>>>>>> receipt
>>>>>>>>>>> of
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> watermark.
>>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be
>>>>> achieved
>>>>>>>>>> with a
>>>>>>>>>>>>> watermark with a +infinity timestamp.
>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>>>>>>> For an input operator which is supposed to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> generate
>>>>> watermarks
>>>>>>>>>>> for
>>>>>>>>>>>>> downstream operators, I can think about the
>>>>>>>>>>>>>>>> following
>>>>>>> watermarks
>>>>>>>>>>>> that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> operator can emit:
>>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> watermark)
>>>>>>>>>> 2. Number of tuple based watermarks (Every n
>>>>>>>>>>>>>>>> tuples)
>>>>> 3. File based watermarks (Start file, end file)
>>>>>>>>>>>>>>>>> 4. Final watermark
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> File based watermarks seem to be applicable for
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> batch
>>>>>>> (file
>>>>>>>>> based)
>>>>>>>>>>>>> as
>>>>>>>>>>>>>
>>>>>>>>>>>>>> well,
>>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does
>>>>> this
>>>>>>>> seem
>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>>
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>> with the thought process?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _________________________
>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I don't think this should be designed based on a
>>>>>>>>>>>>>>>>> simplistic
>>>>>>>>>> file
>>>>>>>>>>>>> input-output scenario. It would be good to
>>>>>>>>>>>>>>>>> include a
>>>>>>> stateful
>>>>>>>>>>> transformation based on event time.
>>>>>>>>>>>>>>>>>> More complex pipelines contain stateful
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> transformations
>>>>>>>> that
>>>>>>>>>>> depend
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
>>>>>>>>>>>>>>>>> watermark
>>>>>>> concept
>>>>>>>>>>> that
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> increasing
>>>>>>>>> sequence)
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> other operators can generically work with.
>>>>>>>>>>>>>>>>>> Note that even file input in many cases can
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> produce
>>>>>>> time
>>>>>>>>> based
>>>>>>>>>>>> watermarks,
>>>>>>>>>>>>>>>>>> for example when you read part files that are
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> bound
>>>>>>> by
>>>>>>>
>>>>>>>> event
>>>>>>>>>>> time.
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <
>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by AJAY GUPTA <aj...@gmail.com>.

Vlad,

The approach you suggested doesn't work because the CSVParser outputs
Object Data Type irrespective of the POJO class being emitted.


Ajay

On Fri, Apr 28, 2017 at 8:13 PM, Vlad Rozov <v....@datatorrent.com> wrote:

> Make your POJO class implement WindowedOperator Tuple interface (it may
> return itself in getValue()).
>
> Thank you,
>
> Vlad
>
> On 4/28/17 02:44, AJAY GUPTA wrote:
>
>> Hi All,
>>
>> I am creating an application which is using Windowed Operator. This
>> application involves CsvParser operator emitting a POJO object which is to
>> be passed as input to WindowedOperator. The WindowedOperator requires an
>> instance of Tuple class as input :
>> *public final transient DefaultInputPort<Tuple<InputT>>
>> input = new DefaultInputPort<Tuple<InputT>>() *
>>
>> Due to this, the addStream cannot work as the output of CsvParser's output
>> port is not compatible with input port type of WindowedOperator.
>> One way to solve this problem is to have an operator between the above two
>> operators as a convertor.
>> I would like to know if there is any other more generic approach to solve
>> this problem without writing a new Operator for every new application
>> using
>> Windowed Operators.
>>
>> Thanks,
>> Ajay
>>
>>
>>
>> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <bh...@datatorrent.com>
>> wrote:
>>
>> Hi All,
>>>
>>> I think we have some agreement on the way we should use control tuples
>>> for
>>> File I/O operators to support batch.
>>>
>>> In order to have more operators in Malhar, support this paradigm, I think
>>> we should also look at store operators - JDBC, Cassandra, HBase etc.
>>> The case with these operators is simpler as most of these do not poll the
>>> sources (except JDBC poller operator) and just stop once they have read a
>>> fixed amount of data. In other words, these are inherently batch sources.
>>> The only change that we should add to these operators is to shut down the
>>> DAG once the reading of data is done. For a windowed operator this would
>>> mean a Global window with a final watermark before the DAG is shut down.
>>>
>>> ~ Bhupesh
>>>
>>>
>>> _______________________________________________________
>>>
>>> Bhupesh Chawda
>>>
>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>
>>> www.datatorrent.com  |  apex.apache.org
>>>
>>>
>>>
>>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <
>>> bhupesh@datatorrent.com>
>>> wrote:
>>>
>>> Hi Thomas,
>>>>
>>>> Even though the windowing operator is not just "event time", it seems it
>>>> is too much dependent on the "time" attribute of the incoming tuple.
>>>> This
>>>> is the reason we had to model the file index as a timestamp to solve the
>>>> batch case for files.
>>>> Perhaps we should work on increasing the scope of the windowed operator
>>>>
>>> to
>>>
>>>> consider other types of windows as well. The Sequence option suggested
>>>> by
>>>> David seems to be something in that direction.
>>>>
>>>> ~ Bhupesh
>>>>
>>>>
>>>> _______________________________________________________
>>>>
>>>> Bhupesh Chawda
>>>>
>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>
>>>> www.datatorrent.com  |  apex.apache.org
>>>>
>>>>
>>>>
>>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org> wrote:
>>>>
>>>> That's correct, we are looking at a generalized approach for state
>>>>> management vs. a series of special cases.
>>>>>
>>>>> And to be clear, windowing does not imply event time, otherwise it
>>>>> would
>>>>> be
>>>>> "EventTimeOperator" :-)
>>>>>
>>>>> Thomas
>>>>>
>>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
>>>>>
>>>> bhupesh@datatorrent.com>
>>>
>>>> wrote:
>>>>>
>>>>> Hi David,
>>>>>>
>>>>>> I went through the discussion, but it seems like it is more on the
>>>>>>
>>>>> event
>>>
>>>> time watermark handling as opposed to batches. What we are trying to
>>>>>>
>>>>> do
>>>
>>>> is
>>>>>
>>>>>> have watermarks serve the purpose of demarcating batches using control
>>>>>> tuples. Since each batch is separate from others, we would like to
>>>>>>
>>>>> have
>>>
>>>> stateful processing within a batch, but not across batches.
>>>>>> At the same time, we would like to do this in a manner which is
>>>>>>
>>>>> consistent
>>>>>
>>>>>> with the windowing mechanism provided by the windowed operator. This
>>>>>>
>>>>> will
>>>>>
>>>>>> allow us to treat a single batch as a (bounded) stream and apply all
>>>>>>
>>>>> the
>>>
>>>> event time windowing concepts in that time span.
>>>>>>
>>>>>> For example, let's say I need to process data for a day (24 hours) as
>>>>>>
>>>>> a
>>>
>>>> single batch. The application is still streaming in nature: it would
>>>>>>
>>>>> end
>>>
>>>> the batch after a day and start a new batch the next day. At the same
>>>>>>
>>>>> time,
>>>>>
>>>>>> I would be able to have early trigger firings every minute as well as
>>>>>>
>>>>> drop
>>>>>
>>>>>> any data which is say, 5 mins late. All this within a single day.
>>>>>>
>>>>>> ~ Bhupesh
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________________
>>>>>>
>>>>>> Bhupesh Chawda
>>>>>>
>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>
>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
>>>>>>
>>>>> wrote:
>>>
>>>> There is a discussion in the Flink mailing list about key-based
>>>>>>>
>>>>>> watermarks.
>>>>>>
>>>>>>> I think it's relevant to our use case here.
>>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
>>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>>>>>>>
>>>>>> bhupesh@datatorrent.com
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> If using time window does not seem appropriate, we can have
>>>>>>>>
>>>>>>> another
>>>
>>>> class
>>>>>>
>>>>>>> which is more suited for such sequential and distinct windows.
>>>>>>>>
>>>>>>> Perhaps, a
>>>>>>
>>>>>>> CustomWindow option can be introduced which takes in a window id.
>>>>>>>>
>>>>>>> The
>>>>>
>>>>>> purpose of this window option could be to translate the window id
>>>>>>>>
>>>>>>> into
>>>>>
>>>>>> appropriate timestamps.
>>>>>>>>
>>>>>>>> Another option would be to go with a custom timestampExtractor for
>>>>>>>>
>>>>>>> such
>>>>>
>>>>>> tuples which translates the each unique file name to a distinct
>>>>>>>>
>>>>>>> timestamp
>>>>>>
>>>>>>> while using time windows in the windowed operator.
>>>>>>>>
>>>>>>>> ~ Bhupesh
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________________
>>>>>>>>
>>>>>>>> Bhupesh Chawda
>>>>>>>>
>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>
>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>> I now see your rationale on putting the filename in the window.
>>>>>>>>> As far as I understand, the reasons why the filename is not part
>>>>>>>>>
>>>>>>>> of
>>>>>
>>>>>> the
>>>>>>
>>>>>>> key
>>>>>>>>
>>>>>>>>> and the Global Window is not used are:
>>>>>>>>>
>>>>>>>>> 1) The files are processed in sequence, not in parallel
>>>>>>>>> 2) The windowed operator should not keep the state associated
>>>>>>>>>
>>>>>>>> with
>>>
>>>> the
>>>>>>
>>>>>>> file
>>>>>>>>
>>>>>>>>> when the processing of the file is done
>>>>>>>>> 3) The trigger should be fired for the file when a file is done
>>>>>>>>>
>>>>>>>> processing.
>>>>>>>>
>>>>>>>>> However, if the file is just a sequence has nothing to do with a
>>>>>>>>>
>>>>>>>> timestamp,
>>>>>>>>
>>>>>>>>> assigning a timestamp to a file is not an intuitive thing to do
>>>>>>>>>
>>>>>>>> and
>>>>>
>>>>>> would
>>>>>>>
>>>>>>>> just create confusions to the users, especially when it's used
>>>>>>>>>
>>>>>>>> as
>>>
>>>> an
>>>>>
>>>>>> example for new users.
>>>>>>>>>
>>>>>>>>> How about having a separate class called SequenceWindow? And
>>>>>>>>>
>>>>>>>> perhaps
>>>>>
>>>>>> TimeWindow can inherit from it?
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>
>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>>>>>>>>>>
>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I think my comments related to count based windows might be
>>>>>>>>>>>
>>>>>>>>>> causing
>>>>>>
>>>>>>> confusion. Let's not discuss count based scenarios for now.
>>>>>>>>>>>
>>>>>>>>>>> Just want to make sure we are on the same page wrt. the
>>>>>>>>>>>
>>>>>>>>>> "each
>>>
>>>> file
>>>>>>
>>>>>>> is a
>>>>>>>>
>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
>>>>>>>>>>>
>>>>>>>>>> the
>>>>>
>>>>>> same
>>>>>>>
>>>>>>>> file
>>>>>>>>>>
>>>>>>>>>>> has the same timestamp (which is just a sequence number) and
>>>>>>>>>>>
>>>>>>>>>> that
>>>>>
>>>>>> helps
>>>>>>>>
>>>>>>>>> keep tuples from each file in a separate window.
>>>>>>>>>>>
>>>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
>>>>>>>>>>
>>>>>>>>> time
>>>>>
>>>>>> stamp
>>>>>>>
>>>>>>>> also, depending on the file naming convention. And if it was
>>>>>>>>>>
>>>>>>>>> event
>>>>>
>>>>>> time
>>>>>>>
>>>>>>>> processing, the watermark would be derived from records within
>>>>>>>>>>
>>>>>>>>> the
>>>>>
>>>>>> file.
>>>>>>>>
>>>>>>>>> Agreed, the source should have a mechanism to control the time
>>>>>>>>>>
>>>>>>>>> stamp
>>>>>>
>>>>>>> extraction along with everything else pertaining to the
>>>>>>>>>>
>>>>>>>>> watermark
>>>>>
>>>>>> generation.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We could also implement a "timestampExtractor" interface to
>>>>>>>>>>>
>>>>>>>>>> identify
>>>>>>>
>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> timestamp (sequence number) for a file.
>>>>>>>>>>>
>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________________
>>>>>>>>>>>
>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>
>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>
>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
>>>>>>>>>>>
>>>>>>>>>> thw@apache.org
>>>
>>>> wrote:
>>>>>>>>
>>>>>>>>> I don't think this is a use case for count based window.
>>>>>>>>>>>>
>>>>>>>>>>>> We have multiple files that are retrieved in a sequence
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>
>>>> there
>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>>> no
>>>>>>>>>
>>>>>>>>>> knowledge of the number of records per file. The
>>>>>>>>>>>>
>>>>>>>>>>> requirement is
>>>>>
>>>>>> to
>>>>>>>
>>>>>>>> aggregate each file separately and emit the aggregate when
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>
>>>>>> file
>>>>>>>
>>>>>>>> is
>>>>>>>>>
>>>>>>>>>> read
>>>>>>>>>>>
>>>>>>>>>>>> fully. There is no concept of "end of something" for an
>>>>>>>>>>>>
>>>>>>>>>>> individual
>>>>>>>
>>>>>>>> key
>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>> global window isn't applicable.
>>>>>>>>>>>>
>>>>>>>>>>>> However, as already explained and implemented by Bhupesh,
>>>>>>>>>>>>
>>>>>>>>>>> this
>>>>>
>>>>>> can
>>>>>>>
>>>>>>>> be
>>>>>>>>
>>>>>>>>> solved using watermark and window (in this case the window
>>>>>>>>>>>>
>>>>>>>>>>> timestamp
>>>>>>>>
>>>>>>>>> isn't
>>>>>>>>>>>
>>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
>>>>>>>>>>>>
>>>>>>>>>>>> Thomas
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>>>>>>>>>>>>
>>>>>>>>>>> davidyan@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I don't think this is the way to go. Global Window only
>>>>>>>>>>>>>
>>>>>>>>>>>> means
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> timestamp
>>>>>>>>>>>>
>>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
>>>>>>>>>>>>>
>>>>>>>>>>>> not
>>>>>
>>>>>> necessarily
>>>>>>>>>>>
>>>>>>>>>>>> mean it's a large batch. Unless there is some notion of
>>>>>>>>>>>>>
>>>>>>>>>>>> event
>>>>>
>>>>>> time
>>>>>>>>
>>>>>>>>> for
>>>>>>>>>>
>>>>>>>>>>> each
>>>>>>>>>>>>
>>>>>>>>>>>>> file, you don't want to embed the file into the window
>>>>>>>>>>>>>
>>>>>>>>>>>> itself.
>>>>>>
>>>>>>> If you want the result broken up by file name, and if
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>
>>>> files
>>>>>>
>>>>>>> are
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> be
>>>>>>>>>>>
>>>>>>>>>>>> processed in parallel, I think making the file name be
>>>>>>>>>>>>>
>>>>>>>>>>>> part
>>>>>
>>>>>> of
>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>> key
>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
>>>>>>>>>>>>>
>>>>>>>>>>>> make
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> file
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> be part of the window.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For count-based window, it's not implemented yet and
>>>>>>>>>>>>>
>>>>>>>>>>>> you're
>>>>>
>>>>>> welcome
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> add
>>>>>>>>>>>>
>>>>>>>>>>>>> that feature. In case of count-based windows, there
>>>>>>>>>>>>>
>>>>>>>>>>>> would
>>>
>>>> be
>>>>>
>>>>>> no
>>>>>>
>>>>>>> notion
>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>> time and you probably only trigger at the end of each
>>>>>>>>>>>>>
>>>>>>>>>>>> window.
>>>>>
>>>>>> In
>>>>>>>
>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> case
>>>>>>>>>>>
>>>>>>>>>>>> of count-based windows, the watermark only matters for
>>>>>>>>>>>>>
>>>>>>>>>>>> batch
>>>>>
>>>>>> since
>>>>>>>>
>>>>>>>>> you
>>>>>>>>>>
>>>>>>>>>>> need
>>>>>>>>>>>>
>>>>>>>>>>>>> a way to know when the batch has ended (if the count is
>>>>>>>>>>>>>
>>>>>>>>>>>> 10,
>>>>>
>>>>>> the
>>>>>>
>>>>>>> number
>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
>>>>>>>>>>>>>
>>>>>>>>>>>> end
>>>>>
>>>>>> the
>>>>>>
>>>>>>> last
>>>>>>>>>
>>>>>>>>>> window
>>>>>>>>>>>>
>>>>>>>>>>>>> with 5 tuples).
>>>>>>>>>>>>>
>>>>>>>>>>>>> David
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>>>>>>>>>>>>>
>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The wordcount example that I created based on the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> windowed
>>>>>
>>>>>> operator
>>>>>>>>>
>>>>>>>>>> does
>>>>>>>>>>>>
>>>>>>>>>>>>> processing of word counts per file (each file as a
>>>>>>>>>>>>>>
>>>>>>>>>>>>> separate
>>>>>
>>>>>> batch),
>>>>>>>>>
>>>>>>>>>> i.e.
>>>>>>>>>>>>
>>>>>>>>>>>>> process counts for each file and dump into separate
>>>>>>>>>>>>>>
>>>>>>>>>>>>> files.
>>>>>
>>>>>> As I understand Global window is for one large batch;
>>>>>>>>>>>>>>
>>>>>>>>>>>>> i.e.
>>>>>
>>>>>> all
>>>>>>>
>>>>>>>> incoming
>>>>>>>>>>>
>>>>>>>>>>>> data falls into the same batch. This could not be
>>>>>>>>>>>>>>
>>>>>>>>>>>>> processed
>>>>>
>>>>>> using
>>>>>>>>
>>>>>>>>> GlobalWindow option as we need more than one windows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> In
>>>
>>>> this
>>>>>>
>>>>>>> case, I
>>>>>>>>>>
>>>>>>>>>>> configured the windowed operator to have time windows
>>>>>>>>>>>>>>
>>>>>>>>>>>>> of
>>>
>>>> 1ms
>>>>>>
>>>>>>> each
>>>>>>>>
>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>> passed data for each file with increasing timestamps:
>>>>>>>>>>>>>>
>>>>>>>>>>>>> (file1,
>>>>>>
>>>>>>> 1),
>>>>>>>>
>>>>>>>>> (file2,
>>>>>>>>>>>>
>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
>>>>>>>>>>>>>>
>>>>>>>>>>>>> scenario?
>>>>>>>
>>>>>>>> Regarding (2 - count based windows), I think there is
>>>>>>>>>>>>>>
>>>>>>>>>>>>> a
>>>
>>>> trigger
>>>>>>>
>>>>>>>> option
>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>>> process count based windows. In case I want to process
>>>>>>>>>>>>>>
>>>>>>>>>>>>> every
>>>>>>
>>>>>>> 1000
>>>>>>>>
>>>>>>>>> tuples
>>>>>>>>>>>>
>>>>>>>>>>>>> as
>>>>>>>>>>>>>
>>>>>>>>>>>>>> a batch, I could set the Trigger option to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> CountTrigger
>>>
>>>> with
>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>> accumulation set to Discarding. Is this correct?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Global
>>>>>
>>>>>> window.
>>>>>>>>>
>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>
>>>>>>>>>>>>> _________________________
>>>
>>>> Bhupesh Chawda
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> davidyan@gmail.com>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm worried that we are making the watermark concept
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> too
>>>>>
>>>>>> complicated.
>>>>>>>>>>>
>>>>>>>>>>>> Watermarks should simply just tell you what windows
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> can
>>>>>
>>>>>> be
>>>>>>
>>>>>>> considered
>>>>>>>>>>>
>>>>>>>>>>>> complete.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Point 2 is basically a count-based window.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Watermarks
>>>
>>>> do
>>>>>
>>>>>> not
>>>>>>>
>>>>>>>> play a
>>>>>>>>>>
>>>>>>>>>>> role
>>>>>>>>>>>>>
>>>>>>>>>>>>>> here because the window is always complete at the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> n-th
>>>
>>>> tuple.
>>>>>>>
>>>>>>>> If I understand correctly, point 3 is for batch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> processing
>>>>>>
>>>>>>> of
>>>>>>>
>>>>>>>> files.
>>>>>>>>>>>
>>>>>>>>>>>> Unless
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> this
>>>>>
>>>>>> can
>>>>>>>
>>>>>>>> be
>>>>>>>>>
>>>>>>>>>> achieved
>>>>>>>>>>>>>
>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> watermark
>>>>>
>>>>>> with
>>>>>>>>
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>>> +infinity
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> upon
>>>>>
>>>>>> receipt
>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>> that
>>>>>>>>>>>>>
>>>>>>>>>>>>>> watermark.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> be
>>>
>>>> achieved
>>>>>>>>
>>>>>>>>> with a
>>>>>>>>>>>
>>>>>>>>>>>> watermark with a +infinity timestamp.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For an input operator which is supposed to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> generate
>>>
>>>> watermarks
>>>>>>>>>
>>>>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>>> downstream operators, I can think about the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> following
>>>>>
>>>>>> watermarks
>>>>>>>>>>
>>>>>>>>>>> that
>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> operator can emit:
>>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> watermark)
>>>>>>>>
>>>>>>>>> 2. Number of tuple based watermarks (Every n
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> tuples)
>>>
>>>> 3. File based watermarks (Start file, end file)
>>>>>>>>>>>>>>>> 4. Final watermark
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> File based watermarks seem to be applicable for
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> batch
>>>>>
>>>>>> (file
>>>>>>>
>>>>>>>> based)
>>>>>>>>>>>
>>>>>>>>>>>> as
>>>>>>>>>>>>
>>>>>>>>>>>>> well,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and hence I thought of looking at these first.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does
>>>
>>>> this
>>>>>>
>>>>>>> seem
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> be
>>>>>>>>>>>
>>>>>>>>>>>> in
>>>>>>>>>>>>>
>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> with the thought process?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ______________________________
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _________________________
>>>>>>
>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> thw@apache.org
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think this should be designed based on a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> simplistic
>>>>>>>>
>>>>>>>>> file
>>>>>>>>>>>
>>>>>>>>>>>> input-output scenario. It would be good to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> include a
>>>>>
>>>>>> stateful
>>>>>>>>>
>>>>>>>>>> transformation based on event time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> More complex pipelines contain stateful
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> transformations
>>>>>>
>>>>>>> that
>>>>>>>>>
>>>>>>>>>> depend
>>>>>>>>>>>>
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> watermark
>>>>>
>>>>>> concept
>>>>>>>>>
>>>>>>>>>> that
>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>>
>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> increasing
>>>>>>>
>>>>>>>> sequence)
>>>>>>>>>>>>
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> other operators can generically work with.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Note that even file input in many cases can
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> produce
>>>>>
>>>>>> time
>>>>>>>
>>>>>>>> based
>>>>>>>>>>
>>>>>>>>>>> watermarks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> for example when you read part files that are
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> bound
>>>>>
>>>>>> by
>>>>>>
>>>>>>> event
>>>>>>>>>
>>>>>>>>>> time.
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <
>>>
>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by David Yan <da...@gmail.com>.

Maybe we should take this to the Beam mailing list and see what people
think how this problem can be solved using watermarks and windowing? I
think we will get some good suggestions.

David

On Tue, Feb 28, 2017 at 8:17 AM, Thomas Weise <th...@apache.org> wrote:

> I think that discussion is related, but in our example we have many keys
> that belong to a file that all fall into a common window boundary.
>
> WRT the naming of the window ("sequence" etc.), it depends on the use case
> and the operators should be kept generic. It could be a sequence that is
> generated, derived from streaming window, from event data etc.
>
> Thanks,
> Thomas
>
>
> On Tue, Feb 28, 2017 at 7:57 AM, David Yan <da...@gmail.com> wrote:
>
> > There is a discussion in the Flink mailing list about key-based
> watermarks.
> > I think it's relevant to our use case here.
> > https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
> > 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
> >
> > David
> >
> > On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > Hi David,
> > >
> > > If using time window does not seem appropriate, we can have another
> class
> > > which is more suited for such sequential and distinct windows.
> Perhaps, a
> > > CustomWindow option can be introduced which takes in a window id. The
> > > purpose of this window option could be to translate the window id into
> > > appropriate timestamps.
> > >
> > > Another option would be to go with a custom timestampExtractor for such
> > > tuples which translates the each unique file name to a distinct
> timestamp
> > > while using time windows in the windowed operator.
> > >
> > > ~ Bhupesh
> > >
> > >
> > > _______________________________________________________
> > >
> > > Bhupesh Chawda
> > >
> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >
> > > www.datatorrent.com  |  apex.apache.org
> > >
> > >
> > >
> > > On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
> wrote:
> > >
> > > > I now see your rationale on putting the filename in the window.
> > > > As far as I understand, the reasons why the filename is not part of
> the
> > > key
> > > > and the Global Window is not used are:
> > > >
> > > > 1) The files are processed in sequence, not in parallel
> > > > 2) The windowed operator should not keep the state associated with
> the
> > > file
> > > > when the processing of the file is done
> > > > 3) The trigger should be fired for the file when a file is done
> > > processing.
> > > >
> > > > However, if the file is just a sequence has nothing to do with a
> > > timestamp,
> > > > assigning a timestamp to a file is not an intuitive thing to do and
> > would
> > > > just create confusions to the users, especially when it's used as an
> > > > example for new users.
> > > >
> > > > How about having a separate class called SequenceWindow? And perhaps
> > > > TimeWindow can inherit from it?
> > > >
> > > > David
> > > >
> > > > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I think my comments related to count based windows might be
> causing
> > > > > > confusion. Let's not discuss count based scenarios for now.
> > > > > >
> > > > > > Just want to make sure we are on the same page wrt. the "each
> file
> > > is a
> > > > > > batch" use case. As mentioned by Thomas, the each tuple from the
> > same
> > > > > file
> > > > > > has the same timestamp (which is just a sequence number) and that
> > > helps
> > > > > > keep tuples from each file in a separate window.
> > > > > >
> > > > >
> > > > > Yes, in this case it is a sequence number, but it could be a time
> > stamp
> > > > > also, depending on the file naming convention. And if it was event
> > time
> > > > > processing, the watermark would be derived from records within the
> > > file.
> > > > >
> > > > > Agreed, the source should have a mechanism to control the time
> stamp
> > > > > extraction along with everything else pertaining to the watermark
> > > > > generation.
> > > > >
> > > > >
> > > > > > We could also implement a "timestampExtractor" interface to
> > identify
> > > > the
> > > > > > timestamp (sequence number) for a file.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > >
> > > > > > _______________________________________________________
> > > > > >
> > > > > > Bhupesh Chawda
> > > > > >
> > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > >
> > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > I don't think this is a use case for count based window.
> > > > > > >
> > > > > > > We have multiple files that are retrieved in a sequence and
> there
> > > is
> > > > no
> > > > > > > knowledge of the number of records per file. The requirement is
> > to
> > > > > > > aggregate each file separately and emit the aggregate when the
> > file
> > > > is
> > > > > > read
> > > > > > > fully. There is no concept of "end of something" for an
> > individual
> > > > key
> > > > > > and
> > > > > > > global window isn't applicable.
> > > > > > >
> > > > > > > However, as already explained and implemented by Bhupesh, this
> > can
> > > be
> > > > > > > solved using watermark and window (in this case the window
> > > timestamp
> > > > > > isn't
> > > > > > > a timestamp, but a file sequence, but that doesn't matter.
> > > > > > >
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <davidyan@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > I don't think this is the way to go. Global Window only means
> > the
> > > > > > > timestamp
> > > > > > > > does not matter (or that there is no timestamp). It does not
> > > > > > necessarily
> > > > > > > > mean it's a large batch. Unless there is some notion of event
> > > time
> > > > > for
> > > > > > > each
> > > > > > > > file, you don't want to embed the file into the window
> itself.
> > > > > > > >
> > > > > > > > If you want the result broken up by file name, and if the
> files
> > > are
> > > > > to
> > > > > > be
> > > > > > > > processed in parallel, I think making the file name be part
> of
> > > the
> > > > > key
> > > > > > is
> > > > > > > > the way to go. I think it's very confusing if we somehow make
> > the
> > > > > file
> > > > > > to
> > > > > > > > be part of the window.
> > > > > > > >
> > > > > > > > For count-based window, it's not implemented yet and you're
> > > welcome
> > > > > to
> > > > > > > add
> > > > > > > > that feature. In case of count-based windows, there would be
> no
> > > > > notion
> > > > > > of
> > > > > > > > time and you probably only trigger at the end of each window.
> > In
> > > > the
> > > > > > case
> > > > > > > > of count-based windows, the watermark only matters for batch
> > > since
> > > > > you
> > > > > > > need
> > > > > > > > a way to know when the batch has ended (if the count is 10,
> the
> > > > > number
> > > > > > of
> > > > > > > > tuples in the batch is let's say 105, you need a way to end
> the
> > > > last
> > > > > > > window
> > > > > > > > with 5 tuples).
> > > > > > > >
> > > > > > > > David
> > > > > > > >
> > > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > > > > bhupesh@datatorrent.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi David,
> > > > > > > > >
> > > > > > > > > Thanks for your comments.
> > > > > > > > >
> > > > > > > > > The wordcount example that I created based on the windowed
> > > > operator
> > > > > > > does
> > > > > > > > > processing of word counts per file (each file as a separate
> > > > batch),
> > > > > > > i.e.
> > > > > > > > > process counts for each file and dump into separate files.
> > > > > > > > > As I understand Global window is for one large batch; i.e.
> > all
> > > > > > incoming
> > > > > > > > > data falls into the same batch. This could not be processed
> > > using
> > > > > > > > > GlobalWindow option as we need more than one windows. In
> this
> > > > > case, I
> > > > > > > > > configured the windowed operator to have time windows of
> 1ms
> > > each
> > > > > and
> > > > > > > > > passed data for each file with increasing timestamps:
> (file1,
> > > 1),
> > > > > > > (file2,
> > > > > > > > > 2) and so on. Is there a better way of handling this
> > scenario?
> > > > > > > > >
> > > > > > > > > Regarding (2 - count based windows), I think there is a
> > trigger
> > > > > > option
> > > > > > > to
> > > > > > > > > process count based windows. In case I want to process
> every
> > > 1000
> > > > > > > tuples
> > > > > > > > as
> > > > > > > > > a batch, I could set the Trigger option to CountTrigger
> with
> > > the
> > > > > > > > > accumulation set to Discarding. Is this correct?
> > > > > > > > >
> > > > > > > > > I agree that (4. Final Watermark) can be done using Global
> > > > window.
> > > > > > > > >
> > > > > > > > > ~ Bhupesh
> > > > > > > > >
> > > > > > > > > _______________________________________________________
> > > > > > > > >
> > > > > > > > > Bhupesh Chawda
> > > > > > > > >
> > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > >
> > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> > > davidyan@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I'm worried that we are making the watermark concept too
> > > > > > complicated.
> > > > > > > > > >
> > > > > > > > > > Watermarks should simply just tell you what windows can
> be
> > > > > > considered
> > > > > > > > > > complete.
> > > > > > > > > >
> > > > > > > > > > Point 2 is basically a count-based window. Watermarks do
> > not
> > > > > play a
> > > > > > > > role
> > > > > > > > > > here because the window is always complete at the n-th
> > tuple.
> > > > > > > > > >
> > > > > > > > > > If I understand correctly, point 3 is for batch
> processing
> > of
> > > > > > files.
> > > > > > > > > Unless
> > > > > > > > > > the files contain timed events, it sounds to be that this
> > can
> > > > be
> > > > > > > > achieved
> > > > > > > > > > with just a Global Window. For signaling EOF, a watermark
> > > with
> > > > a
> > > > > > > > > +infinity
> > > > > > > > > > timestamp can be used so that triggers will be fired upon
> > > > receipt
> > > > > > of
> > > > > > > > that
> > > > > > > > > > watermark.
> > > > > > > > > >
> > > > > > > > > > For point 4, just like what I mentioned above, can be
> > > achieved
> > > > > > with a
> > > > > > > > > > watermark with a +infinity timestamp.
> > > > > > > > > >
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Thomas,
> > > > > > > > > > >
> > > > > > > > > > > For an input operator which is supposed to generate
> > > > watermarks
> > > > > > for
> > > > > > > > > > > downstream operators, I can think about the following
> > > > > watermarks
> > > > > > > that
> > > > > > > > > the
> > > > > > > > > > > operator can emit:
> > > > > > > > > > > 1. Time based watermarks (the high watermark / low
> > > watermark)
> > > > > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > > > > > 4. Final watermark
> > > > > > > > > > >
> > > > > > > > > > > File based watermarks seem to be applicable for batch
> > (file
> > > > > > based)
> > > > > > > as
> > > > > > > > > > well,
> > > > > > > > > > > and hence I thought of looking at these first. Does
> this
> > > seem
> > > > > to
> > > > > > be
> > > > > > > > in
> > > > > > > > > > line
> > > > > > > > > > > with the thought process?
> > > > > > > > > > >
> > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ______________________________
> _________________________
> > > > > > > > > > >
> > > > > > > > > > > Bhupesh Chawda
> > > > > > > > > > >
> > > > > > > > > > > Software Engineer
> > > > > > > > > > >
> > > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > > > >
> > > > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> > > > thw@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I don't think this should be designed based on a
> > > simplistic
> > > > > > file
> > > > > > > > > > > > input-output scenario. It would be good to include a
> > > > stateful
> > > > > > > > > > > > transformation based on event time.
> > > > > > > > > > > >
> > > > > > > > > > > > More complex pipelines contain stateful
> transformations
> > > > that
> > > > > > > depend
> > > > > > > > > on
> > > > > > > > > > > > windowing and watermarks. I think we need a watermark
> > > > concept
> > > > > > > that
> > > > > > > > is
> > > > > > > > > > > based
> > > > > > > > > > > > on progress in event time (or other monotonic
> > increasing
> > > > > > > sequence)
> > > > > > > > > that
> > > > > > > > > > > > other operators can generically work with.
> > > > > > > > > > > >
> > > > > > > > > > > > Note that even file input in many cases can produce
> > time
> > > > > based
> > > > > > > > > > > watermarks,
> > > > > > > > > > > > for example when you read part files that are bound
> by
> > > > event
> > > > > > > time.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > For better understanding the use case for control
> > > tuples
> > > > in
> > > > > > > > batch,
> > > > > > > > > I
> > > > > > > > > > > am
> > > > > > > > > > > > > creating a prototype for a batch application using
> > File
> > > > > Input
> > > > > > > and
> > > > > > > > > > File
> > > > > > > > > > > > > Output operators.
> > > > > > > > > > > > >
> > > > > > > > > > > > > To enable basic batch processing for File IO
> > > operators, I
> > > > > am
> > > > > > > > > > proposing
> > > > > > > > > > > > the
> > > > > > > > > > > > > following changes to File input and output
> operators:
> > > > > > > > > > > > > 1. File Input operator emits a watermark each time
> it
> > > > opens
> > > > > > and
> > > > > > > > > > closes
> > > > > > > > > > > a
> > > > > > > > > > > > > file. These can be "start file" and "end file"
> > > watermarks
> > > > > > which
> > > > > > > > > > include
> > > > > > > > > > > > the
> > > > > > > > > > > > > corresponding file names. The "start file" tuple
> > should
> > > > be
> > > > > > sent
> > > > > > > > > > before
> > > > > > > > > > > > any
> > > > > > > > > > > > > of the data from that file flows.
> > > > > > > > > > > > > 2. File Input operator can be configured to end the
> > > > > > application
> > > > > > > > > > after a
> > > > > > > > > > > > > single or n scans of the directory (a batch). This
> is
> > > > where
> > > > > > the
> > > > > > > > > > > operator
> > > > > > > > > > > > > emits the final watermark (the end of application
> > > control
> > > > > > > tuple).
> > > > > > > > > > This
> > > > > > > > > > > > will
> > > > > > > > > > > > > also shutdown the application.
> > > > > > > > > > > > > 3. The File output operator handles these control
> > > tuples.
> > > > > > > "Start
> > > > > > > > > > file"
> > > > > > > > > > > > > initializes the file name for the incoming tuples.
> > "End
> > > > > file"
> > > > > > > > > > watermark
> > > > > > > > > > > > > forces a finalize on that file.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The user would be able to enable the operators to
> > send
> > > > only
> > > > > > > those
> > > > > > > > > > > > > watermarks that are needed in the application. If
> > none
> > > of
> > > > > the
> > > > > > > > > options
> > > > > > > > > > > are
> > > > > > > > > > > > > configured, the operators behave as in a streaming
> > > > > > application.
> > > > > > > > > > > > >
> > > > > > > > > > > > > There are a few challenges in the implementation
> > where
> > > > the
> > > > > > > input
> > > > > > > > > > > operator
> > > > > > > > > > > > > is partitioned. In this case, the correlation
> between
> > > the
> > > > > > > > start/end
> > > > > > > > > > > for a
> > > > > > > > > > > > > file and the data tuples for that file is lost.
> Hence
> > > we
> > > > > need
> > > > > > > to
> > > > > > > > > > > maintain
> > > > > > > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The "start file" and "end file" control tuples in
> > this
> > > > > > example
> > > > > > > > are
> > > > > > > > > > > > > temporary names for watermarks. We can have generic
> > > > "start
> > > > > > > > batch" /
> > > > > > > > > > > "end
> > > > > > > > > > > > > batch" tuples which could be used for other use
> cases
> > > as
> > > > > > well.
> > > > > > > > The
> > > > > > > > > > > Final
> > > > > > > > > > > > > watermark is common and serves the same purpose in
> > each
> > > > > case.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, this can be part of operator configuration.
> > > Given
> > > > > > this,
> > > > > > > > for
> > > > > > > > > a
> > > > > > > > > > > user
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > define a batch application, would mean
> configuring
> > > the
> > > > > > > > connectors
> > > > > > > > > > > > (mostly
> > > > > > > > > > > > > > the input operator) in the application for the
> > > desired
> > > > > > > > behavior.
> > > > > > > > > > > > > Similarly,
> > > > > > > > > > > > > > there can be other use cases that can be achieved
> > > other
> > > > > > than
> > > > > > > > > batch.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > > > > > 1. Make sure that the watermarks or control
> tuples
> > > are
> > > > > > > > consistent
> > > > > > > > > > > > across
> > > > > > > > > > > > > > sources. Meaning an HDFS sink should be able to
> > > > interpret
> > > > > > the
> > > > > > > > > > > watermark
> > > > > > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > > > > > 2. In addition to I/O connectors, we should also
> > look
> > > > at
> > > > > > the
> > > > > > > > need
> > > > > > > > > > for
> > > > > > > > > > > > > > processing operators to understand some of the
> > > control
> > > > > > > tuples /
> > > > > > > > > > > > > watermarks.
> > > > > > > > > > > > > > For example, we may want to reset the operator
> > > behavior
> > > > > on
> > > > > > > > > arrival
> > > > > > > > > > of
> > > > > > > > > > > > > some
> > > > > > > > > > > > > > watermark tuple.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > > > > > thw@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> The HDFS source can operate in two modes,
> bounded
> > or
> > > > > > > > unbounded.
> > > > > > > > > If
> > > > > > > > > > > you
> > > > > > > > > > > > > >> scan
> > > > > > > > > > > > > >> only once, then it should emit the final
> watermark
> > > > after
> > > > > > it
> > > > > > > is
> > > > > > > > > > done.
> > > > > > > > > > > > > >> Otherwise it would emit watermarks based on a
> > policy
> > > > > > (files
> > > > > > > > > names
> > > > > > > > > > > > etc.).
> > > > > > > > > > > > > >> The mechanism to generate the marks may depend
> on
> > > the
> > > > > type
> > > > > > > of
> > > > > > > > > > source
> > > > > > > > > > > > and
> > > > > > > > > > > > > >> the user needs to be able to influence/configure
> > it.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Thomas
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda
> <
> > > > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > Hi Thomas,
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > I am not sure that I completely understand
> your
> > > > > > > suggestion.
> > > > > > > > > Are
> > > > > > > > > > > you
> > > > > > > > > > > > > >> > suggesting to broaden the scope of the
> proposal
> > to
> > > > > treat
> > > > > > > all
> > > > > > > > > > > sources
> > > > > > > > > > > > > as
> > > > > > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > In case of Apex, we treat all sources as
> > unbounded
> > > > > > > sources.
> > > > > > > > > Even
> > > > > > > > > > > > > bounded
> > > > > > > > > > > > > >> > sources like HDFS file source is treated as
> > > > unbounded
> > > > > by
> > > > > > > > means
> > > > > > > > > > of
> > > > > > > > > > > > > >> scanning
> > > > > > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > > > > > >> > In this case, if we treat it as a bounded
> > source,
> > > we
> > > > > can
> > > > > > > > > define
> > > > > > > > > > > > hooks
> > > > > > > > > > > > > >> which
> > > > > > > > > > > > > >> > allows us to detect the end of the file and
> send
> > > the
> > > > > > > "final
> > > > > > > > > > > > > watermark".
> > > > > > > > > > > > > >> We
> > > > > > > > > > > > > >> > could also consider HDFS file source as a
> > > streaming
> > > > > > source
> > > > > > > > and
> > > > > > > > > > > > define
> > > > > > > > > > > > > >> hooks
> > > > > > > > > > > > > >> > which send watermarks based on different kinds
> > of
> > > > > > windows.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise
> <
> > > > > > > > thw@apache.org
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > > Bhupesh,
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > Please see how that can be solved in a
> unified
> > > way
> > > > > > using
> > > > > > > > > > windows
> > > > > > > > > > > > and
> > > > > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded
> > > data.
> > > > > In
> > > > > > > Beam
> > > > > > > > > for
> > > > > > > > > > > > > >> example,
> > > > > > > > > > > > > >> > you
> > > > > > > > > > > > > >> > > can use the "global window" and the final
> > > > watermark
> > > > > to
> > > > > > > > > > > accomplish
> > > > > > > > > > > > > what
> > > > > > > > > > > > > >> > you
> > > > > > > > > > > > > >> > > are looking for. Batch is just a special
> case
> > of
> > > > > > > streaming
> > > > > > > > > > where
> > > > > > > > > > > > the
> > > > > > > > > > > > > >> > source
> > > > > > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > > >> > > Thomas
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh
> > Chawda
> > > <
> > > > > > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
> > > > > > application,
> > > > > > > > > then
> > > > > > > > > > > > batch
> > > > > > > > > > > > > >> aware
> > > > > > > > > > > > > >> > > > operators need to be used in the
> > application.
> > > > > > > > > > > > > >> > > > The nature of the application is mostly
> > > > controlled
> > > > > > by
> > > > > > > > the
> > > > > > > > > > > input
> > > > > > > > > > > > > and
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> > > > output operators used in the application.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > For example, consider an application which
> > > needs
> > > > > to
> > > > > > > > filter
> > > > > > > > > > > > records
> > > > > > > > > > > > > >> in a
> > > > > > > > > > > > > >> > > > input file and store the filtered records
> in
> > > > > another
> > > > > > > > file.
> > > > > > > > > > The
> > > > > > > > > > > > > >> nature
> > > > > > > > > > > > > >> > of
> > > > > > > > > > > > > >> > > > this app is to end once the entire file is
> > > > > > processed.
> > > > > > > > > > > Following
> > > > > > > > > > > > > >> things
> > > > > > > > > > > > > >> > > are
> > > > > > > > > > > > > >> > > > expected of the application:
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > >    1. Once the input data is over,
> finalize
> > > the
> > > > > > output
> > > > > > > > > file
> > > > > > > > > > > from
> > > > > > > > > > > > > >> .tmp
> > > > > > > > > > > > > >> > > >    files. - Responsibility of output
> > operator
> > > > > > > > > > > > > >> > > >    2. End the application, once the data
> is
> > > read
> > > > > and
> > > > > > > > > > > processed -
> > > > > > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > These functions are essential to allow the
> > > user
> > > > to
> > > > > > do
> > > > > > > > > higher
> > > > > > > > > > > > level
> > > > > > > > > > > > > >> > > > operations like scheduling or running a
> > > workflow
> > > > > of
> > > > > > > > batch
> > > > > > > > > > > > > >> applications.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > I am not sure about intermediate
> > (processing)
> > > > > > > operators,
> > > > > > > > > as
> > > > > > > > > > > > there
> > > > > > > > > > > > > >> is no
> > > > > > > > > > > > > >> > > > change in their functionality for batch
> use
> > > > cases.
> > > > > > > > > Perhaps,
> > > > > > > > > > > > > allowing
> > > > > > > > > > > > > >> > > > running multiple batches in a single
> > > application
> > > > > may
> > > > > > > > > require
> > > > > > > > > > > > > similar
> > > > > > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
> > > > Gugale <
> > > > > > > > > > > > > priyag@apache.org
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > > Will it make an impression on user that,
> > if
> > > he
> > > > > > has a
> > > > > > > > > batch
> > > > > > > > > > > > > >> usecase he
> > > > > > > > > > > > > >> > > has
> > > > > > > > > > > > > >> > > > > to use batch aware operators only? If
> so,
> > is
> > > > > that
> > > > > > > what
> > > > > > > > > we
> > > > > > > > > > > > > expect?
> > > > > > > > > > > > > >> I
> > > > > > > > > > > > > >> > am
> > > > > > > > > > > > > >> > > > not
> > > > > > > > > > > > > >> > > > > aware of how do we implement batch
> > scenario
> > > so
> > > > > > this
> > > > > > > > > might
> > > > > > > > > > > be a
> > > > > > > > > > > > > >> basic
> > > > > > > > > > > > > >> > > > > question.
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM,
> Bhupesh
> > > > > Chawda <
> > > > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > While design / implementation for
> custom
> > > > > control
> > > > > > > > > tuples
> > > > > > > > > > is
> > > > > > > > > > > > > >> > ongoing, I
> > > > > > > > > > > > > >> > > > > > thought it would be a good idea to
> > > consider
> > > > > its
> > > > > > > > > > usefulness
> > > > > > > > > > > > in
> > > > > > > > > > > > > >> one
> > > > > > > > > > > > > >> > of
> > > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
> > > > existing
> > > > > > > > > operators
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > >> > Apache
> > > > > > > > > > > > > >> > > > > Apex
> > > > > > > > > > > > > >> > > > > > Malhar library so that it is easy to
> use
> > > > them
> > > > > in
> > > > > > > > batch
> > > > > > > > > > use
> > > > > > > > > > > > > >> cases.
> > > > > > > > > > > > > >> > > > > > Naturally, this would be applicable
> for
> > > > only a
> > > > > > > > subset
> > > > > > > > > of
> > > > > > > > > > > > > >> operators
> > > > > > > > > > > > > >> > > like
> > > > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > > > > > >> > > > > > For example, for a file based store,
> > (say
> > > > HDFS
> > > > > > > > store),
> > > > > > > > > > we
> > > > > > > > > > > > > could
> > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
> > > operators
> > > > > > which
> > > > > > > > > allow
> > > > > > > > > > > > easy
> > > > > > > > > > > > > >> > > > integration
> > > > > > > > > > > > > >> > > > > > into a batch application. These
> > operators
> > > > > would
> > > > > > be
> > > > > > > > > > > extended
> > > > > > > > > > > > > from
> > > > > > > > > > > > > >> > > their
> > > > > > > > > > > > > >> > > > > > existing implementations and would be
> > > "Batch
> > > > > > > Aware",
> > > > > > > > > in
> > > > > > > > > > > that
> > > > > > > > > > > > > >> they
> > > > > > > > > > > > > >> > may
> > > > > > > > > > > > > >> > > > > > understand the meaning of some
> specific
> > > > > control
> > > > > > > > tuples
> > > > > > > > > > > that
> > > > > > > > > > > > > flow
> > > > > > > > > > > > > >> > > > through
> > > > > > > > > > > > > >> > > > > > the DAG. Start batch and end batch
> seem
> > to
> > > > be
> > > > > > the
> > > > > > > > > > obvious
> > > > > > > > > > > > > >> > candidates
> > > > > > > > > > > > > >> > > > that
> > > > > > > > > > > > > >> > > > > > come to mind. On receipt of such
> control
> > > > > tuples,
> > > > > > > > they
> > > > > > > > > > may
> > > > > > > > > > > > try
> > > > > > > > > > > > > to
> > > > > > > > > > > > > >> > > modify
> > > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > > >> > > > > > behavior of the operator - to
> > reinitialize
> > > > > some
> > > > > > > > > metrics
> > > > > > > > > > or
> > > > > > > > > > > > > >> finalize
> > > > > > > > > > > > > >> > > an
> > > > > > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > We can discuss the potential control
> > > tuples
> > > > > and
> > > > > > > > > actions
> > > > > > > > > > in
> > > > > > > > > > > > > >> detail,
> > > > > > > > > > > > > >> > > but
> > > > > > > > > > > > > >> > > > > > first I would like to understand the
> > views
> > > > of
> > > > > > the
> > > > > > > > > > > community
> > > > > > > > > > > > > for
> > > > > > > > > > > > > >> > this
> > > > > > > > > > > > > >> > > > > > proposal.
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

I think that discussion is related, but in our example we have many keys
that belong to a file that all fall into a common window boundary.

WRT the naming of the window ("sequence" etc.), it depends on the use case
and the operators should be kept generic. It could be a sequence that is
generated, derived from streaming window, from event data etc.

Thanks,
Thomas


On Tue, Feb 28, 2017 at 7:57 AM, David Yan <da...@gmail.com> wrote:

> There is a discussion in the Flink mailing list about key-based watermarks.
> I think it's relevant to our use case here.
> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>
> David
>
> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi David,
> >
> > If using time window does not seem appropriate, we can have another class
> > which is more suited for such sequential and distinct windows. Perhaps, a
> > CustomWindow option can be introduced which takes in a window id. The
> > purpose of this window option could be to translate the window id into
> > appropriate timestamps.
> >
> > Another option would be to go with a custom timestampExtractor for such
> > tuples which translates the each unique file name to a distinct timestamp
> > while using time windows in the windowed operator.
> >
> > ~ Bhupesh
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com> wrote:
> >
> > > I now see your rationale on putting the filename in the window.
> > > As far as I understand, the reasons why the filename is not part of the
> > key
> > > and the Global Window is not used are:
> > >
> > > 1) The files are processed in sequence, not in parallel
> > > 2) The windowed operator should not keep the state associated with the
> > file
> > > when the processing of the file is done
> > > 3) The trigger should be fired for the file when a file is done
> > processing.
> > >
> > > However, if the file is just a sequence has nothing to do with a
> > timestamp,
> > > assigning a timestamp to a file is not an intuitive thing to do and
> would
> > > just create confusions to the users, especially when it's used as an
> > > example for new users.
> > >
> > > How about having a separate class called SequenceWindow? And perhaps
> > > TimeWindow can inherit from it?
> > >
> > > David
> > >
> > > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > I think my comments related to count based windows might be causing
> > > > > confusion. Let's not discuss count based scenarios for now.
> > > > >
> > > > > Just want to make sure we are on the same page wrt. the "each file
> > is a
> > > > > batch" use case. As mentioned by Thomas, the each tuple from the
> same
> > > > file
> > > > > has the same timestamp (which is just a sequence number) and that
> > helps
> > > > > keep tuples from each file in a separate window.
> > > > >
> > > >
> > > > Yes, in this case it is a sequence number, but it could be a time
> stamp
> > > > also, depending on the file naming convention. And if it was event
> time
> > > > processing, the watermark would be derived from records within the
> > file.
> > > >
> > > > Agreed, the source should have a mechanism to control the time stamp
> > > > extraction along with everything else pertaining to the watermark
> > > > generation.
> > > >
> > > >
> > > > > We could also implement a "timestampExtractor" interface to
> identify
> > > the
> > > > > timestamp (sequence number) for a file.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > >
> > > > > _______________________________________________________
> > > > >
> > > > > Bhupesh Chawda
> > > > >
> > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > >
> > > > > www.datatorrent.com  |  apex.apache.org
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > > > I don't think this is a use case for count based window.
> > > > > >
> > > > > > We have multiple files that are retrieved in a sequence and there
> > is
> > > no
> > > > > > knowledge of the number of records per file. The requirement is
> to
> > > > > > aggregate each file separately and emit the aggregate when the
> file
> > > is
> > > > > read
> > > > > > fully. There is no concept of "end of something" for an
> individual
> > > key
> > > > > and
> > > > > > global window isn't applicable.
> > > > > >
> > > > > > However, as already explained and implemented by Bhupesh, this
> can
> > be
> > > > > > solved using watermark and window (in this case the window
> > timestamp
> > > > > isn't
> > > > > > a timestamp, but a file sequence, but that doesn't matter.
> > > > > >
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > I don't think this is the way to go. Global Window only means
> the
> > > > > > timestamp
> > > > > > > does not matter (or that there is no timestamp). It does not
> > > > > necessarily
> > > > > > > mean it's a large batch. Unless there is some notion of event
> > time
> > > > for
> > > > > > each
> > > > > > > file, you don't want to embed the file into the window itself.
> > > > > > >
> > > > > > > If you want the result broken up by file name, and if the files
> > are
> > > > to
> > > > > be
> > > > > > > processed in parallel, I think making the file name be part of
> > the
> > > > key
> > > > > is
> > > > > > > the way to go. I think it's very confusing if we somehow make
> the
> > > > file
> > > > > to
> > > > > > > be part of the window.
> > > > > > >
> > > > > > > For count-based window, it's not implemented yet and you're
> > welcome
> > > > to
> > > > > > add
> > > > > > > that feature. In case of count-based windows, there would be no
> > > > notion
> > > > > of
> > > > > > > time and you probably only trigger at the end of each window.
> In
> > > the
> > > > > case
> > > > > > > of count-based windows, the watermark only matters for batch
> > since
> > > > you
> > > > > > need
> > > > > > > a way to know when the batch has ended (if the count is 10, the
> > > > number
> > > > > of
> > > > > > > tuples in the batch is let's say 105, you need a way to end the
> > > last
> > > > > > window
> > > > > > > with 5 tuples).
> > > > > > >
> > > > > > > David
> > > > > > >
> > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Thanks for your comments.
> > > > > > > >
> > > > > > > > The wordcount example that I created based on the windowed
> > > operator
> > > > > > does
> > > > > > > > processing of word counts per file (each file as a separate
> > > batch),
> > > > > > i.e.
> > > > > > > > process counts for each file and dump into separate files.
> > > > > > > > As I understand Global window is for one large batch; i.e.
> all
> > > > > incoming
> > > > > > > > data falls into the same batch. This could not be processed
> > using
> > > > > > > > GlobalWindow option as we need more than one windows. In this
> > > > case, I
> > > > > > > > configured the windowed operator to have time windows of 1ms
> > each
> > > > and
> > > > > > > > passed data for each file with increasing timestamps: (file1,
> > 1),
> > > > > > (file2,
> > > > > > > > 2) and so on. Is there a better way of handling this
> scenario?
> > > > > > > >
> > > > > > > > Regarding (2 - count based windows), I think there is a
> trigger
> > > > > option
> > > > > > to
> > > > > > > > process count based windows. In case I want to process every
> > 1000
> > > > > > tuples
> > > > > > > as
> > > > > > > > a batch, I could set the Trigger option to CountTrigger with
> > the
> > > > > > > > accumulation set to Discarding. Is this correct?
> > > > > > > >
> > > > > > > > I agree that (4. Final Watermark) can be done using Global
> > > window.
> > > > > > > >
> > > > > > > > ~ Bhupesh
> > > > > > > >
> > > > > > > > _______________________________________________________
> > > > > > > >
> > > > > > > > Bhupesh Chawda
> > > > > > > >
> > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > >
> > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> > davidyan@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I'm worried that we are making the watermark concept too
> > > > > complicated.
> > > > > > > > >
> > > > > > > > > Watermarks should simply just tell you what windows can be
> > > > > considered
> > > > > > > > > complete.
> > > > > > > > >
> > > > > > > > > Point 2 is basically a count-based window. Watermarks do
> not
> > > > play a
> > > > > > > role
> > > > > > > > > here because the window is always complete at the n-th
> tuple.
> > > > > > > > >
> > > > > > > > > If I understand correctly, point 3 is for batch processing
> of
> > > > > files.
> > > > > > > > Unless
> > > > > > > > > the files contain timed events, it sounds to be that this
> can
> > > be
> > > > > > > achieved
> > > > > > > > > with just a Global Window. For signaling EOF, a watermark
> > with
> > > a
> > > > > > > > +infinity
> > > > > > > > > timestamp can be used so that triggers will be fired upon
> > > receipt
> > > > > of
> > > > > > > that
> > > > > > > > > watermark.
> > > > > > > > >
> > > > > > > > > For point 4, just like what I mentioned above, can be
> > achieved
> > > > > with a
> > > > > > > > > watermark with a +infinity timestamp.
> > > > > > > > >
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > > > > bhupesh@datatorrent.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Thomas,
> > > > > > > > > >
> > > > > > > > > > For an input operator which is supposed to generate
> > > watermarks
> > > > > for
> > > > > > > > > > downstream operators, I can think about the following
> > > > watermarks
> > > > > > that
> > > > > > > > the
> > > > > > > > > > operator can emit:
> > > > > > > > > > 1. Time based watermarks (the high watermark / low
> > watermark)
> > > > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > > > > 4. Final watermark
> > > > > > > > > >
> > > > > > > > > > File based watermarks seem to be applicable for batch
> (file
> > > > > based)
> > > > > > as
> > > > > > > > > well,
> > > > > > > > > > and hence I thought of looking at these first. Does this
> > seem
> > > > to
> > > > > be
> > > > > > > in
> > > > > > > > > line
> > > > > > > > > > with the thought process?
> > > > > > > > > >
> > > > > > > > > > ~ Bhupesh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > _______________________________________________________
> > > > > > > > > >
> > > > > > > > > > Bhupesh Chawda
> > > > > > > > > >
> > > > > > > > > > Software Engineer
> > > > > > > > > >
> > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > > >
> > > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> > > thw@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I don't think this should be designed based on a
> > simplistic
> > > > > file
> > > > > > > > > > > input-output scenario. It would be good to include a
> > > stateful
> > > > > > > > > > > transformation based on event time.
> > > > > > > > > > >
> > > > > > > > > > > More complex pipelines contain stateful transformations
> > > that
> > > > > > depend
> > > > > > > > on
> > > > > > > > > > > windowing and watermarks. I think we need a watermark
> > > concept
> > > > > > that
> > > > > > > is
> > > > > > > > > > based
> > > > > > > > > > > on progress in event time (or other monotonic
> increasing
> > > > > > sequence)
> > > > > > > > that
> > > > > > > > > > > other operators can generically work with.
> > > > > > > > > > >
> > > > > > > > > > > Note that even file input in many cases can produce
> time
> > > > based
> > > > > > > > > > watermarks,
> > > > > > > > > > > for example when you read part files that are bound by
> > > event
> > > > > > time.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Thomas
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > For better understanding the use case for control
> > tuples
> > > in
> > > > > > > batch,
> > > > > > > > I
> > > > > > > > > > am
> > > > > > > > > > > > creating a prototype for a batch application using
> File
> > > > Input
> > > > > > and
> > > > > > > > > File
> > > > > > > > > > > > Output operators.
> > > > > > > > > > > >
> > > > > > > > > > > > To enable basic batch processing for File IO
> > operators, I
> > > > am
> > > > > > > > > proposing
> > > > > > > > > > > the
> > > > > > > > > > > > following changes to File input and output operators:
> > > > > > > > > > > > 1. File Input operator emits a watermark each time it
> > > opens
> > > > > and
> > > > > > > > > closes
> > > > > > > > > > a
> > > > > > > > > > > > file. These can be "start file" and "end file"
> > watermarks
> > > > > which
> > > > > > > > > include
> > > > > > > > > > > the
> > > > > > > > > > > > corresponding file names. The "start file" tuple
> should
> > > be
> > > > > sent
> > > > > > > > > before
> > > > > > > > > > > any
> > > > > > > > > > > > of the data from that file flows.
> > > > > > > > > > > > 2. File Input operator can be configured to end the
> > > > > application
> > > > > > > > > after a
> > > > > > > > > > > > single or n scans of the directory (a batch). This is
> > > where
> > > > > the
> > > > > > > > > > operator
> > > > > > > > > > > > emits the final watermark (the end of application
> > control
> > > > > > tuple).
> > > > > > > > > This
> > > > > > > > > > > will
> > > > > > > > > > > > also shutdown the application.
> > > > > > > > > > > > 3. The File output operator handles these control
> > tuples.
> > > > > > "Start
> > > > > > > > > file"
> > > > > > > > > > > > initializes the file name for the incoming tuples.
> "End
> > > > file"
> > > > > > > > > watermark
> > > > > > > > > > > > forces a finalize on that file.
> > > > > > > > > > > >
> > > > > > > > > > > > The user would be able to enable the operators to
> send
> > > only
> > > > > > those
> > > > > > > > > > > > watermarks that are needed in the application. If
> none
> > of
> > > > the
> > > > > > > > options
> > > > > > > > > > are
> > > > > > > > > > > > configured, the operators behave as in a streaming
> > > > > application.
> > > > > > > > > > > >
> > > > > > > > > > > > There are a few challenges in the implementation
> where
> > > the
> > > > > > input
> > > > > > > > > > operator
> > > > > > > > > > > > is partitioned. In this case, the correlation between
> > the
> > > > > > > start/end
> > > > > > > > > > for a
> > > > > > > > > > > > file and the data tuples for that file is lost. Hence
> > we
> > > > need
> > > > > > to
> > > > > > > > > > maintain
> > > > > > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > > > > > >
> > > > > > > > > > > > The "start file" and "end file" control tuples in
> this
> > > > > example
> > > > > > > are
> > > > > > > > > > > > temporary names for watermarks. We can have generic
> > > "start
> > > > > > > batch" /
> > > > > > > > > > "end
> > > > > > > > > > > > batch" tuples which could be used for other use cases
> > as
> > > > > well.
> > > > > > > The
> > > > > > > > > > Final
> > > > > > > > > > > > watermark is common and serves the same purpose in
> each
> > > > case.
> > > > > > > > > > > >
> > > > > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > > > > >
> > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Yes, this can be part of operator configuration.
> > Given
> > > > > this,
> > > > > > > for
> > > > > > > > a
> > > > > > > > > > user
> > > > > > > > > > > > to
> > > > > > > > > > > > > define a batch application, would mean configuring
> > the
> > > > > > > connectors
> > > > > > > > > > > (mostly
> > > > > > > > > > > > > the input operator) in the application for the
> > desired
> > > > > > > behavior.
> > > > > > > > > > > > Similarly,
> > > > > > > > > > > > > there can be other use cases that can be achieved
> > other
> > > > > than
> > > > > > > > batch.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > > > > 1. Make sure that the watermarks or control tuples
> > are
> > > > > > > consistent
> > > > > > > > > > > across
> > > > > > > > > > > > > sources. Meaning an HDFS sink should be able to
> > > interpret
> > > > > the
> > > > > > > > > > watermark
> > > > > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > > > > 2. In addition to I/O connectors, we should also
> look
> > > at
> > > > > the
> > > > > > > need
> > > > > > > > > for
> > > > > > > > > > > > > processing operators to understand some of the
> > control
> > > > > > tuples /
> > > > > > > > > > > > watermarks.
> > > > > > > > > > > > > For example, we may want to reset the operator
> > behavior
> > > > on
> > > > > > > > arrival
> > > > > > > > > of
> > > > > > > > > > > > some
> > > > > > > > > > > > > watermark tuple.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > > > > thw@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> The HDFS source can operate in two modes, bounded
> or
> > > > > > > unbounded.
> > > > > > > > If
> > > > > > > > > > you
> > > > > > > > > > > > >> scan
> > > > > > > > > > > > >> only once, then it should emit the final watermark
> > > after
> > > > > it
> > > > > > is
> > > > > > > > > done.
> > > > > > > > > > > > >> Otherwise it would emit watermarks based on a
> policy
> > > > > (files
> > > > > > > > names
> > > > > > > > > > > etc.).
> > > > > > > > > > > > >> The mechanism to generate the marks may depend on
> > the
> > > > type
> > > > > > of
> > > > > > > > > source
> > > > > > > > > > > and
> > > > > > > > > > > > >> the user needs to be able to influence/configure
> it.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Thomas
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > Hi Thomas,
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > I am not sure that I completely understand your
> > > > > > suggestion.
> > > > > > > > Are
> > > > > > > > > > you
> > > > > > > > > > > > >> > suggesting to broaden the scope of the proposal
> to
> > > > treat
> > > > > > all
> > > > > > > > > > sources
> > > > > > > > > > > > as
> > > > > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > In case of Apex, we treat all sources as
> unbounded
> > > > > > sources.
> > > > > > > > Even
> > > > > > > > > > > > bounded
> > > > > > > > > > > > >> > sources like HDFS file source is treated as
> > > unbounded
> > > > by
> > > > > > > means
> > > > > > > > > of
> > > > > > > > > > > > >> scanning
> > > > > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > > > > >> > In this case, if we treat it as a bounded
> source,
> > we
> > > > can
> > > > > > > > define
> > > > > > > > > > > hooks
> > > > > > > > > > > > >> which
> > > > > > > > > > > > >> > allows us to detect the end of the file and send
> > the
> > > > > > "final
> > > > > > > > > > > > watermark".
> > > > > > > > > > > > >> We
> > > > > > > > > > > > >> > could also consider HDFS file source as a
> > streaming
> > > > > source
> > > > > > > and
> > > > > > > > > > > define
> > > > > > > > > > > > >> hooks
> > > > > > > > > > > > >> > which send watermarks based on different kinds
> of
> > > > > windows.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > > > > > > thw@apache.org
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > Bhupesh,
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Please see how that can be solved in a unified
> > way
> > > > > using
> > > > > > > > > windows
> > > > > > > > > > > and
> > > > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded
> > data.
> > > > In
> > > > > > Beam
> > > > > > > > for
> > > > > > > > > > > > >> example,
> > > > > > > > > > > > >> > you
> > > > > > > > > > > > >> > > can use the "global window" and the final
> > > watermark
> > > > to
> > > > > > > > > > accomplish
> > > > > > > > > > > > what
> > > > > > > > > > > > >> > you
> > > > > > > > > > > > >> > > are looking for. Batch is just a special case
> of
> > > > > > streaming
> > > > > > > > > where
> > > > > > > > > > > the
> > > > > > > > > > > > >> > source
> > > > > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > >> > > Thomas
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh
> Chawda
> > <
> > > > > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
> > > > > application,
> > > > > > > > then
> > > > > > > > > > > batch
> > > > > > > > > > > > >> aware
> > > > > > > > > > > > >> > > > operators need to be used in the
> application.
> > > > > > > > > > > > >> > > > The nature of the application is mostly
> > > controlled
> > > > > by
> > > > > > > the
> > > > > > > > > > input
> > > > > > > > > > > > and
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > output operators used in the application.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > For example, consider an application which
> > needs
> > > > to
> > > > > > > filter
> > > > > > > > > > > records
> > > > > > > > > > > > >> in a
> > > > > > > > > > > > >> > > > input file and store the filtered records in
> > > > another
> > > > > > > file.
> > > > > > > > > The
> > > > > > > > > > > > >> nature
> > > > > > > > > > > > >> > of
> > > > > > > > > > > > >> > > > this app is to end once the entire file is
> > > > > processed.
> > > > > > > > > > Following
> > > > > > > > > > > > >> things
> > > > > > > > > > > > >> > > are
> > > > > > > > > > > > >> > > > expected of the application:
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >    1. Once the input data is over, finalize
> > the
> > > > > output
> > > > > > > > file
> > > > > > > > > > from
> > > > > > > > > > > > >> .tmp
> > > > > > > > > > > > >> > > >    files. - Responsibility of output
> operator
> > > > > > > > > > > > >> > > >    2. End the application, once the data is
> > read
> > > > and
> > > > > > > > > > processed -
> > > > > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > These functions are essential to allow the
> > user
> > > to
> > > > > do
> > > > > > > > higher
> > > > > > > > > > > level
> > > > > > > > > > > > >> > > > operations like scheduling or running a
> > workflow
> > > > of
> > > > > > > batch
> > > > > > > > > > > > >> applications.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > I am not sure about intermediate
> (processing)
> > > > > > operators,
> > > > > > > > as
> > > > > > > > > > > there
> > > > > > > > > > > > >> is no
> > > > > > > > > > > > >> > > > change in their functionality for batch use
> > > cases.
> > > > > > > > Perhaps,
> > > > > > > > > > > > allowing
> > > > > > > > > > > > >> > > > running multiple batches in a single
> > application
> > > > may
> > > > > > > > require
> > > > > > > > > > > > similar
> > > > > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
> > > Gugale <
> > > > > > > > > > > > priyag@apache.org
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Will it make an impression on user that,
> if
> > he
> > > > > has a
> > > > > > > > batch
> > > > > > > > > > > > >> usecase he
> > > > > > > > > > > > >> > > has
> > > > > > > > > > > > >> > > > > to use batch aware operators only? If so,
> is
> > > > that
> > > > > > what
> > > > > > > > we
> > > > > > > > > > > > expect?
> > > > > > > > > > > > >> I
> > > > > > > > > > > > >> > am
> > > > > > > > > > > > >> > > > not
> > > > > > > > > > > > >> > > > > aware of how do we implement batch
> scenario
> > so
> > > > > this
> > > > > > > > might
> > > > > > > > > > be a
> > > > > > > > > > > > >> basic
> > > > > > > > > > > > >> > > > > question.
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh
> > > > Chawda <
> > > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > While design / implementation for custom
> > > > control
> > > > > > > > tuples
> > > > > > > > > is
> > > > > > > > > > > > >> > ongoing, I
> > > > > > > > > > > > >> > > > > > thought it would be a good idea to
> > consider
> > > > its
> > > > > > > > > usefulness
> > > > > > > > > > > in
> > > > > > > > > > > > >> one
> > > > > > > > > > > > >> > of
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
> > > existing
> > > > > > > > operators
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > >> > Apache
> > > > > > > > > > > > >> > > > > Apex
> > > > > > > > > > > > >> > > > > > Malhar library so that it is easy to use
> > > them
> > > > in
> > > > > > > batch
> > > > > > > > > use
> > > > > > > > > > > > >> cases.
> > > > > > > > > > > > >> > > > > > Naturally, this would be applicable for
> > > only a
> > > > > > > subset
> > > > > > > > of
> > > > > > > > > > > > >> operators
> > > > > > > > > > > > >> > > like
> > > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > > > > >> > > > > > For example, for a file based store,
> (say
> > > HDFS
> > > > > > > store),
> > > > > > > > > we
> > > > > > > > > > > > could
> > > > > > > > > > > > >> > have
> > > > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
> > operators
> > > > > which
> > > > > > > > allow
> > > > > > > > > > > easy
> > > > > > > > > > > > >> > > > integration
> > > > > > > > > > > > >> > > > > > into a batch application. These
> operators
> > > > would
> > > > > be
> > > > > > > > > > extended
> > > > > > > > > > > > from
> > > > > > > > > > > > >> > > their
> > > > > > > > > > > > >> > > > > > existing implementations and would be
> > "Batch
> > > > > > Aware",
> > > > > > > > in
> > > > > > > > > > that
> > > > > > > > > > > > >> they
> > > > > > > > > > > > >> > may
> > > > > > > > > > > > >> > > > > > understand the meaning of some specific
> > > > control
> > > > > > > tuples
> > > > > > > > > > that
> > > > > > > > > > > > flow
> > > > > > > > > > > > >> > > > through
> > > > > > > > > > > > >> > > > > > the DAG. Start batch and end batch seem
> to
> > > be
> > > > > the
> > > > > > > > > obvious
> > > > > > > > > > > > >> > candidates
> > > > > > > > > > > > >> > > > that
> > > > > > > > > > > > >> > > > > > come to mind. On receipt of such control
> > > > tuples,
> > > > > > > they
> > > > > > > > > may
> > > > > > > > > > > try
> > > > > > > > > > > > to
> > > > > > > > > > > > >> > > modify
> > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > >> > > > > > behavior of the operator - to
> reinitialize
> > > > some
> > > > > > > > metrics
> > > > > > > > > or
> > > > > > > > > > > > >> finalize
> > > > > > > > > > > > >> > > an
> > > > > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > We can discuss the potential control
> > tuples
> > > > and
> > > > > > > > actions
> > > > > > > > > in
> > > > > > > > > > > > >> detail,
> > > > > > > > > > > > >> > > but
> > > > > > > > > > > > >> > > > > > first I would like to understand the
> views
> > > of
> > > > > the
> > > > > > > > > > community
> > > > > > > > > > > > for
> > > > > > > > > > > > >> > this
> > > > > > > > > > > > >> > > > > > proposal.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Vlad Rozov <v....@datatorrent.com>.

Make your POJO class implement WindowedOperator Tuple interface (it may 
return itself in getValue()).

Thank you,

Vlad

On 4/28/17 02:44, AJAY GUPTA wrote:
> Hi All,
>
> I am creating an application which is using Windowed Operator. This
> application involves CsvParser operator emitting a POJO object which is to
> be passed as input to WindowedOperator. The WindowedOperator requires an
> instance of Tuple class as input :
> *public final transient DefaultInputPort<Tuple<InputT>>
> input = new DefaultInputPort<Tuple<InputT>>() *
>
> Due to this, the addStream cannot work as the output of CsvParser's output
> port is not compatible with input port type of WindowedOperator.
> One way to solve this problem is to have an operator between the above two
> operators as a convertor.
> I would like to know if there is any other more generic approach to solve
> this problem without writing a new Operator for every new application using
> Windowed Operators.
>
> Thanks,
> Ajay
>
>
>
> On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
>> Hi All,
>>
>> I think we have some agreement on the way we should use control tuples for
>> File I/O operators to support batch.
>>
>> In order to have more operators in Malhar, support this paradigm, I think
>> we should also look at store operators - JDBC, Cassandra, HBase etc.
>> The case with these operators is simpler as most of these do not poll the
>> sources (except JDBC poller operator) and just stop once they have read a
>> fixed amount of data. In other words, these are inherently batch sources.
>> The only change that we should add to these operators is to shut down the
>> DAG once the reading of data is done. For a windowed operator this would
>> mean a Global window with a final watermark before the DAG is shut down.
>>
>> ~ Bhupesh
>>
>>
>> _______________________________________________________
>>
>> Bhupesh Chawda
>>
>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>
>> www.datatorrent.com  |  apex.apache.org
>>
>>
>>
>> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <bh...@datatorrent.com>
>> wrote:
>>
>>> Hi Thomas,
>>>
>>> Even though the windowing operator is not just "event time", it seems it
>>> is too much dependent on the "time" attribute of the incoming tuple. This
>>> is the reason we had to model the file index as a timestamp to solve the
>>> batch case for files.
>>> Perhaps we should work on increasing the scope of the windowed operator
>> to
>>> consider other types of windows as well. The Sequence option suggested by
>>> David seems to be something in that direction.
>>>
>>> ~ Bhupesh
>>>
>>>
>>> _______________________________________________________
>>>
>>> Bhupesh Chawda
>>>
>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>
>>> www.datatorrent.com  |  apex.apache.org
>>>
>>>
>>>
>>> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org> wrote:
>>>
>>>> That's correct, we are looking at a generalized approach for state
>>>> management vs. a series of special cases.
>>>>
>>>> And to be clear, windowing does not imply event time, otherwise it would
>>>> be
>>>> "EventTimeOperator" :-)
>>>>
>>>> Thomas
>>>>
>>>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
>> bhupesh@datatorrent.com>
>>>> wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>> I went through the discussion, but it seems like it is more on the
>> event
>>>>> time watermark handling as opposed to batches. What we are trying to
>> do
>>>> is
>>>>> have watermarks serve the purpose of demarcating batches using control
>>>>> tuples. Since each batch is separate from others, we would like to
>> have
>>>>> stateful processing within a batch, but not across batches.
>>>>> At the same time, we would like to do this in a manner which is
>>>> consistent
>>>>> with the windowing mechanism provided by the windowed operator. This
>>>> will
>>>>> allow us to treat a single batch as a (bounded) stream and apply all
>> the
>>>>> event time windowing concepts in that time span.
>>>>>
>>>>> For example, let's say I need to process data for a day (24 hours) as
>> a
>>>>> single batch. The application is still streaming in nature: it would
>> end
>>>>> the batch after a day and start a new batch the next day. At the same
>>>> time,
>>>>> I would be able to have early trigger firings every minute as well as
>>>> drop
>>>>> any data which is say, 5 mins late. All this within a single day.
>>>>>
>>>>> ~ Bhupesh
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________________
>>>>>
>>>>> Bhupesh Chawda
>>>>>
>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>
>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
>> wrote:
>>>>>> There is a discussion in the Flink mailing list about key-based
>>>>> watermarks.
>>>>>> I think it's relevant to our use case here.
>>>>>> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
>>>>>> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>>>> bhupesh@datatorrent.com
>>>>>> wrote:
>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> If using time window does not seem appropriate, we can have
>> another
>>>>> class
>>>>>>> which is more suited for such sequential and distinct windows.
>>>>> Perhaps, a
>>>>>>> CustomWindow option can be introduced which takes in a window id.
>>>> The
>>>>>>> purpose of this window option could be to translate the window id
>>>> into
>>>>>>> appropriate timestamps.
>>>>>>>
>>>>>>> Another option would be to go with a custom timestampExtractor for
>>>> such
>>>>>>> tuples which translates the each unique file name to a distinct
>>>>> timestamp
>>>>>>> while using time windows in the windowed operator.
>>>>>>>
>>>>>>> ~ Bhupesh
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________________
>>>>>>>
>>>>>>> Bhupesh Chawda
>>>>>>>
>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>
>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
>>>>> wrote:
>>>>>>>> I now see your rationale on putting the filename in the window.
>>>>>>>> As far as I understand, the reasons why the filename is not part
>>>> of
>>>>> the
>>>>>>> key
>>>>>>>> and the Global Window is not used are:
>>>>>>>>
>>>>>>>> 1) The files are processed in sequence, not in parallel
>>>>>>>> 2) The windowed operator should not keep the state associated
>> with
>>>>> the
>>>>>>> file
>>>>>>>> when the processing of the file is done
>>>>>>>> 3) The trigger should be fired for the file when a file is done
>>>>>>> processing.
>>>>>>>> However, if the file is just a sequence has nothing to do with a
>>>>>>> timestamp,
>>>>>>>> assigning a timestamp to a file is not an intuitive thing to do
>>>> and
>>>>>> would
>>>>>>>> just create confusions to the users, especially when it's used
>> as
>>>> an
>>>>>>>> example for new users.
>>>>>>>>
>>>>>>>> How about having a separate class called SequenceWindow? And
>>>> perhaps
>>>>>>>> TimeWindow can inherit from it?
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
>>>>> wrote:
>>>>>>>>> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>>>>>>> bhupesh@datatorrent.com
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think my comments related to count based windows might be
>>>>> causing
>>>>>>>>>> confusion. Let's not discuss count based scenarios for now.
>>>>>>>>>>
>>>>>>>>>> Just want to make sure we are on the same page wrt. the
>> "each
>>>>> file
>>>>>>> is a
>>>>>>>>>> batch" use case. As mentioned by Thomas, the each tuple from
>>>> the
>>>>>> same
>>>>>>>>> file
>>>>>>>>>> has the same timestamp (which is just a sequence number) and
>>>> that
>>>>>>> helps
>>>>>>>>>> keep tuples from each file in a separate window.
>>>>>>>>>>
>>>>>>>>> Yes, in this case it is a sequence number, but it could be a
>>>> time
>>>>>> stamp
>>>>>>>>> also, depending on the file naming convention. And if it was
>>>> event
>>>>>> time
>>>>>>>>> processing, the watermark would be derived from records within
>>>> the
>>>>>>> file.
>>>>>>>>> Agreed, the source should have a mechanism to control the time
>>>>> stamp
>>>>>>>>> extraction along with everything else pertaining to the
>>>> watermark
>>>>>>>>> generation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> We could also implement a "timestampExtractor" interface to
>>>>>> identify
>>>>>>>> the
>>>>>>>>>> timestamp (sequence number) for a file.
>>>>>>>>>>
>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________________
>>>>>>>>>>
>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>
>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>
>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
>> thw@apache.org
>>>>>>> wrote:
>>>>>>>>>>> I don't think this is a use case for count based window.
>>>>>>>>>>>
>>>>>>>>>>> We have multiple files that are retrieved in a sequence
>> and
>>>>> there
>>>>>>> is
>>>>>>>> no
>>>>>>>>>>> knowledge of the number of records per file. The
>>>> requirement is
>>>>>> to
>>>>>>>>>>> aggregate each file separately and emit the aggregate when
>>>> the
>>>>>> file
>>>>>>>> is
>>>>>>>>>> read
>>>>>>>>>>> fully. There is no concept of "end of something" for an
>>>>>> individual
>>>>>>>> key
>>>>>>>>>> and
>>>>>>>>>>> global window isn't applicable.
>>>>>>>>>>>
>>>>>>>>>>> However, as already explained and implemented by Bhupesh,
>>>> this
>>>>>> can
>>>>>>> be
>>>>>>>>>>> solved using watermark and window (in this case the window
>>>>>>> timestamp
>>>>>>>>>> isn't
>>>>>>>>>>> a timestamp, but a file sequence, but that doesn't matter.
>>>>>>>>>>>
>>>>>>>>>>> Thomas
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>>>> davidyan@gmail.com
>>>>>>>> wrote:
>>>>>>>>>>>> I don't think this is the way to go. Global Window only
>>>> means
>>>>>> the
>>>>>>>>>>> timestamp
>>>>>>>>>>>> does not matter (or that there is no timestamp). It does
>>>> not
>>>>>>>>>> necessarily
>>>>>>>>>>>> mean it's a large batch. Unless there is some notion of
>>>> event
>>>>>>> time
>>>>>>>>> for
>>>>>>>>>>> each
>>>>>>>>>>>> file, you don't want to embed the file into the window
>>>>> itself.
>>>>>>>>>>>> If you want the result broken up by file name, and if
>> the
>>>>> files
>>>>>>> are
>>>>>>>>> to
>>>>>>>>>> be
>>>>>>>>>>>> processed in parallel, I think making the file name be
>>>> part
>>>>> of
>>>>>>> the
>>>>>>>>> key
>>>>>>>>>> is
>>>>>>>>>>>> the way to go. I think it's very confusing if we somehow
>>>> make
>>>>>> the
>>>>>>>>> file
>>>>>>>>>> to
>>>>>>>>>>>> be part of the window.
>>>>>>>>>>>>
>>>>>>>>>>>> For count-based window, it's not implemented yet and
>>>> you're
>>>>>>> welcome
>>>>>>>>> to
>>>>>>>>>>> add
>>>>>>>>>>>> that feature. In case of count-based windows, there
>> would
>>>> be
>>>>> no
>>>>>>>>> notion
>>>>>>>>>> of
>>>>>>>>>>>> time and you probably only trigger at the end of each
>>>> window.
>>>>>> In
>>>>>>>> the
>>>>>>>>>> case
>>>>>>>>>>>> of count-based windows, the watermark only matters for
>>>> batch
>>>>>>> since
>>>>>>>>> you
>>>>>>>>>>> need
>>>>>>>>>>>> a way to know when the batch has ended (if the count is
>>>> 10,
>>>>> the
>>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>>> tuples in the batch is let's say 105, you need a way to
>>>> end
>>>>> the
>>>>>>>> last
>>>>>>>>>>> window
>>>>>>>>>>>> with 5 tuples).
>>>>>>>>>>>>
>>>>>>>>>>>> David
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The wordcount example that I created based on the
>>>> windowed
>>>>>>>> operator
>>>>>>>>>>> does
>>>>>>>>>>>>> processing of word counts per file (each file as a
>>>> separate
>>>>>>>> batch),
>>>>>>>>>>> i.e.
>>>>>>>>>>>>> process counts for each file and dump into separate
>>>> files.
>>>>>>>>>>>>> As I understand Global window is for one large batch;
>>>> i.e.
>>>>>> all
>>>>>>>>>> incoming
>>>>>>>>>>>>> data falls into the same batch. This could not be
>>>> processed
>>>>>>> using
>>>>>>>>>>>>> GlobalWindow option as we need more than one windows.
>> In
>>>>> this
>>>>>>>>> case, I
>>>>>>>>>>>>> configured the windowed operator to have time windows
>> of
>>>>> 1ms
>>>>>>> each
>>>>>>>>> and
>>>>>>>>>>>>> passed data for each file with increasing timestamps:
>>>>> (file1,
>>>>>>> 1),
>>>>>>>>>>> (file2,
>>>>>>>>>>>>> 2) and so on. Is there a better way of handling this
>>>>>> scenario?
>>>>>>>>>>>>> Regarding (2 - count based windows), I think there is
>> a
>>>>>> trigger
>>>>>>>>>> option
>>>>>>>>>>> to
>>>>>>>>>>>>> process count based windows. In case I want to process
>>>>> every
>>>>>>> 1000
>>>>>>>>>>> tuples
>>>>>>>>>>>> as
>>>>>>>>>>>>> a batch, I could set the Trigger option to
>> CountTrigger
>>>>> with
>>>>>>> the
>>>>>>>>>>>>> accumulation set to Discarding. Is this correct?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree that (4. Final Watermark) can be done using
>>>> Global
>>>>>>>> window.
>>>>>>>>>>>>> \u200b~ Bhupesh\u200b
>>>>>>>>>>>>>
>>>>>>>>>>>>> ______________________________
>> _________________________
>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>
>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>
>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>>>>>>> davidyan@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> I'm worried that we are making the watermark concept
>>>> too
>>>>>>>>>> complicated.
>>>>>>>>>>>>>> Watermarks should simply just tell you what windows
>>>> can
>>>>> be
>>>>>>>>>> considered
>>>>>>>>>>>>>> complete.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Point 2 is basically a count-based window.
>> Watermarks
>>>> do
>>>>>> not
>>>>>>>>> play a
>>>>>>>>>>>> role
>>>>>>>>>>>>>> here because the window is always complete at the
>> n-th
>>>>>> tuple.
>>>>>>>>>>>>>> If I understand correctly, point 3 is for batch
>>>>> processing
>>>>>> of
>>>>>>>>>> files.
>>>>>>>>>>>>> Unless
>>>>>>>>>>>>>> the files contain timed events, it sounds to be that
>>>> this
>>>>>> can
>>>>>>>> be
>>>>>>>>>>>> achieved
>>>>>>>>>>>>>> with just a Global Window. For signaling EOF, a
>>>> watermark
>>>>>>> with
>>>>>>>> a
>>>>>>>>>>>>> +infinity
>>>>>>>>>>>>>> timestamp can be used so that triggers will be fired
>>>> upon
>>>>>>>> receipt
>>>>>>>>>> of
>>>>>>>>>>>> that
>>>>>>>>>>>>>> watermark.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For point 4, just like what I mentioned above, can
>> be
>>>>>>> achieved
>>>>>>>>>> with a
>>>>>>>>>>>>>> watermark with a +infinity timestamp.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For an input operator which is supposed to
>> generate
>>>>>>>> watermarks
>>>>>>>>>> for
>>>>>>>>>>>>>>> downstream operators, I can think about the
>>>> following
>>>>>>>>> watermarks
>>>>>>>>>>> that
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> operator can emit:
>>>>>>>>>>>>>>> 1. Time based watermarks (the high watermark / low
>>>>>>> watermark)
>>>>>>>>>>>>>>> 2. Number of tuple based watermarks (Every n
>> tuples)
>>>>>>>>>>>>>>> 3. File based watermarks (Start file, end file)
>>>>>>>>>>>>>>> 4. Final watermark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> File based watermarks seem to be applicable for
>>>> batch
>>>>>> (file
>>>>>>>>>> based)
>>>>>>>>>>> as
>>>>>>>>>>>>>> well,
>>>>>>>>>>>>>>> and hence I thought of looking at these first.
>> Does
>>>>> this
>>>>>>> seem
>>>>>>>>> to
>>>>>>>>>> be
>>>>>>>>>>>> in
>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>> with the thought process?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ______________________________
>>>>> _________________________
>>>>>>>>>>>>>>> Bhupesh Chawda
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> www.datatorrent.com  |  apex.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>>>>>>>> thw@apache.org
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> I don't think this should be designed based on a
>>>>>>> simplistic
>>>>>>>>>> file
>>>>>>>>>>>>>>>> input-output scenario. It would be good to
>>>> include a
>>>>>>>> stateful
>>>>>>>>>>>>>>>> transformation based on event time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> More complex pipelines contain stateful
>>>>> transformations
>>>>>>>> that
>>>>>>>>>>> depend
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> windowing and watermarks. I think we need a
>>>> watermark
>>>>>>>> concept
>>>>>>>>>>> that
>>>>>>>>>>>> is
>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>> on progress in event time (or other monotonic
>>>>>> increasing
>>>>>>>>>>> sequence)
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> other operators can generically work with.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Note that even file input in many cases can
>>>> produce
>>>>>> time
>>>>>>>>> based
>>>>>>>>>>>>>>> watermarks,
>>>>>>>>>>>>>>>> for example when you read part files that are
>>>> bound
>>>>> by
>>>>>>>> event
>>>>>>>>>>> time.
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
>> <
>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For better understanding the use case for
>>>> control
>>>>>>> tuples
>>>>>>>> in
>>>>>>>>>>>> batch,
>>>>>>>>>>>>> \u200bI
>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>> creating a prototype for a batch application
>>>> using
>>>>>> File
>>>>>>>>> Input
>>>>>>>>>>> and
>>>>>>>>>>>>>> File
>>>>>>>>>>>>>>>>> Output operators.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To enable basic batch processing for File IO
>>>>>>> operators, I
>>>>>>>>> am
>>>>>>>>>>>>>> proposing
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> following changes to File input and output
>>>>> operators:
>>>>>>>>>>>>>>>>> 1. File Input operator emits a watermark each
>>>> time
>>>>> it
>>>>>>>> opens
>>>>>>>>>> and
>>>>>>>>>>>>>> closes
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> file. These can be "start file" and "end file"
>>>>>>> watermarks
>>>>>>>>>> which
>>>>>>>>>>>>>> include
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> corresponding file names. The "start file"
>> tuple
>>>>>> should
>>>>>>>> be
>>>>>>>>>> sent
>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>> of the data from that file flows.
>>>>>>>>>>>>>>>>> 2. File Input operator can be configured to
>> end
>>>> the
>>>>>>>>>> application
>>>>>>>>>>>>>> after a
>>>>>>>>>>>>>>>>> single or n scans of the directory (a batch).
>>>> This
>>>>> is
>>>>>>>> where
>>>>>>>>>> the
>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>> emits the final watermark (the end of
>>>> application
>>>>>>> control
>>>>>>>>>>> tuple).
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> also shutdown the application.
>>>>>>>>>>>>>>>>> 3. The File output operator handles these
>>>> control
>>>>>>> tuples.
>>>>>>>>>>> "Start
>>>>>>>>>>>>>> file"
>>>>>>>>>>>>>>>>> initializes the file name for the incoming
>>>> tuples.
>>>>>> "End
>>>>>>>>> file"
>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>> forces a finalize on that file.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The user would be able to enable the operators
>>>> to
>>>>>> send
>>>>>>>> only
>>>>>>>>>>> those
>>>>>>>>>>>>>>>>> watermarks that are needed in the application.
>>>> If
>>>>>> none
>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>>>> options
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> configured, the operators behave as in a
>>>> streaming
>>>>>>>>>> application.
>>>>>>>>>>>>>>>>> There are a few challenges in the
>> implementation
>>>>>> where
>>>>>>>> the
>>>>>>>>>>> input
>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>> is partitioned. In this case, the correlation
>>>>> between
>>>>>>> the
>>>>>>>>>>>> start/end
>>>>>>>>>>>>>>> for a
>>>>>>>>>>>>>>>>> file and the data tuples for that file is
>> lost.
>>>>> Hence
>>>>>>> we
>>>>>>>>> need
>>>>>>>>>>> to
>>>>>>>>>>>>>>> maintain
>>>>>>>>>>>>>>>>> the filename as part of each tuple in the
>>>> pipeline.
>>>>>>>>>>>>>>>>> The "start file" and "end file" control tuples
>>>> in
>>>>>> this
>>>>>>>>>> example
>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> temporary names for watermarks. We can have
>>>> generic
>>>>>>>> "start
>>>>>>>>>>>> batch" /
>>>>>>>>>>>>>>> "end
>>>>>>>>>>>>>>>>> batch" tuples which could be used for other
>> use
>>>>> cases
>>>>>>> as
>>>>>>>>>> well.
>>>>>>>>>>>> The
>>>>>>>>>>>>>>> Final
>>>>>>>>>>>>>>>>> watermark is common and serves the same
>> purpose
>>>> in
>>>>>> each
>>>>>>>>> case.
>>>>>>>>>>>>>>>>> Please let me know your thoughts on this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh
>>>> Chawda <
>>>>>>>>>>>>>>>> bhupesh@datatorrent.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, this can be part of operator
>>>> configuration.
>>>>>>> Given
>>>>>>>>>> this,
>>>>>>>>>>>> for
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> define a batch application, would mean
>>>>> configuring
>>>>>>> the
>>>>>>>>>>>> connectors
>>>>>>>>>>>>>>>> (mostly
>>>>>>>>>>>>>>>>>> the input operator) in the application for
>> the
>>>>>>> desired
>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>>>> Similarly,
>>>>>>>>>>>>>>>>>> there can be other use cases that can be
>>>> achieved
>>>>>>> other
>>>>>>>>>> than
>>>>>>>>>>>>> batch.
>>>>>>>>>>>>>>>>>> We may also need to take care of the
>>>> following:
>>>>>>>>>>>>>>>>>> 1. Make sure that the watermarks or control
>>>>> tuples
>>>>>>> are
>>>>>>>>>>>> consistent
>>>>>>>>>>>>>>>> across
>>>>>>>>>>>>>>>>>> sources. Meaning an HDFS sink should be able
>>>> to
>>>>>>>> interpret
>>>>>>>>>> the
>>>>>>>>>>>>>>> watermark
>>>>>>>>>>>>>>>>>> tuple sent out by, say, a JDBC source.
>>>>>>>>>>>>>>>>>> 2. In addition to I/O connectors, we should
>>>> also
>>>>>> look
>>>>>>>> at
>>>>>>>>>> the
>>>>>>>>>>>> need
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> processing operators to understand some of
>> the
>>>>>>> control
>>>>>>>>>>> tuples /
>>>>>>>>>>>>>>>>> watermarks.
>>>>>>>>>>>>>>>>>> For example, we may want to reset the
>> operator
>>>>>>> behavior
>>>>>>>>> on
>>>>>>>>>>>>> arrival
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> watermark tuple.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Jan 17, 2017 at 9:59 PM, Thomas
>> Weise
>>>> <
>>>>>>>>>>> thw@apache.org>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> The HDFS source can operate in two modes,
>>>>> bounded
>>>>>> or
>>>>>>>>>>>> unbounded.
>>>>>>>>>>>>> If
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> scan
>>>>>>>>>>>>>>>>>>> only once, then it should emit the final
>>>>> watermark
>>>>>>>> after
>>>>>>>>>> it
>>>>>>>>>>> is
>>>>>>>>>>>>>> done.
>>>>>>>>>>>>>>>>>>> Otherwise it would emit watermarks based
>> on a
>>>>>> policy
>>>>>>>>>> (files
>>>>>>>>>>>>> names
>>>>>>>>>>>>>>>> etc.).
>>>>>>>>>>>>>>>>>>> The mechanism to generate the marks may
>>>> depend
>>>>> on
>>>>>>> the
>>>>>>>>> type
>>>>>>>>>>> of
>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> the user needs to be able to
>>>> influence/configure
>>>>>> it.
>>>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh
>>>> Chawda
>>>>> <
>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I am not sure that I completely
>> understand
>>>>> your
>>>>>>>>>>> suggestion.
>>>>>>>>>>>>> Are
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> suggesting to broaden the scope of the
>>>>> proposal
>>>>>> to
>>>>>>>>> treat
>>>>>>>>>>> all
>>>>>>>>>>>>>>> sources
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>> bounded as well as unbounded?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In case of Apex, we treat all sources as
>>>>>> unbounded
>>>>>>>>>>> sources.
>>>>>>>>>>>>> Even
>>>>>>>>>>>>>>>>> bounded
>>>>>>>>>>>>>>>>>>>> sources like HDFS file source is treated
>> as
>>>>>>>> unbounded
>>>>>>>>> by
>>>>>>>>>>>> means
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> scanning
>>>>>>>>>>>>>>>>>>>> the input directory repeatedly.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Let's consider HDFS file source for
>>>> example:
>>>>>>>>>>>>>>>>>>>> In this case, if we treat it as a bounded
>>>>>> source,
>>>>>>> we
>>>>>>>>> can
>>>>>>>>>>>>> define
>>>>>>>>>>>>>>>> hooks
>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> allows us to detect the end of the file
>> and
>>>>> send
>>>>>>> the
>>>>>>>>>>> "final
>>>>>>>>>>>>>>>>> watermark".
>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>> could also consider HDFS file source as a
>>>>>>> streaming
>>>>>>>>>> source
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> define
>>>>>>>>>>>>>>>>>>> hooks
>>>>>>>>>>>>>>>>>>>> which send watermarks based on different
>>>> kinds
>>>>>> of
>>>>>>>>>> windows.
>>>>>>>>>>>>>>>>>>>> Please correct me if I misunderstand.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Jan 16, 2017 at 9:23 PM, Thomas
>>>> Weise
>>>>> <
>>>>>>>>>>>> thw@apache.org
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> Bhupesh,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Please see how that can be solved in a
>>>>> unified
>>>>>>> way
>>>>>>>>>> using
>>>>>>>>>>>>>> windows
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> watermarks. It is bounded data vs.
>>>> unbounded
>>>>>>> data.
>>>>>>>>> In
>>>>>>>>>>> Beam
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> example,
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> can use the "global window" and the
>> final
>>>>>>>> watermark
>>>>>>>>> to
>>>>>>>>>>>>>>> accomplish
>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> are looking for. Batch is just a
>> special
>>>>> case
>>>>>> of
>>>>>>>>>>> streaming
>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>> emits the final watermark.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Thomas
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Jan 16, 2017 at 1:02 AM,
>> Bhupesh
>>>>>> Chawda
>>>>>>> <
>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, if the user needs to develop a
>>>> batch
>>>>>>>>>> application,
>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>>> aware
>>>>>>>>>>>>>>>>>>>>>> operators need to be used in the
>>>>>> application.
>>>>>>>>>>>>>>>>>>>>>> The nature of the application is
>> mostly
>>>>>>>> controlled
>>>>>>>>>> by
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> input
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> output operators used in the
>>>> application.
>>>>>>>>>>>>>>>>>>>>>> For example, consider an application
>>>> which
>>>>>>> needs
>>>>>>>>> to
>>>>>>>>>>>> filter
>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>> input file and store the filtered
>>>> records
>>>>> in
>>>>>>>>> another
>>>>>>>>>>>> file.
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>> nature
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> this app is to end once the entire
>>>> file is
>>>>>>>>>> processed.
>>>>>>>>>>>>>>> Following
>>>>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>> expected of the application:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>     1. Once the input data is over,
>>>>> finalize
>>>>>>> the
>>>>>>>>>> output
>>>>>>>>>>>>> file
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>> .tmp
>>>>>>>>>>>>>>>>>>>>>>     files. - Responsibility of output
>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>     2. End the application, once the
>>>> data
>>>>> is
>>>>>>> read
>>>>>>>>> and
>>>>>>>>>>>>>>> processed -
>>>>>>>>>>>>>>>>>>>>>>     Responsibility of input operator
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> These functions are essential to
>> allow
>>>> the
>>>>>>> user
>>>>>>>> to
>>>>>>>>>> do
>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>>> operations like scheduling or
>> running a
>>>>>>> workflow
>>>>>>>>> of
>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>>> applications.
>>>>>>>>>>>>>>>>>>>>>> I am not sure about intermediate
>>>>>> (processing)
>>>>>>>>>>> operators,
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>>>>>>>> change in their functionality for
>> batch
>>>>> use
>>>>>>>> cases.
>>>>>>>>>>>>> Perhaps,
>>>>>>>>>>>>>>>>> allowing
>>>>>>>>>>>>>>>>>>>>>> running multiple batches in a single
>>>>>>> application
>>>>>>>>> may
>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>>>>>>> changes in processing operators as
>>>> well.
>>>>>>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, Jan 16, 2017 at 2:19 PM,
>>>> Priyanka
>>>>>>>> Gugale <
>>>>>>>>>>>>>>>>> priyag@apache.org
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Will it make an impression on user
>>>> that,
>>>>>> if
>>>>>>> he
>>>>>>>>>> has a
>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>>> usecase he
>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>> to use batch aware operators only?
>> If
>>>>> so,
>>>>>> is
>>>>>>>>> that
>>>>>>>>>>> what
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> expect?
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> aware of how do we implement batch
>>>>>> scenario
>>>>>>> so
>>>>>>>>>> this
>>>>>>>>>>>>> might
>>>>>>>>>>>>>>> be a
>>>>>>>>>>>>>>>>>>> basic
>>>>>>>>>>>>>>>>>>>>>>> question.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -Priyanka
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jan 16, 2017 at 12:02 PM,
>>>>> Bhupesh
>>>>>>>>> Chawda <
>>>>>>>>>>>>>>>>>>>>>> bhupesh@datatorrent.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> While design / implementation for
>>>>> custom
>>>>>>>>> control
>>>>>>>>>>>>> tuples
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> ongoing, I
>>>>>>>>>>>>>>>>>>>>>>>> thought it would be a good idea
>> to
>>>>>>> consider
>>>>>>>>> its
>>>>>>>>>>>>>> usefulness
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> use cases -  batch applications.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> This is a proposal to adapt /
>>>> extend
>>>>>>>> existing
>>>>>>>>>>>>> operators
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>> Apex
>>>>>>>>>>>>>>>>>>>>>>>> Malhar library so that it is easy
>>>> to
>>>>> use
>>>>>>>> them
>>>>>>>>> in
>>>>>>>>>>>> batch
>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> cases.
>>>>>>>>>>>>>>>>>>>>>>>> Naturally, this would be
>> applicable
>>>>> for
>>>>>>>> only a
>>>>>>>>>>>> subset
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>> File, JDBC and NoSQL databases.
>>>>>>>>>>>>>>>>>>>>>>>> For example, for a file based
>>>> store,
>>>>>> (say
>>>>>>>> HDFS
>>>>>>>>>>>> store),
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>> FileBatchInput and
>> FileBatchOutput
>>>>>>> operators
>>>>>>>>>> which
>>>>>>>>>>>>> allow
>>>>>>>>>>>>>>>> easy
>>>>>>>>>>>>>>>>>>>>>> integration
>>>>>>>>>>>>>>>>>>>>>>>> into a batch application. These
>>>>>> operators
>>>>>>>>> would
>>>>>>>>>> be
>>>>>>>>>>>>>>> extended
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>> existing implementations and
>> would
>>>> be
>>>>>>> "Batch
>>>>>>>>>>> Aware",
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>>> understand the meaning of some
>>>>> specific
>>>>>>>>> control
>>>>>>>>>>>> tuples
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> flow
>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>> the DAG. Start batch and end
>> batch
>>>>> seem
>>>>>> to
>>>>>>>> be
>>>>>>>>>> the
>>>>>>>>>>>>>> obvious
>>>>>>>>>>>>>>>>>>>> candidates
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> come to mind. On receipt of such
>>>>> control
>>>>>>>>> tuples,
>>>>>>>>>>>> they
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> modify
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> behavior of the operator - to
>>>>>> reinitialize
>>>>>>>>> some
>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>> finalize
>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>> output file for example.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> We can discuss the potential
>>>> control
>>>>>>> tuples
>>>>>>>>> and
>>>>>>>>>>>>> actions
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> detail,
>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>> first I would like to understand
>>>> the
>>>>>> views
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>>>>> community
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ~ Bhupesh
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by AJAY GUPTA <aj...@gmail.com>.

Hi All,

I am creating an application which is using Windowed Operator. This
application involves CsvParser operator emitting a POJO object which is to
be passed as input to WindowedOperator. The WindowedOperator requires an
instance of Tuple class as input :
*public final transient DefaultInputPort<Tuple<InputT>>
input = new DefaultInputPort<Tuple<InputT>>() *

Due to this, the addStream cannot work as the output of CsvParser's output
port is not compatible with input port type of WindowedOperator.
One way to solve this problem is to have an operator between the above two
operators as a convertor.
I would like to know if there is any other more generic approach to solve
this problem without writing a new Operator for every new application using
Windowed Operators.

Thanks,
Ajay



On Thu, Mar 23, 2017 at 5:25 PM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi All,
>
> I think we have some agreement on the way we should use control tuples for
> File I/O operators to support batch.
>
> In order to have more operators in Malhar, support this paradigm, I think
> we should also look at store operators - JDBC, Cassandra, HBase etc.
> The case with these operators is simpler as most of these do not poll the
> sources (except JDBC poller operator) and just stop once they have read a
> fixed amount of data. In other words, these are inherently batch sources.
> The only change that we should add to these operators is to shut down the
> DAG once the reading of data is done. For a windowed operator this would
> mean a Global window with a final watermark before the DAG is shut down.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi Thomas,
> >
> > Even though the windowing operator is not just "event time", it seems it
> > is too much dependent on the "time" attribute of the incoming tuple. This
> > is the reason we had to model the file index as a timestamp to solve the
> > batch case for files.
> > Perhaps we should work on increasing the scope of the windowed operator
> to
> > consider other types of windows as well. The Sequence option suggested by
> > David seems to be something in that direction.
> >
> > ~ Bhupesh
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org> wrote:
> >
> >> That's correct, we are looking at a generalized approach for state
> >> management vs. a series of special cases.
> >>
> >> And to be clear, windowing does not imply event time, otherwise it would
> >> be
> >> "EventTimeOperator" :-)
> >>
> >> Thomas
> >>
> >> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com>
> >> wrote:
> >>
> >> > Hi David,
> >> >
> >> > I went through the discussion, but it seems like it is more on the
> event
> >> > time watermark handling as opposed to batches. What we are trying to
> do
> >> is
> >> > have watermarks serve the purpose of demarcating batches using control
> >> > tuples. Since each batch is separate from others, we would like to
> have
> >> > stateful processing within a batch, but not across batches.
> >> > At the same time, we would like to do this in a manner which is
> >> consistent
> >> > with the windowing mechanism provided by the windowed operator. This
> >> will
> >> > allow us to treat a single batch as a (bounded) stream and apply all
> the
> >> > event time windowing concepts in that time span.
> >> >
> >> > For example, let's say I need to process data for a day (24 hours) as
> a
> >> > single batch. The application is still streaming in nature: it would
> end
> >> > the batch after a day and start a new batch the next day. At the same
> >> time,
> >> > I would be able to have early trigger firings every minute as well as
> >> drop
> >> > any data which is say, 5 mins late. All this within a single day.
> >> >
> >> > ~ Bhupesh
> >> >
> >> >
> >> >
> >> > _______________________________________________________
> >> >
> >> > Bhupesh Chawda
> >> >
> >> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >> >
> >> > www.datatorrent.com  |  apex.apache.org
> >> >
> >> >
> >> >
> >> > On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com>
> wrote:
> >> >
> >> > > There is a discussion in the Flink mailing list about key-based
> >> > watermarks.
> >> > > I think it's relevant to our use case here.
> >> > > https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
> >> > > 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
> >> > >
> >> > > David
> >> > >
> >> > > On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
> >> bhupesh@datatorrent.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hi David,
> >> > > >
> >> > > > If using time window does not seem appropriate, we can have
> another
> >> > class
> >> > > > which is more suited for such sequential and distinct windows.
> >> > Perhaps, a
> >> > > > CustomWindow option can be introduced which takes in a window id.
> >> The
> >> > > > purpose of this window option could be to translate the window id
> >> into
> >> > > > appropriate timestamps.
> >> > > >
> >> > > > Another option would be to go with a custom timestampExtractor for
> >> such
> >> > > > tuples which translates the each unique file name to a distinct
> >> > timestamp
> >> > > > while using time windows in the windowed operator.
> >> > > >
> >> > > > ~ Bhupesh
> >> > > >
> >> > > >
> >> > > > _______________________________________________________
> >> > > >
> >> > > > Bhupesh Chawda
> >> > > >
> >> > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >> > > >
> >> > > > www.datatorrent.com  |  apex.apache.org
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > I now see your rationale on putting the filename in the window.
> >> > > > > As far as I understand, the reasons why the filename is not part
> >> of
> >> > the
> >> > > > key
> >> > > > > and the Global Window is not used are:
> >> > > > >
> >> > > > > 1) The files are processed in sequence, not in parallel
> >> > > > > 2) The windowed operator should not keep the state associated
> with
> >> > the
> >> > > > file
> >> > > > > when the processing of the file is done
> >> > > > > 3) The trigger should be fired for the file when a file is done
> >> > > > processing.
> >> > > > >
> >> > > > > However, if the file is just a sequence has nothing to do with a
> >> > > > timestamp,
> >> > > > > assigning a timestamp to a file is not an intuitive thing to do
> >> and
> >> > > would
> >> > > > > just create confusions to the users, especially when it's used
> as
> >> an
> >> > > > > example for new users.
> >> > > > >
> >> > > > > How about having a separate class called SequenceWindow? And
> >> perhaps
> >> > > > > TimeWindow can inherit from it?
> >> > > > >
> >> > > > > David
> >> > > > >
> >> > > > > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
> >> > wrote:
> >> > > > >
> >> > > > > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> >> > > > bhupesh@datatorrent.com
> >> > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > I think my comments related to count based windows might be
> >> > causing
> >> > > > > > > confusion. Let's not discuss count based scenarios for now.
> >> > > > > > >
> >> > > > > > > Just want to make sure we are on the same page wrt. the
> "each
> >> > file
> >> > > > is a
> >> > > > > > > batch" use case. As mentioned by Thomas, the each tuple from
> >> the
> >> > > same
> >> > > > > > file
> >> > > > > > > has the same timestamp (which is just a sequence number) and
> >> that
> >> > > > helps
> >> > > > > > > keep tuples from each file in a separate window.
> >> > > > > > >
> >> > > > > >
> >> > > > > > Yes, in this case it is a sequence number, but it could be a
> >> time
> >> > > stamp
> >> > > > > > also, depending on the file naming convention. And if it was
> >> event
> >> > > time
> >> > > > > > processing, the watermark would be derived from records within
> >> the
> >> > > > file.
> >> > > > > >
> >> > > > > > Agreed, the source should have a mechanism to control the time
> >> > stamp
> >> > > > > > extraction along with everything else pertaining to the
> >> watermark
> >> > > > > > generation.
> >> > > > > >
> >> > > > > >
> >> > > > > > > We could also implement a "timestampExtractor" interface to
> >> > > identify
> >> > > > > the
> >> > > > > > > timestamp (sequence number) for a file.
> >> > > > > > >
> >> > > > > > > ~ Bhupesh
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > _______________________________________________________
> >> > > > > > >
> >> > > > > > > Bhupesh Chawda
> >> > > > > > >
> >> > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >> > > > > > >
> >> > > > > > > www.datatorrent.com  |  apex.apache.org
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <
> thw@apache.org
> >> >
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > I don't think this is a use case for count based window.
> >> > > > > > > >
> >> > > > > > > > We have multiple files that are retrieved in a sequence
> and
> >> > there
> >> > > > is
> >> > > > > no
> >> > > > > > > > knowledge of the number of records per file. The
> >> requirement is
> >> > > to
> >> > > > > > > > aggregate each file separately and emit the aggregate when
> >> the
> >> > > file
> >> > > > > is
> >> > > > > > > read
> >> > > > > > > > fully. There is no concept of "end of something" for an
> >> > > individual
> >> > > > > key
> >> > > > > > > and
> >> > > > > > > > global window isn't applicable.
> >> > > > > > > >
> >> > > > > > > > However, as already explained and implemented by Bhupesh,
> >> this
> >> > > can
> >> > > > be
> >> > > > > > > > solved using watermark and window (in this case the window
> >> > > > timestamp
> >> > > > > > > isn't
> >> > > > > > > > a timestamp, but a file sequence, but that doesn't matter.
> >> > > > > > > >
> >> > > > > > > > Thomas
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
> >> davidyan@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > I don't think this is the way to go. Global Window only
> >> means
> >> > > the
> >> > > > > > > > timestamp
> >> > > > > > > > > does not matter (or that there is no timestamp). It does
> >> not
> >> > > > > > > necessarily
> >> > > > > > > > > mean it's a large batch. Unless there is some notion of
> >> event
> >> > > > time
> >> > > > > > for
> >> > > > > > > > each
> >> > > > > > > > > file, you don't want to embed the file into the window
> >> > itself.
> >> > > > > > > > >
> >> > > > > > > > > If you want the result broken up by file name, and if
> the
> >> > files
> >> > > > are
> >> > > > > > to
> >> > > > > > > be
> >> > > > > > > > > processed in parallel, I think making the file name be
> >> part
> >> > of
> >> > > > the
> >> > > > > > key
> >> > > > > > > is
> >> > > > > > > > > the way to go. I think it's very confusing if we somehow
> >> make
> >> > > the
> >> > > > > > file
> >> > > > > > > to
> >> > > > > > > > > be part of the window.
> >> > > > > > > > >
> >> > > > > > > > > For count-based window, it's not implemented yet and
> >> you're
> >> > > > welcome
> >> > > > > > to
> >> > > > > > > > add
> >> > > > > > > > > that feature. In case of count-based windows, there
> would
> >> be
> >> > no
> >> > > > > > notion
> >> > > > > > > of
> >> > > > > > > > > time and you probably only trigger at the end of each
> >> window.
> >> > > In
> >> > > > > the
> >> > > > > > > case
> >> > > > > > > > > of count-based windows, the watermark only matters for
> >> batch
> >> > > > since
> >> > > > > > you
> >> > > > > > > > need
> >> > > > > > > > > a way to know when the batch has ended (if the count is
> >> 10,
> >> > the
> >> > > > > > number
> >> > > > > > > of
> >> > > > > > > > > tuples in the batch is let's say 105, you need a way to
> >> end
> >> > the
> >> > > > > last
> >> > > > > > > > window
> >> > > > > > > > > with 5 tuples).
> >> > > > > > > > >
> >> > > > > > > > > David
> >> > > > > > > > >
> >> > > > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> >> > > > > > > bhupesh@datatorrent.com
> >> > > > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Hi David,
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks for your comments.
> >> > > > > > > > > >
> >> > > > > > > > > > The wordcount example that I created based on the
> >> windowed
> >> > > > > operator
> >> > > > > > > > does
> >> > > > > > > > > > processing of word counts per file (each file as a
> >> separate
> >> > > > > batch),
> >> > > > > > > > i.e.
> >> > > > > > > > > > process counts for each file and dump into separate
> >> files.
> >> > > > > > > > > > As I understand Global window is for one large batch;
> >> i.e.
> >> > > all
> >> > > > > > > incoming
> >> > > > > > > > > > data falls into the same batch. This could not be
> >> processed
> >> > > > using
> >> > > > > > > > > > GlobalWindow option as we need more than one windows.
> In
> >> > this
> >> > > > > > case, I
> >> > > > > > > > > > configured the windowed operator to have time windows
> of
> >> > 1ms
> >> > > > each
> >> > > > > > and
> >> > > > > > > > > > passed data for each file with increasing timestamps:
> >> > (file1,
> >> > > > 1),
> >> > > > > > > > (file2,
> >> > > > > > > > > > 2) and so on. Is there a better way of handling this
> >> > > scenario?
> >> > > > > > > > > >
> >> > > > > > > > > > Regarding (2 - count based windows), I think there is
> a
> >> > > trigger
> >> > > > > > > option
> >> > > > > > > > to
> >> > > > > > > > > > process count based windows. In case I want to process
> >> > every
> >> > > > 1000
> >> > > > > > > > tuples
> >> > > > > > > > > as
> >> > > > > > > > > > a batch, I could set the Trigger option to
> CountTrigger
> >> > with
> >> > > > the
> >> > > > > > > > > > accumulation set to Discarding. Is this correct?
> >> > > > > > > > > >
> >> > > > > > > > > > I agree that (4. Final Watermark) can be done using
> >> Global
> >> > > > > window.
> >> > > > > > > > > >
> >> > > > > > > > > > ~ Bhupesh
> >> > > > > > > > > >
> >> > > > > > > > > > ______________________________
> _________________________
> >> > > > > > > > > >
> >> > > > > > > > > > Bhupesh Chawda
> >> > > > > > > > > >
> >> > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >> > > > > > > > > >
> >> > > > > > > > > > www.datatorrent.com  |  apex.apache.org
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> >> > > > davidyan@gmail.com>
> >> > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > I'm worried that we are making the watermark concept
> >> too
> >> > > > > > > complicated.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Watermarks should simply just tell you what windows
> >> can
> >> > be
> >> > > > > > > considered
> >> > > > > > > > > > > complete.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Point 2 is basically a count-based window.
> Watermarks
> >> do
> >> > > not
> >> > > > > > play a
> >> > > > > > > > > role
> >> > > > > > > > > > > here because the window is always complete at the
> n-th
> >> > > tuple.
> >> > > > > > > > > > >
> >> > > > > > > > > > > If I understand correctly, point 3 is for batch
> >> > processing
> >> > > of
> >> > > > > > > files.
> >> > > > > > > > > > Unless
> >> > > > > > > > > > > the files contain timed events, it sounds to be that
> >> this
> >> > > can
> >> > > > > be
> >> > > > > > > > > achieved
> >> > > > > > > > > > > with just a Global Window. For signaling EOF, a
> >> watermark
> >> > > > with
> >> > > > > a
> >> > > > > > > > > > +infinity
> >> > > > > > > > > > > timestamp can be used so that triggers will be fired
> >> upon
> >> > > > > receipt
> >> > > > > > > of
> >> > > > > > > > > that
> >> > > > > > > > > > > watermark.
> >> > > > > > > > > > >
> >> > > > > > > > > > > For point 4, just like what I mentioned above, can
> be
> >> > > > achieved
> >> > > > > > > with a
> >> > > > > > > > > > > watermark with a +infinity timestamp.
> >> > > > > > > > > > >
> >> > > > > > > > > > > David
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> >> > > > > > > > > bhupesh@datatorrent.com
> >> > > > > > > > > > >
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi Thomas,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > For an input operator which is supposed to
> generate
> >> > > > > watermarks
> >> > > > > > > for
> >> > > > > > > > > > > > downstream operators, I can think about the
> >> following
> >> > > > > > watermarks
> >> > > > > > > > that
> >> > > > > > > > > > the
> >> > > > > > > > > > > > operator can emit:
> >> > > > > > > > > > > > 1. Time based watermarks (the high watermark / low
> >> > > > watermark)
> >> > > > > > > > > > > > 2. Number of tuple based watermarks (Every n
> tuples)
> >> > > > > > > > > > > > 3. File based watermarks (Start file, end file)
> >> > > > > > > > > > > > 4. Final watermark
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > File based watermarks seem to be applicable for
> >> batch
> >> > > (file
> >> > > > > > > based)
> >> > > > > > > > as
> >> > > > > > > > > > > well,
> >> > > > > > > > > > > > and hence I thought of looking at these first.
> Does
> >> > this
> >> > > > seem
> >> > > > > > to
> >> > > > > > > be
> >> > > > > > > > > in
> >> > > > > > > > > > > line
> >> > > > > > > > > > > > with the thought process?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > ~ Bhupesh
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > ______________________________
> >> > _________________________
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Bhupesh Chawda
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Software Engineer
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > www.datatorrent.com  |  apex.apache.org
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> >> > > > > thw@apache.org
> >> > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > I don't think this should be designed based on a
> >> > > > simplistic
> >> > > > > > > file
> >> > > > > > > > > > > > > input-output scenario. It would be good to
> >> include a
> >> > > > > stateful
> >> > > > > > > > > > > > > transformation based on event time.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > More complex pipelines contain stateful
> >> > transformations
> >> > > > > that
> >> > > > > > > > depend
> >> > > > > > > > > > on
> >> > > > > > > > > > > > > windowing and watermarks. I think we need a
> >> watermark
> >> > > > > concept
> >> > > > > > > > that
> >> > > > > > > > > is
> >> > > > > > > > > > > > based
> >> > > > > > > > > > > > > on progress in event time (or other monotonic
> >> > > increasing
> >> > > > > > > > sequence)
> >> > > > > > > > > > that
> >> > > > > > > > > > > > > other operators can generically work with.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Note that even file input in many cases can
> >> produce
> >> > > time
> >> > > > > > based
> >> > > > > > > > > > > > watermarks,
> >> > > > > > > > > > > > > for example when you read part files that are
> >> bound
> >> > by
> >> > > > > event
> >> > > > > > > > time.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > Thomas
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda
> <
> >> > > > > > > > > > > bhupesh@datatorrent.com
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > For better understanding the use case for
> >> control
> >> > > > tuples
> >> > > > > in
> >> > > > > > > > > batch,
> >> > > > > > > > > > I
> >> > > > > > > > > > > > am
> >> > > > > > > > > > > > > > creating a prototype for a batch application
> >> using
> >> > > File
> >> > > > > > Input
> >> > > > > > > > and
> >> > > > > > > > > > > File
> >> > > > > > > > > > > > > > Output operators.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > To enable basic batch processing for File IO
> >> > > > operators, I
> >> > > > > > am
> >> > > > > > > > > > > proposing
> >> > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > following changes to File input and output
> >> > operators:
> >> > > > > > > > > > > > > > 1. File Input operator emits a watermark each
> >> time
> >> > it
> >> > > > > opens
> >> > > > > > > and
> >> > > > > > > > > > > closes
> >> > > > > > > > > > > > a
> >> > > > > > > > > > > > > > file. These can be "start file" and "end file"
> >> > > > watermarks
> >> > > > > > > which
> >> > > > > > > > > > > include
> >> > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > corresponding file names. The "start file"
> tuple
> >> > > should
> >> > > > > be
> >> > > > > > > sent
> >> > > > > > > > > > > before
> >> > > > > > > > > > > > > any
> >> > > > > > > > > > > > > > of the data from that file flows.
> >> > > > > > > > > > > > > > 2. File Input operator can be configured to
> end
> >> the
> >> > > > > > > application
> >> > > > > > > > > > > after a
> >> > > > > > > > > > > > > > single or n scans of the directory (a batch).
> >> This
> >> > is
> >> > > > > where
> >> > > > > > > the
> >> > > > > > > > > > > > operator
> >> > > > > > > > > > > > > > emits the final watermark (the end of
> >> application
> >> > > > control
> >> > > > > > > > tuple).
> >> > > > > > > > > > > This
> >> > > > > > > > > > > > > will
> >> > > > > > > > > > > > > > also shutdown the application.
> >> > > > > > > > > > > > > > 3. The File output operator handles these
> >> control
> >> > > > tuples.
> >> > > > > > > > "Start
> >> > > > > > > > > > > file"
> >> > > > > > > > > > > > > > initializes the file name for the incoming
> >> tuples.
> >> > > "End
> >> > > > > > file"
> >> > > > > > > > > > > watermark
> >> > > > > > > > > > > > > > forces a finalize on that file.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > The user would be able to enable the operators
> >> to
> >> > > send
> >> > > > > only
> >> > > > > > > > those
> >> > > > > > > > > > > > > > watermarks that are needed in the application.
> >> If
> >> > > none
> >> > > > of
> >> > > > > > the
> >> > > > > > > > > > options
> >> > > > > > > > > > > > are
> >> > > > > > > > > > > > > > configured, the operators behave as in a
> >> streaming
> >> > > > > > > application.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > There are a few challenges in the
> implementation
> >> > > where
> >> > > > > the
> >> > > > > > > > input
> >> > > > > > > > > > > > operator
> >> > > > > > > > > > > > > > is partitioned. In this case, the correlation
> >> > between
> >> > > > the
> >> > > > > > > > > start/end
> >> > > > > > > > > > > > for a
> >> > > > > > > > > > > > > > file and the data tuples for that file is
> lost.
> >> > Hence
> >> > > > we
> >> > > > > > need
> >> > > > > > > > to
> >> > > > > > > > > > > > maintain
> >> > > > > > > > > > > > > > the filename as part of each tuple in the
> >> pipeline.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > The "start file" and "end file" control tuples
> >> in
> >> > > this
> >> > > > > > > example
> >> > > > > > > > > are
> >> > > > > > > > > > > > > > temporary names for watermarks. We can have
> >> generic
> >> > > > > "start
> >> > > > > > > > > batch" /
> >> > > > > > > > > > > > "end
> >> > > > > > > > > > > > > > batch" tuples which could be used for other
> use
> >> > cases
> >> > > > as
> >> > > > > > > well.
> >> > > > > > > > > The
> >> > > > > > > > > > > > Final
> >> > > > > > > > > > > > > > watermark is common and serves the same
> purpose
> >> in
> >> > > each
> >> > > > > > case.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Please let me know your thoughts on this.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > ~ Bhupesh
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh
> >> Chawda <
> >> > > > > > > > > > > > > bhupesh@datatorrent.com>
> >> > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Yes, this can be part of operator
> >> configuration.
> >> > > > Given
> >> > > > > > > this,
> >> > > > > > > > > for
> >> > > > > > > > > > a
> >> > > > > > > > > > > > user
> >> > > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > define a batch application, would mean
> >> > configuring
> >> > > > the
> >> > > > > > > > > connectors
> >> > > > > > > > > > > > > (mostly
> >> > > > > > > > > > > > > > > the input operator) in the application for
> the
> >> > > > desired
> >> > > > > > > > > behavior.
> >> > > > > > > > > > > > > > Similarly,
> >> > > > > > > > > > > > > > > there can be other use cases that can be
> >> achieved
> >> > > > other
> >> > > > > > > than
> >> > > > > > > > > > batch.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > We may also need to take care of the
> >> following:
> >> > > > > > > > > > > > > > > 1. Make sure that the watermarks or control
> >> > tuples
> >> > > > are
> >> > > > > > > > > consistent
> >> > > > > > > > > > > > > across
> >> > > > > > > > > > > > > > > sources. Meaning an HDFS sink should be able
> >> to
> >> > > > > interpret
> >> > > > > > > the
> >> > > > > > > > > > > > watermark
> >> > > > > > > > > > > > > > > tuple sent out by, say, a JDBC source.
> >> > > > > > > > > > > > > > > 2. In addition to I/O connectors, we should
> >> also
> >> > > look
> >> > > > > at
> >> > > > > > > the
> >> > > > > > > > > need
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > > > processing operators to understand some of
> the
> >> > > > control
> >> > > > > > > > tuples /
> >> > > > > > > > > > > > > > watermarks.
> >> > > > > > > > > > > > > > > For example, we may want to reset the
> operator
> >> > > > behavior
> >> > > > > > on
> >> > > > > > > > > > arrival
> >> > > > > > > > > > > of
> >> > > > > > > > > > > > > > some
> >> > > > > > > > > > > > > > > watermark tuple.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > ~ Bhupesh
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas
> Weise
> >> <
> >> > > > > > > > thw@apache.org>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >> The HDFS source can operate in two modes,
> >> > bounded
> >> > > or
> >> > > > > > > > > unbounded.
> >> > > > > > > > > > If
> >> > > > > > > > > > > > you
> >> > > > > > > > > > > > > > >> scan
> >> > > > > > > > > > > > > > >> only once, then it should emit the final
> >> > watermark
> >> > > > > after
> >> > > > > > > it
> >> > > > > > > > is
> >> > > > > > > > > > > done.
> >> > > > > > > > > > > > > > >> Otherwise it would emit watermarks based
> on a
> >> > > policy
> >> > > > > > > (files
> >> > > > > > > > > > names
> >> > > > > > > > > > > > > etc.).
> >> > > > > > > > > > > > > > >> The mechanism to generate the marks may
> >> depend
> >> > on
> >> > > > the
> >> > > > > > type
> >> > > > > > > > of
> >> > > > > > > > > > > source
> >> > > > > > > > > > > > > and
> >> > > > > > > > > > > > > > >> the user needs to be able to
> >> influence/configure
> >> > > it.
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >> Thomas
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh
> >> Chawda
> >> > <
> >> > > > > > > > > > > > > > bhupesh@datatorrent.com>
> >> > > > > > > > > > > > > > >> wrote:
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >> > Hi Thomas,
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> > I am not sure that I completely
> understand
> >> > your
> >> > > > > > > > suggestion.
> >> > > > > > > > > > Are
> >> > > > > > > > > > > > you
> >> > > > > > > > > > > > > > >> > suggesting to broaden the scope of the
> >> > proposal
> >> > > to
> >> > > > > > treat
> >> > > > > > > > all
> >> > > > > > > > > > > > sources
> >> > > > > > > > > > > > > > as
> >> > > > > > > > > > > > > > >> > bounded as well as unbounded?
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> > In case of Apex, we treat all sources as
> >> > > unbounded
> >> > > > > > > > sources.
> >> > > > > > > > > > Even
> >> > > > > > > > > > > > > > bounded
> >> > > > > > > > > > > > > > >> > sources like HDFS file source is treated
> as
> >> > > > > unbounded
> >> > > > > > by
> >> > > > > > > > > means
> >> > > > > > > > > > > of
> >> > > > > > > > > > > > > > >> scanning
> >> > > > > > > > > > > > > > >> > the input directory repeatedly.
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> > Let's consider HDFS file source for
> >> example:
> >> > > > > > > > > > > > > > >> > In this case, if we treat it as a bounded
> >> > > source,
> >> > > > we
> >> > > > > > can
> >> > > > > > > > > > define
> >> > > > > > > > > > > > > hooks
> >> > > > > > > > > > > > > > >> which
> >> > > > > > > > > > > > > > >> > allows us to detect the end of the file
> and
> >> > send
> >> > > > the
> >> > > > > > > > "final
> >> > > > > > > > > > > > > > watermark".
> >> > > > > > > > > > > > > > >> We
> >> > > > > > > > > > > > > > >> > could also consider HDFS file source as a
> >> > > > streaming
> >> > > > > > > source
> >> > > > > > > > > and
> >> > > > > > > > > > > > > define
> >> > > > > > > > > > > > > > >> hooks
> >> > > > > > > > > > > > > > >> > which send watermarks based on different
> >> kinds
> >> > > of
> >> > > > > > > windows.
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> > Please correct me if I misunderstand.
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> > ~ Bhupesh
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas
> >> Weise
> >> > <
> >> > > > > > > > > thw@apache.org
> >> > > > > > > > > > >
> >> > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> > > Bhupesh,
> >> > > > > > > > > > > > > > >> > >
> >> > > > > > > > > > > > > > >> > > Please see how that can be solved in a
> >> > unified
> >> > > > way
> >> > > > > > > using
> >> > > > > > > > > > > windows
> >> > > > > > > > > > > > > and
> >> > > > > > > > > > > > > > >> > > watermarks. It is bounded data vs.
> >> unbounded
> >> > > > data.
> >> > > > > > In
> >> > > > > > > > Beam
> >> > > > > > > > > > for
> >> > > > > > > > > > > > > > >> example,
> >> > > > > > > > > > > > > > >> > you
> >> > > > > > > > > > > > > > >> > > can use the "global window" and the
> final
> >> > > > > watermark
> >> > > > > > to
> >> > > > > > > > > > > > accomplish
> >> > > > > > > > > > > > > > what
> >> > > > > > > > > > > > > > >> > you
> >> > > > > > > > > > > > > > >> > > are looking for. Batch is just a
> special
> >> > case
> >> > > of
> >> > > > > > > > streaming
> >> > > > > > > > > > > where
> >> > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > >> > source
> >> > > > > > > > > > > > > > >> > > emits the final watermark.
> >> > > > > > > > > > > > > > >> > >
> >> > > > > > > > > > > > > > >> > > Thanks,
> >> > > > > > > > > > > > > > >> > > Thomas
> >> > > > > > > > > > > > > > >> > >
> >> > > > > > > > > > > > > > >> > >
> >> > > > > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM,
> Bhupesh
> >> > > Chawda
> >> > > > <
> >> > > > > > > > > > > > > > >> bhupesh@datatorrent.com
> >> > > > > > > > > > > > > > >> > >
> >> > > > > > > > > > > > > > >> > > wrote:
> >> > > > > > > > > > > > > > >> > >
> >> > > > > > > > > > > > > > >> > > > Yes, if the user needs to develop a
> >> batch
> >> > > > > > > application,
> >> > > > > > > > > > then
> >> > > > > > > > > > > > > batch
> >> > > > > > > > > > > > > > >> aware
> >> > > > > > > > > > > > > > >> > > > operators need to be used in the
> >> > > application.
> >> > > > > > > > > > > > > > >> > > > The nature of the application is
> mostly
> >> > > > > controlled
> >> > > > > > > by
> >> > > > > > > > > the
> >> > > > > > > > > > > > input
> >> > > > > > > > > > > > > > and
> >> > > > > > > > > > > > > > >> the
> >> > > > > > > > > > > > > > >> > > > output operators used in the
> >> application.
> >> > > > > > > > > > > > > > >> > > >
> >> > > > > > > > > > > > > > >> > > > For example, consider an application
> >> which
> >> > > > needs
> >> > > > > > to
> >> > > > > > > > > filter
> >> > > > > > > > > > > > > records
> >> > > > > > > > > > > > > > >> in a
> >> > > > > > > > > > > > > > >> > > > input file and store the filtered
> >> records
> >> > in
> >> > > > > > another
> >> > > > > > > > > file.
> >> > > > > > > > > > > The
> >> > > > > > > > > > > > > > >> nature
> >> > > > > > > > > > > > > > >> > of
> >> > > > > > > > > > > > > > >> > > > this app is to end once the entire
> >> file is
> >> > > > > > > processed.
> >> > > > > > > > > > > > Following
> >> > > > > > > > > > > > > > >> things
> >> > > > > > > > > > > > > > >> > > are
> >> > > > > > > > > > > > > > >> > > > expected of the application:
> >> > > > > > > > > > > > > > >> > > >
> >> > > > > > > > > > > > > > >> > > >    1. Once the input data is over,
> >> > finalize
> >> > > > the
> >> > > > > > > output
> >> > > > > > > > > > file
> >> > > > > > > > > > > > from
> >> > > > > > > > > > > > > > >> .tmp
> >> > > > > > > > > > > > > > >> > > >    files. - Responsibility of output
> >> > > operator
> >> > > > > > > > > > > > > > >> > > >    2. End the application, once the
> >> data
> >> > is
> >> > > > read
> >> > > > > > and
> >> > > > > > > > > > > > processed -
> >> > > > > > > > > > > > > > >> > > >    Responsibility of input operator
> >> > > > > > > > > > > > > > >> > > >
> >> > > > > > > > > > > > > > >> > > > These functions are essential to
> allow
> >> the
> >> > > > user
> >> > > > > to
> >> > > > > > > do
> >> > > > > > > > > > higher
> >> > > > > > > > > > > > > level
> >> > > > > > > > > > > > > > >> > > > operations like scheduling or
> running a
> >> > > > workflow
> >> > > > > > of
> >> > > > > > > > > batch
> >> > > > > > > > > > > > > > >> applications.
> >> > > > > > > > > > > > > > >> > > >
> >> > > > > > > > > > > > > > >> > > > I am not sure about intermediate
> >> > > (processing)
> >> > > > > > > > operators,
> >> > > > > > > > > > as
> >> > > > > > > > > > > > > there
> >> > > > > > > > > > > > > > >> is no
> >> > > > > > > > > > > > > > >> > > > change in their functionality for
> batch
> >> > use
> >> > > > > cases.
> >> > > > > > > > > > Perhaps,
> >> > > > > > > > > > > > > > allowing
> >> > > > > > > > > > > > > > >> > > > running multiple batches in a single
> >> > > > application
> >> > > > > > may
> >> > > > > > > > > > require
> >> > > > > > > > > > > > > > similar
> >> > > > > > > > > > > > > > >> > > > changes in processing operators as
> >> well.
> >> > > > > > > > > > > > > > >> > > >
> >> > > > > > > > > > > > > > >> > > > ~ Bhupesh
> >> > > > > > > > > > > > > > >> > > >
> >> > > > > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM,
> >> Priyanka
> >> > > > > Gugale <
> >> > > > > > > > > > > > > > priyag@apache.org
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >> > > > wrote:
> >> > > > > > > > > > > > > > >> > > >
> >> > > > > > > > > > > > > > >> > > > > Will it make an impression on user
> >> that,
> >> > > if
> >> > > > he
> >> > > > > > > has a
> >> > > > > > > > > > batch
> >> > > > > > > > > > > > > > >> usecase he
> >> > > > > > > > > > > > > > >> > > has
> >> > > > > > > > > > > > > > >> > > > > to use batch aware operators only?
> If
> >> > so,
> >> > > is
> >> > > > > > that
> >> > > > > > > > what
> >> > > > > > > > > > we
> >> > > > > > > > > > > > > > expect?
> >> > > > > > > > > > > > > > >> I
> >> > > > > > > > > > > > > > >> > am
> >> > > > > > > > > > > > > > >> > > > not
> >> > > > > > > > > > > > > > >> > > > > aware of how do we implement batch
> >> > > scenario
> >> > > > so
> >> > > > > > > this
> >> > > > > > > > > > might
> >> > > > > > > > > > > > be a
> >> > > > > > > > > > > > > > >> basic
> >> > > > > > > > > > > > > > >> > > > > question.
> >> > > > > > > > > > > > > > >> > > > >
> >> > > > > > > > > > > > > > >> > > > > -Priyanka
> >> > > > > > > > > > > > > > >> > > > >
> >> > > > > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM,
> >> > Bhupesh
> >> > > > > > Chawda <
> >> > > > > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> >> > > > > > > > > > > > > > >> > > > > wrote:
> >> > > > > > > > > > > > > > >> > > > >
> >> > > > > > > > > > > > > > >> > > > > > Hi All,
> >> > > > > > > > > > > > > > >> > > > > >
> >> > > > > > > > > > > > > > >> > > > > > While design / implementation for
> >> > custom
> >> > > > > > control
> >> > > > > > > > > > tuples
> >> > > > > > > > > > > is
> >> > > > > > > > > > > > > > >> > ongoing, I
> >> > > > > > > > > > > > > > >> > > > > > thought it would be a good idea
> to
> >> > > > consider
> >> > > > > > its
> >> > > > > > > > > > > usefulness
> >> > > > > > > > > > > > > in
> >> > > > > > > > > > > > > > >> one
> >> > > > > > > > > > > > > > >> > of
> >> > > > > > > > > > > > > > >> > > > the
> >> > > > > > > > > > > > > > >> > > > > > use cases -  batch applications.
> >> > > > > > > > > > > > > > >> > > > > >
> >> > > > > > > > > > > > > > >> > > > > > This is a proposal to adapt /
> >> extend
> >> > > > > existing
> >> > > > > > > > > > operators
> >> > > > > > > > > > > in
> >> > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > >> > Apache
> >> > > > > > > > > > > > > > >> > > > > Apex
> >> > > > > > > > > > > > > > >> > > > > > Malhar library so that it is easy
> >> to
> >> > use
> >> > > > > them
> >> > > > > > in
> >> > > > > > > > > batch
> >> > > > > > > > > > > use
> >> > > > > > > > > > > > > > >> cases.
> >> > > > > > > > > > > > > > >> > > > > > Naturally, this would be
> applicable
> >> > for
> >> > > > > only a
> >> > > > > > > > > subset
> >> > > > > > > > > > of
> >> > > > > > > > > > > > > > >> operators
> >> > > > > > > > > > > > > > >> > > like
> >> > > > > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> >> > > > > > > > > > > > > > >> > > > > > For example, for a file based
> >> store,
> >> > > (say
> >> > > > > HDFS
> >> > > > > > > > > store),
> >> > > > > > > > > > > we
> >> > > > > > > > > > > > > > could
> >> > > > > > > > > > > > > > >> > have
> >> > > > > > > > > > > > > > >> > > > > > FileBatchInput and
> FileBatchOutput
> >> > > > operators
> >> > > > > > > which
> >> > > > > > > > > > allow
> >> > > > > > > > > > > > > easy
> >> > > > > > > > > > > > > > >> > > > integration
> >> > > > > > > > > > > > > > >> > > > > > into a batch application. These
> >> > > operators
> >> > > > > > would
> >> > > > > > > be
> >> > > > > > > > > > > > extended
> >> > > > > > > > > > > > > > from
> >> > > > > > > > > > > > > > >> > > their
> >> > > > > > > > > > > > > > >> > > > > > existing implementations and
> would
> >> be
> >> > > > "Batch
> >> > > > > > > > Aware",
> >> > > > > > > > > > in
> >> > > > > > > > > > > > that
> >> > > > > > > > > > > > > > >> they
> >> > > > > > > > > > > > > > >> > may
> >> > > > > > > > > > > > > > >> > > > > > understand the meaning of some
> >> > specific
> >> > > > > > control
> >> > > > > > > > > tuples
> >> > > > > > > > > > > > that
> >> > > > > > > > > > > > > > flow
> >> > > > > > > > > > > > > > >> > > > through
> >> > > > > > > > > > > > > > >> > > > > > the DAG. Start batch and end
> batch
> >> > seem
> >> > > to
> >> > > > > be
> >> > > > > > > the
> >> > > > > > > > > > > obvious
> >> > > > > > > > > > > > > > >> > candidates
> >> > > > > > > > > > > > > > >> > > > that
> >> > > > > > > > > > > > > > >> > > > > > come to mind. On receipt of such
> >> > control
> >> > > > > > tuples,
> >> > > > > > > > > they
> >> > > > > > > > > > > may
> >> > > > > > > > > > > > > try
> >> > > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > >> > > modify
> >> > > > > > > > > > > > > > >> > > > > the
> >> > > > > > > > > > > > > > >> > > > > > behavior of the operator - to
> >> > > reinitialize
> >> > > > > > some
> >> > > > > > > > > > metrics
> >> > > > > > > > > > > or
> >> > > > > > > > > > > > > > >> finalize
> >> > > > > > > > > > > > > > >> > > an
> >> > > > > > > > > > > > > > >> > > > > > output file for example.
> >> > > > > > > > > > > > > > >> > > > > >
> >> > > > > > > > > > > > > > >> > > > > > We can discuss the potential
> >> control
> >> > > > tuples
> >> > > > > > and
> >> > > > > > > > > > actions
> >> > > > > > > > > > > in
> >> > > > > > > > > > > > > > >> detail,
> >> > > > > > > > > > > > > > >> > > but
> >> > > > > > > > > > > > > > >> > > > > > first I would like to understand
> >> the
> >> > > views
> >> > > > > of
> >> > > > > > > the
> >> > > > > > > > > > > > community
> >> > > > > > > > > > > > > > for
> >> > > > > > > > > > > > > > >> > this
> >> > > > > > > > > > > > > > >> > > > > > proposal.
> >> > > > > > > > > > > > > > >> > > > > >
> >> > > > > > > > > > > > > > >> > > > > > ~ Bhupesh
> >> > > > > > > > > > > > > > >> > > > > >
> >> > > > > > > > > > > > > > >> > > > >
> >> > > > > > > > > > > > > > >> > > >
> >> > > > > > > > > > > > > > >> > >
> >> > > > > > > > > > > > > > >> >
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi All,

I think we have some agreement on the way we should use control tuples for
File I/O operators to support batch.

In order to have more operators in Malhar, support this paradigm, I think
we should also look at store operators - JDBC, Cassandra, HBase etc.
The case with these operators is simpler as most of these do not poll the
sources (except JDBC poller operator) and just stop once they have read a
fixed amount of data. In other words, these are inherently batch sources.
The only change that we should add to these operators is to shut down the
DAG once the reading of data is done. For a windowed operator this would
mean a Global window with a final watermark before the DAG is shut down.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi Thomas,
>
> Even though the windowing operator is not just "event time", it seems it
> is too much dependent on the "time" attribute of the incoming tuple. This
> is the reason we had to model the file index as a timestamp to solve the
> batch case for files.
> Perhaps we should work on increasing the scope of the windowed operator to
> consider other types of windows as well. The Sequence option suggested by
> David seems to be something in that direction.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org> wrote:
>
>> That's correct, we are looking at a generalized approach for state
>> management vs. a series of special cases.
>>
>> And to be clear, windowing does not imply event time, otherwise it would
>> be
>> "EventTimeOperator" :-)
>>
>> Thomas
>>
>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <bh...@datatorrent.com>
>> wrote:
>>
>> > Hi David,
>> >
>> > I went through the discussion, but it seems like it is more on the event
>> > time watermark handling as opposed to batches. What we are trying to do
>> is
>> > have watermarks serve the purpose of demarcating batches using control
>> > tuples. Since each batch is separate from others, we would like to have
>> > stateful processing within a batch, but not across batches.
>> > At the same time, we would like to do this in a manner which is
>> consistent
>> > with the windowing mechanism provided by the windowed operator. This
>> will
>> > allow us to treat a single batch as a (bounded) stream and apply all the
>> > event time windowing concepts in that time span.
>> >
>> > For example, let's say I need to process data for a day (24 hours) as a
>> > single batch. The application is still streaming in nature: it would end
>> > the batch after a day and start a new batch the next day. At the same
>> time,
>> > I would be able to have early trigger firings every minute as well as
>> drop
>> > any data which is say, 5 mins late. All this within a single day.
>> >
>> > ~ Bhupesh
>> >
>> >
>> >
>> > _______________________________________________________
>> >
>> > Bhupesh Chawda
>> >
>> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> >
>> > www.datatorrent.com  |  apex.apache.org
>> >
>> >
>> >
>> > On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com> wrote:
>> >
>> > > There is a discussion in the Flink mailing list about key-based
>> > watermarks.
>> > > I think it's relevant to our use case here.
>> > > https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
>> > > 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>> > >
>> > > David
>> > >
>> > > On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>> bhupesh@datatorrent.com
>> > >
>> > > wrote:
>> > >
>> > > > Hi David,
>> > > >
>> > > > If using time window does not seem appropriate, we can have another
>> > class
>> > > > which is more suited for such sequential and distinct windows.
>> > Perhaps, a
>> > > > CustomWindow option can be introduced which takes in a window id.
>> The
>> > > > purpose of this window option could be to translate the window id
>> into
>> > > > appropriate timestamps.
>> > > >
>> > > > Another option would be to go with a custom timestampExtractor for
>> such
>> > > > tuples which translates the each unique file name to a distinct
>> > timestamp
>> > > > while using time windows in the windowed operator.
>> > > >
>> > > > ~ Bhupesh
>> > > >
>> > > >
>> > > > _______________________________________________________
>> > > >
>> > > > Bhupesh Chawda
>> > > >
>> > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > >
>> > > > www.datatorrent.com  |  apex.apache.org
>> > > >
>> > > >
>> > > >
>> > > > On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
>> > wrote:
>> > > >
>> > > > > I now see your rationale on putting the filename in the window.
>> > > > > As far as I understand, the reasons why the filename is not part
>> of
>> > the
>> > > > key
>> > > > > and the Global Window is not used are:
>> > > > >
>> > > > > 1) The files are processed in sequence, not in parallel
>> > > > > 2) The windowed operator should not keep the state associated with
>> > the
>> > > > file
>> > > > > when the processing of the file is done
>> > > > > 3) The trigger should be fired for the file when a file is done
>> > > > processing.
>> > > > >
>> > > > > However, if the file is just a sequence has nothing to do with a
>> > > > timestamp,
>> > > > > assigning a timestamp to a file is not an intuitive thing to do
>> and
>> > > would
>> > > > > just create confusions to the users, especially when it's used as
>> an
>> > > > > example for new users.
>> > > > >
>> > > > > How about having a separate class called SequenceWindow? And
>> perhaps
>> > > > > TimeWindow can inherit from it?
>> > > > >
>> > > > > David
>> > > > >
>> > > > > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
>> > wrote:
>> > > > >
>> > > > > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>> > > > bhupesh@datatorrent.com
>> > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I think my comments related to count based windows might be
>> > causing
>> > > > > > > confusion. Let's not discuss count based scenarios for now.
>> > > > > > >
>> > > > > > > Just want to make sure we are on the same page wrt. the "each
>> > file
>> > > > is a
>> > > > > > > batch" use case. As mentioned by Thomas, the each tuple from
>> the
>> > > same
>> > > > > > file
>> > > > > > > has the same timestamp (which is just a sequence number) and
>> that
>> > > > helps
>> > > > > > > keep tuples from each file in a separate window.
>> > > > > > >
>> > > > > >
>> > > > > > Yes, in this case it is a sequence number, but it could be a
>> time
>> > > stamp
>> > > > > > also, depending on the file naming convention. And if it was
>> event
>> > > time
>> > > > > > processing, the watermark would be derived from records within
>> the
>> > > > file.
>> > > > > >
>> > > > > > Agreed, the source should have a mechanism to control the time
>> > stamp
>> > > > > > extraction along with everything else pertaining to the
>> watermark
>> > > > > > generation.
>> > > > > >
>> > > > > >
>> > > > > > > We could also implement a "timestampExtractor" interface to
>> > > identify
>> > > > > the
>> > > > > > > timestamp (sequence number) for a file.
>> > > > > > >
>> > > > > > > ~ Bhupesh
>> > > > > > >
>> > > > > > >
>> > > > > > > _______________________________________________________
>> > > > > > >
>> > > > > > > Bhupesh Chawda
>> > > > > > >
>> > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > > > > >
>> > > > > > > www.datatorrent.com  |  apex.apache.org
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <thw@apache.org
>> >
>> > > > wrote:
>> > > > > > >
>> > > > > > > > I don't think this is a use case for count based window.
>> > > > > > > >
>> > > > > > > > We have multiple files that are retrieved in a sequence and
>> > there
>> > > > is
>> > > > > no
>> > > > > > > > knowledge of the number of records per file. The
>> requirement is
>> > > to
>> > > > > > > > aggregate each file separately and emit the aggregate when
>> the
>> > > file
>> > > > > is
>> > > > > > > read
>> > > > > > > > fully. There is no concept of "end of something" for an
>> > > individual
>> > > > > key
>> > > > > > > and
>> > > > > > > > global window isn't applicable.
>> > > > > > > >
>> > > > > > > > However, as already explained and implemented by Bhupesh,
>> this
>> > > can
>> > > > be
>> > > > > > > > solved using watermark and window (in this case the window
>> > > > timestamp
>> > > > > > > isn't
>> > > > > > > > a timestamp, but a file sequence, but that doesn't matter.
>> > > > > > > >
>> > > > > > > > Thomas
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>> davidyan@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > I don't think this is the way to go. Global Window only
>> means
>> > > the
>> > > > > > > > timestamp
>> > > > > > > > > does not matter (or that there is no timestamp). It does
>> not
>> > > > > > > necessarily
>> > > > > > > > > mean it's a large batch. Unless there is some notion of
>> event
>> > > > time
>> > > > > > for
>> > > > > > > > each
>> > > > > > > > > file, you don't want to embed the file into the window
>> > itself.
>> > > > > > > > >
>> > > > > > > > > If you want the result broken up by file name, and if the
>> > files
>> > > > are
>> > > > > > to
>> > > > > > > be
>> > > > > > > > > processed in parallel, I think making the file name be
>> part
>> > of
>> > > > the
>> > > > > > key
>> > > > > > > is
>> > > > > > > > > the way to go. I think it's very confusing if we somehow
>> make
>> > > the
>> > > > > > file
>> > > > > > > to
>> > > > > > > > > be part of the window.
>> > > > > > > > >
>> > > > > > > > > For count-based window, it's not implemented yet and
>> you're
>> > > > welcome
>> > > > > > to
>> > > > > > > > add
>> > > > > > > > > that feature. In case of count-based windows, there would
>> be
>> > no
>> > > > > > notion
>> > > > > > > of
>> > > > > > > > > time and you probably only trigger at the end of each
>> window.
>> > > In
>> > > > > the
>> > > > > > > case
>> > > > > > > > > of count-based windows, the watermark only matters for
>> batch
>> > > > since
>> > > > > > you
>> > > > > > > > need
>> > > > > > > > > a way to know when the batch has ended (if the count is
>> 10,
>> > the
>> > > > > > number
>> > > > > > > of
>> > > > > > > > > tuples in the batch is let's say 105, you need a way to
>> end
>> > the
>> > > > > last
>> > > > > > > > window
>> > > > > > > > > with 5 tuples).
>> > > > > > > > >
>> > > > > > > > > David
>> > > > > > > > >
>> > > > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>> > > > > > > bhupesh@datatorrent.com
>> > > > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi David,
>> > > > > > > > > >
>> > > > > > > > > > Thanks for your comments.
>> > > > > > > > > >
>> > > > > > > > > > The wordcount example that I created based on the
>> windowed
>> > > > > operator
>> > > > > > > > does
>> > > > > > > > > > processing of word counts per file (each file as a
>> separate
>> > > > > batch),
>> > > > > > > > i.e.
>> > > > > > > > > > process counts for each file and dump into separate
>> files.
>> > > > > > > > > > As I understand Global window is for one large batch;
>> i.e.
>> > > all
>> > > > > > > incoming
>> > > > > > > > > > data falls into the same batch. This could not be
>> processed
>> > > > using
>> > > > > > > > > > GlobalWindow option as we need more than one windows. In
>> > this
>> > > > > > case, I
>> > > > > > > > > > configured the windowed operator to have time windows of
>> > 1ms
>> > > > each
>> > > > > > and
>> > > > > > > > > > passed data for each file with increasing timestamps:
>> > (file1,
>> > > > 1),
>> > > > > > > > (file2,
>> > > > > > > > > > 2) and so on. Is there a better way of handling this
>> > > scenario?
>> > > > > > > > > >
>> > > > > > > > > > Regarding (2 - count based windows), I think there is a
>> > > trigger
>> > > > > > > option
>> > > > > > > > to
>> > > > > > > > > > process count based windows. In case I want to process
>> > every
>> > > > 1000
>> > > > > > > > tuples
>> > > > > > > > > as
>> > > > > > > > > > a batch, I could set the Trigger option to CountTrigger
>> > with
>> > > > the
>> > > > > > > > > > accumulation set to Discarding. Is this correct?
>> > > > > > > > > >
>> > > > > > > > > > I agree that (4. Final Watermark) can be done using
>> Global
>> > > > > window.
>> > > > > > > > > >
>> > > > > > > > > > ~ Bhupesh
>> > > > > > > > > >
>> > > > > > > > > > _______________________________________________________
>> > > > > > > > > >
>> > > > > > > > > > Bhupesh Chawda
>> > > > > > > > > >
>> > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > > > > > > > >
>> > > > > > > > > > www.datatorrent.com  |  apex.apache.org
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>> > > > davidyan@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > I'm worried that we are making the watermark concept
>> too
>> > > > > > > complicated.
>> > > > > > > > > > >
>> > > > > > > > > > > Watermarks should simply just tell you what windows
>> can
>> > be
>> > > > > > > considered
>> > > > > > > > > > > complete.
>> > > > > > > > > > >
>> > > > > > > > > > > Point 2 is basically a count-based window. Watermarks
>> do
>> > > not
>> > > > > > play a
>> > > > > > > > > role
>> > > > > > > > > > > here because the window is always complete at the n-th
>> > > tuple.
>> > > > > > > > > > >
>> > > > > > > > > > > If I understand correctly, point 3 is for batch
>> > processing
>> > > of
>> > > > > > > files.
>> > > > > > > > > > Unless
>> > > > > > > > > > > the files contain timed events, it sounds to be that
>> this
>> > > can
>> > > > > be
>> > > > > > > > > achieved
>> > > > > > > > > > > with just a Global Window. For signaling EOF, a
>> watermark
>> > > > with
>> > > > > a
>> > > > > > > > > > +infinity
>> > > > > > > > > > > timestamp can be used so that triggers will be fired
>> upon
>> > > > > receipt
>> > > > > > > of
>> > > > > > > > > that
>> > > > > > > > > > > watermark.
>> > > > > > > > > > >
>> > > > > > > > > > > For point 4, just like what I mentioned above, can be
>> > > > achieved
>> > > > > > > with a
>> > > > > > > > > > > watermark with a +infinity timestamp.
>> > > > > > > > > > >
>> > > > > > > > > > > David
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>> > > > > > > > > bhupesh@datatorrent.com
>> > > > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi Thomas,
>> > > > > > > > > > > >
>> > > > > > > > > > > > For an input operator which is supposed to generate
>> > > > > watermarks
>> > > > > > > for
>> > > > > > > > > > > > downstream operators, I can think about the
>> following
>> > > > > > watermarks
>> > > > > > > > that
>> > > > > > > > > > the
>> > > > > > > > > > > > operator can emit:
>> > > > > > > > > > > > 1. Time based watermarks (the high watermark / low
>> > > > watermark)
>> > > > > > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
>> > > > > > > > > > > > 3. File based watermarks (Start file, end file)
>> > > > > > > > > > > > 4. Final watermark
>> > > > > > > > > > > >
>> > > > > > > > > > > > File based watermarks seem to be applicable for
>> batch
>> > > (file
>> > > > > > > based)
>> > > > > > > > as
>> > > > > > > > > > > well,
>> > > > > > > > > > > > and hence I thought of looking at these first. Does
>> > this
>> > > > seem
>> > > > > > to
>> > > > > > > be
>> > > > > > > > > in
>> > > > > > > > > > > line
>> > > > > > > > > > > > with the thought process?
>> > > > > > > > > > > >
>> > > > > > > > > > > > ~ Bhupesh
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > ______________________________
>> > _________________________
>> > > > > > > > > > > >
>> > > > > > > > > > > > Bhupesh Chawda
>> > > > > > > > > > > >
>> > > > > > > > > > > > Software Engineer
>> > > > > > > > > > > >
>> > > > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > > > > > > > > > >
>> > > > > > > > > > > > www.datatorrent.com  |  apex.apache.org
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>> > > > > thw@apache.org
>> > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > I don't think this should be designed based on a
>> > > > simplistic
>> > > > > > > file
>> > > > > > > > > > > > > input-output scenario. It would be good to
>> include a
>> > > > > stateful
>> > > > > > > > > > > > > transformation based on event time.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > More complex pipelines contain stateful
>> > transformations
>> > > > > that
>> > > > > > > > depend
>> > > > > > > > > > on
>> > > > > > > > > > > > > windowing and watermarks. I think we need a
>> watermark
>> > > > > concept
>> > > > > > > > that
>> > > > > > > > > is
>> > > > > > > > > > > > based
>> > > > > > > > > > > > > on progress in event time (or other monotonic
>> > > increasing
>> > > > > > > > sequence)
>> > > > > > > > > > that
>> > > > > > > > > > > > > other operators can generically work with.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Note that even file input in many cases can
>> produce
>> > > time
>> > > > > > based
>> > > > > > > > > > > > watermarks,
>> > > > > > > > > > > > > for example when you read part files that are
>> bound
>> > by
>> > > > > event
>> > > > > > > > time.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > Thomas
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
>> > > > > > > > > > > bhupesh@datatorrent.com
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > For better understanding the use case for
>> control
>> > > > tuples
>> > > > > in
>> > > > > > > > > batch,
>> > > > > > > > > > I
>> > > > > > > > > > > > am
>> > > > > > > > > > > > > > creating a prototype for a batch application
>> using
>> > > File
>> > > > > > Input
>> > > > > > > > and
>> > > > > > > > > > > File
>> > > > > > > > > > > > > > Output operators.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > To enable basic batch processing for File IO
>> > > > operators, I
>> > > > > > am
>> > > > > > > > > > > proposing
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > following changes to File input and output
>> > operators:
>> > > > > > > > > > > > > > 1. File Input operator emits a watermark each
>> time
>> > it
>> > > > > opens
>> > > > > > > and
>> > > > > > > > > > > closes
>> > > > > > > > > > > > a
>> > > > > > > > > > > > > > file. These can be "start file" and "end file"
>> > > > watermarks
>> > > > > > > which
>> > > > > > > > > > > include
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > corresponding file names. The "start file" tuple
>> > > should
>> > > > > be
>> > > > > > > sent
>> > > > > > > > > > > before
>> > > > > > > > > > > > > any
>> > > > > > > > > > > > > > of the data from that file flows.
>> > > > > > > > > > > > > > 2. File Input operator can be configured to end
>> the
>> > > > > > > application
>> > > > > > > > > > > after a
>> > > > > > > > > > > > > > single or n scans of the directory (a batch).
>> This
>> > is
>> > > > > where
>> > > > > > > the
>> > > > > > > > > > > > operator
>> > > > > > > > > > > > > > emits the final watermark (the end of
>> application
>> > > > control
>> > > > > > > > tuple).
>> > > > > > > > > > > This
>> > > > > > > > > > > > > will
>> > > > > > > > > > > > > > also shutdown the application.
>> > > > > > > > > > > > > > 3. The File output operator handles these
>> control
>> > > > tuples.
>> > > > > > > > "Start
>> > > > > > > > > > > file"
>> > > > > > > > > > > > > > initializes the file name for the incoming
>> tuples.
>> > > "End
>> > > > > > file"
>> > > > > > > > > > > watermark
>> > > > > > > > > > > > > > forces a finalize on that file.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > The user would be able to enable the operators
>> to
>> > > send
>> > > > > only
>> > > > > > > > those
>> > > > > > > > > > > > > > watermarks that are needed in the application.
>> If
>> > > none
>> > > > of
>> > > > > > the
>> > > > > > > > > > options
>> > > > > > > > > > > > are
>> > > > > > > > > > > > > > configured, the operators behave as in a
>> streaming
>> > > > > > > application.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > There are a few challenges in the implementation
>> > > where
>> > > > > the
>> > > > > > > > input
>> > > > > > > > > > > > operator
>> > > > > > > > > > > > > > is partitioned. In this case, the correlation
>> > between
>> > > > the
>> > > > > > > > > start/end
>> > > > > > > > > > > > for a
>> > > > > > > > > > > > > > file and the data tuples for that file is lost.
>> > Hence
>> > > > we
>> > > > > > need
>> > > > > > > > to
>> > > > > > > > > > > > maintain
>> > > > > > > > > > > > > > the filename as part of each tuple in the
>> pipeline.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > The "start file" and "end file" control tuples
>> in
>> > > this
>> > > > > > > example
>> > > > > > > > > are
>> > > > > > > > > > > > > > temporary names for watermarks. We can have
>> generic
>> > > > > "start
>> > > > > > > > > batch" /
>> > > > > > > > > > > > "end
>> > > > > > > > > > > > > > batch" tuples which could be used for other use
>> > cases
>> > > > as
>> > > > > > > well.
>> > > > > > > > > The
>> > > > > > > > > > > > Final
>> > > > > > > > > > > > > > watermark is common and serves the same purpose
>> in
>> > > each
>> > > > > > case.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Please let me know your thoughts on this.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > ~ Bhupesh
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh
>> Chawda <
>> > > > > > > > > > > > > bhupesh@datatorrent.com>
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Yes, this can be part of operator
>> configuration.
>> > > > Given
>> > > > > > > this,
>> > > > > > > > > for
>> > > > > > > > > > a
>> > > > > > > > > > > > user
>> > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > define a batch application, would mean
>> > configuring
>> > > > the
>> > > > > > > > > connectors
>> > > > > > > > > > > > > (mostly
>> > > > > > > > > > > > > > > the input operator) in the application for the
>> > > > desired
>> > > > > > > > > behavior.
>> > > > > > > > > > > > > > Similarly,
>> > > > > > > > > > > > > > > there can be other use cases that can be
>> achieved
>> > > > other
>> > > > > > > than
>> > > > > > > > > > batch.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > We may also need to take care of the
>> following:
>> > > > > > > > > > > > > > > 1. Make sure that the watermarks or control
>> > tuples
>> > > > are
>> > > > > > > > > consistent
>> > > > > > > > > > > > > across
>> > > > > > > > > > > > > > > sources. Meaning an HDFS sink should be able
>> to
>> > > > > interpret
>> > > > > > > the
>> > > > > > > > > > > > watermark
>> > > > > > > > > > > > > > > tuple sent out by, say, a JDBC source.
>> > > > > > > > > > > > > > > 2. In addition to I/O connectors, we should
>> also
>> > > look
>> > > > > at
>> > > > > > > the
>> > > > > > > > > need
>> > > > > > > > > > > for
>> > > > > > > > > > > > > > > processing operators to understand some of the
>> > > > control
>> > > > > > > > tuples /
>> > > > > > > > > > > > > > watermarks.
>> > > > > > > > > > > > > > > For example, we may want to reset the operator
>> > > > behavior
>> > > > > > on
>> > > > > > > > > > arrival
>> > > > > > > > > > > of
>> > > > > > > > > > > > > > some
>> > > > > > > > > > > > > > > watermark tuple.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > ~ Bhupesh
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise
>> <
>> > > > > > > > thw@apache.org>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >> The HDFS source can operate in two modes,
>> > bounded
>> > > or
>> > > > > > > > > unbounded.
>> > > > > > > > > > If
>> > > > > > > > > > > > you
>> > > > > > > > > > > > > > >> scan
>> > > > > > > > > > > > > > >> only once, then it should emit the final
>> > watermark
>> > > > > after
>> > > > > > > it
>> > > > > > > > is
>> > > > > > > > > > > done.
>> > > > > > > > > > > > > > >> Otherwise it would emit watermarks based on a
>> > > policy
>> > > > > > > (files
>> > > > > > > > > > names
>> > > > > > > > > > > > > etc.).
>> > > > > > > > > > > > > > >> The mechanism to generate the marks may
>> depend
>> > on
>> > > > the
>> > > > > > type
>> > > > > > > > of
>> > > > > > > > > > > source
>> > > > > > > > > > > > > and
>> > > > > > > > > > > > > > >> the user needs to be able to
>> influence/configure
>> > > it.
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >> Thomas
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh
>> Chawda
>> > <
>> > > > > > > > > > > > > > bhupesh@datatorrent.com>
>> > > > > > > > > > > > > > >> wrote:
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >> > Hi Thomas,
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > I am not sure that I completely understand
>> > your
>> > > > > > > > suggestion.
>> > > > > > > > > > Are
>> > > > > > > > > > > > you
>> > > > > > > > > > > > > > >> > suggesting to broaden the scope of the
>> > proposal
>> > > to
>> > > > > > treat
>> > > > > > > > all
>> > > > > > > > > > > > sources
>> > > > > > > > > > > > > > as
>> > > > > > > > > > > > > > >> > bounded as well as unbounded?
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > In case of Apex, we treat all sources as
>> > > unbounded
>> > > > > > > > sources.
>> > > > > > > > > > Even
>> > > > > > > > > > > > > > bounded
>> > > > > > > > > > > > > > >> > sources like HDFS file source is treated as
>> > > > > unbounded
>> > > > > > by
>> > > > > > > > > means
>> > > > > > > > > > > of
>> > > > > > > > > > > > > > >> scanning
>> > > > > > > > > > > > > > >> > the input directory repeatedly.
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > Let's consider HDFS file source for
>> example:
>> > > > > > > > > > > > > > >> > In this case, if we treat it as a bounded
>> > > source,
>> > > > we
>> > > > > > can
>> > > > > > > > > > define
>> > > > > > > > > > > > > hooks
>> > > > > > > > > > > > > > >> which
>> > > > > > > > > > > > > > >> > allows us to detect the end of the file and
>> > send
>> > > > the
>> > > > > > > > "final
>> > > > > > > > > > > > > > watermark".
>> > > > > > > > > > > > > > >> We
>> > > > > > > > > > > > > > >> > could also consider HDFS file source as a
>> > > > streaming
>> > > > > > > source
>> > > > > > > > > and
>> > > > > > > > > > > > > define
>> > > > > > > > > > > > > > >> hooks
>> > > > > > > > > > > > > > >> > which send watermarks based on different
>> kinds
>> > > of
>> > > > > > > windows.
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > Please correct me if I misunderstand.
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > ~ Bhupesh
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas
>> Weise
>> > <
>> > > > > > > > > thw@apache.org
>> > > > > > > > > > >
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > > Bhupesh,
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > Please see how that can be solved in a
>> > unified
>> > > > way
>> > > > > > > using
>> > > > > > > > > > > windows
>> > > > > > > > > > > > > and
>> > > > > > > > > > > > > > >> > > watermarks. It is bounded data vs.
>> unbounded
>> > > > data.
>> > > > > > In
>> > > > > > > > Beam
>> > > > > > > > > > for
>> > > > > > > > > > > > > > >> example,
>> > > > > > > > > > > > > > >> > you
>> > > > > > > > > > > > > > >> > > can use the "global window" and the final
>> > > > > watermark
>> > > > > > to
>> > > > > > > > > > > > accomplish
>> > > > > > > > > > > > > > what
>> > > > > > > > > > > > > > >> > you
>> > > > > > > > > > > > > > >> > > are looking for. Batch is just a special
>> > case
>> > > of
>> > > > > > > > streaming
>> > > > > > > > > > > where
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > >> > source
>> > > > > > > > > > > > > > >> > > emits the final watermark.
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > Thanks,
>> > > > > > > > > > > > > > >> > > Thomas
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh
>> > > Chawda
>> > > > <
>> > > > > > > > > > > > > > >> bhupesh@datatorrent.com
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > wrote:
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > > Yes, if the user needs to develop a
>> batch
>> > > > > > > application,
>> > > > > > > > > > then
>> > > > > > > > > > > > > batch
>> > > > > > > > > > > > > > >> aware
>> > > > > > > > > > > > > > >> > > > operators need to be used in the
>> > > application.
>> > > > > > > > > > > > > > >> > > > The nature of the application is mostly
>> > > > > controlled
>> > > > > > > by
>> > > > > > > > > the
>> > > > > > > > > > > > input
>> > > > > > > > > > > > > > and
>> > > > > > > > > > > > > > >> the
>> > > > > > > > > > > > > > >> > > > output operators used in the
>> application.
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > For example, consider an application
>> which
>> > > > needs
>> > > > > > to
>> > > > > > > > > filter
>> > > > > > > > > > > > > records
>> > > > > > > > > > > > > > >> in a
>> > > > > > > > > > > > > > >> > > > input file and store the filtered
>> records
>> > in
>> > > > > > another
>> > > > > > > > > file.
>> > > > > > > > > > > The
>> > > > > > > > > > > > > > >> nature
>> > > > > > > > > > > > > > >> > of
>> > > > > > > > > > > > > > >> > > > this app is to end once the entire
>> file is
>> > > > > > > processed.
>> > > > > > > > > > > > Following
>> > > > > > > > > > > > > > >> things
>> > > > > > > > > > > > > > >> > > are
>> > > > > > > > > > > > > > >> > > > expected of the application:
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > >    1. Once the input data is over,
>> > finalize
>> > > > the
>> > > > > > > output
>> > > > > > > > > > file
>> > > > > > > > > > > > from
>> > > > > > > > > > > > > > >> .tmp
>> > > > > > > > > > > > > > >> > > >    files. - Responsibility of output
>> > > operator
>> > > > > > > > > > > > > > >> > > >    2. End the application, once the
>> data
>> > is
>> > > > read
>> > > > > > and
>> > > > > > > > > > > > processed -
>> > > > > > > > > > > > > > >> > > >    Responsibility of input operator
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > These functions are essential to allow
>> the
>> > > > user
>> > > > > to
>> > > > > > > do
>> > > > > > > > > > higher
>> > > > > > > > > > > > > level
>> > > > > > > > > > > > > > >> > > > operations like scheduling or running a
>> > > > workflow
>> > > > > > of
>> > > > > > > > > batch
>> > > > > > > > > > > > > > >> applications.
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > I am not sure about intermediate
>> > > (processing)
>> > > > > > > > operators,
>> > > > > > > > > > as
>> > > > > > > > > > > > > there
>> > > > > > > > > > > > > > >> is no
>> > > > > > > > > > > > > > >> > > > change in their functionality for batch
>> > use
>> > > > > cases.
>> > > > > > > > > > Perhaps,
>> > > > > > > > > > > > > > allowing
>> > > > > > > > > > > > > > >> > > > running multiple batches in a single
>> > > > application
>> > > > > > may
>> > > > > > > > > > require
>> > > > > > > > > > > > > > similar
>> > > > > > > > > > > > > > >> > > > changes in processing operators as
>> well.
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > ~ Bhupesh
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM,
>> Priyanka
>> > > > > Gugale <
>> > > > > > > > > > > > > > priyag@apache.org
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > > > wrote:
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > > Will it make an impression on user
>> that,
>> > > if
>> > > > he
>> > > > > > > has a
>> > > > > > > > > > batch
>> > > > > > > > > > > > > > >> usecase he
>> > > > > > > > > > > > > > >> > > has
>> > > > > > > > > > > > > > >> > > > > to use batch aware operators only? If
>> > so,
>> > > is
>> > > > > > that
>> > > > > > > > what
>> > > > > > > > > > we
>> > > > > > > > > > > > > > expect?
>> > > > > > > > > > > > > > >> I
>> > > > > > > > > > > > > > >> > am
>> > > > > > > > > > > > > > >> > > > not
>> > > > > > > > > > > > > > >> > > > > aware of how do we implement batch
>> > > scenario
>> > > > so
>> > > > > > > this
>> > > > > > > > > > might
>> > > > > > > > > > > > be a
>> > > > > > > > > > > > > > >> basic
>> > > > > > > > > > > > > > >> > > > > question.
>> > > > > > > > > > > > > > >> > > > >
>> > > > > > > > > > > > > > >> > > > > -Priyanka
>> > > > > > > > > > > > > > >> > > > >
>> > > > > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM,
>> > Bhupesh
>> > > > > > Chawda <
>> > > > > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
>> > > > > > > > > > > > > > >> > > > > wrote:
>> > > > > > > > > > > > > > >> > > > >
>> > > > > > > > > > > > > > >> > > > > > Hi All,
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > > > While design / implementation for
>> > custom
>> > > > > > control
>> > > > > > > > > > tuples
>> > > > > > > > > > > is
>> > > > > > > > > > > > > > >> > ongoing, I
>> > > > > > > > > > > > > > >> > > > > > thought it would be a good idea to
>> > > > consider
>> > > > > > its
>> > > > > > > > > > > usefulness
>> > > > > > > > > > > > > in
>> > > > > > > > > > > > > > >> one
>> > > > > > > > > > > > > > >> > of
>> > > > > > > > > > > > > > >> > > > the
>> > > > > > > > > > > > > > >> > > > > > use cases -  batch applications.
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > > > This is a proposal to adapt /
>> extend
>> > > > > existing
>> > > > > > > > > > operators
>> > > > > > > > > > > in
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > >> > Apache
>> > > > > > > > > > > > > > >> > > > > Apex
>> > > > > > > > > > > > > > >> > > > > > Malhar library so that it is easy
>> to
>> > use
>> > > > > them
>> > > > > > in
>> > > > > > > > > batch
>> > > > > > > > > > > use
>> > > > > > > > > > > > > > >> cases.
>> > > > > > > > > > > > > > >> > > > > > Naturally, this would be applicable
>> > for
>> > > > > only a
>> > > > > > > > > subset
>> > > > > > > > > > of
>> > > > > > > > > > > > > > >> operators
>> > > > > > > > > > > > > > >> > > like
>> > > > > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
>> > > > > > > > > > > > > > >> > > > > > For example, for a file based
>> store,
>> > > (say
>> > > > > HDFS
>> > > > > > > > > store),
>> > > > > > > > > > > we
>> > > > > > > > > > > > > > could
>> > > > > > > > > > > > > > >> > have
>> > > > > > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
>> > > > operators
>> > > > > > > which
>> > > > > > > > > > allow
>> > > > > > > > > > > > > easy
>> > > > > > > > > > > > > > >> > > > integration
>> > > > > > > > > > > > > > >> > > > > > into a batch application. These
>> > > operators
>> > > > > > would
>> > > > > > > be
>> > > > > > > > > > > > extended
>> > > > > > > > > > > > > > from
>> > > > > > > > > > > > > > >> > > their
>> > > > > > > > > > > > > > >> > > > > > existing implementations and would
>> be
>> > > > "Batch
>> > > > > > > > Aware",
>> > > > > > > > > > in
>> > > > > > > > > > > > that
>> > > > > > > > > > > > > > >> they
>> > > > > > > > > > > > > > >> > may
>> > > > > > > > > > > > > > >> > > > > > understand the meaning of some
>> > specific
>> > > > > > control
>> > > > > > > > > tuples
>> > > > > > > > > > > > that
>> > > > > > > > > > > > > > flow
>> > > > > > > > > > > > > > >> > > > through
>> > > > > > > > > > > > > > >> > > > > > the DAG. Start batch and end batch
>> > seem
>> > > to
>> > > > > be
>> > > > > > > the
>> > > > > > > > > > > obvious
>> > > > > > > > > > > > > > >> > candidates
>> > > > > > > > > > > > > > >> > > > that
>> > > > > > > > > > > > > > >> > > > > > come to mind. On receipt of such
>> > control
>> > > > > > tuples,
>> > > > > > > > > they
>> > > > > > > > > > > may
>> > > > > > > > > > > > > try
>> > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > >> > > modify
>> > > > > > > > > > > > > > >> > > > > the
>> > > > > > > > > > > > > > >> > > > > > behavior of the operator - to
>> > > reinitialize
>> > > > > > some
>> > > > > > > > > > metrics
>> > > > > > > > > > > or
>> > > > > > > > > > > > > > >> finalize
>> > > > > > > > > > > > > > >> > > an
>> > > > > > > > > > > > > > >> > > > > > output file for example.
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > > > We can discuss the potential
>> control
>> > > > tuples
>> > > > > > and
>> > > > > > > > > > actions
>> > > > > > > > > > > in
>> > > > > > > > > > > > > > >> detail,
>> > > > > > > > > > > > > > >> > > but
>> > > > > > > > > > > > > > >> > > > > > first I would like to understand
>> the
>> > > views
>> > > > > of
>> > > > > > > the
>> > > > > > > > > > > > community
>> > > > > > > > > > > > > > for
>> > > > > > > > > > > > > > >> > this
>> > > > > > > > > > > > > > >> > > > > > proposal.
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > > > ~ Bhupesh
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > >
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Created a JIRA to track this:
https://issues.apache.org/jira/browse/APEXMALHAR-2449

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Feb 28, 2017 at 10:59 PM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi Thomas,
>
> Even though the windowing operator is not just "event time", it seems it
> is too much dependent on the "time" attribute of the incoming tuple. This
> is the reason we had to model the file index as a timestamp to solve the
> batch case for files.
> Perhaps we should work on increasing the scope of the windowed operator to
> consider other types of windows as well. The Sequence option suggested by
> David seems to be something in that direction.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org> wrote:
>
>> That's correct, we are looking at a generalized approach for state
>> management vs. a series of special cases.
>>
>> And to be clear, windowing does not imply event time, otherwise it would
>> be
>> "EventTimeOperator" :-)
>>
>> Thomas
>>
>> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <bh...@datatorrent.com>
>> wrote:
>>
>> > Hi David,
>> >
>> > I went through the discussion, but it seems like it is more on the event
>> > time watermark handling as opposed to batches. What we are trying to do
>> is
>> > have watermarks serve the purpose of demarcating batches using control
>> > tuples. Since each batch is separate from others, we would like to have
>> > stateful processing within a batch, but not across batches.
>> > At the same time, we would like to do this in a manner which is
>> consistent
>> > with the windowing mechanism provided by the windowed operator. This
>> will
>> > allow us to treat a single batch as a (bounded) stream and apply all the
>> > event time windowing concepts in that time span.
>> >
>> > For example, let's say I need to process data for a day (24 hours) as a
>> > single batch. The application is still streaming in nature: it would end
>> > the batch after a day and start a new batch the next day. At the same
>> time,
>> > I would be able to have early trigger firings every minute as well as
>> drop
>> > any data which is say, 5 mins late. All this within a single day.
>> >
>> > ~ Bhupesh
>> >
>> >
>> >
>> > _______________________________________________________
>> >
>> > Bhupesh Chawda
>> >
>> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> >
>> > www.datatorrent.com  |  apex.apache.org
>> >
>> >
>> >
>> > On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com> wrote:
>> >
>> > > There is a discussion in the Flink mailing list about key-based
>> > watermarks.
>> > > I think it's relevant to our use case here.
>> > > https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
>> > > 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>> > >
>> > > David
>> > >
>> > > On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
>> bhupesh@datatorrent.com
>> > >
>> > > wrote:
>> > >
>> > > > Hi David,
>> > > >
>> > > > If using time window does not seem appropriate, we can have another
>> > class
>> > > > which is more suited for such sequential and distinct windows.
>> > Perhaps, a
>> > > > CustomWindow option can be introduced which takes in a window id.
>> The
>> > > > purpose of this window option could be to translate the window id
>> into
>> > > > appropriate timestamps.
>> > > >
>> > > > Another option would be to go with a custom timestampExtractor for
>> such
>> > > > tuples which translates the each unique file name to a distinct
>> > timestamp
>> > > > while using time windows in the windowed operator.
>> > > >
>> > > > ~ Bhupesh
>> > > >
>> > > >
>> > > > _______________________________________________________
>> > > >
>> > > > Bhupesh Chawda
>> > > >
>> > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > >
>> > > > www.datatorrent.com  |  apex.apache.org
>> > > >
>> > > >
>> > > >
>> > > > On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
>> > wrote:
>> > > >
>> > > > > I now see your rationale on putting the filename in the window.
>> > > > > As far as I understand, the reasons why the filename is not part
>> of
>> > the
>> > > > key
>> > > > > and the Global Window is not used are:
>> > > > >
>> > > > > 1) The files are processed in sequence, not in parallel
>> > > > > 2) The windowed operator should not keep the state associated with
>> > the
>> > > > file
>> > > > > when the processing of the file is done
>> > > > > 3) The trigger should be fired for the file when a file is done
>> > > > processing.
>> > > > >
>> > > > > However, if the file is just a sequence has nothing to do with a
>> > > > timestamp,
>> > > > > assigning a timestamp to a file is not an intuitive thing to do
>> and
>> > > would
>> > > > > just create confusions to the users, especially when it's used as
>> an
>> > > > > example for new users.
>> > > > >
>> > > > > How about having a separate class called SequenceWindow? And
>> perhaps
>> > > > > TimeWindow can inherit from it?
>> > > > >
>> > > > > David
>> > > > >
>> > > > > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
>> > wrote:
>> > > > >
>> > > > > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>> > > > bhupesh@datatorrent.com
>> > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I think my comments related to count based windows might be
>> > causing
>> > > > > > > confusion. Let's not discuss count based scenarios for now.
>> > > > > > >
>> > > > > > > Just want to make sure we are on the same page wrt. the "each
>> > file
>> > > > is a
>> > > > > > > batch" use case. As mentioned by Thomas, the each tuple from
>> the
>> > > same
>> > > > > > file
>> > > > > > > has the same timestamp (which is just a sequence number) and
>> that
>> > > > helps
>> > > > > > > keep tuples from each file in a separate window.
>> > > > > > >
>> > > > > >
>> > > > > > Yes, in this case it is a sequence number, but it could be a
>> time
>> > > stamp
>> > > > > > also, depending on the file naming convention. And if it was
>> event
>> > > time
>> > > > > > processing, the watermark would be derived from records within
>> the
>> > > > file.
>> > > > > >
>> > > > > > Agreed, the source should have a mechanism to control the time
>> > stamp
>> > > > > > extraction along with everything else pertaining to the
>> watermark
>> > > > > > generation.
>> > > > > >
>> > > > > >
>> > > > > > > We could also implement a "timestampExtractor" interface to
>> > > identify
>> > > > > the
>> > > > > > > timestamp (sequence number) for a file.
>> > > > > > >
>> > > > > > > ~ Bhupesh
>> > > > > > >
>> > > > > > >
>> > > > > > > _______________________________________________________
>> > > > > > >
>> > > > > > > Bhupesh Chawda
>> > > > > > >
>> > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > > > > >
>> > > > > > > www.datatorrent.com  |  apex.apache.org
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <thw@apache.org
>> >
>> > > > wrote:
>> > > > > > >
>> > > > > > > > I don't think this is a use case for count based window.
>> > > > > > > >
>> > > > > > > > We have multiple files that are retrieved in a sequence and
>> > there
>> > > > is
>> > > > > no
>> > > > > > > > knowledge of the number of records per file. The
>> requirement is
>> > > to
>> > > > > > > > aggregate each file separately and emit the aggregate when
>> the
>> > > file
>> > > > > is
>> > > > > > > read
>> > > > > > > > fully. There is no concept of "end of something" for an
>> > > individual
>> > > > > key
>> > > > > > > and
>> > > > > > > > global window isn't applicable.
>> > > > > > > >
>> > > > > > > > However, as already explained and implemented by Bhupesh,
>> this
>> > > can
>> > > > be
>> > > > > > > > solved using watermark and window (in this case the window
>> > > > timestamp
>> > > > > > > isn't
>> > > > > > > > a timestamp, but a file sequence, but that doesn't matter.
>> > > > > > > >
>> > > > > > > > Thomas
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
>> davidyan@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > I don't think this is the way to go. Global Window only
>> means
>> > > the
>> > > > > > > > timestamp
>> > > > > > > > > does not matter (or that there is no timestamp). It does
>> not
>> > > > > > > necessarily
>> > > > > > > > > mean it's a large batch. Unless there is some notion of
>> event
>> > > > time
>> > > > > > for
>> > > > > > > > each
>> > > > > > > > > file, you don't want to embed the file into the window
>> > itself.
>> > > > > > > > >
>> > > > > > > > > If you want the result broken up by file name, and if the
>> > files
>> > > > are
>> > > > > > to
>> > > > > > > be
>> > > > > > > > > processed in parallel, I think making the file name be
>> part
>> > of
>> > > > the
>> > > > > > key
>> > > > > > > is
>> > > > > > > > > the way to go. I think it's very confusing if we somehow
>> make
>> > > the
>> > > > > > file
>> > > > > > > to
>> > > > > > > > > be part of the window.
>> > > > > > > > >
>> > > > > > > > > For count-based window, it's not implemented yet and
>> you're
>> > > > welcome
>> > > > > > to
>> > > > > > > > add
>> > > > > > > > > that feature. In case of count-based windows, there would
>> be
>> > no
>> > > > > > notion
>> > > > > > > of
>> > > > > > > > > time and you probably only trigger at the end of each
>> window.
>> > > In
>> > > > > the
>> > > > > > > case
>> > > > > > > > > of count-based windows, the watermark only matters for
>> batch
>> > > > since
>> > > > > > you
>> > > > > > > > need
>> > > > > > > > > a way to know when the batch has ended (if the count is
>> 10,
>> > the
>> > > > > > number
>> > > > > > > of
>> > > > > > > > > tuples in the batch is let's say 105, you need a way to
>> end
>> > the
>> > > > > last
>> > > > > > > > window
>> > > > > > > > > with 5 tuples).
>> > > > > > > > >
>> > > > > > > > > David
>> > > > > > > > >
>> > > > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>> > > > > > > bhupesh@datatorrent.com
>> > > > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi David,
>> > > > > > > > > >
>> > > > > > > > > > Thanks for your comments.
>> > > > > > > > > >
>> > > > > > > > > > The wordcount example that I created based on the
>> windowed
>> > > > > operator
>> > > > > > > > does
>> > > > > > > > > > processing of word counts per file (each file as a
>> separate
>> > > > > batch),
>> > > > > > > > i.e.
>> > > > > > > > > > process counts for each file and dump into separate
>> files.
>> > > > > > > > > > As I understand Global window is for one large batch;
>> i.e.
>> > > all
>> > > > > > > incoming
>> > > > > > > > > > data falls into the same batch. This could not be
>> processed
>> > > > using
>> > > > > > > > > > GlobalWindow option as we need more than one windows. In
>> > this
>> > > > > > case, I
>> > > > > > > > > > configured the windowed operator to have time windows of
>> > 1ms
>> > > > each
>> > > > > > and
>> > > > > > > > > > passed data for each file with increasing timestamps:
>> > (file1,
>> > > > 1),
>> > > > > > > > (file2,
>> > > > > > > > > > 2) and so on. Is there a better way of handling this
>> > > scenario?
>> > > > > > > > > >
>> > > > > > > > > > Regarding (2 - count based windows), I think there is a
>> > > trigger
>> > > > > > > option
>> > > > > > > > to
>> > > > > > > > > > process count based windows. In case I want to process
>> > every
>> > > > 1000
>> > > > > > > > tuples
>> > > > > > > > > as
>> > > > > > > > > > a batch, I could set the Trigger option to CountTrigger
>> > with
>> > > > the
>> > > > > > > > > > accumulation set to Discarding. Is this correct?
>> > > > > > > > > >
>> > > > > > > > > > I agree that (4. Final Watermark) can be done using
>> Global
>> > > > > window.
>> > > > > > > > > >
>> > > > > > > > > > ~ Bhupesh
>> > > > > > > > > >
>> > > > > > > > > > _______________________________________________________
>> > > > > > > > > >
>> > > > > > > > > > Bhupesh Chawda
>> > > > > > > > > >
>> > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > > > > > > > >
>> > > > > > > > > > www.datatorrent.com  |  apex.apache.org
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>> > > > davidyan@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > I'm worried that we are making the watermark concept
>> too
>> > > > > > > complicated.
>> > > > > > > > > > >
>> > > > > > > > > > > Watermarks should simply just tell you what windows
>> can
>> > be
>> > > > > > > considered
>> > > > > > > > > > > complete.
>> > > > > > > > > > >
>> > > > > > > > > > > Point 2 is basically a count-based window. Watermarks
>> do
>> > > not
>> > > > > > play a
>> > > > > > > > > role
>> > > > > > > > > > > here because the window is always complete at the n-th
>> > > tuple.
>> > > > > > > > > > >
>> > > > > > > > > > > If I understand correctly, point 3 is for batch
>> > processing
>> > > of
>> > > > > > > files.
>> > > > > > > > > > Unless
>> > > > > > > > > > > the files contain timed events, it sounds to be that
>> this
>> > > can
>> > > > > be
>> > > > > > > > > achieved
>> > > > > > > > > > > with just a Global Window. For signaling EOF, a
>> watermark
>> > > > with
>> > > > > a
>> > > > > > > > > > +infinity
>> > > > > > > > > > > timestamp can be used so that triggers will be fired
>> upon
>> > > > > receipt
>> > > > > > > of
>> > > > > > > > > that
>> > > > > > > > > > > watermark.
>> > > > > > > > > > >
>> > > > > > > > > > > For point 4, just like what I mentioned above, can be
>> > > > achieved
>> > > > > > > with a
>> > > > > > > > > > > watermark with a +infinity timestamp.
>> > > > > > > > > > >
>> > > > > > > > > > > David
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>> > > > > > > > > bhupesh@datatorrent.com
>> > > > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi Thomas,
>> > > > > > > > > > > >
>> > > > > > > > > > > > For an input operator which is supposed to generate
>> > > > > watermarks
>> > > > > > > for
>> > > > > > > > > > > > downstream operators, I can think about the
>> following
>> > > > > > watermarks
>> > > > > > > > that
>> > > > > > > > > > the
>> > > > > > > > > > > > operator can emit:
>> > > > > > > > > > > > 1. Time based watermarks (the high watermark / low
>> > > > watermark)
>> > > > > > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
>> > > > > > > > > > > > 3. File based watermarks (Start file, end file)
>> > > > > > > > > > > > 4. Final watermark
>> > > > > > > > > > > >
>> > > > > > > > > > > > File based watermarks seem to be applicable for
>> batch
>> > > (file
>> > > > > > > based)
>> > > > > > > > as
>> > > > > > > > > > > well,
>> > > > > > > > > > > > and hence I thought of looking at these first. Does
>> > this
>> > > > seem
>> > > > > > to
>> > > > > > > be
>> > > > > > > > > in
>> > > > > > > > > > > line
>> > > > > > > > > > > > with the thought process?
>> > > > > > > > > > > >
>> > > > > > > > > > > > ~ Bhupesh
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > ______________________________
>> > _________________________
>> > > > > > > > > > > >
>> > > > > > > > > > > > Bhupesh Chawda
>> > > > > > > > > > > >
>> > > > > > > > > > > > Software Engineer
>> > > > > > > > > > > >
>> > > > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > > > > > > > > > >
>> > > > > > > > > > > > www.datatorrent.com  |  apex.apache.org
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>> > > > > thw@apache.org
>> > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > I don't think this should be designed based on a
>> > > > simplistic
>> > > > > > > file
>> > > > > > > > > > > > > input-output scenario. It would be good to
>> include a
>> > > > > stateful
>> > > > > > > > > > > > > transformation based on event time.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > More complex pipelines contain stateful
>> > transformations
>> > > > > that
>> > > > > > > > depend
>> > > > > > > > > > on
>> > > > > > > > > > > > > windowing and watermarks. I think we need a
>> watermark
>> > > > > concept
>> > > > > > > > that
>> > > > > > > > > is
>> > > > > > > > > > > > based
>> > > > > > > > > > > > > on progress in event time (or other monotonic
>> > > increasing
>> > > > > > > > sequence)
>> > > > > > > > > > that
>> > > > > > > > > > > > > other operators can generically work with.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Note that even file input in many cases can
>> produce
>> > > time
>> > > > > > based
>> > > > > > > > > > > > watermarks,
>> > > > > > > > > > > > > for example when you read part files that are
>> bound
>> > by
>> > > > > event
>> > > > > > > > time.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > Thomas
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
>> > > > > > > > > > > bhupesh@datatorrent.com
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > For better understanding the use case for
>> control
>> > > > tuples
>> > > > > in
>> > > > > > > > > batch,
>> > > > > > > > > > I
>> > > > > > > > > > > > am
>> > > > > > > > > > > > > > creating a prototype for a batch application
>> using
>> > > File
>> > > > > > Input
>> > > > > > > > and
>> > > > > > > > > > > File
>> > > > > > > > > > > > > > Output operators.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > To enable basic batch processing for File IO
>> > > > operators, I
>> > > > > > am
>> > > > > > > > > > > proposing
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > following changes to File input and output
>> > operators:
>> > > > > > > > > > > > > > 1. File Input operator emits a watermark each
>> time
>> > it
>> > > > > opens
>> > > > > > > and
>> > > > > > > > > > > closes
>> > > > > > > > > > > > a
>> > > > > > > > > > > > > > file. These can be "start file" and "end file"
>> > > > watermarks
>> > > > > > > which
>> > > > > > > > > > > include
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > corresponding file names. The "start file" tuple
>> > > should
>> > > > > be
>> > > > > > > sent
>> > > > > > > > > > > before
>> > > > > > > > > > > > > any
>> > > > > > > > > > > > > > of the data from that file flows.
>> > > > > > > > > > > > > > 2. File Input operator can be configured to end
>> the
>> > > > > > > application
>> > > > > > > > > > > after a
>> > > > > > > > > > > > > > single or n scans of the directory (a batch).
>> This
>> > is
>> > > > > where
>> > > > > > > the
>> > > > > > > > > > > > operator
>> > > > > > > > > > > > > > emits the final watermark (the end of
>> application
>> > > > control
>> > > > > > > > tuple).
>> > > > > > > > > > > This
>> > > > > > > > > > > > > will
>> > > > > > > > > > > > > > also shutdown the application.
>> > > > > > > > > > > > > > 3. The File output operator handles these
>> control
>> > > > tuples.
>> > > > > > > > "Start
>> > > > > > > > > > > file"
>> > > > > > > > > > > > > > initializes the file name for the incoming
>> tuples.
>> > > "End
>> > > > > > file"
>> > > > > > > > > > > watermark
>> > > > > > > > > > > > > > forces a finalize on that file.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > The user would be able to enable the operators
>> to
>> > > send
>> > > > > only
>> > > > > > > > those
>> > > > > > > > > > > > > > watermarks that are needed in the application.
>> If
>> > > none
>> > > > of
>> > > > > > the
>> > > > > > > > > > options
>> > > > > > > > > > > > are
>> > > > > > > > > > > > > > configured, the operators behave as in a
>> streaming
>> > > > > > > application.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > There are a few challenges in the implementation
>> > > where
>> > > > > the
>> > > > > > > > input
>> > > > > > > > > > > > operator
>> > > > > > > > > > > > > > is partitioned. In this case, the correlation
>> > between
>> > > > the
>> > > > > > > > > start/end
>> > > > > > > > > > > > for a
>> > > > > > > > > > > > > > file and the data tuples for that file is lost.
>> > Hence
>> > > > we
>> > > > > > need
>> > > > > > > > to
>> > > > > > > > > > > > maintain
>> > > > > > > > > > > > > > the filename as part of each tuple in the
>> pipeline.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > The "start file" and "end file" control tuples
>> in
>> > > this
>> > > > > > > example
>> > > > > > > > > are
>> > > > > > > > > > > > > > temporary names for watermarks. We can have
>> generic
>> > > > > "start
>> > > > > > > > > batch" /
>> > > > > > > > > > > > "end
>> > > > > > > > > > > > > > batch" tuples which could be used for other use
>> > cases
>> > > > as
>> > > > > > > well.
>> > > > > > > > > The
>> > > > > > > > > > > > Final
>> > > > > > > > > > > > > > watermark is common and serves the same purpose
>> in
>> > > each
>> > > > > > case.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Please let me know your thoughts on this.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > ~ Bhupesh
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh
>> Chawda <
>> > > > > > > > > > > > > bhupesh@datatorrent.com>
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Yes, this can be part of operator
>> configuration.
>> > > > Given
>> > > > > > > this,
>> > > > > > > > > for
>> > > > > > > > > > a
>> > > > > > > > > > > > user
>> > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > define a batch application, would mean
>> > configuring
>> > > > the
>> > > > > > > > > connectors
>> > > > > > > > > > > > > (mostly
>> > > > > > > > > > > > > > > the input operator) in the application for the
>> > > > desired
>> > > > > > > > > behavior.
>> > > > > > > > > > > > > > Similarly,
>> > > > > > > > > > > > > > > there can be other use cases that can be
>> achieved
>> > > > other
>> > > > > > > than
>> > > > > > > > > > batch.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > We may also need to take care of the
>> following:
>> > > > > > > > > > > > > > > 1. Make sure that the watermarks or control
>> > tuples
>> > > > are
>> > > > > > > > > consistent
>> > > > > > > > > > > > > across
>> > > > > > > > > > > > > > > sources. Meaning an HDFS sink should be able
>> to
>> > > > > interpret
>> > > > > > > the
>> > > > > > > > > > > > watermark
>> > > > > > > > > > > > > > > tuple sent out by, say, a JDBC source.
>> > > > > > > > > > > > > > > 2. In addition to I/O connectors, we should
>> also
>> > > look
>> > > > > at
>> > > > > > > the
>> > > > > > > > > need
>> > > > > > > > > > > for
>> > > > > > > > > > > > > > > processing operators to understand some of the
>> > > > control
>> > > > > > > > tuples /
>> > > > > > > > > > > > > > watermarks.
>> > > > > > > > > > > > > > > For example, we may want to reset the operator
>> > > > behavior
>> > > > > > on
>> > > > > > > > > > arrival
>> > > > > > > > > > > of
>> > > > > > > > > > > > > > some
>> > > > > > > > > > > > > > > watermark tuple.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > ~ Bhupesh
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise
>> <
>> > > > > > > > thw@apache.org>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >> The HDFS source can operate in two modes,
>> > bounded
>> > > or
>> > > > > > > > > unbounded.
>> > > > > > > > > > If
>> > > > > > > > > > > > you
>> > > > > > > > > > > > > > >> scan
>> > > > > > > > > > > > > > >> only once, then it should emit the final
>> > watermark
>> > > > > after
>> > > > > > > it
>> > > > > > > > is
>> > > > > > > > > > > done.
>> > > > > > > > > > > > > > >> Otherwise it would emit watermarks based on a
>> > > policy
>> > > > > > > (files
>> > > > > > > > > > names
>> > > > > > > > > > > > > etc.).
>> > > > > > > > > > > > > > >> The mechanism to generate the marks may
>> depend
>> > on
>> > > > the
>> > > > > > type
>> > > > > > > > of
>> > > > > > > > > > > source
>> > > > > > > > > > > > > and
>> > > > > > > > > > > > > > >> the user needs to be able to
>> influence/configure
>> > > it.
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >> Thomas
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh
>> Chawda
>> > <
>> > > > > > > > > > > > > > bhupesh@datatorrent.com>
>> > > > > > > > > > > > > > >> wrote:
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >> > Hi Thomas,
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > I am not sure that I completely understand
>> > your
>> > > > > > > > suggestion.
>> > > > > > > > > > Are
>> > > > > > > > > > > > you
>> > > > > > > > > > > > > > >> > suggesting to broaden the scope of the
>> > proposal
>> > > to
>> > > > > > treat
>> > > > > > > > all
>> > > > > > > > > > > > sources
>> > > > > > > > > > > > > > as
>> > > > > > > > > > > > > > >> > bounded as well as unbounded?
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > In case of Apex, we treat all sources as
>> > > unbounded
>> > > > > > > > sources.
>> > > > > > > > > > Even
>> > > > > > > > > > > > > > bounded
>> > > > > > > > > > > > > > >> > sources like HDFS file source is treated as
>> > > > > unbounded
>> > > > > > by
>> > > > > > > > > means
>> > > > > > > > > > > of
>> > > > > > > > > > > > > > >> scanning
>> > > > > > > > > > > > > > >> > the input directory repeatedly.
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > Let's consider HDFS file source for
>> example:
>> > > > > > > > > > > > > > >> > In this case, if we treat it as a bounded
>> > > source,
>> > > > we
>> > > > > > can
>> > > > > > > > > > define
>> > > > > > > > > > > > > hooks
>> > > > > > > > > > > > > > >> which
>> > > > > > > > > > > > > > >> > allows us to detect the end of the file and
>> > send
>> > > > the
>> > > > > > > > "final
>> > > > > > > > > > > > > > watermark".
>> > > > > > > > > > > > > > >> We
>> > > > > > > > > > > > > > >> > could also consider HDFS file source as a
>> > > > streaming
>> > > > > > > source
>> > > > > > > > > and
>> > > > > > > > > > > > > define
>> > > > > > > > > > > > > > >> hooks
>> > > > > > > > > > > > > > >> > which send watermarks based on different
>> kinds
>> > > of
>> > > > > > > windows.
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > Please correct me if I misunderstand.
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > ~ Bhupesh
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas
>> Weise
>> > <
>> > > > > > > > > thw@apache.org
>> > > > > > > > > > >
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > > Bhupesh,
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > Please see how that can be solved in a
>> > unified
>> > > > way
>> > > > > > > using
>> > > > > > > > > > > windows
>> > > > > > > > > > > > > and
>> > > > > > > > > > > > > > >> > > watermarks. It is bounded data vs.
>> unbounded
>> > > > data.
>> > > > > > In
>> > > > > > > > Beam
>> > > > > > > > > > for
>> > > > > > > > > > > > > > >> example,
>> > > > > > > > > > > > > > >> > you
>> > > > > > > > > > > > > > >> > > can use the "global window" and the final
>> > > > > watermark
>> > > > > > to
>> > > > > > > > > > > > accomplish
>> > > > > > > > > > > > > > what
>> > > > > > > > > > > > > > >> > you
>> > > > > > > > > > > > > > >> > > are looking for. Batch is just a special
>> > case
>> > > of
>> > > > > > > > streaming
>> > > > > > > > > > > where
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > >> > source
>> > > > > > > > > > > > > > >> > > emits the final watermark.
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > Thanks,
>> > > > > > > > > > > > > > >> > > Thomas
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh
>> > > Chawda
>> > > > <
>> > > > > > > > > > > > > > >> bhupesh@datatorrent.com
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > wrote:
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> > > > Yes, if the user needs to develop a
>> batch
>> > > > > > > application,
>> > > > > > > > > > then
>> > > > > > > > > > > > > batch
>> > > > > > > > > > > > > > >> aware
>> > > > > > > > > > > > > > >> > > > operators need to be used in the
>> > > application.
>> > > > > > > > > > > > > > >> > > > The nature of the application is mostly
>> > > > > controlled
>> > > > > > > by
>> > > > > > > > > the
>> > > > > > > > > > > > input
>> > > > > > > > > > > > > > and
>> > > > > > > > > > > > > > >> the
>> > > > > > > > > > > > > > >> > > > output operators used in the
>> application.
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > For example, consider an application
>> which
>> > > > needs
>> > > > > > to
>> > > > > > > > > filter
>> > > > > > > > > > > > > records
>> > > > > > > > > > > > > > >> in a
>> > > > > > > > > > > > > > >> > > > input file and store the filtered
>> records
>> > in
>> > > > > > another
>> > > > > > > > > file.
>> > > > > > > > > > > The
>> > > > > > > > > > > > > > >> nature
>> > > > > > > > > > > > > > >> > of
>> > > > > > > > > > > > > > >> > > > this app is to end once the entire
>> file is
>> > > > > > > processed.
>> > > > > > > > > > > > Following
>> > > > > > > > > > > > > > >> things
>> > > > > > > > > > > > > > >> > > are
>> > > > > > > > > > > > > > >> > > > expected of the application:
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > >    1. Once the input data is over,
>> > finalize
>> > > > the
>> > > > > > > output
>> > > > > > > > > > file
>> > > > > > > > > > > > from
>> > > > > > > > > > > > > > >> .tmp
>> > > > > > > > > > > > > > >> > > >    files. - Responsibility of output
>> > > operator
>> > > > > > > > > > > > > > >> > > >    2. End the application, once the
>> data
>> > is
>> > > > read
>> > > > > > and
>> > > > > > > > > > > > processed -
>> > > > > > > > > > > > > > >> > > >    Responsibility of input operator
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > These functions are essential to allow
>> the
>> > > > user
>> > > > > to
>> > > > > > > do
>> > > > > > > > > > higher
>> > > > > > > > > > > > > level
>> > > > > > > > > > > > > > >> > > > operations like scheduling or running a
>> > > > workflow
>> > > > > > of
>> > > > > > > > > batch
>> > > > > > > > > > > > > > >> applications.
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > I am not sure about intermediate
>> > > (processing)
>> > > > > > > > operators,
>> > > > > > > > > > as
>> > > > > > > > > > > > > there
>> > > > > > > > > > > > > > >> is no
>> > > > > > > > > > > > > > >> > > > change in their functionality for batch
>> > use
>> > > > > cases.
>> > > > > > > > > > Perhaps,
>> > > > > > > > > > > > > > allowing
>> > > > > > > > > > > > > > >> > > > running multiple batches in a single
>> > > > application
>> > > > > > may
>> > > > > > > > > > require
>> > > > > > > > > > > > > > similar
>> > > > > > > > > > > > > > >> > > > changes in processing operators as
>> well.
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > ~ Bhupesh
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM,
>> Priyanka
>> > > > > Gugale <
>> > > > > > > > > > > > > > priyag@apache.org
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >> > > > wrote:
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > > > > Will it make an impression on user
>> that,
>> > > if
>> > > > he
>> > > > > > > has a
>> > > > > > > > > > batch
>> > > > > > > > > > > > > > >> usecase he
>> > > > > > > > > > > > > > >> > > has
>> > > > > > > > > > > > > > >> > > > > to use batch aware operators only? If
>> > so,
>> > > is
>> > > > > > that
>> > > > > > > > what
>> > > > > > > > > > we
>> > > > > > > > > > > > > > expect?
>> > > > > > > > > > > > > > >> I
>> > > > > > > > > > > > > > >> > am
>> > > > > > > > > > > > > > >> > > > not
>> > > > > > > > > > > > > > >> > > > > aware of how do we implement batch
>> > > scenario
>> > > > so
>> > > > > > > this
>> > > > > > > > > > might
>> > > > > > > > > > > > be a
>> > > > > > > > > > > > > > >> basic
>> > > > > > > > > > > > > > >> > > > > question.
>> > > > > > > > > > > > > > >> > > > >
>> > > > > > > > > > > > > > >> > > > > -Priyanka
>> > > > > > > > > > > > > > >> > > > >
>> > > > > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM,
>> > Bhupesh
>> > > > > > Chawda <
>> > > > > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
>> > > > > > > > > > > > > > >> > > > > wrote:
>> > > > > > > > > > > > > > >> > > > >
>> > > > > > > > > > > > > > >> > > > > > Hi All,
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > > > While design / implementation for
>> > custom
>> > > > > > control
>> > > > > > > > > > tuples
>> > > > > > > > > > > is
>> > > > > > > > > > > > > > >> > ongoing, I
>> > > > > > > > > > > > > > >> > > > > > thought it would be a good idea to
>> > > > consider
>> > > > > > its
>> > > > > > > > > > > usefulness
>> > > > > > > > > > > > > in
>> > > > > > > > > > > > > > >> one
>> > > > > > > > > > > > > > >> > of
>> > > > > > > > > > > > > > >> > > > the
>> > > > > > > > > > > > > > >> > > > > > use cases -  batch applications.
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > > > This is a proposal to adapt /
>> extend
>> > > > > existing
>> > > > > > > > > > operators
>> > > > > > > > > > > in
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > >> > Apache
>> > > > > > > > > > > > > > >> > > > > Apex
>> > > > > > > > > > > > > > >> > > > > > Malhar library so that it is easy
>> to
>> > use
>> > > > > them
>> > > > > > in
>> > > > > > > > > batch
>> > > > > > > > > > > use
>> > > > > > > > > > > > > > >> cases.
>> > > > > > > > > > > > > > >> > > > > > Naturally, this would be applicable
>> > for
>> > > > > only a
>> > > > > > > > > subset
>> > > > > > > > > > of
>> > > > > > > > > > > > > > >> operators
>> > > > > > > > > > > > > > >> > > like
>> > > > > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
>> > > > > > > > > > > > > > >> > > > > > For example, for a file based
>> store,
>> > > (say
>> > > > > HDFS
>> > > > > > > > > store),
>> > > > > > > > > > > we
>> > > > > > > > > > > > > > could
>> > > > > > > > > > > > > > >> > have
>> > > > > > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
>> > > > operators
>> > > > > > > which
>> > > > > > > > > > allow
>> > > > > > > > > > > > > easy
>> > > > > > > > > > > > > > >> > > > integration
>> > > > > > > > > > > > > > >> > > > > > into a batch application. These
>> > > operators
>> > > > > > would
>> > > > > > > be
>> > > > > > > > > > > > extended
>> > > > > > > > > > > > > > from
>> > > > > > > > > > > > > > >> > > their
>> > > > > > > > > > > > > > >> > > > > > existing implementations and would
>> be
>> > > > "Batch
>> > > > > > > > Aware",
>> > > > > > > > > > in
>> > > > > > > > > > > > that
>> > > > > > > > > > > > > > >> they
>> > > > > > > > > > > > > > >> > may
>> > > > > > > > > > > > > > >> > > > > > understand the meaning of some
>> > specific
>> > > > > > control
>> > > > > > > > > tuples
>> > > > > > > > > > > > that
>> > > > > > > > > > > > > > flow
>> > > > > > > > > > > > > > >> > > > through
>> > > > > > > > > > > > > > >> > > > > > the DAG. Start batch and end batch
>> > seem
>> > > to
>> > > > > be
>> > > > > > > the
>> > > > > > > > > > > obvious
>> > > > > > > > > > > > > > >> > candidates
>> > > > > > > > > > > > > > >> > > > that
>> > > > > > > > > > > > > > >> > > > > > come to mind. On receipt of such
>> > control
>> > > > > > tuples,
>> > > > > > > > > they
>> > > > > > > > > > > may
>> > > > > > > > > > > > > try
>> > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > >> > > modify
>> > > > > > > > > > > > > > >> > > > > the
>> > > > > > > > > > > > > > >> > > > > > behavior of the operator - to
>> > > reinitialize
>> > > > > > some
>> > > > > > > > > > metrics
>> > > > > > > > > > > or
>> > > > > > > > > > > > > > >> finalize
>> > > > > > > > > > > > > > >> > > an
>> > > > > > > > > > > > > > >> > > > > > output file for example.
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > > > We can discuss the potential
>> control
>> > > > tuples
>> > > > > > and
>> > > > > > > > > > actions
>> > > > > > > > > > > in
>> > > > > > > > > > > > > > >> detail,
>> > > > > > > > > > > > > > >> > > but
>> > > > > > > > > > > > > > >> > > > > > first I would like to understand
>> the
>> > > views
>> > > > > of
>> > > > > > > the
>> > > > > > > > > > > > community
>> > > > > > > > > > > > > > for
>> > > > > > > > > > > > > > >> > this
>> > > > > > > > > > > > > > >> > > > > > proposal.
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > > > ~ Bhupesh
>> > > > > > > > > > > > > > >> > > > > >
>> > > > > > > > > > > > > > >> > > > >
>> > > > > > > > > > > > > > >> > > >
>> > > > > > > > > > > > > > >> > >
>> > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi Thomas,

Even though the windowing operator is not just "event time", it seems it is
too much dependent on the "time" attribute of the incoming tuple. This is
the reason we had to model the file index as a timestamp to solve the batch
case for files.
Perhaps we should work on increasing the scope of the windowed operator to
consider other types of windows as well. The Sequence option suggested by
David seems to be something in that direction.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Feb 28, 2017 at 10:48 PM, Thomas Weise <th...@apache.org> wrote:

> That's correct, we are looking at a generalized approach for state
> management vs. a series of special cases.
>
> And to be clear, windowing does not imply event time, otherwise it would be
> "EventTimeOperator" :-)
>
> Thomas
>
> On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi David,
> >
> > I went through the discussion, but it seems like it is more on the event
> > time watermark handling as opposed to batches. What we are trying to do
> is
> > have watermarks serve the purpose of demarcating batches using control
> > tuples. Since each batch is separate from others, we would like to have
> > stateful processing within a batch, but not across batches.
> > At the same time, we would like to do this in a manner which is
> consistent
> > with the windowing mechanism provided by the windowed operator. This will
> > allow us to treat a single batch as a (bounded) stream and apply all the
> > event time windowing concepts in that time span.
> >
> > For example, let's say I need to process data for a day (24 hours) as a
> > single batch. The application is still streaming in nature: it would end
> > the batch after a day and start a new batch the next day. At the same
> time,
> > I would be able to have early trigger firings every minute as well as
> drop
> > any data which is say, 5 mins late. All this within a single day.
> >
> > ~ Bhupesh
> >
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com> wrote:
> >
> > > There is a discussion in the Flink mailing list about key-based
> > watermarks.
> > > I think it's relevant to our use case here.
> > > https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
> > > 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
> > >
> > > David
> > >
> > > On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Hi David,
> > > >
> > > > If using time window does not seem appropriate, we can have another
> > class
> > > > which is more suited for such sequential and distinct windows.
> > Perhaps, a
> > > > CustomWindow option can be introduced which takes in a window id. The
> > > > purpose of this window option could be to translate the window id
> into
> > > > appropriate timestamps.
> > > >
> > > > Another option would be to go with a custom timestampExtractor for
> such
> > > > tuples which translates the each unique file name to a distinct
> > timestamp
> > > > while using time windows in the windowed operator.
> > > >
> > > > ~ Bhupesh
> > > >
> > > >
> > > > _______________________________________________________
> > > >
> > > > Bhupesh Chawda
> > > >
> > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >
> > > > www.datatorrent.com  |  apex.apache.org
> > > >
> > > >
> > > >
> > > > On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
> > wrote:
> > > >
> > > > > I now see your rationale on putting the filename in the window.
> > > > > As far as I understand, the reasons why the filename is not part of
> > the
> > > > key
> > > > > and the Global Window is not used are:
> > > > >
> > > > > 1) The files are processed in sequence, not in parallel
> > > > > 2) The windowed operator should not keep the state associated with
> > the
> > > > file
> > > > > when the processing of the file is done
> > > > > 3) The trigger should be fired for the file when a file is done
> > > > processing.
> > > > >
> > > > > However, if the file is just a sequence has nothing to do with a
> > > > timestamp,
> > > > > assigning a timestamp to a file is not an intuitive thing to do and
> > > would
> > > > > just create confusions to the users, especially when it's used as
> an
> > > > > example for new users.
> > > > >
> > > > > How about having a separate class called SequenceWindow? And
> perhaps
> > > > > TimeWindow can inherit from it?
> > > > >
> > > > > David
> > > > >
> > > > > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I think my comments related to count based windows might be
> > causing
> > > > > > > confusion. Let's not discuss count based scenarios for now.
> > > > > > >
> > > > > > > Just want to make sure we are on the same page wrt. the "each
> > file
> > > > is a
> > > > > > > batch" use case. As mentioned by Thomas, the each tuple from
> the
> > > same
> > > > > > file
> > > > > > > has the same timestamp (which is just a sequence number) and
> that
> > > > helps
> > > > > > > keep tuples from each file in a separate window.
> > > > > > >
> > > > > >
> > > > > > Yes, in this case it is a sequence number, but it could be a time
> > > stamp
> > > > > > also, depending on the file naming convention. And if it was
> event
> > > time
> > > > > > processing, the watermark would be derived from records within
> the
> > > > file.
> > > > > >
> > > > > > Agreed, the source should have a mechanism to control the time
> > stamp
> > > > > > extraction along with everything else pertaining to the watermark
> > > > > > generation.
> > > > > >
> > > > > >
> > > > > > > We could also implement a "timestampExtractor" interface to
> > > identify
> > > > > the
> > > > > > > timestamp (sequence number) for a file.
> > > > > > >
> > > > > > > ~ Bhupesh
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________________
> > > > > > >
> > > > > > > Bhupesh Chawda
> > > > > > >
> > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > >
> > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > I don't think this is a use case for count based window.
> > > > > > > >
> > > > > > > > We have multiple files that are retrieved in a sequence and
> > there
> > > > is
> > > > > no
> > > > > > > > knowledge of the number of records per file. The requirement
> is
> > > to
> > > > > > > > aggregate each file separately and emit the aggregate when
> the
> > > file
> > > > > is
> > > > > > > read
> > > > > > > > fully. There is no concept of "end of something" for an
> > > individual
> > > > > key
> > > > > > > and
> > > > > > > > global window isn't applicable.
> > > > > > > >
> > > > > > > > However, as already explained and implemented by Bhupesh,
> this
> > > can
> > > > be
> > > > > > > > solved using watermark and window (in this case the window
> > > > timestamp
> > > > > > > isn't
> > > > > > > > a timestamp, but a file sequence, but that doesn't matter.
> > > > > > > >
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <
> davidyan@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > I don't think this is the way to go. Global Window only
> means
> > > the
> > > > > > > > timestamp
> > > > > > > > > does not matter (or that there is no timestamp). It does
> not
> > > > > > > necessarily
> > > > > > > > > mean it's a large batch. Unless there is some notion of
> event
> > > > time
> > > > > > for
> > > > > > > > each
> > > > > > > > > file, you don't want to embed the file into the window
> > itself.
> > > > > > > > >
> > > > > > > > > If you want the result broken up by file name, and if the
> > files
> > > > are
> > > > > > to
> > > > > > > be
> > > > > > > > > processed in parallel, I think making the file name be part
> > of
> > > > the
> > > > > > key
> > > > > > > is
> > > > > > > > > the way to go. I think it's very confusing if we somehow
> make
> > > the
> > > > > > file
> > > > > > > to
> > > > > > > > > be part of the window.
> > > > > > > > >
> > > > > > > > > For count-based window, it's not implemented yet and you're
> > > > welcome
> > > > > > to
> > > > > > > > add
> > > > > > > > > that feature. In case of count-based windows, there would
> be
> > no
> > > > > > notion
> > > > > > > of
> > > > > > > > > time and you probably only trigger at the end of each
> window.
> > > In
> > > > > the
> > > > > > > case
> > > > > > > > > of count-based windows, the watermark only matters for
> batch
> > > > since
> > > > > > you
> > > > > > > > need
> > > > > > > > > a way to know when the batch has ended (if the count is 10,
> > the
> > > > > > number
> > > > > > > of
> > > > > > > > > tuples in the batch is let's say 105, you need a way to end
> > the
> > > > > last
> > > > > > > > window
> > > > > > > > > with 5 tuples).
> > > > > > > > >
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > > > > > bhupesh@datatorrent.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi David,
> > > > > > > > > >
> > > > > > > > > > Thanks for your comments.
> > > > > > > > > >
> > > > > > > > > > The wordcount example that I created based on the
> windowed
> > > > > operator
> > > > > > > > does
> > > > > > > > > > processing of word counts per file (each file as a
> separate
> > > > > batch),
> > > > > > > > i.e.
> > > > > > > > > > process counts for each file and dump into separate
> files.
> > > > > > > > > > As I understand Global window is for one large batch;
> i.e.
> > > all
> > > > > > > incoming
> > > > > > > > > > data falls into the same batch. This could not be
> processed
> > > > using
> > > > > > > > > > GlobalWindow option as we need more than one windows. In
> > this
> > > > > > case, I
> > > > > > > > > > configured the windowed operator to have time windows of
> > 1ms
> > > > each
> > > > > > and
> > > > > > > > > > passed data for each file with increasing timestamps:
> > (file1,
> > > > 1),
> > > > > > > > (file2,
> > > > > > > > > > 2) and so on. Is there a better way of handling this
> > > scenario?
> > > > > > > > > >
> > > > > > > > > > Regarding (2 - count based windows), I think there is a
> > > trigger
> > > > > > > option
> > > > > > > > to
> > > > > > > > > > process count based windows. In case I want to process
> > every
> > > > 1000
> > > > > > > > tuples
> > > > > > > > > as
> > > > > > > > > > a batch, I could set the Trigger option to CountTrigger
> > with
> > > > the
> > > > > > > > > > accumulation set to Discarding. Is this correct?
> > > > > > > > > >
> > > > > > > > > > I agree that (4. Final Watermark) can be done using
> Global
> > > > > window.
> > > > > > > > > >
> > > > > > > > > > ~ Bhupesh
> > > > > > > > > >
> > > > > > > > > > _______________________________________________________
> > > > > > > > > >
> > > > > > > > > > Bhupesh Chawda
> > > > > > > > > >
> > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > > >
> > > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> > > > davidyan@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I'm worried that we are making the watermark concept
> too
> > > > > > > complicated.
> > > > > > > > > > >
> > > > > > > > > > > Watermarks should simply just tell you what windows can
> > be
> > > > > > > considered
> > > > > > > > > > > complete.
> > > > > > > > > > >
> > > > > > > > > > > Point 2 is basically a count-based window. Watermarks
> do
> > > not
> > > > > > play a
> > > > > > > > > role
> > > > > > > > > > > here because the window is always complete at the n-th
> > > tuple.
> > > > > > > > > > >
> > > > > > > > > > > If I understand correctly, point 3 is for batch
> > processing
> > > of
> > > > > > > files.
> > > > > > > > > > Unless
> > > > > > > > > > > the files contain timed events, it sounds to be that
> this
> > > can
> > > > > be
> > > > > > > > > achieved
> > > > > > > > > > > with just a Global Window. For signaling EOF, a
> watermark
> > > > with
> > > > > a
> > > > > > > > > > +infinity
> > > > > > > > > > > timestamp can be used so that triggers will be fired
> upon
> > > > > receipt
> > > > > > > of
> > > > > > > > > that
> > > > > > > > > > > watermark.
> > > > > > > > > > >
> > > > > > > > > > > For point 4, just like what I mentioned above, can be
> > > > achieved
> > > > > > > with a
> > > > > > > > > > > watermark with a +infinity timestamp.
> > > > > > > > > > >
> > > > > > > > > > > David
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Thomas,
> > > > > > > > > > > >
> > > > > > > > > > > > For an input operator which is supposed to generate
> > > > > watermarks
> > > > > > > for
> > > > > > > > > > > > downstream operators, I can think about the following
> > > > > > watermarks
> > > > > > > > that
> > > > > > > > > > the
> > > > > > > > > > > > operator can emit:
> > > > > > > > > > > > 1. Time based watermarks (the high watermark / low
> > > > watermark)
> > > > > > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > > > > > > 4. Final watermark
> > > > > > > > > > > >
> > > > > > > > > > > > File based watermarks seem to be applicable for batch
> > > (file
> > > > > > > based)
> > > > > > > > as
> > > > > > > > > > > well,
> > > > > > > > > > > > and hence I thought of looking at these first. Does
> > this
> > > > seem
> > > > > > to
> > > > > > > be
> > > > > > > > > in
> > > > > > > > > > > line
> > > > > > > > > > > > with the thought process?
> > > > > > > > > > > >
> > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ______________________________
> > _________________________
> > > > > > > > > > > >
> > > > > > > > > > > > Bhupesh Chawda
> > > > > > > > > > > >
> > > > > > > > > > > > Software Engineer
> > > > > > > > > > > >
> > > > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > > > > >
> > > > > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> > > > > thw@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I don't think this should be designed based on a
> > > > simplistic
> > > > > > > file
> > > > > > > > > > > > > input-output scenario. It would be good to include
> a
> > > > > stateful
> > > > > > > > > > > > > transformation based on event time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > More complex pipelines contain stateful
> > transformations
> > > > > that
> > > > > > > > depend
> > > > > > > > > > on
> > > > > > > > > > > > > windowing and watermarks. I think we need a
> watermark
> > > > > concept
> > > > > > > > that
> > > > > > > > > is
> > > > > > > > > > > > based
> > > > > > > > > > > > > on progress in event time (or other monotonic
> > > increasing
> > > > > > > > sequence)
> > > > > > > > > > that
> > > > > > > > > > > > > other operators can generically work with.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Note that even file input in many cases can produce
> > > time
> > > > > > based
> > > > > > > > > > > > watermarks,
> > > > > > > > > > > > > for example when you read part files that are bound
> > by
> > > > > event
> > > > > > > > time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > For better understanding the use case for control
> > > > tuples
> > > > > in
> > > > > > > > > batch,
> > > > > > > > > > I
> > > > > > > > > > > > am
> > > > > > > > > > > > > > creating a prototype for a batch application
> using
> > > File
> > > > > > Input
> > > > > > > > and
> > > > > > > > > > > File
> > > > > > > > > > > > > > Output operators.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > To enable basic batch processing for File IO
> > > > operators, I
> > > > > > am
> > > > > > > > > > > proposing
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > following changes to File input and output
> > operators:
> > > > > > > > > > > > > > 1. File Input operator emits a watermark each
> time
> > it
> > > > > opens
> > > > > > > and
> > > > > > > > > > > closes
> > > > > > > > > > > > a
> > > > > > > > > > > > > > file. These can be "start file" and "end file"
> > > > watermarks
> > > > > > > which
> > > > > > > > > > > include
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > corresponding file names. The "start file" tuple
> > > should
> > > > > be
> > > > > > > sent
> > > > > > > > > > > before
> > > > > > > > > > > > > any
> > > > > > > > > > > > > > of the data from that file flows.
> > > > > > > > > > > > > > 2. File Input operator can be configured to end
> the
> > > > > > > application
> > > > > > > > > > > after a
> > > > > > > > > > > > > > single or n scans of the directory (a batch).
> This
> > is
> > > > > where
> > > > > > > the
> > > > > > > > > > > > operator
> > > > > > > > > > > > > > emits the final watermark (the end of application
> > > > control
> > > > > > > > tuple).
> > > > > > > > > > > This
> > > > > > > > > > > > > will
> > > > > > > > > > > > > > also shutdown the application.
> > > > > > > > > > > > > > 3. The File output operator handles these control
> > > > tuples.
> > > > > > > > "Start
> > > > > > > > > > > file"
> > > > > > > > > > > > > > initializes the file name for the incoming
> tuples.
> > > "End
> > > > > > file"
> > > > > > > > > > > watermark
> > > > > > > > > > > > > > forces a finalize on that file.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The user would be able to enable the operators to
> > > send
> > > > > only
> > > > > > > > those
> > > > > > > > > > > > > > watermarks that are needed in the application. If
> > > none
> > > > of
> > > > > > the
> > > > > > > > > > options
> > > > > > > > > > > > are
> > > > > > > > > > > > > > configured, the operators behave as in a
> streaming
> > > > > > > application.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > There are a few challenges in the implementation
> > > where
> > > > > the
> > > > > > > > input
> > > > > > > > > > > > operator
> > > > > > > > > > > > > > is partitioned. In this case, the correlation
> > between
> > > > the
> > > > > > > > > start/end
> > > > > > > > > > > > for a
> > > > > > > > > > > > > > file and the data tuples for that file is lost.
> > Hence
> > > > we
> > > > > > need
> > > > > > > > to
> > > > > > > > > > > > maintain
> > > > > > > > > > > > > > the filename as part of each tuple in the
> pipeline.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The "start file" and "end file" control tuples in
> > > this
> > > > > > > example
> > > > > > > > > are
> > > > > > > > > > > > > > temporary names for watermarks. We can have
> generic
> > > > > "start
> > > > > > > > > batch" /
> > > > > > > > > > > > "end
> > > > > > > > > > > > > > batch" tuples which could be used for other use
> > cases
> > > > as
> > > > > > > well.
> > > > > > > > > The
> > > > > > > > > > > > Final
> > > > > > > > > > > > > > watermark is common and serves the same purpose
> in
> > > each
> > > > > > case.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda
> <
> > > > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, this can be part of operator
> configuration.
> > > > Given
> > > > > > > this,
> > > > > > > > > for
> > > > > > > > > > a
> > > > > > > > > > > > user
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > define a batch application, would mean
> > configuring
> > > > the
> > > > > > > > > connectors
> > > > > > > > > > > > > (mostly
> > > > > > > > > > > > > > > the input operator) in the application for the
> > > > desired
> > > > > > > > > behavior.
> > > > > > > > > > > > > > Similarly,
> > > > > > > > > > > > > > > there can be other use cases that can be
> achieved
> > > > other
> > > > > > > than
> > > > > > > > > > batch.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > > > > > > 1. Make sure that the watermarks or control
> > tuples
> > > > are
> > > > > > > > > consistent
> > > > > > > > > > > > > across
> > > > > > > > > > > > > > > sources. Meaning an HDFS sink should be able to
> > > > > interpret
> > > > > > > the
> > > > > > > > > > > > watermark
> > > > > > > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > > > > > > 2. In addition to I/O connectors, we should
> also
> > > look
> > > > > at
> > > > > > > the
> > > > > > > > > need
> > > > > > > > > > > for
> > > > > > > > > > > > > > > processing operators to understand some of the
> > > > control
> > > > > > > > tuples /
> > > > > > > > > > > > > > watermarks.
> > > > > > > > > > > > > > > For example, we may want to reset the operator
> > > > behavior
> > > > > > on
> > > > > > > > > > arrival
> > > > > > > > > > > of
> > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > watermark tuple.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > > > > > > thw@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >> The HDFS source can operate in two modes,
> > bounded
> > > or
> > > > > > > > > unbounded.
> > > > > > > > > > If
> > > > > > > > > > > > you
> > > > > > > > > > > > > > >> scan
> > > > > > > > > > > > > > >> only once, then it should emit the final
> > watermark
> > > > > after
> > > > > > > it
> > > > > > > > is
> > > > > > > > > > > done.
> > > > > > > > > > > > > > >> Otherwise it would emit watermarks based on a
> > > policy
> > > > > > > (files
> > > > > > > > > > names
> > > > > > > > > > > > > etc.).
> > > > > > > > > > > > > > >> The mechanism to generate the marks may depend
> > on
> > > > the
> > > > > > type
> > > > > > > > of
> > > > > > > > > > > source
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > >> the user needs to be able to
> influence/configure
> > > it.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Thomas
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh
> Chawda
> > <
> > > > > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> > Hi Thomas,
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > I am not sure that I completely understand
> > your
> > > > > > > > suggestion.
> > > > > > > > > > Are
> > > > > > > > > > > > you
> > > > > > > > > > > > > > >> > suggesting to broaden the scope of the
> > proposal
> > > to
> > > > > > treat
> > > > > > > > all
> > > > > > > > > > > > sources
> > > > > > > > > > > > > > as
> > > > > > > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > In case of Apex, we treat all sources as
> > > unbounded
> > > > > > > > sources.
> > > > > > > > > > Even
> > > > > > > > > > > > > > bounded
> > > > > > > > > > > > > > >> > sources like HDFS file source is treated as
> > > > > unbounded
> > > > > > by
> > > > > > > > > means
> > > > > > > > > > > of
> > > > > > > > > > > > > > >> scanning
> > > > > > > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > > > > > > >> > In this case, if we treat it as a bounded
> > > source,
> > > > we
> > > > > > can
> > > > > > > > > > define
> > > > > > > > > > > > > hooks
> > > > > > > > > > > > > > >> which
> > > > > > > > > > > > > > >> > allows us to detect the end of the file and
> > send
> > > > the
> > > > > > > > "final
> > > > > > > > > > > > > > watermark".
> > > > > > > > > > > > > > >> We
> > > > > > > > > > > > > > >> > could also consider HDFS file source as a
> > > > streaming
> > > > > > > source
> > > > > > > > > and
> > > > > > > > > > > > > define
> > > > > > > > > > > > > > >> hooks
> > > > > > > > > > > > > > >> > which send watermarks based on different
> kinds
> > > of
> > > > > > > windows.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas
> Weise
> > <
> > > > > > > > > thw@apache.org
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > > Bhupesh,
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > Please see how that can be solved in a
> > unified
> > > > way
> > > > > > > using
> > > > > > > > > > > windows
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > >> > > watermarks. It is bounded data vs.
> unbounded
> > > > data.
> > > > > > In
> > > > > > > > Beam
> > > > > > > > > > for
> > > > > > > > > > > > > > >> example,
> > > > > > > > > > > > > > >> > you
> > > > > > > > > > > > > > >> > > can use the "global window" and the final
> > > > > watermark
> > > > > > to
> > > > > > > > > > > > accomplish
> > > > > > > > > > > > > > what
> > > > > > > > > > > > > > >> > you
> > > > > > > > > > > > > > >> > > are looking for. Batch is just a special
> > case
> > > of
> > > > > > > > streaming
> > > > > > > > > > > where
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > >> > source
> > > > > > > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > > > >> > > Thomas
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh
> > > Chawda
> > > > <
> > > > > > > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > > Yes, if the user needs to develop a
> batch
> > > > > > > application,
> > > > > > > > > > then
> > > > > > > > > > > > > batch
> > > > > > > > > > > > > > >> aware
> > > > > > > > > > > > > > >> > > > operators need to be used in the
> > > application.
> > > > > > > > > > > > > > >> > > > The nature of the application is mostly
> > > > > controlled
> > > > > > > by
> > > > > > > > > the
> > > > > > > > > > > > input
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> > > > output operators used in the
> application.
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > For example, consider an application
> which
> > > > needs
> > > > > > to
> > > > > > > > > filter
> > > > > > > > > > > > > records
> > > > > > > > > > > > > > >> in a
> > > > > > > > > > > > > > >> > > > input file and store the filtered
> records
> > in
> > > > > > another
> > > > > > > > > file.
> > > > > > > > > > > The
> > > > > > > > > > > > > > >> nature
> > > > > > > > > > > > > > >> > of
> > > > > > > > > > > > > > >> > > > this app is to end once the entire file
> is
> > > > > > > processed.
> > > > > > > > > > > > Following
> > > > > > > > > > > > > > >> things
> > > > > > > > > > > > > > >> > > are
> > > > > > > > > > > > > > >> > > > expected of the application:
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > >    1. Once the input data is over,
> > finalize
> > > > the
> > > > > > > output
> > > > > > > > > > file
> > > > > > > > > > > > from
> > > > > > > > > > > > > > >> .tmp
> > > > > > > > > > > > > > >> > > >    files. - Responsibility of output
> > > operator
> > > > > > > > > > > > > > >> > > >    2. End the application, once the data
> > is
> > > > read
> > > > > > and
> > > > > > > > > > > > processed -
> > > > > > > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > These functions are essential to allow
> the
> > > > user
> > > > > to
> > > > > > > do
> > > > > > > > > > higher
> > > > > > > > > > > > > level
> > > > > > > > > > > > > > >> > > > operations like scheduling or running a
> > > > workflow
> > > > > > of
> > > > > > > > > batch
> > > > > > > > > > > > > > >> applications.
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > I am not sure about intermediate
> > > (processing)
> > > > > > > > operators,
> > > > > > > > > > as
> > > > > > > > > > > > > there
> > > > > > > > > > > > > > >> is no
> > > > > > > > > > > > > > >> > > > change in their functionality for batch
> > use
> > > > > cases.
> > > > > > > > > > Perhaps,
> > > > > > > > > > > > > > allowing
> > > > > > > > > > > > > > >> > > > running multiple batches in a single
> > > > application
> > > > > > may
> > > > > > > > > > require
> > > > > > > > > > > > > > similar
> > > > > > > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM,
> Priyanka
> > > > > Gugale <
> > > > > > > > > > > > > > priyag@apache.org
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > > Will it make an impression on user
> that,
> > > if
> > > > he
> > > > > > > has a
> > > > > > > > > > batch
> > > > > > > > > > > > > > >> usecase he
> > > > > > > > > > > > > > >> > > has
> > > > > > > > > > > > > > >> > > > > to use batch aware operators only? If
> > so,
> > > is
> > > > > > that
> > > > > > > > what
> > > > > > > > > > we
> > > > > > > > > > > > > > expect?
> > > > > > > > > > > > > > >> I
> > > > > > > > > > > > > > >> > am
> > > > > > > > > > > > > > >> > > > not
> > > > > > > > > > > > > > >> > > > > aware of how do we implement batch
> > > scenario
> > > > so
> > > > > > > this
> > > > > > > > > > might
> > > > > > > > > > > > be a
> > > > > > > > > > > > > > >> basic
> > > > > > > > > > > > > > >> > > > > question.
> > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM,
> > Bhupesh
> > > > > > Chawda <
> > > > > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > >> > > > > > While design / implementation for
> > custom
> > > > > > control
> > > > > > > > > > tuples
> > > > > > > > > > > is
> > > > > > > > > > > > > > >> > ongoing, I
> > > > > > > > > > > > > > >> > > > > > thought it would be a good idea to
> > > > consider
> > > > > > its
> > > > > > > > > > > usefulness
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > >> one
> > > > > > > > > > > > > > >> > of
> > > > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
> > > > > existing
> > > > > > > > > > operators
> > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > >> > Apache
> > > > > > > > > > > > > > >> > > > > Apex
> > > > > > > > > > > > > > >> > > > > > Malhar library so that it is easy to
> > use
> > > > > them
> > > > > > in
> > > > > > > > > batch
> > > > > > > > > > > use
> > > > > > > > > > > > > > >> cases.
> > > > > > > > > > > > > > >> > > > > > Naturally, this would be applicable
> > for
> > > > > only a
> > > > > > > > > subset
> > > > > > > > > > of
> > > > > > > > > > > > > > >> operators
> > > > > > > > > > > > > > >> > > like
> > > > > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > > > > > > >> > > > > > For example, for a file based store,
> > > (say
> > > > > HDFS
> > > > > > > > > store),
> > > > > > > > > > > we
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
> > > > operators
> > > > > > > which
> > > > > > > > > > allow
> > > > > > > > > > > > > easy
> > > > > > > > > > > > > > >> > > > integration
> > > > > > > > > > > > > > >> > > > > > into a batch application. These
> > > operators
> > > > > > would
> > > > > > > be
> > > > > > > > > > > > extended
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > >> > > their
> > > > > > > > > > > > > > >> > > > > > existing implementations and would
> be
> > > > "Batch
> > > > > > > > Aware",
> > > > > > > > > > in
> > > > > > > > > > > > that
> > > > > > > > > > > > > > >> they
> > > > > > > > > > > > > > >> > may
> > > > > > > > > > > > > > >> > > > > > understand the meaning of some
> > specific
> > > > > > control
> > > > > > > > > tuples
> > > > > > > > > > > > that
> > > > > > > > > > > > > > flow
> > > > > > > > > > > > > > >> > > > through
> > > > > > > > > > > > > > >> > > > > > the DAG. Start batch and end batch
> > seem
> > > to
> > > > > be
> > > > > > > the
> > > > > > > > > > > obvious
> > > > > > > > > > > > > > >> > candidates
> > > > > > > > > > > > > > >> > > > that
> > > > > > > > > > > > > > >> > > > > > come to mind. On receipt of such
> > control
> > > > > > tuples,
> > > > > > > > > they
> > > > > > > > > > > may
> > > > > > > > > > > > > try
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > >> > > modify
> > > > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > > > >> > > > > > behavior of the operator - to
> > > reinitialize
> > > > > > some
> > > > > > > > > > metrics
> > > > > > > > > > > or
> > > > > > > > > > > > > > >> finalize
> > > > > > > > > > > > > > >> > > an
> > > > > > > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > >> > > > > > We can discuss the potential control
> > > > tuples
> > > > > > and
> > > > > > > > > > actions
> > > > > > > > > > > in
> > > > > > > > > > > > > > >> detail,
> > > > > > > > > > > > > > >> > > but
> > > > > > > > > > > > > > >> > > > > > first I would like to understand the
> > > views
> > > > > of
> > > > > > > the
> > > > > > > > > > > > community
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > >> > this
> > > > > > > > > > > > > > >> > > > > > proposal.
> > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

That's correct, we are looking at a generalized approach for state
management vs. a series of special cases.

And to be clear, windowing does not imply event time, otherwise it would be
"EventTimeOperator" :-)

Thomas

On Tue, Feb 28, 2017 at 9:11 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi David,
>
> I went through the discussion, but it seems like it is more on the event
> time watermark handling as opposed to batches. What we are trying to do is
> have watermarks serve the purpose of demarcating batches using control
> tuples. Since each batch is separate from others, we would like to have
> stateful processing within a batch, but not across batches.
> At the same time, we would like to do this in a manner which is consistent
> with the windowing mechanism provided by the windowed operator. This will
> allow us to treat a single batch as a (bounded) stream and apply all the
> event time windowing concepts in that time span.
>
> For example, let's say I need to process data for a day (24 hours) as a
> single batch. The application is still streaming in nature: it would end
> the batch after a day and start a new batch the next day. At the same time,
> I would be able to have early trigger firings every minute as well as drop
> any data which is say, 5 mins late. All this within a single day.
>
> ~ Bhupesh
>
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com> wrote:
>
> > There is a discussion in the Flink mailing list about key-based
> watermarks.
> > I think it's relevant to our use case here.
> > https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
> > 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
> >
> > David
> >
> > On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > Hi David,
> > >
> > > If using time window does not seem appropriate, we can have another
> class
> > > which is more suited for such sequential and distinct windows.
> Perhaps, a
> > > CustomWindow option can be introduced which takes in a window id. The
> > > purpose of this window option could be to translate the window id into
> > > appropriate timestamps.
> > >
> > > Another option would be to go with a custom timestampExtractor for such
> > > tuples which translates the each unique file name to a distinct
> timestamp
> > > while using time windows in the windowed operator.
> > >
> > > ~ Bhupesh
> > >
> > >
> > > _______________________________________________________
> > >
> > > Bhupesh Chawda
> > >
> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >
> > > www.datatorrent.com  |  apex.apache.org
> > >
> > >
> > >
> > > On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com>
> wrote:
> > >
> > > > I now see your rationale on putting the filename in the window.
> > > > As far as I understand, the reasons why the filename is not part of
> the
> > > key
> > > > and the Global Window is not used are:
> > > >
> > > > 1) The files are processed in sequence, not in parallel
> > > > 2) The windowed operator should not keep the state associated with
> the
> > > file
> > > > when the processing of the file is done
> > > > 3) The trigger should be fired for the file when a file is done
> > > processing.
> > > >
> > > > However, if the file is just a sequence has nothing to do with a
> > > timestamp,
> > > > assigning a timestamp to a file is not an intuitive thing to do and
> > would
> > > > just create confusions to the users, especially when it's used as an
> > > > example for new users.
> > > >
> > > > How about having a separate class called SequenceWindow? And perhaps
> > > > TimeWindow can inherit from it?
> > > >
> > > > David
> > > >
> > > > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I think my comments related to count based windows might be
> causing
> > > > > > confusion. Let's not discuss count based scenarios for now.
> > > > > >
> > > > > > Just want to make sure we are on the same page wrt. the "each
> file
> > > is a
> > > > > > batch" use case. As mentioned by Thomas, the each tuple from the
> > same
> > > > > file
> > > > > > has the same timestamp (which is just a sequence number) and that
> > > helps
> > > > > > keep tuples from each file in a separate window.
> > > > > >
> > > > >
> > > > > Yes, in this case it is a sequence number, but it could be a time
> > stamp
> > > > > also, depending on the file naming convention. And if it was event
> > time
> > > > > processing, the watermark would be derived from records within the
> > > file.
> > > > >
> > > > > Agreed, the source should have a mechanism to control the time
> stamp
> > > > > extraction along with everything else pertaining to the watermark
> > > > > generation.
> > > > >
> > > > >
> > > > > > We could also implement a "timestampExtractor" interface to
> > identify
> > > > the
> > > > > > timestamp (sequence number) for a file.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > >
> > > > > > _______________________________________________________
> > > > > >
> > > > > > Bhupesh Chawda
> > > > > >
> > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > >
> > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > I don't think this is a use case for count based window.
> > > > > > >
> > > > > > > We have multiple files that are retrieved in a sequence and
> there
> > > is
> > > > no
> > > > > > > knowledge of the number of records per file. The requirement is
> > to
> > > > > > > aggregate each file separately and emit the aggregate when the
> > file
> > > > is
> > > > > > read
> > > > > > > fully. There is no concept of "end of something" for an
> > individual
> > > > key
> > > > > > and
> > > > > > > global window isn't applicable.
> > > > > > >
> > > > > > > However, as already explained and implemented by Bhupesh, this
> > can
> > > be
> > > > > > > solved using watermark and window (in this case the window
> > > timestamp
> > > > > > isn't
> > > > > > > a timestamp, but a file sequence, but that doesn't matter.
> > > > > > >
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <davidyan@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > I don't think this is the way to go. Global Window only means
> > the
> > > > > > > timestamp
> > > > > > > > does not matter (or that there is no timestamp). It does not
> > > > > > necessarily
> > > > > > > > mean it's a large batch. Unless there is some notion of event
> > > time
> > > > > for
> > > > > > > each
> > > > > > > > file, you don't want to embed the file into the window
> itself.
> > > > > > > >
> > > > > > > > If you want the result broken up by file name, and if the
> files
> > > are
> > > > > to
> > > > > > be
> > > > > > > > processed in parallel, I think making the file name be part
> of
> > > the
> > > > > key
> > > > > > is
> > > > > > > > the way to go. I think it's very confusing if we somehow make
> > the
> > > > > file
> > > > > > to
> > > > > > > > be part of the window.
> > > > > > > >
> > > > > > > > For count-based window, it's not implemented yet and you're
> > > welcome
> > > > > to
> > > > > > > add
> > > > > > > > that feature. In case of count-based windows, there would be
> no
> > > > > notion
> > > > > > of
> > > > > > > > time and you probably only trigger at the end of each window.
> > In
> > > > the
> > > > > > case
> > > > > > > > of count-based windows, the watermark only matters for batch
> > > since
> > > > > you
> > > > > > > need
> > > > > > > > a way to know when the batch has ended (if the count is 10,
> the
> > > > > number
> > > > > > of
> > > > > > > > tuples in the batch is let's say 105, you need a way to end
> the
> > > > last
> > > > > > > window
> > > > > > > > with 5 tuples).
> > > > > > > >
> > > > > > > > David
> > > > > > > >
> > > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > > > > bhupesh@datatorrent.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi David,
> > > > > > > > >
> > > > > > > > > Thanks for your comments.
> > > > > > > > >
> > > > > > > > > The wordcount example that I created based on the windowed
> > > > operator
> > > > > > > does
> > > > > > > > > processing of word counts per file (each file as a separate
> > > > batch),
> > > > > > > i.e.
> > > > > > > > > process counts for each file and dump into separate files.
> > > > > > > > > As I understand Global window is for one large batch; i.e.
> > all
> > > > > > incoming
> > > > > > > > > data falls into the same batch. This could not be processed
> > > using
> > > > > > > > > GlobalWindow option as we need more than one windows. In
> this
> > > > > case, I
> > > > > > > > > configured the windowed operator to have time windows of
> 1ms
> > > each
> > > > > and
> > > > > > > > > passed data for each file with increasing timestamps:
> (file1,
> > > 1),
> > > > > > > (file2,
> > > > > > > > > 2) and so on. Is there a better way of handling this
> > scenario?
> > > > > > > > >
> > > > > > > > > Regarding (2 - count based windows), I think there is a
> > trigger
> > > > > > option
> > > > > > > to
> > > > > > > > > process count based windows. In case I want to process
> every
> > > 1000
> > > > > > > tuples
> > > > > > > > as
> > > > > > > > > a batch, I could set the Trigger option to CountTrigger
> with
> > > the
> > > > > > > > > accumulation set to Discarding. Is this correct?
> > > > > > > > >
> > > > > > > > > I agree that (4. Final Watermark) can be done using Global
> > > > window.
> > > > > > > > >
> > > > > > > > > ~ Bhupesh
> > > > > > > > >
> > > > > > > > > _______________________________________________________
> > > > > > > > >
> > > > > > > > > Bhupesh Chawda
> > > > > > > > >
> > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > >
> > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> > > davidyan@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I'm worried that we are making the watermark concept too
> > > > > > complicated.
> > > > > > > > > >
> > > > > > > > > > Watermarks should simply just tell you what windows can
> be
> > > > > > considered
> > > > > > > > > > complete.
> > > > > > > > > >
> > > > > > > > > > Point 2 is basically a count-based window. Watermarks do
> > not
> > > > > play a
> > > > > > > > role
> > > > > > > > > > here because the window is always complete at the n-th
> > tuple.
> > > > > > > > > >
> > > > > > > > > > If I understand correctly, point 3 is for batch
> processing
> > of
> > > > > > files.
> > > > > > > > > Unless
> > > > > > > > > > the files contain timed events, it sounds to be that this
> > can
> > > > be
> > > > > > > > achieved
> > > > > > > > > > with just a Global Window. For signaling EOF, a watermark
> > > with
> > > > a
> > > > > > > > > +infinity
> > > > > > > > > > timestamp can be used so that triggers will be fired upon
> > > > receipt
> > > > > > of
> > > > > > > > that
> > > > > > > > > > watermark.
> > > > > > > > > >
> > > > > > > > > > For point 4, just like what I mentioned above, can be
> > > achieved
> > > > > > with a
> > > > > > > > > > watermark with a +infinity timestamp.
> > > > > > > > > >
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Thomas,
> > > > > > > > > > >
> > > > > > > > > > > For an input operator which is supposed to generate
> > > > watermarks
> > > > > > for
> > > > > > > > > > > downstream operators, I can think about the following
> > > > > watermarks
> > > > > > > that
> > > > > > > > > the
> > > > > > > > > > > operator can emit:
> > > > > > > > > > > 1. Time based watermarks (the high watermark / low
> > > watermark)
> > > > > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > > > > > 4. Final watermark
> > > > > > > > > > >
> > > > > > > > > > > File based watermarks seem to be applicable for batch
> > (file
> > > > > > based)
> > > > > > > as
> > > > > > > > > > well,
> > > > > > > > > > > and hence I thought of looking at these first. Does
> this
> > > seem
> > > > > to
> > > > > > be
> > > > > > > > in
> > > > > > > > > > line
> > > > > > > > > > > with the thought process?
> > > > > > > > > > >
> > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ______________________________
> _________________________
> > > > > > > > > > >
> > > > > > > > > > > Bhupesh Chawda
> > > > > > > > > > >
> > > > > > > > > > > Software Engineer
> > > > > > > > > > >
> > > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > > > >
> > > > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> > > > thw@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I don't think this should be designed based on a
> > > simplistic
> > > > > > file
> > > > > > > > > > > > input-output scenario. It would be good to include a
> > > > stateful
> > > > > > > > > > > > transformation based on event time.
> > > > > > > > > > > >
> > > > > > > > > > > > More complex pipelines contain stateful
> transformations
> > > > that
> > > > > > > depend
> > > > > > > > > on
> > > > > > > > > > > > windowing and watermarks. I think we need a watermark
> > > > concept
> > > > > > > that
> > > > > > > > is
> > > > > > > > > > > based
> > > > > > > > > > > > on progress in event time (or other monotonic
> > increasing
> > > > > > > sequence)
> > > > > > > > > that
> > > > > > > > > > > > other operators can generically work with.
> > > > > > > > > > > >
> > > > > > > > > > > > Note that even file input in many cases can produce
> > time
> > > > > based
> > > > > > > > > > > watermarks,
> > > > > > > > > > > > for example when you read part files that are bound
> by
> > > > event
> > > > > > > time.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > For better understanding the use case for control
> > > tuples
> > > > in
> > > > > > > > batch,
> > > > > > > > > I
> > > > > > > > > > > am
> > > > > > > > > > > > > creating a prototype for a batch application using
> > File
> > > > > Input
> > > > > > > and
> > > > > > > > > > File
> > > > > > > > > > > > > Output operators.
> > > > > > > > > > > > >
> > > > > > > > > > > > > To enable basic batch processing for File IO
> > > operators, I
> > > > > am
> > > > > > > > > > proposing
> > > > > > > > > > > > the
> > > > > > > > > > > > > following changes to File input and output
> operators:
> > > > > > > > > > > > > 1. File Input operator emits a watermark each time
> it
> > > > opens
> > > > > > and
> > > > > > > > > > closes
> > > > > > > > > > > a
> > > > > > > > > > > > > file. These can be "start file" and "end file"
> > > watermarks
> > > > > > which
> > > > > > > > > > include
> > > > > > > > > > > > the
> > > > > > > > > > > > > corresponding file names. The "start file" tuple
> > should
> > > > be
> > > > > > sent
> > > > > > > > > > before
> > > > > > > > > > > > any
> > > > > > > > > > > > > of the data from that file flows.
> > > > > > > > > > > > > 2. File Input operator can be configured to end the
> > > > > > application
> > > > > > > > > > after a
> > > > > > > > > > > > > single or n scans of the directory (a batch). This
> is
> > > > where
> > > > > > the
> > > > > > > > > > > operator
> > > > > > > > > > > > > emits the final watermark (the end of application
> > > control
> > > > > > > tuple).
> > > > > > > > > > This
> > > > > > > > > > > > will
> > > > > > > > > > > > > also shutdown the application.
> > > > > > > > > > > > > 3. The File output operator handles these control
> > > tuples.
> > > > > > > "Start
> > > > > > > > > > file"
> > > > > > > > > > > > > initializes the file name for the incoming tuples.
> > "End
> > > > > file"
> > > > > > > > > > watermark
> > > > > > > > > > > > > forces a finalize on that file.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The user would be able to enable the operators to
> > send
> > > > only
> > > > > > > those
> > > > > > > > > > > > > watermarks that are needed in the application. If
> > none
> > > of
> > > > > the
> > > > > > > > > options
> > > > > > > > > > > are
> > > > > > > > > > > > > configured, the operators behave as in a streaming
> > > > > > application.
> > > > > > > > > > > > >
> > > > > > > > > > > > > There are a few challenges in the implementation
> > where
> > > > the
> > > > > > > input
> > > > > > > > > > > operator
> > > > > > > > > > > > > is partitioned. In this case, the correlation
> between
> > > the
> > > > > > > > start/end
> > > > > > > > > > > for a
> > > > > > > > > > > > > file and the data tuples for that file is lost.
> Hence
> > > we
> > > > > need
> > > > > > > to
> > > > > > > > > > > maintain
> > > > > > > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The "start file" and "end file" control tuples in
> > this
> > > > > > example
> > > > > > > > are
> > > > > > > > > > > > > temporary names for watermarks. We can have generic
> > > > "start
> > > > > > > > batch" /
> > > > > > > > > > > "end
> > > > > > > > > > > > > batch" tuples which could be used for other use
> cases
> > > as
> > > > > > well.
> > > > > > > > The
> > > > > > > > > > > Final
> > > > > > > > > > > > > watermark is common and serves the same purpose in
> > each
> > > > > case.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, this can be part of operator configuration.
> > > Given
> > > > > > this,
> > > > > > > > for
> > > > > > > > > a
> > > > > > > > > > > user
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > define a batch application, would mean
> configuring
> > > the
> > > > > > > > connectors
> > > > > > > > > > > > (mostly
> > > > > > > > > > > > > > the input operator) in the application for the
> > > desired
> > > > > > > > behavior.
> > > > > > > > > > > > > Similarly,
> > > > > > > > > > > > > > there can be other use cases that can be achieved
> > > other
> > > > > > than
> > > > > > > > > batch.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > > > > > 1. Make sure that the watermarks or control
> tuples
> > > are
> > > > > > > > consistent
> > > > > > > > > > > > across
> > > > > > > > > > > > > > sources. Meaning an HDFS sink should be able to
> > > > interpret
> > > > > > the
> > > > > > > > > > > watermark
> > > > > > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > > > > > 2. In addition to I/O connectors, we should also
> > look
> > > > at
> > > > > > the
> > > > > > > > need
> > > > > > > > > > for
> > > > > > > > > > > > > > processing operators to understand some of the
> > > control
> > > > > > > tuples /
> > > > > > > > > > > > > watermarks.
> > > > > > > > > > > > > > For example, we may want to reset the operator
> > > behavior
> > > > > on
> > > > > > > > > arrival
> > > > > > > > > > of
> > > > > > > > > > > > > some
> > > > > > > > > > > > > > watermark tuple.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > > > > > thw@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> The HDFS source can operate in two modes,
> bounded
> > or
> > > > > > > > unbounded.
> > > > > > > > > If
> > > > > > > > > > > you
> > > > > > > > > > > > > >> scan
> > > > > > > > > > > > > >> only once, then it should emit the final
> watermark
> > > > after
> > > > > > it
> > > > > > > is
> > > > > > > > > > done.
> > > > > > > > > > > > > >> Otherwise it would emit watermarks based on a
> > policy
> > > > > > (files
> > > > > > > > > names
> > > > > > > > > > > > etc.).
> > > > > > > > > > > > > >> The mechanism to generate the marks may depend
> on
> > > the
> > > > > type
> > > > > > > of
> > > > > > > > > > source
> > > > > > > > > > > > and
> > > > > > > > > > > > > >> the user needs to be able to influence/configure
> > it.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Thomas
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda
> <
> > > > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > Hi Thomas,
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > I am not sure that I completely understand
> your
> > > > > > > suggestion.
> > > > > > > > > Are
> > > > > > > > > > > you
> > > > > > > > > > > > > >> > suggesting to broaden the scope of the
> proposal
> > to
> > > > > treat
> > > > > > > all
> > > > > > > > > > > sources
> > > > > > > > > > > > > as
> > > > > > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > In case of Apex, we treat all sources as
> > unbounded
> > > > > > > sources.
> > > > > > > > > Even
> > > > > > > > > > > > > bounded
> > > > > > > > > > > > > >> > sources like HDFS file source is treated as
> > > > unbounded
> > > > > by
> > > > > > > > means
> > > > > > > > > > of
> > > > > > > > > > > > > >> scanning
> > > > > > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > > > > > >> > In this case, if we treat it as a bounded
> > source,
> > > we
> > > > > can
> > > > > > > > > define
> > > > > > > > > > > > hooks
> > > > > > > > > > > > > >> which
> > > > > > > > > > > > > >> > allows us to detect the end of the file and
> send
> > > the
> > > > > > > "final
> > > > > > > > > > > > > watermark".
> > > > > > > > > > > > > >> We
> > > > > > > > > > > > > >> > could also consider HDFS file source as a
> > > streaming
> > > > > > source
> > > > > > > > and
> > > > > > > > > > > > define
> > > > > > > > > > > > > >> hooks
> > > > > > > > > > > > > >> > which send watermarks based on different kinds
> > of
> > > > > > windows.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise
> <
> > > > > > > > thw@apache.org
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > > Bhupesh,
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > Please see how that can be solved in a
> unified
> > > way
> > > > > > using
> > > > > > > > > > windows
> > > > > > > > > > > > and
> > > > > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded
> > > data.
> > > > > In
> > > > > > > Beam
> > > > > > > > > for
> > > > > > > > > > > > > >> example,
> > > > > > > > > > > > > >> > you
> > > > > > > > > > > > > >> > > can use the "global window" and the final
> > > > watermark
> > > > > to
> > > > > > > > > > > accomplish
> > > > > > > > > > > > > what
> > > > > > > > > > > > > >> > you
> > > > > > > > > > > > > >> > > are looking for. Batch is just a special
> case
> > of
> > > > > > > streaming
> > > > > > > > > > where
> > > > > > > > > > > > the
> > > > > > > > > > > > > >> > source
> > > > > > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > > >> > > Thomas
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh
> > Chawda
> > > <
> > > > > > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
> > > > > > application,
> > > > > > > > > then
> > > > > > > > > > > > batch
> > > > > > > > > > > > > >> aware
> > > > > > > > > > > > > >> > > > operators need to be used in the
> > application.
> > > > > > > > > > > > > >> > > > The nature of the application is mostly
> > > > controlled
> > > > > > by
> > > > > > > > the
> > > > > > > > > > > input
> > > > > > > > > > > > > and
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> > > > output operators used in the application.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > For example, consider an application which
> > > needs
> > > > > to
> > > > > > > > filter
> > > > > > > > > > > > records
> > > > > > > > > > > > > >> in a
> > > > > > > > > > > > > >> > > > input file and store the filtered records
> in
> > > > > another
> > > > > > > > file.
> > > > > > > > > > The
> > > > > > > > > > > > > >> nature
> > > > > > > > > > > > > >> > of
> > > > > > > > > > > > > >> > > > this app is to end once the entire file is
> > > > > > processed.
> > > > > > > > > > > Following
> > > > > > > > > > > > > >> things
> > > > > > > > > > > > > >> > > are
> > > > > > > > > > > > > >> > > > expected of the application:
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > >    1. Once the input data is over,
> finalize
> > > the
> > > > > > output
> > > > > > > > > file
> > > > > > > > > > > from
> > > > > > > > > > > > > >> .tmp
> > > > > > > > > > > > > >> > > >    files. - Responsibility of output
> > operator
> > > > > > > > > > > > > >> > > >    2. End the application, once the data
> is
> > > read
> > > > > and
> > > > > > > > > > > processed -
> > > > > > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > These functions are essential to allow the
> > > user
> > > > to
> > > > > > do
> > > > > > > > > higher
> > > > > > > > > > > > level
> > > > > > > > > > > > > >> > > > operations like scheduling or running a
> > > workflow
> > > > > of
> > > > > > > > batch
> > > > > > > > > > > > > >> applications.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > I am not sure about intermediate
> > (processing)
> > > > > > > operators,
> > > > > > > > > as
> > > > > > > > > > > > there
> > > > > > > > > > > > > >> is no
> > > > > > > > > > > > > >> > > > change in their functionality for batch
> use
> > > > cases.
> > > > > > > > > Perhaps,
> > > > > > > > > > > > > allowing
> > > > > > > > > > > > > >> > > > running multiple batches in a single
> > > application
> > > > > may
> > > > > > > > > require
> > > > > > > > > > > > > similar
> > > > > > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
> > > > Gugale <
> > > > > > > > > > > > > priyag@apache.org
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > > Will it make an impression on user that,
> > if
> > > he
> > > > > > has a
> > > > > > > > > batch
> > > > > > > > > > > > > >> usecase he
> > > > > > > > > > > > > >> > > has
> > > > > > > > > > > > > >> > > > > to use batch aware operators only? If
> so,
> > is
> > > > > that
> > > > > > > what
> > > > > > > > > we
> > > > > > > > > > > > > expect?
> > > > > > > > > > > > > >> I
> > > > > > > > > > > > > >> > am
> > > > > > > > > > > > > >> > > > not
> > > > > > > > > > > > > >> > > > > aware of how do we implement batch
> > scenario
> > > so
> > > > > > this
> > > > > > > > > might
> > > > > > > > > > > be a
> > > > > > > > > > > > > >> basic
> > > > > > > > > > > > > >> > > > > question.
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM,
> Bhupesh
> > > > > Chawda <
> > > > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > While design / implementation for
> custom
> > > > > control
> > > > > > > > > tuples
> > > > > > > > > > is
> > > > > > > > > > > > > >> > ongoing, I
> > > > > > > > > > > > > >> > > > > > thought it would be a good idea to
> > > consider
> > > > > its
> > > > > > > > > > usefulness
> > > > > > > > > > > > in
> > > > > > > > > > > > > >> one
> > > > > > > > > > > > > >> > of
> > > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
> > > > existing
> > > > > > > > > operators
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > >> > Apache
> > > > > > > > > > > > > >> > > > > Apex
> > > > > > > > > > > > > >> > > > > > Malhar library so that it is easy to
> use
> > > > them
> > > > > in
> > > > > > > > batch
> > > > > > > > > > use
> > > > > > > > > > > > > >> cases.
> > > > > > > > > > > > > >> > > > > > Naturally, this would be applicable
> for
> > > > only a
> > > > > > > > subset
> > > > > > > > > of
> > > > > > > > > > > > > >> operators
> > > > > > > > > > > > > >> > > like
> > > > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > > > > > >> > > > > > For example, for a file based store,
> > (say
> > > > HDFS
> > > > > > > > store),
> > > > > > > > > > we
> > > > > > > > > > > > > could
> > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
> > > operators
> > > > > > which
> > > > > > > > > allow
> > > > > > > > > > > > easy
> > > > > > > > > > > > > >> > > > integration
> > > > > > > > > > > > > >> > > > > > into a batch application. These
> > operators
> > > > > would
> > > > > > be
> > > > > > > > > > > extended
> > > > > > > > > > > > > from
> > > > > > > > > > > > > >> > > their
> > > > > > > > > > > > > >> > > > > > existing implementations and would be
> > > "Batch
> > > > > > > Aware",
> > > > > > > > > in
> > > > > > > > > > > that
> > > > > > > > > > > > > >> they
> > > > > > > > > > > > > >> > may
> > > > > > > > > > > > > >> > > > > > understand the meaning of some
> specific
> > > > > control
> > > > > > > > tuples
> > > > > > > > > > > that
> > > > > > > > > > > > > flow
> > > > > > > > > > > > > >> > > > through
> > > > > > > > > > > > > >> > > > > > the DAG. Start batch and end batch
> seem
> > to
> > > > be
> > > > > > the
> > > > > > > > > > obvious
> > > > > > > > > > > > > >> > candidates
> > > > > > > > > > > > > >> > > > that
> > > > > > > > > > > > > >> > > > > > come to mind. On receipt of such
> control
> > > > > tuples,
> > > > > > > > they
> > > > > > > > > > may
> > > > > > > > > > > > try
> > > > > > > > > > > > > to
> > > > > > > > > > > > > >> > > modify
> > > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > > >> > > > > > behavior of the operator - to
> > reinitialize
> > > > > some
> > > > > > > > > metrics
> > > > > > > > > > or
> > > > > > > > > > > > > >> finalize
> > > > > > > > > > > > > >> > > an
> > > > > > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > We can discuss the potential control
> > > tuples
> > > > > and
> > > > > > > > > actions
> > > > > > > > > > in
> > > > > > > > > > > > > >> detail,
> > > > > > > > > > > > > >> > > but
> > > > > > > > > > > > > >> > > > > > first I would like to understand the
> > views
> > > > of
> > > > > > the
> > > > > > > > > > > community
> > > > > > > > > > > > > for
> > > > > > > > > > > > > >> > this
> > > > > > > > > > > > > >> > > > > > proposal.
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi David,

I went through the discussion, but it seems like it is more on the event
time watermark handling as opposed to batches. What we are trying to do is
have watermarks serve the purpose of demarcating batches using control
tuples. Since each batch is separate from others, we would like to have
stateful processing within a batch, but not across batches.
At the same time, we would like to do this in a manner which is consistent
with the windowing mechanism provided by the windowed operator. This will
allow us to treat a single batch as a (bounded) stream and apply all the
event time windowing concepts in that time span.

For example, let's say I need to process data for a day (24 hours) as a
single batch. The application is still streaming in nature: it would end
the batch after a day and start a new batch the next day. At the same time,
I would be able to have early trigger firings every minute as well as drop
any data which is say, 5 mins late. All this within a single day.

~ Bhupesh



_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Feb 28, 2017 at 9:27 PM, David Yan <da...@gmail.com> wrote:

> There is a discussion in the Flink mailing list about key-based watermarks.
> I think it's relevant to our use case here.
> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>
> David
>
> On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi David,
> >
> > If using time window does not seem appropriate, we can have another class
> > which is more suited for such sequential and distinct windows. Perhaps, a
> > CustomWindow option can be introduced which takes in a window id. The
> > purpose of this window option could be to translate the window id into
> > appropriate timestamps.
> >
> > Another option would be to go with a custom timestampExtractor for such
> > tuples which translates the each unique file name to a distinct timestamp
> > while using time windows in the windowed operator.
> >
> > ~ Bhupesh
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com> wrote:
> >
> > > I now see your rationale on putting the filename in the window.
> > > As far as I understand, the reasons why the filename is not part of the
> > key
> > > and the Global Window is not used are:
> > >
> > > 1) The files are processed in sequence, not in parallel
> > > 2) The windowed operator should not keep the state associated with the
> > file
> > > when the processing of the file is done
> > > 3) The trigger should be fired for the file when a file is done
> > processing.
> > >
> > > However, if the file is just a sequence has nothing to do with a
> > timestamp,
> > > assigning a timestamp to a file is not an intuitive thing to do and
> would
> > > just create confusions to the users, especially when it's used as an
> > > example for new users.
> > >
> > > How about having a separate class called SequenceWindow? And perhaps
> > > TimeWindow can inherit from it?
> > >
> > > David
> > >
> > > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > I think my comments related to count based windows might be causing
> > > > > confusion. Let's not discuss count based scenarios for now.
> > > > >
> > > > > Just want to make sure we are on the same page wrt. the "each file
> > is a
> > > > > batch" use case. As mentioned by Thomas, the each tuple from the
> same
> > > > file
> > > > > has the same timestamp (which is just a sequence number) and that
> > helps
> > > > > keep tuples from each file in a separate window.
> > > > >
> > > >
> > > > Yes, in this case it is a sequence number, but it could be a time
> stamp
> > > > also, depending on the file naming convention. And if it was event
> time
> > > > processing, the watermark would be derived from records within the
> > file.
> > > >
> > > > Agreed, the source should have a mechanism to control the time stamp
> > > > extraction along with everything else pertaining to the watermark
> > > > generation.
> > > >
> > > >
> > > > > We could also implement a "timestampExtractor" interface to
> identify
> > > the
> > > > > timestamp (sequence number) for a file.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > >
> > > > > _______________________________________________________
> > > > >
> > > > > Bhupesh Chawda
> > > > >
> > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > >
> > > > > www.datatorrent.com  |  apex.apache.org
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > > > I don't think this is a use case for count based window.
> > > > > >
> > > > > > We have multiple files that are retrieved in a sequence and there
> > is
> > > no
> > > > > > knowledge of the number of records per file. The requirement is
> to
> > > > > > aggregate each file separately and emit the aggregate when the
> file
> > > is
> > > > > read
> > > > > > fully. There is no concept of "end of something" for an
> individual
> > > key
> > > > > and
> > > > > > global window isn't applicable.
> > > > > >
> > > > > > However, as already explained and implemented by Bhupesh, this
> can
> > be
> > > > > > solved using watermark and window (in this case the window
> > timestamp
> > > > > isn't
> > > > > > a timestamp, but a file sequence, but that doesn't matter.
> > > > > >
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > I don't think this is the way to go. Global Window only means
> the
> > > > > > timestamp
> > > > > > > does not matter (or that there is no timestamp). It does not
> > > > > necessarily
> > > > > > > mean it's a large batch. Unless there is some notion of event
> > time
> > > > for
> > > > > > each
> > > > > > > file, you don't want to embed the file into the window itself.
> > > > > > >
> > > > > > > If you want the result broken up by file name, and if the files
> > are
> > > > to
> > > > > be
> > > > > > > processed in parallel, I think making the file name be part of
> > the
> > > > key
> > > > > is
> > > > > > > the way to go. I think it's very confusing if we somehow make
> the
> > > > file
> > > > > to
> > > > > > > be part of the window.
> > > > > > >
> > > > > > > For count-based window, it's not implemented yet and you're
> > welcome
> > > > to
> > > > > > add
> > > > > > > that feature. In case of count-based windows, there would be no
> > > > notion
> > > > > of
> > > > > > > time and you probably only trigger at the end of each window.
> In
> > > the
> > > > > case
> > > > > > > of count-based windows, the watermark only matters for batch
> > since
> > > > you
> > > > > > need
> > > > > > > a way to know when the batch has ended (if the count is 10, the
> > > > number
> > > > > of
> > > > > > > tuples in the batch is let's say 105, you need a way to end the
> > > last
> > > > > > window
> > > > > > > with 5 tuples).
> > > > > > >
> > > > > > > David
> > > > > > >
> > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Thanks for your comments.
> > > > > > > >
> > > > > > > > The wordcount example that I created based on the windowed
> > > operator
> > > > > > does
> > > > > > > > processing of word counts per file (each file as a separate
> > > batch),
> > > > > > i.e.
> > > > > > > > process counts for each file and dump into separate files.
> > > > > > > > As I understand Global window is for one large batch; i.e.
> all
> > > > > incoming
> > > > > > > > data falls into the same batch. This could not be processed
> > using
> > > > > > > > GlobalWindow option as we need more than one windows. In this
> > > > case, I
> > > > > > > > configured the windowed operator to have time windows of 1ms
> > each
> > > > and
> > > > > > > > passed data for each file with increasing timestamps: (file1,
> > 1),
> > > > > > (file2,
> > > > > > > > 2) and so on. Is there a better way of handling this
> scenario?
> > > > > > > >
> > > > > > > > Regarding (2 - count based windows), I think there is a
> trigger
> > > > > option
> > > > > > to
> > > > > > > > process count based windows. In case I want to process every
> > 1000
> > > > > > tuples
> > > > > > > as
> > > > > > > > a batch, I could set the Trigger option to CountTrigger with
> > the
> > > > > > > > accumulation set to Discarding. Is this correct?
> > > > > > > >
> > > > > > > > I agree that (4. Final Watermark) can be done using Global
> > > window.
> > > > > > > >
> > > > > > > > ~ Bhupesh
> > > > > > > >
> > > > > > > > _______________________________________________________
> > > > > > > >
> > > > > > > > Bhupesh Chawda
> > > > > > > >
> > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > >
> > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> > davidyan@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I'm worried that we are making the watermark concept too
> > > > > complicated.
> > > > > > > > >
> > > > > > > > > Watermarks should simply just tell you what windows can be
> > > > > considered
> > > > > > > > > complete.
> > > > > > > > >
> > > > > > > > > Point 2 is basically a count-based window. Watermarks do
> not
> > > > play a
> > > > > > > role
> > > > > > > > > here because the window is always complete at the n-th
> tuple.
> > > > > > > > >
> > > > > > > > > If I understand correctly, point 3 is for batch processing
> of
> > > > > files.
> > > > > > > > Unless
> > > > > > > > > the files contain timed events, it sounds to be that this
> can
> > > be
> > > > > > > achieved
> > > > > > > > > with just a Global Window. For signaling EOF, a watermark
> > with
> > > a
> > > > > > > > +infinity
> > > > > > > > > timestamp can be used so that triggers will be fired upon
> > > receipt
> > > > > of
> > > > > > > that
> > > > > > > > > watermark.
> > > > > > > > >
> > > > > > > > > For point 4, just like what I mentioned above, can be
> > achieved
> > > > > with a
> > > > > > > > > watermark with a +infinity timestamp.
> > > > > > > > >
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > > > > bhupesh@datatorrent.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Thomas,
> > > > > > > > > >
> > > > > > > > > > For an input operator which is supposed to generate
> > > watermarks
> > > > > for
> > > > > > > > > > downstream operators, I can think about the following
> > > > watermarks
> > > > > > that
> > > > > > > > the
> > > > > > > > > > operator can emit:
> > > > > > > > > > 1. Time based watermarks (the high watermark / low
> > watermark)
> > > > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > > > > 4. Final watermark
> > > > > > > > > >
> > > > > > > > > > File based watermarks seem to be applicable for batch
> (file
> > > > > based)
> > > > > > as
> > > > > > > > > well,
> > > > > > > > > > and hence I thought of looking at these first. Does this
> > seem
> > > > to
> > > > > be
> > > > > > > in
> > > > > > > > > line
> > > > > > > > > > with the thought process?
> > > > > > > > > >
> > > > > > > > > > ~ Bhupesh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > _______________________________________________________
> > > > > > > > > >
> > > > > > > > > > Bhupesh Chawda
> > > > > > > > > >
> > > > > > > > > > Software Engineer
> > > > > > > > > >
> > > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > > >
> > > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> > > thw@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I don't think this should be designed based on a
> > simplistic
> > > > > file
> > > > > > > > > > > input-output scenario. It would be good to include a
> > > stateful
> > > > > > > > > > > transformation based on event time.
> > > > > > > > > > >
> > > > > > > > > > > More complex pipelines contain stateful transformations
> > > that
> > > > > > depend
> > > > > > > > on
> > > > > > > > > > > windowing and watermarks. I think we need a watermark
> > > concept
> > > > > > that
> > > > > > > is
> > > > > > > > > > based
> > > > > > > > > > > on progress in event time (or other monotonic
> increasing
> > > > > > sequence)
> > > > > > > > that
> > > > > > > > > > > other operators can generically work with.
> > > > > > > > > > >
> > > > > > > > > > > Note that even file input in many cases can produce
> time
> > > > based
> > > > > > > > > > watermarks,
> > > > > > > > > > > for example when you read part files that are bound by
> > > event
> > > > > > time.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Thomas
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > For better understanding the use case for control
> > tuples
> > > in
> > > > > > > batch,
> > > > > > > > I
> > > > > > > > > > am
> > > > > > > > > > > > creating a prototype for a batch application using
> File
> > > > Input
> > > > > > and
> > > > > > > > > File
> > > > > > > > > > > > Output operators.
> > > > > > > > > > > >
> > > > > > > > > > > > To enable basic batch processing for File IO
> > operators, I
> > > > am
> > > > > > > > > proposing
> > > > > > > > > > > the
> > > > > > > > > > > > following changes to File input and output operators:
> > > > > > > > > > > > 1. File Input operator emits a watermark each time it
> > > opens
> > > > > and
> > > > > > > > > closes
> > > > > > > > > > a
> > > > > > > > > > > > file. These can be "start file" and "end file"
> > watermarks
> > > > > which
> > > > > > > > > include
> > > > > > > > > > > the
> > > > > > > > > > > > corresponding file names. The "start file" tuple
> should
> > > be
> > > > > sent
> > > > > > > > > before
> > > > > > > > > > > any
> > > > > > > > > > > > of the data from that file flows.
> > > > > > > > > > > > 2. File Input operator can be configured to end the
> > > > > application
> > > > > > > > > after a
> > > > > > > > > > > > single or n scans of the directory (a batch). This is
> > > where
> > > > > the
> > > > > > > > > > operator
> > > > > > > > > > > > emits the final watermark (the end of application
> > control
> > > > > > tuple).
> > > > > > > > > This
> > > > > > > > > > > will
> > > > > > > > > > > > also shutdown the application.
> > > > > > > > > > > > 3. The File output operator handles these control
> > tuples.
> > > > > > "Start
> > > > > > > > > file"
> > > > > > > > > > > > initializes the file name for the incoming tuples.
> "End
> > > > file"
> > > > > > > > > watermark
> > > > > > > > > > > > forces a finalize on that file.
> > > > > > > > > > > >
> > > > > > > > > > > > The user would be able to enable the operators to
> send
> > > only
> > > > > > those
> > > > > > > > > > > > watermarks that are needed in the application. If
> none
> > of
> > > > the
> > > > > > > > options
> > > > > > > > > > are
> > > > > > > > > > > > configured, the operators behave as in a streaming
> > > > > application.
> > > > > > > > > > > >
> > > > > > > > > > > > There are a few challenges in the implementation
> where
> > > the
> > > > > > input
> > > > > > > > > > operator
> > > > > > > > > > > > is partitioned. In this case, the correlation between
> > the
> > > > > > > start/end
> > > > > > > > > > for a
> > > > > > > > > > > > file and the data tuples for that file is lost. Hence
> > we
> > > > need
> > > > > > to
> > > > > > > > > > maintain
> > > > > > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > > > > > >
> > > > > > > > > > > > The "start file" and "end file" control tuples in
> this
> > > > > example
> > > > > > > are
> > > > > > > > > > > > temporary names for watermarks. We can have generic
> > > "start
> > > > > > > batch" /
> > > > > > > > > > "end
> > > > > > > > > > > > batch" tuples which could be used for other use cases
> > as
> > > > > well.
> > > > > > > The
> > > > > > > > > > Final
> > > > > > > > > > > > watermark is common and serves the same purpose in
> each
> > > > case.
> > > > > > > > > > > >
> > > > > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > > > > >
> > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Yes, this can be part of operator configuration.
> > Given
> > > > > this,
> > > > > > > for
> > > > > > > > a
> > > > > > > > > > user
> > > > > > > > > > > > to
> > > > > > > > > > > > > define a batch application, would mean configuring
> > the
> > > > > > > connectors
> > > > > > > > > > > (mostly
> > > > > > > > > > > > > the input operator) in the application for the
> > desired
> > > > > > > behavior.
> > > > > > > > > > > > Similarly,
> > > > > > > > > > > > > there can be other use cases that can be achieved
> > other
> > > > > than
> > > > > > > > batch.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > > > > 1. Make sure that the watermarks or control tuples
> > are
> > > > > > > consistent
> > > > > > > > > > > across
> > > > > > > > > > > > > sources. Meaning an HDFS sink should be able to
> > > interpret
> > > > > the
> > > > > > > > > > watermark
> > > > > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > > > > 2. In addition to I/O connectors, we should also
> look
> > > at
> > > > > the
> > > > > > > need
> > > > > > > > > for
> > > > > > > > > > > > > processing operators to understand some of the
> > control
> > > > > > tuples /
> > > > > > > > > > > > watermarks.
> > > > > > > > > > > > > For example, we may want to reset the operator
> > behavior
> > > > on
> > > > > > > > arrival
> > > > > > > > > of
> > > > > > > > > > > > some
> > > > > > > > > > > > > watermark tuple.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > > > > thw@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> The HDFS source can operate in two modes, bounded
> or
> > > > > > > unbounded.
> > > > > > > > If
> > > > > > > > > > you
> > > > > > > > > > > > >> scan
> > > > > > > > > > > > >> only once, then it should emit the final watermark
> > > after
> > > > > it
> > > > > > is
> > > > > > > > > done.
> > > > > > > > > > > > >> Otherwise it would emit watermarks based on a
> policy
> > > > > (files
> > > > > > > > names
> > > > > > > > > > > etc.).
> > > > > > > > > > > > >> The mechanism to generate the marks may depend on
> > the
> > > > type
> > > > > > of
> > > > > > > > > source
> > > > > > > > > > > and
> > > > > > > > > > > > >> the user needs to be able to influence/configure
> it.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Thomas
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > Hi Thomas,
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > I am not sure that I completely understand your
> > > > > > suggestion.
> > > > > > > > Are
> > > > > > > > > > you
> > > > > > > > > > > > >> > suggesting to broaden the scope of the proposal
> to
> > > > treat
> > > > > > all
> > > > > > > > > > sources
> > > > > > > > > > > > as
> > > > > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > In case of Apex, we treat all sources as
> unbounded
> > > > > > sources.
> > > > > > > > Even
> > > > > > > > > > > > bounded
> > > > > > > > > > > > >> > sources like HDFS file source is treated as
> > > unbounded
> > > > by
> > > > > > > means
> > > > > > > > > of
> > > > > > > > > > > > >> scanning
> > > > > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > > > > >> > In this case, if we treat it as a bounded
> source,
> > we
> > > > can
> > > > > > > > define
> > > > > > > > > > > hooks
> > > > > > > > > > > > >> which
> > > > > > > > > > > > >> > allows us to detect the end of the file and send
> > the
> > > > > > "final
> > > > > > > > > > > > watermark".
> > > > > > > > > > > > >> We
> > > > > > > > > > > > >> > could also consider HDFS file source as a
> > streaming
> > > > > source
> > > > > > > and
> > > > > > > > > > > define
> > > > > > > > > > > > >> hooks
> > > > > > > > > > > > >> > which send watermarks based on different kinds
> of
> > > > > windows.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > > > > > > thw@apache.org
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > Bhupesh,
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Please see how that can be solved in a unified
> > way
> > > > > using
> > > > > > > > > windows
> > > > > > > > > > > and
> > > > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded
> > data.
> > > > In
> > > > > > Beam
> > > > > > > > for
> > > > > > > > > > > > >> example,
> > > > > > > > > > > > >> > you
> > > > > > > > > > > > >> > > can use the "global window" and the final
> > > watermark
> > > > to
> > > > > > > > > > accomplish
> > > > > > > > > > > > what
> > > > > > > > > > > > >> > you
> > > > > > > > > > > > >> > > are looking for. Batch is just a special case
> of
> > > > > > streaming
> > > > > > > > > where
> > > > > > > > > > > the
> > > > > > > > > > > > >> > source
> > > > > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > >> > > Thomas
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh
> Chawda
> > <
> > > > > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
> > > > > application,
> > > > > > > > then
> > > > > > > > > > > batch
> > > > > > > > > > > > >> aware
> > > > > > > > > > > > >> > > > operators need to be used in the
> application.
> > > > > > > > > > > > >> > > > The nature of the application is mostly
> > > controlled
> > > > > by
> > > > > > > the
> > > > > > > > > > input
> > > > > > > > > > > > and
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > output operators used in the application.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > For example, consider an application which
> > needs
> > > > to
> > > > > > > filter
> > > > > > > > > > > records
> > > > > > > > > > > > >> in a
> > > > > > > > > > > > >> > > > input file and store the filtered records in
> > > > another
> > > > > > > file.
> > > > > > > > > The
> > > > > > > > > > > > >> nature
> > > > > > > > > > > > >> > of
> > > > > > > > > > > > >> > > > this app is to end once the entire file is
> > > > > processed.
> > > > > > > > > > Following
> > > > > > > > > > > > >> things
> > > > > > > > > > > > >> > > are
> > > > > > > > > > > > >> > > > expected of the application:
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >    1. Once the input data is over, finalize
> > the
> > > > > output
> > > > > > > > file
> > > > > > > > > > from
> > > > > > > > > > > > >> .tmp
> > > > > > > > > > > > >> > > >    files. - Responsibility of output
> operator
> > > > > > > > > > > > >> > > >    2. End the application, once the data is
> > read
> > > > and
> > > > > > > > > > processed -
> > > > > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > These functions are essential to allow the
> > user
> > > to
> > > > > do
> > > > > > > > higher
> > > > > > > > > > > level
> > > > > > > > > > > > >> > > > operations like scheduling or running a
> > workflow
> > > > of
> > > > > > > batch
> > > > > > > > > > > > >> applications.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > I am not sure about intermediate
> (processing)
> > > > > > operators,
> > > > > > > > as
> > > > > > > > > > > there
> > > > > > > > > > > > >> is no
> > > > > > > > > > > > >> > > > change in their functionality for batch use
> > > cases.
> > > > > > > > Perhaps,
> > > > > > > > > > > > allowing
> > > > > > > > > > > > >> > > > running multiple batches in a single
> > application
> > > > may
> > > > > > > > require
> > > > > > > > > > > > similar
> > > > > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
> > > Gugale <
> > > > > > > > > > > > priyag@apache.org
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Will it make an impression on user that,
> if
> > he
> > > > > has a
> > > > > > > > batch
> > > > > > > > > > > > >> usecase he
> > > > > > > > > > > > >> > > has
> > > > > > > > > > > > >> > > > > to use batch aware operators only? If so,
> is
> > > > that
> > > > > > what
> > > > > > > > we
> > > > > > > > > > > > expect?
> > > > > > > > > > > > >> I
> > > > > > > > > > > > >> > am
> > > > > > > > > > > > >> > > > not
> > > > > > > > > > > > >> > > > > aware of how do we implement batch
> scenario
> > so
> > > > > this
> > > > > > > > might
> > > > > > > > > > be a
> > > > > > > > > > > > >> basic
> > > > > > > > > > > > >> > > > > question.
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh
> > > > Chawda <
> > > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > While design / implementation for custom
> > > > control
> > > > > > > > tuples
> > > > > > > > > is
> > > > > > > > > > > > >> > ongoing, I
> > > > > > > > > > > > >> > > > > > thought it would be a good idea to
> > consider
> > > > its
> > > > > > > > > usefulness
> > > > > > > > > > > in
> > > > > > > > > > > > >> one
> > > > > > > > > > > > >> > of
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
> > > existing
> > > > > > > > operators
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > >> > Apache
> > > > > > > > > > > > >> > > > > Apex
> > > > > > > > > > > > >> > > > > > Malhar library so that it is easy to use
> > > them
> > > > in
> > > > > > > batch
> > > > > > > > > use
> > > > > > > > > > > > >> cases.
> > > > > > > > > > > > >> > > > > > Naturally, this would be applicable for
> > > only a
> > > > > > > subset
> > > > > > > > of
> > > > > > > > > > > > >> operators
> > > > > > > > > > > > >> > > like
> > > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > > > > >> > > > > > For example, for a file based store,
> (say
> > > HDFS
> > > > > > > store),
> > > > > > > > > we
> > > > > > > > > > > > could
> > > > > > > > > > > > >> > have
> > > > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
> > operators
> > > > > which
> > > > > > > > allow
> > > > > > > > > > > easy
> > > > > > > > > > > > >> > > > integration
> > > > > > > > > > > > >> > > > > > into a batch application. These
> operators
> > > > would
> > > > > be
> > > > > > > > > > extended
> > > > > > > > > > > > from
> > > > > > > > > > > > >> > > their
> > > > > > > > > > > > >> > > > > > existing implementations and would be
> > "Batch
> > > > > > Aware",
> > > > > > > > in
> > > > > > > > > > that
> > > > > > > > > > > > >> they
> > > > > > > > > > > > >> > may
> > > > > > > > > > > > >> > > > > > understand the meaning of some specific
> > > > control
> > > > > > > tuples
> > > > > > > > > > that
> > > > > > > > > > > > flow
> > > > > > > > > > > > >> > > > through
> > > > > > > > > > > > >> > > > > > the DAG. Start batch and end batch seem
> to
> > > be
> > > > > the
> > > > > > > > > obvious
> > > > > > > > > > > > >> > candidates
> > > > > > > > > > > > >> > > > that
> > > > > > > > > > > > >> > > > > > come to mind. On receipt of such control
> > > > tuples,
> > > > > > > they
> > > > > > > > > may
> > > > > > > > > > > try
> > > > > > > > > > > > to
> > > > > > > > > > > > >> > > modify
> > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > >> > > > > > behavior of the operator - to
> reinitialize
> > > > some
> > > > > > > > metrics
> > > > > > > > > or
> > > > > > > > > > > > >> finalize
> > > > > > > > > > > > >> > > an
> > > > > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > We can discuss the potential control
> > tuples
> > > > and
> > > > > > > > actions
> > > > > > > > > in
> > > > > > > > > > > > >> detail,
> > > > > > > > > > > > >> > > but
> > > > > > > > > > > > >> > > > > > first I would like to understand the
> views
> > > of
> > > > > the
> > > > > > > > > > community
> > > > > > > > > > > > for
> > > > > > > > > > > > >> > this
> > > > > > > > > > > > >> > > > > > proposal.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by David Yan <da...@gmail.com>.

There is a discussion in the Flink mailing list about key-based watermarks.
I think it's relevant to our use case here.
https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E

David

On Tue, Feb 28, 2017 at 2:13 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi David,
>
> If using time window does not seem appropriate, we can have another class
> which is more suited for such sequential and distinct windows. Perhaps, a
> CustomWindow option can be introduced which takes in a window id. The
> purpose of this window option could be to translate the window id into
> appropriate timestamps.
>
> Another option would be to go with a custom timestampExtractor for such
> tuples which translates the each unique file name to a distinct timestamp
> while using time windows in the windowed operator.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com> wrote:
>
> > I now see your rationale on putting the filename in the window.
> > As far as I understand, the reasons why the filename is not part of the
> key
> > and the Global Window is not used are:
> >
> > 1) The files are processed in sequence, not in parallel
> > 2) The windowed operator should not keep the state associated with the
> file
> > when the processing of the file is done
> > 3) The trigger should be fired for the file when a file is done
> processing.
> >
> > However, if the file is just a sequence has nothing to do with a
> timestamp,
> > assigning a timestamp to a file is not an intuitive thing to do and would
> > just create confusions to the users, especially when it's used as an
> > example for new users.
> >
> > How about having a separate class called SequenceWindow? And perhaps
> > TimeWindow can inherit from it?
> >
> > David
> >
> > On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org> wrote:
> >
> > > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > I think my comments related to count based windows might be causing
> > > > confusion. Let's not discuss count based scenarios for now.
> > > >
> > > > Just want to make sure we are on the same page wrt. the "each file
> is a
> > > > batch" use case. As mentioned by Thomas, the each tuple from the same
> > > file
> > > > has the same timestamp (which is just a sequence number) and that
> helps
> > > > keep tuples from each file in a separate window.
> > > >
> > >
> > > Yes, in this case it is a sequence number, but it could be a time stamp
> > > also, depending on the file naming convention. And if it was event time
> > > processing, the watermark would be derived from records within the
> file.
> > >
> > > Agreed, the source should have a mechanism to control the time stamp
> > > extraction along with everything else pertaining to the watermark
> > > generation.
> > >
> > >
> > > > We could also implement a "timestampExtractor" interface to identify
> > the
> > > > timestamp (sequence number) for a file.
> > > >
> > > > ~ Bhupesh
> > > >
> > > >
> > > > _______________________________________________________
> > > >
> > > > Bhupesh Chawda
> > > >
> > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >
> > > > www.datatorrent.com  |  apex.apache.org
> > > >
> > > >
> > > >
> > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > > > I don't think this is a use case for count based window.
> > > > >
> > > > > We have multiple files that are retrieved in a sequence and there
> is
> > no
> > > > > knowledge of the number of records per file. The requirement is to
> > > > > aggregate each file separately and emit the aggregate when the file
> > is
> > > > read
> > > > > fully. There is no concept of "end of something" for an individual
> > key
> > > > and
> > > > > global window isn't applicable.
> > > > >
> > > > > However, as already explained and implemented by Bhupesh, this can
> be
> > > > > solved using watermark and window (in this case the window
> timestamp
> > > > isn't
> > > > > a timestamp, but a file sequence, but that doesn't matter.
> > > > >
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com>
> > wrote:
> > > > >
> > > > > > I don't think this is the way to go. Global Window only means the
> > > > > timestamp
> > > > > > does not matter (or that there is no timestamp). It does not
> > > > necessarily
> > > > > > mean it's a large batch. Unless there is some notion of event
> time
> > > for
> > > > > each
> > > > > > file, you don't want to embed the file into the window itself.
> > > > > >
> > > > > > If you want the result broken up by file name, and if the files
> are
> > > to
> > > > be
> > > > > > processed in parallel, I think making the file name be part of
> the
> > > key
> > > > is
> > > > > > the way to go. I think it's very confusing if we somehow make the
> > > file
> > > > to
> > > > > > be part of the window.
> > > > > >
> > > > > > For count-based window, it's not implemented yet and you're
> welcome
> > > to
> > > > > add
> > > > > > that feature. In case of count-based windows, there would be no
> > > notion
> > > > of
> > > > > > time and you probably only trigger at the end of each window. In
> > the
> > > > case
> > > > > > of count-based windows, the watermark only matters for batch
> since
> > > you
> > > > > need
> > > > > > a way to know when the batch has ended (if the count is 10, the
> > > number
> > > > of
> > > > > > tuples in the batch is let's say 105, you need a way to end the
> > last
> > > > > window
> > > > > > with 5 tuples).
> > > > > >
> > > > > > David
> > > > > >
> > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > Thanks for your comments.
> > > > > > >
> > > > > > > The wordcount example that I created based on the windowed
> > operator
> > > > > does
> > > > > > > processing of word counts per file (each file as a separate
> > batch),
> > > > > i.e.
> > > > > > > process counts for each file and dump into separate files.
> > > > > > > As I understand Global window is for one large batch; i.e. all
> > > > incoming
> > > > > > > data falls into the same batch. This could not be processed
> using
> > > > > > > GlobalWindow option as we need more than one windows. In this
> > > case, I
> > > > > > > configured the windowed operator to have time windows of 1ms
> each
> > > and
> > > > > > > passed data for each file with increasing timestamps: (file1,
> 1),
> > > > > (file2,
> > > > > > > 2) and so on. Is there a better way of handling this scenario?
> > > > > > >
> > > > > > > Regarding (2 - count based windows), I think there is a trigger
> > > > option
> > > > > to
> > > > > > > process count based windows. In case I want to process every
> 1000
> > > > > tuples
> > > > > > as
> > > > > > > a batch, I could set the Trigger option to CountTrigger with
> the
> > > > > > > accumulation set to Discarding. Is this correct?
> > > > > > >
> > > > > > > I agree that (4. Final Watermark) can be done using Global
> > window.
> > > > > > >
> > > > > > > ~ Bhupesh
> > > > > > >
> > > > > > > _______________________________________________________
> > > > > > >
> > > > > > > Bhupesh Chawda
> > > > > > >
> > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > >
> > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
> davidyan@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > I'm worried that we are making the watermark concept too
> > > > complicated.
> > > > > > > >
> > > > > > > > Watermarks should simply just tell you what windows can be
> > > > considered
> > > > > > > > complete.
> > > > > > > >
> > > > > > > > Point 2 is basically a count-based window. Watermarks do not
> > > play a
> > > > > > role
> > > > > > > > here because the window is always complete at the n-th tuple.
> > > > > > > >
> > > > > > > > If I understand correctly, point 3 is for batch processing of
> > > > files.
> > > > > > > Unless
> > > > > > > > the files contain timed events, it sounds to be that this can
> > be
> > > > > > achieved
> > > > > > > > with just a Global Window. For signaling EOF, a watermark
> with
> > a
> > > > > > > +infinity
> > > > > > > > timestamp can be used so that triggers will be fired upon
> > receipt
> > > > of
> > > > > > that
> > > > > > > > watermark.
> > > > > > > >
> > > > > > > > For point 4, just like what I mentioned above, can be
> achieved
> > > > with a
> > > > > > > > watermark with a +infinity timestamp.
> > > > > > > >
> > > > > > > > David
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > > > bhupesh@datatorrent.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Thomas,
> > > > > > > > >
> > > > > > > > > For an input operator which is supposed to generate
> > watermarks
> > > > for
> > > > > > > > > downstream operators, I can think about the following
> > > watermarks
> > > > > that
> > > > > > > the
> > > > > > > > > operator can emit:
> > > > > > > > > 1. Time based watermarks (the high watermark / low
> watermark)
> > > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > > > 4. Final watermark
> > > > > > > > >
> > > > > > > > > File based watermarks seem to be applicable for batch (file
> > > > based)
> > > > > as
> > > > > > > > well,
> > > > > > > > > and hence I thought of looking at these first. Does this
> seem
> > > to
> > > > be
> > > > > > in
> > > > > > > > line
> > > > > > > > > with the thought process?
> > > > > > > > >
> > > > > > > > > ~ Bhupesh
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > _______________________________________________________
> > > > > > > > >
> > > > > > > > > Bhupesh Chawda
> > > > > > > > >
> > > > > > > > > Software Engineer
> > > > > > > > >
> > > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > > >
> > > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> > thw@apache.org
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I don't think this should be designed based on a
> simplistic
> > > > file
> > > > > > > > > > input-output scenario. It would be good to include a
> > stateful
> > > > > > > > > > transformation based on event time.
> > > > > > > > > >
> > > > > > > > > > More complex pipelines contain stateful transformations
> > that
> > > > > depend
> > > > > > > on
> > > > > > > > > > windowing and watermarks. I think we need a watermark
> > concept
> > > > > that
> > > > > > is
> > > > > > > > > based
> > > > > > > > > > on progress in event time (or other monotonic increasing
> > > > > sequence)
> > > > > > > that
> > > > > > > > > > other operators can generically work with.
> > > > > > > > > >
> > > > > > > > > > Note that even file input in many cases can produce time
> > > based
> > > > > > > > > watermarks,
> > > > > > > > > > for example when you read part files that are bound by
> > event
> > > > > time.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > > > bhupesh@datatorrent.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > For better understanding the use case for control
> tuples
> > in
> > > > > > batch,
> > > > > > > I
> > > > > > > > > am
> > > > > > > > > > > creating a prototype for a batch application using File
> > > Input
> > > > > and
> > > > > > > > File
> > > > > > > > > > > Output operators.
> > > > > > > > > > >
> > > > > > > > > > > To enable basic batch processing for File IO
> operators, I
> > > am
> > > > > > > > proposing
> > > > > > > > > > the
> > > > > > > > > > > following changes to File input and output operators:
> > > > > > > > > > > 1. File Input operator emits a watermark each time it
> > opens
> > > > and
> > > > > > > > closes
> > > > > > > > > a
> > > > > > > > > > > file. These can be "start file" and "end file"
> watermarks
> > > > which
> > > > > > > > include
> > > > > > > > > > the
> > > > > > > > > > > corresponding file names. The "start file" tuple should
> > be
> > > > sent
> > > > > > > > before
> > > > > > > > > > any
> > > > > > > > > > > of the data from that file flows.
> > > > > > > > > > > 2. File Input operator can be configured to end the
> > > > application
> > > > > > > > after a
> > > > > > > > > > > single or n scans of the directory (a batch). This is
> > where
> > > > the
> > > > > > > > > operator
> > > > > > > > > > > emits the final watermark (the end of application
> control
> > > > > tuple).
> > > > > > > > This
> > > > > > > > > > will
> > > > > > > > > > > also shutdown the application.
> > > > > > > > > > > 3. The File output operator handles these control
> tuples.
> > > > > "Start
> > > > > > > > file"
> > > > > > > > > > > initializes the file name for the incoming tuples. "End
> > > file"
> > > > > > > > watermark
> > > > > > > > > > > forces a finalize on that file.
> > > > > > > > > > >
> > > > > > > > > > > The user would be able to enable the operators to send
> > only
> > > > > those
> > > > > > > > > > > watermarks that are needed in the application. If none
> of
> > > the
> > > > > > > options
> > > > > > > > > are
> > > > > > > > > > > configured, the operators behave as in a streaming
> > > > application.
> > > > > > > > > > >
> > > > > > > > > > > There are a few challenges in the implementation where
> > the
> > > > > input
> > > > > > > > > operator
> > > > > > > > > > > is partitioned. In this case, the correlation between
> the
> > > > > > start/end
> > > > > > > > > for a
> > > > > > > > > > > file and the data tuples for that file is lost. Hence
> we
> > > need
> > > > > to
> > > > > > > > > maintain
> > > > > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > > > > >
> > > > > > > > > > > The "start file" and "end file" control tuples in this
> > > > example
> > > > > > are
> > > > > > > > > > > temporary names for watermarks. We can have generic
> > "start
> > > > > > batch" /
> > > > > > > > > "end
> > > > > > > > > > > batch" tuples which could be used for other use cases
> as
> > > > well.
> > > > > > The
> > > > > > > > > Final
> > > > > > > > > > > watermark is common and serves the same purpose in each
> > > case.
> > > > > > > > > > >
> > > > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > > > >
> > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Yes, this can be part of operator configuration.
> Given
> > > > this,
> > > > > > for
> > > > > > > a
> > > > > > > > > user
> > > > > > > > > > > to
> > > > > > > > > > > > define a batch application, would mean configuring
> the
> > > > > > connectors
> > > > > > > > > > (mostly
> > > > > > > > > > > > the input operator) in the application for the
> desired
> > > > > > behavior.
> > > > > > > > > > > Similarly,
> > > > > > > > > > > > there can be other use cases that can be achieved
> other
> > > > than
> > > > > > > batch.
> > > > > > > > > > > >
> > > > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > > > 1. Make sure that the watermarks or control tuples
> are
> > > > > > consistent
> > > > > > > > > > across
> > > > > > > > > > > > sources. Meaning an HDFS sink should be able to
> > interpret
> > > > the
> > > > > > > > > watermark
> > > > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > > > 2. In addition to I/O connectors, we should also look
> > at
> > > > the
> > > > > > need
> > > > > > > > for
> > > > > > > > > > > > processing operators to understand some of the
> control
> > > > > tuples /
> > > > > > > > > > > watermarks.
> > > > > > > > > > > > For example, we may want to reset the operator
> behavior
> > > on
> > > > > > > arrival
> > > > > > > > of
> > > > > > > > > > > some
> > > > > > > > > > > > watermark tuple.
> > > > > > > > > > > >
> > > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > > > thw@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> The HDFS source can operate in two modes, bounded or
> > > > > > unbounded.
> > > > > > > If
> > > > > > > > > you
> > > > > > > > > > > >> scan
> > > > > > > > > > > >> only once, then it should emit the final watermark
> > after
> > > > it
> > > > > is
> > > > > > > > done.
> > > > > > > > > > > >> Otherwise it would emit watermarks based on a policy
> > > > (files
> > > > > > > names
> > > > > > > > > > etc.).
> > > > > > > > > > > >> The mechanism to generate the marks may depend on
> the
> > > type
> > > > > of
> > > > > > > > source
> > > > > > > > > > and
> > > > > > > > > > > >> the user needs to be able to influence/configure it.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Thomas
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Hi Thomas,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > I am not sure that I completely understand your
> > > > > suggestion.
> > > > > > > Are
> > > > > > > > > you
> > > > > > > > > > > >> > suggesting to broaden the scope of the proposal to
> > > treat
> > > > > all
> > > > > > > > > sources
> > > > > > > > > > > as
> > > > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > In case of Apex, we treat all sources as unbounded
> > > > > sources.
> > > > > > > Even
> > > > > > > > > > > bounded
> > > > > > > > > > > >> > sources like HDFS file source is treated as
> > unbounded
> > > by
> > > > > > means
> > > > > > > > of
> > > > > > > > > > > >> scanning
> > > > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > > > >> > In this case, if we treat it as a bounded source,
> we
> > > can
> > > > > > > define
> > > > > > > > > > hooks
> > > > > > > > > > > >> which
> > > > > > > > > > > >> > allows us to detect the end of the file and send
> the
> > > > > "final
> > > > > > > > > > > watermark".
> > > > > > > > > > > >> We
> > > > > > > > > > > >> > could also consider HDFS file source as a
> streaming
> > > > source
> > > > > > and
> > > > > > > > > > define
> > > > > > > > > > > >> hooks
> > > > > > > > > > > >> > which send watermarks based on different kinds of
> > > > windows.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > > > > > thw@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > Bhupesh,
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Please see how that can be solved in a unified
> way
> > > > using
> > > > > > > > windows
> > > > > > > > > > and
> > > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded
> data.
> > > In
> > > > > Beam
> > > > > > > for
> > > > > > > > > > > >> example,
> > > > > > > > > > > >> > you
> > > > > > > > > > > >> > > can use the "global window" and the final
> > watermark
> > > to
> > > > > > > > > accomplish
> > > > > > > > > > > what
> > > > > > > > > > > >> > you
> > > > > > > > > > > >> > > are looking for. Batch is just a special case of
> > > > > streaming
> > > > > > > > where
> > > > > > > > > > the
> > > > > > > > > > > >> > source
> > > > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > >> > > Thomas
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda
> <
> > > > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
> > > > application,
> > > > > > > then
> > > > > > > > > > batch
> > > > > > > > > > > >> aware
> > > > > > > > > > > >> > > > operators need to be used in the application.
> > > > > > > > > > > >> > > > The nature of the application is mostly
> > controlled
> > > > by
> > > > > > the
> > > > > > > > > input
> > > > > > > > > > > and
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > output operators used in the application.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > For example, consider an application which
> needs
> > > to
> > > > > > filter
> > > > > > > > > > records
> > > > > > > > > > > >> in a
> > > > > > > > > > > >> > > > input file and store the filtered records in
> > > another
> > > > > > file.
> > > > > > > > The
> > > > > > > > > > > >> nature
> > > > > > > > > > > >> > of
> > > > > > > > > > > >> > > > this app is to end once the entire file is
> > > > processed.
> > > > > > > > > Following
> > > > > > > > > > > >> things
> > > > > > > > > > > >> > > are
> > > > > > > > > > > >> > > > expected of the application:
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >    1. Once the input data is over, finalize
> the
> > > > output
> > > > > > > file
> > > > > > > > > from
> > > > > > > > > > > >> .tmp
> > > > > > > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > > > > > > >> > > >    2. End the application, once the data is
> read
> > > and
> > > > > > > > > processed -
> > > > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > These functions are essential to allow the
> user
> > to
> > > > do
> > > > > > > higher
> > > > > > > > > > level
> > > > > > > > > > > >> > > > operations like scheduling or running a
> workflow
> > > of
> > > > > > batch
> > > > > > > > > > > >> applications.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > I am not sure about intermediate (processing)
> > > > > operators,
> > > > > > > as
> > > > > > > > > > there
> > > > > > > > > > > >> is no
> > > > > > > > > > > >> > > > change in their functionality for batch use
> > cases.
> > > > > > > Perhaps,
> > > > > > > > > > > allowing
> > > > > > > > > > > >> > > > running multiple batches in a single
> application
> > > may
> > > > > > > require
> > > > > > > > > > > similar
> > > > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
> > Gugale <
> > > > > > > > > > > priyag@apache.org
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Will it make an impression on user that, if
> he
> > > > has a
> > > > > > > batch
> > > > > > > > > > > >> usecase he
> > > > > > > > > > > >> > > has
> > > > > > > > > > > >> > > > > to use batch aware operators only? If so, is
> > > that
> > > > > what
> > > > > > > we
> > > > > > > > > > > expect?
> > > > > > > > > > > >> I
> > > > > > > > > > > >> > am
> > > > > > > > > > > >> > > > not
> > > > > > > > > > > >> > > > > aware of how do we implement batch scenario
> so
> > > > this
> > > > > > > might
> > > > > > > > > be a
> > > > > > > > > > > >> basic
> > > > > > > > > > > >> > > > > question.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh
> > > Chawda <
> > > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > While design / implementation for custom
> > > control
> > > > > > > tuples
> > > > > > > > is
> > > > > > > > > > > >> > ongoing, I
> > > > > > > > > > > >> > > > > > thought it would be a good idea to
> consider
> > > its
> > > > > > > > usefulness
> > > > > > > > > > in
> > > > > > > > > > > >> one
> > > > > > > > > > > >> > of
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
> > existing
> > > > > > > operators
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > >> > Apache
> > > > > > > > > > > >> > > > > Apex
> > > > > > > > > > > >> > > > > > Malhar library so that it is easy to use
> > them
> > > in
> > > > > > batch
> > > > > > > > use
> > > > > > > > > > > >> cases.
> > > > > > > > > > > >> > > > > > Naturally, this would be applicable for
> > only a
> > > > > > subset
> > > > > > > of
> > > > > > > > > > > >> operators
> > > > > > > > > > > >> > > like
> > > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > > > >> > > > > > For example, for a file based store, (say
> > HDFS
> > > > > > store),
> > > > > > > > we
> > > > > > > > > > > could
> > > > > > > > > > > >> > have
> > > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
> operators
> > > > which
> > > > > > > allow
> > > > > > > > > > easy
> > > > > > > > > > > >> > > > integration
> > > > > > > > > > > >> > > > > > into a batch application. These operators
> > > would
> > > > be
> > > > > > > > > extended
> > > > > > > > > > > from
> > > > > > > > > > > >> > > their
> > > > > > > > > > > >> > > > > > existing implementations and would be
> "Batch
> > > > > Aware",
> > > > > > > in
> > > > > > > > > that
> > > > > > > > > > > >> they
> > > > > > > > > > > >> > may
> > > > > > > > > > > >> > > > > > understand the meaning of some specific
> > > control
> > > > > > tuples
> > > > > > > > > that
> > > > > > > > > > > flow
> > > > > > > > > > > >> > > > through
> > > > > > > > > > > >> > > > > > the DAG. Start batch and end batch seem to
> > be
> > > > the
> > > > > > > > obvious
> > > > > > > > > > > >> > candidates
> > > > > > > > > > > >> > > > that
> > > > > > > > > > > >> > > > > > come to mind. On receipt of such control
> > > tuples,
> > > > > > they
> > > > > > > > may
> > > > > > > > > > try
> > > > > > > > > > > to
> > > > > > > > > > > >> > > modify
> > > > > > > > > > > >> > > > > the
> > > > > > > > > > > >> > > > > > behavior of the operator - to reinitialize
> > > some
> > > > > > > metrics
> > > > > > > > or
> > > > > > > > > > > >> finalize
> > > > > > > > > > > >> > > an
> > > > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > We can discuss the potential control
> tuples
> > > and
> > > > > > > actions
> > > > > > > > in
> > > > > > > > > > > >> detail,
> > > > > > > > > > > >> > > but
> > > > > > > > > > > >> > > > > > first I would like to understand the views
> > of
> > > > the
> > > > > > > > > community
> > > > > > > > > > > for
> > > > > > > > > > > >> > this
> > > > > > > > > > > >> > > > > > proposal.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi,

Have opened a review only PR: https://github.com/apache/apex-malhar/pull/567
There are a lot of other changes as well:

   1. Pointing to apex-core 3.6.0-SNAPSHOT for Control Tuple Support
   2. Modifying Windowed Operator to use Custom Control Tuples instead of
   Control port
   3. Changes in AbstractFileInputOperator and AbstractFileOutputOperator
   for supporting file based batches

Please ignore 1 and 2.

Here are the changes in the Abstract operators for File input and File
output:

https://github.com/apache/apex-malhar/pull/567/files#diff-8b18d6df947f93d70436ad9b32645463

https://github.com/apache/apex-malhar/pull/567/files#diff-bfe551fb062ee68dd8d0215280039f84

I will be refining this example to include the notion of something like
SequenceWindow as suggested by David. This will help avoid the confusion
about timestamps being used to demarcate files.

Please ignore the suggestion regarding use of a Timestamp extractor since
it cannot be used. It is just a function and only looks at the data tuples.
The real windows will be defined by the control tuples (watermarks).

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Feb 28, 2017 at 4:11 PM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Let me work on the changes in abstract classes for File Input and File
> Output and come up with a review only PR, which will help understand the
> case better. The same thing can then be extended to other connectors like
> JDBC and NoSQL operators.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Tue, Feb 28, 2017 at 3:43 PM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
>> Hi David,
>>
>> If using time window does not seem appropriate, we can have another class
>> which is more suited for such sequential and distinct windows. Perhaps, a
>> CustomWindow option can be introduced which takes in a window id. The
>> purpose of this window option could be to translate the window id into
>> appropriate timestamps.
>>
>> Another option would be to go with a custom timestampExtractor for such
>> tuples which translates the each unique file name to a distinct timestamp
>> while using time windows in the windowed operator.
>>
>> ~ Bhupesh
>>
>>
>> _______________________________________________________
>>
>> Bhupesh Chawda
>>
>> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>
>> www.datatorrent.com  |  apex.apache.org
>>
>>
>>
>> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com> wrote:
>>
>>> I now see your rationale on putting the filename in the window.
>>> As far as I understand, the reasons why the filename is not part of the
>>> key
>>> and the Global Window is not used are:
>>>
>>> 1) The files are processed in sequence, not in parallel
>>> 2) The windowed operator should not keep the state associated with the
>>> file
>>> when the processing of the file is done
>>> 3) The trigger should be fired for the file when a file is done
>>> processing.
>>>
>>> However, if the file is just a sequence has nothing to do with a
>>> timestamp,
>>> assigning a timestamp to a file is not an intuitive thing to do and would
>>> just create confusions to the users, especially when it's used as an
>>> example for new users.
>>>
>>> How about having a separate class called SequenceWindow? And perhaps
>>> TimeWindow can inherit from it?
>>>
>>> David
>>>
>>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org> wrote:
>>>
>>> > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>>> bhupesh@datatorrent.com>
>>> > wrote:
>>> >
>>> > > I think my comments related to count based windows might be causing
>>> > > confusion. Let's not discuss count based scenarios for now.
>>> > >
>>> > > Just want to make sure we are on the same page wrt. the "each file
>>> is a
>>> > > batch" use case. As mentioned by Thomas, the each tuple from the same
>>> > file
>>> > > has the same timestamp (which is just a sequence number) and that
>>> helps
>>> > > keep tuples from each file in a separate window.
>>> > >
>>> >
>>> > Yes, in this case it is a sequence number, but it could be a time stamp
>>> > also, depending on the file naming convention. And if it was event time
>>> > processing, the watermark would be derived from records within the
>>> file.
>>> >
>>> > Agreed, the source should have a mechanism to control the time stamp
>>> > extraction along with everything else pertaining to the watermark
>>> > generation.
>>> >
>>> >
>>> > > We could also implement a "timestampExtractor" interface to identify
>>> the
>>> > > timestamp (sequence number) for a file.
>>> > >
>>> > > ~ Bhupesh
>>> > >
>>> > >
>>> > > _______________________________________________________
>>> > >
>>> > > Bhupesh Chawda
>>> > >
>>> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>> > >
>>> > > www.datatorrent.com  |  apex.apache.org
>>> > >
>>> > >
>>> > >
>>> > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org>
>>> wrote:
>>> > >
>>> > > > I don't think this is a use case for count based window.
>>> > > >
>>> > > > We have multiple files that are retrieved in a sequence and there
>>> is no
>>> > > > knowledge of the number of records per file. The requirement is to
>>> > > > aggregate each file separately and emit the aggregate when the
>>> file is
>>> > > read
>>> > > > fully. There is no concept of "end of something" for an individual
>>> key
>>> > > and
>>> > > > global window isn't applicable.
>>> > > >
>>> > > > However, as already explained and implemented by Bhupesh, this can
>>> be
>>> > > > solved using watermark and window (in this case the window
>>> timestamp
>>> > > isn't
>>> > > > a timestamp, but a file sequence, but that doesn't matter.
>>> > > >
>>> > > > Thomas
>>> > > >
>>> > > >
>>> > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com>
>>> wrote:
>>> > > >
>>> > > > > I don't think this is the way to go. Global Window only means the
>>> > > > timestamp
>>> > > > > does not matter (or that there is no timestamp). It does not
>>> > > necessarily
>>> > > > > mean it's a large batch. Unless there is some notion of event
>>> time
>>> > for
>>> > > > each
>>> > > > > file, you don't want to embed the file into the window itself.
>>> > > > >
>>> > > > > If you want the result broken up by file name, and if the files
>>> are
>>> > to
>>> > > be
>>> > > > > processed in parallel, I think making the file name be part of
>>> the
>>> > key
>>> > > is
>>> > > > > the way to go. I think it's very confusing if we somehow make the
>>> > file
>>> > > to
>>> > > > > be part of the window.
>>> > > > >
>>> > > > > For count-based window, it's not implemented yet and you're
>>> welcome
>>> > to
>>> > > > add
>>> > > > > that feature. In case of count-based windows, there would be no
>>> > notion
>>> > > of
>>> > > > > time and you probably only trigger at the end of each window. In
>>> the
>>> > > case
>>> > > > > of count-based windows, the watermark only matters for batch
>>> since
>>> > you
>>> > > > need
>>> > > > > a way to know when the batch has ended (if the count is 10, the
>>> > number
>>> > > of
>>> > > > > tuples in the batch is let's say 105, you need a way to end the
>>> last
>>> > > > window
>>> > > > > with 5 tuples).
>>> > > > >
>>> > > > > David
>>> > > > >
>>> > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>>> > > bhupesh@datatorrent.com
>>> > > > >
>>> > > > > wrote:
>>> > > > >
>>> > > > > > Hi David,
>>> > > > > >
>>> > > > > > Thanks for your comments.
>>> > > > > >
>>> > > > > > The wordcount example that I created based on the windowed
>>> operator
>>> > > > does
>>> > > > > > processing of word counts per file (each file as a separate
>>> batch),
>>> > > > i.e.
>>> > > > > > process counts for each file and dump into separate files.
>>> > > > > > As I understand Global window is for one large batch; i.e. all
>>> > > incoming
>>> > > > > > data falls into the same batch. This could not be processed
>>> using
>>> > > > > > GlobalWindow option as we need more than one windows. In this
>>> > case, I
>>> > > > > > configured the windowed operator to have time windows of 1ms
>>> each
>>> > and
>>> > > > > > passed data for each file with increasing timestamps: (file1,
>>> 1),
>>> > > > (file2,
>>> > > > > > 2) and so on. Is there a better way of handling this scenario?
>>> > > > > >
>>> > > > > > Regarding (2 - count based windows), I think there is a trigger
>>> > > option
>>> > > > to
>>> > > > > > process count based windows. In case I want to process every
>>> 1000
>>> > > > tuples
>>> > > > > as
>>> > > > > > a batch, I could set the Trigger option to CountTrigger with
>>> the
>>> > > > > > accumulation set to Discarding. Is this correct?
>>> > > > > >
>>> > > > > > I agree that (4. Final Watermark) can be done using Global
>>> window.
>>> > > > > >
>>> > > > > > ~ Bhupesh
>>> > > > > >
>>> > > > > > _______________________________________________________
>>> > > > > >
>>> > > > > > Bhupesh Chawda
>>> > > > > >
>>> > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>> > > > > >
>>> > > > > > www.datatorrent.com  |  apex.apache.org
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <
>>> davidyan@gmail.com>
>>> > > > wrote:
>>> > > > > >
>>> > > > > > > I'm worried that we are making the watermark concept too
>>> > > complicated.
>>> > > > > > >
>>> > > > > > > Watermarks should simply just tell you what windows can be
>>> > > considered
>>> > > > > > > complete.
>>> > > > > > >
>>> > > > > > > Point 2 is basically a count-based window. Watermarks do not
>>> > play a
>>> > > > > role
>>> > > > > > > here because the window is always complete at the n-th tuple.
>>> > > > > > >
>>> > > > > > > If I understand correctly, point 3 is for batch processing of
>>> > > files.
>>> > > > > > Unless
>>> > > > > > > the files contain timed events, it sounds to be that this
>>> can be
>>> > > > > achieved
>>> > > > > > > with just a Global Window. For signaling EOF, a watermark
>>> with a
>>> > > > > > +infinity
>>> > > > > > > timestamp can be used so that triggers will be fired upon
>>> receipt
>>> > > of
>>> > > > > that
>>> > > > > > > watermark.
>>> > > > > > >
>>> > > > > > > For point 4, just like what I mentioned above, can be
>>> achieved
>>> > > with a
>>> > > > > > > watermark with a +infinity timestamp.
>>> > > > > > >
>>> > > > > > > David
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>>> > > > > bhupesh@datatorrent.com
>>> > > > > > >
>>> > > > > > > wrote:
>>> > > > > > >
>>> > > > > > > > Hi Thomas,
>>> > > > > > > >
>>> > > > > > > > For an input operator which is supposed to generate
>>> watermarks
>>> > > for
>>> > > > > > > > downstream operators, I can think about the following
>>> > watermarks
>>> > > > that
>>> > > > > > the
>>> > > > > > > > operator can emit:
>>> > > > > > > > 1. Time based watermarks (the high watermark / low
>>> watermark)
>>> > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
>>> > > > > > > > 3. File based watermarks (Start file, end file)
>>> > > > > > > > 4. Final watermark
>>> > > > > > > >
>>> > > > > > > > File based watermarks seem to be applicable for batch (file
>>> > > based)
>>> > > > as
>>> > > > > > > well,
>>> > > > > > > > and hence I thought of looking at these first. Does this
>>> seem
>>> > to
>>> > > be
>>> > > > > in
>>> > > > > > > line
>>> > > > > > > > with the thought process?
>>> > > > > > > >
>>> > > > > > > > ~ Bhupesh
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > _______________________________________________________
>>> > > > > > > >
>>> > > > > > > > Bhupesh Chawda
>>> > > > > > > >
>>> > > > > > > > Software Engineer
>>> > > > > > > >
>>> > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>>> > > > > > > >
>>> > > > > > > > www.datatorrent.com  |  apex.apache.org
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>>> thw@apache.org
>>> > >
>>> > > > > wrote:
>>> > > > > > > >
>>> > > > > > > > > I don't think this should be designed based on a
>>> simplistic
>>> > > file
>>> > > > > > > > > input-output scenario. It would be good to include a
>>> stateful
>>> > > > > > > > > transformation based on event time.
>>> > > > > > > > >
>>> > > > > > > > > More complex pipelines contain stateful transformations
>>> that
>>> > > > depend
>>> > > > > > on
>>> > > > > > > > > windowing and watermarks. I think we need a watermark
>>> concept
>>> > > > that
>>> > > > > is
>>> > > > > > > > based
>>> > > > > > > > > on progress in event time (or other monotonic increasing
>>> > > > sequence)
>>> > > > > > that
>>> > > > > > > > > other operators can generically work with.
>>> > > > > > > > >
>>> > > > > > > > > Note that even file input in many cases can produce time
>>> > based
>>> > > > > > > > watermarks,
>>> > > > > > > > > for example when you read part files that are bound by
>>> event
>>> > > > time.
>>> > > > > > > > >
>>> > > > > > > > > Thanks,
>>> > > > > > > > > Thomas
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
>>> > > > > > > bhupesh@datatorrent.com
>>> > > > > > > > >
>>> > > > > > > > > wrote:
>>> > > > > > > > >
>>> > > > > > > > > > For better understanding the use case for control
>>> tuples in
>>> > > > > batch,
>>> > > > > > I
>>> > > > > > > > am
>>> > > > > > > > > > creating a prototype for a batch application using File
>>> > Input
>>> > > > and
>>> > > > > > > File
>>> > > > > > > > > > Output operators.
>>> > > > > > > > > >
>>> > > > > > > > > > To enable basic batch processing for File IO
>>> operators, I
>>> > am
>>> > > > > > > proposing
>>> > > > > > > > > the
>>> > > > > > > > > > following changes to File input and output operators:
>>> > > > > > > > > > 1. File Input operator emits a watermark each time it
>>> opens
>>> > > and
>>> > > > > > > closes
>>> > > > > > > > a
>>> > > > > > > > > > file. These can be "start file" and "end file"
>>> watermarks
>>> > > which
>>> > > > > > > include
>>> > > > > > > > > the
>>> > > > > > > > > > corresponding file names. The "start file" tuple
>>> should be
>>> > > sent
>>> > > > > > > before
>>> > > > > > > > > any
>>> > > > > > > > > > of the data from that file flows.
>>> > > > > > > > > > 2. File Input operator can be configured to end the
>>> > > application
>>> > > > > > > after a
>>> > > > > > > > > > single or n scans of the directory (a batch). This is
>>> where
>>> > > the
>>> > > > > > > > operator
>>> > > > > > > > > > emits the final watermark (the end of application
>>> control
>>> > > > tuple).
>>> > > > > > > This
>>> > > > > > > > > will
>>> > > > > > > > > > also shutdown the application.
>>> > > > > > > > > > 3. The File output operator handles these control
>>> tuples.
>>> > > > "Start
>>> > > > > > > file"
>>> > > > > > > > > > initializes the file name for the incoming tuples. "End
>>> > file"
>>> > > > > > > watermark
>>> > > > > > > > > > forces a finalize on that file.
>>> > > > > > > > > >
>>> > > > > > > > > > The user would be able to enable the operators to send
>>> only
>>> > > > those
>>> > > > > > > > > > watermarks that are needed in the application. If none
>>> of
>>> > the
>>> > > > > > options
>>> > > > > > > > are
>>> > > > > > > > > > configured, the operators behave as in a streaming
>>> > > application.
>>> > > > > > > > > >
>>> > > > > > > > > > There are a few challenges in the implementation where
>>> the
>>> > > > input
>>> > > > > > > > operator
>>> > > > > > > > > > is partitioned. In this case, the correlation between
>>> the
>>> > > > > start/end
>>> > > > > > > > for a
>>> > > > > > > > > > file and the data tuples for that file is lost. Hence
>>> we
>>> > need
>>> > > > to
>>> > > > > > > > maintain
>>> > > > > > > > > > the filename as part of each tuple in the pipeline.
>>> > > > > > > > > >
>>> > > > > > > > > > The "start file" and "end file" control tuples in this
>>> > > example
>>> > > > > are
>>> > > > > > > > > > temporary names for watermarks. We can have generic
>>> "start
>>> > > > > batch" /
>>> > > > > > > > "end
>>> > > > > > > > > > batch" tuples which could be used for other use cases
>>> as
>>> > > well.
>>> > > > > The
>>> > > > > > > > Final
>>> > > > > > > > > > watermark is common and serves the same purpose in each
>>> > case.
>>> > > > > > > > > >
>>> > > > > > > > > > Please let me know your thoughts on this.
>>> > > > > > > > > >
>>> > > > > > > > > > ~ Bhupesh
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
>>> > > > > > > > > bhupesh@datatorrent.com>
>>> > > > > > > > > > wrote:
>>> > > > > > > > > >
>>> > > > > > > > > > > Yes, this can be part of operator configuration.
>>> Given
>>> > > this,
>>> > > > > for
>>> > > > > > a
>>> > > > > > > > user
>>> > > > > > > > > > to
>>> > > > > > > > > > > define a batch application, would mean configuring
>>> the
>>> > > > > connectors
>>> > > > > > > > > (mostly
>>> > > > > > > > > > > the input operator) in the application for the
>>> desired
>>> > > > > behavior.
>>> > > > > > > > > > Similarly,
>>> > > > > > > > > > > there can be other use cases that can be achieved
>>> other
>>> > > than
>>> > > > > > batch.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We may also need to take care of the following:
>>> > > > > > > > > > > 1. Make sure that the watermarks or control tuples
>>> are
>>> > > > > consistent
>>> > > > > > > > > across
>>> > > > > > > > > > > sources. Meaning an HDFS sink should be able to
>>> interpret
>>> > > the
>>> > > > > > > > watermark
>>> > > > > > > > > > > tuple sent out by, say, a JDBC source.
>>> > > > > > > > > > > 2. In addition to I/O connectors, we should also
>>> look at
>>> > > the
>>> > > > > need
>>> > > > > > > for
>>> > > > > > > > > > > processing operators to understand some of the
>>> control
>>> > > > tuples /
>>> > > > > > > > > > watermarks.
>>> > > > > > > > > > > For example, we may want to reset the operator
>>> behavior
>>> > on
>>> > > > > > arrival
>>> > > > > > > of
>>> > > > > > > > > > some
>>> > > > > > > > > > > watermark tuple.
>>> > > > > > > > > > >
>>> > > > > > > > > > > ~ Bhupesh
>>> > > > > > > > > > >
>>> > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
>>> > > > thw@apache.org>
>>> > > > > > > > wrote:
>>> > > > > > > > > > >
>>> > > > > > > > > > >> The HDFS source can operate in two modes, bounded or
>>> > > > > unbounded.
>>> > > > > > If
>>> > > > > > > > you
>>> > > > > > > > > > >> scan
>>> > > > > > > > > > >> only once, then it should emit the final watermark
>>> after
>>> > > it
>>> > > > is
>>> > > > > > > done.
>>> > > > > > > > > > >> Otherwise it would emit watermarks based on a policy
>>> > > (files
>>> > > > > > names
>>> > > > > > > > > etc.).
>>> > > > > > > > > > >> The mechanism to generate the marks may depend on
>>> the
>>> > type
>>> > > > of
>>> > > > > > > source
>>> > > > > > > > > and
>>> > > > > > > > > > >> the user needs to be able to influence/configure it.
>>> > > > > > > > > > >>
>>> > > > > > > > > > >> Thomas
>>> > > > > > > > > > >>
>>> > > > > > > > > > >>
>>> > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
>>> > > > > > > > > > bhupesh@datatorrent.com>
>>> > > > > > > > > > >> wrote:
>>> > > > > > > > > > >>
>>> > > > > > > > > > >> > Hi Thomas,
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> > I am not sure that I completely understand your
>>> > > > suggestion.
>>> > > > > > Are
>>> > > > > > > > you
>>> > > > > > > > > > >> > suggesting to broaden the scope of the proposal to
>>> > treat
>>> > > > all
>>> > > > > > > > sources
>>> > > > > > > > > > as
>>> > > > > > > > > > >> > bounded as well as unbounded?
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> > In case of Apex, we treat all sources as unbounded
>>> > > > sources.
>>> > > > > > Even
>>> > > > > > > > > > bounded
>>> > > > > > > > > > >> > sources like HDFS file source is treated as
>>> unbounded
>>> > by
>>> > > > > means
>>> > > > > > > of
>>> > > > > > > > > > >> scanning
>>> > > > > > > > > > >> > the input directory repeatedly.
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> > Let's consider HDFS file source for example:
>>> > > > > > > > > > >> > In this case, if we treat it as a bounded source,
>>> we
>>> > can
>>> > > > > > define
>>> > > > > > > > > hooks
>>> > > > > > > > > > >> which
>>> > > > > > > > > > >> > allows us to detect the end of the file and send
>>> the
>>> > > > "final
>>> > > > > > > > > > watermark".
>>> > > > > > > > > > >> We
>>> > > > > > > > > > >> > could also consider HDFS file source as a
>>> streaming
>>> > > source
>>> > > > > and
>>> > > > > > > > > define
>>> > > > > > > > > > >> hooks
>>> > > > > > > > > > >> > which send watermarks based on different kinds of
>>> > > windows.
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> > Please correct me if I misunderstand.
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> > ~ Bhupesh
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
>>> > > > > thw@apache.org
>>> > > > > > >
>>> > > > > > > > > wrote:
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> > > Bhupesh,
>>> > > > > > > > > > >> > >
>>> > > > > > > > > > >> > > Please see how that can be solved in a unified
>>> way
>>> > > using
>>> > > > > > > windows
>>> > > > > > > > > and
>>> > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded
>>> data.
>>> > In
>>> > > > Beam
>>> > > > > > for
>>> > > > > > > > > > >> example,
>>> > > > > > > > > > >> > you
>>> > > > > > > > > > >> > > can use the "global window" and the final
>>> watermark
>>> > to
>>> > > > > > > > accomplish
>>> > > > > > > > > > what
>>> > > > > > > > > > >> > you
>>> > > > > > > > > > >> > > are looking for. Batch is just a special case of
>>> > > > streaming
>>> > > > > > > where
>>> > > > > > > > > the
>>> > > > > > > > > > >> > source
>>> > > > > > > > > > >> > > emits the final watermark.
>>> > > > > > > > > > >> > >
>>> > > > > > > > > > >> > > Thanks,
>>> > > > > > > > > > >> > > Thomas
>>> > > > > > > > > > >> > >
>>> > > > > > > > > > >> > >
>>> > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda
>>> <
>>> > > > > > > > > > >> bhupesh@datatorrent.com
>>> > > > > > > > > > >> > >
>>> > > > > > > > > > >> > > wrote:
>>> > > > > > > > > > >> > >
>>> > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
>>> > > application,
>>> > > > > > then
>>> > > > > > > > > batch
>>> > > > > > > > > > >> aware
>>> > > > > > > > > > >> > > > operators need to be used in the application.
>>> > > > > > > > > > >> > > > The nature of the application is mostly
>>> controlled
>>> > > by
>>> > > > > the
>>> > > > > > > > input
>>> > > > > > > > > > and
>>> > > > > > > > > > >> the
>>> > > > > > > > > > >> > > > output operators used in the application.
>>> > > > > > > > > > >> > > >
>>> > > > > > > > > > >> > > > For example, consider an application which
>>> needs
>>> > to
>>> > > > > filter
>>> > > > > > > > > records
>>> > > > > > > > > > >> in a
>>> > > > > > > > > > >> > > > input file and store the filtered records in
>>> > another
>>> > > > > file.
>>> > > > > > > The
>>> > > > > > > > > > >> nature
>>> > > > > > > > > > >> > of
>>> > > > > > > > > > >> > > > this app is to end once the entire file is
>>> > > processed.
>>> > > > > > > > Following
>>> > > > > > > > > > >> things
>>> > > > > > > > > > >> > > are
>>> > > > > > > > > > >> > > > expected of the application:
>>> > > > > > > > > > >> > > >
>>> > > > > > > > > > >> > > >    1. Once the input data is over, finalize
>>> the
>>> > > output
>>> > > > > > file
>>> > > > > > > > from
>>> > > > > > > > > > >> .tmp
>>> > > > > > > > > > >> > > >    files. - Responsibility of output operator
>>> > > > > > > > > > >> > > >    2. End the application, once the data is
>>> read
>>> > and
>>> > > > > > > > processed -
>>> > > > > > > > > > >> > > >    Responsibility of input operator
>>> > > > > > > > > > >> > > >
>>> > > > > > > > > > >> > > > These functions are essential to allow the
>>> user to
>>> > > do
>>> > > > > > higher
>>> > > > > > > > > level
>>> > > > > > > > > > >> > > > operations like scheduling or running a
>>> workflow
>>> > of
>>> > > > > batch
>>> > > > > > > > > > >> applications.
>>> > > > > > > > > > >> > > >
>>> > > > > > > > > > >> > > > I am not sure about intermediate (processing)
>>> > > > operators,
>>> > > > > > as
>>> > > > > > > > > there
>>> > > > > > > > > > >> is no
>>> > > > > > > > > > >> > > > change in their functionality for batch use
>>> cases.
>>> > > > > > Perhaps,
>>> > > > > > > > > > allowing
>>> > > > > > > > > > >> > > > running multiple batches in a single
>>> application
>>> > may
>>> > > > > > require
>>> > > > > > > > > > similar
>>> > > > > > > > > > >> > > > changes in processing operators as well.
>>> > > > > > > > > > >> > > >
>>> > > > > > > > > > >> > > > ~ Bhupesh
>>> > > > > > > > > > >> > > >
>>> > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
>>> Gugale <
>>> > > > > > > > > > priyag@apache.org
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >> > > > wrote:
>>> > > > > > > > > > >> > > >
>>> > > > > > > > > > >> > > > > Will it make an impression on user that, if
>>> he
>>> > > has a
>>> > > > > > batch
>>> > > > > > > > > > >> usecase he
>>> > > > > > > > > > >> > > has
>>> > > > > > > > > > >> > > > > to use batch aware operators only? If so, is
>>> > that
>>> > > > what
>>> > > > > > we
>>> > > > > > > > > > expect?
>>> > > > > > > > > > >> I
>>> > > > > > > > > > >> > am
>>> > > > > > > > > > >> > > > not
>>> > > > > > > > > > >> > > > > aware of how do we implement batch scenario
>>> so
>>> > > this
>>> > > > > > might
>>> > > > > > > > be a
>>> > > > > > > > > > >> basic
>>> > > > > > > > > > >> > > > > question.
>>> > > > > > > > > > >> > > > >
>>> > > > > > > > > > >> > > > > -Priyanka
>>> > > > > > > > > > >> > > > >
>>> > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh
>>> > Chawda <
>>> > > > > > > > > > >> > > > bhupesh@datatorrent.com>
>>> > > > > > > > > > >> > > > > wrote:
>>> > > > > > > > > > >> > > > >
>>> > > > > > > > > > >> > > > > > Hi All,
>>> > > > > > > > > > >> > > > > >
>>> > > > > > > > > > >> > > > > > While design / implementation for custom
>>> > control
>>> > > > > > tuples
>>> > > > > > > is
>>> > > > > > > > > > >> > ongoing, I
>>> > > > > > > > > > >> > > > > > thought it would be a good idea to
>>> consider
>>> > its
>>> > > > > > > usefulness
>>> > > > > > > > > in
>>> > > > > > > > > > >> one
>>> > > > > > > > > > >> > of
>>> > > > > > > > > > >> > > > the
>>> > > > > > > > > > >> > > > > > use cases -  batch applications.
>>> > > > > > > > > > >> > > > > >
>>> > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
>>> existing
>>> > > > > > operators
>>> > > > > > > in
>>> > > > > > > > > the
>>> > > > > > > > > > >> > Apache
>>> > > > > > > > > > >> > > > > Apex
>>> > > > > > > > > > >> > > > > > Malhar library so that it is easy to use
>>> them
>>> > in
>>> > > > > batch
>>> > > > > > > use
>>> > > > > > > > > > >> cases.
>>> > > > > > > > > > >> > > > > > Naturally, this would be applicable for
>>> only a
>>> > > > > subset
>>> > > > > > of
>>> > > > > > > > > > >> operators
>>> > > > > > > > > > >> > > like
>>> > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
>>> > > > > > > > > > >> > > > > > For example, for a file based store, (say
>>> HDFS
>>> > > > > store),
>>> > > > > > > we
>>> > > > > > > > > > could
>>> > > > > > > > > > >> > have
>>> > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
>>> operators
>>> > > which
>>> > > > > > allow
>>> > > > > > > > > easy
>>> > > > > > > > > > >> > > > integration
>>> > > > > > > > > > >> > > > > > into a batch application. These operators
>>> > would
>>> > > be
>>> > > > > > > > extended
>>> > > > > > > > > > from
>>> > > > > > > > > > >> > > their
>>> > > > > > > > > > >> > > > > > existing implementations and would be
>>> "Batch
>>> > > > Aware",
>>> > > > > > in
>>> > > > > > > > that
>>> > > > > > > > > > >> they
>>> > > > > > > > > > >> > may
>>> > > > > > > > > > >> > > > > > understand the meaning of some specific
>>> > control
>>> > > > > tuples
>>> > > > > > > > that
>>> > > > > > > > > > flow
>>> > > > > > > > > > >> > > > through
>>> > > > > > > > > > >> > > > > > the DAG. Start batch and end batch seem
>>> to be
>>> > > the
>>> > > > > > > obvious
>>> > > > > > > > > > >> > candidates
>>> > > > > > > > > > >> > > > that
>>> > > > > > > > > > >> > > > > > come to mind. On receipt of such control
>>> > tuples,
>>> > > > > they
>>> > > > > > > may
>>> > > > > > > > > try
>>> > > > > > > > > > to
>>> > > > > > > > > > >> > > modify
>>> > > > > > > > > > >> > > > > the
>>> > > > > > > > > > >> > > > > > behavior of the operator - to reinitialize
>>> > some
>>> > > > > > metrics
>>> > > > > > > or
>>> > > > > > > > > > >> finalize
>>> > > > > > > > > > >> > > an
>>> > > > > > > > > > >> > > > > > output file for example.
>>> > > > > > > > > > >> > > > > >
>>> > > > > > > > > > >> > > > > > We can discuss the potential control
>>> tuples
>>> > and
>>> > > > > > actions
>>> > > > > > > in
>>> > > > > > > > > > >> detail,
>>> > > > > > > > > > >> > > but
>>> > > > > > > > > > >> > > > > > first I would like to understand the
>>> views of
>>> > > the
>>> > > > > > > > community
>>> > > > > > > > > > for
>>> > > > > > > > > > >> > this
>>> > > > > > > > > > >> > > > > > proposal.
>>> > > > > > > > > > >> > > > > >
>>> > > > > > > > > > >> > > > > > ~ Bhupesh
>>> > > > > > > > > > >> > > > > >
>>> > > > > > > > > > >> > > > >
>>> > > > > > > > > > >> > > >
>>> > > > > > > > > > >> > >
>>> > > > > > > > > > >> >
>>> > > > > > > > > > >>
>>> > > > > > > > > > >
>>> > > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Let me work on the changes in abstract classes for File Input and File
Output and come up with a review only PR, which will help understand the
case better. The same thing can then be extended to other connectors like
JDBC and NoSQL operators.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Feb 28, 2017 at 3:43 PM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi David,
>
> If using time window does not seem appropriate, we can have another class
> which is more suited for such sequential and distinct windows. Perhaps, a
> CustomWindow option can be introduced which takes in a window id. The
> purpose of this window option could be to translate the window id into
> appropriate timestamps.
>
> Another option would be to go with a custom timestampExtractor for such
> tuples which translates the each unique file name to a distinct timestamp
> while using time windows in the windowed operator.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com> wrote:
>
>> I now see your rationale on putting the filename in the window.
>> As far as I understand, the reasons why the filename is not part of the
>> key
>> and the Global Window is not used are:
>>
>> 1) The files are processed in sequence, not in parallel
>> 2) The windowed operator should not keep the state associated with the
>> file
>> when the processing of the file is done
>> 3) The trigger should be fired for the file when a file is done
>> processing.
>>
>> However, if the file is just a sequence has nothing to do with a
>> timestamp,
>> assigning a timestamp to a file is not an intuitive thing to do and would
>> just create confusions to the users, especially when it's used as an
>> example for new users.
>>
>> How about having a separate class called SequenceWindow? And perhaps
>> TimeWindow can inherit from it?
>>
>> David
>>
>> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org> wrote:
>>
>> > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <
>> bhupesh@datatorrent.com>
>> > wrote:
>> >
>> > > I think my comments related to count based windows might be causing
>> > > confusion. Let's not discuss count based scenarios for now.
>> > >
>> > > Just want to make sure we are on the same page wrt. the "each file is
>> a
>> > > batch" use case. As mentioned by Thomas, the each tuple from the same
>> > file
>> > > has the same timestamp (which is just a sequence number) and that
>> helps
>> > > keep tuples from each file in a separate window.
>> > >
>> >
>> > Yes, in this case it is a sequence number, but it could be a time stamp
>> > also, depending on the file naming convention. And if it was event time
>> > processing, the watermark would be derived from records within the file.
>> >
>> > Agreed, the source should have a mechanism to control the time stamp
>> > extraction along with everything else pertaining to the watermark
>> > generation.
>> >
>> >
>> > > We could also implement a "timestampExtractor" interface to identify
>> the
>> > > timestamp (sequence number) for a file.
>> > >
>> > > ~ Bhupesh
>> > >
>> > >
>> > > _______________________________________________________
>> > >
>> > > Bhupesh Chawda
>> > >
>> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > >
>> > > www.datatorrent.com  |  apex.apache.org
>> > >
>> > >
>> > >
>> > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org> wrote:
>> > >
>> > > > I don't think this is a use case for count based window.
>> > > >
>> > > > We have multiple files that are retrieved in a sequence and there
>> is no
>> > > > knowledge of the number of records per file. The requirement is to
>> > > > aggregate each file separately and emit the aggregate when the file
>> is
>> > > read
>> > > > fully. There is no concept of "end of something" for an individual
>> key
>> > > and
>> > > > global window isn't applicable.
>> > > >
>> > > > However, as already explained and implemented by Bhupesh, this can
>> be
>> > > > solved using watermark and window (in this case the window timestamp
>> > > isn't
>> > > > a timestamp, but a file sequence, but that doesn't matter.
>> > > >
>> > > > Thomas
>> > > >
>> > > >
>> > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com>
>> wrote:
>> > > >
>> > > > > I don't think this is the way to go. Global Window only means the
>> > > > timestamp
>> > > > > does not matter (or that there is no timestamp). It does not
>> > > necessarily
>> > > > > mean it's a large batch. Unless there is some notion of event time
>> > for
>> > > > each
>> > > > > file, you don't want to embed the file into the window itself.
>> > > > >
>> > > > > If you want the result broken up by file name, and if the files
>> are
>> > to
>> > > be
>> > > > > processed in parallel, I think making the file name be part of the
>> > key
>> > > is
>> > > > > the way to go. I think it's very confusing if we somehow make the
>> > file
>> > > to
>> > > > > be part of the window.
>> > > > >
>> > > > > For count-based window, it's not implemented yet and you're
>> welcome
>> > to
>> > > > add
>> > > > > that feature. In case of count-based windows, there would be no
>> > notion
>> > > of
>> > > > > time and you probably only trigger at the end of each window. In
>> the
>> > > case
>> > > > > of count-based windows, the watermark only matters for batch since
>> > you
>> > > > need
>> > > > > a way to know when the batch has ended (if the count is 10, the
>> > number
>> > > of
>> > > > > tuples in the batch is let's say 105, you need a way to end the
>> last
>> > > > window
>> > > > > with 5 tuples).
>> > > > >
>> > > > > David
>> > > > >
>> > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
>> > > bhupesh@datatorrent.com
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi David,
>> > > > > >
>> > > > > > Thanks for your comments.
>> > > > > >
>> > > > > > The wordcount example that I created based on the windowed
>> operator
>> > > > does
>> > > > > > processing of word counts per file (each file as a separate
>> batch),
>> > > > i.e.
>> > > > > > process counts for each file and dump into separate files.
>> > > > > > As I understand Global window is for one large batch; i.e. all
>> > > incoming
>> > > > > > data falls into the same batch. This could not be processed
>> using
>> > > > > > GlobalWindow option as we need more than one windows. In this
>> > case, I
>> > > > > > configured the windowed operator to have time windows of 1ms
>> each
>> > and
>> > > > > > passed data for each file with increasing timestamps: (file1,
>> 1),
>> > > > (file2,
>> > > > > > 2) and so on. Is there a better way of handling this scenario?
>> > > > > >
>> > > > > > Regarding (2 - count based windows), I think there is a trigger
>> > > option
>> > > > to
>> > > > > > process count based windows. In case I want to process every
>> 1000
>> > > > tuples
>> > > > > as
>> > > > > > a batch, I could set the Trigger option to CountTrigger with the
>> > > > > > accumulation set to Discarding. Is this correct?
>> > > > > >
>> > > > > > I agree that (4. Final Watermark) can be done using Global
>> window.
>> > > > > >
>> > > > > > ~ Bhupesh
>> > > > > >
>> > > > > > _______________________________________________________
>> > > > > >
>> > > > > > Bhupesh Chawda
>> > > > > >
>> > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > > > >
>> > > > > > www.datatorrent.com  |  apex.apache.org
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <davidyan@gmail.com
>> >
>> > > > wrote:
>> > > > > >
>> > > > > > > I'm worried that we are making the watermark concept too
>> > > complicated.
>> > > > > > >
>> > > > > > > Watermarks should simply just tell you what windows can be
>> > > considered
>> > > > > > > complete.
>> > > > > > >
>> > > > > > > Point 2 is basically a count-based window. Watermarks do not
>> > play a
>> > > > > role
>> > > > > > > here because the window is always complete at the n-th tuple.
>> > > > > > >
>> > > > > > > If I understand correctly, point 3 is for batch processing of
>> > > files.
>> > > > > > Unless
>> > > > > > > the files contain timed events, it sounds to be that this can
>> be
>> > > > > achieved
>> > > > > > > with just a Global Window. For signaling EOF, a watermark
>> with a
>> > > > > > +infinity
>> > > > > > > timestamp can be used so that triggers will be fired upon
>> receipt
>> > > of
>> > > > > that
>> > > > > > > watermark.
>> > > > > > >
>> > > > > > > For point 4, just like what I mentioned above, can be achieved
>> > > with a
>> > > > > > > watermark with a +infinity timestamp.
>> > > > > > >
>> > > > > > > David
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
>> > > > > bhupesh@datatorrent.com
>> > > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Thomas,
>> > > > > > > >
>> > > > > > > > For an input operator which is supposed to generate
>> watermarks
>> > > for
>> > > > > > > > downstream operators, I can think about the following
>> > watermarks
>> > > > that
>> > > > > > the
>> > > > > > > > operator can emit:
>> > > > > > > > 1. Time based watermarks (the high watermark / low
>> watermark)
>> > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
>> > > > > > > > 3. File based watermarks (Start file, end file)
>> > > > > > > > 4. Final watermark
>> > > > > > > >
>> > > > > > > > File based watermarks seem to be applicable for batch (file
>> > > based)
>> > > > as
>> > > > > > > well,
>> > > > > > > > and hence I thought of looking at these first. Does this
>> seem
>> > to
>> > > be
>> > > > > in
>> > > > > > > line
>> > > > > > > > with the thought process?
>> > > > > > > >
>> > > > > > > > ~ Bhupesh
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > _______________________________________________________
>> > > > > > > >
>> > > > > > > > Bhupesh Chawda
>> > > > > > > >
>> > > > > > > > Software Engineer
>> > > > > > > >
>> > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>> > > > > > > >
>> > > > > > > > www.datatorrent.com  |  apex.apache.org
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
>> thw@apache.org
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > I don't think this should be designed based on a
>> simplistic
>> > > file
>> > > > > > > > > input-output scenario. It would be good to include a
>> stateful
>> > > > > > > > > transformation based on event time.
>> > > > > > > > >
>> > > > > > > > > More complex pipelines contain stateful transformations
>> that
>> > > > depend
>> > > > > > on
>> > > > > > > > > windowing and watermarks. I think we need a watermark
>> concept
>> > > > that
>> > > > > is
>> > > > > > > > based
>> > > > > > > > > on progress in event time (or other monotonic increasing
>> > > > sequence)
>> > > > > > that
>> > > > > > > > > other operators can generically work with.
>> > > > > > > > >
>> > > > > > > > > Note that even file input in many cases can produce time
>> > based
>> > > > > > > > watermarks,
>> > > > > > > > > for example when you read part files that are bound by
>> event
>> > > > time.
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > > Thomas
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
>> > > > > > > bhupesh@datatorrent.com
>> > > > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > For better understanding the use case for control
>> tuples in
>> > > > > batch,
>> > > > > > I
>> > > > > > > > am
>> > > > > > > > > > creating a prototype for a batch application using File
>> > Input
>> > > > and
>> > > > > > > File
>> > > > > > > > > > Output operators.
>> > > > > > > > > >
>> > > > > > > > > > To enable basic batch processing for File IO operators,
>> I
>> > am
>> > > > > > > proposing
>> > > > > > > > > the
>> > > > > > > > > > following changes to File input and output operators:
>> > > > > > > > > > 1. File Input operator emits a watermark each time it
>> opens
>> > > and
>> > > > > > > closes
>> > > > > > > > a
>> > > > > > > > > > file. These can be "start file" and "end file"
>> watermarks
>> > > which
>> > > > > > > include
>> > > > > > > > > the
>> > > > > > > > > > corresponding file names. The "start file" tuple should
>> be
>> > > sent
>> > > > > > > before
>> > > > > > > > > any
>> > > > > > > > > > of the data from that file flows.
>> > > > > > > > > > 2. File Input operator can be configured to end the
>> > > application
>> > > > > > > after a
>> > > > > > > > > > single or n scans of the directory (a batch). This is
>> where
>> > > the
>> > > > > > > > operator
>> > > > > > > > > > emits the final watermark (the end of application
>> control
>> > > > tuple).
>> > > > > > > This
>> > > > > > > > > will
>> > > > > > > > > > also shutdown the application.
>> > > > > > > > > > 3. The File output operator handles these control
>> tuples.
>> > > > "Start
>> > > > > > > file"
>> > > > > > > > > > initializes the file name for the incoming tuples. "End
>> > file"
>> > > > > > > watermark
>> > > > > > > > > > forces a finalize on that file.
>> > > > > > > > > >
>> > > > > > > > > > The user would be able to enable the operators to send
>> only
>> > > > those
>> > > > > > > > > > watermarks that are needed in the application. If none
>> of
>> > the
>> > > > > > options
>> > > > > > > > are
>> > > > > > > > > > configured, the operators behave as in a streaming
>> > > application.
>> > > > > > > > > >
>> > > > > > > > > > There are a few challenges in the implementation where
>> the
>> > > > input
>> > > > > > > > operator
>> > > > > > > > > > is partitioned. In this case, the correlation between
>> the
>> > > > > start/end
>> > > > > > > > for a
>> > > > > > > > > > file and the data tuples for that file is lost. Hence we
>> > need
>> > > > to
>> > > > > > > > maintain
>> > > > > > > > > > the filename as part of each tuple in the pipeline.
>> > > > > > > > > >
>> > > > > > > > > > The "start file" and "end file" control tuples in this
>> > > example
>> > > > > are
>> > > > > > > > > > temporary names for watermarks. We can have generic
>> "start
>> > > > > batch" /
>> > > > > > > > "end
>> > > > > > > > > > batch" tuples which could be used for other use cases as
>> > > well.
>> > > > > The
>> > > > > > > > Final
>> > > > > > > > > > watermark is common and serves the same purpose in each
>> > case.
>> > > > > > > > > >
>> > > > > > > > > > Please let me know your thoughts on this.
>> > > > > > > > > >
>> > > > > > > > > > ~ Bhupesh
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
>> > > > > > > > > bhupesh@datatorrent.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Yes, this can be part of operator configuration. Given
>> > > this,
>> > > > > for
>> > > > > > a
>> > > > > > > > user
>> > > > > > > > > > to
>> > > > > > > > > > > define a batch application, would mean configuring the
>> > > > > connectors
>> > > > > > > > > (mostly
>> > > > > > > > > > > the input operator) in the application for the desired
>> > > > > behavior.
>> > > > > > > > > > Similarly,
>> > > > > > > > > > > there can be other use cases that can be achieved
>> other
>> > > than
>> > > > > > batch.
>> > > > > > > > > > >
>> > > > > > > > > > > We may also need to take care of the following:
>> > > > > > > > > > > 1. Make sure that the watermarks or control tuples are
>> > > > > consistent
>> > > > > > > > > across
>> > > > > > > > > > > sources. Meaning an HDFS sink should be able to
>> interpret
>> > > the
>> > > > > > > > watermark
>> > > > > > > > > > > tuple sent out by, say, a JDBC source.
>> > > > > > > > > > > 2. In addition to I/O connectors, we should also look
>> at
>> > > the
>> > > > > need
>> > > > > > > for
>> > > > > > > > > > > processing operators to understand some of the control
>> > > > tuples /
>> > > > > > > > > > watermarks.
>> > > > > > > > > > > For example, we may want to reset the operator
>> behavior
>> > on
>> > > > > > arrival
>> > > > > > > of
>> > > > > > > > > > some
>> > > > > > > > > > > watermark tuple.
>> > > > > > > > > > >
>> > > > > > > > > > > ~ Bhupesh
>> > > > > > > > > > >
>> > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
>> > > > thw@apache.org>
>> > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > >> The HDFS source can operate in two modes, bounded or
>> > > > > unbounded.
>> > > > > > If
>> > > > > > > > you
>> > > > > > > > > > >> scan
>> > > > > > > > > > >> only once, then it should emit the final watermark
>> after
>> > > it
>> > > > is
>> > > > > > > done.
>> > > > > > > > > > >> Otherwise it would emit watermarks based on a policy
>> > > (files
>> > > > > > names
>> > > > > > > > > etc.).
>> > > > > > > > > > >> The mechanism to generate the marks may depend on the
>> > type
>> > > > of
>> > > > > > > source
>> > > > > > > > > and
>> > > > > > > > > > >> the user needs to be able to influence/configure it.
>> > > > > > > > > > >>
>> > > > > > > > > > >> Thomas
>> > > > > > > > > > >>
>> > > > > > > > > > >>
>> > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
>> > > > > > > > > > bhupesh@datatorrent.com>
>> > > > > > > > > > >> wrote:
>> > > > > > > > > > >>
>> > > > > > > > > > >> > Hi Thomas,
>> > > > > > > > > > >> >
>> > > > > > > > > > >> > I am not sure that I completely understand your
>> > > > suggestion.
>> > > > > > Are
>> > > > > > > > you
>> > > > > > > > > > >> > suggesting to broaden the scope of the proposal to
>> > treat
>> > > > all
>> > > > > > > > sources
>> > > > > > > > > > as
>> > > > > > > > > > >> > bounded as well as unbounded?
>> > > > > > > > > > >> >
>> > > > > > > > > > >> > In case of Apex, we treat all sources as unbounded
>> > > > sources.
>> > > > > > Even
>> > > > > > > > > > bounded
>> > > > > > > > > > >> > sources like HDFS file source is treated as
>> unbounded
>> > by
>> > > > > means
>> > > > > > > of
>> > > > > > > > > > >> scanning
>> > > > > > > > > > >> > the input directory repeatedly.
>> > > > > > > > > > >> >
>> > > > > > > > > > >> > Let's consider HDFS file source for example:
>> > > > > > > > > > >> > In this case, if we treat it as a bounded source,
>> we
>> > can
>> > > > > > define
>> > > > > > > > > hooks
>> > > > > > > > > > >> which
>> > > > > > > > > > >> > allows us to detect the end of the file and send
>> the
>> > > > "final
>> > > > > > > > > > watermark".
>> > > > > > > > > > >> We
>> > > > > > > > > > >> > could also consider HDFS file source as a streaming
>> > > source
>> > > > > and
>> > > > > > > > > define
>> > > > > > > > > > >> hooks
>> > > > > > > > > > >> > which send watermarks based on different kinds of
>> > > windows.
>> > > > > > > > > > >> >
>> > > > > > > > > > >> > Please correct me if I misunderstand.
>> > > > > > > > > > >> >
>> > > > > > > > > > >> > ~ Bhupesh
>> > > > > > > > > > >> >
>> > > > > > > > > > >> >
>> > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
>> > > > > thw@apache.org
>> > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > > >> >
>> > > > > > > > > > >> > > Bhupesh,
>> > > > > > > > > > >> > >
>> > > > > > > > > > >> > > Please see how that can be solved in a unified
>> way
>> > > using
>> > > > > > > windows
>> > > > > > > > > and
>> > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded
>> data.
>> > In
>> > > > Beam
>> > > > > > for
>> > > > > > > > > > >> example,
>> > > > > > > > > > >> > you
>> > > > > > > > > > >> > > can use the "global window" and the final
>> watermark
>> > to
>> > > > > > > > accomplish
>> > > > > > > > > > what
>> > > > > > > > > > >> > you
>> > > > > > > > > > >> > > are looking for. Batch is just a special case of
>> > > > streaming
>> > > > > > > where
>> > > > > > > > > the
>> > > > > > > > > > >> > source
>> > > > > > > > > > >> > > emits the final watermark.
>> > > > > > > > > > >> > >
>> > > > > > > > > > >> > > Thanks,
>> > > > > > > > > > >> > > Thomas
>> > > > > > > > > > >> > >
>> > > > > > > > > > >> > >
>> > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
>> > > > > > > > > > >> bhupesh@datatorrent.com
>> > > > > > > > > > >> > >
>> > > > > > > > > > >> > > wrote:
>> > > > > > > > > > >> > >
>> > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
>> > > application,
>> > > > > > then
>> > > > > > > > > batch
>> > > > > > > > > > >> aware
>> > > > > > > > > > >> > > > operators need to be used in the application.
>> > > > > > > > > > >> > > > The nature of the application is mostly
>> controlled
>> > > by
>> > > > > the
>> > > > > > > > input
>> > > > > > > > > > and
>> > > > > > > > > > >> the
>> > > > > > > > > > >> > > > output operators used in the application.
>> > > > > > > > > > >> > > >
>> > > > > > > > > > >> > > > For example, consider an application which
>> needs
>> > to
>> > > > > filter
>> > > > > > > > > records
>> > > > > > > > > > >> in a
>> > > > > > > > > > >> > > > input file and store the filtered records in
>> > another
>> > > > > file.
>> > > > > > > The
>> > > > > > > > > > >> nature
>> > > > > > > > > > >> > of
>> > > > > > > > > > >> > > > this app is to end once the entire file is
>> > > processed.
>> > > > > > > > Following
>> > > > > > > > > > >> things
>> > > > > > > > > > >> > > are
>> > > > > > > > > > >> > > > expected of the application:
>> > > > > > > > > > >> > > >
>> > > > > > > > > > >> > > >    1. Once the input data is over, finalize the
>> > > output
>> > > > > > file
>> > > > > > > > from
>> > > > > > > > > > >> .tmp
>> > > > > > > > > > >> > > >    files. - Responsibility of output operator
>> > > > > > > > > > >> > > >    2. End the application, once the data is
>> read
>> > and
>> > > > > > > > processed -
>> > > > > > > > > > >> > > >    Responsibility of input operator
>> > > > > > > > > > >> > > >
>> > > > > > > > > > >> > > > These functions are essential to allow the
>> user to
>> > > do
>> > > > > > higher
>> > > > > > > > > level
>> > > > > > > > > > >> > > > operations like scheduling or running a
>> workflow
>> > of
>> > > > > batch
>> > > > > > > > > > >> applications.
>> > > > > > > > > > >> > > >
>> > > > > > > > > > >> > > > I am not sure about intermediate (processing)
>> > > > operators,
>> > > > > > as
>> > > > > > > > > there
>> > > > > > > > > > >> is no
>> > > > > > > > > > >> > > > change in their functionality for batch use
>> cases.
>> > > > > > Perhaps,
>> > > > > > > > > > allowing
>> > > > > > > > > > >> > > > running multiple batches in a single
>> application
>> > may
>> > > > > > require
>> > > > > > > > > > similar
>> > > > > > > > > > >> > > > changes in processing operators as well.
>> > > > > > > > > > >> > > >
>> > > > > > > > > > >> > > > ~ Bhupesh
>> > > > > > > > > > >> > > >
>> > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
>> Gugale <
>> > > > > > > > > > priyag@apache.org
>> > > > > > > > > > >> >
>> > > > > > > > > > >> > > > wrote:
>> > > > > > > > > > >> > > >
>> > > > > > > > > > >> > > > > Will it make an impression on user that, if
>> he
>> > > has a
>> > > > > > batch
>> > > > > > > > > > >> usecase he
>> > > > > > > > > > >> > > has
>> > > > > > > > > > >> > > > > to use batch aware operators only? If so, is
>> > that
>> > > > what
>> > > > > > we
>> > > > > > > > > > expect?
>> > > > > > > > > > >> I
>> > > > > > > > > > >> > am
>> > > > > > > > > > >> > > > not
>> > > > > > > > > > >> > > > > aware of how do we implement batch scenario
>> so
>> > > this
>> > > > > > might
>> > > > > > > > be a
>> > > > > > > > > > >> basic
>> > > > > > > > > > >> > > > > question.
>> > > > > > > > > > >> > > > >
>> > > > > > > > > > >> > > > > -Priyanka
>> > > > > > > > > > >> > > > >
>> > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh
>> > Chawda <
>> > > > > > > > > > >> > > > bhupesh@datatorrent.com>
>> > > > > > > > > > >> > > > > wrote:
>> > > > > > > > > > >> > > > >
>> > > > > > > > > > >> > > > > > Hi All,
>> > > > > > > > > > >> > > > > >
>> > > > > > > > > > >> > > > > > While design / implementation for custom
>> > control
>> > > > > > tuples
>> > > > > > > is
>> > > > > > > > > > >> > ongoing, I
>> > > > > > > > > > >> > > > > > thought it would be a good idea to consider
>> > its
>> > > > > > > usefulness
>> > > > > > > > > in
>> > > > > > > > > > >> one
>> > > > > > > > > > >> > of
>> > > > > > > > > > >> > > > the
>> > > > > > > > > > >> > > > > > use cases -  batch applications.
>> > > > > > > > > > >> > > > > >
>> > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
>> existing
>> > > > > > operators
>> > > > > > > in
>> > > > > > > > > the
>> > > > > > > > > > >> > Apache
>> > > > > > > > > > >> > > > > Apex
>> > > > > > > > > > >> > > > > > Malhar library so that it is easy to use
>> them
>> > in
>> > > > > batch
>> > > > > > > use
>> > > > > > > > > > >> cases.
>> > > > > > > > > > >> > > > > > Naturally, this would be applicable for
>> only a
>> > > > > subset
>> > > > > > of
>> > > > > > > > > > >> operators
>> > > > > > > > > > >> > > like
>> > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
>> > > > > > > > > > >> > > > > > For example, for a file based store, (say
>> HDFS
>> > > > > store),
>> > > > > > > we
>> > > > > > > > > > could
>> > > > > > > > > > >> > have
>> > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput
>> operators
>> > > which
>> > > > > > allow
>> > > > > > > > > easy
>> > > > > > > > > > >> > > > integration
>> > > > > > > > > > >> > > > > > into a batch application. These operators
>> > would
>> > > be
>> > > > > > > > extended
>> > > > > > > > > > from
>> > > > > > > > > > >> > > their
>> > > > > > > > > > >> > > > > > existing implementations and would be
>> "Batch
>> > > > Aware",
>> > > > > > in
>> > > > > > > > that
>> > > > > > > > > > >> they
>> > > > > > > > > > >> > may
>> > > > > > > > > > >> > > > > > understand the meaning of some specific
>> > control
>> > > > > tuples
>> > > > > > > > that
>> > > > > > > > > > flow
>> > > > > > > > > > >> > > > through
>> > > > > > > > > > >> > > > > > the DAG. Start batch and end batch seem to
>> be
>> > > the
>> > > > > > > obvious
>> > > > > > > > > > >> > candidates
>> > > > > > > > > > >> > > > that
>> > > > > > > > > > >> > > > > > come to mind. On receipt of such control
>> > tuples,
>> > > > > they
>> > > > > > > may
>> > > > > > > > > try
>> > > > > > > > > > to
>> > > > > > > > > > >> > > modify
>> > > > > > > > > > >> > > > > the
>> > > > > > > > > > >> > > > > > behavior of the operator - to reinitialize
>> > some
>> > > > > > metrics
>> > > > > > > or
>> > > > > > > > > > >> finalize
>> > > > > > > > > > >> > > an
>> > > > > > > > > > >> > > > > > output file for example.
>> > > > > > > > > > >> > > > > >
>> > > > > > > > > > >> > > > > > We can discuss the potential control tuples
>> > and
>> > > > > > actions
>> > > > > > > in
>> > > > > > > > > > >> detail,
>> > > > > > > > > > >> > > but
>> > > > > > > > > > >> > > > > > first I would like to understand the views
>> of
>> > > the
>> > > > > > > > community
>> > > > > > > > > > for
>> > > > > > > > > > >> > this
>> > > > > > > > > > >> > > > > > proposal.
>> > > > > > > > > > >> > > > > >
>> > > > > > > > > > >> > > > > > ~ Bhupesh
>> > > > > > > > > > >> > > > > >
>> > > > > > > > > > >> > > > >
>> > > > > > > > > > >> > > >
>> > > > > > > > > > >> > >
>> > > > > > > > > > >> >
>> > > > > > > > > > >>
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi David,

If using time window does not seem appropriate, we can have another class
which is more suited for such sequential and distinct windows. Perhaps, a
CustomWindow option can be introduced which takes in a window id. The
purpose of this window option could be to translate the window id into
appropriate timestamps.

Another option would be to go with a custom timestampExtractor for such
tuples which translates the each unique file name to a distinct timestamp
while using time windows in the windowed operator.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Feb 28, 2017 at 12:28 AM, David Yan <da...@gmail.com> wrote:

> I now see your rationale on putting the filename in the window.
> As far as I understand, the reasons why the filename is not part of the key
> and the Global Window is not used are:
>
> 1) The files are processed in sequence, not in parallel
> 2) The windowed operator should not keep the state associated with the file
> when the processing of the file is done
> 3) The trigger should be fired for the file when a file is done processing.
>
> However, if the file is just a sequence has nothing to do with a timestamp,
> assigning a timestamp to a file is not an intuitive thing to do and would
> just create confusions to the users, especially when it's used as an
> example for new users.
>
> How about having a separate class called SequenceWindow? And perhaps
> TimeWindow can inherit from it?
>
> David
>
> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org> wrote:
>
> > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > I think my comments related to count based windows might be causing
> > > confusion. Let's not discuss count based scenarios for now.
> > >
> > > Just want to make sure we are on the same page wrt. the "each file is a
> > > batch" use case. As mentioned by Thomas, the each tuple from the same
> > file
> > > has the same timestamp (which is just a sequence number) and that helps
> > > keep tuples from each file in a separate window.
> > >
> >
> > Yes, in this case it is a sequence number, but it could be a time stamp
> > also, depending on the file naming convention. And if it was event time
> > processing, the watermark would be derived from records within the file.
> >
> > Agreed, the source should have a mechanism to control the time stamp
> > extraction along with everything else pertaining to the watermark
> > generation.
> >
> >
> > > We could also implement a "timestampExtractor" interface to identify
> the
> > > timestamp (sequence number) for a file.
> > >
> > > ~ Bhupesh
> > >
> > >
> > > _______________________________________________________
> > >
> > > Bhupesh Chawda
> > >
> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >
> > > www.datatorrent.com  |  apex.apache.org
> > >
> > >
> > >
> > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > > I don't think this is a use case for count based window.
> > > >
> > > > We have multiple files that are retrieved in a sequence and there is
> no
> > > > knowledge of the number of records per file. The requirement is to
> > > > aggregate each file separately and emit the aggregate when the file
> is
> > > read
> > > > fully. There is no concept of "end of something" for an individual
> key
> > > and
> > > > global window isn't applicable.
> > > >
> > > > However, as already explained and implemented by Bhupesh, this can be
> > > > solved using watermark and window (in this case the window timestamp
> > > isn't
> > > > a timestamp, but a file sequence, but that doesn't matter.
> > > >
> > > > Thomas
> > > >
> > > >
> > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com>
> wrote:
> > > >
> > > > > I don't think this is the way to go. Global Window only means the
> > > > timestamp
> > > > > does not matter (or that there is no timestamp). It does not
> > > necessarily
> > > > > mean it's a large batch. Unless there is some notion of event time
> > for
> > > > each
> > > > > file, you don't want to embed the file into the window itself.
> > > > >
> > > > > If you want the result broken up by file name, and if the files are
> > to
> > > be
> > > > > processed in parallel, I think making the file name be part of the
> > key
> > > is
> > > > > the way to go. I think it's very confusing if we somehow make the
> > file
> > > to
> > > > > be part of the window.
> > > > >
> > > > > For count-based window, it's not implemented yet and you're welcome
> > to
> > > > add
> > > > > that feature. In case of count-based windows, there would be no
> > notion
> > > of
> > > > > time and you probably only trigger at the end of each window. In
> the
> > > case
> > > > > of count-based windows, the watermark only matters for batch since
> > you
> > > > need
> > > > > a way to know when the batch has ended (if the count is 10, the
> > number
> > > of
> > > > > tuples in the batch is let's say 105, you need a way to end the
> last
> > > > window
> > > > > with 5 tuples).
> > > > >
> > > > > David
> > > > >
> > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Thanks for your comments.
> > > > > >
> > > > > > The wordcount example that I created based on the windowed
> operator
> > > > does
> > > > > > processing of word counts per file (each file as a separate
> batch),
> > > > i.e.
> > > > > > process counts for each file and dump into separate files.
> > > > > > As I understand Global window is for one large batch; i.e. all
> > > incoming
> > > > > > data falls into the same batch. This could not be processed using
> > > > > > GlobalWindow option as we need more than one windows. In this
> > case, I
> > > > > > configured the windowed operator to have time windows of 1ms each
> > and
> > > > > > passed data for each file with increasing timestamps: (file1, 1),
> > > > (file2,
> > > > > > 2) and so on. Is there a better way of handling this scenario?
> > > > > >
> > > > > > Regarding (2 - count based windows), I think there is a trigger
> > > option
> > > > to
> > > > > > process count based windows. In case I want to process every 1000
> > > > tuples
> > > > > as
> > > > > > a batch, I could set the Trigger option to CountTrigger with the
> > > > > > accumulation set to Discarding. Is this correct?
> > > > > >
> > > > > > I agree that (4. Final Watermark) can be done using Global
> window.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > > _______________________________________________________
> > > > > >
> > > > > > Bhupesh Chawda
> > > > > >
> > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > >
> > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <da...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > I'm worried that we are making the watermark concept too
> > > complicated.
> > > > > > >
> > > > > > > Watermarks should simply just tell you what windows can be
> > > considered
> > > > > > > complete.
> > > > > > >
> > > > > > > Point 2 is basically a count-based window. Watermarks do not
> > play a
> > > > > role
> > > > > > > here because the window is always complete at the n-th tuple.
> > > > > > >
> > > > > > > If I understand correctly, point 3 is for batch processing of
> > > files.
> > > > > > Unless
> > > > > > > the files contain timed events, it sounds to be that this can
> be
> > > > > achieved
> > > > > > > with just a Global Window. For signaling EOF, a watermark with
> a
> > > > > > +infinity
> > > > > > > timestamp can be used so that triggers will be fired upon
> receipt
> > > of
> > > > > that
> > > > > > > watermark.
> > > > > > >
> > > > > > > For point 4, just like what I mentioned above, can be achieved
> > > with a
> > > > > > > watermark with a +infinity timestamp.
> > > > > > >
> > > > > > > David
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Thomas,
> > > > > > > >
> > > > > > > > For an input operator which is supposed to generate
> watermarks
> > > for
> > > > > > > > downstream operators, I can think about the following
> > watermarks
> > > > that
> > > > > > the
> > > > > > > > operator can emit:
> > > > > > > > 1. Time based watermarks (the high watermark / low watermark)
> > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > > 4. Final watermark
> > > > > > > >
> > > > > > > > File based watermarks seem to be applicable for batch (file
> > > based)
> > > > as
> > > > > > > well,
> > > > > > > > and hence I thought of looking at these first. Does this seem
> > to
> > > be
> > > > > in
> > > > > > > line
> > > > > > > > with the thought process?
> > > > > > > >
> > > > > > > > ~ Bhupesh
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________________
> > > > > > > >
> > > > > > > > Bhupesh Chawda
> > > > > > > >
> > > > > > > > Software Engineer
> > > > > > > >
> > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > >
> > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> thw@apache.org
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > I don't think this should be designed based on a simplistic
> > > file
> > > > > > > > > input-output scenario. It would be good to include a
> stateful
> > > > > > > > > transformation based on event time.
> > > > > > > > >
> > > > > > > > > More complex pipelines contain stateful transformations
> that
> > > > depend
> > > > > > on
> > > > > > > > > windowing and watermarks. I think we need a watermark
> concept
> > > > that
> > > > > is
> > > > > > > > based
> > > > > > > > > on progress in event time (or other monotonic increasing
> > > > sequence)
> > > > > > that
> > > > > > > > > other operators can generically work with.
> > > > > > > > >
> > > > > > > > > Note that even file input in many cases can produce time
> > based
> > > > > > > > watermarks,
> > > > > > > > > for example when you read part files that are bound by
> event
> > > > time.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Thomas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > > bhupesh@datatorrent.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > For better understanding the use case for control tuples
> in
> > > > > batch,
> > > > > > I
> > > > > > > > am
> > > > > > > > > > creating a prototype for a batch application using File
> > Input
> > > > and
> > > > > > > File
> > > > > > > > > > Output operators.
> > > > > > > > > >
> > > > > > > > > > To enable basic batch processing for File IO operators, I
> > am
> > > > > > > proposing
> > > > > > > > > the
> > > > > > > > > > following changes to File input and output operators:
> > > > > > > > > > 1. File Input operator emits a watermark each time it
> opens
> > > and
> > > > > > > closes
> > > > > > > > a
> > > > > > > > > > file. These can be "start file" and "end file" watermarks
> > > which
> > > > > > > include
> > > > > > > > > the
> > > > > > > > > > corresponding file names. The "start file" tuple should
> be
> > > sent
> > > > > > > before
> > > > > > > > > any
> > > > > > > > > > of the data from that file flows.
> > > > > > > > > > 2. File Input operator can be configured to end the
> > > application
> > > > > > > after a
> > > > > > > > > > single or n scans of the directory (a batch). This is
> where
> > > the
> > > > > > > > operator
> > > > > > > > > > emits the final watermark (the end of application control
> > > > tuple).
> > > > > > > This
> > > > > > > > > will
> > > > > > > > > > also shutdown the application.
> > > > > > > > > > 3. The File output operator handles these control tuples.
> > > > "Start
> > > > > > > file"
> > > > > > > > > > initializes the file name for the incoming tuples. "End
> > file"
> > > > > > > watermark
> > > > > > > > > > forces a finalize on that file.
> > > > > > > > > >
> > > > > > > > > > The user would be able to enable the operators to send
> only
> > > > those
> > > > > > > > > > watermarks that are needed in the application. If none of
> > the
> > > > > > options
> > > > > > > > are
> > > > > > > > > > configured, the operators behave as in a streaming
> > > application.
> > > > > > > > > >
> > > > > > > > > > There are a few challenges in the implementation where
> the
> > > > input
> > > > > > > > operator
> > > > > > > > > > is partitioned. In this case, the correlation between the
> > > > > start/end
> > > > > > > > for a
> > > > > > > > > > file and the data tuples for that file is lost. Hence we
> > need
> > > > to
> > > > > > > > maintain
> > > > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > > > >
> > > > > > > > > > The "start file" and "end file" control tuples in this
> > > example
> > > > > are
> > > > > > > > > > temporary names for watermarks. We can have generic
> "start
> > > > > batch" /
> > > > > > > > "end
> > > > > > > > > > batch" tuples which could be used for other use cases as
> > > well.
> > > > > The
> > > > > > > > Final
> > > > > > > > > > watermark is common and serves the same purpose in each
> > case.
> > > > > > > > > >
> > > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > > >
> > > > > > > > > > ~ Bhupesh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Yes, this can be part of operator configuration. Given
> > > this,
> > > > > for
> > > > > > a
> > > > > > > > user
> > > > > > > > > > to
> > > > > > > > > > > define a batch application, would mean configuring the
> > > > > connectors
> > > > > > > > > (mostly
> > > > > > > > > > > the input operator) in the application for the desired
> > > > > behavior.
> > > > > > > > > > Similarly,
> > > > > > > > > > > there can be other use cases that can be achieved other
> > > than
> > > > > > batch.
> > > > > > > > > > >
> > > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > > 1. Make sure that the watermarks or control tuples are
> > > > > consistent
> > > > > > > > > across
> > > > > > > > > > > sources. Meaning an HDFS sink should be able to
> interpret
> > > the
> > > > > > > > watermark
> > > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > > 2. In addition to I/O connectors, we should also look
> at
> > > the
> > > > > need
> > > > > > > for
> > > > > > > > > > > processing operators to understand some of the control
> > > > tuples /
> > > > > > > > > > watermarks.
> > > > > > > > > > > For example, we may want to reset the operator behavior
> > on
> > > > > > arrival
> > > > > > > of
> > > > > > > > > > some
> > > > > > > > > > > watermark tuple.
> > > > > > > > > > >
> > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > > thw@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> The HDFS source can operate in two modes, bounded or
> > > > > unbounded.
> > > > > > If
> > > > > > > > you
> > > > > > > > > > >> scan
> > > > > > > > > > >> only once, then it should emit the final watermark
> after
> > > it
> > > > is
> > > > > > > done.
> > > > > > > > > > >> Otherwise it would emit watermarks based on a policy
> > > (files
> > > > > > names
> > > > > > > > > etc.).
> > > > > > > > > > >> The mechanism to generate the marks may depend on the
> > type
> > > > of
> > > > > > > source
> > > > > > > > > and
> > > > > > > > > > >> the user needs to be able to influence/configure it.
> > > > > > > > > > >>
> > > > > > > > > > >> Thomas
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Hi Thomas,
> > > > > > > > > > >> >
> > > > > > > > > > >> > I am not sure that I completely understand your
> > > > suggestion.
> > > > > > Are
> > > > > > > > you
> > > > > > > > > > >> > suggesting to broaden the scope of the proposal to
> > treat
> > > > all
> > > > > > > > sources
> > > > > > > > > > as
> > > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > > >> >
> > > > > > > > > > >> > In case of Apex, we treat all sources as unbounded
> > > > sources.
> > > > > > Even
> > > > > > > > > > bounded
> > > > > > > > > > >> > sources like HDFS file source is treated as
> unbounded
> > by
> > > > > means
> > > > > > > of
> > > > > > > > > > >> scanning
> > > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > > >> > In this case, if we treat it as a bounded source, we
> > can
> > > > > > define
> > > > > > > > > hooks
> > > > > > > > > > >> which
> > > > > > > > > > >> > allows us to detect the end of the file and send the
> > > > "final
> > > > > > > > > > watermark".
> > > > > > > > > > >> We
> > > > > > > > > > >> > could also consider HDFS file source as a streaming
> > > source
> > > > > and
> > > > > > > > > define
> > > > > > > > > > >> hooks
> > > > > > > > > > >> > which send watermarks based on different kinds of
> > > windows.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > > >> >
> > > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > > > > thw@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > Bhupesh,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Please see how that can be solved in a unified way
> > > using
> > > > > > > windows
> > > > > > > > > and
> > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded data.
> > In
> > > > Beam
> > > > > > for
> > > > > > > > > > >> example,
> > > > > > > > > > >> > you
> > > > > > > > > > >> > > can use the "global window" and the final
> watermark
> > to
> > > > > > > > accomplish
> > > > > > > > > > what
> > > > > > > > > > >> > you
> > > > > > > > > > >> > > are looking for. Batch is just a special case of
> > > > streaming
> > > > > > > where
> > > > > > > > > the
> > > > > > > > > > >> > source
> > > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks,
> > > > > > > > > > >> > > Thomas
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
> > > application,
> > > > > > then
> > > > > > > > > batch
> > > > > > > > > > >> aware
> > > > > > > > > > >> > > > operators need to be used in the application.
> > > > > > > > > > >> > > > The nature of the application is mostly
> controlled
> > > by
> > > > > the
> > > > > > > > input
> > > > > > > > > > and
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > output operators used in the application.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > For example, consider an application which needs
> > to
> > > > > filter
> > > > > > > > > records
> > > > > > > > > > >> in a
> > > > > > > > > > >> > > > input file and store the filtered records in
> > another
> > > > > file.
> > > > > > > The
> > > > > > > > > > >> nature
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > this app is to end once the entire file is
> > > processed.
> > > > > > > > Following
> > > > > > > > > > >> things
> > > > > > > > > > >> > > are
> > > > > > > > > > >> > > > expected of the application:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >    1. Once the input data is over, finalize the
> > > output
> > > > > > file
> > > > > > > > from
> > > > > > > > > > >> .tmp
> > > > > > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > > > > > >> > > >    2. End the application, once the data is read
> > and
> > > > > > > > processed -
> > > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > These functions are essential to allow the user
> to
> > > do
> > > > > > higher
> > > > > > > > > level
> > > > > > > > > > >> > > > operations like scheduling or running a workflow
> > of
> > > > > batch
> > > > > > > > > > >> applications.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > I am not sure about intermediate (processing)
> > > > operators,
> > > > > > as
> > > > > > > > > there
> > > > > > > > > > >> is no
> > > > > > > > > > >> > > > change in their functionality for batch use
> cases.
> > > > > > Perhaps,
> > > > > > > > > > allowing
> > > > > > > > > > >> > > > running multiple batches in a single application
> > may
> > > > > > require
> > > > > > > > > > similar
> > > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
> Gugale <
> > > > > > > > > > priyag@apache.org
> > > > > > > > > > >> >
> > > > > > > > > > >> > > > wrote:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Will it make an impression on user that, if he
> > > has a
> > > > > > batch
> > > > > > > > > > >> usecase he
> > > > > > > > > > >> > > has
> > > > > > > > > > >> > > > > to use batch aware operators only? If so, is
> > that
> > > > what
> > > > > > we
> > > > > > > > > > expect?
> > > > > > > > > > >> I
> > > > > > > > > > >> > am
> > > > > > > > > > >> > > > not
> > > > > > > > > > >> > > > > aware of how do we implement batch scenario so
> > > this
> > > > > > might
> > > > > > > > be a
> > > > > > > > > > >> basic
> > > > > > > > > > >> > > > > question.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh
> > Chawda <
> > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > While design / implementation for custom
> > control
> > > > > > tuples
> > > > > > > is
> > > > > > > > > > >> > ongoing, I
> > > > > > > > > > >> > > > > > thought it would be a good idea to consider
> > its
> > > > > > > usefulness
> > > > > > > > > in
> > > > > > > > > > >> one
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
> existing
> > > > > > operators
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > >> > Apache
> > > > > > > > > > >> > > > > Apex
> > > > > > > > > > >> > > > > > Malhar library so that it is easy to use
> them
> > in
> > > > > batch
> > > > > > > use
> > > > > > > > > > >> cases.
> > > > > > > > > > >> > > > > > Naturally, this would be applicable for
> only a
> > > > > subset
> > > > > > of
> > > > > > > > > > >> operators
> > > > > > > > > > >> > > like
> > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > > >> > > > > > For example, for a file based store, (say
> HDFS
> > > > > store),
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > >> > have
> > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput operators
> > > which
> > > > > > allow
> > > > > > > > > easy
> > > > > > > > > > >> > > > integration
> > > > > > > > > > >> > > > > > into a batch application. These operators
> > would
> > > be
> > > > > > > > extended
> > > > > > > > > > from
> > > > > > > > > > >> > > their
> > > > > > > > > > >> > > > > > existing implementations and would be "Batch
> > > > Aware",
> > > > > > in
> > > > > > > > that
> > > > > > > > > > >> they
> > > > > > > > > > >> > may
> > > > > > > > > > >> > > > > > understand the meaning of some specific
> > control
> > > > > tuples
> > > > > > > > that
> > > > > > > > > > flow
> > > > > > > > > > >> > > > through
> > > > > > > > > > >> > > > > > the DAG. Start batch and end batch seem to
> be
> > > the
> > > > > > > obvious
> > > > > > > > > > >> > candidates
> > > > > > > > > > >> > > > that
> > > > > > > > > > >> > > > > > come to mind. On receipt of such control
> > tuples,
> > > > > they
> > > > > > > may
> > > > > > > > > try
> > > > > > > > > > to
> > > > > > > > > > >> > > modify
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > behavior of the operator - to reinitialize
> > some
> > > > > > metrics
> > > > > > > or
> > > > > > > > > > >> finalize
> > > > > > > > > > >> > > an
> > > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > We can discuss the potential control tuples
> > and
> > > > > > actions
> > > > > > > in
> > > > > > > > > > >> detail,
> > > > > > > > > > >> > > but
> > > > > > > > > > >> > > > > > first I would like to understand the views
> of
> > > the
> > > > > > > > community
> > > > > > > > > > for
> > > > > > > > > > >> > this
> > > > > > > > > > >> > > > > > proposal.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Pramod Immaneni <pr...@datatorrent.com>.

Yes having this type of window not tied to timestamp will work out better.

On Mon, Feb 27, 2017 at 10:58 AM, David Yan <da...@gmail.com> wrote:

> I now see your rationale on putting the filename in the window.
> As far as I understand, the reasons why the filename is not part of the key
> and the Global Window is not used are:
>
> 1) The files are processed in sequence, not in parallel
> 2) The windowed operator should not keep the state associated with the file
> when the processing of the file is done
> 3) The trigger should be fired for the file when a file is done processing.
>
> However, if the file is just a sequence has nothing to do with a timestamp,
> assigning a timestamp to a file is not an intuitive thing to do and would
> just create confusions to the users, especially when it's used as an
> example for new users.
>
> How about having a separate class called SequenceWindow? And perhaps
> TimeWindow can inherit from it?
>
> David
>
> On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org> wrote:
>
> > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > I think my comments related to count based windows might be causing
> > > confusion. Let's not discuss count based scenarios for now.
> > >
> > > Just want to make sure we are on the same page wrt. the "each file is a
> > > batch" use case. As mentioned by Thomas, the each tuple from the same
> > file
> > > has the same timestamp (which is just a sequence number) and that helps
> > > keep tuples from each file in a separate window.
> > >
> >
> > Yes, in this case it is a sequence number, but it could be a time stamp
> > also, depending on the file naming convention. And if it was event time
> > processing, the watermark would be derived from records within the file.
> >
> > Agreed, the source should have a mechanism to control the time stamp
> > extraction along with everything else pertaining to the watermark
> > generation.
> >
> >
> > > We could also implement a "timestampExtractor" interface to identify
> the
> > > timestamp (sequence number) for a file.
> > >
> > > ~ Bhupesh
> > >
> > >
> > > _______________________________________________________
> > >
> > > Bhupesh Chawda
> > >
> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >
> > > www.datatorrent.com  |  apex.apache.org
> > >
> > >
> > >
> > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > > I don't think this is a use case for count based window.
> > > >
> > > > We have multiple files that are retrieved in a sequence and there is
> no
> > > > knowledge of the number of records per file. The requirement is to
> > > > aggregate each file separately and emit the aggregate when the file
> is
> > > read
> > > > fully. There is no concept of "end of something" for an individual
> key
> > > and
> > > > global window isn't applicable.
> > > >
> > > > However, as already explained and implemented by Bhupesh, this can be
> > > > solved using watermark and window (in this case the window timestamp
> > > isn't
> > > > a timestamp, but a file sequence, but that doesn't matter.
> > > >
> > > > Thomas
> > > >
> > > >
> > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com>
> wrote:
> > > >
> > > > > I don't think this is the way to go. Global Window only means the
> > > > timestamp
> > > > > does not matter (or that there is no timestamp). It does not
> > > necessarily
> > > > > mean it's a large batch. Unless there is some notion of event time
> > for
> > > > each
> > > > > file, you don't want to embed the file into the window itself.
> > > > >
> > > > > If you want the result broken up by file name, and if the files are
> > to
> > > be
> > > > > processed in parallel, I think making the file name be part of the
> > key
> > > is
> > > > > the way to go. I think it's very confusing if we somehow make the
> > file
> > > to
> > > > > be part of the window.
> > > > >
> > > > > For count-based window, it's not implemented yet and you're welcome
> > to
> > > > add
> > > > > that feature. In case of count-based windows, there would be no
> > notion
> > > of
> > > > > time and you probably only trigger at the end of each window. In
> the
> > > case
> > > > > of count-based windows, the watermark only matters for batch since
> > you
> > > > need
> > > > > a way to know when the batch has ended (if the count is 10, the
> > number
> > > of
> > > > > tuples in the batch is let's say 105, you need a way to end the
> last
> > > > window
> > > > > with 5 tuples).
> > > > >
> > > > > David
> > > > >
> > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Thanks for your comments.
> > > > > >
> > > > > > The wordcount example that I created based on the windowed
> operator
> > > > does
> > > > > > processing of word counts per file (each file as a separate
> batch),
> > > > i.e.
> > > > > > process counts for each file and dump into separate files.
> > > > > > As I understand Global window is for one large batch; i.e. all
> > > incoming
> > > > > > data falls into the same batch. This could not be processed using
> > > > > > GlobalWindow option as we need more than one windows. In this
> > case, I
> > > > > > configured the windowed operator to have time windows of 1ms each
> > and
> > > > > > passed data for each file with increasing timestamps: (file1, 1),
> > > > (file2,
> > > > > > 2) and so on. Is there a better way of handling this scenario?
> > > > > >
> > > > > > Regarding (2 - count based windows), I think there is a trigger
> > > option
> > > > to
> > > > > > process count based windows. In case I want to process every 1000
> > > > tuples
> > > > > as
> > > > > > a batch, I could set the Trigger option to CountTrigger with the
> > > > > > accumulation set to Discarding. Is this correct?
> > > > > >
> > > > > > I agree that (4. Final Watermark) can be done using Global
> window.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > > _______________________________________________________
> > > > > >
> > > > > > Bhupesh Chawda
> > > > > >
> > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > >
> > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <da...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > I'm worried that we are making the watermark concept too
> > > complicated.
> > > > > > >
> > > > > > > Watermarks should simply just tell you what windows can be
> > > considered
> > > > > > > complete.
> > > > > > >
> > > > > > > Point 2 is basically a count-based window. Watermarks do not
> > play a
> > > > > role
> > > > > > > here because the window is always complete at the n-th tuple.
> > > > > > >
> > > > > > > If I understand correctly, point 3 is for batch processing of
> > > files.
> > > > > > Unless
> > > > > > > the files contain timed events, it sounds to be that this can
> be
> > > > > achieved
> > > > > > > with just a Global Window. For signaling EOF, a watermark with
> a
> > > > > > +infinity
> > > > > > > timestamp can be used so that triggers will be fired upon
> receipt
> > > of
> > > > > that
> > > > > > > watermark.
> > > > > > >
> > > > > > > For point 4, just like what I mentioned above, can be achieved
> > > with a
> > > > > > > watermark with a +infinity timestamp.
> > > > > > >
> > > > > > > David
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Thomas,
> > > > > > > >
> > > > > > > > For an input operator which is supposed to generate
> watermarks
> > > for
> > > > > > > > downstream operators, I can think about the following
> > watermarks
> > > > that
> > > > > > the
> > > > > > > > operator can emit:
> > > > > > > > 1. Time based watermarks (the high watermark / low watermark)
> > > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > > 4. Final watermark
> > > > > > > >
> > > > > > > > File based watermarks seem to be applicable for batch (file
> > > based)
> > > > as
> > > > > > > well,
> > > > > > > > and hence I thought of looking at these first. Does this seem
> > to
> > > be
> > > > > in
> > > > > > > line
> > > > > > > > with the thought process?
> > > > > > > >
> > > > > > > > ~ Bhupesh
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________________
> > > > > > > >
> > > > > > > > Bhupesh Chawda
> > > > > > > >
> > > > > > > > Software Engineer
> > > > > > > >
> > > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > > >
> > > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <
> thw@apache.org
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > I don't think this should be designed based on a simplistic
> > > file
> > > > > > > > > input-output scenario. It would be good to include a
> stateful
> > > > > > > > > transformation based on event time.
> > > > > > > > >
> > > > > > > > > More complex pipelines contain stateful transformations
> that
> > > > depend
> > > > > > on
> > > > > > > > > windowing and watermarks. I think we need a watermark
> concept
> > > > that
> > > > > is
> > > > > > > > based
> > > > > > > > > on progress in event time (or other monotonic increasing
> > > > sequence)
> > > > > > that
> > > > > > > > > other operators can generically work with.
> > > > > > > > >
> > > > > > > > > Note that even file input in many cases can produce time
> > based
> > > > > > > > watermarks,
> > > > > > > > > for example when you read part files that are bound by
> event
> > > > time.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Thomas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > > bhupesh@datatorrent.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > For better understanding the use case for control tuples
> in
> > > > > batch,
> > > > > > I
> > > > > > > > am
> > > > > > > > > > creating a prototype for a batch application using File
> > Input
> > > > and
> > > > > > > File
> > > > > > > > > > Output operators.
> > > > > > > > > >
> > > > > > > > > > To enable basic batch processing for File IO operators, I
> > am
> > > > > > > proposing
> > > > > > > > > the
> > > > > > > > > > following changes to File input and output operators:
> > > > > > > > > > 1. File Input operator emits a watermark each time it
> opens
> > > and
> > > > > > > closes
> > > > > > > > a
> > > > > > > > > > file. These can be "start file" and "end file" watermarks
> > > which
> > > > > > > include
> > > > > > > > > the
> > > > > > > > > > corresponding file names. The "start file" tuple should
> be
> > > sent
> > > > > > > before
> > > > > > > > > any
> > > > > > > > > > of the data from that file flows.
> > > > > > > > > > 2. File Input operator can be configured to end the
> > > application
> > > > > > > after a
> > > > > > > > > > single or n scans of the directory (a batch). This is
> where
> > > the
> > > > > > > > operator
> > > > > > > > > > emits the final watermark (the end of application control
> > > > tuple).
> > > > > > > This
> > > > > > > > > will
> > > > > > > > > > also shutdown the application.
> > > > > > > > > > 3. The File output operator handles these control tuples.
> > > > "Start
> > > > > > > file"
> > > > > > > > > > initializes the file name for the incoming tuples. "End
> > file"
> > > > > > > watermark
> > > > > > > > > > forces a finalize on that file.
> > > > > > > > > >
> > > > > > > > > > The user would be able to enable the operators to send
> only
> > > > those
> > > > > > > > > > watermarks that are needed in the application. If none of
> > the
> > > > > > options
> > > > > > > > are
> > > > > > > > > > configured, the operators behave as in a streaming
> > > application.
> > > > > > > > > >
> > > > > > > > > > There are a few challenges in the implementation where
> the
> > > > input
> > > > > > > > operator
> > > > > > > > > > is partitioned. In this case, the correlation between the
> > > > > start/end
> > > > > > > > for a
> > > > > > > > > > file and the data tuples for that file is lost. Hence we
> > need
> > > > to
> > > > > > > > maintain
> > > > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > > > >
> > > > > > > > > > The "start file" and "end file" control tuples in this
> > > example
> > > > > are
> > > > > > > > > > temporary names for watermarks. We can have generic
> "start
> > > > > batch" /
> > > > > > > > "end
> > > > > > > > > > batch" tuples which could be used for other use cases as
> > > well.
> > > > > The
> > > > > > > > Final
> > > > > > > > > > watermark is common and serves the same purpose in each
> > case.
> > > > > > > > > >
> > > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > > >
> > > > > > > > > > ~ Bhupesh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Yes, this can be part of operator configuration. Given
> > > this,
> > > > > for
> > > > > > a
> > > > > > > > user
> > > > > > > > > > to
> > > > > > > > > > > define a batch application, would mean configuring the
> > > > > connectors
> > > > > > > > > (mostly
> > > > > > > > > > > the input operator) in the application for the desired
> > > > > behavior.
> > > > > > > > > > Similarly,
> > > > > > > > > > > there can be other use cases that can be achieved other
> > > than
> > > > > > batch.
> > > > > > > > > > >
> > > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > > 1. Make sure that the watermarks or control tuples are
> > > > > consistent
> > > > > > > > > across
> > > > > > > > > > > sources. Meaning an HDFS sink should be able to
> interpret
> > > the
> > > > > > > > watermark
> > > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > > 2. In addition to I/O connectors, we should also look
> at
> > > the
> > > > > need
> > > > > > > for
> > > > > > > > > > > processing operators to understand some of the control
> > > > tuples /
> > > > > > > > > > watermarks.
> > > > > > > > > > > For example, we may want to reset the operator behavior
> > on
> > > > > > arrival
> > > > > > > of
> > > > > > > > > > some
> > > > > > > > > > > watermark tuple.
> > > > > > > > > > >
> > > > > > > > > > > ~ Bhupesh
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > > thw@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> The HDFS source can operate in two modes, bounded or
> > > > > unbounded.
> > > > > > If
> > > > > > > > you
> > > > > > > > > > >> scan
> > > > > > > > > > >> only once, then it should emit the final watermark
> after
> > > it
> > > > is
> > > > > > > done.
> > > > > > > > > > >> Otherwise it would emit watermarks based on a policy
> > > (files
> > > > > > names
> > > > > > > > > etc.).
> > > > > > > > > > >> The mechanism to generate the marks may depend on the
> > type
> > > > of
> > > > > > > source
> > > > > > > > > and
> > > > > > > > > > >> the user needs to be able to influence/configure it.
> > > > > > > > > > >>
> > > > > > > > > > >> Thomas
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Hi Thomas,
> > > > > > > > > > >> >
> > > > > > > > > > >> > I am not sure that I completely understand your
> > > > suggestion.
> > > > > > Are
> > > > > > > > you
> > > > > > > > > > >> > suggesting to broaden the scope of the proposal to
> > treat
> > > > all
> > > > > > > > sources
> > > > > > > > > > as
> > > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > > >> >
> > > > > > > > > > >> > In case of Apex, we treat all sources as unbounded
> > > > sources.
> > > > > > Even
> > > > > > > > > > bounded
> > > > > > > > > > >> > sources like HDFS file source is treated as
> unbounded
> > by
> > > > > means
> > > > > > > of
> > > > > > > > > > >> scanning
> > > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > > >> > In this case, if we treat it as a bounded source, we
> > can
> > > > > > define
> > > > > > > > > hooks
> > > > > > > > > > >> which
> > > > > > > > > > >> > allows us to detect the end of the file and send the
> > > > "final
> > > > > > > > > > watermark".
> > > > > > > > > > >> We
> > > > > > > > > > >> > could also consider HDFS file source as a streaming
> > > source
> > > > > and
> > > > > > > > > define
> > > > > > > > > > >> hooks
> > > > > > > > > > >> > which send watermarks based on different kinds of
> > > windows.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > > >> >
> > > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > > > > thw@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > Bhupesh,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Please see how that can be solved in a unified way
> > > using
> > > > > > > windows
> > > > > > > > > and
> > > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded data.
> > In
> > > > Beam
> > > > > > for
> > > > > > > > > > >> example,
> > > > > > > > > > >> > you
> > > > > > > > > > >> > > can use the "global window" and the final
> watermark
> > to
> > > > > > > > accomplish
> > > > > > > > > > what
> > > > > > > > > > >> > you
> > > > > > > > > > >> > > are looking for. Batch is just a special case of
> > > > streaming
> > > > > > > where
> > > > > > > > > the
> > > > > > > > > > >> > source
> > > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks,
> > > > > > > > > > >> > > Thomas
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Yes, if the user needs to develop a batch
> > > application,
> > > > > > then
> > > > > > > > > batch
> > > > > > > > > > >> aware
> > > > > > > > > > >> > > > operators need to be used in the application.
> > > > > > > > > > >> > > > The nature of the application is mostly
> controlled
> > > by
> > > > > the
> > > > > > > > input
> > > > > > > > > > and
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > output operators used in the application.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > For example, consider an application which needs
> > to
> > > > > filter
> > > > > > > > > records
> > > > > > > > > > >> in a
> > > > > > > > > > >> > > > input file and store the filtered records in
> > another
> > > > > file.
> > > > > > > The
> > > > > > > > > > >> nature
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > this app is to end once the entire file is
> > > processed.
> > > > > > > > Following
> > > > > > > > > > >> things
> > > > > > > > > > >> > > are
> > > > > > > > > > >> > > > expected of the application:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >    1. Once the input data is over, finalize the
> > > output
> > > > > > file
> > > > > > > > from
> > > > > > > > > > >> .tmp
> > > > > > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > > > > > >> > > >    2. End the application, once the data is read
> > and
> > > > > > > > processed -
> > > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > These functions are essential to allow the user
> to
> > > do
> > > > > > higher
> > > > > > > > > level
> > > > > > > > > > >> > > > operations like scheduling or running a workflow
> > of
> > > > > batch
> > > > > > > > > > >> applications.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > I am not sure about intermediate (processing)
> > > > operators,
> > > > > > as
> > > > > > > > > there
> > > > > > > > > > >> is no
> > > > > > > > > > >> > > > change in their functionality for batch use
> cases.
> > > > > > Perhaps,
> > > > > > > > > > allowing
> > > > > > > > > > >> > > > running multiple batches in a single application
> > may
> > > > > > require
> > > > > > > > > > similar
> > > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka
> Gugale <
> > > > > > > > > > priyag@apache.org
> > > > > > > > > > >> >
> > > > > > > > > > >> > > > wrote:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Will it make an impression on user that, if he
> > > has a
> > > > > > batch
> > > > > > > > > > >> usecase he
> > > > > > > > > > >> > > has
> > > > > > > > > > >> > > > > to use batch aware operators only? If so, is
> > that
> > > > what
> > > > > > we
> > > > > > > > > > expect?
> > > > > > > > > > >> I
> > > > > > > > > > >> > am
> > > > > > > > > > >> > > > not
> > > > > > > > > > >> > > > > aware of how do we implement batch scenario so
> > > this
> > > > > > might
> > > > > > > > be a
> > > > > > > > > > >> basic
> > > > > > > > > > >> > > > > question.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh
> > Chawda <
> > > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > While design / implementation for custom
> > control
> > > > > > tuples
> > > > > > > is
> > > > > > > > > > >> > ongoing, I
> > > > > > > > > > >> > > > > > thought it would be a good idea to consider
> > its
> > > > > > > usefulness
> > > > > > > > > in
> > > > > > > > > > >> one
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > This is a proposal to adapt / extend
> existing
> > > > > > operators
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > >> > Apache
> > > > > > > > > > >> > > > > Apex
> > > > > > > > > > >> > > > > > Malhar library so that it is easy to use
> them
> > in
> > > > > batch
> > > > > > > use
> > > > > > > > > > >> cases.
> > > > > > > > > > >> > > > > > Naturally, this would be applicable for
> only a
> > > > > subset
> > > > > > of
> > > > > > > > > > >> operators
> > > > > > > > > > >> > > like
> > > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > > >> > > > > > For example, for a file based store, (say
> HDFS
> > > > > store),
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > >> > have
> > > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput operators
> > > which
> > > > > > allow
> > > > > > > > > easy
> > > > > > > > > > >> > > > integration
> > > > > > > > > > >> > > > > > into a batch application. These operators
> > would
> > > be
> > > > > > > > extended
> > > > > > > > > > from
> > > > > > > > > > >> > > their
> > > > > > > > > > >> > > > > > existing implementations and would be "Batch
> > > > Aware",
> > > > > > in
> > > > > > > > that
> > > > > > > > > > >> they
> > > > > > > > > > >> > may
> > > > > > > > > > >> > > > > > understand the meaning of some specific
> > control
> > > > > tuples
> > > > > > > > that
> > > > > > > > > > flow
> > > > > > > > > > >> > > > through
> > > > > > > > > > >> > > > > > the DAG. Start batch and end batch seem to
> be
> > > the
> > > > > > > obvious
> > > > > > > > > > >> > candidates
> > > > > > > > > > >> > > > that
> > > > > > > > > > >> > > > > > come to mind. On receipt of such control
> > tuples,
> > > > > they
> > > > > > > may
> > > > > > > > > try
> > > > > > > > > > to
> > > > > > > > > > >> > > modify
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > behavior of the operator - to reinitialize
> > some
> > > > > > metrics
> > > > > > > or
> > > > > > > > > > >> finalize
> > > > > > > > > > >> > > an
> > > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > We can discuss the potential control tuples
> > and
> > > > > > actions
> > > > > > > in
> > > > > > > > > > >> detail,
> > > > > > > > > > >> > > but
> > > > > > > > > > >> > > > > > first I would like to understand the views
> of
> > > the
> > > > > > > > community
> > > > > > > > > > for
> > > > > > > > > > >> > this
> > > > > > > > > > >> > > > > > proposal.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by David Yan <da...@gmail.com>.

I now see your rationale on putting the filename in the window.
As far as I understand, the reasons why the filename is not part of the key
and the Global Window is not used are:

1) The files are processed in sequence, not in parallel
2) The windowed operator should not keep the state associated with the file
when the processing of the file is done
3) The trigger should be fired for the file when a file is done processing.

However, if the file is just a sequence has nothing to do with a timestamp,
assigning a timestamp to a file is not an intuitive thing to do and would
just create confusions to the users, especially when it's used as an
example for new users.

How about having a separate class called SequenceWindow? And perhaps
TimeWindow can inherit from it?

David

On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weise <th...@apache.org> wrote:

> On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > I think my comments related to count based windows might be causing
> > confusion. Let's not discuss count based scenarios for now.
> >
> > Just want to make sure we are on the same page wrt. the "each file is a
> > batch" use case. As mentioned by Thomas, the each tuple from the same
> file
> > has the same timestamp (which is just a sequence number) and that helps
> > keep tuples from each file in a separate window.
> >
>
> Yes, in this case it is a sequence number, but it could be a time stamp
> also, depending on the file naming convention. And if it was event time
> processing, the watermark would be derived from records within the file.
>
> Agreed, the source should have a mechanism to control the time stamp
> extraction along with everything else pertaining to the watermark
> generation.
>
>
> > We could also implement a "timestampExtractor" interface to identify the
> > timestamp (sequence number) for a file.
> >
> > ~ Bhupesh
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org> wrote:
> >
> > > I don't think this is a use case for count based window.
> > >
> > > We have multiple files that are retrieved in a sequence and there is no
> > > knowledge of the number of records per file. The requirement is to
> > > aggregate each file separately and emit the aggregate when the file is
> > read
> > > fully. There is no concept of "end of something" for an individual key
> > and
> > > global window isn't applicable.
> > >
> > > However, as already explained and implemented by Bhupesh, this can be
> > > solved using watermark and window (in this case the window timestamp
> > isn't
> > > a timestamp, but a file sequence, but that doesn't matter.
> > >
> > > Thomas
> > >
> > >
> > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com> wrote:
> > >
> > > > I don't think this is the way to go. Global Window only means the
> > > timestamp
> > > > does not matter (or that there is no timestamp). It does not
> > necessarily
> > > > mean it's a large batch. Unless there is some notion of event time
> for
> > > each
> > > > file, you don't want to embed the file into the window itself.
> > > >
> > > > If you want the result broken up by file name, and if the files are
> to
> > be
> > > > processed in parallel, I think making the file name be part of the
> key
> > is
> > > > the way to go. I think it's very confusing if we somehow make the
> file
> > to
> > > > be part of the window.
> > > >
> > > > For count-based window, it's not implemented yet and you're welcome
> to
> > > add
> > > > that feature. In case of count-based windows, there would be no
> notion
> > of
> > > > time and you probably only trigger at the end of each window. In the
> > case
> > > > of count-based windows, the watermark only matters for batch since
> you
> > > need
> > > > a way to know when the batch has ended (if the count is 10, the
> number
> > of
> > > > tuples in the batch is let's say 105, you need a way to end the last
> > > window
> > > > with 5 tuples).
> > > >
> > > > David
> > > >
> > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > Thanks for your comments.
> > > > >
> > > > > The wordcount example that I created based on the windowed operator
> > > does
> > > > > processing of word counts per file (each file as a separate batch),
> > > i.e.
> > > > > process counts for each file and dump into separate files.
> > > > > As I understand Global window is for one large batch; i.e. all
> > incoming
> > > > > data falls into the same batch. This could not be processed using
> > > > > GlobalWindow option as we need more than one windows. In this
> case, I
> > > > > configured the windowed operator to have time windows of 1ms each
> and
> > > > > passed data for each file with increasing timestamps: (file1, 1),
> > > (file2,
> > > > > 2) and so on. Is there a better way of handling this scenario?
> > > > >
> > > > > Regarding (2 - count based windows), I think there is a trigger
> > option
> > > to
> > > > > process count based windows. In case I want to process every 1000
> > > tuples
> > > > as
> > > > > a batch, I could set the Trigger option to CountTrigger with the
> > > > > accumulation set to Discarding. Is this correct?
> > > > >
> > > > > I agree that (4. Final Watermark) can be done using Global window.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > > _______________________________________________________
> > > > >
> > > > > Bhupesh Chawda
> > > > >
> > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > >
> > > > > www.datatorrent.com  |  apex.apache.org
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <da...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I'm worried that we are making the watermark concept too
> > complicated.
> > > > > >
> > > > > > Watermarks should simply just tell you what windows can be
> > considered
> > > > > > complete.
> > > > > >
> > > > > > Point 2 is basically a count-based window. Watermarks do not
> play a
> > > > role
> > > > > > here because the window is always complete at the n-th tuple.
> > > > > >
> > > > > > If I understand correctly, point 3 is for batch processing of
> > files.
> > > > > Unless
> > > > > > the files contain timed events, it sounds to be that this can be
> > > > achieved
> > > > > > with just a Global Window. For signaling EOF, a watermark with a
> > > > > +infinity
> > > > > > timestamp can be used so that triggers will be fired upon receipt
> > of
> > > > that
> > > > > > watermark.
> > > > > >
> > > > > > For point 4, just like what I mentioned above, can be achieved
> > with a
> > > > > > watermark with a +infinity timestamp.
> > > > > >
> > > > > > David
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Thomas,
> > > > > > >
> > > > > > > For an input operator which is supposed to generate watermarks
> > for
> > > > > > > downstream operators, I can think about the following
> watermarks
> > > that
> > > > > the
> > > > > > > operator can emit:
> > > > > > > 1. Time based watermarks (the high watermark / low watermark)
> > > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > > 3. File based watermarks (Start file, end file)
> > > > > > > 4. Final watermark
> > > > > > >
> > > > > > > File based watermarks seem to be applicable for batch (file
> > based)
> > > as
> > > > > > well,
> > > > > > > and hence I thought of looking at these first. Does this seem
> to
> > be
> > > > in
> > > > > > line
> > > > > > > with the thought process?
> > > > > > >
> > > > > > > ~ Bhupesh
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________________
> > > > > > >
> > > > > > > Bhupesh Chawda
> > > > > > >
> > > > > > > Software Engineer
> > > > > > >
> > > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > > >
> > > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <thw@apache.org
> >
> > > > wrote:
> > > > > > >
> > > > > > > > I don't think this should be designed based on a simplistic
> > file
> > > > > > > > input-output scenario. It would be good to include a stateful
> > > > > > > > transformation based on event time.
> > > > > > > >
> > > > > > > > More complex pipelines contain stateful transformations that
> > > depend
> > > > > on
> > > > > > > > windowing and watermarks. I think we need a watermark concept
> > > that
> > > > is
> > > > > > > based
> > > > > > > > on progress in event time (or other monotonic increasing
> > > sequence)
> > > > > that
> > > > > > > > other operators can generically work with.
> > > > > > > >
> > > > > > > > Note that even file input in many cases can produce time
> based
> > > > > > > watermarks,
> > > > > > > > for example when you read part files that are bound by event
> > > time.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > > bhupesh@datatorrent.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > For better understanding the use case for control tuples in
> > > > batch,
> > > > > I
> > > > > > > am
> > > > > > > > > creating a prototype for a batch application using File
> Input
> > > and
> > > > > > File
> > > > > > > > > Output operators.
> > > > > > > > >
> > > > > > > > > To enable basic batch processing for File IO operators, I
> am
> > > > > > proposing
> > > > > > > > the
> > > > > > > > > following changes to File input and output operators:
> > > > > > > > > 1. File Input operator emits a watermark each time it opens
> > and
> > > > > > closes
> > > > > > > a
> > > > > > > > > file. These can be "start file" and "end file" watermarks
> > which
> > > > > > include
> > > > > > > > the
> > > > > > > > > corresponding file names. The "start file" tuple should be
> > sent
> > > > > > before
> > > > > > > > any
> > > > > > > > > of the data from that file flows.
> > > > > > > > > 2. File Input operator can be configured to end the
> > application
> > > > > > after a
> > > > > > > > > single or n scans of the directory (a batch). This is where
> > the
> > > > > > > operator
> > > > > > > > > emits the final watermark (the end of application control
> > > tuple).
> > > > > > This
> > > > > > > > will
> > > > > > > > > also shutdown the application.
> > > > > > > > > 3. The File output operator handles these control tuples.
> > > "Start
> > > > > > file"
> > > > > > > > > initializes the file name for the incoming tuples. "End
> file"
> > > > > > watermark
> > > > > > > > > forces a finalize on that file.
> > > > > > > > >
> > > > > > > > > The user would be able to enable the operators to send only
> > > those
> > > > > > > > > watermarks that are needed in the application. If none of
> the
> > > > > options
> > > > > > > are
> > > > > > > > > configured, the operators behave as in a streaming
> > application.
> > > > > > > > >
> > > > > > > > > There are a few challenges in the implementation where the
> > > input
> > > > > > > operator
> > > > > > > > > is partitioned. In this case, the correlation between the
> > > > start/end
> > > > > > > for a
> > > > > > > > > file and the data tuples for that file is lost. Hence we
> need
> > > to
> > > > > > > maintain
> > > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > > >
> > > > > > > > > The "start file" and "end file" control tuples in this
> > example
> > > > are
> > > > > > > > > temporary names for watermarks. We can have generic "start
> > > > batch" /
> > > > > > > "end
> > > > > > > > > batch" tuples which could be used for other use cases as
> > well.
> > > > The
> > > > > > > Final
> > > > > > > > > watermark is common and serves the same purpose in each
> case.
> > > > > > > > >
> > > > > > > > > Please let me know your thoughts on this.
> > > > > > > > >
> > > > > > > > > ~ Bhupesh
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Yes, this can be part of operator configuration. Given
> > this,
> > > > for
> > > > > a
> > > > > > > user
> > > > > > > > > to
> > > > > > > > > > define a batch application, would mean configuring the
> > > > connectors
> > > > > > > > (mostly
> > > > > > > > > > the input operator) in the application for the desired
> > > > behavior.
> > > > > > > > > Similarly,
> > > > > > > > > > there can be other use cases that can be achieved other
> > than
> > > > > batch.
> > > > > > > > > >
> > > > > > > > > > We may also need to take care of the following:
> > > > > > > > > > 1. Make sure that the watermarks or control tuples are
> > > > consistent
> > > > > > > > across
> > > > > > > > > > sources. Meaning an HDFS sink should be able to interpret
> > the
> > > > > > > watermark
> > > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > > 2. In addition to I/O connectors, we should also look at
> > the
> > > > need
> > > > > > for
> > > > > > > > > > processing operators to understand some of the control
> > > tuples /
> > > > > > > > > watermarks.
> > > > > > > > > > For example, we may want to reset the operator behavior
> on
> > > > > arrival
> > > > > > of
> > > > > > > > > some
> > > > > > > > > > watermark tuple.
> > > > > > > > > >
> > > > > > > > > > ~ Bhupesh
> > > > > > > > > >
> > > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > > thw@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> The HDFS source can operate in two modes, bounded or
> > > > unbounded.
> > > > > If
> > > > > > > you
> > > > > > > > > >> scan
> > > > > > > > > >> only once, then it should emit the final watermark after
> > it
> > > is
> > > > > > done.
> > > > > > > > > >> Otherwise it would emit watermarks based on a policy
> > (files
> > > > > names
> > > > > > > > etc.).
> > > > > > > > > >> The mechanism to generate the marks may depend on the
> type
> > > of
> > > > > > source
> > > > > > > > and
> > > > > > > > > >> the user needs to be able to influence/configure it.
> > > > > > > > > >>
> > > > > > > > > >> Thomas
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Hi Thomas,
> > > > > > > > > >> >
> > > > > > > > > >> > I am not sure that I completely understand your
> > > suggestion.
> > > > > Are
> > > > > > > you
> > > > > > > > > >> > suggesting to broaden the scope of the proposal to
> treat
> > > all
> > > > > > > sources
> > > > > > > > > as
> > > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > > >> >
> > > > > > > > > >> > In case of Apex, we treat all sources as unbounded
> > > sources.
> > > > > Even
> > > > > > > > > bounded
> > > > > > > > > >> > sources like HDFS file source is treated as unbounded
> by
> > > > means
> > > > > > of
> > > > > > > > > >> scanning
> > > > > > > > > >> > the input directory repeatedly.
> > > > > > > > > >> >
> > > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > > >> > In this case, if we treat it as a bounded source, we
> can
> > > > > define
> > > > > > > > hooks
> > > > > > > > > >> which
> > > > > > > > > >> > allows us to detect the end of the file and send the
> > > "final
> > > > > > > > > watermark".
> > > > > > > > > >> We
> > > > > > > > > >> > could also consider HDFS file source as a streaming
> > source
> > > > and
> > > > > > > > define
> > > > > > > > > >> hooks
> > > > > > > > > >> > which send watermarks based on different kinds of
> > windows.
> > > > > > > > > >> >
> > > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > > >> >
> > > > > > > > > >> > ~ Bhupesh
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > > > thw@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > > Bhupesh,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Please see how that can be solved in a unified way
> > using
> > > > > > windows
> > > > > > > > and
> > > > > > > > > >> > > watermarks. It is bounded data vs. unbounded data.
> In
> > > Beam
> > > > > for
> > > > > > > > > >> example,
> > > > > > > > > >> > you
> > > > > > > > > >> > > can use the "global window" and the final watermark
> to
> > > > > > > accomplish
> > > > > > > > > what
> > > > > > > > > >> > you
> > > > > > > > > >> > > are looking for. Batch is just a special case of
> > > streaming
> > > > > > where
> > > > > > > > the
> > > > > > > > > >> > source
> > > > > > > > > >> > > emits the final watermark.
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks,
> > > > > > > > > >> > > Thomas
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > > >> > >
> > > > > > > > > >> > > wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > > > Yes, if the user needs to develop a batch
> > application,
> > > > > then
> > > > > > > > batch
> > > > > > > > > >> aware
> > > > > > > > > >> > > > operators need to be used in the application.
> > > > > > > > > >> > > > The nature of the application is mostly controlled
> > by
> > > > the
> > > > > > > input
> > > > > > > > > and
> > > > > > > > > >> the
> > > > > > > > > >> > > > output operators used in the application.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > For example, consider an application which needs
> to
> > > > filter
> > > > > > > > records
> > > > > > > > > >> in a
> > > > > > > > > >> > > > input file and store the filtered records in
> another
> > > > file.
> > > > > > The
> > > > > > > > > >> nature
> > > > > > > > > >> > of
> > > > > > > > > >> > > > this app is to end once the entire file is
> > processed.
> > > > > > > Following
> > > > > > > > > >> things
> > > > > > > > > >> > > are
> > > > > > > > > >> > > > expected of the application:
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >    1. Once the input data is over, finalize the
> > output
> > > > > file
> > > > > > > from
> > > > > > > > > >> .tmp
> > > > > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > > > > >> > > >    2. End the application, once the data is read
> and
> > > > > > > processed -
> > > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > These functions are essential to allow the user to
> > do
> > > > > higher
> > > > > > > > level
> > > > > > > > > >> > > > operations like scheduling or running a workflow
> of
> > > > batch
> > > > > > > > > >> applications.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > I am not sure about intermediate (processing)
> > > operators,
> > > > > as
> > > > > > > > there
> > > > > > > > > >> is no
> > > > > > > > > >> > > > change in their functionality for batch use cases.
> > > > > Perhaps,
> > > > > > > > > allowing
> > > > > > > > > >> > > > running multiple batches in a single application
> may
> > > > > require
> > > > > > > > > similar
> > > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > > > > > priyag@apache.org
> > > > > > > > > >> >
> > > > > > > > > >> > > > wrote:
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Will it make an impression on user that, if he
> > has a
> > > > > batch
> > > > > > > > > >> usecase he
> > > > > > > > > >> > > has
> > > > > > > > > >> > > > > to use batch aware operators only? If so, is
> that
> > > what
> > > > > we
> > > > > > > > > expect?
> > > > > > > > > >> I
> > > > > > > > > >> > am
> > > > > > > > > >> > > > not
> > > > > > > > > >> > > > > aware of how do we implement batch scenario so
> > this
> > > > > might
> > > > > > > be a
> > > > > > > > > >> basic
> > > > > > > > > >> > > > > question.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > -Priyanka
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh
> Chawda <
> > > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > > >> > > > > wrote:
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > > Hi All,
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > While design / implementation for custom
> control
> > > > > tuples
> > > > > > is
> > > > > > > > > >> > ongoing, I
> > > > > > > > > >> > > > > > thought it would be a good idea to consider
> its
> > > > > > usefulness
> > > > > > > > in
> > > > > > > > > >> one
> > > > > > > > > >> > of
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > This is a proposal to adapt / extend existing
> > > > > operators
> > > > > > in
> > > > > > > > the
> > > > > > > > > >> > Apache
> > > > > > > > > >> > > > > Apex
> > > > > > > > > >> > > > > > Malhar library so that it is easy to use them
> in
> > > > batch
> > > > > > use
> > > > > > > > > >> cases.
> > > > > > > > > >> > > > > > Naturally, this would be applicable for only a
> > > > subset
> > > > > of
> > > > > > > > > >> operators
> > > > > > > > > >> > > like
> > > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > > >> > > > > > For example, for a file based store, (say HDFS
> > > > store),
> > > > > > we
> > > > > > > > > could
> > > > > > > > > >> > have
> > > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput operators
> > which
> > > > > allow
> > > > > > > > easy
> > > > > > > > > >> > > > integration
> > > > > > > > > >> > > > > > into a batch application. These operators
> would
> > be
> > > > > > > extended
> > > > > > > > > from
> > > > > > > > > >> > > their
> > > > > > > > > >> > > > > > existing implementations and would be "Batch
> > > Aware",
> > > > > in
> > > > > > > that
> > > > > > > > > >> they
> > > > > > > > > >> > may
> > > > > > > > > >> > > > > > understand the meaning of some specific
> control
> > > > tuples
> > > > > > > that
> > > > > > > > > flow
> > > > > > > > > >> > > > through
> > > > > > > > > >> > > > > > the DAG. Start batch and end batch seem to be
> > the
> > > > > > obvious
> > > > > > > > > >> > candidates
> > > > > > > > > >> > > > that
> > > > > > > > > >> > > > > > come to mind. On receipt of such control
> tuples,
> > > > they
> > > > > > may
> > > > > > > > try
> > > > > > > > > to
> > > > > > > > > >> > > modify
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > behavior of the operator - to reinitialize
> some
> > > > > metrics
> > > > > > or
> > > > > > > > > >> finalize
> > > > > > > > > >> > > an
> > > > > > > > > >> > > > > > output file for example.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > We can discuss the potential control tuples
> and
> > > > > actions
> > > > > > in
> > > > > > > > > >> detail,
> > > > > > > > > >> > > but
> > > > > > > > > >> > > > > > first I would like to understand the views of
> > the
> > > > > > > community
> > > > > > > > > for
> > > > > > > > > >> > this
> > > > > > > > > >> > > > > > proposal.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> I think my comments related to count based windows might be causing
> confusion. Let's not discuss count based scenarios for now.
>
> Just want to make sure we are on the same page wrt. the "each file is a
> batch" use case. As mentioned by Thomas, the each tuple from the same file
> has the same timestamp (which is just a sequence number) and that helps
> keep tuples from each file in a separate window.
>

Yes, in this case it is a sequence number, but it could be a time stamp
also, depending on the file naming convention. And if it was event time
processing, the watermark would be derived from records within the file.

Agreed, the source should have a mechanism to control the time stamp
extraction along with everything else pertaining to the watermark
generation.


> We could also implement a "timestampExtractor" interface to identify the
> timestamp (sequence number) for a file.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org> wrote:
>
> > I don't think this is a use case for count based window.
> >
> > We have multiple files that are retrieved in a sequence and there is no
> > knowledge of the number of records per file. The requirement is to
> > aggregate each file separately and emit the aggregate when the file is
> read
> > fully. There is no concept of "end of something" for an individual key
> and
> > global window isn't applicable.
> >
> > However, as already explained and implemented by Bhupesh, this can be
> > solved using watermark and window (in this case the window timestamp
> isn't
> > a timestamp, but a file sequence, but that doesn't matter.
> >
> > Thomas
> >
> >
> > On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com> wrote:
> >
> > > I don't think this is the way to go. Global Window only means the
> > timestamp
> > > does not matter (or that there is no timestamp). It does not
> necessarily
> > > mean it's a large batch. Unless there is some notion of event time for
> > each
> > > file, you don't want to embed the file into the window itself.
> > >
> > > If you want the result broken up by file name, and if the files are to
> be
> > > processed in parallel, I think making the file name be part of the key
> is
> > > the way to go. I think it's very confusing if we somehow make the file
> to
> > > be part of the window.
> > >
> > > For count-based window, it's not implemented yet and you're welcome to
> > add
> > > that feature. In case of count-based windows, there would be no notion
> of
> > > time and you probably only trigger at the end of each window. In the
> case
> > > of count-based windows, the watermark only matters for batch since you
> > need
> > > a way to know when the batch has ended (if the count is 10, the number
> of
> > > tuples in the batch is let's say 105, you need a way to end the last
> > window
> > > with 5 tuples).
> > >
> > > David
> > >
> > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Hi David,
> > > >
> > > > Thanks for your comments.
> > > >
> > > > The wordcount example that I created based on the windowed operator
> > does
> > > > processing of word counts per file (each file as a separate batch),
> > i.e.
> > > > process counts for each file and dump into separate files.
> > > > As I understand Global window is for one large batch; i.e. all
> incoming
> > > > data falls into the same batch. This could not be processed using
> > > > GlobalWindow option as we need more than one windows. In this case, I
> > > > configured the windowed operator to have time windows of 1ms each and
> > > > passed data for each file with increasing timestamps: (file1, 1),
> > (file2,
> > > > 2) and so on. Is there a better way of handling this scenario?
> > > >
> > > > Regarding (2 - count based windows), I think there is a trigger
> option
> > to
> > > > process count based windows. In case I want to process every 1000
> > tuples
> > > as
> > > > a batch, I could set the Trigger option to CountTrigger with the
> > > > accumulation set to Discarding. Is this correct?
> > > >
> > > > I agree that (4. Final Watermark) can be done using Global window.
> > > >
> > > > ~ Bhupesh
> > > >
> > > > _______________________________________________________
> > > >
> > > > Bhupesh Chawda
> > > >
> > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >
> > > > www.datatorrent.com  |  apex.apache.org
> > > >
> > > >
> > > >
> > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <da...@gmail.com>
> > wrote:
> > > >
> > > > > I'm worried that we are making the watermark concept too
> complicated.
> > > > >
> > > > > Watermarks should simply just tell you what windows can be
> considered
> > > > > complete.
> > > > >
> > > > > Point 2 is basically a count-based window. Watermarks do not play a
> > > role
> > > > > here because the window is always complete at the n-th tuple.
> > > > >
> > > > > If I understand correctly, point 3 is for batch processing of
> files.
> > > > Unless
> > > > > the files contain timed events, it sounds to be that this can be
> > > achieved
> > > > > with just a Global Window. For signaling EOF, a watermark with a
> > > > +infinity
> > > > > timestamp can be used so that triggers will be fired upon receipt
> of
> > > that
> > > > > watermark.
> > > > >
> > > > > For point 4, just like what I mentioned above, can be achieved
> with a
> > > > > watermark with a +infinity timestamp.
> > > > >
> > > > > David
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Thomas,
> > > > > >
> > > > > > For an input operator which is supposed to generate watermarks
> for
> > > > > > downstream operators, I can think about the following watermarks
> > that
> > > > the
> > > > > > operator can emit:
> > > > > > 1. Time based watermarks (the high watermark / low watermark)
> > > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > > 3. File based watermarks (Start file, end file)
> > > > > > 4. Final watermark
> > > > > >
> > > > > > File based watermarks seem to be applicable for batch (file
> based)
> > as
> > > > > well,
> > > > > > and hence I thought of looking at these first. Does this seem to
> be
> > > in
> > > > > line
> > > > > > with the thought process?
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________________
> > > > > >
> > > > > > Bhupesh Chawda
> > > > > >
> > > > > > Software Engineer
> > > > > >
> > > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > > >
> > > > > > www.datatorrent.com  |  apex.apache.org
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > I don't think this should be designed based on a simplistic
> file
> > > > > > > input-output scenario. It would be good to include a stateful
> > > > > > > transformation based on event time.
> > > > > > >
> > > > > > > More complex pipelines contain stateful transformations that
> > depend
> > > > on
> > > > > > > windowing and watermarks. I think we need a watermark concept
> > that
> > > is
> > > > > > based
> > > > > > > on progress in event time (or other monotonic increasing
> > sequence)
> > > > that
> > > > > > > other operators can generically work with.
> > > > > > >
> > > > > > > Note that even file input in many cases can produce time based
> > > > > > watermarks,
> > > > > > > for example when you read part files that are bound by event
> > time.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > For better understanding the use case for control tuples in
> > > batch,
> > > > I
> > > > > > am
> > > > > > > > creating a prototype for a batch application using File Input
> > and
> > > > > File
> > > > > > > > Output operators.
> > > > > > > >
> > > > > > > > To enable basic batch processing for File IO operators, I am
> > > > > proposing
> > > > > > > the
> > > > > > > > following changes to File input and output operators:
> > > > > > > > 1. File Input operator emits a watermark each time it opens
> and
> > > > > closes
> > > > > > a
> > > > > > > > file. These can be "start file" and "end file" watermarks
> which
> > > > > include
> > > > > > > the
> > > > > > > > corresponding file names. The "start file" tuple should be
> sent
> > > > > before
> > > > > > > any
> > > > > > > > of the data from that file flows.
> > > > > > > > 2. File Input operator can be configured to end the
> application
> > > > > after a
> > > > > > > > single or n scans of the directory (a batch). This is where
> the
> > > > > > operator
> > > > > > > > emits the final watermark (the end of application control
> > tuple).
> > > > > This
> > > > > > > will
> > > > > > > > also shutdown the application.
> > > > > > > > 3. The File output operator handles these control tuples.
> > "Start
> > > > > file"
> > > > > > > > initializes the file name for the incoming tuples. "End file"
> > > > > watermark
> > > > > > > > forces a finalize on that file.
> > > > > > > >
> > > > > > > > The user would be able to enable the operators to send only
> > those
> > > > > > > > watermarks that are needed in the application. If none of the
> > > > options
> > > > > > are
> > > > > > > > configured, the operators behave as in a streaming
> application.
> > > > > > > >
> > > > > > > > There are a few challenges in the implementation where the
> > input
> > > > > > operator
> > > > > > > > is partitioned. In this case, the correlation between the
> > > start/end
> > > > > > for a
> > > > > > > > file and the data tuples for that file is lost. Hence we need
> > to
> > > > > > maintain
> > > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > > >
> > > > > > > > The "start file" and "end file" control tuples in this
> example
> > > are
> > > > > > > > temporary names for watermarks. We can have generic "start
> > > batch" /
> > > > > > "end
> > > > > > > > batch" tuples which could be used for other use cases as
> well.
> > > The
> > > > > > Final
> > > > > > > > watermark is common and serves the same purpose in each case.
> > > > > > > >
> > > > > > > > Please let me know your thoughts on this.
> > > > > > > >
> > > > > > > > ~ Bhupesh
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > > bhupesh@datatorrent.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Yes, this can be part of operator configuration. Given
> this,
> > > for
> > > > a
> > > > > > user
> > > > > > > > to
> > > > > > > > > define a batch application, would mean configuring the
> > > connectors
> > > > > > > (mostly
> > > > > > > > > the input operator) in the application for the desired
> > > behavior.
> > > > > > > > Similarly,
> > > > > > > > > there can be other use cases that can be achieved other
> than
> > > > batch.
> > > > > > > > >
> > > > > > > > > We may also need to take care of the following:
> > > > > > > > > 1. Make sure that the watermarks or control tuples are
> > > consistent
> > > > > > > across
> > > > > > > > > sources. Meaning an HDFS sink should be able to interpret
> the
> > > > > > watermark
> > > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > > 2. In addition to I/O connectors, we should also look at
> the
> > > need
> > > > > for
> > > > > > > > > processing operators to understand some of the control
> > tuples /
> > > > > > > > watermarks.
> > > > > > > > > For example, we may want to reset the operator behavior on
> > > > arrival
> > > > > of
> > > > > > > > some
> > > > > > > > > watermark tuple.
> > > > > > > > >
> > > > > > > > > ~ Bhupesh
> > > > > > > > >
> > > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> > thw@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> The HDFS source can operate in two modes, bounded or
> > > unbounded.
> > > > If
> > > > > > you
> > > > > > > > >> scan
> > > > > > > > >> only once, then it should emit the final watermark after
> it
> > is
> > > > > done.
> > > > > > > > >> Otherwise it would emit watermarks based on a policy
> (files
> > > > names
> > > > > > > etc.).
> > > > > > > > >> The mechanism to generate the marks may depend on the type
> > of
> > > > > source
> > > > > > > and
> > > > > > > > >> the user needs to be able to influence/configure it.
> > > > > > > > >>
> > > > > > > > >> Thomas
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > > bhupesh@datatorrent.com>
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > Hi Thomas,
> > > > > > > > >> >
> > > > > > > > >> > I am not sure that I completely understand your
> > suggestion.
> > > > Are
> > > > > > you
> > > > > > > > >> > suggesting to broaden the scope of the proposal to treat
> > all
> > > > > > sources
> > > > > > > > as
> > > > > > > > >> > bounded as well as unbounded?
> > > > > > > > >> >
> > > > > > > > >> > In case of Apex, we treat all sources as unbounded
> > sources.
> > > > Even
> > > > > > > > bounded
> > > > > > > > >> > sources like HDFS file source is treated as unbounded by
> > > means
> > > > > of
> > > > > > > > >> scanning
> > > > > > > > >> > the input directory repeatedly.
> > > > > > > > >> >
> > > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > > >> > In this case, if we treat it as a bounded source, we can
> > > > define
> > > > > > > hooks
> > > > > > > > >> which
> > > > > > > > >> > allows us to detect the end of the file and send the
> > "final
> > > > > > > > watermark".
> > > > > > > > >> We
> > > > > > > > >> > could also consider HDFS file source as a streaming
> source
> > > and
> > > > > > > define
> > > > > > > > >> hooks
> > > > > > > > >> > which send watermarks based on different kinds of
> windows.
> > > > > > > > >> >
> > > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > > >> >
> > > > > > > > >> > ~ Bhupesh
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > > thw@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Bhupesh,
> > > > > > > > >> > >
> > > > > > > > >> > > Please see how that can be solved in a unified way
> using
> > > > > windows
> > > > > > > and
> > > > > > > > >> > > watermarks. It is bounded data vs. unbounded data. In
> > Beam
> > > > for
> > > > > > > > >> example,
> > > > > > > > >> > you
> > > > > > > > >> > > can use the "global window" and the final watermark to
> > > > > > accomplish
> > > > > > > > what
> > > > > > > > >> > you
> > > > > > > > >> > > are looking for. Batch is just a special case of
> > streaming
> > > > > where
> > > > > > > the
> > > > > > > > >> > source
> > > > > > > > >> > > emits the final watermark.
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks,
> > > > > > > > >> > > Thomas
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > > > > >> bhupesh@datatorrent.com
> > > > > > > > >> > >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Yes, if the user needs to develop a batch
> application,
> > > > then
> > > > > > > batch
> > > > > > > > >> aware
> > > > > > > > >> > > > operators need to be used in the application.
> > > > > > > > >> > > > The nature of the application is mostly controlled
> by
> > > the
> > > > > > input
> > > > > > > > and
> > > > > > > > >> the
> > > > > > > > >> > > > output operators used in the application.
> > > > > > > > >> > > >
> > > > > > > > >> > > > For example, consider an application which needs to
> > > filter
> > > > > > > records
> > > > > > > > >> in a
> > > > > > > > >> > > > input file and store the filtered records in another
> > > file.
> > > > > The
> > > > > > > > >> nature
> > > > > > > > >> > of
> > > > > > > > >> > > > this app is to end once the entire file is
> processed.
> > > > > > Following
> > > > > > > > >> things
> > > > > > > > >> > > are
> > > > > > > > >> > > > expected of the application:
> > > > > > > > >> > > >
> > > > > > > > >> > > >    1. Once the input data is over, finalize the
> output
> > > > file
> > > > > > from
> > > > > > > > >> .tmp
> > > > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > > > >> > > >    2. End the application, once the data is read and
> > > > > > processed -
> > > > > > > > >> > > >    Responsibility of input operator
> > > > > > > > >> > > >
> > > > > > > > >> > > > These functions are essential to allow the user to
> do
> > > > higher
> > > > > > > level
> > > > > > > > >> > > > operations like scheduling or running a workflow of
> > > batch
> > > > > > > > >> applications.
> > > > > > > > >> > > >
> > > > > > > > >> > > > I am not sure about intermediate (processing)
> > operators,
> > > > as
> > > > > > > there
> > > > > > > > >> is no
> > > > > > > > >> > > > change in their functionality for batch use cases.
> > > > Perhaps,
> > > > > > > > allowing
> > > > > > > > >> > > > running multiple batches in a single application may
> > > > require
> > > > > > > > similar
> > > > > > > > >> > > > changes in processing operators as well.
> > > > > > > > >> > > >
> > > > > > > > >> > > > ~ Bhupesh
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > > > > priyag@apache.org
> > > > > > > > >> >
> > > > > > > > >> > > > wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Will it make an impression on user that, if he
> has a
> > > > batch
> > > > > > > > >> usecase he
> > > > > > > > >> > > has
> > > > > > > > >> > > > > to use batch aware operators only? If so, is that
> > what
> > > > we
> > > > > > > > expect?
> > > > > > > > >> I
> > > > > > > > >> > am
> > > > > > > > >> > > > not
> > > > > > > > >> > > > > aware of how do we implement batch scenario so
> this
> > > > might
> > > > > > be a
> > > > > > > > >> basic
> > > > > > > > >> > > > > question.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > -Priyanka
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > > >> > > > > wrote:
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > > Hi All,
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > While design / implementation for custom control
> > > > tuples
> > > > > is
> > > > > > > > >> > ongoing, I
> > > > > > > > >> > > > > > thought it would be a good idea to consider its
> > > > > usefulness
> > > > > > > in
> > > > > > > > >> one
> > > > > > > > >> > of
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > This is a proposal to adapt / extend existing
> > > > operators
> > > > > in
> > > > > > > the
> > > > > > > > >> > Apache
> > > > > > > > >> > > > > Apex
> > > > > > > > >> > > > > > Malhar library so that it is easy to use them in
> > > batch
> > > > > use
> > > > > > > > >> cases.
> > > > > > > > >> > > > > > Naturally, this would be applicable for only a
> > > subset
> > > > of
> > > > > > > > >> operators
> > > > > > > > >> > > like
> > > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > > >> > > > > > For example, for a file based store, (say HDFS
> > > store),
> > > > > we
> > > > > > > > could
> > > > > > > > >> > have
> > > > > > > > >> > > > > > FileBatchInput and FileBatchOutput operators
> which
> > > > allow
> > > > > > > easy
> > > > > > > > >> > > > integration
> > > > > > > > >> > > > > > into a batch application. These operators would
> be
> > > > > > extended
> > > > > > > > from
> > > > > > > > >> > > their
> > > > > > > > >> > > > > > existing implementations and would be "Batch
> > Aware",
> > > > in
> > > > > > that
> > > > > > > > >> they
> > > > > > > > >> > may
> > > > > > > > >> > > > > > understand the meaning of some specific control
> > > tuples
> > > > > > that
> > > > > > > > flow
> > > > > > > > >> > > > through
> > > > > > > > >> > > > > > the DAG. Start batch and end batch seem to be
> the
> > > > > obvious
> > > > > > > > >> > candidates
> > > > > > > > >> > > > that
> > > > > > > > >> > > > > > come to mind. On receipt of such control tuples,
> > > they
> > > > > may
> > > > > > > try
> > > > > > > > to
> > > > > > > > >> > > modify
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > behavior of the operator - to reinitialize some
> > > > metrics
> > > > > or
> > > > > > > > >> finalize
> > > > > > > > >> > > an
> > > > > > > > >> > > > > > output file for example.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > We can discuss the potential control tuples and
> > > > actions
> > > > > in
> > > > > > > > >> detail,
> > > > > > > > >> > > but
> > > > > > > > >> > > > > > first I would like to understand the views of
> the
> > > > > > community
> > > > > > > > for
> > > > > > > > >> > this
> > > > > > > > >> > > > > > proposal.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

I think my comments related to count based windows might be causing
confusion. Let's not discuss count based scenarios for now.

Just want to make sure we are on the same page wrt. the "each file is a
batch" use case. As mentioned by Thomas, the each tuple from the same file
has the same timestamp (which is just a sequence number) and that helps
keep tuples from each file in a separate window.

We could also implement a "timestampExtractor" interface to identify the
timestamp (sequence number) for a file.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise <th...@apache.org> wrote:

> I don't think this is a use case for count based window.
>
> We have multiple files that are retrieved in a sequence and there is no
> knowledge of the number of records per file. The requirement is to
> aggregate each file separately and emit the aggregate when the file is read
> fully. There is no concept of "end of something" for an individual key and
> global window isn't applicable.
>
> However, as already explained and implemented by Bhupesh, this can be
> solved using watermark and window (in this case the window timestamp isn't
> a timestamp, but a file sequence, but that doesn't matter.
>
> Thomas
>
>
> On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com> wrote:
>
> > I don't think this is the way to go. Global Window only means the
> timestamp
> > does not matter (or that there is no timestamp). It does not necessarily
> > mean it's a large batch. Unless there is some notion of event time for
> each
> > file, you don't want to embed the file into the window itself.
> >
> > If you want the result broken up by file name, and if the files are to be
> > processed in parallel, I think making the file name be part of the key is
> > the way to go. I think it's very confusing if we somehow make the file to
> > be part of the window.
> >
> > For count-based window, it's not implemented yet and you're welcome to
> add
> > that feature. In case of count-based windows, there would be no notion of
> > time and you probably only trigger at the end of each window. In the case
> > of count-based windows, the watermark only matters for batch since you
> need
> > a way to know when the batch has ended (if the count is 10, the number of
> > tuples in the batch is let's say 105, you need a way to end the last
> window
> > with 5 tuples).
> >
> > David
> >
> > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > Hi David,
> > >
> > > Thanks for your comments.
> > >
> > > The wordcount example that I created based on the windowed operator
> does
> > > processing of word counts per file (each file as a separate batch),
> i.e.
> > > process counts for each file and dump into separate files.
> > > As I understand Global window is for one large batch; i.e. all incoming
> > > data falls into the same batch. This could not be processed using
> > > GlobalWindow option as we need more than one windows. In this case, I
> > > configured the windowed operator to have time windows of 1ms each and
> > > passed data for each file with increasing timestamps: (file1, 1),
> (file2,
> > > 2) and so on. Is there a better way of handling this scenario?
> > >
> > > Regarding (2 - count based windows), I think there is a trigger option
> to
> > > process count based windows. In case I want to process every 1000
> tuples
> > as
> > > a batch, I could set the Trigger option to CountTrigger with the
> > > accumulation set to Discarding. Is this correct?
> > >
> > > I agree that (4. Final Watermark) can be done using Global window.
> > >
> > > ~ Bhupesh
> > >
> > > _______________________________________________________
> > >
> > > Bhupesh Chawda
> > >
> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >
> > > www.datatorrent.com  |  apex.apache.org
> > >
> > >
> > >
> > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <da...@gmail.com>
> wrote:
> > >
> > > > I'm worried that we are making the watermark concept too complicated.
> > > >
> > > > Watermarks should simply just tell you what windows can be considered
> > > > complete.
> > > >
> > > > Point 2 is basically a count-based window. Watermarks do not play a
> > role
> > > > here because the window is always complete at the n-th tuple.
> > > >
> > > > If I understand correctly, point 3 is for batch processing of files.
> > > Unless
> > > > the files contain timed events, it sounds to be that this can be
> > achieved
> > > > with just a Global Window. For signaling EOF, a watermark with a
> > > +infinity
> > > > timestamp can be used so that triggers will be fired upon receipt of
> > that
> > > > watermark.
> > > >
> > > > For point 4, just like what I mentioned above, can be achieved with a
> > > > watermark with a +infinity timestamp.
> > > >
> > > > David
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Thomas,
> > > > >
> > > > > For an input operator which is supposed to generate watermarks for
> > > > > downstream operators, I can think about the following watermarks
> that
> > > the
> > > > > operator can emit:
> > > > > 1. Time based watermarks (the high watermark / low watermark)
> > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > 3. File based watermarks (Start file, end file)
> > > > > 4. Final watermark
> > > > >
> > > > > File based watermarks seem to be applicable for batch (file based)
> as
> > > > well,
> > > > > and hence I thought of looking at these first. Does this seem to be
> > in
> > > > line
> > > > > with the thought process?
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________________
> > > > >
> > > > > Bhupesh Chawda
> > > > >
> > > > > Software Engineer
> > > > >
> > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > >
> > > > > www.datatorrent.com  |  apex.apache.org
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > > > I don't think this should be designed based on a simplistic file
> > > > > > input-output scenario. It would be good to include a stateful
> > > > > > transformation based on event time.
> > > > > >
> > > > > > More complex pipelines contain stateful transformations that
> depend
> > > on
> > > > > > windowing and watermarks. I think we need a watermark concept
> that
> > is
> > > > > based
> > > > > > on progress in event time (or other monotonic increasing
> sequence)
> > > that
> > > > > > other operators can generically work with.
> > > > > >
> > > > > > Note that even file input in many cases can produce time based
> > > > > watermarks,
> > > > > > for example when you read part files that are bound by event
> time.
> > > > > >
> > > > > > Thanks,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > For better understanding the use case for control tuples in
> > batch,
> > > I
> > > > > am
> > > > > > > creating a prototype for a batch application using File Input
> and
> > > > File
> > > > > > > Output operators.
> > > > > > >
> > > > > > > To enable basic batch processing for File IO operators, I am
> > > > proposing
> > > > > > the
> > > > > > > following changes to File input and output operators:
> > > > > > > 1. File Input operator emits a watermark each time it opens and
> > > > closes
> > > > > a
> > > > > > > file. These can be "start file" and "end file" watermarks which
> > > > include
> > > > > > the
> > > > > > > corresponding file names. The "start file" tuple should be sent
> > > > before
> > > > > > any
> > > > > > > of the data from that file flows.
> > > > > > > 2. File Input operator can be configured to end the application
> > > > after a
> > > > > > > single or n scans of the directory (a batch). This is where the
> > > > > operator
> > > > > > > emits the final watermark (the end of application control
> tuple).
> > > > This
> > > > > > will
> > > > > > > also shutdown the application.
> > > > > > > 3. The File output operator handles these control tuples.
> "Start
> > > > file"
> > > > > > > initializes the file name for the incoming tuples. "End file"
> > > > watermark
> > > > > > > forces a finalize on that file.
> > > > > > >
> > > > > > > The user would be able to enable the operators to send only
> those
> > > > > > > watermarks that are needed in the application. If none of the
> > > options
> > > > > are
> > > > > > > configured, the operators behave as in a streaming application.
> > > > > > >
> > > > > > > There are a few challenges in the implementation where the
> input
> > > > > operator
> > > > > > > is partitioned. In this case, the correlation between the
> > start/end
> > > > > for a
> > > > > > > file and the data tuples for that file is lost. Hence we need
> to
> > > > > maintain
> > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > >
> > > > > > > The "start file" and "end file" control tuples in this example
> > are
> > > > > > > temporary names for watermarks. We can have generic "start
> > batch" /
> > > > > "end
> > > > > > > batch" tuples which could be used for other use cases as well.
> > The
> > > > > Final
> > > > > > > watermark is common and serves the same purpose in each case.
> > > > > > >
> > > > > > > Please let me know your thoughts on this.
> > > > > > >
> > > > > > > ~ Bhupesh
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > bhupesh@datatorrent.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Yes, this can be part of operator configuration. Given this,
> > for
> > > a
> > > > > user
> > > > > > > to
> > > > > > > > define a batch application, would mean configuring the
> > connectors
> > > > > > (mostly
> > > > > > > > the input operator) in the application for the desired
> > behavior.
> > > > > > > Similarly,
> > > > > > > > there can be other use cases that can be achieved other than
> > > batch.
> > > > > > > >
> > > > > > > > We may also need to take care of the following:
> > > > > > > > 1. Make sure that the watermarks or control tuples are
> > consistent
> > > > > > across
> > > > > > > > sources. Meaning an HDFS sink should be able to interpret the
> > > > > watermark
> > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > 2. In addition to I/O connectors, we should also look at the
> > need
> > > > for
> > > > > > > > processing operators to understand some of the control
> tuples /
> > > > > > > watermarks.
> > > > > > > > For example, we may want to reset the operator behavior on
> > > arrival
> > > > of
> > > > > > > some
> > > > > > > > watermark tuple.
> > > > > > > >
> > > > > > > > ~ Bhupesh
> > > > > > > >
> > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> thw@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> The HDFS source can operate in two modes, bounded or
> > unbounded.
> > > If
> > > > > you
> > > > > > > >> scan
> > > > > > > >> only once, then it should emit the final watermark after it
> is
> > > > done.
> > > > > > > >> Otherwise it would emit watermarks based on a policy (files
> > > names
> > > > > > etc.).
> > > > > > > >> The mechanism to generate the marks may depend on the type
> of
> > > > source
> > > > > > and
> > > > > > > >> the user needs to be able to influence/configure it.
> > > > > > > >>
> > > > > > > >> Thomas
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > bhupesh@datatorrent.com>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hi Thomas,
> > > > > > > >> >
> > > > > > > >> > I am not sure that I completely understand your
> suggestion.
> > > Are
> > > > > you
> > > > > > > >> > suggesting to broaden the scope of the proposal to treat
> all
> > > > > sources
> > > > > > > as
> > > > > > > >> > bounded as well as unbounded?
> > > > > > > >> >
> > > > > > > >> > In case of Apex, we treat all sources as unbounded
> sources.
> > > Even
> > > > > > > bounded
> > > > > > > >> > sources like HDFS file source is treated as unbounded by
> > means
> > > > of
> > > > > > > >> scanning
> > > > > > > >> > the input directory repeatedly.
> > > > > > > >> >
> > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > >> > In this case, if we treat it as a bounded source, we can
> > > define
> > > > > > hooks
> > > > > > > >> which
> > > > > > > >> > allows us to detect the end of the file and send the
> "final
> > > > > > > watermark".
> > > > > > > >> We
> > > > > > > >> > could also consider HDFS file source as a streaming source
> > and
> > > > > > define
> > > > > > > >> hooks
> > > > > > > >> > which send watermarks based on different kinds of windows.
> > > > > > > >> >
> > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > >> >
> > > > > > > >> > ~ Bhupesh
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > thw@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > Bhupesh,
> > > > > > > >> > >
> > > > > > > >> > > Please see how that can be solved in a unified way using
> > > > windows
> > > > > > and
> > > > > > > >> > > watermarks. It is bounded data vs. unbounded data. In
> Beam
> > > for
> > > > > > > >> example,
> > > > > > > >> > you
> > > > > > > >> > > can use the "global window" and the final watermark to
> > > > > accomplish
> > > > > > > what
> > > > > > > >> > you
> > > > > > > >> > > are looking for. Batch is just a special case of
> streaming
> > > > where
> > > > > > the
> > > > > > > >> > source
> > > > > > > >> > > emits the final watermark.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Thomas
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > > > >> bhupesh@datatorrent.com
> > > > > > > >> > >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Yes, if the user needs to develop a batch application,
> > > then
> > > > > > batch
> > > > > > > >> aware
> > > > > > > >> > > > operators need to be used in the application.
> > > > > > > >> > > > The nature of the application is mostly controlled by
> > the
> > > > > input
> > > > > > > and
> > > > > > > >> the
> > > > > > > >> > > > output operators used in the application.
> > > > > > > >> > > >
> > > > > > > >> > > > For example, consider an application which needs to
> > filter
> > > > > > records
> > > > > > > >> in a
> > > > > > > >> > > > input file and store the filtered records in another
> > file.
> > > > The
> > > > > > > >> nature
> > > > > > > >> > of
> > > > > > > >> > > > this app is to end once the entire file is processed.
> > > > > Following
> > > > > > > >> things
> > > > > > > >> > > are
> > > > > > > >> > > > expected of the application:
> > > > > > > >> > > >
> > > > > > > >> > > >    1. Once the input data is over, finalize the output
> > > file
> > > > > from
> > > > > > > >> .tmp
> > > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > > >> > > >    2. End the application, once the data is read and
> > > > > processed -
> > > > > > > >> > > >    Responsibility of input operator
> > > > > > > >> > > >
> > > > > > > >> > > > These functions are essential to allow the user to do
> > > higher
> > > > > > level
> > > > > > > >> > > > operations like scheduling or running a workflow of
> > batch
> > > > > > > >> applications.
> > > > > > > >> > > >
> > > > > > > >> > > > I am not sure about intermediate (processing)
> operators,
> > > as
> > > > > > there
> > > > > > > >> is no
> > > > > > > >> > > > change in their functionality for batch use cases.
> > > Perhaps,
> > > > > > > allowing
> > > > > > > >> > > > running multiple batches in a single application may
> > > require
> > > > > > > similar
> > > > > > > >> > > > changes in processing operators as well.
> > > > > > > >> > > >
> > > > > > > >> > > > ~ Bhupesh
> > > > > > > >> > > >
> > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > > > priyag@apache.org
> > > > > > > >> >
> > > > > > > >> > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Will it make an impression on user that, if he has a
> > > batch
> > > > > > > >> usecase he
> > > > > > > >> > > has
> > > > > > > >> > > > > to use batch aware operators only? If so, is that
> what
> > > we
> > > > > > > expect?
> > > > > > > >> I
> > > > > > > >> > am
> > > > > > > >> > > > not
> > > > > > > >> > > > > aware of how do we implement batch scenario so this
> > > might
> > > > > be a
> > > > > > > >> basic
> > > > > > > >> > > > > question.
> > > > > > > >> > > > >
> > > > > > > >> > > > > -Priyanka
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > > > Hi All,
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > While design / implementation for custom control
> > > tuples
> > > > is
> > > > > > > >> > ongoing, I
> > > > > > > >> > > > > > thought it would be a good idea to consider its
> > > > usefulness
> > > > > > in
> > > > > > > >> one
> > > > > > > >> > of
> > > > > > > >> > > > the
> > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > This is a proposal to adapt / extend existing
> > > operators
> > > > in
> > > > > > the
> > > > > > > >> > Apache
> > > > > > > >> > > > > Apex
> > > > > > > >> > > > > > Malhar library so that it is easy to use them in
> > batch
> > > > use
> > > > > > > >> cases.
> > > > > > > >> > > > > > Naturally, this would be applicable for only a
> > subset
> > > of
> > > > > > > >> operators
> > > > > > > >> > > like
> > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > >> > > > > > For example, for a file based store, (say HDFS
> > store),
> > > > we
> > > > > > > could
> > > > > > > >> > have
> > > > > > > >> > > > > > FileBatchInput and FileBatchOutput operators which
> > > allow
> > > > > > easy
> > > > > > > >> > > > integration
> > > > > > > >> > > > > > into a batch application. These operators would be
> > > > > extended
> > > > > > > from
> > > > > > > >> > > their
> > > > > > > >> > > > > > existing implementations and would be "Batch
> Aware",
> > > in
> > > > > that
> > > > > > > >> they
> > > > > > > >> > may
> > > > > > > >> > > > > > understand the meaning of some specific control
> > tuples
> > > > > that
> > > > > > > flow
> > > > > > > >> > > > through
> > > > > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> > > > obvious
> > > > > > > >> > candidates
> > > > > > > >> > > > that
> > > > > > > >> > > > > > come to mind. On receipt of such control tuples,
> > they
> > > > may
> > > > > > try
> > > > > > > to
> > > > > > > >> > > modify
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > behavior of the operator - to reinitialize some
> > > metrics
> > > > or
> > > > > > > >> finalize
> > > > > > > >> > > an
> > > > > > > >> > > > > > output file for example.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > We can discuss the potential control tuples and
> > > actions
> > > > in
> > > > > > > >> detail,
> > > > > > > >> > > but
> > > > > > > >> > > > > > first I would like to understand the views of the
> > > > > community
> > > > > > > for
> > > > > > > >> > this
> > > > > > > >> > > > > > proposal.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

I don't think this is a use case for count based window.

We have multiple files that are retrieved in a sequence and there is no
knowledge of the number of records per file. The requirement is to
aggregate each file separately and emit the aggregate when the file is read
fully. There is no concept of "end of something" for an individual key and
global window isn't applicable.

However, as already explained and implemented by Bhupesh, this can be
solved using watermark and window (in this case the window timestamp isn't
a timestamp, but a file sequence, but that doesn't matter.

Thomas


On Mon, Feb 27, 2017 at 8:05 AM, David Yan <da...@gmail.com> wrote:

> I don't think this is the way to go. Global Window only means the timestamp
> does not matter (or that there is no timestamp). It does not necessarily
> mean it's a large batch. Unless there is some notion of event time for each
> file, you don't want to embed the file into the window itself.
>
> If you want the result broken up by file name, and if the files are to be
> processed in parallel, I think making the file name be part of the key is
> the way to go. I think it's very confusing if we somehow make the file to
> be part of the window.
>
> For count-based window, it's not implemented yet and you're welcome to add
> that feature. In case of count-based windows, there would be no notion of
> time and you probably only trigger at the end of each window. In the case
> of count-based windows, the watermark only matters for batch since you need
> a way to know when the batch has ended (if the count is 10, the number of
> tuples in the batch is let's say 105, you need a way to end the last window
> with 5 tuples).
>
> David
>
> On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi David,
> >
> > Thanks for your comments.
> >
> > The wordcount example that I created based on the windowed operator does
> > processing of word counts per file (each file as a separate batch), i.e.
> > process counts for each file and dump into separate files.
> > As I understand Global window is for one large batch; i.e. all incoming
> > data falls into the same batch. This could not be processed using
> > GlobalWindow option as we need more than one windows. In this case, I
> > configured the windowed operator to have time windows of 1ms each and
> > passed data for each file with increasing timestamps: (file1, 1), (file2,
> > 2) and so on. Is there a better way of handling this scenario?
> >
> > Regarding (2 - count based windows), I think there is a trigger option to
> > process count based windows. In case I want to process every 1000 tuples
> as
> > a batch, I could set the Trigger option to CountTrigger with the
> > accumulation set to Discarding. Is this correct?
> >
> > I agree that (4. Final Watermark) can be done using Global window.
> >
> > ~ Bhupesh
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Mon, Feb 27, 2017 at 12:18 PM, David Yan <da...@gmail.com> wrote:
> >
> > > I'm worried that we are making the watermark concept too complicated.
> > >
> > > Watermarks should simply just tell you what windows can be considered
> > > complete.
> > >
> > > Point 2 is basically a count-based window. Watermarks do not play a
> role
> > > here because the window is always complete at the n-th tuple.
> > >
> > > If I understand correctly, point 3 is for batch processing of files.
> > Unless
> > > the files contain timed events, it sounds to be that this can be
> achieved
> > > with just a Global Window. For signaling EOF, a watermark with a
> > +infinity
> > > timestamp can be used so that triggers will be fired upon receipt of
> that
> > > watermark.
> > >
> > > For point 4, just like what I mentioned above, can be achieved with a
> > > watermark with a +infinity timestamp.
> > >
> > > David
> > >
> > >
> > >
> > >
> > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > For an input operator which is supposed to generate watermarks for
> > > > downstream operators, I can think about the following watermarks that
> > the
> > > > operator can emit:
> > > > 1. Time based watermarks (the high watermark / low watermark)
> > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > 3. File based watermarks (Start file, end file)
> > > > 4. Final watermark
> > > >
> > > > File based watermarks seem to be applicable for batch (file based) as
> > > well,
> > > > and hence I thought of looking at these first. Does this seem to be
> in
> > > line
> > > > with the thought process?
> > > >
> > > > ~ Bhupesh
> > > >
> > > >
> > > >
> > > > _______________________________________________________
> > > >
> > > > Bhupesh Chawda
> > > >
> > > > Software Engineer
> > > >
> > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >
> > > > www.datatorrent.com  |  apex.apache.org
> > > >
> > > >
> > > >
> > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > > > I don't think this should be designed based on a simplistic file
> > > > > input-output scenario. It would be good to include a stateful
> > > > > transformation based on event time.
> > > > >
> > > > > More complex pipelines contain stateful transformations that depend
> > on
> > > > > windowing and watermarks. I think we need a watermark concept that
> is
> > > > based
> > > > > on progress in event time (or other monotonic increasing sequence)
> > that
> > > > > other operators can generically work with.
> > > > >
> > > > > Note that even file input in many cases can produce time based
> > > > watermarks,
> > > > > for example when you read part files that are bound by event time.
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > For better understanding the use case for control tuples in
> batch,
> > I
> > > > am
> > > > > > creating a prototype for a batch application using File Input and
> > > File
> > > > > > Output operators.
> > > > > >
> > > > > > To enable basic batch processing for File IO operators, I am
> > > proposing
> > > > > the
> > > > > > following changes to File input and output operators:
> > > > > > 1. File Input operator emits a watermark each time it opens and
> > > closes
> > > > a
> > > > > > file. These can be "start file" and "end file" watermarks which
> > > include
> > > > > the
> > > > > > corresponding file names. The "start file" tuple should be sent
> > > before
> > > > > any
> > > > > > of the data from that file flows.
> > > > > > 2. File Input operator can be configured to end the application
> > > after a
> > > > > > single or n scans of the directory (a batch). This is where the
> > > > operator
> > > > > > emits the final watermark (the end of application control tuple).
> > > This
> > > > > will
> > > > > > also shutdown the application.
> > > > > > 3. The File output operator handles these control tuples. "Start
> > > file"
> > > > > > initializes the file name for the incoming tuples. "End file"
> > > watermark
> > > > > > forces a finalize on that file.
> > > > > >
> > > > > > The user would be able to enable the operators to send only those
> > > > > > watermarks that are needed in the application. If none of the
> > options
> > > > are
> > > > > > configured, the operators behave as in a streaming application.
> > > > > >
> > > > > > There are a few challenges in the implementation where the input
> > > > operator
> > > > > > is partitioned. In this case, the correlation between the
> start/end
> > > > for a
> > > > > > file and the data tuples for that file is lost. Hence we need to
> > > > maintain
> > > > > > the filename as part of each tuple in the pipeline.
> > > > > >
> > > > > > The "start file" and "end file" control tuples in this example
> are
> > > > > > temporary names for watermarks. We can have generic "start
> batch" /
> > > > "end
> > > > > > batch" tuples which could be used for other use cases as well.
> The
> > > > Final
> > > > > > watermark is common and serves the same purpose in each case.
> > > > > >
> > > > > > Please let me know your thoughts on this.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Yes, this can be part of operator configuration. Given this,
> for
> > a
> > > > user
> > > > > > to
> > > > > > > define a batch application, would mean configuring the
> connectors
> > > > > (mostly
> > > > > > > the input operator) in the application for the desired
> behavior.
> > > > > > Similarly,
> > > > > > > there can be other use cases that can be achieved other than
> > batch.
> > > > > > >
> > > > > > > We may also need to take care of the following:
> > > > > > > 1. Make sure that the watermarks or control tuples are
> consistent
> > > > > across
> > > > > > > sources. Meaning an HDFS sink should be able to interpret the
> > > > watermark
> > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > 2. In addition to I/O connectors, we should also look at the
> need
> > > for
> > > > > > > processing operators to understand some of the control tuples /
> > > > > > watermarks.
> > > > > > > For example, we may want to reset the operator behavior on
> > arrival
> > > of
> > > > > > some
> > > > > > > watermark tuple.
> > > > > > >
> > > > > > > ~ Bhupesh
> > > > > > >
> > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > >> The HDFS source can operate in two modes, bounded or
> unbounded.
> > If
> > > > you
> > > > > > >> scan
> > > > > > >> only once, then it should emit the final watermark after it is
> > > done.
> > > > > > >> Otherwise it would emit watermarks based on a policy (files
> > names
> > > > > etc.).
> > > > > > >> The mechanism to generate the marks may depend on the type of
> > > source
> > > > > and
> > > > > > >> the user needs to be able to influence/configure it.
> > > > > > >>
> > > > > > >> Thomas
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > bhupesh@datatorrent.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hi Thomas,
> > > > > > >> >
> > > > > > >> > I am not sure that I completely understand your suggestion.
> > Are
> > > > you
> > > > > > >> > suggesting to broaden the scope of the proposal to treat all
> > > > sources
> > > > > > as
> > > > > > >> > bounded as well as unbounded?
> > > > > > >> >
> > > > > > >> > In case of Apex, we treat all sources as unbounded sources.
> > Even
> > > > > > bounded
> > > > > > >> > sources like HDFS file source is treated as unbounded by
> means
> > > of
> > > > > > >> scanning
> > > > > > >> > the input directory repeatedly.
> > > > > > >> >
> > > > > > >> > Let's consider HDFS file source for example:
> > > > > > >> > In this case, if we treat it as a bounded source, we can
> > define
> > > > > hooks
> > > > > > >> which
> > > > > > >> > allows us to detect the end of the file and send the "final
> > > > > > watermark".
> > > > > > >> We
> > > > > > >> > could also consider HDFS file source as a streaming source
> and
> > > > > define
> > > > > > >> hooks
> > > > > > >> > which send watermarks based on different kinds of windows.
> > > > > > >> >
> > > > > > >> > Please correct me if I misunderstand.
> > > > > > >> >
> > > > > > >> > ~ Bhupesh
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> thw@apache.org
> > >
> > > > > wrote:
> > > > > > >> >
> > > > > > >> > > Bhupesh,
> > > > > > >> > >
> > > > > > >> > > Please see how that can be solved in a unified way using
> > > windows
> > > > > and
> > > > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam
> > for
> > > > > > >> example,
> > > > > > >> > you
> > > > > > >> > > can use the "global window" and the final watermark to
> > > > accomplish
> > > > > > what
> > > > > > >> > you
> > > > > > >> > > are looking for. Batch is just a special case of streaming
> > > where
> > > > > the
> > > > > > >> > source
> > > > > > >> > > emits the final watermark.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Thomas
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > > >> bhupesh@datatorrent.com
> > > > > > >> > >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Yes, if the user needs to develop a batch application,
> > then
> > > > > batch
> > > > > > >> aware
> > > > > > >> > > > operators need to be used in the application.
> > > > > > >> > > > The nature of the application is mostly controlled by
> the
> > > > input
> > > > > > and
> > > > > > >> the
> > > > > > >> > > > output operators used in the application.
> > > > > > >> > > >
> > > > > > >> > > > For example, consider an application which needs to
> filter
> > > > > records
> > > > > > >> in a
> > > > > > >> > > > input file and store the filtered records in another
> file.
> > > The
> > > > > > >> nature
> > > > > > >> > of
> > > > > > >> > > > this app is to end once the entire file is processed.
> > > > Following
> > > > > > >> things
> > > > > > >> > > are
> > > > > > >> > > > expected of the application:
> > > > > > >> > > >
> > > > > > >> > > >    1. Once the input data is over, finalize the output
> > file
> > > > from
> > > > > > >> .tmp
> > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > >> > > >    2. End the application, once the data is read and
> > > > processed -
> > > > > > >> > > >    Responsibility of input operator
> > > > > > >> > > >
> > > > > > >> > > > These functions are essential to allow the user to do
> > higher
> > > > > level
> > > > > > >> > > > operations like scheduling or running a workflow of
> batch
> > > > > > >> applications.
> > > > > > >> > > >
> > > > > > >> > > > I am not sure about intermediate (processing) operators,
> > as
> > > > > there
> > > > > > >> is no
> > > > > > >> > > > change in their functionality for batch use cases.
> > Perhaps,
> > > > > > allowing
> > > > > > >> > > > running multiple batches in a single application may
> > require
> > > > > > similar
> > > > > > >> > > > changes in processing operators as well.
> > > > > > >> > > >
> > > > > > >> > > > ~ Bhupesh
> > > > > > >> > > >
> > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > > priyag@apache.org
> > > > > > >> >
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Will it make an impression on user that, if he has a
> > batch
> > > > > > >> usecase he
> > > > > > >> > > has
> > > > > > >> > > > > to use batch aware operators only? If so, is that what
> > we
> > > > > > expect?
> > > > > > >> I
> > > > > > >> > am
> > > > > > >> > > > not
> > > > > > >> > > > > aware of how do we implement batch scenario so this
> > might
> > > > be a
> > > > > > >> basic
> > > > > > >> > > > > question.
> > > > > > >> > > > >
> > > > > > >> > > > > -Priyanka
> > > > > > >> > > > >
> > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi All,
> > > > > > >> > > > > >
> > > > > > >> > > > > > While design / implementation for custom control
> > tuples
> > > is
> > > > > > >> > ongoing, I
> > > > > > >> > > > > > thought it would be a good idea to consider its
> > > usefulness
> > > > > in
> > > > > > >> one
> > > > > > >> > of
> > > > > > >> > > > the
> > > > > > >> > > > > > use cases -  batch applications.
> > > > > > >> > > > > >
> > > > > > >> > > > > > This is a proposal to adapt / extend existing
> > operators
> > > in
> > > > > the
> > > > > > >> > Apache
> > > > > > >> > > > > Apex
> > > > > > >> > > > > > Malhar library so that it is easy to use them in
> batch
> > > use
> > > > > > >> cases.
> > > > > > >> > > > > > Naturally, this would be applicable for only a
> subset
> > of
> > > > > > >> operators
> > > > > > >> > > like
> > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > >> > > > > > For example, for a file based store, (say HDFS
> store),
> > > we
> > > > > > could
> > > > > > >> > have
> > > > > > >> > > > > > FileBatchInput and FileBatchOutput operators which
> > allow
> > > > > easy
> > > > > > >> > > > integration
> > > > > > >> > > > > > into a batch application. These operators would be
> > > > extended
> > > > > > from
> > > > > > >> > > their
> > > > > > >> > > > > > existing implementations and would be "Batch Aware",
> > in
> > > > that
> > > > > > >> they
> > > > > > >> > may
> > > > > > >> > > > > > understand the meaning of some specific control
> tuples
> > > > that
> > > > > > flow
> > > > > > >> > > > through
> > > > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> > > obvious
> > > > > > >> > candidates
> > > > > > >> > > > that
> > > > > > >> > > > > > come to mind. On receipt of such control tuples,
> they
> > > may
> > > > > try
> > > > > > to
> > > > > > >> > > modify
> > > > > > >> > > > > the
> > > > > > >> > > > > > behavior of the operator - to reinitialize some
> > metrics
> > > or
> > > > > > >> finalize
> > > > > > >> > > an
> > > > > > >> > > > > > output file for example.
> > > > > > >> > > > > >
> > > > > > >> > > > > > We can discuss the potential control tuples and
> > actions
> > > in
> > > > > > >> detail,
> > > > > > >> > > but
> > > > > > >> > > > > > first I would like to understand the views of the
> > > > community
> > > > > > for
> > > > > > >> > this
> > > > > > >> > > > > > proposal.
> > > > > > >> > > > > >
> > > > > > >> > > > > > ~ Bhupesh
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by David Yan <da...@gmail.com>.

I don't think this is the way to go. Global Window only means the timestamp
does not matter (or that there is no timestamp). It does not necessarily
mean it's a large batch. Unless there is some notion of event time for each
file, you don't want to embed the file into the window itself.

If you want the result broken up by file name, and if the files are to be
processed in parallel, I think making the file name be part of the key is
the way to go. I think it's very confusing if we somehow make the file to
be part of the window.

For count-based window, it's not implemented yet and you're welcome to add
that feature. In case of count-based windows, there would be no notion of
time and you probably only trigger at the end of each window. In the case
of count-based windows, the watermark only matters for batch since you need
a way to know when the batch has ended (if the count is 10, the number of
tuples in the batch is let's say 105, you need a way to end the last window
with 5 tuples).

David

On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi David,
>
> Thanks for your comments.
>
> The wordcount example that I created based on the windowed operator does
> processing of word counts per file (each file as a separate batch), i.e.
> process counts for each file and dump into separate files.
> As I understand Global window is for one large batch; i.e. all incoming
> data falls into the same batch. This could not be processed using
> GlobalWindow option as we need more than one windows. In this case, I
> configured the windowed operator to have time windows of 1ms each and
> passed data for each file with increasing timestamps: (file1, 1), (file2,
> 2) and so on. Is there a better way of handling this scenario?
>
> Regarding (2 - count based windows), I think there is a trigger option to
> process count based windows. In case I want to process every 1000 tuples as
> a batch, I could set the Trigger option to CountTrigger with the
> accumulation set to Discarding. Is this correct?
>
> I agree that (4. Final Watermark) can be done using Global window.
>
> ~ Bhupesh
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Mon, Feb 27, 2017 at 12:18 PM, David Yan <da...@gmail.com> wrote:
>
> > I'm worried that we are making the watermark concept too complicated.
> >
> > Watermarks should simply just tell you what windows can be considered
> > complete.
> >
> > Point 2 is basically a count-based window. Watermarks do not play a role
> > here because the window is always complete at the n-th tuple.
> >
> > If I understand correctly, point 3 is for batch processing of files.
> Unless
> > the files contain timed events, it sounds to be that this can be achieved
> > with just a Global Window. For signaling EOF, a watermark with a
> +infinity
> > timestamp can be used so that triggers will be fired upon receipt of that
> > watermark.
> >
> > For point 4, just like what I mentioned above, can be achieved with a
> > watermark with a +infinity timestamp.
> >
> > David
> >
> >
> >
> >
> > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > For an input operator which is supposed to generate watermarks for
> > > downstream operators, I can think about the following watermarks that
> the
> > > operator can emit:
> > > 1. Time based watermarks (the high watermark / low watermark)
> > > 2. Number of tuple based watermarks (Every n tuples)
> > > 3. File based watermarks (Start file, end file)
> > > 4. Final watermark
> > >
> > > File based watermarks seem to be applicable for batch (file based) as
> > well,
> > > and hence I thought of looking at these first. Does this seem to be in
> > line
> > > with the thought process?
> > >
> > > ~ Bhupesh
> > >
> > >
> > >
> > > _______________________________________________________
> > >
> > > Bhupesh Chawda
> > >
> > > Software Engineer
> > >
> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >
> > > www.datatorrent.com  |  apex.apache.org
> > >
> > >
> > >
> > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > > I don't think this should be designed based on a simplistic file
> > > > input-output scenario. It would be good to include a stateful
> > > > transformation based on event time.
> > > >
> > > > More complex pipelines contain stateful transformations that depend
> on
> > > > windowing and watermarks. I think we need a watermark concept that is
> > > based
> > > > on progress in event time (or other monotonic increasing sequence)
> that
> > > > other operators can generically work with.
> > > >
> > > > Note that even file input in many cases can produce time based
> > > watermarks,
> > > > for example when you read part files that are bound by event time.
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > For better understanding the use case for control tuples in batch,
> I
> > > am
> > > > > creating a prototype for a batch application using File Input and
> > File
> > > > > Output operators.
> > > > >
> > > > > To enable basic batch processing for File IO operators, I am
> > proposing
> > > > the
> > > > > following changes to File input and output operators:
> > > > > 1. File Input operator emits a watermark each time it opens and
> > closes
> > > a
> > > > > file. These can be "start file" and "end file" watermarks which
> > include
> > > > the
> > > > > corresponding file names. The "start file" tuple should be sent
> > before
> > > > any
> > > > > of the data from that file flows.
> > > > > 2. File Input operator can be configured to end the application
> > after a
> > > > > single or n scans of the directory (a batch). This is where the
> > > operator
> > > > > emits the final watermark (the end of application control tuple).
> > This
> > > > will
> > > > > also shutdown the application.
> > > > > 3. The File output operator handles these control tuples. "Start
> > file"
> > > > > initializes the file name for the incoming tuples. "End file"
> > watermark
> > > > > forces a finalize on that file.
> > > > >
> > > > > The user would be able to enable the operators to send only those
> > > > > watermarks that are needed in the application. If none of the
> options
> > > are
> > > > > configured, the operators behave as in a streaming application.
> > > > >
> > > > > There are a few challenges in the implementation where the input
> > > operator
> > > > > is partitioned. In this case, the correlation between the start/end
> > > for a
> > > > > file and the data tuples for that file is lost. Hence we need to
> > > maintain
> > > > > the filename as part of each tuple in the pipeline.
> > > > >
> > > > > The "start file" and "end file" control tuples in this example are
> > > > > temporary names for watermarks. We can have generic "start batch" /
> > > "end
> > > > > batch" tuples which could be used for other use cases as well. The
> > > Final
> > > > > watermark is common and serves the same purpose in each case.
> > > > >
> > > > > Please let me know your thoughts on this.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com>
> > > > > wrote:
> > > > >
> > > > > > Yes, this can be part of operator configuration. Given this, for
> a
> > > user
> > > > > to
> > > > > > define a batch application, would mean configuring the connectors
> > > > (mostly
> > > > > > the input operator) in the application for the desired behavior.
> > > > > Similarly,
> > > > > > there can be other use cases that can be achieved other than
> batch.
> > > > > >
> > > > > > We may also need to take care of the following:
> > > > > > 1. Make sure that the watermarks or control tuples are consistent
> > > > across
> > > > > > sources. Meaning an HDFS sink should be able to interpret the
> > > watermark
> > > > > > tuple sent out by, say, a JDBC source.
> > > > > > 2. In addition to I/O connectors, we should also look at the need
> > for
> > > > > > processing operators to understand some of the control tuples /
> > > > > watermarks.
> > > > > > For example, we may want to reset the operator behavior on
> arrival
> > of
> > > > > some
> > > > > > watermark tuple.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > >
> > > > > >> The HDFS source can operate in two modes, bounded or unbounded.
> If
> > > you
> > > > > >> scan
> > > > > >> only once, then it should emit the final watermark after it is
> > done.
> > > > > >> Otherwise it would emit watermarks based on a policy (files
> names
> > > > etc.).
> > > > > >> The mechanism to generate the marks may depend on the type of
> > source
> > > > and
> > > > > >> the user needs to be able to influence/configure it.
> > > > > >>
> > > > > >> Thomas
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Thomas,
> > > > > >> >
> > > > > >> > I am not sure that I completely understand your suggestion.
> Are
> > > you
> > > > > >> > suggesting to broaden the scope of the proposal to treat all
> > > sources
> > > > > as
> > > > > >> > bounded as well as unbounded?
> > > > > >> >
> > > > > >> > In case of Apex, we treat all sources as unbounded sources.
> Even
> > > > > bounded
> > > > > >> > sources like HDFS file source is treated as unbounded by means
> > of
> > > > > >> scanning
> > > > > >> > the input directory repeatedly.
> > > > > >> >
> > > > > >> > Let's consider HDFS file source for example:
> > > > > >> > In this case, if we treat it as a bounded source, we can
> define
> > > > hooks
> > > > > >> which
> > > > > >> > allows us to detect the end of the file and send the "final
> > > > > watermark".
> > > > > >> We
> > > > > >> > could also consider HDFS file source as a streaming source and
> > > > define
> > > > > >> hooks
> > > > > >> > which send watermarks based on different kinds of windows.
> > > > > >> >
> > > > > >> > Please correct me if I misunderstand.
> > > > > >> >
> > > > > >> > ~ Bhupesh
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <thw@apache.org
> >
> > > > wrote:
> > > > > >> >
> > > > > >> > > Bhupesh,
> > > > > >> > >
> > > > > >> > > Please see how that can be solved in a unified way using
> > windows
> > > > and
> > > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam
> for
> > > > > >> example,
> > > > > >> > you
> > > > > >> > > can use the "global window" and the final watermark to
> > > accomplish
> > > > > what
> > > > > >> > you
> > > > > >> > > are looking for. Batch is just a special case of streaming
> > where
> > > > the
> > > > > >> > source
> > > > > >> > > emits the final watermark.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Thomas
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > >> bhupesh@datatorrent.com
> > > > > >> > >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Yes, if the user needs to develop a batch application,
> then
> > > > batch
> > > > > >> aware
> > > > > >> > > > operators need to be used in the application.
> > > > > >> > > > The nature of the application is mostly controlled by the
> > > input
> > > > > and
> > > > > >> the
> > > > > >> > > > output operators used in the application.
> > > > > >> > > >
> > > > > >> > > > For example, consider an application which needs to filter
> > > > records
> > > > > >> in a
> > > > > >> > > > input file and store the filtered records in another file.
> > The
> > > > > >> nature
> > > > > >> > of
> > > > > >> > > > this app is to end once the entire file is processed.
> > > Following
> > > > > >> things
> > > > > >> > > are
> > > > > >> > > > expected of the application:
> > > > > >> > > >
> > > > > >> > > >    1. Once the input data is over, finalize the output
> file
> > > from
> > > > > >> .tmp
> > > > > >> > > >    files. - Responsibility of output operator
> > > > > >> > > >    2. End the application, once the data is read and
> > > processed -
> > > > > >> > > >    Responsibility of input operator
> > > > > >> > > >
> > > > > >> > > > These functions are essential to allow the user to do
> higher
> > > > level
> > > > > >> > > > operations like scheduling or running a workflow of batch
> > > > > >> applications.
> > > > > >> > > >
> > > > > >> > > > I am not sure about intermediate (processing) operators,
> as
> > > > there
> > > > > >> is no
> > > > > >> > > > change in their functionality for batch use cases.
> Perhaps,
> > > > > allowing
> > > > > >> > > > running multiple batches in a single application may
> require
> > > > > similar
> > > > > >> > > > changes in processing operators as well.
> > > > > >> > > >
> > > > > >> > > > ~ Bhupesh
> > > > > >> > > >
> > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > priyag@apache.org
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Will it make an impression on user that, if he has a
> batch
> > > > > >> usecase he
> > > > > >> > > has
> > > > > >> > > > > to use batch aware operators only? If so, is that what
> we
> > > > > expect?
> > > > > >> I
> > > > > >> > am
> > > > > >> > > > not
> > > > > >> > > > > aware of how do we implement batch scenario so this
> might
> > > be a
> > > > > >> basic
> > > > > >> > > > > question.
> > > > > >> > > > >
> > > > > >> > > > > -Priyanka
> > > > > >> > > > >
> > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > > >> > > > bhupesh@datatorrent.com>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi All,
> > > > > >> > > > > >
> > > > > >> > > > > > While design / implementation for custom control
> tuples
> > is
> > > > > >> > ongoing, I
> > > > > >> > > > > > thought it would be a good idea to consider its
> > usefulness
> > > > in
> > > > > >> one
> > > > > >> > of
> > > > > >> > > > the
> > > > > >> > > > > > use cases -  batch applications.
> > > > > >> > > > > >
> > > > > >> > > > > > This is a proposal to adapt / extend existing
> operators
> > in
> > > > the
> > > > > >> > Apache
> > > > > >> > > > > Apex
> > > > > >> > > > > > Malhar library so that it is easy to use them in batch
> > use
> > > > > >> cases.
> > > > > >> > > > > > Naturally, this would be applicable for only a subset
> of
> > > > > >> operators
> > > > > >> > > like
> > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > >> > > > > > For example, for a file based store, (say HDFS store),
> > we
> > > > > could
> > > > > >> > have
> > > > > >> > > > > > FileBatchInput and FileBatchOutput operators which
> allow
> > > > easy
> > > > > >> > > > integration
> > > > > >> > > > > > into a batch application. These operators would be
> > > extended
> > > > > from
> > > > > >> > > their
> > > > > >> > > > > > existing implementations and would be "Batch Aware",
> in
> > > that
> > > > > >> they
> > > > > >> > may
> > > > > >> > > > > > understand the meaning of some specific control tuples
> > > that
> > > > > flow
> > > > > >> > > > through
> > > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> > obvious
> > > > > >> > candidates
> > > > > >> > > > that
> > > > > >> > > > > > come to mind. On receipt of such control tuples, they
> > may
> > > > try
> > > > > to
> > > > > >> > > modify
> > > > > >> > > > > the
> > > > > >> > > > > > behavior of the operator - to reinitialize some
> metrics
> > or
> > > > > >> finalize
> > > > > >> > > an
> > > > > >> > > > > > output file for example.
> > > > > >> > > > > >
> > > > > >> > > > > > We can discuss the potential control tuples and
> actions
> > in
> > > > > >> detail,
> > > > > >> > > but
> > > > > >> > > > > > first I would like to understand the views of the
> > > community
> > > > > for
> > > > > >> > this
> > > > > >> > > > > > proposal.
> > > > > >> > > > > >
> > > > > >> > > > > > ~ Bhupesh
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi David,

Thanks for your comments.

The wordcount example that I created based on the windowed operator does
processing of word counts per file (each file as a separate batch), i.e.
process counts for each file and dump into separate files.
As I understand Global window is for one large batch; i.e. all incoming
data falls into the same batch. This could not be processed using
GlobalWindow option as we need more than one windows. In this case, I
configured the windowed operator to have time windows of 1ms each and
passed data for each file with increasing timestamps: (file1, 1), (file2,
2) and so on. Is there a better way of handling this scenario?

Regarding (2 - count based windows), I think there is a trigger option to
process count based windows. In case I want to process every 1000 tuples as
a batch, I could set the Trigger option to CountTrigger with the
accumulation set to Discarding. Is this correct?

I agree that (4. Final Watermark) can be done using Global window.

~ Bhupesh

_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Mon, Feb 27, 2017 at 12:18 PM, David Yan <da...@gmail.com> wrote:

> I'm worried that we are making the watermark concept too complicated.
>
> Watermarks should simply just tell you what windows can be considered
> complete.
>
> Point 2 is basically a count-based window. Watermarks do not play a role
> here because the window is always complete at the n-th tuple.
>
> If I understand correctly, point 3 is for batch processing of files. Unless
> the files contain timed events, it sounds to be that this can be achieved
> with just a Global Window. For signaling EOF, a watermark with a +infinity
> timestamp can be used so that triggers will be fired upon receipt of that
> watermark.
>
> For point 4, just like what I mentioned above, can be achieved with a
> watermark with a +infinity timestamp.
>
> David
>
>
>
>
> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi Thomas,
> >
> > For an input operator which is supposed to generate watermarks for
> > downstream operators, I can think about the following watermarks that the
> > operator can emit:
> > 1. Time based watermarks (the high watermark / low watermark)
> > 2. Number of tuple based watermarks (Every n tuples)
> > 3. File based watermarks (Start file, end file)
> > 4. Final watermark
> >
> > File based watermarks seem to be applicable for batch (file based) as
> well,
> > and hence I thought of looking at these first. Does this seem to be in
> line
> > with the thought process?
> >
> > ~ Bhupesh
> >
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > Software Engineer
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org> wrote:
> >
> > > I don't think this should be designed based on a simplistic file
> > > input-output scenario. It would be good to include a stateful
> > > transformation based on event time.
> > >
> > > More complex pipelines contain stateful transformations that depend on
> > > windowing and watermarks. I think we need a watermark concept that is
> > based
> > > on progress in event time (or other monotonic increasing sequence) that
> > > other operators can generically work with.
> > >
> > > Note that even file input in many cases can produce time based
> > watermarks,
> > > for example when you read part files that are bound by event time.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > For better understanding the use case for control tuples in batch, I
> > am
> > > > creating a prototype for a batch application using File Input and
> File
> > > > Output operators.
> > > >
> > > > To enable basic batch processing for File IO operators, I am
> proposing
> > > the
> > > > following changes to File input and output operators:
> > > > 1. File Input operator emits a watermark each time it opens and
> closes
> > a
> > > > file. These can be "start file" and "end file" watermarks which
> include
> > > the
> > > > corresponding file names. The "start file" tuple should be sent
> before
> > > any
> > > > of the data from that file flows.
> > > > 2. File Input operator can be configured to end the application
> after a
> > > > single or n scans of the directory (a batch). This is where the
> > operator
> > > > emits the final watermark (the end of application control tuple).
> This
> > > will
> > > > also shutdown the application.
> > > > 3. The File output operator handles these control tuples. "Start
> file"
> > > > initializes the file name for the incoming tuples. "End file"
> watermark
> > > > forces a finalize on that file.
> > > >
> > > > The user would be able to enable the operators to send only those
> > > > watermarks that are needed in the application. If none of the options
> > are
> > > > configured, the operators behave as in a streaming application.
> > > >
> > > > There are a few challenges in the implementation where the input
> > operator
> > > > is partitioned. In this case, the correlation between the start/end
> > for a
> > > > file and the data tuples for that file is lost. Hence we need to
> > maintain
> > > > the filename as part of each tuple in the pipeline.
> > > >
> > > > The "start file" and "end file" control tuples in this example are
> > > > temporary names for watermarks. We can have generic "start batch" /
> > "end
> > > > batch" tuples which could be used for other use cases as well. The
> > Final
> > > > watermark is common and serves the same purpose in each case.
> > > >
> > > > Please let me know your thoughts on this.
> > > >
> > > > ~ Bhupesh
> > > >
> > > >
> > > >
> > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > > wrote:
> > > >
> > > > > Yes, this can be part of operator configuration. Given this, for a
> > user
> > > > to
> > > > > define a batch application, would mean configuring the connectors
> > > (mostly
> > > > > the input operator) in the application for the desired behavior.
> > > > Similarly,
> > > > > there can be other use cases that can be achieved other than batch.
> > > > >
> > > > > We may also need to take care of the following:
> > > > > 1. Make sure that the watermarks or control tuples are consistent
> > > across
> > > > > sources. Meaning an HDFS sink should be able to interpret the
> > watermark
> > > > > tuple sent out by, say, a JDBC source.
> > > > > 2. In addition to I/O connectors, we should also look at the need
> for
> > > > > processing operators to understand some of the control tuples /
> > > > watermarks.
> > > > > For example, we may want to reset the operator behavior on arrival
> of
> > > > some
> > > > > watermark tuple.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > >> The HDFS source can operate in two modes, bounded or unbounded. If
> > you
> > > > >> scan
> > > > >> only once, then it should emit the final watermark after it is
> done.
> > > > >> Otherwise it would emit watermarks based on a policy (files names
> > > etc.).
> > > > >> The mechanism to generate the marks may depend on the type of
> source
> > > and
> > > > >> the user needs to be able to influence/configure it.
> > > > >>
> > > > >> Thomas
> > > > >>
> > > > >>
> > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Thomas,
> > > > >> >
> > > > >> > I am not sure that I completely understand your suggestion. Are
> > you
> > > > >> > suggesting to broaden the scope of the proposal to treat all
> > sources
> > > > as
> > > > >> > bounded as well as unbounded?
> > > > >> >
> > > > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > > > bounded
> > > > >> > sources like HDFS file source is treated as unbounded by means
> of
> > > > >> scanning
> > > > >> > the input directory repeatedly.
> > > > >> >
> > > > >> > Let's consider HDFS file source for example:
> > > > >> > In this case, if we treat it as a bounded source, we can define
> > > hooks
> > > > >> which
> > > > >> > allows us to detect the end of the file and send the "final
> > > > watermark".
> > > > >> We
> > > > >> > could also consider HDFS file source as a streaming source and
> > > define
> > > > >> hooks
> > > > >> > which send watermarks based on different kinds of windows.
> > > > >> >
> > > > >> > Please correct me if I misunderstand.
> > > > >> >
> > > > >> > ~ Bhupesh
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > >> >
> > > > >> > > Bhupesh,
> > > > >> > >
> > > > >> > > Please see how that can be solved in a unified way using
> windows
> > > and
> > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > > > >> example,
> > > > >> > you
> > > > >> > > can use the "global window" and the final watermark to
> > accomplish
> > > > what
> > > > >> > you
> > > > >> > > are looking for. Batch is just a special case of streaming
> where
> > > the
> > > > >> > source
> > > > >> > > emits the final watermark.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Thomas
> > > > >> > >
> > > > >> > >
> > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > >> bhupesh@datatorrent.com
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Yes, if the user needs to develop a batch application, then
> > > batch
> > > > >> aware
> > > > >> > > > operators need to be used in the application.
> > > > >> > > > The nature of the application is mostly controlled by the
> > input
> > > > and
> > > > >> the
> > > > >> > > > output operators used in the application.
> > > > >> > > >
> > > > >> > > > For example, consider an application which needs to filter
> > > records
> > > > >> in a
> > > > >> > > > input file and store the filtered records in another file.
> The
> > > > >> nature
> > > > >> > of
> > > > >> > > > this app is to end once the entire file is processed.
> > Following
> > > > >> things
> > > > >> > > are
> > > > >> > > > expected of the application:
> > > > >> > > >
> > > > >> > > >    1. Once the input data is over, finalize the output file
> > from
> > > > >> .tmp
> > > > >> > > >    files. - Responsibility of output operator
> > > > >> > > >    2. End the application, once the data is read and
> > processed -
> > > > >> > > >    Responsibility of input operator
> > > > >> > > >
> > > > >> > > > These functions are essential to allow the user to do higher
> > > level
> > > > >> > > > operations like scheduling or running a workflow of batch
> > > > >> applications.
> > > > >> > > >
> > > > >> > > > I am not sure about intermediate (processing) operators, as
> > > there
> > > > >> is no
> > > > >> > > > change in their functionality for batch use cases. Perhaps,
> > > > allowing
> > > > >> > > > running multiple batches in a single application may require
> > > > similar
> > > > >> > > > changes in processing operators as well.
> > > > >> > > >
> > > > >> > > > ~ Bhupesh
> > > > >> > > >
> > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > priyag@apache.org
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Will it make an impression on user that, if he has a batch
> > > > >> usecase he
> > > > >> > > has
> > > > >> > > > > to use batch aware operators only? If so, is that what we
> > > > expect?
> > > > >> I
> > > > >> > am
> > > > >> > > > not
> > > > >> > > > > aware of how do we implement batch scenario so this might
> > be a
> > > > >> basic
> > > > >> > > > > question.
> > > > >> > > > >
> > > > >> > > > > -Priyanka
> > > > >> > > > >
> > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > >> > > > bhupesh@datatorrent.com>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi All,
> > > > >> > > > > >
> > > > >> > > > > > While design / implementation for custom control tuples
> is
> > > > >> > ongoing, I
> > > > >> > > > > > thought it would be a good idea to consider its
> usefulness
> > > in
> > > > >> one
> > > > >> > of
> > > > >> > > > the
> > > > >> > > > > > use cases -  batch applications.
> > > > >> > > > > >
> > > > >> > > > > > This is a proposal to adapt / extend existing operators
> in
> > > the
> > > > >> > Apache
> > > > >> > > > > Apex
> > > > >> > > > > > Malhar library so that it is easy to use them in batch
> use
> > > > >> cases.
> > > > >> > > > > > Naturally, this would be applicable for only a subset of
> > > > >> operators
> > > > >> > > like
> > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > >> > > > > > For example, for a file based store, (say HDFS store),
> we
> > > > could
> > > > >> > have
> > > > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> > > easy
> > > > >> > > > integration
> > > > >> > > > > > into a batch application. These operators would be
> > extended
> > > > from
> > > > >> > > their
> > > > >> > > > > > existing implementations and would be "Batch Aware", in
> > that
> > > > >> they
> > > > >> > may
> > > > >> > > > > > understand the meaning of some specific control tuples
> > that
> > > > flow
> > > > >> > > > through
> > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> obvious
> > > > >> > candidates
> > > > >> > > > that
> > > > >> > > > > > come to mind. On receipt of such control tuples, they
> may
> > > try
> > > > to
> > > > >> > > modify
> > > > >> > > > > the
> > > > >> > > > > > behavior of the operator - to reinitialize some metrics
> or
> > > > >> finalize
> > > > >> > > an
> > > > >> > > > > > output file for example.
> > > > >> > > > > >
> > > > >> > > > > > We can discuss the potential control tuples and actions
> in
> > > > >> detail,
> > > > >> > > but
> > > > >> > > > > > first I would like to understand the views of the
> > community
> > > > for
> > > > >> > this
> > > > >> > > > > > proposal.
> > > > >> > > > > >
> > > > >> > > > > > ~ Bhupesh
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by David Yan <da...@gmail.com>.

I'm worried that we are making the watermark concept too complicated.

Watermarks should simply just tell you what windows can be considered
complete.

Point 2 is basically a count-based window. Watermarks do not play a role
here because the window is always complete at the n-th tuple.

If I understand correctly, point 3 is for batch processing of files. Unless
the files contain timed events, it sounds to be that this can be achieved
with just a Global Window. For signaling EOF, a watermark with a +infinity
timestamp can be used so that triggers will be fired upon receipt of that
watermark.

For point 4, just like what I mentioned above, can be achieved with a
watermark with a +infinity timestamp.

David




On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi Thomas,
>
> For an input operator which is supposed to generate watermarks for
> downstream operators, I can think about the following watermarks that the
> operator can emit:
> 1. Time based watermarks (the high watermark / low watermark)
> 2. Number of tuple based watermarks (Every n tuples)
> 3. File based watermarks (Start file, end file)
> 4. Final watermark
>
> File based watermarks seem to be applicable for batch (file based) as well,
> and hence I thought of looking at these first. Does this seem to be in line
> with the thought process?
>
> ~ Bhupesh
>
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> Software Engineer
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org> wrote:
>
> > I don't think this should be designed based on a simplistic file
> > input-output scenario. It would be good to include a stateful
> > transformation based on event time.
> >
> > More complex pipelines contain stateful transformations that depend on
> > windowing and watermarks. I think we need a watermark concept that is
> based
> > on progress in event time (or other monotonic increasing sequence) that
> > other operators can generically work with.
> >
> > Note that even file input in many cases can produce time based
> watermarks,
> > for example when you read part files that are bound by event time.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > For better understanding the use case for control tuples in batch, I
> am
> > > creating a prototype for a batch application using File Input and File
> > > Output operators.
> > >
> > > To enable basic batch processing for File IO operators, I am proposing
> > the
> > > following changes to File input and output operators:
> > > 1. File Input operator emits a watermark each time it opens and closes
> a
> > > file. These can be "start file" and "end file" watermarks which include
> > the
> > > corresponding file names. The "start file" tuple should be sent before
> > any
> > > of the data from that file flows.
> > > 2. File Input operator can be configured to end the application after a
> > > single or n scans of the directory (a batch). This is where the
> operator
> > > emits the final watermark (the end of application control tuple). This
> > will
> > > also shutdown the application.
> > > 3. The File output operator handles these control tuples. "Start file"
> > > initializes the file name for the incoming tuples. "End file" watermark
> > > forces a finalize on that file.
> > >
> > > The user would be able to enable the operators to send only those
> > > watermarks that are needed in the application. If none of the options
> are
> > > configured, the operators behave as in a streaming application.
> > >
> > > There are a few challenges in the implementation where the input
> operator
> > > is partitioned. In this case, the correlation between the start/end
> for a
> > > file and the data tuples for that file is lost. Hence we need to
> maintain
> > > the filename as part of each tuple in the pipeline.
> > >
> > > The "start file" and "end file" control tuples in this example are
> > > temporary names for watermarks. We can have generic "start batch" /
> "end
> > > batch" tuples which could be used for other use cases as well. The
> Final
> > > watermark is common and serves the same purpose in each case.
> > >
> > > Please let me know your thoughts on this.
> > >
> > > ~ Bhupesh
> > >
> > >
> > >
> > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com>
> > > wrote:
> > >
> > > > Yes, this can be part of operator configuration. Given this, for a
> user
> > > to
> > > > define a batch application, would mean configuring the connectors
> > (mostly
> > > > the input operator) in the application for the desired behavior.
> > > Similarly,
> > > > there can be other use cases that can be achieved other than batch.
> > > >
> > > > We may also need to take care of the following:
> > > > 1. Make sure that the watermarks or control tuples are consistent
> > across
> > > > sources. Meaning an HDFS sink should be able to interpret the
> watermark
> > > > tuple sent out by, say, a JDBC source.
> > > > 2. In addition to I/O connectors, we should also look at the need for
> > > > processing operators to understand some of the control tuples /
> > > watermarks.
> > > > For example, we may want to reset the operator behavior on arrival of
> > > some
> > > > watermark tuple.
> > > >
> > > > ~ Bhupesh
> > > >
> > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > >> The HDFS source can operate in two modes, bounded or unbounded. If
> you
> > > >> scan
> > > >> only once, then it should emit the final watermark after it is done.
> > > >> Otherwise it would emit watermarks based on a policy (files names
> > etc.).
> > > >> The mechanism to generate the marks may depend on the type of source
> > and
> > > >> the user needs to be able to influence/configure it.
> > > >>
> > > >> Thomas
> > > >>
> > > >>
> > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Thomas,
> > > >> >
> > > >> > I am not sure that I completely understand your suggestion. Are
> you
> > > >> > suggesting to broaden the scope of the proposal to treat all
> sources
> > > as
> > > >> > bounded as well as unbounded?
> > > >> >
> > > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > > bounded
> > > >> > sources like HDFS file source is treated as unbounded by means of
> > > >> scanning
> > > >> > the input directory repeatedly.
> > > >> >
> > > >> > Let's consider HDFS file source for example:
> > > >> > In this case, if we treat it as a bounded source, we can define
> > hooks
> > > >> which
> > > >> > allows us to detect the end of the file and send the "final
> > > watermark".
> > > >> We
> > > >> > could also consider HDFS file source as a streaming source and
> > define
> > > >> hooks
> > > >> > which send watermarks based on different kinds of windows.
> > > >> >
> > > >> > Please correct me if I misunderstand.
> > > >> >
> > > >> > ~ Bhupesh
> > > >> >
> > > >> >
> > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > >> >
> > > >> > > Bhupesh,
> > > >> > >
> > > >> > > Please see how that can be solved in a unified way using windows
> > and
> > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > > >> example,
> > > >> > you
> > > >> > > can use the "global window" and the final watermark to
> accomplish
> > > what
> > > >> > you
> > > >> > > are looking for. Batch is just a special case of streaming where
> > the
> > > >> > source
> > > >> > > emits the final watermark.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Thomas
> > > >> > >
> > > >> > >
> > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > >> bhupesh@datatorrent.com
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes, if the user needs to develop a batch application, then
> > batch
> > > >> aware
> > > >> > > > operators need to be used in the application.
> > > >> > > > The nature of the application is mostly controlled by the
> input
> > > and
> > > >> the
> > > >> > > > output operators used in the application.
> > > >> > > >
> > > >> > > > For example, consider an application which needs to filter
> > records
> > > >> in a
> > > >> > > > input file and store the filtered records in another file. The
> > > >> nature
> > > >> > of
> > > >> > > > this app is to end once the entire file is processed.
> Following
> > > >> things
> > > >> > > are
> > > >> > > > expected of the application:
> > > >> > > >
> > > >> > > >    1. Once the input data is over, finalize the output file
> from
> > > >> .tmp
> > > >> > > >    files. - Responsibility of output operator
> > > >> > > >    2. End the application, once the data is read and
> processed -
> > > >> > > >    Responsibility of input operator
> > > >> > > >
> > > >> > > > These functions are essential to allow the user to do higher
> > level
> > > >> > > > operations like scheduling or running a workflow of batch
> > > >> applications.
> > > >> > > >
> > > >> > > > I am not sure about intermediate (processing) operators, as
> > there
> > > >> is no
> > > >> > > > change in their functionality for batch use cases. Perhaps,
> > > allowing
> > > >> > > > running multiple batches in a single application may require
> > > similar
> > > >> > > > changes in processing operators as well.
> > > >> > > >
> > > >> > > > ~ Bhupesh
> > > >> > > >
> > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > priyag@apache.org
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Will it make an impression on user that, if he has a batch
> > > >> usecase he
> > > >> > > has
> > > >> > > > > to use batch aware operators only? If so, is that what we
> > > expect?
> > > >> I
> > > >> > am
> > > >> > > > not
> > > >> > > > > aware of how do we implement batch scenario so this might
> be a
> > > >> basic
> > > >> > > > > question.
> > > >> > > > >
> > > >> > > > > -Priyanka
> > > >> > > > >
> > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > >> > > > bhupesh@datatorrent.com>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi All,
> > > >> > > > > >
> > > >> > > > > > While design / implementation for custom control tuples is
> > > >> > ongoing, I
> > > >> > > > > > thought it would be a good idea to consider its usefulness
> > in
> > > >> one
> > > >> > of
> > > >> > > > the
> > > >> > > > > > use cases -  batch applications.
> > > >> > > > > >
> > > >> > > > > > This is a proposal to adapt / extend existing operators in
> > the
> > > >> > Apache
> > > >> > > > > Apex
> > > >> > > > > > Malhar library so that it is easy to use them in batch use
> > > >> cases.
> > > >> > > > > > Naturally, this would be applicable for only a subset of
> > > >> operators
> > > >> > > like
> > > >> > > > > > File, JDBC and NoSQL databases.
> > > >> > > > > > For example, for a file based store, (say HDFS store), we
> > > could
> > > >> > have
> > > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> > easy
> > > >> > > > integration
> > > >> > > > > > into a batch application. These operators would be
> extended
> > > from
> > > >> > > their
> > > >> > > > > > existing implementations and would be "Batch Aware", in
> that
> > > >> they
> > > >> > may
> > > >> > > > > > understand the meaning of some specific control tuples
> that
> > > flow
> > > >> > > > through
> > > >> > > > > > the DAG. Start batch and end batch seem to be the obvious
> > > >> > candidates
> > > >> > > > that
> > > >> > > > > > come to mind. On receipt of such control tuples, they may
> > try
> > > to
> > > >> > > modify
> > > >> > > > > the
> > > >> > > > > > behavior of the operator - to reinitialize some metrics or
> > > >> finalize
> > > >> > > an
> > > >> > > > > > output file for example.
> > > >> > > > > >
> > > >> > > > > > We can discuss the potential control tuples and actions in
> > > >> detail,
> > > >> > > but
> > > >> > > > > > first I would like to understand the views of the
> community
> > > for
> > > >> > this
> > > >> > > > > > proposal.
> > > >> > > > > >
> > > >> > > > > > ~ Bhupesh
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by David Yan <da...@gmail.com>.

>
>
> Understood all but this line:
>
> windowedOperator.setWindowOption(new
> WindowOption.TimeWindows(Duration.millis(2)));
>
> I wonder if there is an option to control this from the source, maybe David
> can take a look?
>
>
It actually does not make sense to have windows at the source.

The source only gives you the events with the corresponding event time.
It's totally up to downstream to do the windowing part. You can have one
source that feeds multiple different downstream pipelines with entirely
different windowing (for example, one with Global Window, another one with
Timed Window with 1 minute, and another one with 5 minutes, and maybe
another one with Session Windows).

David

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

-->

On Thu, Feb 23, 2017 at 12:02 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi Thomas,
>
> My response inline:
>
> On Wed, Feb 22, 2017 at 10:17 PM, Thomas Weise <th...@apache.org> wrote:
>
> > Hi Bhupesh,
> >
> > This looks great. You use the watermark as measure of completeness and
> the
> > window to isolate the state, which is how it should work.
> >
> > Questions/comments:
> >
> > Why does the count operator have a 2ms window when this should be driven
> by
> > the watermark from the input operator?
> >
> >
> In this example, we trigger at the Watermark. So the count (windowed)
> operator accumulates the state until the watermark and then emits all the
> accumulated counts.
> 2 ms is not necessary, we can make it 1ms. But in this case its not the
> time duration that matters. The file input operator makes sure all tuples
> belonging to a file become part of the same window by making the timestamp
> in those tuples same. So all tuples in first file go out with timestamp 0,
> second file with timestamp 1 and so on.
> 
>

Understood all but this line:

windowedOperator.setWindowOption(new
WindowOption.TimeWindows(Duration.millis(2)));

I wonder if there is an option to control this from the source, maybe David
can take a look?


>
> > I don't think there should be separate "windowed" connector operators.
> The
> > watermark support needs to be incorporated into existing operators and
> > configurable. Windowing is a concept that the entire library needs to be
> > aware of. I see no reason to arrange classes in separate "window"
> packages
> > except those that are really specific to windowing support such as the
> > watermark tuple.
> >
>
> I think, making changes to the existing operators would make them too
> heavy and complex. I suggest we extend the existing operators and have new
> classes with just the logic for watermarks. This will also help bugs
> resulting due to the new implementations isolated.
> We can keep these in the same package as the existing operators. Just the
> window specific classes (like watermarks) will go into the window package.
>

Watermark support needs to be part of all connectors, just like
idempotency, fault tolerance, partitioning etc. Well written operators have
pluggable components and strategies. Inheritance is not a way to solve
this, this needs to be part of the abstract base classes.


> 
>
>
> >
> > Why does the control tuple have an operatorId in it?
> >
>
> Operator id is not used in the current example, but may help the user to
> understand the originating partition for a watermark tuple. This will be in
> scenarios where we cannot distinguish between watermark tuples from
> different partitions; unlike the file based watermarks where filename is
> the distinguishing property.
>


Not clear why it is important which physical operator generated a
watermark. I don't think anything downstream can rely on that, in general.
It would help if you can provide a real example and document it?


> 
>
> >
> > Once you make the changes to the operators, please also augment the
> > documentation and examples (in this case wordcount demo).
> >
>
> Sure.
>
>
>
> > Thanks,
> > Thomas
> >
> >
> >
> > On Wed, Feb 22, 2017 at 4:51 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > Sorry for the delay.
> > > I agree that the watermark concept is general and is understood by
> > > intermediate transformations. File name is some additional information
> in
> > > the watermark which helps the start and end operators do stuff related
> to
> > > batch.
> > > As suggested, I have created a wordcount application which uses
> > watermarks
> > > to create separate windows for each file by means of a long
> (timestamp).
> > > I am linking the source for reference:
> > >
> > > Watermarks:
> > > https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> > > library/src/main/java/org/apache/apex/malhar/lib/window/
> > > windowable/FileWatermark.java
> > >
> > > Extended File Input and Output operators:
> > > https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> > > library/src/main/java/org/apache/apex/malhar/lib/window/windowable/
> > > WindowedFileInputOperator.java
> > > https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> > > library/src/main/java/org/apache/apex/malhar/lib/window/windowable/
> > > WindowedFileOutputOperator.java
> > >
> > >
> > > WordCount Application:
> > >
> > > https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> > > library/src/test/java/org/apache/apex/malhar/lib/window/
> > > windowable/WindowedWordCount.java
> > >
> > >
> > > The input operator attaches a timestamp with each file which allows the
> > > WindowedOperator to identify each file and its state in a distinct
> > window.
> > >
> > > Additionally, using the additional file information, the application
> can
> > > store the counts in similarly named files at the destination.
> > >
> > >
> > > Thanks.
> > >
> > > _______________________________________________________
> > >
> > > Bhupesh Chawda
> > >
> > > Software Engineer
> > >
> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >
> > > www.datatorrent.com  |  apex.apache.org
> > >
> > >
> > >
> > > On Sat, Feb 18, 2017 at 10:24 PM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > > Hi Bhupesh,
> > > >
> > > > I think this needs a generic watermark concept that is independent of
> > > > source and destination and can be understood by intermediate
> > > > transformations. File names don't meet this criteria.
> > > >
> > > > One possible approach is to have a monotonic increasing file sequence
> > > > (instead of time, if it is not applicable) that can be mapped to
> > > watermark.
> > > > You can still tag on the file name to the control tuple as extra
> > > > information so that a file output operator that understands it can do
> > > > whatever it wants with it. But it should also work without it, let's
> > say
> > > > when we write the output to the console.
> > > >
> > > > The key here is that you can demonstrate that an intermediate
> stateful
> > > > transformation will work. I would suggest to try wordcount per input
> > file
> > > > with the window operator that emits the counts at file boundary,
> > without
> > > > knowing anything about files.
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Thomas,
> > > > >
> > > > > For an input operator which is supposed to generate watermarks for
> > > > > downstream operators, I can think about the following watermarks
> that
> > > the
> > > > > operator can emit:
> > > > > 1. Time based watermarks (the high watermark / low watermark)
> > > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > > 3. File based watermarks (Start file, end file)
> > > > > 4. Final watermark
> > > > >
> > > > > File based watermarks seem to be applicable for batch (file based)
> as
> > > > well,
> > > > > and hence I thought of looking at these first. Does this seem to be
> > in
> > > > line
> > > > > with the thought process?
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________________
> > > > >
> > > > > Bhupesh Chawda
> > > > >
> > > > > Software Engineer
> > > > >
> > > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > > >
> > > > > www.datatorrent.com  |  apex.apache.org
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > > > I don't think this should be designed based on a simplistic file
> > > > > > input-output scenario. It would be good to include a stateful
> > > > > > transformation based on event time.
> > > > > >
> > > > > > More complex pipelines contain stateful transformations that
> depend
> > > on
> > > > > > windowing and watermarks. I think we need a watermark concept
> that
> > is
> > > > > based
> > > > > > on progress in event time (or other monotonic increasing
> sequence)
> > > that
> > > > > > other operators can generically work with.
> > > > > >
> > > > > > Note that even file input in many cases can produce time based
> > > > > watermarks,
> > > > > > for example when you read part files that are bound by event
> time.
> > > > > >
> > > > > > Thanks,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > For better understanding the use case for control tuples in
> > batch,
> > > I
> > > > > am
> > > > > > > creating a prototype for a batch application using File Input
> and
> > > > File
> > > > > > > Output operators.
> > > > > > >
> > > > > > > To enable basic batch processing for File IO operators, I am
> > > > proposing
> > > > > > the
> > > > > > > following changes to File input and output operators:
> > > > > > > 1. File Input operator emits a watermark each time it opens and
> > > > closes
> > > > > a
> > > > > > > file. These can be "start file" and "end file" watermarks which
> > > > include
> > > > > > the
> > > > > > > corresponding file names. The "start file" tuple should be sent
> > > > before
> > > > > > any
> > > > > > > of the data from that file flows.
> > > > > > > 2. File Input operator can be configured to end the application
> > > > after a
> > > > > > > single or n scans of the directory (a batch). This is where the
> > > > > operator
> > > > > > > emits the final watermark (the end of application control
> tuple).
> > > > This
> > > > > > will
> > > > > > > also shutdown the application.
> > > > > > > 3. The File output operator handles these control tuples.
> "Start
> > > > file"
> > > > > > > initializes the file name for the incoming tuples. "End file"
> > > > watermark
> > > > > > > forces a finalize on that file.
> > > > > > >
> > > > > > > The user would be able to enable the operators to send only
> those
> > > > > > > watermarks that are needed in the application. If none of the
> > > options
> > > > > are
> > > > > > > configured, the operators behave as in a streaming application.
> > > > > > >
> > > > > > > There are a few challenges in the implementation where the
> input
> > > > > operator
> > > > > > > is partitioned. In this case, the correlation between the
> > start/end
> > > > > for a
> > > > > > > file and the data tuples for that file is lost. Hence we need
> to
> > > > > maintain
> > > > > > > the filename as part of each tuple in the pipeline.
> > > > > > >
> > > > > > > The "start file" and "end file" control tuples in this example
> > are
> > > > > > > temporary names for watermarks. We can have generic "start
> > batch" /
> > > > > "end
> > > > > > > batch" tuples which could be used for other use cases as well.
> > The
> > > > > Final
> > > > > > > watermark is common and serves the same purpose in each case.
> > > > > > >
> > > > > > > Please let me know your thoughts on this.
> > > > > > >
> > > > > > > ~ Bhupesh
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > > bhupesh@datatorrent.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Yes, this can be part of operator configuration. Given this,
> > for
> > > a
> > > > > user
> > > > > > > to
> > > > > > > > define a batch application, would mean configuring the
> > connectors
> > > > > > (mostly
> > > > > > > > the input operator) in the application for the desired
> > behavior.
> > > > > > > Similarly,
> > > > > > > > there can be other use cases that can be achieved other than
> > > batch.
> > > > > > > >
> > > > > > > > We may also need to take care of the following:
> > > > > > > > 1. Make sure that the watermarks or control tuples are
> > consistent
> > > > > > across
> > > > > > > > sources. Meaning an HDFS sink should be able to interpret the
> > > > > watermark
> > > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > > 2. In addition to I/O connectors, we should also look at the
> > need
> > > > for
> > > > > > > > processing operators to understand some of the control
> tuples /
> > > > > > > watermarks.
> > > > > > > > For example, we may want to reset the operator behavior on
> > > arrival
> > > > of
> > > > > > > some
> > > > > > > > watermark tuple.
> > > > > > > >
> > > > > > > > ~ Bhupesh
> > > > > > > >
> > > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <
> thw@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> The HDFS source can operate in two modes, bounded or
> > unbounded.
> > > If
> > > > > you
> > > > > > > >> scan
> > > > > > > >> only once, then it should emit the final watermark after it
> is
> > > > done.
> > > > > > > >> Otherwise it would emit watermarks based on a policy (files
> > > names
> > > > > > etc.).
> > > > > > > >> The mechanism to generate the marks may depend on the type
> of
> > > > source
> > > > > > and
> > > > > > > >> the user needs to be able to influence/configure it.
> > > > > > > >>
> > > > > > > >> Thomas
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > > bhupesh@datatorrent.com>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hi Thomas,
> > > > > > > >> >
> > > > > > > >> > I am not sure that I completely understand your
> suggestion.
> > > Are
> > > > > you
> > > > > > > >> > suggesting to broaden the scope of the proposal to treat
> all
> > > > > sources
> > > > > > > as
> > > > > > > >> > bounded as well as unbounded?
> > > > > > > >> >
> > > > > > > >> > In case of Apex, we treat all sources as unbounded
> sources.
> > > Even
> > > > > > > bounded
> > > > > > > >> > sources like HDFS file source is treated as unbounded by
> > means
> > > > of
> > > > > > > >> scanning
> > > > > > > >> > the input directory repeatedly.
> > > > > > > >> >
> > > > > > > >> > Let's consider HDFS file source for example:
> > > > > > > >> > In this case, if we treat it as a bounded source, we can
> > > define
> > > > > > hooks
> > > > > > > >> which
> > > > > > > >> > allows us to detect the end of the file and send the
> "final
> > > > > > > watermark".
> > > > > > > >> We
> > > > > > > >> > could also consider HDFS file source as a streaming source
> > and
> > > > > > define
> > > > > > > >> hooks
> > > > > > > >> > which send watermarks based on different kinds of windows.
> > > > > > > >> >
> > > > > > > >> > Please correct me if I misunderstand.
> > > > > > > >> >
> > > > > > > >> > ~ Bhupesh
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> > thw@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > Bhupesh,
> > > > > > > >> > >
> > > > > > > >> > > Please see how that can be solved in a unified way using
> > > > windows
> > > > > > and
> > > > > > > >> > > watermarks. It is bounded data vs. unbounded data. In
> Beam
> > > for
> > > > > > > >> example,
> > > > > > > >> > you
> > > > > > > >> > > can use the "global window" and the final watermark to
> > > > > accomplish
> > > > > > > what
> > > > > > > >> > you
> > > > > > > >> > > are looking for. Batch is just a special case of
> streaming
> > > > where
> > > > > > the
> > > > > > > >> > source
> > > > > > > >> > > emits the final watermark.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Thomas
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > > > >> bhupesh@datatorrent.com
> > > > > > > >> > >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Yes, if the user needs to develop a batch application,
> > > then
> > > > > > batch
> > > > > > > >> aware
> > > > > > > >> > > > operators need to be used in the application.
> > > > > > > >> > > > The nature of the application is mostly controlled by
> > the
> > > > > input
> > > > > > > and
> > > > > > > >> the
> > > > > > > >> > > > output operators used in the application.
> > > > > > > >> > > >
> > > > > > > >> > > > For example, consider an application which needs to
> > filter
> > > > > > records
> > > > > > > >> in a
> > > > > > > >> > > > input file and store the filtered records in another
> > file.
> > > > The
> > > > > > > >> nature
> > > > > > > >> > of
> > > > > > > >> > > > this app is to end once the entire file is processed.
> > > > > Following
> > > > > > > >> things
> > > > > > > >> > > are
> > > > > > > >> > > > expected of the application:
> > > > > > > >> > > >
> > > > > > > >> > > >    1. Once the input data is over, finalize the output
> > > file
> > > > > from
> > > > > > > >> .tmp
> > > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > > >> > > >    2. End the application, once the data is read and
> > > > > processed -
> > > > > > > >> > > >    Responsibility of input operator
> > > > > > > >> > > >
> > > > > > > >> > > > These functions are essential to allow the user to do
> > > higher
> > > > > > level
> > > > > > > >> > > > operations like scheduling or running a workflow of
> > batch
> > > > > > > >> applications.
> > > > > > > >> > > >
> > > > > > > >> > > > I am not sure about intermediate (processing)
> operators,
> > > as
> > > > > > there
> > > > > > > >> is no
> > > > > > > >> > > > change in their functionality for batch use cases.
> > > Perhaps,
> > > > > > > allowing
> > > > > > > >> > > > running multiple batches in a single application may
> > > require
> > > > > > > similar
> > > > > > > >> > > > changes in processing operators as well.
> > > > > > > >> > > >
> > > > > > > >> > > > ~ Bhupesh
> > > > > > > >> > > >
> > > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > > > priyag@apache.org
> > > > > > > >> >
> > > > > > > >> > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Will it make an impression on user that, if he has a
> > > batch
> > > > > > > >> usecase he
> > > > > > > >> > > has
> > > > > > > >> > > > > to use batch aware operators only? If so, is that
> what
> > > we
> > > > > > > expect?
> > > > > > > >> I
> > > > > > > >> > am
> > > > > > > >> > > > not
> > > > > > > >> > > > > aware of how do we implement batch scenario so this
> > > might
> > > > > be a
> > > > > > > >> basic
> > > > > > > >> > > > > question.
> > > > > > > >> > > > >
> > > > > > > >> > > > > -Priyanka
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > > > Hi All,
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > While design / implementation for custom control
> > > tuples
> > > > is
> > > > > > > >> > ongoing, I
> > > > > > > >> > > > > > thought it would be a good idea to consider its
> > > > usefulness
> > > > > > in
> > > > > > > >> one
> > > > > > > >> > of
> > > > > > > >> > > > the
> > > > > > > >> > > > > > use cases -  batch applications.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > This is a proposal to adapt / extend existing
> > > operators
> > > > in
> > > > > > the
> > > > > > > >> > Apache
> > > > > > > >> > > > > Apex
> > > > > > > >> > > > > > Malhar library so that it is easy to use them in
> > batch
> > > > use
> > > > > > > >> cases.
> > > > > > > >> > > > > > Naturally, this would be applicable for only a
> > subset
> > > of
> > > > > > > >> operators
> > > > > > > >> > > like
> > > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > > >> > > > > > For example, for a file based store, (say HDFS
> > store),
> > > > we
> > > > > > > could
> > > > > > > >> > have
> > > > > > > >> > > > > > FileBatchInput and FileBatchOutput operators which
> > > allow
> > > > > > easy
> > > > > > > >> > > > integration
> > > > > > > >> > > > > > into a batch application. These operators would be
> > > > > extended
> > > > > > > from
> > > > > > > >> > > their
> > > > > > > >> > > > > > existing implementations and would be "Batch
> Aware",
> > > in
> > > > > that
> > > > > > > >> they
> > > > > > > >> > may
> > > > > > > >> > > > > > understand the meaning of some specific control
> > tuples
> > > > > that
> > > > > > > flow
> > > > > > > >> > > > through
> > > > > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> > > > obvious
> > > > > > > >> > candidates
> > > > > > > >> > > > that
> > > > > > > >> > > > > > come to mind. On receipt of such control tuples,
> > they
> > > > may
> > > > > > try
> > > > > > > to
> > > > > > > >> > > modify
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > behavior of the operator - to reinitialize some
> > > metrics
> > > > or
> > > > > > > >> finalize
> > > > > > > >> > > an
> > > > > > > >> > > > > > output file for example.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > We can discuss the potential control tuples and
> > > actions
> > > > in
> > > > > > > >> detail,
> > > > > > > >> > > but
> > > > > > > >> > > > > > first I would like to understand the views of the
> > > > > community
> > > > > > > for
> > > > > > > >> > this
> > > > > > > >> > > > > > proposal.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > ~ Bhupesh
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi Thomas,

My response inline:

On Wed, Feb 22, 2017 at 10:17 PM, Thomas Weise <th...@apache.org> wrote:

> Hi Bhupesh,
>
> This looks great. You use the watermark as measure of completeness and the
> window to isolate the state, which is how it should work.
>
> Questions/comments:
>
> Why does the count operator have a 2ms window when this should be driven by
> the watermark from the input operator?
>
>
In this example, we trigger at the Watermark. So the count (windowed)
operator accumulates the state until the watermark and then emits all the
accumulated counts.
2 ms is not necessary, we can make it 1ms. But in this case its not the
time duration that matters. The file input operator makes sure all tuples
belonging to a file become part of the same window by making the timestamp
in those tuples same. So all tuples in first file go out with timestamp 0,
second file with timestamp 1 and so on.



> I don't think there should be separate "windowed" connector operators. The
> watermark support needs to be incorporated into existing operators and
> configurable. Windowing is a concept that the entire library needs to be
> aware of. I see no reason to arrange classes in separate "window" packages
> except those that are really specific to windowing support such as the
> watermark tuple.
>

I think, making changes to the existing operators would make them too
heavy and complex. I suggest we extend the existing operators and have new
classes with just the logic for watermarks. This will also help bugs
resulting due to the new implementations isolated.
We can keep these in the same package as the existing operators. Just the
window specific classes (like watermarks) will go into the window package.



>
> Why does the control tuple have an operatorId in it?
>

Operator id is not used in the current example, but may help the user to
understand the originating partition for a watermark tuple. This will be in
scenarios where we cannot distinguish between watermark tuples from
different partitions; unlike the file based watermarks where filename is
the distinguishing property.


>
> Once you make the changes to the operators, please also augment the
> documentation and examples (in this case wordcount demo).
>

Sure.



> Thanks,
> Thomas
>
>
>
> On Wed, Feb 22, 2017 at 4:51 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi Thomas,
> >
> > Sorry for the delay.
> > I agree that the watermark concept is general and is understood by
> > intermediate transformations. File name is some additional information in
> > the watermark which helps the start and end operators do stuff related to
> > batch.
> > As suggested, I have created a wordcount application which uses
> watermarks
> > to create separate windows for each file by means of a long (timestamp).
> > I am linking the source for reference:
> >
> > Watermarks:
> > https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> > library/src/main/java/org/apache/apex/malhar/lib/window/
> > windowable/FileWatermark.java
> >
> > Extended File Input and Output operators:
> > https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> > library/src/main/java/org/apache/apex/malhar/lib/window/windowable/
> > WindowedFileInputOperator.java
> > https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> > library/src/main/java/org/apache/apex/malhar/lib/window/windowable/
> > WindowedFileOutputOperator.java
> >
> >
> > WordCount Application:
> >
> > https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> > library/src/test/java/org/apache/apex/malhar/lib/window/
> > windowable/WindowedWordCount.java
> >
> >
> > The input operator attaches a timestamp with each file which allows the
> > WindowedOperator to identify each file and its state in a distinct
> window.
> >
> > Additionally, using the additional file information, the application can
> > store the counts in similarly named files at the destination.
> >
> >
> > Thanks.
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > Software Engineer
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Sat, Feb 18, 2017 at 10:24 PM, Thomas Weise <th...@apache.org> wrote:
> >
> > > Hi Bhupesh,
> > >
> > > I think this needs a generic watermark concept that is independent of
> > > source and destination and can be understood by intermediate
> > > transformations. File names don't meet this criteria.
> > >
> > > One possible approach is to have a monotonic increasing file sequence
> > > (instead of time, if it is not applicable) that can be mapped to
> > watermark.
> > > You can still tag on the file name to the control tuple as extra
> > > information so that a file output operator that understands it can do
> > > whatever it wants with it. But it should also work without it, let's
> say
> > > when we write the output to the console.
> > >
> > > The key here is that you can demonstrate that an intermediate stateful
> > > transformation will work. I would suggest to try wordcount per input
> file
> > > with the window operator that emits the counts at file boundary,
> without
> > > knowing anything about files.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > For an input operator which is supposed to generate watermarks for
> > > > downstream operators, I can think about the following watermarks that
> > the
> > > > operator can emit:
> > > > 1. Time based watermarks (the high watermark / low watermark)
> > > > 2. Number of tuple based watermarks (Every n tuples)
> > > > 3. File based watermarks (Start file, end file)
> > > > 4. Final watermark
> > > >
> > > > File based watermarks seem to be applicable for batch (file based) as
> > > well,
> > > > and hence I thought of looking at these first. Does this seem to be
> in
> > > line
> > > > with the thought process?
> > > >
> > > > ~ Bhupesh
> > > >
> > > >
> > > >
> > > > _______________________________________________________
> > > >
> > > > Bhupesh Chawda
> > > >
> > > > Software Engineer
> > > >
> > > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > > >
> > > > www.datatorrent.com  |  apex.apache.org
> > > >
> > > >
> > > >
> > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > > > I don't think this should be designed based on a simplistic file
> > > > > input-output scenario. It would be good to include a stateful
> > > > > transformation based on event time.
> > > > >
> > > > > More complex pipelines contain stateful transformations that depend
> > on
> > > > > windowing and watermarks. I think we need a watermark concept that
> is
> > > > based
> > > > > on progress in event time (or other monotonic increasing sequence)
> > that
> > > > > other operators can generically work with.
> > > > >
> > > > > Note that even file input in many cases can produce time based
> > > > watermarks,
> > > > > for example when you read part files that are bound by event time.
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > For better understanding the use case for control tuples in
> batch,
> > I
> > > > am
> > > > > > creating a prototype for a batch application using File Input and
> > > File
> > > > > > Output operators.
> > > > > >
> > > > > > To enable basic batch processing for File IO operators, I am
> > > proposing
> > > > > the
> > > > > > following changes to File input and output operators:
> > > > > > 1. File Input operator emits a watermark each time it opens and
> > > closes
> > > > a
> > > > > > file. These can be "start file" and "end file" watermarks which
> > > include
> > > > > the
> > > > > > corresponding file names. The "start file" tuple should be sent
> > > before
> > > > > any
> > > > > > of the data from that file flows.
> > > > > > 2. File Input operator can be configured to end the application
> > > after a
> > > > > > single or n scans of the directory (a batch). This is where the
> > > > operator
> > > > > > emits the final watermark (the end of application control tuple).
> > > This
> > > > > will
> > > > > > also shutdown the application.
> > > > > > 3. The File output operator handles these control tuples. "Start
> > > file"
> > > > > > initializes the file name for the incoming tuples. "End file"
> > > watermark
> > > > > > forces a finalize on that file.
> > > > > >
> > > > > > The user would be able to enable the operators to send only those
> > > > > > watermarks that are needed in the application. If none of the
> > options
> > > > are
> > > > > > configured, the operators behave as in a streaming application.
> > > > > >
> > > > > > There are a few challenges in the implementation where the input
> > > > operator
> > > > > > is partitioned. In this case, the correlation between the
> start/end
> > > > for a
> > > > > > file and the data tuples for that file is lost. Hence we need to
> > > > maintain
> > > > > > the filename as part of each tuple in the pipeline.
> > > > > >
> > > > > > The "start file" and "end file" control tuples in this example
> are
> > > > > > temporary names for watermarks. We can have generic "start
> batch" /
> > > > "end
> > > > > > batch" tuples which could be used for other use cases as well.
> The
> > > > Final
> > > > > > watermark is common and serves the same purpose in each case.
> > > > > >
> > > > > > Please let me know your thoughts on this.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Yes, this can be part of operator configuration. Given this,
> for
> > a
> > > > user
> > > > > > to
> > > > > > > define a batch application, would mean configuring the
> connectors
> > > > > (mostly
> > > > > > > the input operator) in the application for the desired
> behavior.
> > > > > > Similarly,
> > > > > > > there can be other use cases that can be achieved other than
> > batch.
> > > > > > >
> > > > > > > We may also need to take care of the following:
> > > > > > > 1. Make sure that the watermarks or control tuples are
> consistent
> > > > > across
> > > > > > > sources. Meaning an HDFS sink should be able to interpret the
> > > > watermark
> > > > > > > tuple sent out by, say, a JDBC source.
> > > > > > > 2. In addition to I/O connectors, we should also look at the
> need
> > > for
> > > > > > > processing operators to understand some of the control tuples /
> > > > > > watermarks.
> > > > > > > For example, we may want to reset the operator behavior on
> > arrival
> > > of
> > > > > > some
> > > > > > > watermark tuple.
> > > > > > >
> > > > > > > ~ Bhupesh
> > > > > > >
> > > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > >> The HDFS source can operate in two modes, bounded or
> unbounded.
> > If
> > > > you
> > > > > > >> scan
> > > > > > >> only once, then it should emit the final watermark after it is
> > > done.
> > > > > > >> Otherwise it would emit watermarks based on a policy (files
> > names
> > > > > etc.).
> > > > > > >> The mechanism to generate the marks may depend on the type of
> > > source
> > > > > and
> > > > > > >> the user needs to be able to influence/configure it.
> > > > > > >>
> > > > > > >> Thomas
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > > bhupesh@datatorrent.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hi Thomas,
> > > > > > >> >
> > > > > > >> > I am not sure that I completely understand your suggestion.
> > Are
> > > > you
> > > > > > >> > suggesting to broaden the scope of the proposal to treat all
> > > > sources
> > > > > > as
> > > > > > >> > bounded as well as unbounded?
> > > > > > >> >
> > > > > > >> > In case of Apex, we treat all sources as unbounded sources.
> > Even
> > > > > > bounded
> > > > > > >> > sources like HDFS file source is treated as unbounded by
> means
> > > of
> > > > > > >> scanning
> > > > > > >> > the input directory repeatedly.
> > > > > > >> >
> > > > > > >> > Let's consider HDFS file source for example:
> > > > > > >> > In this case, if we treat it as a bounded source, we can
> > define
> > > > > hooks
> > > > > > >> which
> > > > > > >> > allows us to detect the end of the file and send the "final
> > > > > > watermark".
> > > > > > >> We
> > > > > > >> > could also consider HDFS file source as a streaming source
> and
> > > > > define
> > > > > > >> hooks
> > > > > > >> > which send watermarks based on different kinds of windows.
> > > > > > >> >
> > > > > > >> > Please correct me if I misunderstand.
> > > > > > >> >
> > > > > > >> > ~ Bhupesh
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <
> thw@apache.org
> > >
> > > > > wrote:
> > > > > > >> >
> > > > > > >> > > Bhupesh,
> > > > > > >> > >
> > > > > > >> > > Please see how that can be solved in a unified way using
> > > windows
> > > > > and
> > > > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam
> > for
> > > > > > >> example,
> > > > > > >> > you
> > > > > > >> > > can use the "global window" and the final watermark to
> > > > accomplish
> > > > > > what
> > > > > > >> > you
> > > > > > >> > > are looking for. Batch is just a special case of streaming
> > > where
> > > > > the
> > > > > > >> > source
> > > > > > >> > > emits the final watermark.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Thomas
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > > >> bhupesh@datatorrent.com
> > > > > > >> > >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Yes, if the user needs to develop a batch application,
> > then
> > > > > batch
> > > > > > >> aware
> > > > > > >> > > > operators need to be used in the application.
> > > > > > >> > > > The nature of the application is mostly controlled by
> the
> > > > input
> > > > > > and
> > > > > > >> the
> > > > > > >> > > > output operators used in the application.
> > > > > > >> > > >
> > > > > > >> > > > For example, consider an application which needs to
> filter
> > > > > records
> > > > > > >> in a
> > > > > > >> > > > input file and store the filtered records in another
> file.
> > > The
> > > > > > >> nature
> > > > > > >> > of
> > > > > > >> > > > this app is to end once the entire file is processed.
> > > > Following
> > > > > > >> things
> > > > > > >> > > are
> > > > > > >> > > > expected of the application:
> > > > > > >> > > >
> > > > > > >> > > >    1. Once the input data is over, finalize the output
> > file
> > > > from
> > > > > > >> .tmp
> > > > > > >> > > >    files. - Responsibility of output operator
> > > > > > >> > > >    2. End the application, once the data is read and
> > > > processed -
> > > > > > >> > > >    Responsibility of input operator
> > > > > > >> > > >
> > > > > > >> > > > These functions are essential to allow the user to do
> > higher
> > > > > level
> > > > > > >> > > > operations like scheduling or running a workflow of
> batch
> > > > > > >> applications.
> > > > > > >> > > >
> > > > > > >> > > > I am not sure about intermediate (processing) operators,
> > as
> > > > > there
> > > > > > >> is no
> > > > > > >> > > > change in their functionality for batch use cases.
> > Perhaps,
> > > > > > allowing
> > > > > > >> > > > running multiple batches in a single application may
> > require
> > > > > > similar
> > > > > > >> > > > changes in processing operators as well.
> > > > > > >> > > >
> > > > > > >> > > > ~ Bhupesh
> > > > > > >> > > >
> > > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > > priyag@apache.org
> > > > > > >> >
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Will it make an impression on user that, if he has a
> > batch
> > > > > > >> usecase he
> > > > > > >> > > has
> > > > > > >> > > > > to use batch aware operators only? If so, is that what
> > we
> > > > > > expect?
> > > > > > >> I
> > > > > > >> > am
> > > > > > >> > > > not
> > > > > > >> > > > > aware of how do we implement batch scenario so this
> > might
> > > > be a
> > > > > > >> basic
> > > > > > >> > > > > question.
> > > > > > >> > > > >
> > > > > > >> > > > > -Priyanka
> > > > > > >> > > > >
> > > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > > > >> > > > bhupesh@datatorrent.com>
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi All,
> > > > > > >> > > > > >
> > > > > > >> > > > > > While design / implementation for custom control
> > tuples
> > > is
> > > > > > >> > ongoing, I
> > > > > > >> > > > > > thought it would be a good idea to consider its
> > > usefulness
> > > > > in
> > > > > > >> one
> > > > > > >> > of
> > > > > > >> > > > the
> > > > > > >> > > > > > use cases -  batch applications.
> > > > > > >> > > > > >
> > > > > > >> > > > > > This is a proposal to adapt / extend existing
> > operators
> > > in
> > > > > the
> > > > > > >> > Apache
> > > > > > >> > > > > Apex
> > > > > > >> > > > > > Malhar library so that it is easy to use them in
> batch
> > > use
> > > > > > >> cases.
> > > > > > >> > > > > > Naturally, this would be applicable for only a
> subset
> > of
> > > > > > >> operators
> > > > > > >> > > like
> > > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > > >> > > > > > For example, for a file based store, (say HDFS
> store),
> > > we
> > > > > > could
> > > > > > >> > have
> > > > > > >> > > > > > FileBatchInput and FileBatchOutput operators which
> > allow
> > > > > easy
> > > > > > >> > > > integration
> > > > > > >> > > > > > into a batch application. These operators would be
> > > > extended
> > > > > > from
> > > > > > >> > > their
> > > > > > >> > > > > > existing implementations and would be "Batch Aware",
> > in
> > > > that
> > > > > > >> they
> > > > > > >> > may
> > > > > > >> > > > > > understand the meaning of some specific control
> tuples
> > > > that
> > > > > > flow
> > > > > > >> > > > through
> > > > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> > > obvious
> > > > > > >> > candidates
> > > > > > >> > > > that
> > > > > > >> > > > > > come to mind. On receipt of such control tuples,
> they
> > > may
> > > > > try
> > > > > > to
> > > > > > >> > > modify
> > > > > > >> > > > > the
> > > > > > >> > > > > > behavior of the operator - to reinitialize some
> > metrics
> > > or
> > > > > > >> finalize
> > > > > > >> > > an
> > > > > > >> > > > > > output file for example.
> > > > > > >> > > > > >
> > > > > > >> > > > > > We can discuss the potential control tuples and
> > actions
> > > in
> > > > > > >> detail,
> > > > > > >> > > but
> > > > > > >> > > > > > first I would like to understand the views of the
> > > > community
> > > > > > for
> > > > > > >> > this
> > > > > > >> > > > > > proposal.
> > > > > > >> > > > > >
> > > > > > >> > > > > > ~ Bhupesh
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

Hi Bhupesh,

This looks great. You use the watermark as measure of completeness and the
window to isolate the state, which is how it should work.

Questions/comments:

Why does the count operator have a 2ms window when this should be driven by
the watermark from the input operator?

I don't think there should be separate "windowed" connector operators. The
watermark support needs to be incorporated into existing operators and
configurable. Windowing is a concept that the entire library needs to be
aware of. I see no reason to arrange classes in separate "window" packages
except those that are really specific to windowing support such as the
watermark tuple.

Why does the control tuple have an operatorId in it?

Once you make the changes to the operators, please also augment the
documentation and examples (in this case wordcount demo).

Thanks,
Thomas



On Wed, Feb 22, 2017 at 4:51 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi Thomas,
>
> Sorry for the delay.
> I agree that the watermark concept is general and is understood by
> intermediate transformations. File name is some additional information in
> the watermark which helps the start and end operators do stuff related to
> batch.
> As suggested, I have created a wordcount application which uses watermarks
> to create separate windows for each file by means of a long (timestamp).
> I am linking the source for reference:
>
> Watermarks:
> https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> library/src/main/java/org/apache/apex/malhar/lib/window/
> windowable/FileWatermark.java
>
> Extended File Input and Output operators:
> https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> library/src/main/java/org/apache/apex/malhar/lib/window/windowable/
> WindowedFileInputOperator.java
> https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> library/src/main/java/org/apache/apex/malhar/lib/window/windowable/
> WindowedFileOutputOperator.java
>
>
> WordCount Application:
>
> https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/
> library/src/test/java/org/apache/apex/malhar/lib/window/
> windowable/WindowedWordCount.java
>
>
> The input operator attaches a timestamp with each file which allows the
> WindowedOperator to identify each file and its state in a distinct window.
>
> Additionally, using the additional file information, the application can
> store the counts in similarly named files at the destination.
>
>
> Thanks.
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> Software Engineer
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Sat, Feb 18, 2017 at 10:24 PM, Thomas Weise <th...@apache.org> wrote:
>
> > Hi Bhupesh,
> >
> > I think this needs a generic watermark concept that is independent of
> > source and destination and can be understood by intermediate
> > transformations. File names don't meet this criteria.
> >
> > One possible approach is to have a monotonic increasing file sequence
> > (instead of time, if it is not applicable) that can be mapped to
> watermark.
> > You can still tag on the file name to the control tuple as extra
> > information so that a file output operator that understands it can do
> > whatever it wants with it. But it should also work without it, let's say
> > when we write the output to the console.
> >
> > The key here is that you can demonstrate that an intermediate stateful
> > transformation will work. I would suggest to try wordcount per input file
> > with the window operator that emits the counts at file boundary, without
> > knowing anything about files.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > For an input operator which is supposed to generate watermarks for
> > > downstream operators, I can think about the following watermarks that
> the
> > > operator can emit:
> > > 1. Time based watermarks (the high watermark / low watermark)
> > > 2. Number of tuple based watermarks (Every n tuples)
> > > 3. File based watermarks (Start file, end file)
> > > 4. Final watermark
> > >
> > > File based watermarks seem to be applicable for batch (file based) as
> > well,
> > > and hence I thought of looking at these first. Does this seem to be in
> > line
> > > with the thought process?
> > >
> > > ~ Bhupesh
> > >
> > >
> > >
> > > _______________________________________________________
> > >
> > > Bhupesh Chawda
> > >
> > > Software Engineer
> > >
> > > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> > >
> > > www.datatorrent.com  |  apex.apache.org
> > >
> > >
> > >
> > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > > I don't think this should be designed based on a simplistic file
> > > > input-output scenario. It would be good to include a stateful
> > > > transformation based on event time.
> > > >
> > > > More complex pipelines contain stateful transformations that depend
> on
> > > > windowing and watermarks. I think we need a watermark concept that is
> > > based
> > > > on progress in event time (or other monotonic increasing sequence)
> that
> > > > other operators can generically work with.
> > > >
> > > > Note that even file input in many cases can produce time based
> > > watermarks,
> > > > for example when you read part files that are bound by event time.
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > For better understanding the use case for control tuples in batch,
> I
> > > am
> > > > > creating a prototype for a batch application using File Input and
> > File
> > > > > Output operators.
> > > > >
> > > > > To enable basic batch processing for File IO operators, I am
> > proposing
> > > > the
> > > > > following changes to File input and output operators:
> > > > > 1. File Input operator emits a watermark each time it opens and
> > closes
> > > a
> > > > > file. These can be "start file" and "end file" watermarks which
> > include
> > > > the
> > > > > corresponding file names. The "start file" tuple should be sent
> > before
> > > > any
> > > > > of the data from that file flows.
> > > > > 2. File Input operator can be configured to end the application
> > after a
> > > > > single or n scans of the directory (a batch). This is where the
> > > operator
> > > > > emits the final watermark (the end of application control tuple).
> > This
> > > > will
> > > > > also shutdown the application.
> > > > > 3. The File output operator handles these control tuples. "Start
> > file"
> > > > > initializes the file name for the incoming tuples. "End file"
> > watermark
> > > > > forces a finalize on that file.
> > > > >
> > > > > The user would be able to enable the operators to send only those
> > > > > watermarks that are needed in the application. If none of the
> options
> > > are
> > > > > configured, the operators behave as in a streaming application.
> > > > >
> > > > > There are a few challenges in the implementation where the input
> > > operator
> > > > > is partitioned. In this case, the correlation between the start/end
> > > for a
> > > > > file and the data tuples for that file is lost. Hence we need to
> > > maintain
> > > > > the filename as part of each tuple in the pipeline.
> > > > >
> > > > > The "start file" and "end file" control tuples in this example are
> > > > > temporary names for watermarks. We can have generic "start batch" /
> > > "end
> > > > > batch" tuples which could be used for other use cases as well. The
> > > Final
> > > > > watermark is common and serves the same purpose in each case.
> > > > >
> > > > > Please let me know your thoughts on this.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com>
> > > > > wrote:
> > > > >
> > > > > > Yes, this can be part of operator configuration. Given this, for
> a
> > > user
> > > > > to
> > > > > > define a batch application, would mean configuring the connectors
> > > > (mostly
> > > > > > the input operator) in the application for the desired behavior.
> > > > > Similarly,
> > > > > > there can be other use cases that can be achieved other than
> batch.
> > > > > >
> > > > > > We may also need to take care of the following:
> > > > > > 1. Make sure that the watermarks or control tuples are consistent
> > > > across
> > > > > > sources. Meaning an HDFS sink should be able to interpret the
> > > watermark
> > > > > > tuple sent out by, say, a JDBC source.
> > > > > > 2. In addition to I/O connectors, we should also look at the need
> > for
> > > > > > processing operators to understand some of the control tuples /
> > > > > watermarks.
> > > > > > For example, we may want to reset the operator behavior on
> arrival
> > of
> > > > > some
> > > > > > watermark tuple.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > >
> > > > > >> The HDFS source can operate in two modes, bounded or unbounded.
> If
> > > you
> > > > > >> scan
> > > > > >> only once, then it should emit the final watermark after it is
> > done.
> > > > > >> Otherwise it would emit watermarks based on a policy (files
> names
> > > > etc.).
> > > > > >> The mechanism to generate the marks may depend on the type of
> > source
> > > > and
> > > > > >> the user needs to be able to influence/configure it.
> > > > > >>
> > > > > >> Thomas
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Thomas,
> > > > > >> >
> > > > > >> > I am not sure that I completely understand your suggestion.
> Are
> > > you
> > > > > >> > suggesting to broaden the scope of the proposal to treat all
> > > sources
> > > > > as
> > > > > >> > bounded as well as unbounded?
> > > > > >> >
> > > > > >> > In case of Apex, we treat all sources as unbounded sources.
> Even
> > > > > bounded
> > > > > >> > sources like HDFS file source is treated as unbounded by means
> > of
> > > > > >> scanning
> > > > > >> > the input directory repeatedly.
> > > > > >> >
> > > > > >> > Let's consider HDFS file source for example:
> > > > > >> > In this case, if we treat it as a bounded source, we can
> define
> > > > hooks
> > > > > >> which
> > > > > >> > allows us to detect the end of the file and send the "final
> > > > > watermark".
> > > > > >> We
> > > > > >> > could also consider HDFS file source as a streaming source and
> > > > define
> > > > > >> hooks
> > > > > >> > which send watermarks based on different kinds of windows.
> > > > > >> >
> > > > > >> > Please correct me if I misunderstand.
> > > > > >> >
> > > > > >> > ~ Bhupesh
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <thw@apache.org
> >
> > > > wrote:
> > > > > >> >
> > > > > >> > > Bhupesh,
> > > > > >> > >
> > > > > >> > > Please see how that can be solved in a unified way using
> > windows
> > > > and
> > > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam
> for
> > > > > >> example,
> > > > > >> > you
> > > > > >> > > can use the "global window" and the final watermark to
> > > accomplish
> > > > > what
> > > > > >> > you
> > > > > >> > > are looking for. Batch is just a special case of streaming
> > where
> > > > the
> > > > > >> > source
> > > > > >> > > emits the final watermark.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Thomas
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > >> bhupesh@datatorrent.com
> > > > > >> > >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Yes, if the user needs to develop a batch application,
> then
> > > > batch
> > > > > >> aware
> > > > > >> > > > operators need to be used in the application.
> > > > > >> > > > The nature of the application is mostly controlled by the
> > > input
> > > > > and
> > > > > >> the
> > > > > >> > > > output operators used in the application.
> > > > > >> > > >
> > > > > >> > > > For example, consider an application which needs to filter
> > > > records
> > > > > >> in a
> > > > > >> > > > input file and store the filtered records in another file.
> > The
> > > > > >> nature
> > > > > >> > of
> > > > > >> > > > this app is to end once the entire file is processed.
> > > Following
> > > > > >> things
> > > > > >> > > are
> > > > > >> > > > expected of the application:
> > > > > >> > > >
> > > > > >> > > >    1. Once the input data is over, finalize the output
> file
> > > from
> > > > > >> .tmp
> > > > > >> > > >    files. - Responsibility of output operator
> > > > > >> > > >    2. End the application, once the data is read and
> > > processed -
> > > > > >> > > >    Responsibility of input operator
> > > > > >> > > >
> > > > > >> > > > These functions are essential to allow the user to do
> higher
> > > > level
> > > > > >> > > > operations like scheduling or running a workflow of batch
> > > > > >> applications.
> > > > > >> > > >
> > > > > >> > > > I am not sure about intermediate (processing) operators,
> as
> > > > there
> > > > > >> is no
> > > > > >> > > > change in their functionality for batch use cases.
> Perhaps,
> > > > > allowing
> > > > > >> > > > running multiple batches in a single application may
> require
> > > > > similar
> > > > > >> > > > changes in processing operators as well.
> > > > > >> > > >
> > > > > >> > > > ~ Bhupesh
> > > > > >> > > >
> > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > priyag@apache.org
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Will it make an impression on user that, if he has a
> batch
> > > > > >> usecase he
> > > > > >> > > has
> > > > > >> > > > > to use batch aware operators only? If so, is that what
> we
> > > > > expect?
> > > > > >> I
> > > > > >> > am
> > > > > >> > > > not
> > > > > >> > > > > aware of how do we implement batch scenario so this
> might
> > > be a
> > > > > >> basic
> > > > > >> > > > > question.
> > > > > >> > > > >
> > > > > >> > > > > -Priyanka
> > > > > >> > > > >
> > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > > >> > > > bhupesh@datatorrent.com>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi All,
> > > > > >> > > > > >
> > > > > >> > > > > > While design / implementation for custom control
> tuples
> > is
> > > > > >> > ongoing, I
> > > > > >> > > > > > thought it would be a good idea to consider its
> > usefulness
> > > > in
> > > > > >> one
> > > > > >> > of
> > > > > >> > > > the
> > > > > >> > > > > > use cases -  batch applications.
> > > > > >> > > > > >
> > > > > >> > > > > > This is a proposal to adapt / extend existing
> operators
> > in
> > > > the
> > > > > >> > Apache
> > > > > >> > > > > Apex
> > > > > >> > > > > > Malhar library so that it is easy to use them in batch
> > use
> > > > > >> cases.
> > > > > >> > > > > > Naturally, this would be applicable for only a subset
> of
> > > > > >> operators
> > > > > >> > > like
> > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > >> > > > > > For example, for a file based store, (say HDFS store),
> > we
> > > > > could
> > > > > >> > have
> > > > > >> > > > > > FileBatchInput and FileBatchOutput operators which
> allow
> > > > easy
> > > > > >> > > > integration
> > > > > >> > > > > > into a batch application. These operators would be
> > > extended
> > > > > from
> > > > > >> > > their
> > > > > >> > > > > > existing implementations and would be "Batch Aware",
> in
> > > that
> > > > > >> they
> > > > > >> > may
> > > > > >> > > > > > understand the meaning of some specific control tuples
> > > that
> > > > > flow
> > > > > >> > > > through
> > > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> > obvious
> > > > > >> > candidates
> > > > > >> > > > that
> > > > > >> > > > > > come to mind. On receipt of such control tuples, they
> > may
> > > > try
> > > > > to
> > > > > >> > > modify
> > > > > >> > > > > the
> > > > > >> > > > > > behavior of the operator - to reinitialize some
> metrics
> > or
> > > > > >> finalize
> > > > > >> > > an
> > > > > >> > > > > > output file for example.
> > > > > >> > > > > >
> > > > > >> > > > > > We can discuss the potential control tuples and
> actions
> > in
> > > > > >> detail,
> > > > > >> > > but
> > > > > >> > > > > > first I would like to understand the views of the
> > > community
> > > > > for
> > > > > >> > this
> > > > > >> > > > > > proposal.
> > > > > >> > > > > >
> > > > > >> > > > > > ~ Bhupesh
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi Thomas,

Sorry for the delay.
I agree that the watermark concept is general and is understood by
intermediate transformations. File name is some additional information in
the watermark which helps the start and end operators do stuff related to
batch.
As suggested, I have created a wordcount application which uses watermarks
to create separate windows for each file by means of a long (timestamp).
I am linking the source for reference:

Watermarks:
https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/library/src/main/java/org/apache/apex/malhar/lib/window/windowable/FileWatermark.java

Extended File Input and Output operators:
https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/library/src/main/java/org/apache/apex/malhar/lib/window/windowable/WindowedFileInputOperator.java
https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/library/src/main/java/org/apache/apex/malhar/lib/window/windowable/WindowedFileOutputOperator.java


WordCount Application:

https://github.com/bhupeshchawda/apex-malhar/blob/batch-io-operators/library/src/test/java/org/apache/apex/malhar/lib/window/windowable/WindowedWordCount.java


The input operator attaches a timestamp with each file which allows the
WindowedOperator to identify each file and its state in a distinct window.

Additionally, using the additional file information, the application can
store the counts in similarly named files at the destination.


Thanks.

_______________________________________________________

Bhupesh Chawda

Software Engineer

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Sat, Feb 18, 2017 at 10:24 PM, Thomas Weise <th...@apache.org> wrote:

> Hi Bhupesh,
>
> I think this needs a generic watermark concept that is independent of
> source and destination and can be understood by intermediate
> transformations. File names don't meet this criteria.
>
> One possible approach is to have a monotonic increasing file sequence
> (instead of time, if it is not applicable) that can be mapped to watermark.
> You can still tag on the file name to the control tuple as extra
> information so that a file output operator that understands it can do
> whatever it wants with it. But it should also work without it, let's say
> when we write the output to the console.
>
> The key here is that you can demonstrate that an intermediate stateful
> transformation will work. I would suggest to try wordcount per input file
> with the window operator that emits the counts at file boundary, without
> knowing anything about files.
>
> Thanks,
> Thomas
>
>
> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi Thomas,
> >
> > For an input operator which is supposed to generate watermarks for
> > downstream operators, I can think about the following watermarks that the
> > operator can emit:
> > 1. Time based watermarks (the high watermark / low watermark)
> > 2. Number of tuple based watermarks (Every n tuples)
> > 3. File based watermarks (Start file, end file)
> > 4. Final watermark
> >
> > File based watermarks seem to be applicable for batch (file based) as
> well,
> > and hence I thought of looking at these first. Does this seem to be in
> line
> > with the thought process?
> >
> > ~ Bhupesh
> >
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > Software Engineer
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org> wrote:
> >
> > > I don't think this should be designed based on a simplistic file
> > > input-output scenario. It would be good to include a stateful
> > > transformation based on event time.
> > >
> > > More complex pipelines contain stateful transformations that depend on
> > > windowing and watermarks. I think we need a watermark concept that is
> > based
> > > on progress in event time (or other monotonic increasing sequence) that
> > > other operators can generically work with.
> > >
> > > Note that even file input in many cases can produce time based
> > watermarks,
> > > for example when you read part files that are bound by event time.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > For better understanding the use case for control tuples in batch, I
> > am
> > > > creating a prototype for a batch application using File Input and
> File
> > > > Output operators.
> > > >
> > > > To enable basic batch processing for File IO operators, I am
> proposing
> > > the
> > > > following changes to File input and output operators:
> > > > 1. File Input operator emits a watermark each time it opens and
> closes
> > a
> > > > file. These can be "start file" and "end file" watermarks which
> include
> > > the
> > > > corresponding file names. The "start file" tuple should be sent
> before
> > > any
> > > > of the data from that file flows.
> > > > 2. File Input operator can be configured to end the application
> after a
> > > > single or n scans of the directory (a batch). This is where the
> > operator
> > > > emits the final watermark (the end of application control tuple).
> This
> > > will
> > > > also shutdown the application.
> > > > 3. The File output operator handles these control tuples. "Start
> file"
> > > > initializes the file name for the incoming tuples. "End file"
> watermark
> > > > forces a finalize on that file.
> > > >
> > > > The user would be able to enable the operators to send only those
> > > > watermarks that are needed in the application. If none of the options
> > are
> > > > configured, the operators behave as in a streaming application.
> > > >
> > > > There are a few challenges in the implementation where the input
> > operator
> > > > is partitioned. In this case, the correlation between the start/end
> > for a
> > > > file and the data tuples for that file is lost. Hence we need to
> > maintain
> > > > the filename as part of each tuple in the pipeline.
> > > >
> > > > The "start file" and "end file" control tuples in this example are
> > > > temporary names for watermarks. We can have generic "start batch" /
> > "end
> > > > batch" tuples which could be used for other use cases as well. The
> > Final
> > > > watermark is common and serves the same purpose in each case.
> > > >
> > > > Please let me know your thoughts on this.
> > > >
> > > > ~ Bhupesh
> > > >
> > > >
> > > >
> > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > > wrote:
> > > >
> > > > > Yes, this can be part of operator configuration. Given this, for a
> > user
> > > > to
> > > > > define a batch application, would mean configuring the connectors
> > > (mostly
> > > > > the input operator) in the application for the desired behavior.
> > > > Similarly,
> > > > > there can be other use cases that can be achieved other than batch.
> > > > >
> > > > > We may also need to take care of the following:
> > > > > 1. Make sure that the watermarks or control tuples are consistent
> > > across
> > > > > sources. Meaning an HDFS sink should be able to interpret the
> > watermark
> > > > > tuple sent out by, say, a JDBC source.
> > > > > 2. In addition to I/O connectors, we should also look at the need
> for
> > > > > processing operators to understand some of the control tuples /
> > > > watermarks.
> > > > > For example, we may want to reset the operator behavior on arrival
> of
> > > > some
> > > > > watermark tuple.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > >> The HDFS source can operate in two modes, bounded or unbounded. If
> > you
> > > > >> scan
> > > > >> only once, then it should emit the final watermark after it is
> done.
> > > > >> Otherwise it would emit watermarks based on a policy (files names
> > > etc.).
> > > > >> The mechanism to generate the marks may depend on the type of
> source
> > > and
> > > > >> the user needs to be able to influence/configure it.
> > > > >>
> > > > >> Thomas
> > > > >>
> > > > >>
> > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Thomas,
> > > > >> >
> > > > >> > I am not sure that I completely understand your suggestion. Are
> > you
> > > > >> > suggesting to broaden the scope of the proposal to treat all
> > sources
> > > > as
> > > > >> > bounded as well as unbounded?
> > > > >> >
> > > > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > > > bounded
> > > > >> > sources like HDFS file source is treated as unbounded by means
> of
> > > > >> scanning
> > > > >> > the input directory repeatedly.
> > > > >> >
> > > > >> > Let's consider HDFS file source for example:
> > > > >> > In this case, if we treat it as a bounded source, we can define
> > > hooks
> > > > >> which
> > > > >> > allows us to detect the end of the file and send the "final
> > > > watermark".
> > > > >> We
> > > > >> > could also consider HDFS file source as a streaming source and
> > > define
> > > > >> hooks
> > > > >> > which send watermarks based on different kinds of windows.
> > > > >> >
> > > > >> > Please correct me if I misunderstand.
> > > > >> >
> > > > >> > ~ Bhupesh
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > >> >
> > > > >> > > Bhupesh,
> > > > >> > >
> > > > >> > > Please see how that can be solved in a unified way using
> windows
> > > and
> > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > > > >> example,
> > > > >> > you
> > > > >> > > can use the "global window" and the final watermark to
> > accomplish
> > > > what
> > > > >> > you
> > > > >> > > are looking for. Batch is just a special case of streaming
> where
> > > the
> > > > >> > source
> > > > >> > > emits the final watermark.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Thomas
> > > > >> > >
> > > > >> > >
> > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > >> bhupesh@datatorrent.com
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Yes, if the user needs to develop a batch application, then
> > > batch
> > > > >> aware
> > > > >> > > > operators need to be used in the application.
> > > > >> > > > The nature of the application is mostly controlled by the
> > input
> > > > and
> > > > >> the
> > > > >> > > > output operators used in the application.
> > > > >> > > >
> > > > >> > > > For example, consider an application which needs to filter
> > > records
> > > > >> in a
> > > > >> > > > input file and store the filtered records in another file.
> The
> > > > >> nature
> > > > >> > of
> > > > >> > > > this app is to end once the entire file is processed.
> > Following
> > > > >> things
> > > > >> > > are
> > > > >> > > > expected of the application:
> > > > >> > > >
> > > > >> > > >    1. Once the input data is over, finalize the output file
> > from
> > > > >> .tmp
> > > > >> > > >    files. - Responsibility of output operator
> > > > >> > > >    2. End the application, once the data is read and
> > processed -
> > > > >> > > >    Responsibility of input operator
> > > > >> > > >
> > > > >> > > > These functions are essential to allow the user to do higher
> > > level
> > > > >> > > > operations like scheduling or running a workflow of batch
> > > > >> applications.
> > > > >> > > >
> > > > >> > > > I am not sure about intermediate (processing) operators, as
> > > there
> > > > >> is no
> > > > >> > > > change in their functionality for batch use cases. Perhaps,
> > > > allowing
> > > > >> > > > running multiple batches in a single application may require
> > > > similar
> > > > >> > > > changes in processing operators as well.
> > > > >> > > >
> > > > >> > > > ~ Bhupesh
> > > > >> > > >
> > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > priyag@apache.org
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Will it make an impression on user that, if he has a batch
> > > > >> usecase he
> > > > >> > > has
> > > > >> > > > > to use batch aware operators only? If so, is that what we
> > > > expect?
> > > > >> I
> > > > >> > am
> > > > >> > > > not
> > > > >> > > > > aware of how do we implement batch scenario so this might
> > be a
> > > > >> basic
> > > > >> > > > > question.
> > > > >> > > > >
> > > > >> > > > > -Priyanka
> > > > >> > > > >
> > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > >> > > > bhupesh@datatorrent.com>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi All,
> > > > >> > > > > >
> > > > >> > > > > > While design / implementation for custom control tuples
> is
> > > > >> > ongoing, I
> > > > >> > > > > > thought it would be a good idea to consider its
> usefulness
> > > in
> > > > >> one
> > > > >> > of
> > > > >> > > > the
> > > > >> > > > > > use cases -  batch applications.
> > > > >> > > > > >
> > > > >> > > > > > This is a proposal to adapt / extend existing operators
> in
> > > the
> > > > >> > Apache
> > > > >> > > > > Apex
> > > > >> > > > > > Malhar library so that it is easy to use them in batch
> use
> > > > >> cases.
> > > > >> > > > > > Naturally, this would be applicable for only a subset of
> > > > >> operators
> > > > >> > > like
> > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > >> > > > > > For example, for a file based store, (say HDFS store),
> we
> > > > could
> > > > >> > have
> > > > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> > > easy
> > > > >> > > > integration
> > > > >> > > > > > into a batch application. These operators would be
> > extended
> > > > from
> > > > >> > > their
> > > > >> > > > > > existing implementations and would be "Batch Aware", in
> > that
> > > > >> they
> > > > >> > may
> > > > >> > > > > > understand the meaning of some specific control tuples
> > that
> > > > flow
> > > > >> > > > through
> > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> obvious
> > > > >> > candidates
> > > > >> > > > that
> > > > >> > > > > > come to mind. On receipt of such control tuples, they
> may
> > > try
> > > > to
> > > > >> > > modify
> > > > >> > > > > the
> > > > >> > > > > > behavior of the operator - to reinitialize some metrics
> or
> > > > >> finalize
> > > > >> > > an
> > > > >> > > > > > output file for example.
> > > > >> > > > > >
> > > > >> > > > > > We can discuss the potential control tuples and actions
> in
> > > > >> detail,
> > > > >> > > but
> > > > >> > > > > > first I would like to understand the views of the
> > community
> > > > for
> > > > >> > this
> > > > >> > > > > > proposal.
> > > > >> > > > > >
> > > > >> > > > > > ~ Bhupesh
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Amol Kekre <am...@datatorrent.com>.

Thomas,
I believe Bhupesh's proposal is to have a monotonically increasing
watermark and filename as extra information. The usage of "file start" may
have caused confusion. I agree, we do not need explicit "file start"
watermark. I am at loss of words, maybe "start <something>"->"end
<something>"; and then a "final-all done" watermarks.

Thks
Amol


*Follow @amolhkekre*
*Join us at Apex Big Data World-San Jose
<http://www.apexbigdata.com/san-jose.html>, April 4, 2017!*
[image: http://www.apexbigdata.com/san-jose-register.html]
<http://www.apexbigdata.com/san-jose-register.html>

On Sat, Feb 18, 2017 at 8:54 AM, Thomas Weise <th...@apache.org> wrote:

> Hi Bhupesh,
>
> I think this needs a generic watermark concept that is independent of
> source and destination and can be understood by intermediate
> transformations. File names don't meet this criteria.
>
> One possible approach is to have a monotonic increasing file sequence
> (instead of time, if it is not applicable) that can be mapped to watermark.
> You can still tag on the file name to the control tuple as extra
> information so that a file output operator that understands it can do
> whatever it wants with it. But it should also work without it, let's say
> when we write the output to the console.
>
> The key here is that you can demonstrate that an intermediate stateful
> transformation will work. I would suggest to try wordcount per input file
> with the window operator that emits the counts at file boundary, without
> knowing anything about files.
>
> Thanks,
> Thomas
>
>
> On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Hi Thomas,
> >
> > For an input operator which is supposed to generate watermarks for
> > downstream operators, I can think about the following watermarks that the
> > operator can emit:
> > 1. Time based watermarks (the high watermark / low watermark)
> > 2. Number of tuple based watermarks (Every n tuples)
> > 3. File based watermarks (Start file, end file)
> > 4. Final watermark
> >
> > File based watermarks seem to be applicable for batch (file based) as
> well,
> > and hence I thought of looking at these first. Does this seem to be in
> line
> > with the thought process?
> >
> > ~ Bhupesh
> >
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > Software Engineer
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org> wrote:
> >
> > > I don't think this should be designed based on a simplistic file
> > > input-output scenario. It would be good to include a stateful
> > > transformation based on event time.
> > >
> > > More complex pipelines contain stateful transformations that depend on
> > > windowing and watermarks. I think we need a watermark concept that is
> > based
> > > on progress in event time (or other monotonic increasing sequence) that
> > > other operators can generically work with.
> > >
> > > Note that even file input in many cases can produce time based
> > watermarks,
> > > for example when you read part files that are bound by event time.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > For better understanding the use case for control tuples in batch, I
> > am
> > > > creating a prototype for a batch application using File Input and
> File
> > > > Output operators.
> > > >
> > > > To enable basic batch processing for File IO operators, I am
> proposing
> > > the
> > > > following changes to File input and output operators:
> > > > 1. File Input operator emits a watermark each time it opens and
> closes
> > a
> > > > file. These can be "start file" and "end file" watermarks which
> include
> > > the
> > > > corresponding file names. The "start file" tuple should be sent
> before
> > > any
> > > > of the data from that file flows.
> > > > 2. File Input operator can be configured to end the application
> after a
> > > > single or n scans of the directory (a batch). This is where the
> > operator
> > > > emits the final watermark (the end of application control tuple).
> This
> > > will
> > > > also shutdown the application.
> > > > 3. The File output operator handles these control tuples. "Start
> file"
> > > > initializes the file name for the incoming tuples. "End file"
> watermark
> > > > forces a finalize on that file.
> > > >
> > > > The user would be able to enable the operators to send only those
> > > > watermarks that are needed in the application. If none of the options
> > are
> > > > configured, the operators behave as in a streaming application.
> > > >
> > > > There are a few challenges in the implementation where the input
> > operator
> > > > is partitioned. In this case, the correlation between the start/end
> > for a
> > > > file and the data tuples for that file is lost. Hence we need to
> > maintain
> > > > the filename as part of each tuple in the pipeline.
> > > >
> > > > The "start file" and "end file" control tuples in this example are
> > > > temporary names for watermarks. We can have generic "start batch" /
> > "end
> > > > batch" tuples which could be used for other use cases as well. The
> > Final
> > > > watermark is common and serves the same purpose in each case.
> > > >
> > > > Please let me know your thoughts on this.
> > > >
> > > > ~ Bhupesh
> > > >
> > > >
> > > >
> > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > > wrote:
> > > >
> > > > > Yes, this can be part of operator configuration. Given this, for a
> > user
> > > > to
> > > > > define a batch application, would mean configuring the connectors
> > > (mostly
> > > > > the input operator) in the application for the desired behavior.
> > > > Similarly,
> > > > > there can be other use cases that can be achieved other than batch.
> > > > >
> > > > > We may also need to take care of the following:
> > > > > 1. Make sure that the watermarks or control tuples are consistent
> > > across
> > > > > sources. Meaning an HDFS sink should be able to interpret the
> > watermark
> > > > > tuple sent out by, say, a JDBC source.
> > > > > 2. In addition to I/O connectors, we should also look at the need
> for
> > > > > processing operators to understand some of the control tuples /
> > > > watermarks.
> > > > > For example, we may want to reset the operator behavior on arrival
> of
> > > > some
> > > > > watermark tuple.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > >> The HDFS source can operate in two modes, bounded or unbounded. If
> > you
> > > > >> scan
> > > > >> only once, then it should emit the final watermark after it is
> done.
> > > > >> Otherwise it would emit watermarks based on a policy (files names
> > > etc.).
> > > > >> The mechanism to generate the marks may depend on the type of
> source
> > > and
> > > > >> the user needs to be able to influence/configure it.
> > > > >>
> > > > >> Thomas
> > > > >>
> > > > >>
> > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Thomas,
> > > > >> >
> > > > >> > I am not sure that I completely understand your suggestion. Are
> > you
> > > > >> > suggesting to broaden the scope of the proposal to treat all
> > sources
> > > > as
> > > > >> > bounded as well as unbounded?
> > > > >> >
> > > > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > > > bounded
> > > > >> > sources like HDFS file source is treated as unbounded by means
> of
> > > > >> scanning
> > > > >> > the input directory repeatedly.
> > > > >> >
> > > > >> > Let's consider HDFS file source for example:
> > > > >> > In this case, if we treat it as a bounded source, we can define
> > > hooks
> > > > >> which
> > > > >> > allows us to detect the end of the file and send the "final
> > > > watermark".
> > > > >> We
> > > > >> > could also consider HDFS file source as a streaming source and
> > > define
> > > > >> hooks
> > > > >> > which send watermarks based on different kinds of windows.
> > > > >> >
> > > > >> > Please correct me if I misunderstand.
> > > > >> >
> > > > >> > ~ Bhupesh
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > >> >
> > > > >> > > Bhupesh,
> > > > >> > >
> > > > >> > > Please see how that can be solved in a unified way using
> windows
> > > and
> > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > > > >> example,
> > > > >> > you
> > > > >> > > can use the "global window" and the final watermark to
> > accomplish
> > > > what
> > > > >> > you
> > > > >> > > are looking for. Batch is just a special case of streaming
> where
> > > the
> > > > >> > source
> > > > >> > > emits the final watermark.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Thomas
> > > > >> > >
> > > > >> > >
> > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > >> bhupesh@datatorrent.com
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Yes, if the user needs to develop a batch application, then
> > > batch
> > > > >> aware
> > > > >> > > > operators need to be used in the application.
> > > > >> > > > The nature of the application is mostly controlled by the
> > input
> > > > and
> > > > >> the
> > > > >> > > > output operators used in the application.
> > > > >> > > >
> > > > >> > > > For example, consider an application which needs to filter
> > > records
> > > > >> in a
> > > > >> > > > input file and store the filtered records in another file.
> The
> > > > >> nature
> > > > >> > of
> > > > >> > > > this app is to end once the entire file is processed.
> > Following
> > > > >> things
> > > > >> > > are
> > > > >> > > > expected of the application:
> > > > >> > > >
> > > > >> > > >    1. Once the input data is over, finalize the output file
> > from
> > > > >> .tmp
> > > > >> > > >    files. - Responsibility of output operator
> > > > >> > > >    2. End the application, once the data is read and
> > processed -
> > > > >> > > >    Responsibility of input operator
> > > > >> > > >
> > > > >> > > > These functions are essential to allow the user to do higher
> > > level
> > > > >> > > > operations like scheduling or running a workflow of batch
> > > > >> applications.
> > > > >> > > >
> > > > >> > > > I am not sure about intermediate (processing) operators, as
> > > there
> > > > >> is no
> > > > >> > > > change in their functionality for batch use cases. Perhaps,
> > > > allowing
> > > > >> > > > running multiple batches in a single application may require
> > > > similar
> > > > >> > > > changes in processing operators as well.
> > > > >> > > >
> > > > >> > > > ~ Bhupesh
> > > > >> > > >
> > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > priyag@apache.org
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Will it make an impression on user that, if he has a batch
> > > > >> usecase he
> > > > >> > > has
> > > > >> > > > > to use batch aware operators only? If so, is that what we
> > > > expect?
> > > > >> I
> > > > >> > am
> > > > >> > > > not
> > > > >> > > > > aware of how do we implement batch scenario so this might
> > be a
> > > > >> basic
> > > > >> > > > > question.
> > > > >> > > > >
> > > > >> > > > > -Priyanka
> > > > >> > > > >
> > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > >> > > > bhupesh@datatorrent.com>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi All,
> > > > >> > > > > >
> > > > >> > > > > > While design / implementation for custom control tuples
> is
> > > > >> > ongoing, I
> > > > >> > > > > > thought it would be a good idea to consider its
> usefulness
> > > in
> > > > >> one
> > > > >> > of
> > > > >> > > > the
> > > > >> > > > > > use cases -  batch applications.
> > > > >> > > > > >
> > > > >> > > > > > This is a proposal to adapt / extend existing operators
> in
> > > the
> > > > >> > Apache
> > > > >> > > > > Apex
> > > > >> > > > > > Malhar library so that it is easy to use them in batch
> use
> > > > >> cases.
> > > > >> > > > > > Naturally, this would be applicable for only a subset of
> > > > >> operators
> > > > >> > > like
> > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > >> > > > > > For example, for a file based store, (say HDFS store),
> we
> > > > could
> > > > >> > have
> > > > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> > > easy
> > > > >> > > > integration
> > > > >> > > > > > into a batch application. These operators would be
> > extended
> > > > from
> > > > >> > > their
> > > > >> > > > > > existing implementations and would be "Batch Aware", in
> > that
> > > > >> they
> > > > >> > may
> > > > >> > > > > > understand the meaning of some specific control tuples
> > that
> > > > flow
> > > > >> > > > through
> > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> obvious
> > > > >> > candidates
> > > > >> > > > that
> > > > >> > > > > > come to mind. On receipt of such control tuples, they
> may
> > > try
> > > > to
> > > > >> > > modify
> > > > >> > > > > the
> > > > >> > > > > > behavior of the operator - to reinitialize some metrics
> or
> > > > >> finalize
> > > > >> > > an
> > > > >> > > > > > output file for example.
> > > > >> > > > > >
> > > > >> > > > > > We can discuss the potential control tuples and actions
> in
> > > > >> detail,
> > > > >> > > but
> > > > >> > > > > > first I would like to understand the views of the
> > community
> > > > for
> > > > >> > this
> > > > >> > > > > > proposal.
> > > > >> > > > > >
> > > > >> > > > > > ~ Bhupesh
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

Hi Bhupesh,

I think this needs a generic watermark concept that is independent of
source and destination and can be understood by intermediate
transformations. File names don't meet this criteria.

One possible approach is to have a monotonic increasing file sequence
(instead of time, if it is not applicable) that can be mapped to watermark.
You can still tag on the file name to the control tuple as extra
information so that a file output operator that understands it can do
whatever it wants with it. But it should also work without it, let's say
when we write the output to the console.

The key here is that you can demonstrate that an intermediate stateful
transformation will work. I would suggest to try wordcount per input file
with the window operator that emits the counts at file boundary, without
knowing anything about files.

Thanks,
Thomas


On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Hi Thomas,
>
> For an input operator which is supposed to generate watermarks for
> downstream operators, I can think about the following watermarks that the
> operator can emit:
> 1. Time based watermarks (the high watermark / low watermark)
> 2. Number of tuple based watermarks (Every n tuples)
> 3. File based watermarks (Start file, end file)
> 4. Final watermark
>
> File based watermarks seem to be applicable for batch (file based) as well,
> and hence I thought of looking at these first. Does this seem to be in line
> with the thought process?
>
> ~ Bhupesh
>
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> Software Engineer
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org> wrote:
>
> > I don't think this should be designed based on a simplistic file
> > input-output scenario. It would be good to include a stateful
> > transformation based on event time.
> >
> > More complex pipelines contain stateful transformations that depend on
> > windowing and watermarks. I think we need a watermark concept that is
> based
> > on progress in event time (or other monotonic increasing sequence) that
> > other operators can generically work with.
> >
> > Note that even file input in many cases can produce time based
> watermarks,
> > for example when you read part files that are bound by event time.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > For better understanding the use case for control tuples in batch, I
> am
> > > creating a prototype for a batch application using File Input and File
> > > Output operators.
> > >
> > > To enable basic batch processing for File IO operators, I am proposing
> > the
> > > following changes to File input and output operators:
> > > 1. File Input operator emits a watermark each time it opens and closes
> a
> > > file. These can be "start file" and "end file" watermarks which include
> > the
> > > corresponding file names. The "start file" tuple should be sent before
> > any
> > > of the data from that file flows.
> > > 2. File Input operator can be configured to end the application after a
> > > single or n scans of the directory (a batch). This is where the
> operator
> > > emits the final watermark (the end of application control tuple). This
> > will
> > > also shutdown the application.
> > > 3. The File output operator handles these control tuples. "Start file"
> > > initializes the file name for the incoming tuples. "End file" watermark
> > > forces a finalize on that file.
> > >
> > > The user would be able to enable the operators to send only those
> > > watermarks that are needed in the application. If none of the options
> are
> > > configured, the operators behave as in a streaming application.
> > >
> > > There are a few challenges in the implementation where the input
> operator
> > > is partitioned. In this case, the correlation between the start/end
> for a
> > > file and the data tuples for that file is lost. Hence we need to
> maintain
> > > the filename as part of each tuple in the pipeline.
> > >
> > > The "start file" and "end file" control tuples in this example are
> > > temporary names for watermarks. We can have generic "start batch" /
> "end
> > > batch" tuples which could be used for other use cases as well. The
> Final
> > > watermark is common and serves the same purpose in each case.
> > >
> > > Please let me know your thoughts on this.
> > >
> > > ~ Bhupesh
> > >
> > >
> > >
> > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com>
> > > wrote:
> > >
> > > > Yes, this can be part of operator configuration. Given this, for a
> user
> > > to
> > > > define a batch application, would mean configuring the connectors
> > (mostly
> > > > the input operator) in the application for the desired behavior.
> > > Similarly,
> > > > there can be other use cases that can be achieved other than batch.
> > > >
> > > > We may also need to take care of the following:
> > > > 1. Make sure that the watermarks or control tuples are consistent
> > across
> > > > sources. Meaning an HDFS sink should be able to interpret the
> watermark
> > > > tuple sent out by, say, a JDBC source.
> > > > 2. In addition to I/O connectors, we should also look at the need for
> > > > processing operators to understand some of the control tuples /
> > > watermarks.
> > > > For example, we may want to reset the operator behavior on arrival of
> > > some
> > > > watermark tuple.
> > > >
> > > > ~ Bhupesh
> > > >
> > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > >> The HDFS source can operate in two modes, bounded or unbounded. If
> you
> > > >> scan
> > > >> only once, then it should emit the final watermark after it is done.
> > > >> Otherwise it would emit watermarks based on a policy (files names
> > etc.).
> > > >> The mechanism to generate the marks may depend on the type of source
> > and
> > > >> the user needs to be able to influence/configure it.
> > > >>
> > > >> Thomas
> > > >>
> > > >>
> > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Thomas,
> > > >> >
> > > >> > I am not sure that I completely understand your suggestion. Are
> you
> > > >> > suggesting to broaden the scope of the proposal to treat all
> sources
> > > as
> > > >> > bounded as well as unbounded?
> > > >> >
> > > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > > bounded
> > > >> > sources like HDFS file source is treated as unbounded by means of
> > > >> scanning
> > > >> > the input directory repeatedly.
> > > >> >
> > > >> > Let's consider HDFS file source for example:
> > > >> > In this case, if we treat it as a bounded source, we can define
> > hooks
> > > >> which
> > > >> > allows us to detect the end of the file and send the "final
> > > watermark".
> > > >> We
> > > >> > could also consider HDFS file source as a streaming source and
> > define
> > > >> hooks
> > > >> > which send watermarks based on different kinds of windows.
> > > >> >
> > > >> > Please correct me if I misunderstand.
> > > >> >
> > > >> > ~ Bhupesh
> > > >> >
> > > >> >
> > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > >> >
> > > >> > > Bhupesh,
> > > >> > >
> > > >> > > Please see how that can be solved in a unified way using windows
> > and
> > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > > >> example,
> > > >> > you
> > > >> > > can use the "global window" and the final watermark to
> accomplish
> > > what
> > > >> > you
> > > >> > > are looking for. Batch is just a special case of streaming where
> > the
> > > >> > source
> > > >> > > emits the final watermark.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Thomas
> > > >> > >
> > > >> > >
> > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > >> bhupesh@datatorrent.com
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes, if the user needs to develop a batch application, then
> > batch
> > > >> aware
> > > >> > > > operators need to be used in the application.
> > > >> > > > The nature of the application is mostly controlled by the
> input
> > > and
> > > >> the
> > > >> > > > output operators used in the application.
> > > >> > > >
> > > >> > > > For example, consider an application which needs to filter
> > records
> > > >> in a
> > > >> > > > input file and store the filtered records in another file. The
> > > >> nature
> > > >> > of
> > > >> > > > this app is to end once the entire file is processed.
> Following
> > > >> things
> > > >> > > are
> > > >> > > > expected of the application:
> > > >> > > >
> > > >> > > >    1. Once the input data is over, finalize the output file
> from
> > > >> .tmp
> > > >> > > >    files. - Responsibility of output operator
> > > >> > > >    2. End the application, once the data is read and
> processed -
> > > >> > > >    Responsibility of input operator
> > > >> > > >
> > > >> > > > These functions are essential to allow the user to do higher
> > level
> > > >> > > > operations like scheduling or running a workflow of batch
> > > >> applications.
> > > >> > > >
> > > >> > > > I am not sure about intermediate (processing) operators, as
> > there
> > > >> is no
> > > >> > > > change in their functionality for batch use cases. Perhaps,
> > > allowing
> > > >> > > > running multiple batches in a single application may require
> > > similar
> > > >> > > > changes in processing operators as well.
> > > >> > > >
> > > >> > > > ~ Bhupesh
> > > >> > > >
> > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > priyag@apache.org
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Will it make an impression on user that, if he has a batch
> > > >> usecase he
> > > >> > > has
> > > >> > > > > to use batch aware operators only? If so, is that what we
> > > expect?
> > > >> I
> > > >> > am
> > > >> > > > not
> > > >> > > > > aware of how do we implement batch scenario so this might
> be a
> > > >> basic
> > > >> > > > > question.
> > > >> > > > >
> > > >> > > > > -Priyanka
> > > >> > > > >
> > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > >> > > > bhupesh@datatorrent.com>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi All,
> > > >> > > > > >
> > > >> > > > > > While design / implementation for custom control tuples is
> > > >> > ongoing, I
> > > >> > > > > > thought it would be a good idea to consider its usefulness
> > in
> > > >> one
> > > >> > of
> > > >> > > > the
> > > >> > > > > > use cases -  batch applications.
> > > >> > > > > >
> > > >> > > > > > This is a proposal to adapt / extend existing operators in
> > the
> > > >> > Apache
> > > >> > > > > Apex
> > > >> > > > > > Malhar library so that it is easy to use them in batch use
> > > >> cases.
> > > >> > > > > > Naturally, this would be applicable for only a subset of
> > > >> operators
> > > >> > > like
> > > >> > > > > > File, JDBC and NoSQL databases.
> > > >> > > > > > For example, for a file based store, (say HDFS store), we
> > > could
> > > >> > have
> > > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> > easy
> > > >> > > > integration
> > > >> > > > > > into a batch application. These operators would be
> extended
> > > from
> > > >> > > their
> > > >> > > > > > existing implementations and would be "Batch Aware", in
> that
> > > >> they
> > > >> > may
> > > >> > > > > > understand the meaning of some specific control tuples
> that
> > > flow
> > > >> > > > through
> > > >> > > > > > the DAG. Start batch and end batch seem to be the obvious
> > > >> > candidates
> > > >> > > > that
> > > >> > > > > > come to mind. On receipt of such control tuples, they may
> > try
> > > to
> > > >> > > modify
> > > >> > > > > the
> > > >> > > > > > behavior of the operator - to reinitialize some metrics or
> > > >> finalize
> > > >> > > an
> > > >> > > > > > output file for example.
> > > >> > > > > >
> > > >> > > > > > We can discuss the potential control tuples and actions in
> > > >> detail,
> > > >> > > but
> > > >> > > > > > first I would like to understand the views of the
> community
> > > for
> > > >> > this
> > > >> > > > > > proposal.
> > > >> > > > > >
> > > >> > > > > > ~ Bhupesh
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Hi Thomas,

For an input operator which is supposed to generate watermarks for
downstream operators, I can think about the following watermarks that the
operator can emit:
1. Time based watermarks (the high watermark / low watermark)
2. Number of tuple based watermarks (Every n tuples)
3. File based watermarks (Start file, end file)
4. Final watermark

File based watermarks seem to be applicable for batch (file based) as well,
and hence I thought of looking at these first. Does this seem to be in line
with the thought process?

~ Bhupesh



_______________________________________________________

Bhupesh Chawda

Software Engineer

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise <th...@apache.org> wrote:

> I don't think this should be designed based on a simplistic file
> input-output scenario. It would be good to include a stateful
> transformation based on event time.
>
> More complex pipelines contain stateful transformations that depend on
> windowing and watermarks. I think we need a watermark concept that is based
> on progress in event time (or other monotonic increasing sequence) that
> other operators can generically work with.
>
> Note that even file input in many cases can produce time based watermarks,
> for example when you read part files that are bound by event time.
>
> Thanks,
> Thomas
>
>
> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > For better understanding the use case for control tuples in batch, I am
> > creating a prototype for a batch application using File Input and File
> > Output operators.
> >
> > To enable basic batch processing for File IO operators, I am proposing
> the
> > following changes to File input and output operators:
> > 1. File Input operator emits a watermark each time it opens and closes a
> > file. These can be "start file" and "end file" watermarks which include
> the
> > corresponding file names. The "start file" tuple should be sent before
> any
> > of the data from that file flows.
> > 2. File Input operator can be configured to end the application after a
> > single or n scans of the directory (a batch). This is where the operator
> > emits the final watermark (the end of application control tuple). This
> will
> > also shutdown the application.
> > 3. The File output operator handles these control tuples. "Start file"
> > initializes the file name for the incoming tuples. "End file" watermark
> > forces a finalize on that file.
> >
> > The user would be able to enable the operators to send only those
> > watermarks that are needed in the application. If none of the options are
> > configured, the operators behave as in a streaming application.
> >
> > There are a few challenges in the implementation where the input operator
> > is partitioned. In this case, the correlation between the start/end for a
> > file and the data tuples for that file is lost. Hence we need to maintain
> > the filename as part of each tuple in the pipeline.
> >
> > The "start file" and "end file" control tuples in this example are
> > temporary names for watermarks. We can have generic "start batch" / "end
> > batch" tuples which could be used for other use cases as well. The Final
> > watermark is common and serves the same purpose in each case.
> >
> > Please let me know your thoughts on this.
> >
> > ~ Bhupesh
> >
> >
> >
> > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com>
> > wrote:
> >
> > > Yes, this can be part of operator configuration. Given this, for a user
> > to
> > > define a batch application, would mean configuring the connectors
> (mostly
> > > the input operator) in the application for the desired behavior.
> > Similarly,
> > > there can be other use cases that can be achieved other than batch.
> > >
> > > We may also need to take care of the following:
> > > 1. Make sure that the watermarks or control tuples are consistent
> across
> > > sources. Meaning an HDFS sink should be able to interpret the watermark
> > > tuple sent out by, say, a JDBC source.
> > > 2. In addition to I/O connectors, we should also look at the need for
> > > processing operators to understand some of the control tuples /
> > watermarks.
> > > For example, we may want to reset the operator behavior on arrival of
> > some
> > > watermark tuple.
> > >
> > > ~ Bhupesh
> > >
> > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org> wrote:
> > >
> > >> The HDFS source can operate in two modes, bounded or unbounded. If you
> > >> scan
> > >> only once, then it should emit the final watermark after it is done.
> > >> Otherwise it would emit watermarks based on a policy (files names
> etc.).
> > >> The mechanism to generate the marks may depend on the type of source
> and
> > >> the user needs to be able to influence/configure it.
> > >>
> > >> Thomas
> > >>
> > >>
> > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com>
> > >> wrote:
> > >>
> > >> > Hi Thomas,
> > >> >
> > >> > I am not sure that I completely understand your suggestion. Are you
> > >> > suggesting to broaden the scope of the proposal to treat all sources
> > as
> > >> > bounded as well as unbounded?
> > >> >
> > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > bounded
> > >> > sources like HDFS file source is treated as unbounded by means of
> > >> scanning
> > >> > the input directory repeatedly.
> > >> >
> > >> > Let's consider HDFS file source for example:
> > >> > In this case, if we treat it as a bounded source, we can define
> hooks
> > >> which
> > >> > allows us to detect the end of the file and send the "final
> > watermark".
> > >> We
> > >> > could also consider HDFS file source as a streaming source and
> define
> > >> hooks
> > >> > which send watermarks based on different kinds of windows.
> > >> >
> > >> > Please correct me if I misunderstand.
> > >> >
> > >> > ~ Bhupesh
> > >> >
> > >> >
> > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> wrote:
> > >> >
> > >> > > Bhupesh,
> > >> > >
> > >> > > Please see how that can be solved in a unified way using windows
> and
> > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > >> example,
> > >> > you
> > >> > > can use the "global window" and the final watermark to accomplish
> > what
> > >> > you
> > >> > > are looking for. Batch is just a special case of streaming where
> the
> > >> > source
> > >> > > emits the final watermark.
> > >> > >
> > >> > > Thanks,
> > >> > > Thomas
> > >> > >
> > >> > >
> > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > >> bhupesh@datatorrent.com
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Yes, if the user needs to develop a batch application, then
> batch
> > >> aware
> > >> > > > operators need to be used in the application.
> > >> > > > The nature of the application is mostly controlled by the input
> > and
> > >> the
> > >> > > > output operators used in the application.
> > >> > > >
> > >> > > > For example, consider an application which needs to filter
> records
> > >> in a
> > >> > > > input file and store the filtered records in another file. The
> > >> nature
> > >> > of
> > >> > > > this app is to end once the entire file is processed. Following
> > >> things
> > >> > > are
> > >> > > > expected of the application:
> > >> > > >
> > >> > > >    1. Once the input data is over, finalize the output file from
> > >> .tmp
> > >> > > >    files. - Responsibility of output operator
> > >> > > >    2. End the application, once the data is read and processed -
> > >> > > >    Responsibility of input operator
> > >> > > >
> > >> > > > These functions are essential to allow the user to do higher
> level
> > >> > > > operations like scheduling or running a workflow of batch
> > >> applications.
> > >> > > >
> > >> > > > I am not sure about intermediate (processing) operators, as
> there
> > >> is no
> > >> > > > change in their functionality for batch use cases. Perhaps,
> > allowing
> > >> > > > running multiple batches in a single application may require
> > similar
> > >> > > > changes in processing operators as well.
> > >> > > >
> > >> > > > ~ Bhupesh
> > >> > > >
> > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > priyag@apache.org
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Will it make an impression on user that, if he has a batch
> > >> usecase he
> > >> > > has
> > >> > > > > to use batch aware operators only? If so, is that what we
> > expect?
> > >> I
> > >> > am
> > >> > > > not
> > >> > > > > aware of how do we implement batch scenario so this might be a
> > >> basic
> > >> > > > > question.
> > >> > > > >
> > >> > > > > -Priyanka
> > >> > > > >
> > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > >> > > > bhupesh@datatorrent.com>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi All,
> > >> > > > > >
> > >> > > > > > While design / implementation for custom control tuples is
> > >> > ongoing, I
> > >> > > > > > thought it would be a good idea to consider its usefulness
> in
> > >> one
> > >> > of
> > >> > > > the
> > >> > > > > > use cases -  batch applications.
> > >> > > > > >
> > >> > > > > > This is a proposal to adapt / extend existing operators in
> the
> > >> > Apache
> > >> > > > > Apex
> > >> > > > > > Malhar library so that it is easy to use them in batch use
> > >> cases.
> > >> > > > > > Naturally, this would be applicable for only a subset of
> > >> operators
> > >> > > like
> > >> > > > > > File, JDBC and NoSQL databases.
> > >> > > > > > For example, for a file based store, (say HDFS store), we
> > could
> > >> > have
> > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> easy
> > >> > > > integration
> > >> > > > > > into a batch application. These operators would be extended
> > from
> > >> > > their
> > >> > > > > > existing implementations and would be "Batch Aware", in that
> > >> they
> > >> > may
> > >> > > > > > understand the meaning of some specific control tuples that
> > flow
> > >> > > > through
> > >> > > > > > the DAG. Start batch and end batch seem to be the obvious
> > >> > candidates
> > >> > > > that
> > >> > > > > > come to mind. On receipt of such control tuples, they may
> try
> > to
> > >> > > modify
> > >> > > > > the
> > >> > > > > > behavior of the operator - to reinitialize some metrics or
> > >> finalize
> > >> > > an
> > >> > > > > > output file for example.
> > >> > > > > >
> > >> > > > > > We can discuss the potential control tuples and actions in
> > >> detail,
> > >> > > but
> > >> > > > > > first I would like to understand the views of the community
> > for
> > >> > this
> > >> > > > > > proposal.
> > >> > > > > >
> > >> > > > > > ~ Bhupesh
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Yes Amol,
Watermarks, in general solve a different issue and we should not mix it
with data association. If needed, the data association can be solved by the
user on the operator/ application layer. The engine should not worry about
it.

Regarding your point on event time based watermarks, I think they too can
be solved at the operator level where the input operators can play a role.
We should keep the event time awareness logic out of the engine.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

Software Engineer

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Sat, Feb 18, 2017 at 10:44 PM, Amol Kekre <am...@datatorrent.com> wrote:

> Bhupesh,
> That is true, but in reality watermarks do not solve a design problem in
> the DAG where data is getting mixed up. All the watermarks do is to convey
> "start" and "end" within the stream. The start and end control tuples
> should have the physical operator id, + a monotonically increasing number.
> Both these are inserted by engine and are not user supplied, i.e. engine
> takes up the guarantees of idenfying these watermarks. This concept is same
> as our current start-window and end-window (which has worked well).
>
> Today Apex does not have watermarks, and lets say I am sending "start
> something", "end something" through another port. I will still need to not
> mix data in a transform operator down stream. That problem exist today and
> will continue. Putting filename on every tuple is too much of a performance
> hit. Secondly a lot of batch operations are not file related (i.e. file to
> file), they are collection of "data" split into part files (due to
> performance reason) and grouping/dimensions/event time/... are done based
> on internals of the file. In case of file to file copy, user should be
> expected to route the data properly (parallel partition?).
>
> Event-time based watermarks needs a separate thread. I am certain that
> engine will need to be event-time aware, and will need to take this into
> account for proper layout.
>
> Thks
> Amol
>
>
> *Follow @amolhkekre*
> *Join us at Apex Big Data World-San Jose
> <http://www.apexbigdata.com/san-jose.html>, April 4, 2017!*
> [image: http://www.apexbigdata.com/san-jose-register.html]
> <http://www.apexbigdata.com/san-jose-register.html>
>
> On Sat, Feb 18, 2017 at 8:17 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Amol, agreed. We can address event time based watermarks once file batch
> is
> > done.
> > Regarding, file batch support: by allowing to partition an input (file)
> > operator, we are implicitly mixing multiple batches. Even if the user
> does
> > not do any transformations, we should be able to write the correct data
> to
> > right files at the destination.
> >
> > ~ Bhupesh
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > Software Engineer
> >
> > E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Sat, Feb 18, 2017 at 12:26 PM, Amol Kekre <am...@datatorrent.com>
> wrote:
> >
> > > Thomas,
> > > The watermarks we have in Apex (start-window and end-window) are
> working
> > > good. It is fine to take a look at event time, but basic file I/O does
> > not
> > > need anything more than start and end. Lets say they are
> start-something,
> > > end-something. The main difference here is that the tuples are user
> > > generated, other than that they should follow similar principle as
> > > start-window & end-window. The commonality includes
> > > - dedup of start-st and end-st
> > > - First start-st passes through
> > > - Last end-st passes through
> > > - Engine indentifies them with chronologically increasing number and
> > source
> > >
> > > The only main difference is that an emit of these is user controlled
> and
> > > cannot be guaranteed to happen as such. BTW, part files are rarely done
> > > based on event time, they are almost always split by size. A vast
> > majority
> > > of batch cases have hourly files bound by arrival time and not event
> > time.
> > >
> > > Bhupesh,
> > > Attaching file names to tuples does not scale. If user mixes two
> batches,
> > > then the user would need to handle the transformations. Post file batch
> > > support, we should look at event time support. Unlike file based
> batches,
> > > event time will overlap each other, i.e. at a given time at least two
> (if
> > > not more) event times will be active. I think the engine will need to
> be
> > > event time aware.
> > >
> > > Thks
> > > Amol
> > >
> > >
> > >
> > > *Follow @amolhkekre*
> > > *Join us at Apex Big Data World-San Jose
> > > <http://www.apexbigdata.com/san-jose.html>, April 4, 2017!*
> > > [image: http://www.apexbigdata.com/san-jose-register.html]
> > > <http://www.apexbigdata.com/san-jose-register.html>
> > >
> > > On Wed, Feb 15, 2017 at 9:07 PM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > > I don't think this should be designed based on a simplistic file
> > > > input-output scenario. It would be good to include a stateful
> > > > transformation based on event time.
> > > >
> > > > More complex pipelines contain stateful transformations that depend
> on
> > > > windowing and watermarks. I think we need a watermark concept that is
> > > based
> > > > on progress in event time (or other monotonic increasing sequence)
> that
> > > > other operators can generically work with.
> > > >
> > > > Note that even file input in many cases can produce time based
> > > watermarks,
> > > > for example when you read part files that are bound by event time.
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > For better understanding the use case for control tuples in batch,
> I
> > > am
> > > > > creating a prototype for a batch application using File Input and
> > File
> > > > > Output operators.
> > > > >
> > > > > To enable basic batch processing for File IO operators, I am
> > proposing
> > > > the
> > > > > following changes to File input and output operators:
> > > > > 1. File Input operator emits a watermark each time it opens and
> > closes
> > > a
> > > > > file. These can be "start file" and "end file" watermarks which
> > include
> > > > the
> > > > > corresponding file names. The "start file" tuple should be sent
> > before
> > > > any
> > > > > of the data from that file flows.
> > > > > 2. File Input operator can be configured to end the application
> > after a
> > > > > single or n scans of the directory (a batch). This is where the
> > > operator
> > > > > emits the final watermark (the end of application control tuple).
> > This
> > > > will
> > > > > also shutdown the application.
> > > > > 3. The File output operator handles these control tuples. "Start
> > file"
> > > > > initializes the file name for the incoming tuples. "End file"
> > watermark
> > > > > forces a finalize on that file.
> > > > >
> > > > > The user would be able to enable the operators to send only those
> > > > > watermarks that are needed in the application. If none of the
> options
> > > are
> > > > > configured, the operators behave as in a streaming application.
> > > > >
> > > > > There are a few challenges in the implementation where the input
> > > operator
> > > > > is partitioned. In this case, the correlation between the start/end
> > > for a
> > > > > file and the data tuples for that file is lost. Hence we need to
> > > maintain
> > > > > the filename as part of each tuple in the pipeline.
> > > > >
> > > > > The "start file" and "end file" control tuples in this example are
> > > > > temporary names for watermarks. We can have generic "start batch" /
> > > "end
> > > > > batch" tuples which could be used for other use cases as well. The
> > > Final
> > > > > watermark is common and serves the same purpose in each case.
> > > > >
> > > > > Please let me know your thoughts on this.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com>
> > > > > wrote:
> > > > >
> > > > > > Yes, this can be part of operator configuration. Given this, for
> a
> > > user
> > > > > to
> > > > > > define a batch application, would mean configuring the connectors
> > > > (mostly
> > > > > > the input operator) in the application for the desired behavior.
> > > > > Similarly,
> > > > > > there can be other use cases that can be achieved other than
> batch.
> > > > > >
> > > > > > We may also need to take care of the following:
> > > > > > 1. Make sure that the watermarks or control tuples are consistent
> > > > across
> > > > > > sources. Meaning an HDFS sink should be able to interpret the
> > > watermark
> > > > > > tuple sent out by, say, a JDBC source.
> > > > > > 2. In addition to I/O connectors, we should also look at the need
> > for
> > > > > > processing operators to understand some of the control tuples /
> > > > > watermarks.
> > > > > > For example, we may want to reset the operator behavior on
> arrival
> > of
> > > > > some
> > > > > > watermark tuple.
> > > > > >
> > > > > > ~ Bhupesh
> > > > > >
> > > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > >
> > > > > >> The HDFS source can operate in two modes, bounded or unbounded.
> If
> > > you
> > > > > >> scan
> > > > > >> only once, then it should emit the final watermark after it is
> > done.
> > > > > >> Otherwise it would emit watermarks based on a policy (files
> names
> > > > etc.).
> > > > > >> The mechanism to generate the marks may depend on the type of
> > source
> > > > and
> > > > > >> the user needs to be able to influence/configure it.
> > > > > >>
> > > > > >> Thomas
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > > bhupesh@datatorrent.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Thomas,
> > > > > >> >
> > > > > >> > I am not sure that I completely understand your suggestion.
> Are
> > > you
> > > > > >> > suggesting to broaden the scope of the proposal to treat all
> > > sources
> > > > > as
> > > > > >> > bounded as well as unbounded?
> > > > > >> >
> > > > > >> > In case of Apex, we treat all sources as unbounded sources.
> Even
> > > > > bounded
> > > > > >> > sources like HDFS file source is treated as unbounded by means
> > of
> > > > > >> scanning
> > > > > >> > the input directory repeatedly.
> > > > > >> >
> > > > > >> > Let's consider HDFS file source for example:
> > > > > >> > In this case, if we treat it as a bounded source, we can
> define
> > > > hooks
> > > > > >> which
> > > > > >> > allows us to detect the end of the file and send the "final
> > > > > watermark".
> > > > > >> We
> > > > > >> > could also consider HDFS file source as a streaming source and
> > > > define
> > > > > >> hooks
> > > > > >> > which send watermarks based on different kinds of windows.
> > > > > >> >
> > > > > >> > Please correct me if I misunderstand.
> > > > > >> >
> > > > > >> > ~ Bhupesh
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <thw@apache.org
> >
> > > > wrote:
> > > > > >> >
> > > > > >> > > Bhupesh,
> > > > > >> > >
> > > > > >> > > Please see how that can be solved in a unified way using
> > windows
> > > > and
> > > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam
> for
> > > > > >> example,
> > > > > >> > you
> > > > > >> > > can use the "global window" and the final watermark to
> > > accomplish
> > > > > what
> > > > > >> > you
> > > > > >> > > are looking for. Batch is just a special case of streaming
> > where
> > > > the
> > > > > >> > source
> > > > > >> > > emits the final watermark.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Thomas
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > > >> bhupesh@datatorrent.com
> > > > > >> > >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Yes, if the user needs to develop a batch application,
> then
> > > > batch
> > > > > >> aware
> > > > > >> > > > operators need to be used in the application.
> > > > > >> > > > The nature of the application is mostly controlled by the
> > > input
> > > > > and
> > > > > >> the
> > > > > >> > > > output operators used in the application.
> > > > > >> > > >
> > > > > >> > > > For example, consider an application which needs to filter
> > > > records
> > > > > >> in a
> > > > > >> > > > input file and store the filtered records in another file.
> > The
> > > > > >> nature
> > > > > >> > of
> > > > > >> > > > this app is to end once the entire file is processed.
> > > Following
> > > > > >> things
> > > > > >> > > are
> > > > > >> > > > expected of the application:
> > > > > >> > > >
> > > > > >> > > >    1. Once the input data is over, finalize the output
> file
> > > from
> > > > > >> .tmp
> > > > > >> > > >    files. - Responsibility of output operator
> > > > > >> > > >    2. End the application, once the data is read and
> > > processed -
> > > > > >> > > >    Responsibility of input operator
> > > > > >> > > >
> > > > > >> > > > These functions are essential to allow the user to do
> higher
> > > > level
> > > > > >> > > > operations like scheduling or running a workflow of batch
> > > > > >> applications.
> > > > > >> > > >
> > > > > >> > > > I am not sure about intermediate (processing) operators,
> as
> > > > there
> > > > > >> is no
> > > > > >> > > > change in their functionality for batch use cases.
> Perhaps,
> > > > > allowing
> > > > > >> > > > running multiple batches in a single application may
> require
> > > > > similar
> > > > > >> > > > changes in processing operators as well.
> > > > > >> > > >
> > > > > >> > > > ~ Bhupesh
> > > > > >> > > >
> > > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > > priyag@apache.org
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Will it make an impression on user that, if he has a
> batch
> > > > > >> usecase he
> > > > > >> > > has
> > > > > >> > > > > to use batch aware operators only? If so, is that what
> we
> > > > > expect?
> > > > > >> I
> > > > > >> > am
> > > > > >> > > > not
> > > > > >> > > > > aware of how do we implement batch scenario so this
> might
> > > be a
> > > > > >> basic
> > > > > >> > > > > question.
> > > > > >> > > > >
> > > > > >> > > > > -Priyanka
> > > > > >> > > > >
> > > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > > >> > > > bhupesh@datatorrent.com>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi All,
> > > > > >> > > > > >
> > > > > >> > > > > > While design / implementation for custom control
> tuples
> > is
> > > > > >> > ongoing, I
> > > > > >> > > > > > thought it would be a good idea to consider its
> > usefulness
> > > > in
> > > > > >> one
> > > > > >> > of
> > > > > >> > > > the
> > > > > >> > > > > > use cases -  batch applications.
> > > > > >> > > > > >
> > > > > >> > > > > > This is a proposal to adapt / extend existing
> operators
> > in
> > > > the
> > > > > >> > Apache
> > > > > >> > > > > Apex
> > > > > >> > > > > > Malhar library so that it is easy to use them in batch
> > use
> > > > > >> cases.
> > > > > >> > > > > > Naturally, this would be applicable for only a subset
> of
> > > > > >> operators
> > > > > >> > > like
> > > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > > >> > > > > > For example, for a file based store, (say HDFS store),
> > we
> > > > > could
> > > > > >> > have
> > > > > >> > > > > > FileBatchInput and FileBatchOutput operators which
> allow
> > > > easy
> > > > > >> > > > integration
> > > > > >> > > > > > into a batch application. These operators would be
> > > extended
> > > > > from
> > > > > >> > > their
> > > > > >> > > > > > existing implementations and would be "Batch Aware",
> in
> > > that
> > > > > >> they
> > > > > >> > may
> > > > > >> > > > > > understand the meaning of some specific control tuples
> > > that
> > > > > flow
> > > > > >> > > > through
> > > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> > obvious
> > > > > >> > candidates
> > > > > >> > > > that
> > > > > >> > > > > > come to mind. On receipt of such control tuples, they
> > may
> > > > try
> > > > > to
> > > > > >> > > modify
> > > > > >> > > > > the
> > > > > >> > > > > > behavior of the operator - to reinitialize some
> metrics
> > or
> > > > > >> finalize
> > > > > >> > > an
> > > > > >> > > > > > output file for example.
> > > > > >> > > > > >
> > > > > >> > > > > > We can discuss the potential control tuples and
> actions
> > in
> > > > > >> detail,
> > > > > >> > > but
> > > > > >> > > > > > first I would like to understand the views of the
> > > community
> > > > > for
> > > > > >> > this
> > > > > >> > > > > > proposal.
> > > > > >> > > > > >
> > > > > >> > > > > > ~ Bhupesh
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Amol Kekre <am...@datatorrent.com>.

Bhupesh,
That is true, but in reality watermarks do not solve a design problem in
the DAG where data is getting mixed up. All the watermarks do is to convey
"start" and "end" within the stream. The start and end control tuples
should have the physical operator id, + a monotonically increasing number.
Both these are inserted by engine and are not user supplied, i.e. engine
takes up the guarantees of idenfying these watermarks. This concept is same
as our current start-window and end-window (which has worked well).

Today Apex does not have watermarks, and lets say I am sending "start
something", "end something" through another port. I will still need to not
mix data in a transform operator down stream. That problem exist today and
will continue. Putting filename on every tuple is too much of a performance
hit. Secondly a lot of batch operations are not file related (i.e. file to
file), they are collection of "data" split into part files (due to
performance reason) and grouping/dimensions/event time/... are done based
on internals of the file. In case of file to file copy, user should be
expected to route the data properly (parallel partition?).

Event-time based watermarks needs a separate thread. I am certain that
engine will need to be event-time aware, and will need to take this into
account for proper layout.

Thks
Amol


*Follow @amolhkekre*
*Join us at Apex Big Data World-San Jose
<http://www.apexbigdata.com/san-jose.html>, April 4, 2017!*
[image: http://www.apexbigdata.com/san-jose-register.html]
<http://www.apexbigdata.com/san-jose-register.html>

On Sat, Feb 18, 2017 at 8:17 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Amol, agreed. We can address event time based watermarks once file batch is
> done.
> Regarding, file batch support: by allowing to partition an input (file)
> operator, we are implicitly mixing multiple batches. Even if the user does
> not do any transformations, we should be able to write the correct data to
> right files at the destination.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> Software Engineer
>
> E: bhupesh@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Sat, Feb 18, 2017 at 12:26 PM, Amol Kekre <am...@datatorrent.com> wrote:
>
> > Thomas,
> > The watermarks we have in Apex (start-window and end-window) are working
> > good. It is fine to take a look at event time, but basic file I/O does
> not
> > need anything more than start and end. Lets say they are start-something,
> > end-something. The main difference here is that the tuples are user
> > generated, other than that they should follow similar principle as
> > start-window & end-window. The commonality includes
> > - dedup of start-st and end-st
> > - First start-st passes through
> > - Last end-st passes through
> > - Engine indentifies them with chronologically increasing number and
> source
> >
> > The only main difference is that an emit of these is user controlled and
> > cannot be guaranteed to happen as such. BTW, part files are rarely done
> > based on event time, they are almost always split by size. A vast
> majority
> > of batch cases have hourly files bound by arrival time and not event
> time.
> >
> > Bhupesh,
> > Attaching file names to tuples does not scale. If user mixes two batches,
> > then the user would need to handle the transformations. Post file batch
> > support, we should look at event time support. Unlike file based batches,
> > event time will overlap each other, i.e. at a given time at least two (if
> > not more) event times will be active. I think the engine will need to be
> > event time aware.
> >
> > Thks
> > Amol
> >
> >
> >
> > *Follow @amolhkekre*
> > *Join us at Apex Big Data World-San Jose
> > <http://www.apexbigdata.com/san-jose.html>, April 4, 2017!*
> > [image: http://www.apexbigdata.com/san-jose-register.html]
> > <http://www.apexbigdata.com/san-jose-register.html>
> >
> > On Wed, Feb 15, 2017 at 9:07 PM, Thomas Weise <th...@apache.org> wrote:
> >
> > > I don't think this should be designed based on a simplistic file
> > > input-output scenario. It would be good to include a stateful
> > > transformation based on event time.
> > >
> > > More complex pipelines contain stateful transformations that depend on
> > > windowing and watermarks. I think we need a watermark concept that is
> > based
> > > on progress in event time (or other monotonic increasing sequence) that
> > > other operators can generically work with.
> > >
> > > Note that even file input in many cases can produce time based
> > watermarks,
> > > for example when you read part files that are bound by event time.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > wrote:
> > >
> > > > For better understanding the use case for control tuples in batch, I
> > am
> > > > creating a prototype for a batch application using File Input and
> File
> > > > Output operators.
> > > >
> > > > To enable basic batch processing for File IO operators, I am
> proposing
> > > the
> > > > following changes to File input and output operators:
> > > > 1. File Input operator emits a watermark each time it opens and
> closes
> > a
> > > > file. These can be "start file" and "end file" watermarks which
> include
> > > the
> > > > corresponding file names. The "start file" tuple should be sent
> before
> > > any
> > > > of the data from that file flows.
> > > > 2. File Input operator can be configured to end the application
> after a
> > > > single or n scans of the directory (a batch). This is where the
> > operator
> > > > emits the final watermark (the end of application control tuple).
> This
> > > will
> > > > also shutdown the application.
> > > > 3. The File output operator handles these control tuples. "Start
> file"
> > > > initializes the file name for the incoming tuples. "End file"
> watermark
> > > > forces a finalize on that file.
> > > >
> > > > The user would be able to enable the operators to send only those
> > > > watermarks that are needed in the application. If none of the options
> > are
> > > > configured, the operators behave as in a streaming application.
> > > >
> > > > There are a few challenges in the implementation where the input
> > operator
> > > > is partitioned. In this case, the correlation between the start/end
> > for a
> > > > file and the data tuples for that file is lost. Hence we need to
> > maintain
> > > > the filename as part of each tuple in the pipeline.
> > > >
> > > > The "start file" and "end file" control tuples in this example are
> > > > temporary names for watermarks. We can have generic "start batch" /
> > "end
> > > > batch" tuples which could be used for other use cases as well. The
> > Final
> > > > watermark is common and serves the same purpose in each case.
> > > >
> > > > Please let me know your thoughts on this.
> > > >
> > > > ~ Bhupesh
> > > >
> > > >
> > > >
> > > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > > wrote:
> > > >
> > > > > Yes, this can be part of operator configuration. Given this, for a
> > user
> > > > to
> > > > > define a batch application, would mean configuring the connectors
> > > (mostly
> > > > > the input operator) in the application for the desired behavior.
> > > > Similarly,
> > > > > there can be other use cases that can be achieved other than batch.
> > > > >
> > > > > We may also need to take care of the following:
> > > > > 1. Make sure that the watermarks or control tuples are consistent
> > > across
> > > > > sources. Meaning an HDFS sink should be able to interpret the
> > watermark
> > > > > tuple sent out by, say, a JDBC source.
> > > > > 2. In addition to I/O connectors, we should also look at the need
> for
> > > > > processing operators to understand some of the control tuples /
> > > > watermarks.
> > > > > For example, we may want to reset the operator behavior on arrival
> of
> > > > some
> > > > > watermark tuple.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > >> The HDFS source can operate in two modes, bounded or unbounded. If
> > you
> > > > >> scan
> > > > >> only once, then it should emit the final watermark after it is
> done.
> > > > >> Otherwise it would emit watermarks based on a policy (files names
> > > etc.).
> > > > >> The mechanism to generate the marks may depend on the type of
> source
> > > and
> > > > >> the user needs to be able to influence/configure it.
> > > > >>
> > > > >> Thomas
> > > > >>
> > > > >>
> > > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Thomas,
> > > > >> >
> > > > >> > I am not sure that I completely understand your suggestion. Are
> > you
> > > > >> > suggesting to broaden the scope of the proposal to treat all
> > sources
> > > > as
> > > > >> > bounded as well as unbounded?
> > > > >> >
> > > > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > > > bounded
> > > > >> > sources like HDFS file source is treated as unbounded by means
> of
> > > > >> scanning
> > > > >> > the input directory repeatedly.
> > > > >> >
> > > > >> > Let's consider HDFS file source for example:
> > > > >> > In this case, if we treat it as a bounded source, we can define
> > > hooks
> > > > >> which
> > > > >> > allows us to detect the end of the file and send the "final
> > > > watermark".
> > > > >> We
> > > > >> > could also consider HDFS file source as a streaming source and
> > > define
> > > > >> hooks
> > > > >> > which send watermarks based on different kinds of windows.
> > > > >> >
> > > > >> > Please correct me if I misunderstand.
> > > > >> >
> > > > >> > ~ Bhupesh
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> > > wrote:
> > > > >> >
> > > > >> > > Bhupesh,
> > > > >> > >
> > > > >> > > Please see how that can be solved in a unified way using
> windows
> > > and
> > > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > > > >> example,
> > > > >> > you
> > > > >> > > can use the "global window" and the final watermark to
> > accomplish
> > > > what
> > > > >> > you
> > > > >> > > are looking for. Batch is just a special case of streaming
> where
> > > the
> > > > >> > source
> > > > >> > > emits the final watermark.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Thomas
> > > > >> > >
> > > > >> > >
> > > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > > >> bhupesh@datatorrent.com
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Yes, if the user needs to develop a batch application, then
> > > batch
> > > > >> aware
> > > > >> > > > operators need to be used in the application.
> > > > >> > > > The nature of the application is mostly controlled by the
> > input
> > > > and
> > > > >> the
> > > > >> > > > output operators used in the application.
> > > > >> > > >
> > > > >> > > > For example, consider an application which needs to filter
> > > records
> > > > >> in a
> > > > >> > > > input file and store the filtered records in another file.
> The
> > > > >> nature
> > > > >> > of
> > > > >> > > > this app is to end once the entire file is processed.
> > Following
> > > > >> things
> > > > >> > > are
> > > > >> > > > expected of the application:
> > > > >> > > >
> > > > >> > > >    1. Once the input data is over, finalize the output file
> > from
> > > > >> .tmp
> > > > >> > > >    files. - Responsibility of output operator
> > > > >> > > >    2. End the application, once the data is read and
> > processed -
> > > > >> > > >    Responsibility of input operator
> > > > >> > > >
> > > > >> > > > These functions are essential to allow the user to do higher
> > > level
> > > > >> > > > operations like scheduling or running a workflow of batch
> > > > >> applications.
> > > > >> > > >
> > > > >> > > > I am not sure about intermediate (processing) operators, as
> > > there
> > > > >> is no
> > > > >> > > > change in their functionality for batch use cases. Perhaps,
> > > > allowing
> > > > >> > > > running multiple batches in a single application may require
> > > > similar
> > > > >> > > > changes in processing operators as well.
> > > > >> > > >
> > > > >> > > > ~ Bhupesh
> > > > >> > > >
> > > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > > priyag@apache.org
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Will it make an impression on user that, if he has a batch
> > > > >> usecase he
> > > > >> > > has
> > > > >> > > > > to use batch aware operators only? If so, is that what we
> > > > expect?
> > > > >> I
> > > > >> > am
> > > > >> > > > not
> > > > >> > > > > aware of how do we implement batch scenario so this might
> > be a
> > > > >> basic
> > > > >> > > > > question.
> > > > >> > > > >
> > > > >> > > > > -Priyanka
> > > > >> > > > >
> > > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > > >> > > > bhupesh@datatorrent.com>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi All,
> > > > >> > > > > >
> > > > >> > > > > > While design / implementation for custom control tuples
> is
> > > > >> > ongoing, I
> > > > >> > > > > > thought it would be a good idea to consider its
> usefulness
> > > in
> > > > >> one
> > > > >> > of
> > > > >> > > > the
> > > > >> > > > > > use cases -  batch applications.
> > > > >> > > > > >
> > > > >> > > > > > This is a proposal to adapt / extend existing operators
> in
> > > the
> > > > >> > Apache
> > > > >> > > > > Apex
> > > > >> > > > > > Malhar library so that it is easy to use them in batch
> use
> > > > >> cases.
> > > > >> > > > > > Naturally, this would be applicable for only a subset of
> > > > >> operators
> > > > >> > > like
> > > > >> > > > > > File, JDBC and NoSQL databases.
> > > > >> > > > > > For example, for a file based store, (say HDFS store),
> we
> > > > could
> > > > >> > have
> > > > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> > > easy
> > > > >> > > > integration
> > > > >> > > > > > into a batch application. These operators would be
> > extended
> > > > from
> > > > >> > > their
> > > > >> > > > > > existing implementations and would be "Batch Aware", in
> > that
> > > > >> they
> > > > >> > may
> > > > >> > > > > > understand the meaning of some specific control tuples
> > that
> > > > flow
> > > > >> > > > through
> > > > >> > > > > > the DAG. Start batch and end batch seem to be the
> obvious
> > > > >> > candidates
> > > > >> > > > that
> > > > >> > > > > > come to mind. On receipt of such control tuples, they
> may
> > > try
> > > > to
> > > > >> > > modify
> > > > >> > > > > the
> > > > >> > > > > > behavior of the operator - to reinitialize some metrics
> or
> > > > >> finalize
> > > > >> > > an
> > > > >> > > > > > output file for example.
> > > > >> > > > > >
> > > > >> > > > > > We can discuss the potential control tuples and actions
> in
> > > > >> detail,
> > > > >> > > but
> > > > >> > > > > > first I would like to understand the views of the
> > community
> > > > for
> > > > >> > this
> > > > >> > > > > > proposal.
> > > > >> > > > > >
> > > > >> > > > > > ~ Bhupesh
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Bhupesh Chawda <bh...@datatorrent.com>.

Amol, agreed. We can address event time based watermarks once file batch is
done.
Regarding, file batch support: by allowing to partition an input (file)
operator, we are implicitly mixing multiple batches. Even if the user does
not do any transformations, we should be able to write the correct data to
right files at the destination.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

Software Engineer

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Sat, Feb 18, 2017 at 12:26 PM, Amol Kekre <am...@datatorrent.com> wrote:

> Thomas,
> The watermarks we have in Apex (start-window and end-window) are working
> good. It is fine to take a look at event time, but basic file I/O does not
> need anything more than start and end. Lets say they are start-something,
> end-something. The main difference here is that the tuples are user
> generated, other than that they should follow similar principle as
> start-window & end-window. The commonality includes
> - dedup of start-st and end-st
> - First start-st passes through
> - Last end-st passes through
> - Engine indentifies them with chronologically increasing number and source
>
> The only main difference is that an emit of these is user controlled and
> cannot be guaranteed to happen as such. BTW, part files are rarely done
> based on event time, they are almost always split by size. A vast majority
> of batch cases have hourly files bound by arrival time and not event time.
>
> Bhupesh,
> Attaching file names to tuples does not scale. If user mixes two batches,
> then the user would need to handle the transformations. Post file batch
> support, we should look at event time support. Unlike file based batches,
> event time will overlap each other, i.e. at a given time at least two (if
> not more) event times will be active. I think the engine will need to be
> event time aware.
>
> Thks
> Amol
>
>
>
> *Follow @amolhkekre*
> *Join us at Apex Big Data World-San Jose
> <http://www.apexbigdata.com/san-jose.html>, April 4, 2017!*
> [image: http://www.apexbigdata.com/san-jose-register.html]
> <http://www.apexbigdata.com/san-jose-register.html>
>
> On Wed, Feb 15, 2017 at 9:07 PM, Thomas Weise <th...@apache.org> wrote:
>
> > I don't think this should be designed based on a simplistic file
> > input-output scenario. It would be good to include a stateful
> > transformation based on event time.
> >
> > More complex pipelines contain stateful transformations that depend on
> > windowing and watermarks. I think we need a watermark concept that is
> based
> > on progress in event time (or other monotonic increasing sequence) that
> > other operators can generically work with.
> >
> > Note that even file input in many cases can produce time based
> watermarks,
> > for example when you read part files that are bound by event time.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > For better understanding the use case for control tuples in batch, I
> am
> > > creating a prototype for a batch application using File Input and File
> > > Output operators.
> > >
> > > To enable basic batch processing for File IO operators, I am proposing
> > the
> > > following changes to File input and output operators:
> > > 1. File Input operator emits a watermark each time it opens and closes
> a
> > > file. These can be "start file" and "end file" watermarks which include
> > the
> > > corresponding file names. The "start file" tuple should be sent before
> > any
> > > of the data from that file flows.
> > > 2. File Input operator can be configured to end the application after a
> > > single or n scans of the directory (a batch). This is where the
> operator
> > > emits the final watermark (the end of application control tuple). This
> > will
> > > also shutdown the application.
> > > 3. The File output operator handles these control tuples. "Start file"
> > > initializes the file name for the incoming tuples. "End file" watermark
> > > forces a finalize on that file.
> > >
> > > The user would be able to enable the operators to send only those
> > > watermarks that are needed in the application. If none of the options
> are
> > > configured, the operators behave as in a streaming application.
> > >
> > > There are a few challenges in the implementation where the input
> operator
> > > is partitioned. In this case, the correlation between the start/end
> for a
> > > file and the data tuples for that file is lost. Hence we need to
> maintain
> > > the filename as part of each tuple in the pipeline.
> > >
> > > The "start file" and "end file" control tuples in this example are
> > > temporary names for watermarks. We can have generic "start batch" /
> "end
> > > batch" tuples which could be used for other use cases as well. The
> Final
> > > watermark is common and serves the same purpose in each case.
> > >
> > > Please let me know your thoughts on this.
> > >
> > > ~ Bhupesh
> > >
> > >
> > >
> > > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com>
> > > wrote:
> > >
> > > > Yes, this can be part of operator configuration. Given this, for a
> user
> > > to
> > > > define a batch application, would mean configuring the connectors
> > (mostly
> > > > the input operator) in the application for the desired behavior.
> > > Similarly,
> > > > there can be other use cases that can be achieved other than batch.
> > > >
> > > > We may also need to take care of the following:
> > > > 1. Make sure that the watermarks or control tuples are consistent
> > across
> > > > sources. Meaning an HDFS sink should be able to interpret the
> watermark
> > > > tuple sent out by, say, a JDBC source.
> > > > 2. In addition to I/O connectors, we should also look at the need for
> > > > processing operators to understand some of the control tuples /
> > > watermarks.
> > > > For example, we may want to reset the operator behavior on arrival of
> > > some
> > > > watermark tuple.
> > > >
> > > > ~ Bhupesh
> > > >
> > > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > >> The HDFS source can operate in two modes, bounded or unbounded. If
> you
> > > >> scan
> > > >> only once, then it should emit the final watermark after it is done.
> > > >> Otherwise it would emit watermarks based on a policy (files names
> > etc.).
> > > >> The mechanism to generate the marks may depend on the type of source
> > and
> > > >> the user needs to be able to influence/configure it.
> > > >>
> > > >> Thomas
> > > >>
> > > >>
> > > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Thomas,
> > > >> >
> > > >> > I am not sure that I completely understand your suggestion. Are
> you
> > > >> > suggesting to broaden the scope of the proposal to treat all
> sources
> > > as
> > > >> > bounded as well as unbounded?
> > > >> >
> > > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > > bounded
> > > >> > sources like HDFS file source is treated as unbounded by means of
> > > >> scanning
> > > >> > the input directory repeatedly.
> > > >> >
> > > >> > Let's consider HDFS file source for example:
> > > >> > In this case, if we treat it as a bounded source, we can define
> > hooks
> > > >> which
> > > >> > allows us to detect the end of the file and send the "final
> > > watermark".
> > > >> We
> > > >> > could also consider HDFS file source as a streaming source and
> > define
> > > >> hooks
> > > >> > which send watermarks based on different kinds of windows.
> > > >> >
> > > >> > Please correct me if I misunderstand.
> > > >> >
> > > >> > ~ Bhupesh
> > > >> >
> > > >> >
> > > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> > wrote:
> > > >> >
> > > >> > > Bhupesh,
> > > >> > >
> > > >> > > Please see how that can be solved in a unified way using windows
> > and
> > > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > > >> example,
> > > >> > you
> > > >> > > can use the "global window" and the final watermark to
> accomplish
> > > what
> > > >> > you
> > > >> > > are looking for. Batch is just a special case of streaming where
> > the
> > > >> > source
> > > >> > > emits the final watermark.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Thomas
> > > >> > >
> > > >> > >
> > > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > > >> bhupesh@datatorrent.com
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes, if the user needs to develop a batch application, then
> > batch
> > > >> aware
> > > >> > > > operators need to be used in the application.
> > > >> > > > The nature of the application is mostly controlled by the
> input
> > > and
> > > >> the
> > > >> > > > output operators used in the application.
> > > >> > > >
> > > >> > > > For example, consider an application which needs to filter
> > records
> > > >> in a
> > > >> > > > input file and store the filtered records in another file. The
> > > >> nature
> > > >> > of
> > > >> > > > this app is to end once the entire file is processed.
> Following
> > > >> things
> > > >> > > are
> > > >> > > > expected of the application:
> > > >> > > >
> > > >> > > >    1. Once the input data is over, finalize the output file
> from
> > > >> .tmp
> > > >> > > >    files. - Responsibility of output operator
> > > >> > > >    2. End the application, once the data is read and
> processed -
> > > >> > > >    Responsibility of input operator
> > > >> > > >
> > > >> > > > These functions are essential to allow the user to do higher
> > level
> > > >> > > > operations like scheduling or running a workflow of batch
> > > >> applications.
> > > >> > > >
> > > >> > > > I am not sure about intermediate (processing) operators, as
> > there
> > > >> is no
> > > >> > > > change in their functionality for batch use cases. Perhaps,
> > > allowing
> > > >> > > > running multiple batches in a single application may require
> > > similar
> > > >> > > > changes in processing operators as well.
> > > >> > > >
> > > >> > > > ~ Bhupesh
> > > >> > > >
> > > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > > priyag@apache.org
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Will it make an impression on user that, if he has a batch
> > > >> usecase he
> > > >> > > has
> > > >> > > > > to use batch aware operators only? If so, is that what we
> > > expect?
> > > >> I
> > > >> > am
> > > >> > > > not
> > > >> > > > > aware of how do we implement batch scenario so this might
> be a
> > > >> basic
> > > >> > > > > question.
> > > >> > > > >
> > > >> > > > > -Priyanka
> > > >> > > > >
> > > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > >> > > > bhupesh@datatorrent.com>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi All,
> > > >> > > > > >
> > > >> > > > > > While design / implementation for custom control tuples is
> > > >> > ongoing, I
> > > >> > > > > > thought it would be a good idea to consider its usefulness
> > in
> > > >> one
> > > >> > of
> > > >> > > > the
> > > >> > > > > > use cases -  batch applications.
> > > >> > > > > >
> > > >> > > > > > This is a proposal to adapt / extend existing operators in
> > the
> > > >> > Apache
> > > >> > > > > Apex
> > > >> > > > > > Malhar library so that it is easy to use them in batch use
> > > >> cases.
> > > >> > > > > > Naturally, this would be applicable for only a subset of
> > > >> operators
> > > >> > > like
> > > >> > > > > > File, JDBC and NoSQL databases.
> > > >> > > > > > For example, for a file based store, (say HDFS store), we
> > > could
> > > >> > have
> > > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> > easy
> > > >> > > > integration
> > > >> > > > > > into a batch application. These operators would be
> extended
> > > from
> > > >> > > their
> > > >> > > > > > existing implementations and would be "Batch Aware", in
> that
> > > >> they
> > > >> > may
> > > >> > > > > > understand the meaning of some specific control tuples
> that
> > > flow
> > > >> > > > through
> > > >> > > > > > the DAG. Start batch and end batch seem to be the obvious
> > > >> > candidates
> > > >> > > > that
> > > >> > > > > > come to mind. On receipt of such control tuples, they may
> > try
> > > to
> > > >> > > modify
> > > >> > > > > the
> > > >> > > > > > behavior of the operator - to reinitialize some metrics or
> > > >> finalize
> > > >> > > an
> > > >> > > > > > output file for example.
> > > >> > > > > >
> > > >> > > > > > We can discuss the potential control tuples and actions in
> > > >> detail,
> > > >> > > but
> > > >> > > > > > first I would like to understand the views of the
> community
> > > for
> > > >> > this
> > > >> > > > > > proposal.
> > > >> > > > > >
> > > >> > > > > > ~ Bhupesh
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Amol Kekre <am...@datatorrent.com>.

Thomas,
The watermarks we have in Apex (start-window and end-window) are working
good. It is fine to take a look at event time, but basic file I/O does not
need anything more than start and end. Lets say they are start-something,
end-something. The main difference here is that the tuples are user
generated, other than that they should follow similar principle as
start-window & end-window. The commonality includes
- dedup of start-st and end-st
- First start-st passes through
- Last end-st passes through
- Engine indentifies them with chronologically increasing number and source

The only main difference is that an emit of these is user controlled and
cannot be guaranteed to happen as such. BTW, part files are rarely done
based on event time, they are almost always split by size. A vast majority
of batch cases have hourly files bound by arrival time and not event time.

Bhupesh,
Attaching file names to tuples does not scale. If user mixes two batches,
then the user would need to handle the transformations. Post file batch
support, we should look at event time support. Unlike file based batches,
event time will overlap each other, i.e. at a given time at least two (if
not more) event times will be active. I think the engine will need to be
event time aware.

Thks
Amol



*Follow @amolhkekre*
*Join us at Apex Big Data World-San Jose
<http://www.apexbigdata.com/san-jose.html>, April 4, 2017!*
[image: http://www.apexbigdata.com/san-jose-register.html]
<http://www.apexbigdata.com/san-jose-register.html>

On Wed, Feb 15, 2017 at 9:07 PM, Thomas Weise <th...@apache.org> wrote:

> I don't think this should be designed based on a simplistic file
> input-output scenario. It would be good to include a stateful
> transformation based on event time.
>
> More complex pipelines contain stateful transformations that depend on
> windowing and watermarks. I think we need a watermark concept that is based
> on progress in event time (or other monotonic increasing sequence) that
> other operators can generically work with.
>
> Note that even file input in many cases can produce time based watermarks,
> for example when you read part files that are bound by event time.
>
> Thanks,
> Thomas
>
>
> On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > For better understanding the use case for control tuples in batch, I am
> > creating a prototype for a batch application using File Input and File
> > Output operators.
> >
> > To enable basic batch processing for File IO operators, I am proposing
> the
> > following changes to File input and output operators:
> > 1. File Input operator emits a watermark each time it opens and closes a
> > file. These can be "start file" and "end file" watermarks which include
> the
> > corresponding file names. The "start file" tuple should be sent before
> any
> > of the data from that file flows.
> > 2. File Input operator can be configured to end the application after a
> > single or n scans of the directory (a batch). This is where the operator
> > emits the final watermark (the end of application control tuple). This
> will
> > also shutdown the application.
> > 3. The File output operator handles these control tuples. "Start file"
> > initializes the file name for the incoming tuples. "End file" watermark
> > forces a finalize on that file.
> >
> > The user would be able to enable the operators to send only those
> > watermarks that are needed in the application. If none of the options are
> > configured, the operators behave as in a streaming application.
> >
> > There are a few challenges in the implementation where the input operator
> > is partitioned. In this case, the correlation between the start/end for a
> > file and the data tuples for that file is lost. Hence we need to maintain
> > the filename as part of each tuple in the pipeline.
> >
> > The "start file" and "end file" control tuples in this example are
> > temporary names for watermarks. We can have generic "start batch" / "end
> > batch" tuples which could be used for other use cases as well. The Final
> > watermark is common and serves the same purpose in each case.
> >
> > Please let me know your thoughts on this.
> >
> > ~ Bhupesh
> >
> >
> >
> > On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com>
> > wrote:
> >
> > > Yes, this can be part of operator configuration. Given this, for a user
> > to
> > > define a batch application, would mean configuring the connectors
> (mostly
> > > the input operator) in the application for the desired behavior.
> > Similarly,
> > > there can be other use cases that can be achieved other than batch.
> > >
> > > We may also need to take care of the following:
> > > 1. Make sure that the watermarks or control tuples are consistent
> across
> > > sources. Meaning an HDFS sink should be able to interpret the watermark
> > > tuple sent out by, say, a JDBC source.
> > > 2. In addition to I/O connectors, we should also look at the need for
> > > processing operators to understand some of the control tuples /
> > watermarks.
> > > For example, we may want to reset the operator behavior on arrival of
> > some
> > > watermark tuple.
> > >
> > > ~ Bhupesh
> > >
> > > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org> wrote:
> > >
> > >> The HDFS source can operate in two modes, bounded or unbounded. If you
> > >> scan
> > >> only once, then it should emit the final watermark after it is done.
> > >> Otherwise it would emit watermarks based on a policy (files names
> etc.).
> > >> The mechanism to generate the marks may depend on the type of source
> and
> > >> the user needs to be able to influence/configure it.
> > >>
> > >> Thomas
> > >>
> > >>
> > >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com>
> > >> wrote:
> > >>
> > >> > Hi Thomas,
> > >> >
> > >> > I am not sure that I completely understand your suggestion. Are you
> > >> > suggesting to broaden the scope of the proposal to treat all sources
> > as
> > >> > bounded as well as unbounded?
> > >> >
> > >> > In case of Apex, we treat all sources as unbounded sources. Even
> > bounded
> > >> > sources like HDFS file source is treated as unbounded by means of
> > >> scanning
> > >> > the input directory repeatedly.
> > >> >
> > >> > Let's consider HDFS file source for example:
> > >> > In this case, if we treat it as a bounded source, we can define
> hooks
> > >> which
> > >> > allows us to detect the end of the file and send the "final
> > watermark".
> > >> We
> > >> > could also consider HDFS file source as a streaming source and
> define
> > >> hooks
> > >> > which send watermarks based on different kinds of windows.
> > >> >
> > >> > Please correct me if I misunderstand.
> > >> >
> > >> > ~ Bhupesh
> > >> >
> > >> >
> > >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org>
> wrote:
> > >> >
> > >> > > Bhupesh,
> > >> > >
> > >> > > Please see how that can be solved in a unified way using windows
> and
> > >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> > >> example,
> > >> > you
> > >> > > can use the "global window" and the final watermark to accomplish
> > what
> > >> > you
> > >> > > are looking for. Batch is just a special case of streaming where
> the
> > >> > source
> > >> > > emits the final watermark.
> > >> > >
> > >> > > Thanks,
> > >> > > Thomas
> > >> > >
> > >> > >
> > >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> > >> bhupesh@datatorrent.com
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Yes, if the user needs to develop a batch application, then
> batch
> > >> aware
> > >> > > > operators need to be used in the application.
> > >> > > > The nature of the application is mostly controlled by the input
> > and
> > >> the
> > >> > > > output operators used in the application.
> > >> > > >
> > >> > > > For example, consider an application which needs to filter
> records
> > >> in a
> > >> > > > input file and store the filtered records in another file. The
> > >> nature
> > >> > of
> > >> > > > this app is to end once the entire file is processed. Following
> > >> things
> > >> > > are
> > >> > > > expected of the application:
> > >> > > >
> > >> > > >    1. Once the input data is over, finalize the output file from
> > >> .tmp
> > >> > > >    files. - Responsibility of output operator
> > >> > > >    2. End the application, once the data is read and processed -
> > >> > > >    Responsibility of input operator
> > >> > > >
> > >> > > > These functions are essential to allow the user to do higher
> level
> > >> > > > operations like scheduling or running a workflow of batch
> > >> applications.
> > >> > > >
> > >> > > > I am not sure about intermediate (processing) operators, as
> there
> > >> is no
> > >> > > > change in their functionality for batch use cases. Perhaps,
> > allowing
> > >> > > > running multiple batches in a single application may require
> > similar
> > >> > > > changes in processing operators as well.
> > >> > > >
> > >> > > > ~ Bhupesh
> > >> > > >
> > >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> > priyag@apache.org
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Will it make an impression on user that, if he has a batch
> > >> usecase he
> > >> > > has
> > >> > > > > to use batch aware operators only? If so, is that what we
> > expect?
> > >> I
> > >> > am
> > >> > > > not
> > >> > > > > aware of how do we implement batch scenario so this might be a
> > >> basic
> > >> > > > > question.
> > >> > > > >
> > >> > > > > -Priyanka
> > >> > > > >
> > >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > >> > > > bhupesh@datatorrent.com>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi All,
> > >> > > > > >
> > >> > > > > > While design / implementation for custom control tuples is
> > >> > ongoing, I
> > >> > > > > > thought it would be a good idea to consider its usefulness
> in
> > >> one
> > >> > of
> > >> > > > the
> > >> > > > > > use cases -  batch applications.
> > >> > > > > >
> > >> > > > > > This is a proposal to adapt / extend existing operators in
> the
> > >> > Apache
> > >> > > > > Apex
> > >> > > > > > Malhar library so that it is easy to use them in batch use
> > >> cases.
> > >> > > > > > Naturally, this would be applicable for only a subset of
> > >> operators
> > >> > > like
> > >> > > > > > File, JDBC and NoSQL databases.
> > >> > > > > > For example, for a file based store, (say HDFS store), we
> > could
> > >> > have
> > >> > > > > > FileBatchInput and FileBatchOutput operators which allow
> easy
> > >> > > > integration
> > >> > > > > > into a batch application. These operators would be extended
> > from
> > >> > > their
> > >> > > > > > existing implementations and would be "Batch Aware", in that
> > >> they
> > >> > may
> > >> > > > > > understand the meaning of some specific control tuples that
> > flow
> > >> > > > through
> > >> > > > > > the DAG. Start batch and end batch seem to be the obvious
> > >> > candidates
> > >> > > > that
> > >> > > > > > come to mind. On receipt of such control tuples, they may
> try
> > to
> > >> > > modify
> > >> > > > > the
> > >> > > > > > behavior of the operator - to reinitialize some metrics or
> > >> finalize
> > >> > > an
> > >> > > > > > output file for example.
> > >> > > > > >
> > >> > > > > > We can discuss the potential control tuples and actions in
> > >> detail,
> > >> > > but
> > >> > > > > > first I would like to understand the views of the community
> > for
> > >> > this
> > >> > > > > > proposal.
> > >> > > > > >
> > >> > > > > > ~ Bhupesh
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

Posted by Thomas Weise <th...@apache.org>.

I don't think this should be designed based on a simplistic file
input-output scenario. It would be good to include a stateful
transformation based on event time.

More complex pipelines contain stateful transformations that depend on
windowing and watermarks. I think we need a watermark concept that is based
on progress in event time (or other monotonic increasing sequence) that
other operators can generically work with.

Note that even file input in many cases can produce time based watermarks,
for example when you read part files that are bound by event time.

Thanks,
Thomas


On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> For better understanding the use case for control tuples in batch, I am
> creating a prototype for a batch application using File Input and File
> Output operators.
>
> To enable basic batch processing for File IO operators, I am proposing the
> following changes to File input and output operators:
> 1. File Input operator emits a watermark each time it opens and closes a
> file. These can be "start file" and "end file" watermarks which include the
> corresponding file names. The "start file" tuple should be sent before any
> of the data from that file flows.
> 2. File Input operator can be configured to end the application after a
> single or n scans of the directory (a batch). This is where the operator
> emits the final watermark (the end of application control tuple). This will
> also shutdown the application.
> 3. The File output operator handles these control tuples. "Start file"
> initializes the file name for the incoming tuples. "End file" watermark
> forces a finalize on that file.
>
> The user would be able to enable the operators to send only those
> watermarks that are needed in the application. If none of the options are
> configured, the operators behave as in a streaming application.
>
> There are a few challenges in the implementation where the input operator
> is partitioned. In this case, the correlation between the start/end for a
> file and the data tuples for that file is lost. Hence we need to maintain
> the filename as part of each tuple in the pipeline.
>
> The "start file" and "end file" control tuples in this example are
> temporary names for watermarks. We can have generic "start batch" / "end
> batch" tuples which could be used for other use cases as well. The Final
> watermark is common and serves the same purpose in each case.
>
> Please let me know your thoughts on this.
>
> ~ Bhupesh
>
>
>
> On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
> > Yes, this can be part of operator configuration. Given this, for a user
> to
> > define a batch application, would mean configuring the connectors (mostly
> > the input operator) in the application for the desired behavior.
> Similarly,
> > there can be other use cases that can be achieved other than batch.
> >
> > We may also need to take care of the following:
> > 1. Make sure that the watermarks or control tuples are consistent across
> > sources. Meaning an HDFS sink should be able to interpret the watermark
> > tuple sent out by, say, a JDBC source.
> > 2. In addition to I/O connectors, we should also look at the need for
> > processing operators to understand some of the control tuples /
> watermarks.
> > For example, we may want to reset the operator behavior on arrival of
> some
> > watermark tuple.
> >
> > ~ Bhupesh
> >
> > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise <th...@apache.org> wrote:
> >
> >> The HDFS source can operate in two modes, bounded or unbounded. If you
> >> scan
> >> only once, then it should emit the final watermark after it is done.
> >> Otherwise it would emit watermarks based on a policy (files names etc.).
> >> The mechanism to generate the marks may depend on the type of source and
> >> the user needs to be able to influence/configure it.
> >>
> >> Thomas
> >>
> >>
> >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com>
> >> wrote:
> >>
> >> > Hi Thomas,
> >> >
> >> > I am not sure that I completely understand your suggestion. Are you
> >> > suggesting to broaden the scope of the proposal to treat all sources
> as
> >> > bounded as well as unbounded?
> >> >
> >> > In case of Apex, we treat all sources as unbounded sources. Even
> bounded
> >> > sources like HDFS file source is treated as unbounded by means of
> >> scanning
> >> > the input directory repeatedly.
> >> >
> >> > Let's consider HDFS file source for example:
> >> > In this case, if we treat it as a bounded source, we can define hooks
> >> which
> >> > allows us to detect the end of the file and send the "final
> watermark".
> >> We
> >> > could also consider HDFS file source as a streaming source and define
> >> hooks
> >> > which send watermarks based on different kinds of windows.
> >> >
> >> > Please correct me if I misunderstand.
> >> >
> >> > ~ Bhupesh
> >> >
> >> >
> >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <th...@apache.org> wrote:
> >> >
> >> > > Bhupesh,
> >> > >
> >> > > Please see how that can be solved in a unified way using windows and
> >> > > watermarks. It is bounded data vs. unbounded data. In Beam for
> >> example,
> >> > you
> >> > > can use the "global window" and the final watermark to accomplish
> what
> >> > you
> >> > > are looking for. Batch is just a special case of streaming where the
> >> > source
> >> > > emits the final watermark.
> >> > >
> >> > > Thanks,
> >> > > Thomas
> >> > >
> >> > >
> >> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
> >> bhupesh@datatorrent.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Yes, if the user needs to develop a batch application, then batch
> >> aware
> >> > > > operators need to be used in the application.
> >> > > > The nature of the application is mostly controlled by the input
> and
> >> the
> >> > > > output operators used in the application.
> >> > > >
> >> > > > For example, consider an application which needs to filter records
> >> in a
> >> > > > input file and store the filtered records in another file. The
> >> nature
> >> > of
> >> > > > this app is to end once the entire file is processed. Following
> >> things
> >> > > are
> >> > > > expected of the application:
> >> > > >
> >> > > >    1. Once the input data is over, finalize the output file from
> >> .tmp
> >> > > >    files. - Responsibility of output operator
> >> > > >    2. End the application, once the data is read and processed -
> >> > > >    Responsibility of input operator
> >> > > >
> >> > > > These functions are essential to allow the user to do higher level
> >> > > > operations like scheduling or running a workflow of batch
> >> applications.
> >> > > >
> >> > > > I am not sure about intermediate (processing) operators, as there
> >> is no
> >> > > > change in their functionality for batch use cases. Perhaps,
> allowing
> >> > > > running multiple batches in a single application may require
> similar
> >> > > > changes in processing operators as well.
> >> > > >
> >> > > > ~ Bhupesh
> >> > > >
> >> > > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <
> priyag@apache.org
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Will it make an impression on user that, if he has a batch
> >> usecase he
> >> > > has
> >> > > > > to use batch aware operators only? If so, is that what we
> expect?
> >> I
> >> > am
> >> > > > not
> >> > > > > aware of how do we implement batch scenario so this might be a
> >> basic
> >> > > > > question.
> >> > > > >
> >> > > > > -Priyanka
> >> > > > >
> >> > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> >> > > > bhupesh@datatorrent.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi All,
> >> > > > > >
> >> > > > > > While design / implementation for custom control tuples is
> >> > ongoing, I
> >> > > > > > thought it would be a good idea to consider its usefulness in
> >> one
> >> > of
> >> > > > the
> >> > > > > > use cases -  batch applications.
> >> > > > > >
> >> > > > > > This is a proposal to adapt / extend existing operators in the
> >> > Apache
> >> > > > > Apex
> >> > > > > > Malhar library so that it is easy to use them in batch use
> >> cases.
> >> > > > > > Naturally, this would be applicable for only a subset of
> >> operators
> >> > > like
> >> > > > > > File, JDBC and NoSQL databases.
> >> > > > > > For example, for a file based store, (say HDFS store), we
> could
> >> > have
> >> > > > > > FileBatchInput and FileBatchOutput operators which allow easy
> >> > > > integration
> >> > > > > > into a batch application. These operators would be extended
> from
> >> > > their
> >> > > > > > existing implementations and would be "Batch Aware", in that
> >> they
> >> > may
> >> > > > > > understand the meaning of some specific control tuples that
> flow
> >> > > > through
> >> > > > > > the DAG. Start batch and end batch seem to be the obvious
> >> > candidates
> >> > > > that
> >> > > > > > come to mind. On receipt of such control tuples, they may try
> to
> >> > > modify
> >> > > > > the
> >> > > > > > behavior of the operator - to reinitialize some metrics or
> >> finalize
> >> > > an
> >> > > > > > output file for example.
> >> > > > > >
> >> > > > > > We can discuss the potential control tuples and actions in
> >> detail,
> >> > > but
> >> > > > > > first I would like to understand the views of the community
> for
> >> > this
> >> > > > > > proposal.
> >> > > > > >
> >> > > > > > ~ Bhupesh
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>