You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Fabian Hueske <fh...@apache.org> on 2014/12/02 21:55:40 UTC

Re: InputFormat API and current scanned row count

Hi Flavio,

we have a few recently started efforts to implement the collection of
monitoring and runtime/data statistics.
Counting the number of elements emitted by an operator (or data source)
will be included.

Do you want to count the number of produced tuples for monitoring the
progress or do you see a different use case?

2014-11-28 9:37 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:

> Hi guys,
>
> I was debugging an inputFormat and I discovered that there's no way to
> understand how many records have been processed in a split.
> So I added a counter in my input format incremented every nextRecord..do
> you think adding something to similar like "public int
> getProcessedRecordsCount()" to InputFormat interface could be useful?
> Or are you going to manage this count stat from the caller of nextRecord?
>
> Best,
> Flavio
>

Re: InputFormat API and current scanned row count

Posted by Fabian Hueske <fh...@apache.org>.
Yes, sure.
Tracking records per split and UDF exec time per call (min, max, avg, or
histogram) would be valuable information when debugging the performance of
a program.

2014-12-02 22:08 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:

> In my specific use case I was intererested in understanding why the scans
> of the splits were taking a long time, so I was intrested in getting
> statistics about the number of records contained in each split and the
> rate/speed of its reading..do you think it could be something useful in
> general?
> On Dec 2, 2014 9:56 PM, "Fabian Hueske" <fh...@apache.org> wrote:
>
> > Hi Flavio,
> >
> > we have a few recently started efforts to implement the collection of
> > monitoring and runtime/data statistics.
> > Counting the number of elements emitted by an operator (or data source)
> > will be included.
> >
> > Do you want to count the number of produced tuples for monitoring the
> > progress or do you see a different use case?
> >
> > 2014-11-28 9:37 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
> >
> > > Hi guys,
> > >
> > > I was debugging an inputFormat and I discovered that there's no way to
> > > understand how many records have been processed in a split.
> > > So I added a counter in my input format incremented every
> nextRecord..do
> > > you think adding something to similar like "public int
> > > getProcessedRecordsCount()" to InputFormat interface could be useful?
> > > Or are you going to manage this count stat from the caller of
> nextRecord?
> > >
> > > Best,
> > > Flavio
> > >
> >
>

Re: InputFormat API and current scanned row count

Posted by Flavio Pompermaier <po...@okkam.it>.
In my specific use case I was intererested in understanding why the scans
of the splits were taking a long time, so I was intrested in getting
statistics about the number of records contained in each split and the
rate/speed of its reading..do you think it could be something useful in
general?
On Dec 2, 2014 9:56 PM, "Fabian Hueske" <fh...@apache.org> wrote:

> Hi Flavio,
>
> we have a few recently started efforts to implement the collection of
> monitoring and runtime/data statistics.
> Counting the number of elements emitted by an operator (or data source)
> will be included.
>
> Do you want to count the number of produced tuples for monitoring the
> progress or do you see a different use case?
>
> 2014-11-28 9:37 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>
> > Hi guys,
> >
> > I was debugging an inputFormat and I discovered that there's no way to
> > understand how many records have been processed in a split.
> > So I added a counter in my input format incremented every nextRecord..do
> > you think adding something to similar like "public int
> > getProcessedRecordsCount()" to InputFormat interface could be useful?
> > Or are you going to manage this count stat from the caller of nextRecord?
> >
> > Best,
> > Flavio
> >
>