You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by "Russell Melick (JIRA)" <ji...@apache.org> on 2015/07/12 20:16:04 UTC

[jira] [Created] (DATAFU-98) New UDF for Histogram / Frequency counting

Russell Melick created DATAFU-98:
------------------------------------

             Summary: New UDF for Histogram / Frequency counting
                 Key: DATAFU-98
                 URL: https://issues.apache.org/jira/browse/DATAFU-98
             Project: DataFu
          Issue Type: New Feature
            Reporter: Russell Melick


I was thinking of creating a new UDF to compute histograms / frequency counts of input bags.  It seems like it would make sense to support ints, longs, float, and doubles.  

I tried looking around to see if this was already implemented, but ValueHistogram and AggregateWordHistogram were about the only things I found.  They seem to exist as an example job, and only work for Strings.
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html

Should the user specify the bin size or the number of bins?  Specifying bin size probably makes the implementation simpler since you can bin things without having seen all of the data.

I think it would make sense to implement a version of this that didn't need any reducers.  It could use counters to keep track of the counts per bin without sending any data to a reducer.  You would be able to call this without a preceding GROUP BY as well.

Here's my proposal for the two udfs.  This assumes the input data is two columns, memberId and numConnections.
{code}
DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')

connections = LOAD 'connections' AS memberId, numConnections;
connectionHistogram = FOREACH (GROUP connections ALL) GENERATE BinnedFrequency(connections.numConnections);
{code}

The output here would be a bag with the frequency counts
{code}
{('0-49', 5), ('50-99', 0), ('100-149', 10)}
{code}

{code}
DEFINE BinnedFrequencyCounter datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')

connections = LOAD 'connections' AS memberId, numConnections;
connections = FOREACH connections GENERATE BinnedFrequencyCounter(numConnections);
{code}

The output here would just be a counter for each bin, all sharing the same group of numConnectionsHistogram.  It would look something like

numConnectionsHistogram.'0-49' = 5
numConnectionsHistogram.'50-99' = 0
numConnectionsHistogram.'100-149' = 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Created] (DATAFU-98) New UDF for Histogram / Frequency counting

Posted by Mitul Tiwari <mi...@gmail.com>.
What about Quantile UDF in DataFu:
http://datafu.incubator.apache.org/docs/datafu/1.1.0/datafu/pig/stats/Quantile.html

Is that useful here? If not then can it be modified to cover Russell's use
case?

Thanks,
Mitul


On Sun, Jul 12, 2015 at 11:16 AM, Russell Melick (JIRA) <ji...@apache.org>
wrote:

> Russell Melick created DATAFU-98:
> ------------------------------------
>
>              Summary: New UDF for Histogram / Frequency counting
>                  Key: DATAFU-98
>                  URL: https://issues.apache.org/jira/browse/DATAFU-98
>              Project: DataFu
>           Issue Type: New Feature
>             Reporter: Russell Melick
>
>
> I was thinking of creating a new UDF to compute histograms / frequency
> counts of input bags.  It seems like it would make sense to support ints,
> longs, float, and doubles.
>
> I tried looking around to see if this was already implemented, but
> ValueHistogram and AggregateWordHistogram were about the only things I
> found.  They seem to exist as an example job, and only work for Strings.
>
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
>
> https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
>
> Should the user specify the bin size or the number of bins?  Specifying
> bin size probably makes the implementation simpler since you can bin things
> without having seen all of the data.
>
> I think it would make sense to implement a version of this that didn't
> need any reducers.  It could use counters to keep track of the counts per
> bin without sending any data to a reducer.  You would be able to call this
> without a preceding GROUP BY as well.
>
> Here's my proposal for the two udfs.  This assumes the input data is two
> columns, memberId and numConnections.
> {code}
> DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
>
> connections = LOAD 'connections' AS memberId, numConnections;
> connectionHistogram = FOREACH (GROUP connections ALL) GENERATE
> BinnedFrequency(connections.numConnections);
> {code}
>
> The output here would be a bag with the frequency counts
> {code}
> {('0-49', 5), ('50-99', 0), ('100-149', 10)}
> {code}
>
> {code}
> DEFINE BinnedFrequencyCounter
> datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
>
> connections = LOAD 'connections' AS memberId, numConnections;
> connections = FOREACH connections GENERATE
> BinnedFrequencyCounter(numConnections);
> {code}
>
> The output here would just be a counter for each bin, all sharing the same
> group of numConnectionsHistogram.  It would look something like
>
> numConnectionsHistogram.'0-49' = 5
> numConnectionsHistogram.'50-99' = 0
> numConnectionsHistogram.'100-149' = 10
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>